Homopolymers (Repeats) — ERRROR, EROR, ERROR?
It’s been a slightly busy last couple of months — during which I got into some interesting work related to homopolymer sequences in the genome.
Homopolymers are stretches of mono nucleotides (DNA bases) greater than two bases long which occur together.
So for instance, ‘ATCCCCGC’ has a homopolymer of length 4 (base ‘C’). These stretches are quite infamous for being sources of errors while sequencing DNA. It doesn't matter which platform or technology your lab uses (Illumina, Pacific Biosciences, Oxford Nanopore, Roche 454 etc.), these ~1.43 million homopolymer regions (also called mononucleotide microsatellites) in the human genome do not treat the sequencing experiment kindly.
The kind of error that creeps in due to these repeats are insertions and deletions or more popularly, indels. Let’s break it down!
Insertions and deletions, or indels. Image source: https://hackbrightacademy.com/blog/indel-finder-how-the-python-version-of-this-program-works/
Platform-Specific Problems
The current DNA sequencing technologies use different mechanisms to read bases. Sequencing platforms which use pyrosequencing, read off each base (A,T,G,C) from input sample DNA, what they essentially do is reconstruct the DNA by referring to the sample. Since the bases used for reconstruction are attached with a fluorophore, on the addition of each subsequent base the intensity of emitted fluorescence is recorded. The cumulative intensity increases linearly with the number of bases added. However, when a series of identical bases are added, the linearity is lost. This means that the sequencer is unable to distinguish between 5 As from 7As or 4 Cs from 6Cs — confidently. You may want to read the title of the blog post once again now! :)
Since the correct number of bases are not called, this leads to alignment errors for all sequences subsequently — basically an insertion or deletion. The accuracy of these technologies also decreases with increase in homopolymer length. This is a problem common to semiconductor based sequencing platforms too.
Addition of each base leads to a characteristic fluorescence signal, unless the region is a homopolymer in which case the signal is fraught with noise. Image Source: Deoxyribonucleoside Triphosphate — an overview | ScienceDirect Topics
The Illumina Platform (~ 90% market share), on the other hand suffers from base identification issues. In sequences such as ‘TTTTTGTTTT’, the lone ‘G’ base will be drowned by the signal from the ‘T’s all around inferring the ‘G’ as a ‘T’. In an ideal situation, the calls for each base should be unambiguous in each position of the DNA, however reality is far from it. Bases are often influenced by neighbouring nucleotide signals. To mention one such issue — G-Quenching: there is marked quenching of the fluorescence signal if a ‘G’ appears next to the base being read. When a base adjacent to ‘G’ is being read, its fluorescence is absorbed by the latter yielding a reduced signal strength.
Solution
A number of alternatives can be considered to resolve the hiccup of homopolymers -
→ Software tools —
Tools that correct for homopolymer length miscalculation include HECTORand Pollux to name a few. The former realigns the reads and performs with a high accuracy given short-read and long-read data.
→ Increased coverage
Increasing coverage or the number of times a position is read is a common solution sequencing experts deploy to tackle low-confidence calls and increase accuracy. Many times, two types of platforms are used to validate results from each other, a prohibitively expensive method indeed!
→ Elimination of homopolymer regions
A popular solution is simply ignoring the stretches of homopolymer tracts from analysis. And this is principally what my approach to the problem was. Crude, as it may sound, this strategy can lead to impressive gains in time and lower false positive detection. Homopolymer repeats in coding regions of the DNA can be responsible for disease-causing variants (a Cystic Fibrosis gene mutation common in Eastern Europeans — 2184insA is a mutation in an 8A long region) and hence the choice of removing homopolymer regions from non-coding regions.
A combination of Python and bcftools led to a neat procedure for obtaining regions free of homopolymers.
Once I downloaded chromosomal data as FASTA files from the UCSC webpage, I ran them through a Python programme, which wrote the positional coordinates, length, and nucleotide of the homopolymer regions as a BED format file. I chose to identify a region as a homopolymer if it exceeded 6 bases in length.
Now, using a file containing the coding regions of the genome, I was able to identify which of the above homopolymers resided within these regions using bcftools. Bcftools makes it very easy to perform filtering operations on BED files.
Then I removed these overlapping values from my full list of homopolymers to give me a list of regions overlapping with the noncoding region.
As the final step, I intersected a list of genes with the above homopolymer file. Voila! I’ve got what I needed, a list of all homopolymers (their gene coordinates) that one can exclude from variant analysis! The output does look a lot like the image of the homopolymer region file (penultimate image), without the columns 4 and 5!
It’s been a sufficiently long post as it is! As a final thought, maybe we will have a gold-standard method for handling this issue for each platform sooner or later. Only thing that remains to be seen is, will it enhance the platform chemistry or be a sweet algorithm that works ubiquitously? :)