Error Correction Algorithms for Genomic Sequencing Data

Publication Type:
Issue Date:
Full metadata record
The rapid development of high-throughput next-generation sequencing (NGS) platforms has produced massive sets of genomic reads under low costs for a wide range of biomedical applications (e.g., de novo genome assembly, read alignment, resequencing, and Single-nucleotide polymorphism discovery). A serious concern over these datasets is that machine-made sequencing data suffers from lots of random errors (such as substitutions, insertions and deletions). To the best of our knowledge, all the existing methods suffer limitations. This work aims to rectify as many errors as possible by designing strategies adapted to specific cases. Three novel error correction algorithms are designed to providing high-quality sequencing data. This first novel instance-based error correction method is designed to provide high-quality reads for any given instance case and implemented as a tool named InsEC. The instance-based strategy makes it possible to make use of data traits only related to an instance, which guarantees that we can approach the ground truth of the instance case and then achieve better error correction performance. The second method is the first miRNA read error correction. A novel lattice structure combining kmers, (k-1)mers, and (k+1)mers is proposed to rectify errors, which is particularly effective for correcting indel errors. Extensive tests on datasets having known ground truth of errors demonstrate that the method is able to remove almost all of the errors, without introducing any new error, to improve the data quality from every-50-reads containing one error to every- 1300-reads containing one error. The third method is the first small RNA error correction which supports substitution, insertion, and deletion error rectification. This method is more robust and also supports all kinds of small RNA sequencing reads (read length from 20-200 nucleotides). Furthermore, we improve the three-layer lattice structure and combine it by reads with the same length, length plus one and length minus one, which dramatically increases the method’s efficiency. Finally, we consider RNA’s isoform and propose to do correction proportionally to make a fine correction. All designed algorithms achieved high performance and provided high-quality sequencing data for all downstream analyses.
Please use this identifier to cite or link to this item: