Error Correction Algorithms for Genomic Sequencing Data

Zhang, Xuan

Error Correction Algorithms for Genomic Sequencing Data

Zhang, Xuan

Permalink

Publication Type:: Thesis
Issue Date:: 2021

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download contents and abstractAdobe PDF (246.94 kB)

Adobe PDF

Download thesisAdobe PDF (1.71 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, Xuan
dc.date.accessioned	2022-01-25T01:32:22Z
dc.date.available	2022-01-25T01:32:22Z
dc.date.issued	2021
dc.identifier.uri	http://hdl.handle.net/10453/153540
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_US.UTF-8
dc.description.abstract	The rapid development of high-throughput next-generation sequencing (NGS) platforms has produced massive sets of genomic reads under low costs for a wide range of biomedical applications (e.g., de novo genome assembly, read alignment, resequencing, and Single-nucleotide polymorphism discovery). A serious concern over these datasets is that machine-made sequencing data suffers from lots of random errors (such as substitutions, insertions and deletions). To the best of our knowledge, all the existing methods suffer limitations. This work aims to rectify as many errors as possible by designing strategies adapted to specific cases. Three novel error correction algorithms are designed to providing high-quality sequencing data. This first novel instance-based error correction method is designed to provide high-quality reads for any given instance case and implemented as a tool named InsEC. The instance-based strategy makes it possible to make use of data traits only related to an instance, which guarantees that we can approach the ground truth of the instance case and then achieve better error correction performance. The second method is the first miRNA read error correction. A novel lattice structure combining kmers, (k-1)mers, and (k+1)mers is proposed to rectify errors, which is particularly effective for correcting indel errors. Extensive tests on datasets having known ground truth of errors demonstrate that the method is able to remove almost all of the errors, without introducing any new error, to improve the data quality from every-50-reads containing one error to every- 1300-reads containing one error. The third method is the first small RNA error correction which supports substitution, insertion, and deletion error rectification. This method is more robust and also supports all kinds of small RNA sequencing reads (read length from 20-200 nucleotides). Furthermore, we improve the three-layer lattice structure and combine it by reads with the same length, length plus one and length minus one, which dramatically increases the method’s efficiency. Finally, we consider RNA’s isoform and propose to do correction proportionally to make a fine correction. All designed algorithms achieved high performance and provided high-quality sequencing data for all downstream analyses.	en_US.UTF-8
dc.format	Thesis (PhD)
dc.language.iso	en_US	en_US.UTF-8
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/153540/2/02whole.pdf
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	au.edu.uts.lib/ppc
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Error Correction Algorithms for Genomic Sequencing Data	en_US.UTF-8
dc.type	Thesis
utslib.copyright.status	open_access	*

Abstract:

The rapid development of high-throughput next-generation sequencing (NGS) platforms has produced massive sets of genomic reads under low costs for a wide range of biomedical applications (e.g., de novo genome assembly, read alignment, resequencing, and Single-nucleotide polymorphism discovery). A serious concern over these datasets is that machine-made sequencing data suffers from lots of random errors (such as substitutions, insertions and deletions). To the best of our knowledge, all the existing methods suffer limitations. This work aims to rectify as many errors as possible by designing strategies adapted to specific cases. Three novel error correction algorithms are designed to providing high-quality sequencing data. This first novel instance-based error correction method is designed to provide high-quality reads for any given instance case and implemented as a tool named InsEC. The instance-based strategy makes it possible to make use of data traits only related to an instance, which guarantees that we can approach the ground truth of the instance case and then achieve better error correction performance. The second method is the first miRNA read error correction. A novel lattice structure combining kmers, (k-1)mers, and (k+1)mers is proposed to rectify errors, which is particularly effective for correcting indel errors. Extensive tests on datasets having known ground truth of errors demonstrate that the method is able to remove almost all of the errors, without introducing any new error, to improve the data quality from every-50-reads containing one error to every- 1300-reads containing one error. The third method is the first small RNA error correction which supports substitution, insertion, and deletion error rectification. This method is more robust and also supports all kinds of small RNA sequencing reads (read length from 20-200 nucleotides). Furthermore, we improve the three-layer lattice structure and combine it by reads with the same length, length plus one and length minus one, which dramatically increases the method’s efficiency. Finally, we consider RNA’s isoform and propose to do correction proportionally to make a fine correction. All designed algorithms achieved high performance and provided high-quality sequencing data for all downstream analyses.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/153540