Instance-based error correction for short reads of disease-associated genes

Zhang, X; Liu, Y; Yu, Z; Blumenstein, M; Hutvagner, G; Li, J

Instance-based error correction for short reads of disease-associated genes

Zhang, X

Liu, Y

Yu, Z Blumenstein, M

Hutvagner, G

Li, J

Permalink

Publisher:: BioMed Central
Publication Type:: Journal Article
Citation:: BMC Bioinformatics, 2021, 22, (Suppl 6), pp. 1-17
Issue Date:: 2021-06-02

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (1.81 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, X https://orcid.org/0000-0002-3089-9809
dc.contributor.author	Liu, Y https://orcid.org/0000-0002-7680-3155
dc.contributor.author	Yu, Z
dc.contributor.author	Blumenstein, M https://orcid.org/0000-0002-9908-3744
dc.contributor.author	Hutvagner, G https://orcid.org/0000-0002-7231-9446
dc.contributor.author	Li, J https://orcid.org/0000-0003-1833-7413
dc.date.accessioned	2021-10-02T22:59:59Z
dc.date.available	2021-03-02
dc.date.available	2021-10-02T22:59:59Z
dc.date.issued	2021-06-02
dc.identifier.citation	BMC Bioinformatics, 2021, 22, (Suppl 6), pp. 1-17
dc.identifier.issn	1471-2105
dc.identifier.issn	1471-2105
dc.identifier.uri	http://hdl.handle.net/10453/150806
dc.description.abstract	Background Genomic reads from sequencing platforms contain random errors. Global correction algorithms have been developed, aiming to rectify all possible errors in the reads using generic genome-wide patterns. However, the non-uniform sequencing depths hinder the global approach to conduct effective error removal. As some genes may get under-corrected or over-corrected by the global approach, we conduct instance-based error correction for short reads of disease-associated genes or pathways. The paramount requirement is to ensure the relevant reads, instead of the whole genome, are error-free to provide significant benefits for single-nucleotide polymorphism (SNP) or variant calling studies on the specific genes. Results To rectify possible errors in the short reads of disease-associated genes, our novel idea is to exploit local sequence features and statistics directly related to these genes. Extensive experiments are conducted in comparison with state-of-the-art methods on both simulated and real datasets of lung cancer associated genes (including single-end and paired-end reads). The results demonstrated the superiority of our method with the best performance on precision, recall and gain rate, as well as on sequence assembly results (e.g., N50, the length of contig and contig quality). Conclusion Instance-based strategy makes it possible to explore fine-grained patterns focusing on specific genes, providing high precision error correction and convincing gene sequence assembly. SNP case studies show that errors occurring at some traditional SNP areas can be accurately corrected, providing high precision and sensitivity for investigations on disease-causing point mutations.
dc.format	Electronic
dc.language	eng
dc.publisher	BioMed Central
dc.relation	http://purl.org/au-research/grants/arc/DP180100120
dc.relation.ispartof	BMC Bioinformatics
dc.relation.isbasedon	10.1186/s12859-021-04058-y
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	01 Mathematical Sciences, 06 Biological Sciences, 08 Information and Computing Sciences
dc.subject.classification	Bioinformatics
dc.subject.mesh	Sequence Analysis, DNA
dc.subject.mesh	Genomics
dc.subject.mesh	Genome
dc.subject.mesh	Algorithms
dc.subject.mesh	High-Throughput Nucleotide Sequencing
dc.subject.mesh	Algorithms
dc.subject.mesh	Genome
dc.subject.mesh	Genomics
dc.subject.mesh	High-Throughput Nucleotide Sequencing
dc.subject.mesh	Sequence Analysis, DNA
dc.title	Instance-based error correction for short reads of disease-associated genes
dc.type	Journal Article
utslib.citation.volume	22
utslib.location.activity	England
utslib.for	01 Mathematical Sciences
utslib.for	06 Biological Sciences
utslib.for	08 Information and Computing Sciences
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CHT - Health Technologies
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Strength - QSI - Centre for Quantum Software and Information
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Biomedical Engineering
pubs.organisational-group	/University of Technology Sydney/Centre for Health Technologies (CHT)
utslib.copyright.status	open_access	*
pubs.consider-herdc	false
dc.date.updated	2021-10-02T22:59:56Z
pubs.issue	Suppl 6
pubs.publication-status	Published
pubs.volume	22
utslib.citation.issue	Suppl 6

Abstract:

Background Genomic reads from sequencing platforms contain random errors. Global correction algorithms have been developed, aiming to rectify all possible errors in the reads using generic genome-wide patterns. However, the non-uniform sequencing depths hinder the global approach to conduct effective error removal. As some genes may get under-corrected or over-corrected by the global approach, we conduct instance-based error correction for short reads of disease-associated genes or pathways. The paramount requirement is to ensure the relevant reads, instead of the whole genome, are error-free to provide significant benefits for single-nucleotide polymorphism (SNP) or variant calling studies on the specific genes. Results To rectify possible errors in the short reads of disease-associated genes, our novel idea is to exploit local sequence features and statistics directly related to these genes. Extensive experiments are conducted in comparison with state-of-the-art methods on both simulated and real datasets of lung cancer associated genes (including single-end and paired-end reads). The results demonstrated the superiority of our method with the best performance on precision, recall and gain rate, as well as on sequence assembly results (e.g., N50, the length of contig and contig quality). Conclusion Instance-based strategy makes it possible to explore fine-grained patterns focusing on specific genes, providing high precision error correction and convincing gene sequence assembly. SNP case studies show that errors occurring at some traditional SNP areas can be accurately corrected, providing high precision and sensitivity for investigations on disease-causing point mutations.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/150806