Mining statistically-solid k-mers for accurate NGS error correction

Zhao, L; Xie, J; Bai, L; Chen, W; Wang, M; Zhang, Z; Wang, Y; Zhao, Z; Li, J

Mining statistically-solid k-mers for accurate NGS error correction

Zhao, L Xie, J Bai, L Chen, W Wang, M Zhang, Z Wang, Y Zhao, Z Li, J

Permalink

Publication Type:: Journal Article
Citation:: BMC Genomics, 2018, 19
Issue Date:: 2018-12-31

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published VersionAdobe PDF (1.63 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhao, L	en_US
dc.contributor.author	Xie, J	en_US
dc.contributor.author	Bai, L	en_US
dc.contributor.author	Chen, W	en_US
dc.contributor.author	Wang, M	en_US
dc.contributor.author	Zhang, Z	en_US
dc.contributor.author	Wang, Y	en_US
dc.contributor.author	Zhao, Z	en_US
dc.contributor.author	Li, J https://orcid.org/0000-0003-1833-7413	en_US
dc.date.issued	2018-12-31	en_US
dc.identifier.citation	BMC Genomics, 2018, 19	en_US
dc.identifier.uri	http://hdl.handle.net/10453/132205
dc.description.abstract	© 2018 The Author(s). Background: NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f 0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. Results: We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f 0 . To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. Conclusion: The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.	en_US
dc.relation.ispartof	BMC Genomics	en_US
dc.relation.isbasedon	10.1186/s12864-018-5272-y	en_US
dc.subject.classification	Bioinformatics	en_US
dc.subject.mesh	Algorithms	en_US
dc.subject.mesh	Research Design	en_US
dc.subject.mesh	Data Mining	en_US
dc.subject.mesh	High-Throughput Nucleotide Sequencing	en_US
dc.title	Mining statistically-solid k-mers for accurate NGS error correction	en_US
dc.type	Journal Article
utslib.citation.volume	19	en_US
utslib.for	0804 Data Format	en_US
utslib.for	06 Biological Sciences	en_US
utslib.for	08 Information and Computing Sciences	en_US
utslib.for	11 Medical and Health Sciences	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Strength - CHT - Health Technologies
utslib.copyright.status	open_access
pubs.publication-status	Published	en_US
pubs.volume	19	en_US

Abstract:

© 2018 The Author(s). Background: NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f 0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. Results: We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f 0 . To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. Conclusion: The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/132205