Noise2read: Accurately Rectify Millions of Erroneous Short Reads Through Graph Learning on Edit Distances.

Publisher:
Oxford University Press (OUP)
Publication Type:
Journal Article
Citation:
Genomics Proteomics Bioinformatics, 2025, pp. qzaf120
Issue Date:
2025-11-29
Full metadata record
Although the per-base error rate of short-read sequencing data is very low at 0.1%-0.5%, the percentage/probability of erroneous reads in a dataset can be as high as 10%-15% or in the number millions. As current methods correct only some errors while introducing many new errors, we solve this problem by turning erroneous reads into their original states, without bringing up any non-existing reads to keep the data integrity. The novelty is originated in a computable rule translated from polymerase chain reaction (PCR) erring mechanism that: a rare read is erroneous if it has a neighbouring read of high abundance. With this principle, we construct a graph to link each pair of reads of tiny edit distances to detect a solid part of erroneous reads; then we consider these pairs of reads of tiny edit distances as training data to learn the erring mechanisms to identify possibly remaining hard-case errors between pairs of high-abundance reads. The proposed approach, noise2read, is competent to handle the rectification of erroneous reads from short-read sequencing data whenever PCR is involved. Compared with state-of-the-art methods on tens of evaluation datasets of unique molecular identifier (UMI) based ground truth, noise2read performs significantly better on 19 metrics. Case studies found that noise2read can greatly improve short-reads quality and make substantial impact on genome abundance quantification, isoform identification, single nucleotide polymorphisms (SNP) profiling, and genome editing efficiency estimation. Noise2read is publicly available at https://github.com/JappyPing/noise2read and https://ngdc.cncb.ac.cn/biocode/tool/7951.
Please use this identifier to cite or link to this item: