Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression

Publication Type:
Journal Article
Citation:
Bioinformatics, 2019, 35 (12), pp. 2066 - 2074
Issue Date:
2019-06-01
Filename Description Size
OCC-146545_AM.pdfAccepted Manuscript Version511.78 kB
Adobe PDF
Full metadata record
© The Author(s) 2018. Motivation: Advanced high-Throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. Results: We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix'prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of pairedend reads, minicom achieved 20'80% compression gain over the best state-of-The-Art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs.
Please use this identifier to cite or link to this item: