Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression

Liu, Y; Yu, Z; Dinger, ME; Li, J

Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression

Liu, Y

Yu, Z Dinger, ME Li, J

Permalink

Publication Type:: Journal Article
Citation:: Bioinformatics, 2019, 35 (12), pp. 2066 - 2074
Issue Date:: 2019-06-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted Manuscript VersionAdobe PDF (511.78 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Liu, Y https://orcid.org/0000-0002-7680-3155	en_US
dc.contributor.author	Yu, Z	en_US
dc.contributor.author	Dinger, ME	en_US
dc.contributor.author	Li, J https://orcid.org/0000-0003-1833-7413	en_US
dc.date.available	2020-05-25T19:04:35Z
dc.date.issued	2019-06-01	en_US
dc.identifier.citation	Bioinformatics, 2019, 35 (12), pp. 2066 - 2074	en_US
dc.identifier.issn	1367-4803	en_US
dc.identifier.uri	http://hdl.handle.net/10453/135522
dc.description.abstract	© The Author(s) 2018. Motivation: Advanced high-Throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. Results: We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix'prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of pairedend reads, minicom achieved 20'80% compression gain over the best state-of-The-Art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs.	en_US
dc.relation	http://purl.org/au-research/grants/arc/DP180100120
dc.relation.ispartof	Bioinformatics	en_US
dc.relation.isbasedon	10.1093/bioinformatics/bty936	en_US
dc.subject.classification	Bioinformatics	en_US
dc.title	Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression	en_US
dc.type	Journal Article
utslib.citation.volume	12	en_US
utslib.citation.volume	35	en_US
utslib.for	0806 Information Systems	en_US
utslib.for	01 Mathematical Sciences	en_US
utslib.for	06 Biological Sciences	en_US
utslib.for	08 Information and Computing Sciences	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Strength - CHT - Health Technologies
utslib.copyright.status	open_access
pubs.issue	12	en_US
pubs.publication-status	Published	en_US
pubs.volume	35	en_US

Abstract:

© The Author(s) 2018. Motivation: Advanced high-Throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. Results: We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix'prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of pairedend reads, minicom achieved 20'80% compression gain over the best state-of-The-Art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/135522