Allowing mutations in maximal matches boosts genome compression performance.

Liu, Y; Wong, L; Li, J

Allowing mutations in maximal matches boosts genome compression performance.

Liu, Y

Wong, L Li, J

Permalink

Publisher:: Oxford University Press
Publication Type:: Journal Article
Citation:: Bioinformatics, 2020, 36, (18), pp. 4675-4681
Issue Date:: 2020-09-15

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

The embargo period expires on 15 Sep 2021

Adobe PDF

Download Accepted Manuscript VersionAdobe PDF (666.41 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Liu, Y https://orcid.org/0000-0002-7680-3155
dc.contributor.author	Wong, L
dc.contributor.author	Li, J https://orcid.org/0000-0003-1833-7413
dc.date.accessioned	2021-04-14T04:20:32Z
dc.date.available	2020-06-10
dc.date.available	2021-04-14T04:20:32Z
dc.date.issued	2020-09-15
dc.identifier.citation	Bioinformatics, 2020, 36, (18), pp. 4675-4681
dc.identifier.issn	1367-4803
dc.identifier.issn	1367-4811
dc.identifier.uri	http://hdl.handle.net/10453/148085
dc.description.abstract	Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission.
dc.format	Print
dc.language	eng
dc.publisher	Oxford University Press
dc.relation	http://purl.org/au-research/grants/arc/DP180100120
dc.relation.ispartof	Bioinformatics
dc.relation.isbasedon	10.1093/bioinformatics/btaa572
dc.rights	info:eu-repo/semantics/embargoedAccess
dc.subject	01 Mathematical Sciences, 06 Biological Sciences, 08 Information and Computing Sciences
dc.subject.classification	Bioinformatics
dc.subject.mesh	Algorithms
dc.subject.mesh	Data Compression
dc.subject.mesh	Genome
dc.subject.mesh	High-Throughput Nucleotide Sequencing
dc.subject.mesh	Mutation
dc.subject.mesh	Sequence Analysis, DNA
dc.subject.mesh	Software
dc.subject.mesh	Sequence Analysis, DNA
dc.subject.mesh	Mutation
dc.subject.mesh	Genome
dc.subject.mesh	Algorithms
dc.subject.mesh	Data Compression
dc.subject.mesh	Software
dc.subject.mesh	High-Throughput Nucleotide Sequencing
dc.subject.mesh	Algorithms
dc.subject.mesh	Data Compression
dc.subject.mesh	Genome
dc.subject.mesh	High-Throughput Nucleotide Sequencing
dc.subject.mesh	Mutation
dc.subject.mesh	Sequence Analysis, DNA
dc.subject.mesh	Software
dc.title	Allowing mutations in maximal matches boosts genome compression performance.
dc.type	Journal Article
utslib.citation.volume	36
utslib.location.activity	England
utslib.for	01 Mathematical Sciences
utslib.for	06 Biological Sciences
utslib.for	08 Information and Computing Sciences
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CHT - Health Technologies
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/A/DRsch The Data Science Institute
utslib.copyright.status	open_access	*
pubs.consider-herdc	true
utslib.copyright.embargo	2021-09-15T00:00:00+1000Z
dc.date.updated	2021-04-14T04:20:31Z
pubs.issue	18
pubs.publication-status	Published
pubs.volume	36
utslib.citation.issue	18

Abstract:

Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/148085