Allowing mutations in maximal matches boosts genome compression performance.
- Publisher:
- Oxford University Press
- Publication Type:
- Journal Article
- Citation:
- Bioinformatics, 2020, 36, (18), pp. 4675-4681
- Issue Date:
- 2020-09-15
Open Access
Copyright Clearance Process
- Recently Added
- In Progress
- Open Access
This item is open access.
The embargo period expires on 15 Sep 2021
Full metadata record
Field | Value | Language |
---|---|---|
dc.contributor.author |
Liu, Y https://orcid.org/0000-0002-7680-3155 |
|
dc.contributor.author | Wong, L | |
dc.contributor.author |
Li, J https://orcid.org/0000-0003-1833-7413 |
|
dc.date.accessioned | 2021-04-14T04:20:32Z | |
dc.date.available | 2020-06-10 | |
dc.date.available | 2021-04-14T04:20:32Z | |
dc.date.issued | 2020-09-15 | |
dc.identifier.citation | Bioinformatics, 2020, 36, (18), pp. 4675-4681 | |
dc.identifier.issn | 1367-4803 | |
dc.identifier.issn | 1367-4811 | |
dc.identifier.uri | http://hdl.handle.net/10453/148085 | |
dc.description.abstract | Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission. | |
dc.format | ||
dc.language | eng | |
dc.publisher | Oxford University Press | |
dc.relation | http://purl.org/au-research/grants/arc/DP180100120 | |
dc.relation.ispartof | Bioinformatics | |
dc.relation.isbasedon | 10.1093/bioinformatics/btaa572 | |
dc.rights | info:eu-repo/semantics/embargoedAccess | |
dc.subject | 01 Mathematical Sciences, 06 Biological Sciences, 08 Information and Computing Sciences | |
dc.subject.classification | Bioinformatics | |
dc.subject.mesh | Algorithms | |
dc.subject.mesh | Data Compression | |
dc.subject.mesh | Genome | |
dc.subject.mesh | High-Throughput Nucleotide Sequencing | |
dc.subject.mesh | Mutation | |
dc.subject.mesh | Sequence Analysis, DNA | |
dc.subject.mesh | Software | |
dc.subject.mesh | Sequence Analysis, DNA | |
dc.subject.mesh | Mutation | |
dc.subject.mesh | Genome | |
dc.subject.mesh | Algorithms | |
dc.subject.mesh | Data Compression | |
dc.subject.mesh | Software | |
dc.subject.mesh | High-Throughput Nucleotide Sequencing | |
dc.subject.mesh | Algorithms | |
dc.subject.mesh | Data Compression | |
dc.subject.mesh | Genome | |
dc.subject.mesh | High-Throughput Nucleotide Sequencing | |
dc.subject.mesh | Mutation | |
dc.subject.mesh | Sequence Analysis, DNA | |
dc.subject.mesh | Software | |
dc.title | Allowing mutations in maximal matches boosts genome compression performance. | |
dc.type | Journal Article | |
utslib.citation.volume | 36 | |
utslib.location.activity | England | |
utslib.for | 01 Mathematical Sciences | |
utslib.for | 06 Biological Sciences | |
utslib.for | 08 Information and Computing Sciences | |
pubs.organisational-group | /University of Technology Sydney | |
pubs.organisational-group | /University of Technology Sydney/Faculty of Engineering and Information Technology | |
pubs.organisational-group | /University of Technology Sydney/Strength - CHT - Health Technologies | |
pubs.organisational-group | /University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre | |
pubs.organisational-group | /University of Technology Sydney/Faculty of Engineering and Information Technology/A/DRsch The Data Science Institute | |
utslib.copyright.status | open_access | * |
pubs.consider-herdc | true | |
utslib.copyright.embargo | 2021-09-15T00:00:00+1000Z | |
dc.date.updated | 2021-04-14T04:20:31Z | |
pubs.issue | 18 | |
pubs.publication-status | Published | |
pubs.volume | 36 | |
utslib.citation.issue | 18 |
Abstract:
Motivation
A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations.
Results
We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission.
Please use this identifier to cite or link to this item:
Download statistics for the last 12 months
Not enough data to produce graph