Data mining for high performance compression of genmoic reads and sequences

Liu, Yuansheng

Data mining for high performance compression of genmoic reads and sequences

Liu, Yuansheng

Permalink

Publication Type:: Thesis
Issue Date:: 2019

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download contents and abstractAdobe PDF (247.67 kB)

Adobe PDF

Download thesisAdobe PDF (982.08 kB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Liu, Yuansheng
dc.date.accessioned	2019-12-02T02:42:30Z
dc.date.available	2019-12-02T02:42:30Z
dc.date.issued	2019
dc.identifier.uri	http://hdl.handle.net/10453/137097
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_AU
dc.description.abstract	The rapid development of next-generation sequencing (NGS) technologies has revolutionized almost all fields of genetics. However, the massive amount of genomic data produced by NGS presents great challenges to data storage, transmission and analysis. Among various NGS-related big data challenges, in this thesis, we focus on short reads data compression, assembled genome compression and maximal exact matches (MEMs) detection. First we propose a new 𝒹𝑒 𝓃𝑜𝓋𝑜 compression algorithm for short reads data. The method utilizes minimizers to exploit the redundant information presented in reads. Specifically, large 𝘬-minimizers are used to group reads and (𝘸, 𝘬)-minimizers are used to search suffix-prefix overlap similarity between two contigs. Our experiments show that the proposed method achieves better compression ratio than the existing methods. Furthermore, we present a high-performance reference-based genome compression algorithm. It is based on a 2-bit encoding scheme and an advanced greedy-matching search on a global hash table. The compression ratio of our method is at least 1.9 times better than the best competing algorithm on its best case, and our compression speed is also at least 2.9 times faster. Finally we introduce a method to detect all MEMs from pairs of large genomes. The method conducts a fixed k-mer sampling on the query sequence and the index 𝘬-mers are filtered from the reference sequence via a Bloom filter. Experiments on large genomes demonstrate that our method is at least 1.8 times faster than the best of the existing algorithms. Overall, this thesis work has developed efficient algorithms for pattern discovery from and for data compression of genomic sequences of big size.	en_AU
dc.format	Thesis (PhD)
dc.language.iso	en_AU	en_AU
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/137097/2/02whole.pdf
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	au.edu.uts.lib/ppc
dc.subject	Data mining.
dc.subject	Data compression.
dc.subject	Genomics.
dc.subject	Short read.
dc.subject	DNA sequencing.
dc.subject	Next-generation sequencing technology.
dc.subject	Genetics.
dc.subject	Large genome.
dc.subject	Reference genome.
dc.subject	De novo assembly.
dc.subject	Maximal exact matches.
dc.subject	Algorithm.
dc.subject	Minimizer.
dc.subject	Bit.
dc.subject	Encoding.
dc.title	Data mining for high performance compression of genmoic reads and sequences	en_AU
dc.type	Thesis	en_AU
utslib.copyright.status	open_access

Abstract:

The rapid development of next-generation sequencing (NGS) technologies has revolutionized almost all fields of genetics. However, the massive amount of genomic data produced by NGS presents great challenges to data storage, transmission and analysis. Among various NGS-related big data challenges, in this thesis, we focus on short reads data compression, assembled genome compression and maximal exact matches (MEMs) detection. First we propose a new 𝒹𝑒 𝓃𝑜𝓋𝑜 compression algorithm for short reads data. The method utilizes minimizers to exploit the redundant information presented in reads. Specifically, large 𝘬-minimizers are used to group reads and (𝘸, 𝘬)-minimizers are used to search suffix-prefix overlap similarity between two contigs. Our experiments show that the proposed method achieves better compression ratio than the existing methods. Furthermore, we present a high-performance reference-based genome compression algorithm. It is based on a 2-bit encoding scheme and an advanced greedy-matching search on a global hash table. The compression ratio of our method is at least 1.9 times better than the best competing algorithm on its best case, and our compression speed is also at least 2.9 times faster. Finally we introduce a method to detect all MEMs from pairs of large genomes. The method conducts a fixed k-mer sampling on the query sequence and the index 𝘬-mers are filtered from the reference sequence via a Bloom filter. Experiments on large genomes demonstrate that our method is at least 1.8 times faster than the best of the existing algorithms. Overall, this thesis work has developed efficient algorithms for pattern discovery from and for data compression of genomic sequences of big size.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/137097