Data mining for high performance compression of genmoic reads and sequences

Publication Type:
Issue Date:
Full metadata record
The rapid development of next-generation sequencing (NGS) technologies has revolutionized almost all fields of genetics. However, the massive amount of genomic data produced by NGS presents great challenges to data storage, transmission and analysis. Among various NGS-related big data challenges, in this thesis, we focus on short reads data compression, assembled genome compression and maximal exact matches (MEMs) detection. First we propose a new 𝒹𝑒 𝓃𝑜𝓋𝑜 compression algorithm for short reads data. The method utilizes minimizers to exploit the redundant information presented in reads. Specifically, large 𝘬-minimizers are used to group reads and (𝘸, 𝘬)-minimizers are used to search suffix-prefix overlap similarity between two contigs. Our experiments show that the proposed method achieves better compression ratio than the existing methods. Furthermore, we present a high-performance reference-based genome compression algorithm. It is based on a 2-bit encoding scheme and an advanced greedy-matching search on a global hash table. The compression ratio of our method is at least 1.9 times better than the best competing algorithm on its best case, and our compression speed is also at least 2.9 times faster. Finally we introduce a method to detect all MEMs from pairs of large genomes. The method conducts a fixed k-mer sampling on the query sequence and the index 𝘬-mers are filtered from the reference sequence via a Bloom filter. Experiments on large genomes demonstrate that our method is at least 1.8 times faster than the best of the existing algorithms. Overall, this thesis work has developed efficient algorithms for pattern discovery from and for data compression of genomic sequences of big size.
Please use this identifier to cite or link to this item: