The integration of fine-scale DNA-DNA associations by inclusion of Hi-C DNA cross-linking information into metagenomic community analysis

Publication Type:
Thesis
Issue Date:
2019
Full metadata record
Much of our understanding of the microbial world has been obtained using culture-based methodologies, a paradigm that has stood since the 19th century. And yet it has long been known that most of the Earth’s microbial species are resistant to laboratory culture. It is reasonable to expect; therefore, that applying equal scrutiny to all microbial life will lead to significant discoveries. Motivated by this, metagenomics eliminates the culturing dependency by directly sampling DNA from an environment; successfully shedding light on the once unseen majority. The technical limitations of present-day sequencing technologies have meant, however, that in achieving culture-independence, traditional shotgun metagenomic sequencing experiments make a considerable sacrifice. That sacrifice comes in the form of information loss where, in preparing DNA for sequencing, much of the “same-cell” and “same-chromosome” information is destroyed; information which is essential when reconstructing the individual genomes. Purely computational solutions to overcoming this sacrifice have proved insufficient; surpassed instead by strategies which employ changes in the experimental design aimed at reducing the information loss. A recent strategy is the inclusion of a new form of sequencing data, provided by the Hi-C sequencing technique. Originally conceived to study the three-dimensional structure of chromatin, the Hi-C sequencing technique captures in vivo proximity interactions between DNA loci in an all-vs-all manner. When applied to direct metagenomic sampling, the physical structure of the microbial community (chromosome, cell and community) strongly influences the probability of observing proximity interactions between loci, and this pronounced modulation can be exploited to recover the information lost during shotgun sequencing. This thesis details the effective integration of Hi-C into metagenomic sequencing studies to accurately reconstruct individual genomes, thereby deconvoluting the metagenome. To accomplish this, first an 𝘪𝘯 𝘴𝘪𝘭𝘪𝘤𝘰 investigation of the effectiveness of graph clustering as a means of metagenome deconvolution was conducted; where Hi-C proximity interactions defined the edges and assembly contigs defined the nodes. A parametric sweep of experimental and community composition parameters was carried out, exploring how the degree of evolutionary divergence (from species to strains) affected the quality of deconvolution. For each iterate in the sweep, a ground-truth was constructed and quality assessed using a novel external validation measure supporting overlapping clusters and variable object weights. This work led to the design and implementation of the first metagenomic Hi-C read-pair simulator, sim3C, capable of simulating complex community definitions and simple three-dimensional structural elements. While in pursuit of the final objective of metagenome deconvolution, sim3C enabled an externally validated development process. Lastly, as the outcome of the final objective, bin3C is demonstrated; an open-source solution to Hi-C driven metagenome deconvolution. In an unsupervised manner, bin3C reconstructs individual genomes from metagenomic data. Using external validation of simulated data, bin3C is shown to have high precision and good recall. When a real human microbiome was analysed, bin3C achieved leading performance, resolving 20 more nearly-complete MAGs (57% gain) than its closest competitor.
Please use this identifier to cite or link to this item: