Evaluation of various distance computation methods for construction of haplotype-based phylogenies from large MLST datasets.

Jacobson, D; Zheng, Y; Plucinski, MM; Qvarnstrom, Y; Barratt, JLN

Evaluation of various distance computation methods for construction of haplotype-based phylogenies from large MLST datasets.

Jacobson, D Zheng, Y Plucinski, MM Qvarnstrom, Y Barratt, JLN

Permalink

Publisher:: Elsevier
Publication Type:: Journal Article
Citation:: Molecular Phylogenetics and Evolution, 2022, 177, pp. 1-13
Issue Date:: 2022-12

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download full textAdobe PDF (4.75 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Jacobson, D
dc.contributor.author	Zheng, Y
dc.contributor.author	Plucinski, MM
dc.contributor.author	Qvarnstrom, Y
dc.contributor.author	Barratt, JLN
dc.date.accessioned	2022-12-06T02:32:01Z
dc.date.available	2022-08-05
dc.date.available	2022-12-06T02:32:01Z
dc.date.issued	2022-12
dc.identifier.citation	Molecular Phylogenetics and Evolution, 2022, 177, pp. 1-13
dc.identifier.issn	1055-7903
dc.identifier.issn	1095-9513
dc.identifier.uri	http://hdl.handle.net/10453/164146
dc.description.abstract	Multi-locus sequence typing (MLST) is widely used to investigate genetic relationships among eukaryotic taxa, including parasitic pathogens. MLST analysis workflows typically involve construction of alignment-based phylogenetic trees - i.e., where tree structures are computed from nucleotide differences observed in a multiple sequence alignment (MSA). Notably, alignment-based phylogenetic methods require that all isolates/taxa are represented by a single sequence. When multiple loci are sequenced these sequences may be concatenated to produce one tree that includes information from all loci. Alignment-based phylogenetic techniques are robust and widely used yet possess some shortcomings, including how heterozygous sites are handled, intolerance for missing data (i.e., partial genotypes), and differences in the way insertions-deletions (indels) are scored/treated during tree construction. In certain contexts, 'haplotype-based' methods may represent a viable alternative to alignment-based techniques, as they do not possess the aforementioned limitations. This is namely because haplotype-based methods assess genetic similarity based on numbers of shared (i.e., intersecting) haplotypes as opposed to similarities in nucleotide composition observed in an MSA. For haplotype-based comparisons, choosing an appropriate distance statistic is fundamental, and several statistics are available to choose from. However, a comprehensive assessment of various available statistics for their ability to produce a robust haplotype-based phylogenetic reconstruction has not yet been performed. We evaluated seven distance statistics by applying them to extant MLST datasets from the gastrointestinal parasite Cyclospora cayetanensis and two species of pathogenic nematode of the genus Strongyloides. We compare the genetic relationships identified using each statistic to epidemiologic, geographic, and host metadata. We show that Barratt's heuristic definition of genetic distance was the most robust among the statistics evaluated. Consequently, it is proposed that Barratt's heuristic represents a useful approach for use in the context of challenging MLST datasets possessing features (i.e., high heterozygosity, partial genotypes, and indel or repeat-based polymorphisms) that confound or preclude the use of alignment-based methods.
dc.format	Print-Electronic
dc.language	eng
dc.publisher	Elsevier
dc.relation.ispartof	Molecular Phylogenetics and Evolution
dc.relation.isbasedon	10.1016/j.ympev.2022.107608
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	0603 Evolutionary Biology, 0604 Genetics, 0608 Zoology
dc.subject.classification	Evolutionary Biology
dc.subject.mesh	Cyclospora
dc.subject.mesh	Haplotypes
dc.subject.mesh	Multilocus Sequence Typing
dc.subject.mesh	Nucleotides
dc.subject.mesh	Phylogeny
dc.subject.mesh	Cyclospora
dc.subject.mesh	Haplotypes
dc.subject.mesh	Multilocus Sequence Typing
dc.subject.mesh	Nucleotides
dc.subject.mesh	Phylogeny
dc.subject.mesh	Cyclospora
dc.subject.mesh	Nucleotides
dc.subject.mesh	Phylogeny
dc.subject.mesh	Haplotypes
dc.subject.mesh	Multilocus Sequence Typing
dc.title	Evaluation of various distance computation methods for construction of haplotype-based phylogenies from large MLST datasets.
dc.type	Journal Article
utslib.citation.volume	177
utslib.location.activity	United States
utslib.for	0603 Evolutionary Biology
utslib.for	0604 Genetics
utslib.for	0608 Zoology
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Science
pubs.organisational-group	/University of Technology Sydney/Faculty of Science/School of Life Sciences
utslib.copyright.status	open_access	*
pubs.consider-herdc	false
dc.date.updated	2022-12-06T02:31:59Z
pubs.publication-status	Published
pubs.volume	177

Abstract:

Multi-locus sequence typing (MLST) is widely used to investigate genetic relationships among eukaryotic taxa, including parasitic pathogens. MLST analysis workflows typically involve construction of alignment-based phylogenetic trees - i.e., where tree structures are computed from nucleotide differences observed in a multiple sequence alignment (MSA). Notably, alignment-based phylogenetic methods require that all isolates/taxa are represented by a single sequence. When multiple loci are sequenced these sequences may be concatenated to produce one tree that includes information from all loci. Alignment-based phylogenetic techniques are robust and widely used yet possess some shortcomings, including how heterozygous sites are handled, intolerance for missing data (i.e., partial genotypes), and differences in the way insertions-deletions (indels) are scored/treated during tree construction. In certain contexts, 'haplotype-based' methods may represent a viable alternative to alignment-based techniques, as they do not possess the aforementioned limitations. This is namely because haplotype-based methods assess genetic similarity based on numbers of shared (i.e., intersecting) haplotypes as opposed to similarities in nucleotide composition observed in an MSA. For haplotype-based comparisons, choosing an appropriate distance statistic is fundamental, and several statistics are available to choose from. However, a comprehensive assessment of various available statistics for their ability to produce a robust haplotype-based phylogenetic reconstruction has not yet been performed. We evaluated seven distance statistics by applying them to extant MLST datasets from the gastrointestinal parasite Cyclospora cayetanensis and two species of pathogenic nematode of the genus Strongyloides. We compare the genetic relationships identified using each statistic to epidemiologic, geographic, and host metadata. We show that Barratt's heuristic definition of genetic distance was the most robust among the statistics evaluated. Consequently, it is proposed that Barratt's heuristic represents a useful approach for use in the context of challenging MLST datasets possessing features (i.e., high heterozygosity, partial genotypes, and indel or repeat-based polymorphisms) that confound or preclude the use of alignment-based methods.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/164146