Chromosome preference of disease genes and vectorization for the prediction of non-coding disease genes

Peng, H; Lan, C; Liu, Y; Liu, T; Blumenstein, M; Li, J

Chromosome preference of disease genes and vectorization for the prediction of non-coding disease genes

Peng, H

Lan, C Liu, Y

Liu, T Blumenstein, M

Li, J

Permalink

Publication Type:: Journal Article
Citation:: Oncotarget, 2017, 8 (45), pp. 78901 - 78916
Issue Date:: 2017-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published VersionAdobe PDF (3.79 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Peng, H https://orcid.org/0000-0002-4379-8097	en_US
dc.contributor.author	Lan, C	en_US
dc.contributor.author	Liu, Y https://orcid.org/0000-0002-7680-3155	en_US
dc.contributor.author	Liu, T	en_US
dc.contributor.author	Blumenstein, M https://orcid.org/0000-0002-9908-3744	en_US
dc.contributor.author	Li, J https://orcid.org/0000-0003-1833-7413	en_US
dc.date.available	2017-07-19	en_US
dc.date.issued	2017-01-01	en_US
dc.identifier.citation	Oncotarget, 2017, 8 (45), pp. 78901 - 78916	en_US
dc.identifier.uri	http://hdl.handle.net/10453/123635
dc.description.abstract	© Peng et al. Disease-related protein-coding genes have been widely studied, but diseaserelated non-coding genes remain largely unknown. This work introduces a new vector to represent diseases, and applies the newly vectorized data for a positive-unlabeled learning algorithm to predict and rank disease-related long non-coding RNA (lncRNA) genes. This novel vector representation for diseases consists of two sub-vectors, one is composed of 45 elements, characterizing the information entropies of the disease genes distribution over 45 chromosome substructures. This idea is supported by our observation that some substructures (e.g., the chromosome 6 p-arm) are highly preferred by disease-related protein coding genes, while some (e.g., the 21 p-arm) are not favored at all. The second sub-vector is 30-dimensional, characterizing the distribution of disease gene enriched KEGG pathways in comparison with our manually created pathway groups. The second sub-vector complements with the first one to differentiate between various diseases. Our prediction method outperforms the stateof- the-art methods on benchmark datasets for prioritizing disease related lncRNA genes. The method also works well when only the sequence information of an lncRNA gene is known, or even when a given disease has no currently recognized long noncoding genes.	en_US
dc.relation	http://purl.org/au-research/grants/arc/DP130102124
dc.relation.ispartof	Oncotarget	en_US
dc.relation.isbasedon	10.18632/oncotarget.20481	en_US
dc.title	Chromosome preference of disease genes and vectorization for the prediction of non-coding disease genes	en_US
dc.type	Journal Article
utslib.citation.volume	45	en_US
utslib.citation.volume	8	en_US
utslib.for	0102 Applied Mathematics	en_US
utslib.for	1112 Oncology and Carcinogenesis	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
pubs.organisational-group	/University of Technology Sydney/Strength - CHT - Health Technologies
pubs.organisational-group	/University of Technology Sydney/Strength - QSI - Centre for Quantum Software and Information
pubs.organisational-group	/University of Technology Sydney/Students
utslib.copyright.status	open_access
pubs.issue	45	en_US
pubs.publication-status	Published	en_US
pubs.volume	8	en_US

Abstract:

© Peng et al. Disease-related protein-coding genes have been widely studied, but diseaserelated non-coding genes remain largely unknown. This work introduces a new vector to represent diseases, and applies the newly vectorized data for a positive-unlabeled learning algorithm to predict and rank disease-related long non-coding RNA (lncRNA) genes. This novel vector representation for diseases consists of two sub-vectors, one is composed of 45 elements, characterizing the information entropies of the disease genes distribution over 45 chromosome substructures. This idea is supported by our observation that some substructures (e.g., the chromosome 6 p-arm) are highly preferred by disease-related protein coding genes, while some (e.g., the 21 p-arm) are not favored at all. The second sub-vector is 30-dimensional, characterizing the distribution of disease gene enriched KEGG pathways in comparison with our manually created pathway groups. The second sub-vector complements with the first one to differentiate between various diseases. Our prediction method outperforms the stateof- the-art methods on benchmark datasets for prioritizing disease related lncRNA genes. The method also works well when only the sequence information of an lncRNA gene is known, or even when a given disease has no currently recognized long noncoding genes.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/123635