PCAN: Probabilistic correlation analysis of two non-normal data sets

Zoh, RS; Mallick, B; Ivanov, I; Baladandayuthapani, V; Manyam, G; Chapkin, RS; Lampe, JW; Carroll, RJ

PCAN: Probabilistic correlation analysis of two non-normal data sets

Zoh, RS Mallick, B Ivanov, I Baladandayuthapani, V Manyam, G Chapkin, RS Lampe, JW Carroll, RJ

Permalink

Publication Type:: Journal Article
Citation:: Biometrics, 2016, 72 (4), pp. 1358 - 1368
Issue Date:: 2016-12-01

Closed Access

	Filename	Description	Size
	Zoh_et_al-2016-Biometrics.pdf	Published Version	521.31 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zoh, RS	en_US
dc.contributor.author	Mallick, B	en_US
dc.contributor.author	Ivanov, I	en_US
dc.contributor.author	Baladandayuthapani, V	en_US
dc.contributor.author	Manyam, G	en_US
dc.contributor.author	Chapkin, RS	en_US
dc.contributor.author	Lampe, JW	en_US
dc.contributor.author	Carroll, RJ	en_US
dc.date.available	2016-02-01	en_US
dc.date.issued	2016-12-01	en_US
dc.identifier.citation	Biometrics, 2016, 72 (4), pp. 1358 - 1368	en_US
dc.identifier.issn	0006-341X	en_US
dc.identifier.uri	http://hdl.handle.net/10453/98658
dc.description.abstract	© 2016, The International Biometric Society Most cancer research now involves one or more assays profiling various biological molecules, e.g., messenger RNA and micro RNA, in samples collected on the same individuals. The main interest with these genomic data sets lies in the identification of a subset of features that are active in explaining the dependence between platforms. To quantify the strength of the dependency between two variables, correlation is often preferred. However, expression data obtained from next-generation sequencing platforms are integer with very low counts for some important features. In this case, the sample Pearson correlation is not a valid estimate of the true correlation matrix, because the sample correlation estimate between two features/variables with low counts will often be close to zero, even when the natural parameters of the Poisson distribution are, in actuality, highly correlated. We propose a model-based approach to correlation estimation between two non-normal data sets, via a method we call Probabilistic Correlations ANalysis, or PCAN. PCAN takes into consideration the distributional assumption about both data sets and suggests that correlations estimated at the model natural parameter level are more appropriate than correlations estimated directly on the observed data. We demonstrate through a simulation study that PCAN outperforms other standard approaches in estimating the true correlation between the natural parameters. We then apply PCAN to the joint analysis of a microRNA (miRNA) and a messenger RNA (mRNA) expression data set from a squamous cell lung cancer study, finding a large number of negative correlation pairs when compared to the standard approaches.	en_US
dc.relation.ispartof	Biometrics	en_US
dc.relation.isbasedon	10.1111/biom.12516	en_US
dc.subject.classification	Statistics & Probability	en_US
dc.subject.mesh	Humans	en_US
dc.subject.mesh	Carcinoma, Squamous Cell	en_US
dc.subject.mesh	Lung Neoplasms	en_US
dc.subject.mesh	MicroRNAs	en_US
dc.subject.mesh	RNA, Messenger	en_US
dc.subject.mesh	Models, Statistical	en_US
dc.subject.mesh	Poisson Distribution	en_US
dc.subject.mesh	High-Throughput Nucleotide Sequencing	en_US
dc.title	PCAN: Probabilistic correlation analysis of two non-normal data sets	en_US
dc.type	Journal Article
utslib.citation.volume	4	en_US
utslib.citation.volume	72	en_US
utslib.for	0104 Statistics	en_US
utslib.for	0199 Other Mathematical Sciences	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Science
pubs.organisational-group	/University of Technology Sydney/Faculty of Science/School of Mathematical and Physical Sciences
utslib.copyright.status	closed_access
pubs.issue	4	en_US
pubs.publication-status	Published	en_US
pubs.volume	72	en_US

Abstract:

© 2016, The International Biometric Society Most cancer research now involves one or more assays profiling various biological molecules, e.g., messenger RNA and micro RNA, in samples collected on the same individuals. The main interest with these genomic data sets lies in the identification of a subset of features that are active in explaining the dependence between platforms. To quantify the strength of the dependency between two variables, correlation is often preferred. However, expression data obtained from next-generation sequencing platforms are integer with very low counts for some important features. In this case, the sample Pearson correlation is not a valid estimate of the true correlation matrix, because the sample correlation estimate between two features/variables with low counts will often be close to zero, even when the natural parameters of the Poisson distribution are, in actuality, highly correlated. We propose a model-based approach to correlation estimation between two non-normal data sets, via a method we call Probabilistic Correlations ANalysis, or PCAN. PCAN takes into consideration the distributional assumption about both data sets and suggests that correlations estimated at the model natural parameter level are more appropriate than correlations estimated directly on the observed data. We demonstrate through a simulation study that PCAN outperforms other standard approaches in estimating the true correlation between the natural parameters. We then apply PCAN to the joint analysis of a microRNA (miRNA) and a messenger RNA (mRNA) expression data set from a squamous cell lung cancer study, finding a large number of negative correlation pairs when compared to the standard approaches.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/98658