A novel case-control subsampling approach for rapid model exploration of large clustered binary data

Wright, ST; Ryan, LM; Pham, T

A novel case-control subsampling approach for rapid model exploration of large clustered binary data

Wright, ST

Ryan, LM

Pham, T

Permalink

Publication Type:: Journal Article
Citation:: Statistics in Medicine, 2018, 37 (6), pp. 899 - 913
Issue Date:: 2018-03-15

Closed Access

	Filename	Description	Size
	Wright_et_al-2018-Statistics_in_Medicine.pdf	Published Version	1.23 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wright, ST https://orcid.org/0000-0002-8593-6056	en_US
dc.contributor.author	Ryan, LM https://orcid.org/0000-0001-5957-2490	en_US
dc.contributor.author	Pham, T	en_US
dc.date.available	2017-10-01	en_US
dc.date.issued	2018-03-15	en_US
dc.identifier.citation	Statistics in Medicine, 2018, 37 (6), pp. 899 - 913	en_US
dc.identifier.issn	0277-6715	en_US
dc.identifier.uri	http://hdl.handle.net/10453/123131
dc.description.abstract	Copyright © 2017 John Wiley & Sons, Ltd. In many settings, an analysis goal is the identification of a factor, or set of factors associated with an event or outcome. Often, these associations are then used for inference and prediction. Unfortunately, in the big data era, the model building and exploration phases of analysis can be time-consuming, especially if constrained by computing power (ie, a typical corporate workstation). To speed up this model development, we propose a novel subsampling scheme to enable rapid model exploration of clustered binary data using flexible yet complex model set-ups (GLMMs with additive smoothing splines). By reframing the binary response prospective cohort study into a case-control–type design, and using our knowledge of sampling fractions, we show one can approximate the model estimates as would be calculated from a full cohort analysis. This idea is extended to derive cluster-specific sampling fractions and thereby incorporate cluster variation into an analysis. Importantly, we demonstrate that previously computationally prohibitive analyses can be conducted in a timely manner on a typical workstation. The approach is applied to analysing risk factors associated with adverse reactions relating to blood donation.	en_US
dc.relation.ispartof	Statistics in Medicine	en_US
dc.relation.isbasedon	10.1002/sim.7543	en_US
dc.subject.classification	Statistics & Probability	en_US
dc.subject.mesh	Humans	en_US
dc.subject.mesh	Cluster Analysis	en_US
dc.subject.mesh	Linear Models	en_US
dc.subject.mesh	Logistic Models	en_US
dc.subject.mesh	Risk Factors	en_US
dc.subject.mesh	Regression Analysis	en_US
dc.subject.mesh	Case-Control Studies	en_US
dc.subject.mesh	Cohort Studies	en_US
dc.subject.mesh	Computer Simulation	en_US
dc.title	A novel case-control subsampling approach for rapid model exploration of large clustered binary data	en_US
dc.type	Journal Article
utslib.citation.volume	6	en_US
utslib.citation.volume	37	en_US
utslib.for	0104 Statistics	en_US
utslib.for	1117 Public Health and Health Services	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Science
pubs.organisational-group	/University of Technology Sydney/Faculty of Science/School of Mathematical and Physical Sciences
utslib.copyright.status	closed_access
pubs.issue	6	en_US
pubs.publication-status	Published	en_US
pubs.volume	37	en_US

Abstract:

Copyright © 2017 John Wiley & Sons, Ltd. In many settings, an analysis goal is the identification of a factor, or set of factors associated with an event or outcome. Often, these associations are then used for inference and prediction. Unfortunately, in the big data era, the model building and exploration phases of analysis can be time-consuming, especially if constrained by computing power (ie, a typical corporate workstation). To speed up this model development, we propose a novel subsampling scheme to enable rapid model exploration of clustered binary data using flexible yet complex model set-ups (GLMMs with additive smoothing splines). By reframing the binary response prospective cohort study into a case-control–type design, and using our knowledge of sampling fractions, we show one can approximate the model estimates as would be calculated from a full cohort analysis. This idea is extended to derive cluster-specific sampling fractions and thereby incorporate cluster variation into an analysis. Importantly, we demonstrate that previously computationally prohibitive analyses can be conducted in a timely manner on a typical workstation. The approach is applied to analysing risk factors associated with adverse reactions relating to blood donation.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/123131