On the impact of model selection on predictor identification and parameter inference

Pfeiffer, RM; Redd, A; Carroll, RJ

On the impact of model selection on predictor identification and parameter inference

Pfeiffer, RM Redd, A Carroll, RJ

Permalink

Publication Type:: Journal Article
Citation:: Computational Statistics, 2017, 32 (2), pp. 667 - 690
Issue Date:: 2017-06-01

Closed Access

	Filename	Description	Size
	10.1007_s00180-016-0690-2.pdf	Published Version	617.47 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Pfeiffer, RM	en_US
dc.contributor.author	Redd, A	en_US
dc.contributor.author	Carroll, RJ https://orcid.org/0000-0002-5465-9682	en_US
dc.date.available	2016-09-24	en_US
dc.date.issued	2017-06-01	en_US
dc.identifier.citation	Computational Statistics, 2017, 32 (2), pp. 667 - 690	en_US
dc.identifier.issn	0943-4062	en_US
dc.identifier.uri	http://hdl.handle.net/10453/125206
dc.description.abstract	© 2016, The Author(s). We assessed the ability of several penalized regression methods for linear and logistic models to identify outcome-associated predictors and the impact of predictor selection on parameter inference for practical sample sizes. We studied effect estimates obtained directly from penalized methods (Algorithm 1), or by refitting selected predictors with standard regression (Algorithm 2). For linear models, penalized linear regression, elastic net, smoothly clipped absolute deviation (SCAD), least angle regression and LASSO had a low false negative (FN) predictor selection rates but false positive (FP) rates above 20 % for all sample and effect sizes. Partial least squares regression had few FPs but many FNs. Only relaxo had low FP and FN rates. For logistic models, LASSO and penalized logistic regression had many FPs and few FNs for all sample and effect sizes. SCAD and adaptive logistic regression had low or moderate FP rates but many FNs. 95 % confidence interval coverage of predictors with null effects was approximately 100 % for Algorithm 1 for all methods, and 95 % for Algorithm 2 for large sample and effect sizes. Coverage was low only for penalized partial least squares (linear regression). For outcome-associated predictors, coverage was close to 95 % for Algorithm 2 for large sample and effect sizes for all methods except penalized partial least squares and penalized logistic regression. Coverage was sub-nominal for Algorithm 1. In conclusion, many methods performed comparably, and while Algorithm 2 is preferred to Algorithm 1 for estimation, it yields valid inference only for large effect and sample sizes.	en_US
dc.relation.ispartof	Computational Statistics	en_US
dc.relation.isbasedon	10.1007/s00180-016-0690-2	en_US
dc.subject.classification	Statistics & Probability	en_US
dc.title	On the impact of model selection on predictor identification and parameter inference	en_US
dc.type	Journal Article
utslib.citation.volume	2	en_US
utslib.citation.volume	32	en_US
utslib.for	0104 Statistics	en_US
utslib.for	0802 Computation Theory and Mathematics	en_US
utslib.for	01 Mathematical Sciences	en_US
utslib.for	08 Information and Computing Sciences	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Science
pubs.organisational-group	/University of Technology Sydney/Faculty of Science/School of Mathematical and Physical Sciences
utslib.copyright.status	closed_access
pubs.issue	2	en_US
pubs.publication-status	Published	en_US
pubs.volume	32	en_US

Abstract:

© 2016, The Author(s). We assessed the ability of several penalized regression methods for linear and logistic models to identify outcome-associated predictors and the impact of predictor selection on parameter inference for practical sample sizes. We studied effect estimates obtained directly from penalized methods (Algorithm 1), or by refitting selected predictors with standard regression (Algorithm 2). For linear models, penalized linear regression, elastic net, smoothly clipped absolute deviation (SCAD), least angle regression and LASSO had a low false negative (FN) predictor selection rates but false positive (FP) rates above 20 % for all sample and effect sizes. Partial least squares regression had few FPs but many FNs. Only relaxo had low FP and FN rates. For logistic models, LASSO and penalized logistic regression had many FPs and few FNs for all sample and effect sizes. SCAD and adaptive logistic regression had low or moderate FP rates but many FNs. 95 % confidence interval coverage of predictors with null effects was approximately 100 % for Algorithm 1 for all methods, and 95 % for Algorithm 2 for large sample and effect sizes. Coverage was low only for penalized partial least squares (linear regression). For outcome-associated predictors, coverage was close to 95 % for Algorithm 2 for large sample and effect sizes for all methods except penalized partial least squares and penalized logistic regression. Coverage was sub-nominal for Algorithm 1. In conclusion, many methods performed comparably, and while Algorithm 2 is preferred to Algorithm 1 for estimation, it yields valid inference only for large effect and sample sizes.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/125206