Feature screening with large-scale and high-dimensional survival data.

Yi, GY; He, W; Carroll, RJ

Feature screening with large-scale and high-dimensional survival data.

Yi, GY He, W Carroll, RJ

Permalink

Publisher:: Wiley
Publication Type:: Journal Article
Citation:: Biometrics, 2022, 78, (3), pp. 894-907
Issue Date:: 2022-04-21

Closed Access

	Filename	Description	Size
	Biometrics - 2021 - Yi - Feature screening with large‐scale and high‐dimensional survival data.pdf	Published version	292.81 kB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Yi, GY
dc.contributor.author	He, W
dc.contributor.author	Carroll, RJ
dc.date.accessioned	2023-06-05T03:15:13Z
dc.date.available	2020-04-07
dc.date.available	2023-06-05T03:15:13Z
dc.date.issued	2022-04-21
dc.identifier.citation	Biometrics, 2022, 78, (3), pp. 894-907
dc.identifier.issn	0006-341X
dc.identifier.issn	1541-0420
dc.identifier.uri	http://hdl.handle.net/10453/170621
dc.description.abstract	Data with a huge size present great challenges in modeling, inferences, and computation. In handling big data, much attention has been directed to settings with "large p small n", and relatively less work has been done to address problems with p and n being both large, though data with such a feature have now become more accessible than before, where p represents the number of variables and n stands for the sample size. The big volume of data does not automatically ensure good quality of inferences because a large number of unimportant variables may be collected in the process of gathering informative variables. To carry out valid statistical analysis, it is imperative to screen out noisy variables that have no predictive value for explaining the outcome variable. In this paper, we develop a screening method for handling large-sized survival data, where the sample size n is large and the dimension p of covariates is of non-polynomial order of the sample size n, or the so-called NP-dimension. We rigorously establish theoretical results for the proposed method and conduct numerical studies to assess its performance. Our research offers multiple extensions of existing work and enlarges the scope of high-dimensional data analysis. The proposed method capitalizes on the connections among useful regression settings and offers a computationally efficient screening procedure. Our method can be applied to different situations with large-scale data including genomic data.
dc.format	Print-Electronic
dc.language	eng
dc.publisher	Wiley
dc.relation	National Cancer InstituteU01CA057030
dc.relation.ispartof	Biometrics
dc.relation.isbasedon	10.1111/biom.13479
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0104 Statistics, 0199 Other Mathematical Sciences
dc.subject.classification	Statistics & Probability
dc.subject.mesh	Genome
dc.subject.mesh	Genomics
dc.subject.mesh	Proportional Hazards Models
dc.subject.mesh	Sample Size
dc.subject.mesh	Proportional Hazards Models
dc.subject.mesh	Sample Size
dc.subject.mesh	Genomics
dc.subject.mesh	Genome
dc.subject.mesh	Genome
dc.subject.mesh	Genomics
dc.subject.mesh	Proportional Hazards Models
dc.subject.mesh	Sample Size
dc.title	Feature screening with large-scale and high-dimensional survival data.
dc.type	Journal Article
utslib.citation.volume	78
utslib.location.activity	United States
utslib.for	0104 Statistics
utslib.for	0199 Other Mathematical Sciences
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Science
pubs.organisational-group	/University of Technology Sydney/Faculty of Science/School of Mathematical and Physical Sciences
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2023-06-05T03:15:12Z
pubs.issue	3
pubs.publication-status	Published
pubs.volume	78
utslib.citation.issue	3

Abstract:

Data with a huge size present great challenges in modeling, inferences, and computation. In handling big data, much attention has been directed to settings with "large p small n", and relatively less work has been done to address problems with p and n being both large, though data with such a feature have now become more accessible than before, where p represents the number of variables and n stands for the sample size. The big volume of data does not automatically ensure good quality of inferences because a large number of unimportant variables may be collected in the process of gathering informative variables. To carry out valid statistical analysis, it is imperative to screen out noisy variables that have no predictive value for explaining the outcome variable. In this paper, we develop a screening method for handling large-sized survival data, where the sample size n is large and the dimension p of covariates is of non-polynomial order of the sample size n, or the so-called NP-dimension. We rigorously establish theoretical results for the proposed method and conduct numerical studies to assess its performance. Our research offers multiple extensions of existing work and enlarges the scope of high-dimensional data analysis. The proposed method capitalizes on the connections among useful regression settings and offers a computationally efficient screening procedure. Our method can be applied to different situations with large-scale data including genomic data.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/170621