Tackling Missing Data in Community Health Studies Using Additive LS-SVM Classifier

Wang, G; Deng, Z; Choi, KS

Tackling Missing Data in Community Health Studies Using Additive LS-SVM Classifier

Wang, G Deng, Z Choi, KS

Permalink

Publication Type:: Journal Article
Citation:: IEEE Journal of Biomedical and Health Informatics, 2018, 22 (2), pp. 579 - 587
Issue Date:: 2018-03-01

Closed Access

	Filename	Description	Size
	07763749.pdf	Published Version	443.51 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, G	en_US
dc.contributor.author	Deng, Z	en_US
dc.contributor.author	Choi, KS	en_US
dc.date.issued	2018-03-01	en_US
dc.identifier.citation	IEEE Journal of Biomedical and Health Informatics, 2018, 22 (2), pp. 579 - 587	en_US
dc.identifier.issn	2168-2194	en_US
dc.identifier.uri	http://hdl.handle.net/10453/134384
dc.description.abstract	© 2013 IEEE. Missing data is a common issue in community health and epidemiological studies. Direct removal of samples with missing data can lead to reduced sample size and information bias, which deteriorates the significance of the results. While data imputation methods are available to deal with missing data, they are limited in performance and could introduce noises into the dataset. Instead of data imputation, a novel method based on additive least square support vector machine (LS-SVM) is proposed in this paper for predictive modeling when the input features of the model contain missing data. The method also determines simultaneously the influence of the features with missing values on the classification accuracy using the fast leave-one-out cross-validation strategy. The performance of the method is evaluated by applying it to predict the quality of life (QOL) of elderly people using health data collected in the community. The dataset involves demographics, socioeconomic status, health history, and the outcomes of health assessments of 444 community-dwelling elderly people, with 5% to 60% of data missing in some of the input features. The QOL is measured using a standard questionnaire of the World Health Organization. Results show that the proposed method outperforms four conventional methods for handling missing data - case deletion, feature deletion, mean imputation, and K-nearest neighbor imputation, with the average QOL prediction accuracy reaching 0.7418. It is potentially a promising technique for tackling missing data in community health research and other applications.	en_US
dc.relation.ispartof	IEEE Journal of Biomedical and Health Informatics	en_US
dc.relation.isbasedon	10.1109/JBHI.2016.2634587	en_US
dc.subject.mesh	Humans	en_US
dc.subject.mesh	Data Interpretation, Statistical	en_US
dc.subject.mesh	Least-Squares Analysis	en_US
dc.subject.mesh	Public Health	en_US
dc.subject.mesh	Quality of Life	en_US
dc.subject.mesh	Medical Informatics	en_US
dc.subject.mesh	Databases, Factual	en_US
dc.subject.mesh	Aged	en_US
dc.subject.mesh	Aged, 80 and over	en_US
dc.subject.mesh	Female	en_US
dc.subject.mesh	Male	en_US
dc.subject.mesh	Support Vector Machine	en_US
dc.title	Tackling Missing Data in Community Health Studies Using Additive LS-SVM Classifier	en_US
dc.type	Journal Article
utslib.citation.volume	2	en_US
utslib.citation.volume	22	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Students
utslib.copyright.status	closed_access
pubs.issue	2	en_US
pubs.publication-status	Published	en_US
pubs.volume	22	en_US

Abstract:

© 2013 IEEE. Missing data is a common issue in community health and epidemiological studies. Direct removal of samples with missing data can lead to reduced sample size and information bias, which deteriorates the significance of the results. While data imputation methods are available to deal with missing data, they are limited in performance and could introduce noises into the dataset. Instead of data imputation, a novel method based on additive least square support vector machine (LS-SVM) is proposed in this paper for predictive modeling when the input features of the model contain missing data. The method also determines simultaneously the influence of the features with missing values on the classification accuracy using the fast leave-one-out cross-validation strategy. The performance of the method is evaluated by applying it to predict the quality of life (QOL) of elderly people using health data collected in the community. The dataset involves demographics, socioeconomic status, health history, and the outcomes of health assessments of 444 community-dwelling elderly people, with 5% to 60% of data missing in some of the input features. The QOL is measured using a standard questionnaire of the World Health Organization. Results show that the proposed method outperforms four conventional methods for handling missing data - case deletion, feature deletion, mean imputation, and K-nearest neighbor imputation, with the average QOL prediction accuracy reaching 0.7418. It is potentially a promising technique for tackling missing data in community health research and other applications.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/134384