Machine learning-based prediction model construction for type 2 diabetes mellitus: A comparison of algorithms and multi-level risk factor analysis

Xu, Q; Ball, J; Sun, J

Machine learning-based prediction model construction for type 2 diabetes mellitus: A comparison of algorithms and multi-level risk factor analysis

Xu, Q Ball, J Sun, J

Permalink

Publisher:: Wiley
Publication Type:: Journal Article
Citation:: Journal of Diabetes Research, 2026, 2026, (1), pp. e4525736
Issue Date:: 2026-01-19

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (1.1 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Xu, Q
dc.contributor.author	Ball, J
dc.contributor.author	Sun, J https://orcid.org/0000-0002-0097-2438
dc.contributor.editor	Ye, W
dc.date.accessioned	2026-04-20T03:17:45Z
dc.date.available	2026-01-21
dc.date.available	2026-04-20T03:17:45Z
dc.date.issued	2026-01-19
dc.identifier.citation	Journal of Diabetes Research, 2026, 2026, (1), pp. e4525736
dc.identifier.issn	2314-6745
dc.identifier.issn	2314-6753
dc.identifier.uri	http://hdl.handle.net/10453/194767
dc.description.abstract	BACKGROUND: Against the backdrop of the global high incidence of Type 2 diabetes mellitus (T2DM), existing prediction models are largely confined to single-dimensional risk factors, suffering from a core limitation of lacking multilevel integrated analysis. Given the severe impact of T2DM on individual health and healthcare systems, the construction of a comprehensive and accurate prediction model is of great significance. OBJECTIVE: This study is aimed at constructing a T2DM prediction model, identifying multilevel risk factors, and enabling early screening, so as to help clinicians identify high-risk individuals and provide targets for public health interventions. METHODS: Data from the National Health and Nutrition Examination Survey (NHANES) 2021-2023 were used, including 6337 participants aged 18 years and older. Missing values were handled using Monte Carlo multiple imputation, collinearity was reduced via principal component analysis (PCA), and feature selection was performed using random forest (RF) and recursive feature elimination (RFE). The adaptive synthetic sampling (ADASYN) method was applied to address class imbalance. The performance of seven machine learning models, including decision tree, random forest, extreme gradient boosting (XGBoost), and adaptive boosting (AdaBoost), was compared. RESULTS: The AdaBoost model exhibited the optimal performance, with an area under the curve (AUC) of 0.85 (95% confidence interval: 0.85-0.86), an accuracy of 0.71 (95% confidence interval: 0.70-0.72), and an F1 score of 0.71; its performance was further improved after parameter optimization. A total of 24 key risk factors were identified, including 19 at the individual trait level, 3 at the individual behavior level, and 2 related to working and living conditions. CONCLUSIONS: Machine learning models integrating multidimensional risk factors based on the health ecology framework can more accurately predict T2DM risk, providing a scientific basis for multilevel interventions. The innovation of this study lies in the first integration of the health ecology model with machine learning technology to systematically identify cross-level risk factors. Compared with traditional models, it is more comprehensive, breaks through the limitations of previous studies, and provides a new and effective tool for the precise prevention of T2DM and public health interventions.
dc.format	Print
dc.language	eng
dc.publisher	Wiley
dc.relation.ispartof	Journal of Diabetes Research
dc.relation.isbasedon	10.1155/jdr/4525736
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	1116 Medical Physiology
dc.subject.classification	3202 Clinical sciences
dc.subject.mesh	Humans
dc.subject.mesh	Diabetes Mellitus, Type 2
dc.subject.mesh	Machine Learning
dc.subject.mesh	Risk Factors
dc.subject.mesh	Middle Aged
dc.subject.mesh	Female
dc.subject.mesh	Male
dc.subject.mesh	Adult
dc.subject.mesh	Algorithms
dc.subject.mesh	Nutrition Surveys
dc.subject.mesh	Aged
dc.subject.mesh	Risk Assessment
dc.subject.mesh	Young Adult
dc.subject.mesh	Humans
dc.subject.mesh	Diabetes Mellitus, Type 2
dc.subject.mesh	Nutrition Surveys
dc.subject.mesh	Risk Assessment
dc.subject.mesh	Risk Factors
dc.subject.mesh	Algorithms
dc.subject.mesh	Adult
dc.subject.mesh	Aged
dc.subject.mesh	Middle Aged
dc.subject.mesh	Female
dc.subject.mesh	Male
dc.subject.mesh	Young Adult
dc.subject.mesh	Machine Learning
dc.subject.mesh	Humans
dc.subject.mesh	Diabetes Mellitus, Type 2
dc.subject.mesh	Machine Learning
dc.subject.mesh	Risk Factors
dc.subject.mesh	Middle Aged
dc.subject.mesh	Female
dc.subject.mesh	Male
dc.subject.mesh	Adult
dc.subject.mesh	Algorithms
dc.subject.mesh	Nutrition Surveys
dc.subject.mesh	Aged
dc.subject.mesh	Risk Assessment
dc.subject.mesh	Young Adult
dc.title	Machine learning-based prediction model construction for type 2 diabetes mellitus: A comparison of algorithms and multi-level risk factor analysis
dc.type	Journal Article
utslib.citation.volume	2026
utslib.location.activity	United States
utslib.for	1116 Medical Physiology
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	open_access	*
pubs.consider-herdc	false
dc.rights.license	This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/
dc.date.updated	2026-04-20T03:17:43Z
pubs.issue	1
pubs.publication-status	Accepted
pubs.volume	2026
utslib.citation.issue	1

Abstract:

BACKGROUND: Against the backdrop of the global high incidence of Type 2 diabetes mellitus (T2DM), existing prediction models are largely confined to single-dimensional risk factors, suffering from a core limitation of lacking multilevel integrated analysis. Given the severe impact of T2DM on individual health and healthcare systems, the construction of a comprehensive and accurate prediction model is of great significance. OBJECTIVE: This study is aimed at constructing a T2DM prediction model, identifying multilevel risk factors, and enabling early screening, so as to help clinicians identify high-risk individuals and provide targets for public health interventions. METHODS: Data from the National Health and Nutrition Examination Survey (NHANES) 2021-2023 were used, including 6337 participants aged 18 years and older. Missing values were handled using Monte Carlo multiple imputation, collinearity was reduced via principal component analysis (PCA), and feature selection was performed using random forest (RF) and recursive feature elimination (RFE). The adaptive synthetic sampling (ADASYN) method was applied to address class imbalance. The performance of seven machine learning models, including decision tree, random forest, extreme gradient boosting (XGBoost), and adaptive boosting (AdaBoost), was compared. RESULTS: The AdaBoost model exhibited the optimal performance, with an area under the curve (AUC) of 0.85 (95% confidence interval: 0.85-0.86), an accuracy of 0.71 (95% confidence interval: 0.70-0.72), and an F1 score of 0.71; its performance was further improved after parameter optimization. A total of 24 key risk factors were identified, including 19 at the individual trait level, 3 at the individual behavior level, and 2 related to working and living conditions. CONCLUSIONS: Machine learning models integrating multidimensional risk factors based on the health ecology framework can more accurately predict T2DM risk, providing a scientific basis for multilevel interventions. The innovation of this study lies in the first integration of the health ecology model with machine learning technology to systematically identify cross-level risk factors. Compared with traditional models, it is more comprehensive, breaks through the limitations of previous studies, and provides a new and effective tool for the precise prevention of T2DM and public health interventions.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/194767