Machine learning-based prediction model construction for type 2 diabetes mellitus: A comparison of algorithms and multi-level risk factor analysis

Publisher:
Wiley
Publication Type:
Journal Article
Citation:
Journal of Diabetes Research, 2026, 2026, (1), pp. e4525736
Issue Date:
2026-01-19
Full metadata record
BACKGROUND: Against the backdrop of the global high incidence of Type 2 diabetes mellitus (T2DM), existing prediction models are largely confined to single-dimensional risk factors, suffering from a core limitation of lacking multilevel integrated analysis. Given the severe impact of T2DM on individual health and healthcare systems, the construction of a comprehensive and accurate prediction model is of great significance. OBJECTIVE: This study is aimed at constructing a T2DM prediction model, identifying multilevel risk factors, and enabling early screening, so as to help clinicians identify high-risk individuals and provide targets for public health interventions. METHODS: Data from the National Health and Nutrition Examination Survey (NHANES) 2021-2023 were used, including 6337 participants aged 18 years and older. Missing values were handled using Monte Carlo multiple imputation, collinearity was reduced via principal component analysis (PCA), and feature selection was performed using random forest (RF) and recursive feature elimination (RFE). The adaptive synthetic sampling (ADASYN) method was applied to address class imbalance. The performance of seven machine learning models, including decision tree, random forest, extreme gradient boosting (XGBoost), and adaptive boosting (AdaBoost), was compared. RESULTS: The AdaBoost model exhibited the optimal performance, with an area under the curve (AUC) of 0.85 (95% confidence interval: 0.85-0.86), an accuracy of 0.71 (95% confidence interval: 0.70-0.72), and an F1 score of 0.71; its performance was further improved after parameter optimization. A total of 24 key risk factors were identified, including 19 at the individual trait level, 3 at the individual behavior level, and 2 related to working and living conditions. CONCLUSIONS: Machine learning models integrating multidimensional risk factors based on the health ecology framework can more accurately predict T2DM risk, providing a scientific basis for multilevel interventions. The innovation of this study lies in the first integration of the health ecology model with machine learning technology to systematically identify cross-level risk factors. Compared with traditional models, it is more comprehensive, breaks through the limitations of previous studies, and provides a new and effective tool for the precise prevention of T2DM and public health interventions.
Please use this identifier to cite or link to this item: