Chinese General Practice ›› 2019, Vol. 22 ›› Issue (9): 1021-1026.DOI: 10.12114/j.issn.1007-9572.2018.00.429

• Monographic Research • Previous Articles     Next Articles

Predictive Value of Random Forest Algorithms for Diabetic Risk in People Underwent Physical Examination 

  

  1. 1.School of Public Health,Xinjiang Medical University,Urumqi 830011,China
    2.Health Management Center,Xinjiang Medical University First Affiliated Hospital,Urumqi 830011,China
    3.Xinjiang Medical University First Affiliated Hospital,Urumqi 830011,China
    *Corresponding author:DAI Jianghong,Professor,Doctoral supervisor;E-mail:epil02@sina.com
    YAO Hua,Professor,Doctoral supervisor;E-mail:yaohua01@sina.com
  • Published:2019-03-20 Online:2019-03-20

随机森林算法对体检人群糖尿病患病风险的预测价值研究

  

  1. 1.830011新疆乌鲁木齐市,新疆医科大学公共卫生学院 2.830011新疆乌鲁木齐市,新疆医科大学第一附属医院健康管理中心 3.830011新疆乌鲁木齐市,新疆医科大学第一附属医院
    *通信作者:戴江红,教授,博士生导师;E-mail:epil02@sina.com 姚华,教授,博士生导师,E-mail:yaohua01@sina.com
  • 基金资助:
    基金项目:新疆维吾尔自治区自然科学基金资助项目(2017D01C425)

Abstract: Background China has 114 million people with diabetes,becoming the country with the largest number of diabetic patients all over the world in 2017.Early identification and effective intervention on high-risk population of diabetes can reduce the risk of diabetes mellitus.Objective To explore the application of random forest algorithm in predicting diabetes mellitus risk in people underwent physical examination.Methods We used the national health examination data of people at the age of 35 to 74 years who had physical examination at community health service centers in Shiyouxincun and Kaziwan,Urumqi from September 2016 to March 2017.Considering of integrity of the data,the data of 6 727 people underwent physical examination were collected (data from questionnaires,physical measurements and laboratory tests).The contents of the questionnaire included general demographic data,physical measurements involved height,body mass,and waist circumference and laboratory tests were blood,blood glucose,and serum chemistry indicators.Dataset was divided into a training set and a test set by a ratio of 3∶1.Multivariate Logistic regression analysis and random forest algorithm was used to establish diabetes risk prediction models in the training set and model validation was done with the test set.The prediction efficiency of the model was evaluated by the predicting consistency rate and area under the receiver operating characteristic (ROC) curve.Results There were 717 cases with diabetes or newly diagnosed with diabetes in 6 727 participants and the prevalence of diabetes mellitus was 10.7%.Among the diabetic patients,the proportion of cases at 65 years and above was 37.1%(266/717);women were 51.0%(366/717);Han Chinese was 94.0%(674/717);people with education level of junior high school was 35.3%(253/717);overweight was 48.0%(344/717);non-smokers was 72.8%(522/717);nondrinkers was 77.0%(552/717).Multivariate Logistic regression analysis was used to predict the test set of the diabetes risk prediction model established in the training set.The sensitivity was 0.202,the specificity was 0.950,and the prediction consistency rate was 0.696;the Yoden index was 0.151,and the area under the ROC curve (AUC) was 0.685.Random forest algorithm was applied to predict the test set of the diabetes risk prediction model established in the training set.The sensitivity was 0.608,the specificity was 0.953,the prediction rate was 0.864,the Yoden index was 0.561,and the AUC was 0.702.Conclusion Random forest algorithm has a higher predictive effect on the risk of diabetes for people had physical examination,but multivariate Logistic regression analysis has an intuitive explanation for the influencing factors of diabetes mellitus.We recommend to combine the advantages of the two models in practical applications to maximize their value in disease risk prediction.

Key words: Diabetes mellitus, Prevalence, Random forest, Forecasting

摘要: 背景 2017年我国是全世界糖尿病患者人数最多的国家,糖尿病患者人数达到了1.14亿,及早识别糖尿病高危人群并对其进行有效干预,能够降低糖尿病的患病风险。目的 探讨随机森林算法在体检人群糖尿病患病风险预测中的应用价值。方法 2016年9月—2017年3月,利用乌鲁木齐市石油新村街道和卡子湾街道社区卫生服务中心35~74岁全民健康体检的数据进行研究,考虑到数据的完整性最终纳入6 727例体检者数据(包含调查问卷、体格测量和实验室检测3部分内容),其中调查问卷内容包括一般人口学资料,体格测量指标包括身高、体质量、腰围等,实验室检测指标包括血液、血糖、血生化等。将数据集按3∶1分为训练集和测试集,在训练集中分别应用多因素Logistic回归和随机森林算法建立糖尿病风险预测模型,用测试集进行模型验证,通过预测一致率和受试者工作特征曲线下面积(AUC)评价模型的预测效能。结果 在本次体检的6 727例体检者中,既往糖尿病患者和新检测出糖尿病患者717例,糖尿病患病率为10.7%。糖尿病患者中65岁及以上者占37.1%(266/717),女性占51.0%(366/717),汉族占94.0%(674/717),初中学历者占35.3%(253/717),超重者占48.0%(344/717),从不吸烟者占72.8%(522/717),从不饮酒者占77.0%(552/717)。采用多因素Logistic回归分析在训练集建立糖尿病风险预测模型对测试集进行预测,其灵敏度为0.202,特异度为0.950,预测一致率为0.696,约登指数为0.151,AUC为0.685;采用随机森林算法在训练集建立糖尿病风险预测模型对测试集进行预测,其灵敏度为0.608,特异度为0.953,预测一致率为0.864,约登指数为0.561,AUC为0.702。结论 随机森林算法对体检人群的糖尿病患病风险具有较高的预测效能,但是多因素Logistic回归分析对糖尿病影响因素有直观的解释。建议在实际应用中结合两个模型的优点,使其在疾病风险预测中发挥最大的价值。

关键词: 糖尿病, 患病率, 随机森林, 预测