中国全科医学 ›› 2024, Vol. 27 ›› Issue (30): 3763-3771.DOI: 10.12114/j.issn.1007-9572.2024.0019

• 论著 • 上一篇    下一篇

心血管疾病中高风险人群颈动脉粥样硬化的识别:基于机器学习的预测模型及验证

刘忠典, 许琪, 陈伊静, 覃玲巧, 陈淑萍, 唐薇婷, 钟秋安*()   

  1. 530021 广西壮族自治区南宁市,广西医科大学公共卫生学院流行病学教研室
  • 收稿日期:2024-02-19 修回日期:2024-04-30 出版日期:2024-10-20 发布日期:2024-07-09
  • 通讯作者: 钟秋安

  • 作者贡献:

    刘忠典、许琪、陈伊静、覃玲巧、陈淑萍、唐薇婷进行研究的实施、数据收集与整理;刘忠典负责进行统计学处理、结果的分析与解释及撰写论文;刘忠典、钟秋安进行论文的修订;钟秋安进行文章的构思与设计、可行性分析,负责文章的质量控制及审校。

  • 基金资助:
    国家自然科学基金资助项目(82060088)

Identification of Carotid Atherosclerosis in Medium-high Risk Population of Cardiovascular Disease: Prediction Model and Validation Based on Machine Learning

LIU Zhongdian, XU Qi, CHEN Yijing, QIN Lingqiao, CHEN Shuping, TANG Weiting, ZHONG Qiuan*()   

  1. Department of Epidemiology, School of Public Health, Guangxi Medical University, Nanning 530021, China
  • Received:2024-02-19 Revised:2024-04-30 Published:2024-10-20 Online:2024-07-09
  • Contact: ZHONG Qiuan

摘要: 背景 颈动脉粥样硬化(CAS)常被视为心血管疾病(CVD)的预警信号,其诊断技术颈动脉多普勒超声检查没有被纳入公共卫生服务项目,同时弗雷明汉风险评分(FRS)存在着评估CAS风险准确性不足的情况,不利于基层医疗人员识别CAS。目前,关于机器学习方法识别FRS中高风险人群CAS的研究依然缺乏。 目的 运用机器学习方法构建FRS中高风险人群CAS的预测模型,比较其判别效能,筛选出性能最优的模型,以期辅助基层医疗人员更简便更准确地识别CAS。 方法 采用方便抽样法,选取2019—2021年和2023年在广西壮族自治区柳州市两乡镇的674例当地居民作为研究对象。收集相关信息,并采集空腹血样、尿样检测生化指标。采用FRS评估CVD发生风险;运用颈动脉超声诊断CAS。将2019—2021年517例研究对象按照8∶2的比例随机分为训练集和验证集,训练集用于构建Logistic回归、随机森林(RF)、支持向量机(SVM)、极端梯度增强(XGBoost)模型和梯度增强决策树(GBDT)模型,验证集用于内部验证;2023年157例研究对象作为测试集,用于外部验证。通过Lasso回归分析筛选特征变量,运用灵敏度、特异度、准确度、F1值和曲线下面积(AUC)评价判别效能,外部验证采用AUC值评价最优模型泛化能力,并通过Shapley Additive exPlanation(SHAP)方法探讨影响最优模型识别CAS的重要变量。 结果 通过Lasso回归,筛选出15个非零特征变量:年龄、BMI、收缩压(SBP)、吸烟、饮酒、高血压、总胆固醇、高密度脂蛋白胆固醇、C-反应蛋白(CRP)、空腹血糖、载脂蛋白B(ApoB)、脂蛋白a(LPA)、天冬氨酸氨基转移酶(AST)、AST/丙氨酸氨基转移酶、尿微量白蛋白肌酐比值。构建的Logistic回归、RF、SVM、XGBoost模型和GBDT模型的AUC值均较高,其中GBDT模型的判别性能最优,其灵敏度、特异度、准确度、F1值和AUC分别是0.755 1、0.836 4、0.798 1、0.778 9、0.834 9,外部验证AUC为0.794 0。SHAP方法发现年龄、SBP、CRP、LPA、ApoB是影响GBDT模型识别CAS排名前5的因素。 结论 基于机器学习识别CAS的Logistic回归、RF、SVM、XGBoost模型和GBDT模型均显示出较高的判别性能,其中GBDT模型综合判别效能最佳,同时具有较强的泛化能力。

关键词: 心血管疾病, 颈动脉粥样硬化, 机器学习, 弗雷明汉风险评分, 识别, 预测

Abstract:

Background

Carotid atherosclerosis (CAS) is often considered an early warning signal for cardiovascular diseases (CVD). The diagnostic technique of carotid artery Doppler ultrasonography has not been included in public health service programs, and the Framingham Risk Score (FRS) lacks accuracy in assessing CAS risk, hindering the identification of CAS by primary healthcare personnel. Currently, there is a lack of research on machine learning methods to identify CAS in the medium-high risk population assessed by FRS.

Objective

To construct a CAS risk prediction model for the medium-high risk population assessed by FRS using machine learning methods, compare its discriminative efficacy, select the optimal model, and assist primary healthcare personnel in identifying CAS more conveniently and accurately.

Methods

Using convenience sampling method, a total of 674 local residents from two townships in Liuzhou City, Guangxi Zhuang Autonomous Region, who met the inclusion criteria from 2019 to 2021 and 2023, were selected as the study subjects. Relevant information was collected, and biochemical indicators were measured in fasting blood and urine samples. FRS was used to assess the risk of CVD occurrence, and carotid ultrasound was used to diagnose CAS. Among the 517 subjects from 2019 to 2021, a random 8∶2 split was used to create a training set and a validation set. The training set was used to build Logistic regression, Random Forest (RF), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), and Gradient Boosting Decision Tree (GBDT) models, while the validation set was used for internal validation. The 157 subjects from 2023 served as the test set for external validation. Feature variables were selected using Lasso regression analysis, and discriminative efficacy was evaluated using sensitivity, specificity, accuracy, F1 score, and area under curve (AUC) value. External validation assessed the generalization ability of the optimal model using AUC value, and the Shapley Additive exPlanation (SHAP) method explored the important variables influencing the optimal model's identification of CAS.

Results

Lasso regression analysis identified 15 feature variables: age, BMI, systolic blood pressure (SBP), smoking, drinking, hypertension, total cholesterol, high density lipoprotein cholesterol, C-reactive protein (CRP), fasting plasma glucose, apolipoprotein B (ApoB), lipoprotein a (LPA), aspartate aminotransferase (AST), AST/ alanine aminotransferase, urinary microalbumin creatinine ratio. The constructed Logistic regression, RF, SVM, XGBoost, and GBDT models exhibited high AUC values, with the GBDT model showing the best discriminative performance. Its sensitivity, specificity, accuracy, F1 score, and AUC value were 0.755 1, 0.836 4, 0.798 1, 0.778 9, and 0.834 9, respectively, and the external validation AUC value was 0.794 0. The SHAP method revealed that age, SBP, CRP, LPA, and ApoB were the top five factors influencing the GBDT model's identification of CAS.

Conclusion

Logistic regression, RF, SVM, XGBoost, and GBDT models for identifying CAS based on machine learning all demonstrated high discriminative performance, with the GBDT model exhibiting the best comprehensive discriminative efficacy and strong generalization ability.

Key words: Cardiovascular diseases, Carotid atherosclerosis, Machine learning, Framingham risk score, Identification, Forecasting

中图分类号: