Background Carotid atherosclerosis (CAS) is often considered an early warning signal for cardiovascular diseases (CVD). The diagnostic technique of carotid artery Doppler ultrasonography has not been included in public health service programs, and the Framingham Risk Score (FRS) lacks accuracy in assessing CAS risk, hindering the identification of CAS by primary healthcare personnel. Currently, there is a lack of research on machine learning methods to identify CAS in the medium-high risk population assessed by FRS.
Objective To construct a CAS risk prediction model for the medium-high risk population assessed by FRS using machine learning methods, compare its discriminative efficacy, select the optimal model, and assist primary healthcare personnel in identifying CAS more conveniently and accurately.
Methods Using convenience sampling method, a total of 674 local residents from two townships in Liuzhou City, Guangxi Zhuang Autonomous Region, who met the inclusion criteria from 2019 to 2021 and 2023, were selected as the study subjects. Relevant information was collected, and biochemical indicators were measured in fasting blood and urine samples. FRS was used to assess the risk of CVD occurrence, and carotid ultrasound was used to diagnose CAS. Among the 517 subjects from 2019 to 2021, a random 8∶2 split was used to create a training set and a validation set. The training set was used to build Logistic regression, Random Forest (RF), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), and Gradient Boosting Decision Tree (GBDT) models, while the validation set was used for internal validation. The 157 subjects from 2023 served as the test set for external validation. Feature variables were selected using Lasso regression analysis, and discriminative efficacy was evaluated using sensitivity, specificity, accuracy, F1 score, and area under curve (AUC) value. External validation assessed the generalization ability of the optimal model using AUC value, and the Shapley Additive exPlanation (SHAP) method explored the important variables influencing the optimal model's identification of CAS.
Results Lasso regression analysis identified 15 feature variables: age, BMI, systolic blood pressure (SBP), smoking, drinking, hypertension, total cholesterol, high density lipoprotein cholesterol, C-reactive protein (CRP), fasting plasma glucose, apolipoprotein B (ApoB), lipoprotein a (LPA), aspartate aminotransferase (AST), AST/ alanine aminotransferase, urinary microalbumin creatinine ratio. The constructed Logistic regression, RF, SVM, XGBoost, and GBDT models exhibited high AUC values, with the GBDT model showing the best discriminative performance. Its sensitivity, specificity, accuracy, F1 score, and AUC value were 0.755 1, 0.836 4, 0.798 1, 0.778 9, and 0.834 9, respectively, and the external validation AUC value was 0.794 0. The SHAP method revealed that age, SBP, CRP, LPA, and ApoB were the top five factors influencing the GBDT model's identification of CAS.
Conclusion Logistic regression, RF, SVM, XGBoost, and GBDT models for identifying CAS based on machine learning all demonstrated high discriminative performance, with the GBDT model exhibiting the best comprehensive discriminative efficacy and strong generalization ability.