Background Coronary atherosclerotic heart disease (CHD) is one of the leading causes of mortality worldwide, and research on risk assessment for CHD has been growing annually. However, the issue of data imbalance in these studies is often overlooked, despite its crucial role in enhancing the accuracy of CHD risk identification within classification algorithms.
Objective To investigate the factors influencing CHD and to establish predictive models for CHD risk using two data balancing methods based on five algorithms, comparing the predictive value of these models for CHD risk.
Methods Utilizing cross-sectional survey data from the 2021 Behavioral Risk Factor Surveillance System (BRFSS) in the United States, a cohort of 112 606 participants was identified, featuring 24 variables related to risk behaviors and health status, with self-reported coronary heart disease (CHD) as the outcome measure. Factors influencing the incidence of CHD were explored through univariate analysis and stepwise logistic regression to select pertinent variables for inclusion in the predictive model. A random sample comprising 10% of the participants (11 261 individuals) was drawn and then randomly divided into training and testing datasets at an 8∶2 ratio. To address data imbalance, two over-sampling techniques were employed: random oversampling and the Synthetic Minority Over-sampling Technique (SMOTE). Based on these methods, CHD predictive models were constructed using five different algorithms: K-Nearest Neighbors (KNN), Logistic Regression, Support Vector Machine (SVM), Decision Tree, and XGBoost.
Results Univariate analysis revealed significant differences (P<0.05) between the CHD and non-CHD groups across all input variables except for rental housing and being informed of prediabetic status. Stepwise Logistic regression identified age, gender, BMI, ethnicity, education level, income level, being informed of hypertension, being informed of prehypertension, being informed of pregnancy-induced hypertension, current use of antihypertensive medication, being informed of hyperlipidemia, being informed of diabetes, smoking status, alcohol consumption within the last 30 days, heavy drinking status, and self-assessed health as factors influencing CHD. The performance of risk models using SMOTE showed overall classification accuracies of 59.2%, 67.4%, 66.2%, 69.2%, and 85.9%; recall rates of 75.2%, 71.4%, 70.5%, 62.9%, and 34.8%; precision of 15.4%, 18.2%, 17.5%, 17.6%, and 28.7%; F-values of 0.256, 0.290, 0.280, 0.275, and 0.315; and AUC values of 0.80, 0.78, 0.72, 0.72, and 0.82, respectively. Using random oversampling, the models achieved classification accuracies of 62.5%, 68.5%, 69.0%, 60.2%, and 70.1%; recall rates of 70.0%, 69.5%, 71.9%, 69.0%, and 67.6%; precision of 15.8%, 18.4%, 19.1%, 14.8%, and 19.0%; F-values of 0.258, 0.291, 0.302, 0.244, and 0.297; and AUC values of 0.80, 0.77, 0.72, 0.72, and 0.83, respectively.
Conclusion This study not only confirmed known factors affecting CHD but also identified potential impacts of self-assessed health level, income level, and education level on CHD. The performance of the five algorithms was significantly enhanced after employing two data balancing methods. Among them, the XGBoost model exhibited superior performance and can be referenced for future optimization of CHD prediction models. Additionally, considering the excellent performance of the XGBoost model and the convenience and interpretability of stepwise logistic regression, a combined use of these approaches after data balancing is recommended in CHD risk prediction models.