Chinese General Practice ›› 2022, Vol. 25 ›› Issue (02): 217-226.DOI: 10.12114/j.issn.1007-9572.2021.01.313

Special Issue: 呼吸疾病文章合集

• Article·Chronic Obstructive Pulmonary Disease • Previous Articles     Next Articles

Using Machine Learning to Build an Early Warning Model for the Risk of Severe Airflow Limitation in Patients with Chronic Obstructive Pulmonary Disease

  

  1. 1.Department of Respiratory and Critical Care MedicineUniversity of Electronic Science and Technology of China Affiliated Hospital & Sichuan Provincial People's HospitalChengdu 610072China

    2.Department of NursingUniversity of Electronic Science and Technology of China Affiliated Hospital & Sichuan Provincial People's HospitalChengdu 610072China

    3.University of Electronic Science and Technology of ChinaChengdu 610072China

    4.Department of PharmacyUniversity of Electronic Science and Technology of China Affiliated Hospital & Sichuan Provincial People's HospitalChengdu 610072China

    5.Personalized Drug Therapy Key Laboratory of Sichuan ProvinceSchool of MedicineUniversity of Electronic Science and Technology of ChinaChengdu 610072China

    *Corresponding authorWEN XianxiuProfessor of nursing E-mail392083173@qq.com

  • Received:2021-06-09 Revised:2021-11-04 Published:2022-01-15 Online:2021-12-29

使用机器学习建立慢性阻塞性肺疾病患者重度气流受限风险预警模型研究

  

  1. 1.610072 四川省成都市,电子科技大学附属医院·四川省人民医院呼吸与危重症医学科
    2.610072 四川省成都市,电子科技大学附属医院·四川省人民医院护理部
    3.610072 四川省成都市,电子科技大学医学院
    4.610072 四川省成都市,电子科技大学附属医院·四川省人民医院药学部
    5.610072 四川省成都市,电子科技大学医学院,个体化药物治疗四川省重点实验室
  • 通讯作者: 温贤秀
  • 基金资助:
    国家自然科学基金资助项目(72004020);干部保健科研课题川干研(2021-219)

Abstract: Background

The degree of airflow limitation is a key indicator of the progression degree in COPD patients. However, problems such as contraindications to testing and compliance make it difficult for some patients to undergo the relevant tests and evaluate the severity of the disease.

Objective

To develop and evaluate a machine learning algorithm-based early warning model for the risk of severe airflow limitation in COPD patients.

Methods

A cross-sectional design was used to investigate COPD inpatients in a tertiary hospital in Sichuan Province from 2019-01 to 2020-06. General clinical indexes and pulmonary function test data were collected. The data were randomly divided into training and test sets in the ratio of 8∶2, and 216 risk warning models were constructed in the training set using four missing value filling methods, three feature screening methods, 17 machine learning and one integrated learning algorithm. The area under the ROC curve (AUC) , accuracy, precision, recall and F1 score were used to evaluate the predictive performance of the model; and the ten-fold cross-validation method and Bootstrapping were used for internal and external validation, respectively. The test set data was used for model testing and selection, the posterior method was used for sample size verification.

Results

A total of 418 patients were included, of which 212 (50.7%) patients were at risk of severe airflow limitation. After four missing value treatments and three feature filters, a total of 12 processed datasets and the importance ranking of 12 factors affecting airflow limitation were obtained, and the results showed that modified medical research council dyspnea scale grade (mMRC) , age, body mass index (BMI) , smoking history (yes, no) , chronic obstructive pulmonary disease assessment test (CAT) score, and dyspnea (yes, no) were at the forefront inthe ranking of variable features and were key indicators for constructing the model, which had an important role in predicting the outcome. Using unfilled, Lasso screening, mMRC grade, smoking history (yes, no) , and dyspnea (yes, no) were the top 3 predictors, with mMRC grade accounting for 54.15% of feature importance. In which, using unfilled, Boruta screening, CAT score, age, and mMRC class were the top 3 predictors, and CAT score accounted for 26.64% of feature importance. A total of 216 prediction models were obtained using 17 machine learning algorithms and 1 integrated learning for each of the 12 datasets. 17 machine learning algorithms with 10-fold cross-validation showed that the differences were statistically significant (P<0.05) when comparing the prediction performance of different algorithms, and the average AUC of the stochastic gradient descent algorithm was maximum (0.738±0.089) . The results of external validation of the test set using the Bootstrapping algorithm showed that the differences were statistically significant (P<0.05) when comparing the prediction performance of the models obtained by different algorithms, and the average AUC of the integrated learning algorithm was maximum (0.757±0.057) . Evaluation of the prediction performance of four missing value treatments and three feature filters using the Bootstrapping algorithm showed that the performance of the model was improved when no padding and Lasso filtering were applied, with a statistically significant difference (P<0.05) . Using the test set data for 216 machine learning models, the best model had an AUC of 0.790 9, accuracy of 75.90%, precision of 75.00%, recall of 78.57%, and F1 value of 0.767 4. The sample size validation results suggested that the study sample size can meet the modeling needs.

Conclusion

In this study, a risk warning model for severe airflow limitation in COPD patients was developed and evaluated. mMRC class, age, BMI, CAT score, presence of smoking history and dyspnea were the key indicators affecting airflow limitation. The model has good predictive effect and has potential clinical application.

Key words: Pulmonary disease, chronic obstructive, Machine learning, Degree of airflow limitation, Lung function, Respiratory function tests, Prediction model

摘要: 背景

气流受限程度是评价慢性阻塞性肺疾病(COPD)患者疾病进展的关键指标。然而由于检查禁忌、依从性等问题,导致部分患者难以开展相关检查,无法评价疾病严重程度。

目的

建立并评估基于机器学习算法的COPD患者重度气流受限风险预警模型。

方法

采用横断面设计调查2019年1月至2020年6月四川省某三甲医院的COPD住院患者,收集患者一般临床指标与肺功能检查数据。将数据按8∶2比例随机分为训练集和测试集,在训练集中使用4种缺失值填充方法、3种特征筛选方法、17种机器学习和1种集成学习算法构建216种风险预警模型。采用ROC曲线下面积(AUC)、准确率、精确率、召回率和F1值评价模型的预测性能,分别使用十折交叉验证法和Bootstrapping算法进行内部验证和外部验证。使用测试集数据进行模型测试和选择。使用后验法进行样本量验证。

结果

共纳入418例患者,其中212例(50.7%)患者存在重度以上气流受限风险。经4种缺失值处理和3种特征筛选后,共获得12个处理后的数据集及12种影响气流受限因素的重要性排序,结果显示,呼吸困难指数评分(mMRC)等级、年龄、体质指数(BMI)、吸烟史(有、无)、慢性阻塞性肺疾病评估表(CAT)评分、呼吸困难(有、无)在变量特征排序中居于前列,是构造模型的关键指标,对结果预测有重要作用。其中,采取不填充、Lasso筛选方法后,mMRC等级、吸烟史(有、无)、呼吸困难(有、无)为位居前3位的预测因子,mMRC等级占特征重要性的54.15%。使用不填充、Boruta筛选方法后,CAT评分、年龄、mMRC等级为位居前3位的预测因子,CAT评分占特征重要性的26.64%。使用17种机器学习和1个集成学习算法对12个数据集分别建模,共得216个预测模型。17种机器学习算法十折交叉验证结果显示,不同算法预测性能比较,差异有统计学意义(P<0.05),随机梯度下降算法的平均AUC最大,为(0.738±0.089)。使用Bootstrapping算法对测试集进行外部验证结果显示,不同算法所得模型的预测性能比较,差异有统计学意义(P<0.05),集成学习算法的平均AUC最大,为(0.757±0.057)。利用Bootstrapping算法对4种缺失值处理和3种特征筛选预测性能评价结果显示,当不填充和Lasso筛选时,可提高模型的性能,差异有统计学意义(P<0.05)。使用测试集数据对216个机器学习模型进行测试,最佳模型的AUC为0.790 9,准确率为75.90%,精确率为75.00%,召回率为78.57%,F1值为0.767 4。样本量验证结果提示研究样本量可满足建模需求。

结论

本研究建立并评价了COPD患者重度气流受限风险预警模型,mMRC等级、年龄、BMI、CAT评分、是否有吸烟史和呼吸困难是影响气流受限的关键指标。该模型预测效果良好,具有潜在的临床应用前景。

关键词: 肺疾病, 慢性阻塞性, 机器学习, 气流受限程度, 肺功能, 呼吸功能试验, 预测模型

CLC Number: