Predicting COVID-19 mortality risk in Toronto, Canada: a comparison of tree-based and regression-based machine learning methods

Cindy Feng; George Kephart; Elizabeth Juarez-Colunga

doi:10.1186/s12874-021-01441-4

Predicting COVID-19 mortality risk in Toronto, Canada: a comparison of tree-based and regression-based machine learning methods

Cindy Feng, George Kephart, Elizabeth Juarez-Colunga

Medicine

Research output: Contribution to journal › Article › peer-review

13 Citations (Scopus)

Abstract

Background: Coronavirus disease (COVID-19) presents an unprecedented threat to global health worldwide. Accurately predicting the mortality risk among the infected individuals is crucial for prioritizing medical care and mitigating the healthcare system’s burden. The present study aimed to assess the predictive accuracy of machine learning methods to predict the COVID-19 mortality risk. Methods: We compared the performance of classification tree, random forest (RF), extreme gradient boosting (XGBoost), logistic regression, generalized additive model (GAM) and linear discriminant analysis (LDA) to predict the mortality risk among 49,216 COVID-19 positive cases in Toronto, Canada, reported from March 1 to December 10, 2020. We used repeated split-sample validation and k-steps-ahead forecasting validation. Predictive models were estimated using training samples, and predictive accuracy of the methods for the testing samples was assessed using the area under the receiver operating characteristic curve, Brier’s score, calibration intercept and calibration slope. Results: We found XGBoost is highly discriminative, with an AUC of 0.9669 and has superior performance over conventional tree-based methods, i.e., classification tree or RF methods for predicting COVID-19 mortality risk. Regression-based methods (logistic, GAM and LASSO) had comparable performance to the XGBoost with slightly lower AUCs and higher Brier’s scores. Conclusions: XGBoost offers superior performance over conventional tree-based methods and minor improvement over regression-based methods for predicting COVID-19 mortality risk in the study population.

Original language	English
Article number	267
Journal	BMC Medical Research Methodology
Volume	21
Issue number	1
DOIs	https://doi.org/10.1186/s12874-021-01441-4
Publication status	Published - Dec 2021

Bibliographical note

Funding Information:
The authors would like to thank the suggestions and comments from the Editor and reviewers, which significantly helped to improve the quality of this manuscript. The authors would also like to acknowledge the support from the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants. This research was enabled in part by support provided by WestGrid (www.westgrid.ca) and Compute Canada Calcul Canada (www.computecanada.ca).

Publisher Copyright:
© 2021, The Author(s).

ASJC Scopus Subject Areas

Epidemiology
Health Informatics

PubMed: MeSH publication types

Journal Article
Research Support, Non-U.S. Gov't

Access to Document

10.1186/s12874-021-01441-4

Cite this

@article{3ad784f8952341868b5452f210016aa5,

title = "Predicting COVID-19 mortality risk in Toronto, Canada: a comparison of tree-based and regression-based machine learning methods",

abstract = "Background: Coronavirus disease (COVID-19) presents an unprecedented threat to global health worldwide. Accurately predicting the mortality risk among the infected individuals is crucial for prioritizing medical care and mitigating the healthcare system{\textquoteright}s burden. The present study aimed to assess the predictive accuracy of machine learning methods to predict the COVID-19 mortality risk. Methods: We compared the performance of classification tree, random forest (RF), extreme gradient boosting (XGBoost), logistic regression, generalized additive model (GAM) and linear discriminant analysis (LDA) to predict the mortality risk among 49,216 COVID-19 positive cases in Toronto, Canada, reported from March 1 to December 10, 2020. We used repeated split-sample validation and k-steps-ahead forecasting validation. Predictive models were estimated using training samples, and predictive accuracy of the methods for the testing samples was assessed using the area under the receiver operating characteristic curve, Brier{\textquoteright}s score, calibration intercept and calibration slope. Results: We found XGBoost is highly discriminative, with an AUC of 0.9669 and has superior performance over conventional tree-based methods, i.e., classification tree or RF methods for predicting COVID-19 mortality risk. Regression-based methods (logistic, GAM and LASSO) had comparable performance to the XGBoost with slightly lower AUCs and higher Brier{\textquoteright}s scores. Conclusions: XGBoost offers superior performance over conventional tree-based methods and minor improvement over regression-based methods for predicting COVID-19 mortality risk in the study population.",

author = "Cindy Feng and George Kephart and Elizabeth Juarez-Colunga",

note = "Funding Information: The authors would like to thank the suggestions and comments from the Editor and reviewers, which significantly helped to improve the quality of this manuscript. The authors would also like to acknowledge the support from the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants. This research was enabled in part by support provided by WestGrid (www.westgrid.ca) and Compute Canada Calcul Canada (www.computecanada.ca). Publisher Copyright: {\textcopyright} 2021, The Author(s).",

year = "2021",

month = dec,

doi = "10.1186/s12874-021-01441-4",

language = "English",

volume = "21",

journal = "BMC Medical Research Methodology",

issn = "1471-2288",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - Predicting COVID-19 mortality risk in Toronto, Canada

T2 - a comparison of tree-based and regression-based machine learning methods

AU - Feng, Cindy

AU - Kephart, George

AU - Juarez-Colunga, Elizabeth

N1 - Funding Information: The authors would like to thank the suggestions and comments from the Editor and reviewers, which significantly helped to improve the quality of this manuscript. The authors would also like to acknowledge the support from the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants. This research was enabled in part by support provided by WestGrid (www.westgrid.ca) and Compute Canada Calcul Canada (www.computecanada.ca). Publisher Copyright: © 2021, The Author(s).

PY - 2021/12

Y1 - 2021/12

N2 - Background: Coronavirus disease (COVID-19) presents an unprecedented threat to global health worldwide. Accurately predicting the mortality risk among the infected individuals is crucial for prioritizing medical care and mitigating the healthcare system’s burden. The present study aimed to assess the predictive accuracy of machine learning methods to predict the COVID-19 mortality risk. Methods: We compared the performance of classification tree, random forest (RF), extreme gradient boosting (XGBoost), logistic regression, generalized additive model (GAM) and linear discriminant analysis (LDA) to predict the mortality risk among 49,216 COVID-19 positive cases in Toronto, Canada, reported from March 1 to December 10, 2020. We used repeated split-sample validation and k-steps-ahead forecasting validation. Predictive models were estimated using training samples, and predictive accuracy of the methods for the testing samples was assessed using the area under the receiver operating characteristic curve, Brier’s score, calibration intercept and calibration slope. Results: We found XGBoost is highly discriminative, with an AUC of 0.9669 and has superior performance over conventional tree-based methods, i.e., classification tree or RF methods for predicting COVID-19 mortality risk. Regression-based methods (logistic, GAM and LASSO) had comparable performance to the XGBoost with slightly lower AUCs and higher Brier’s scores. Conclusions: XGBoost offers superior performance over conventional tree-based methods and minor improvement over regression-based methods for predicting COVID-19 mortality risk in the study population.

AB - Background: Coronavirus disease (COVID-19) presents an unprecedented threat to global health worldwide. Accurately predicting the mortality risk among the infected individuals is crucial for prioritizing medical care and mitigating the healthcare system’s burden. The present study aimed to assess the predictive accuracy of machine learning methods to predict the COVID-19 mortality risk. Methods: We compared the performance of classification tree, random forest (RF), extreme gradient boosting (XGBoost), logistic regression, generalized additive model (GAM) and linear discriminant analysis (LDA) to predict the mortality risk among 49,216 COVID-19 positive cases in Toronto, Canada, reported from March 1 to December 10, 2020. We used repeated split-sample validation and k-steps-ahead forecasting validation. Predictive models were estimated using training samples, and predictive accuracy of the methods for the testing samples was assessed using the area under the receiver operating characteristic curve, Brier’s score, calibration intercept and calibration slope. Results: We found XGBoost is highly discriminative, with an AUC of 0.9669 and has superior performance over conventional tree-based methods, i.e., classification tree or RF methods for predicting COVID-19 mortality risk. Regression-based methods (logistic, GAM and LASSO) had comparable performance to the XGBoost with slightly lower AUCs and higher Brier’s scores. Conclusions: XGBoost offers superior performance over conventional tree-based methods and minor improvement over regression-based methods for predicting COVID-19 mortality risk in the study population.

UR - http://www.scopus.com/inward/record.url?scp=85120090280&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85120090280&partnerID=8YFLogxK

U2 - 10.1186/s12874-021-01441-4

DO - 10.1186/s12874-021-01441-4

M3 - Article

C2 - 34837951

AN - SCOPUS:85120090280

SN - 1471-2288

VL - 21

JO - BMC Medical Research Methodology

JF - BMC Medical Research Methodology

IS - 1

M1 - 267

ER -

Predicting COVID-19 mortality risk in Toronto, Canada: a comparison of tree-based and regression-based machine learning methods

Abstract

Bibliographical note

ASJC Scopus Subject Areas

PubMed: MeSH publication types

Access to Document

Other files and links

Fingerprint

Cite this