Challenge participants were asked to develop, train, and validate computational models to predict pediatric patients at risk for hospitalization, need for ventilation, and cardiovascular interventions, utilizing deindetified electronic health record data available through NCATS’ National COVID Cohort Collaborative (N3C) Data Enclave.
"We launched the Pediatric COVID-19 Data Challenge during a critical time in the pandemic, when pediatric patients largely didn't have access to vaccinations and were highly vulnerable as a result. This challenge provided an opportunity to develop and test algorithms from commonly available data that could empower clinicians with better insights to predict severe outcomes and hospitalizations more accurately so they can make critical decisions to reduce hospital burden and improve pediatric patient outcomes. We are excited about the potential for the results of this challenge to create new capabilities that can be available in the future."
- Dr. Sandeep Patel, DRIVe Director
In Task 1, teams developed computational models to predict the need for hospitalization among pediatric patients who test positive for COVID-19 in an outpatient setting.
Winner and Award of $100,000
University of Wisconsin – Madison
Department of Biostatistics & Medical Informatics (BMI)
Approach: Gradient boosting method and hand-crafted features extracted from multisite EHR data.
Additional Details: The team tailored a widely used machine learning approach (gradient boosting), reduced the dimensionality of EHR data, and enhanced model interpretability by summarizing patients' medical conditions and drug exposures using medical meaning concepts such as International Classification of Diseases (ICD-10) and Anatomical Therapeutic Chemical (ATC) codes. Not only did they perform the best of the scored models quantitatively, but they also used a subset of COVID-19 related lab measurements and recent values (prior to the patient's COVID-19 diagnosis) and customized the model training/tuning procedure, so that the model was resistant to sample size bias, making it more generalizable across multiple sites.
Post Challenge: The team is interested in refining their model to tie into therapeutic interventions for high-risk groups and incorporate additional information such as clinical notes into the model.
In Task 2, teams developed computational models to predict the need for respiratory and cardiovascular interventions in hospitalized pediatric patients, including children with multisystem inflammztory syndrome in children (MIS-C), a life-threatening inflammation of organs and tissues.
Winner and Award of $100,000
Vir Biotechnology
Approach: Missingness aware gradient boosted tree classifier) capable of extracting patterns from a complex set of Electronic Health Records.
Additional Details: The team focused on extracting data from laboratory measurements, disease conditions and past medical interventions to employ manual data cleaning, creation of new aggregate variables, and further harmonization of the data model. Not only did this group have the highest quantitative score, they also employed a missingness aware classifier, capable of learning from the patterns of data availability and which avoids the imputation of missing data and overfitting by evaluating their trained classifier. When their model was evaluated to simulate a live clinical scenario, their model maintained its high performance.
Post Challenge: The team hopes to further evaluate their model in clinics and create standards and privacy-preserving analytics to foster a new generation of decision support tools. They envision similar models in the future with the ability to accurately forecast the burden of disease for patients and hospital systems to become critical components of pandemic preparedness and real-time response.
Honorable Mentions go to the following three teams for their outstanding solutions:
Feature Interpretability and Design:
Oregon Health and Science University
Approach: Ensemble classifier that uses demographic, laboratory, and diagnosis data that emphasizes model applicability and interpretability. Used Shapley Additive Values to explain model predictions and highlight possible clinical applications of predictive models.
Additional Details: The team used a common set of predictors including demographics, laboratory values and associated diagnosis codes to employ an ensemble classifier that combined individual predictions from logistic regression, random forest, gradient boosted tree, and artificial neural network models. They used Shapley Additive Values to provide individual-level and population-level explanations for model predictions. This high-performing approach provides clinicians with an outcome prediction and an individualized explanation with predictors for intervention. The team began to explore how the model could be applied to patient populations to help clinicians prioritize allocation of monoclonal antibodies and would like to further optimize their model to address different sub-populations that may have underlying biases (e.g., racial or socioeconomic disparities), as well as validate their model further to provide early intervention to high-risk children to prevent severe outcomes.
Post challenge: The team began to explore how the model could be applied to patient populations to help clinicians prioritize allocation of monoclonal antibodies and would like to further optimize their model to address different sub-populations that may have underlying biases (e.g., racial or socioeconomic disparities), as well as validate their model further to provide early intervention to high-risk children to prevent severe outcomes.
Clinical Utility:
Bruce Cragin, PhD., Citizen Scientist Wind City Applied Research
Approach: Open-source "extreme boosting" algorithm – XGBoost with a modern Shapley Value analysis technique that generalizes the concept of a vector of the feature importance of a model to a feature importance matrix.
Additional Details: A retired physicist and electrical engineer at Wind City Applied Research in rural New Hampshire, B. L. Cragin, PhD, received an Honorable Mention for Clinical Utility. Dr. Cragin noticed that model features derived from existing electronic health record codesets as defined in the National COVID Cohort Consortium Data Enclave gave consistently better performance than those based on machine-selected codes, allowing him to also benefit from the extensive clinical expertise of that community. In developing his model, he applied an open-source "extreme boosting" algorithm called XGBoost that has proven to be a top performer in earlier predictive modeling challenges. The XGBoost code also facilitated the introduction of a modern Shapley Value analysis technique that generalizes the concept of a vector of the feature importance of a model to a feature importance matrix, each row of which applies to an individual case or patient, thus allowing clinicians to identify specific population sub-cohorts for which any given feature is expected to be an especially good indicator of increased risk.
Post Challenge: Dr. Cragin hopes to establish an informal association with an existing academic or industry research team to join their effort to make a marketable product.
Computational Methodology:
ARI Science
Approach: Ensemble of Ensembles
Additional Details: The team received an Honorable Mention for Computational Methodology. The team took into account clinical and laboratory indicators from pre-visit and during-visit data that was normalized by age, gender and other demographic attributes and fed into Random Forest, Neural Network, Regression-based, Naïve Bayes and Neighborhood-based artificial intelligence (AI) models to create ensembles of predictions.
Post Challenge: The team hopes that the sub-model of their ensemble of ensembles can identify the highest risk children even prior to COVID-19 infection.