Machine learning model from the largest US COVID-19 dataset predicts disease severity

A centralized repository of COVID-19 health records built last year is beginning to show results, starting with a new paper published today. The repository is the largest set of COVID-19 records to date, and was built by a team of researchers and data experts last year to help make sense of COVID-19.

The study, published in the journal JAMA Network Open, looked at risk factors for severe cases of COVID-19 and traced the progression of the disease over time. The authors built machine learning models to predict which hospitalized patients would develop severe disease based on information collected on their first day in a hospital.

Using the centralized database, called the National COVID Cohort Collaborative Data Enclave, or N3C, meant the research team was able to include hundreds of thousands of patients’ records in its analysis. The study used data from 34 medical centers and included over 1 million adults — 174,568 who tested positive for COVID-19 and 1,133,848 who tested negative. It includes records stretching from January 2020 to December 2020.

The analysis shows how treatment for COVID-19 changed over the course of 2020, as doctors tried new treatments and gained more experience with the condition. The percentage of patients who were treated with the anti-malaria drug hydroxychloroquine, which was promoted by former President Donald Trump before proving to be ineffective, dropped off to nearly zero by May 2020. Use of the steroid dexamethasone ticked up in June, after studies showed it could improve survival rates.

It also confirmed that survival rates for patients with COVID-19 improved over the course of 2020. In March and April, 16 percent of people admitted to the hospital with COVID-19 died. In September and October, that dropped to just under 9 percent.

People who had higher heart rates, breathing rates, and temperatures when they arrived at the hospital were more likely to need drastic interventions like ventilation. They were also more likely to die. Abnormal white blood cell count, inflammation, blood acidity, and kidney function were also linked to more severe cases. The research team built machine learning models using those and other data points that could predict which patients would get seriously ill. The models could eventually be used as the basis for decision-making tools with additional testing, the authors wrote.

Researchers have been analyzing the trajectory of COVID-19 since the very start of the pandemic. This study has the advantage of pulling from a large and diverse dataset — it’s not restricted to one hospital or one state. In the US, researchers are often limited to studying the medical records from patients at the institutions where they work. That means the number of records they’re able to include in studies can be limited, and they’re not able to easily check if their conclusions would apply in other places.

A resource like N3C, which pulls together records from dozens of institutions, sidesteps those limitations. By now, N3C includes data from 73 health institutions and has records from over 2 million COVID-19 patients. More than 200 research projects using the data are underway, including studies examining risk factors for COVID-19 re-infection and the disease’s impact on pregnancy. It’s not perfect — standardizing information across hospitals is hard, and there may not be complete data on many patients.

Still, having such a large set of data is invaluable. Researchers are using the resource to run studies that they may not have been able to tackle with just their own institution’s resources, Elaine Hill, a health economist at the University of Rochester working on pregnancy research, told The Verge last fall. “It makes it possible to shed light on things we wouldn’t be able to,” she said.