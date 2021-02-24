LIn January, Samuel Scalpino didn’t know what to do with Covid-19. He is the director of the Emergent Epidemics Lab at Northeastern University, working with all other epidemiologists around the world to interpret the earliest data on new viruses.

He immediately set about creating a spreadsheet initiated by a group of international epidemiologists, collecting and openly sharing detailed data on individual Covid-19 cases around the world. Today, the project launched a complete website, Global.healthAllows open access to over 5 million anonymized Covid-19 records from 160 countries. Each record can contain dozens of data points about the case, such as demographics, travel history, test dates, and results.

The project is supported by $ 1.25 million in funding and other resources from Google.org, along with additional support from The Rockefeller Foundation, Oxford University, Harvard, Northeastern, Boston Children’s Hospital, Georgetown, University of Washington. Led by scholars. And Johns Hopkins Center for Health Security.

STAT talked to Scarpino about the challenges associated with scaling up the project and what it takes to maintain this type of large-scale epidemiological data collection long enough to avoid the next pandemic. ..

How has this project evolved from a simple spreadsheet to the current millions of records?

Around this time last year, a group of volunteers manually entered Covid-19 records, as reported. Japanese news sources will break that there was an incident. We enter them into Google Sheets and publish them via Spreadsheets and Github.

We have reached the size limit of about 80,000 cases that can be used in Google Sheets. So I contacted Google.org about the fellowship program. In this program, Google employees spend three to six months on projects that have some kind of well-defined positive social impact. We pitched this idea to them and they bit.





I thought a few engineers would be working together in a month or so. Approximately 12 people participated in the entire product and engineering stack for six months, from designers, researchers to engineers.

We interviewed our target audience, journalists, public health authorities, and researchers for six months, and then built a cloud data platform to store anonymized case records at the individual level. Originally intended for Covid, this is a kind of quick response data system that can be deployed in real time after an event.

What is the value of recording these individual row-level records compared to the case and mortality aggregate data that we are all familiar with?

It’s interesting because when we saw Johns Hopkins have a system taking off for aggregated data, we actually stopped trying to aggregate data. You need both types of systems, but it’s clear that you don’t have the same infrastructure to handle these two types of data.

The data often contains more detailed information about each case. It’s like the history of travel. Age distribution, race, ethnicity, if reported. If symptoms have been reported. When results such as death or hospitalization are reported. So you can get a better idea of ​​what’s going on.

What kind of infrastructure did you need to build to track all these variables?

One of the things we tried to design was the assumption that the data model needed to be modified. This is an emerging infectious disease and you don’t know what you don’t know. At launch, we expect 10 million records from 160 countries. At least 12 fields are completed in 90% of them. 50% of cases have about 25 fields.

I don’t see it, but I’ve built a front-end system for entering records. This is because we understand that we often collect data from informal sources, even manually when recorded from press releases and hospitals. And on social media.

What really resembles what happened early in Wuhan now is that we have begun to see most of the variants of these concerns reported in the newspapers. “Czech Republic individuals are positive on the B.1.1.7 test.” And added a field — are there variants of concern related to this case — and since we started recording them, we associated them with variants. The map was created.

However, not all countries can provide that granularity of data.

In fact, there is a view showing what percentage of all reported cases are in the system. You might think, “This is a map of how poor the work we do in different places.” I don’t see it that way. There are all readily accessible data publicly reported by various Ministry of Health and Ministry of Health. Therefore, this is considered a map of the quality of public health data in the world.

One of the things you’ll see is that in many places with high percentages of data, there are few cases. And that’s because, in almost all cases, the quality of the data system is closely related to the quality of the response.

Why did the World Health Organization not do this kind of tracking?

Yes, there are several reasons. First of all, it’s expensive. It requires an expensive software engineer. I don’t know how much it actually costs to hire such a thing — in fact, I have an idea, probably about $ 5 million in 6 months. WHO would have had to find such a kind of resource. And as soon as this surged, they were far less likely to maintain a large international data system and were so scarce that they couldn’t even keep up with everything they were supposed to do. I am.

And on top of that, the politics of international data sharing is in turmoil. We did that by taking advantage of the fact that we are a volunteer organization. We have been quite deeply involved in the legal and ethical aspects of this. So it turns out to be a difficult way to do it, but given the regulations on data usage and data sharing contracts, and data scraping, we think everything we do is actually legal.

What is Global.health’s role at this point in the pandemic? This data has been available to researchers for a long time — why are you releasing it publicly now?

There are so many cases right now that you can even run aggregated cases. But Covid-19 will be rare. It falls back into the environment of what causes respiratory illness.

As a result, we need a more loyal system that collects a lot of information, notifies us of prompt public health responses, identifies new variants, and obtains information about their spread. Therefore, a one- or two-year plan should ensure that data is captured as you move into more complex phases. Eventually, we’ll return to the area where we’re looking at travel history and age distribution, and keep track of this.

The five-year plan is tuberculosis and malaria. And the next time someone appears in the seafood market with an emerging infectious disease …

What do you need to continue to support this until the next pandemic? How do I actually get it to work as intended?

To be honest, that’s one of the things we’re trying to understand. How do you actually organize the next five years so that you are not confused by the political and financial issues discussed for WHO? The good thing is that we succeeded in doing this because no one was paying attention, everyone was looking at Hopkins and doing it in the opposite direction to get all the data. Since then, numerous software engineers have donated $ 10 million worth of time. Build this thing. It won’t happen again.

We have provided very generous support not only financially, but also in terms of expertise and resources from Rockefeller, Google.org and MapBox. But I don’t know what that will bring — two, three, four, or five years, and perhaps more. Maybe someone decides to just inject enough capital for us to be self-reliant. However, there are business models that don’t necessarily have to be Palantir, and you can still be self-reliant. This is also one of the things we are working on.