BLOG AI Alerts

Open Datasets To Build AI Applications To Fight COVID-19

Merlin Peter
October 6, 2020

A worldwide health crisis of this magnitude calls for extensive public unity to achieve better global outcomes. From automated diagnostics and developing therapeutics to ensuring faster drug manufacturing and improved logistics, technology is stepping up its game to serve governments and people during such unprecedented times. 

Data And The Fight Against COVID-19

Data is quintessential to all the efforts being made to stop the outbreak from getting worse. For instance, the advancement in rapid genome sequencing technologies has helped us learn a great deal about the COVID-19 virus faster than ever before. AI can use this data to predict compounds that could be effective against the evolving virus and accelerate the drug development process. 

However, curating authentic information and articles from valid sources in the age of misinformation and fake news is a huge roadblock to winning the battle against this novel coronavirus. Further, research for vaccines and therapeutics requires large amounts of data on several factors like transmission, incubation, risk factors, medical care, the effectiveness of non-pharmaceutical interventions, environmental stability, and effects on COVID-positive patients, virus genetics, clinical studies, and findings, etc. 

Many research institutions and organisations working for the cause have released their datasets on the web to foster public cooperation against the pandemic. Here’s a curated list of freely accessible public datasets you can use for your COVID-19 research or ML initiatives.

Freely Accessible COVID-19 Datasets 

COVID-19 Open Research Dataset Challenge (CORD-19)

With a usability score of 8.8, the CORD-19 dataset is one of the top COVID-19 datasets on Kaggle compiled by The Allen Institute for AI and The White House in coalition with other credible research groups. The resource has over 200K scholarly articles, including over 100K articles with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. The dataset is claimed to be the most extensive machine-readable coronavirus literature collection available for data mining to date.


The World Health Organisation has created a global research database curating all the latest international multilingual scientific findings and knowledge on COVID-19 to coordinate global research efforts to control the pandemic. The database is built by BIREME, the Specialised Centre of PAHO/AMRO, and part of the Regional Office’s Department of Evidence and Intelligence for Action in Health. The literature is updated daily from hand searching, bibliographic databases, and other expert-referred scientific articles. 

COVID-19 Data Repository by CSSE at John Hopkins University

The COVID-19 Data Repository offers one of the most comprehensive daily information about the coronavirus outbreak. The visual dashboard offers daily updates about case reports and time-series summary tables, including confirmed new cases, deaths, and recovered cases. The dataset is operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE).

Image: COVID-19 Dashboard by CSSE at JHU

nCOV19 by Open COVID-19 Data Working Group

The nCOV19 dataset was published by the Open COVID-19 Data Working Group, a global and multi-organizational initiative for rapid public health data sharing to improve responses to infectious diseases. The “Epidemiological data from the COVID-19 outbreak, real-time case information.” was published on March 24, 2020. It includes individual-level data from national, provincial, and municipal health reports, and additional information from online reports. All data are geo-coded and, where available, including symptoms, key dates (date of onset, admission, and confirmation), and travel history.

COVID-19 Tweet IDs

The COVID-19 Tweet IDs dataset is an ongoing collection of tweet IDs associated with the coronavirus outbreak. The first tweet in the dataset dates back to January 21, 2020. Additionally, you can also hydrate the tweets (i.e, view complete tweet info) using the Hydrator GitHub repository. The dataset is updated every week.

Associated Research Paper: Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set

COVID Tracking Project Data (US)

The COVID Tracking Project dataset by Liz Friedman collects the most comprehensive COVID-19 testing data from 50 US states, the District of Columbia, and 5 other US territories The dataset is updated daily via The COVID Racial Data Tracker. The dataset provides information regarding positive and negative test results, pending tests, and total people tested and the racial and ethnic demographic information to outline the US COVID-19 testing effort and the outbreak’s effects on the people and communities. 

Oxford Covid-19 Government Response Tracker

The Oxford COVID-19 Government Response Tracker (OxCGRT) collects government policy response data to the pandemic from more than 180 countries worldwide. The tracker has 17 indicators like travel restrictions, school closures, etc. They have also collated a Lockdown Rollback Checklist which looks at how closely countries meet four of the six WHO recommendations for lockdown relaxation.
Additionally, they have also published a secondary USA COVID Policy dataset on US states' responses to COVID-19 outlining all policies affecting the residents of a state. 

MIDAS 2019 Novel Coronavirus Repository

The MIDAS Coordination Centre released an online portal for better search of COVID-19 information to help the modelling research of the novel coronavirus outbreak. The online portal serves as the landing page for all latest COVID-19 data and findings and the Github repository will allow easy sharing of computable (CSV) files with data, parameter estimates, software, and metadata.


The dataset outlines the details regarding government measures taken worldwide to tackle the coronavirus scare. The government measure dataset is curated by the Humanitarian Data Exchange, a service provided by the United Nations Office for the Coordination of Humanitarian Affairs.

Image: ACAPS Dashboard

COVID-Net and COVIDx Dataset

The COVID-Net Open Source Initiative provides datasets for developing deep learning models that support effective screening of infected patients using chest radiography. In the initial coronavirus diagnostic studies, it was found that patients who showcased chest abnormalities were COVID-19 positive. 

COVID-Net is a deep convolutional neural network design tailored for COVID-19 detection from chest X-ray (CXR) images. COVIDx is an open-access benchmark dataset comprising 13,975 CXR images across 13,870 positive cases. 

The initiative is not a production-ready solution but aims to accelerate the development of highly accurate yet practical deep learning solutions for detecting COVID-19 cases more effectively. 

GISAID — Global Initiative on Sharing All Influenza Data

Many research laboratories around the world are rapidly generating genome sequences of the novel coronavirus to understand the disease compounds and quickly develop therapeutics to contain the spread. GISAID data submitters and curators ensure that the real-time data shared about COVID-19 is reliable for effective research and intervention design. 

COVID-19 Coronavirus Data (EU)

The European Centre for Disease Prevention and Control has published the COVID-19 Coronavirus Data including daily situation updates, the epidemiological curve, and the global geographical distribution (EU/EEA and the UK, worldwide) of the virus. The dataset is updated every day to provide the latest information on the development of cases in the EU region.

Coronavirus (Covid-19) Data in the United States

The New York Times has tracked coronavirus cases in real-time since early January and has made the dataset available publicly to help researchers and organisations better understand the development patterns of the virus in the United States region. They provide live data with the current number of cases in each geographic region and the historical tallied data for each geography in three main levels: U.S Region Data (overall), U.S State-Level Data, and U.S County-Level Data. 

AI Technology For COVID-19 and Beyond

The use of Artificial Intelligence will help us rapidly speed up the process of gaining control over the novel coronavirus. Several players in different parts of the world are venturing into using AI tools for early detection and diagnosis, monitoring treatments and patient data, contact tracing, cluster identification, projection of cases, mortality rates, and developing drugs and vaccines for controlling the spread.  

Using a data-driven approach is the key to ensuring victory over this seemingly endless pandemic. Intelligent data curation, better data quality, easy data availability on public platforms, and effective data analysis plays a pivotal role in carving the exit strategy of the coronavirus. 

This worldwide health crisis is a huge wake-up call for increased investments and initiatives to fight other deadly virus attacks in the future.

Are you building an AI solution for COVID-19 and beyond? Reach out to us at for high-quality datasets for your next ML initiative. 

P.S: We will update this blog frequently with the latest information on the datasets mentioned above.