Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Challenge Tasks

Tathagata Chakraborti edited this page Dec 13, 2021 · 8 revisions

We welcome any and all contributions, either directly to the codebase or in the form of blogs, analysis, etc. by consuming the data. If you are looking for ideas, we describe a few challenging and fun tasks for you to get started.

πŸ’‘ Flex Your Brains

Data Source: COVID-19 India SQL DB

This is an open-ended task. Contribute your insights in the form of analysis or anomalies. You can use this data to validate or extend models developed for other countries to India [1] [2] [3]; develop epidemiological models which integrate additional variables [4] [5] [6] [7]; and understand various aspects of the pandemic in detail [8] [9], among others.

πŸ’‘ OCR on Health Bulletins

Data Source: COVID-19 India Bulletin Download

Computer Vision Β  Beginner

Classic PDF parsers fail when tabular and textual data are embedded as images inside the document. This task asks you to help us enrich the data by extending the automated data extraction pipeline with open-sourced OCR techniques to parse data inside images as well. To get started:

  1. Read more about the data extraction pipeline here
  2. Become familiar with the code here by going through the setup for an already existing state.
  3. Get started! States, where this will be of immediate use, are the likes of Rajasthan, Karnataka, and Goa (shown below respectively). Rajasthan (left) puts out its district-level data in the form of an image (the snapshot shows results of running Python-tesseract on it), while much of the vaccination data for Karnataka (middle) and daily case and case data for Goa (right) are presented as images.

πŸ’₯ Start experimenting on the following bulletins. πŸ’₯

πŸ’‘ Translating Health Bulletins

Data Source: COVID-19 India Bulletin Download

Natural Language Processing Β  Advanced

Not all Indian states report their data in English. Help enrich the COVID-19 India data by creating translation models and parsers that work with Hindi and other regional languages (and extend the state of the art in "natural language" processing πŸ˜‰).

Madyha Pradesh bulletins are a fantastic place to start for this task. The daily data for MP is available in Hindi in the opening segment of the bulletin.

For examples of bulletins in regional languages, Kerala is a fantastic place to start, containing detailed information in both Malayalam and English. The latter can be used as ground truth. Gujarat is another example but without an English equivalent. Open source translation engines, such as Googletrans, maybe a good place to start experimenting with this task.

πŸ’‘ Positional Entity Parser

Data Source: COVID-19 India Bulletin Download

Natural Language Processing Β  Intermediate

Not all information in the health bulletins is tabular. In this task, you are required to model a domain-dependent precision parser to extract patient and pandemic data from plain text information in health bulletins. As a concrete task, let's look at a couple of individual case data from Tamil Nadu bulletins. This is a wealth of information for scientists modeling the spread and evolution of the pandemic in detail.

In this task, we want to build precision-parsers that are trained in this domain to extract all entities from such text. As an example, we have followed up the sample case data with what a regex-based parser extracts in a structured JSON form. The schema of this data can be found here.

Death case No.161:
A 78 years old male from Chennai with Diabetes Mellitus / Systemic Hypertension / Coronary Artery Disease 
admitted on 28.05.2020 with complaints of fever, sore throat and breathing difficulty at a private hospital, 
Chennai and died on 29.05.2020 at 10.05 PM due to Pneumonia.
{
    "case_id": 161,
    "category": null,
    "age": 78,
    "gender": "male",
    "location": "Chennai",
    "comorbidity": "with Diabetes Mellitus / Systemic Hypertension / Coronary Artery Disease",
    "test": null,
    "admission": {
        "symptoms": {
            "days": null,
            "details": "fever, sore thro"
        },
        "location": null,
        "date": "28.05.2020",
        "time": null
    },
    "death": {
        "cause": "Pneumonia",
        "date": "29.05.2020",
        "time": "10.05 PM"
    }
}

Death Case No. 34923
A 65 years old Male from Trichy admitted on 17.08.2021 in Mahathma
Gandhi Memorial Government Hospital, Trichy. Outcome of COVID test
positive result on 17.08.2021. The patient died on 30.08.2021 at 05.35AM
due to COVID-19 Pneumonia.
{
    "case_id": 34923,
    "category": null,
    "age": 65,
    "gender": "Male",
    "location": "Trichy",
    "comorbidity": null,
    "test": {
        "date": "17.08.2021",
        "details": "COVID test positive result"
    },
    "admission": {
        "symptoms": null,
        "location": "Mahathma Gandhi Memorial Government Hospital,",
        "date": "17.08.2021",
        "time": null
    },
    "death": {
        "cause": "COVID-19 Pneumonia",
        "date": "30.08.2021",
        "time": "05.35AM"
    }
}

As you can see, there are variations to these texts, and regex-based parsing is bound to be brittle. However, in building entity parsers that can parse such knowledge with high accuracy, you can use the regex-based parser as the generator for the ground truth data.

πŸ‘‰ Start by browsing some examples of such case information here and run the regex-based parser on them here. Once you are ready to test at scale, use the bulletin downloader for Tamil Nadu [link] to fetch the bulletins (see here for an example of how to extract the text using the pdfplumber library).

You can also download the parsed text at bulk directly below.

Download

While the Tamil Nadu bulletins provide a specific parsing challenge, you are welcome to create such entity parsers for any state. Bulletins from Kerala, for example, start with a wealth of information provided in plain text. See here, as an example.

πŸ’‘ The task of translating bulletin text into English, interleaves with this task for certain states. For example, the text from the Madhya Pradesh bulletin presented above, once translated, would need to flow into this solution in order to extract all the relevant data.

πŸ’‘ Speak to the Data

Data Source: COVID-19 India SQL DB

Natural Language Processing Β  Human-Computer Interaction Β  Beginner

The primary motivation behind this project is to make this data as easily accessible as possible. This applies to interest from scientists, researchers, journalists, and policy-makers; all of whom come with different levels of expertise dealing with structured data like JSON, SQL, and other programmatic interfaces. While making the data available in such structured forms allows its consumption at scale, this challenge asks you to build easy-to-use interfaces to the data using natural language. Two immediate use cases emerge.

Natural Language to SQL

The NL2SQL task -- i.e. converting database queries in natural language to their SQL form -- is fast gaining traction in the natural language processing community. This task asks you to build such querying capabilities into our landing page so that users can generate complex "Highlights" themselves. Check out the current analysis page, or the highlights section on the individual state-level pages, for examples of such queries and possible cases of integration of NL2SQL capabilities with the data.

πŸ‘‰ To get started on this task, start with this fantastic intro into the EditSQL model.

Spider 1.0 Β  SParC

Q&A with the Data

NL2SQL covers a large part of the ways a user can interface to the data in natural language, but not all. The full scope of natural language interfacing to the data does not involve just being able to construct an SQL query from a text but also being able to converse with the user to elicit the desired questions, extracting the query texts from the conversation, helping the user explore the data and arrive at the relevant conclusions. It can also involve bringing together, and presenting, multiple sources of information with explanations, and argumentation comparing and contrasting against potentially conflicting sources of information. Let your imagination run wild. πŸ™‚

πŸ’‘ The following examples might provide some inspiration.

The Research Literature Q&A Service allows you to ask questions in natural language to extract answers from the COVID-19 Open Research Dataset (CORD-19), a collection of over 70,000 scientific articles.

Explore

The Deep Search Service allows you to query thousands of peer-reviewed papers and licensed databases to extract critical COVID-19 knowledge.

Explore

Smart Assistants that can help you figure out your local COVID situation, navigate government mandates, provide guidance on trends, travel, etc.

Explore

πŸ’‘ Data Aggregation

Data Source: Other

Natural Language Processing Β  Computer Vision Β  Intermediate

Currently, we use only the daily health bulletins put out by individual Indian states on their websites. However, the source of COVID data from India is myriad and varied. See here, for example, for a wealth of Twitter sources used by covid19india.org. Help us extend the automated data extraction pipeline to include other sources of information, from kinds of documents to social media posts.

To get started with this task, start exploring the Twitter API.

πŸ’‘ Data Validation

Data Source: COVID-19 India SQL DB

Data Source: Other

As explained previously, different websites have been maintaining key aspects of the pandemic in India. Some of these are listed here as "Additional Resource" (if you find more, please add them to this list). While that data is not at the same level of detail as that extracted from the entire health bulletins, there is of course some overlap in terms of some of the key metrics. For this task, help validate the accuracy of the automatically extracted data with that of manual data-keeping.

You can also potentially use these manual sources as ground-truth data for any of the tasks you are attempting above.