Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Adding a new state to the data extraction pipeline

Mayank Agarwal edited this page Aug 17, 2021 · 10 revisions

The data extraction pipeline executes the following steps sequentially:

  1. Downloads all the health bulletins for all states
  2. Sets up the database and the tables for all the states, and
  3. Extracts the information and inserts them into the tables for all the health bulletins for each state

These 3 steps are executed unless the particular health bulletin for a state has already been downloaded and processed.

To add a new state to the data extraction pipeline, it is beneficial to follow the steps in the same order. A detailed description of these steps is described below.

  1. Add the bulletin download routine for the state

    • Create a new file in the data_extractor/bulletin_download/states/ folder. Use the ISO 3166-2:IN standard to name your file.
    • Inherit from the Bulletin class in the bulletins.py file. This will allow you to use functions commonly used across this procedure and will also allow the pipeline to track and save the metadata associated with the state.
    • Implement a run function in the newly created file along with any other utility function you might find useful. The main procedure will call the run function to execute the script.
    • Create a dictionary of date as the key, and the URL of the corresponding days' bulletin, and use the download_bulletin function in the Bulletin class to automatically download and save the PDFs from these links.
    • Call the _save_state_ function to save the metadata associated with the script.
    • References:
      • Delhi (DL) : Parses the HTML on the Health Department website to create the the {date: url} dictionary.
      • Telangana (TG) : Uses a set URL format to create the dictionary.
    • Finally, add the newly created state file to the data_extractor/bulletin_download/main.py file
  2. Define the table structure for the state

    • Once you have completed the bulletin download routine, start by defining the table structures which will hold the data for the particular state
    • Create a folder with the name <state>_tables in the data_extractor/db folder.
    • Create files for each table in the newly created folder. The particular table class should implement a create_table and insert_row function. See the structure for Telangana for reference.
    • Create a new file in the data_extractor/db folder, inheriting from the Database class in the db.py file. As before, use the ISO 3166-2:IN standard to name your file.
    • This new file should initialize an instance variable self.tables, a dictionary with a table identifier as the key and the table class instance as the value. Thereafter, call the create_tables function to create these tables in the database.
    • See the Telangana file for reference.
    • Finally, add the entry for the newly created state in the main.py file.
  3. Write the data extraction logic for the state

    • TBD
  4. Document

    • TBD
Clone this wiki locally