This repository has been archived by the owner on Jul 7, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 8
Adding a new state to the data extraction pipeline
Mayank Agarwal edited this page Aug 17, 2021
·
10 revisions
The data extraction pipeline executes the following steps sequentially:
- Downloads all the health bulletins for all states
- Sets up the database and the tables for all the states, and
- Extracts the information and inserts them into the tables for all the health bulletins for each state
These 3 steps are executed unless the particular health bulletin for a state has already been downloaded and processed.
To add a new state to the data extraction pipeline, it is beneficial to follow the steps in the same order. A detailed description of these steps is described below.
-
Add the bulletin download routine for the state
- Create a new file in the
data_extractor/bulletin_download/states/
folder. Use the ISO 3166-2:IN standard to name your file. - Inherit from the
Bulletin
class in thebulletins.py
file. This will allow you to use functions commonly used across this procedure and will also allow the pipeline to track and save the metadata associated with the state. - Implement a
run
function in the newly created file along with any other utility function you might find useful. The main procedure will call therun
function to execute the script. - Create a dictionary of date as the key, and the URL of the corresponding days' bulletin, and use the
download_bulletin
function in theBulletin
class to automatically download and save the PDFs from these links. - Call the
_save_state_
function to save the metadata associated with the script. - References:
-
Delhi (DL) : Parses the HTML on the Health Department website to create the the
{date: url}
dictionary. - Telangana (TG) : Uses a set URL format to create the dictionary.
-
Delhi (DL) : Parses the HTML on the Health Department website to create the the
- Finally, add the newly created state file to the
data_extractor/bulletin_download/main.py
file
- Create a new file in the
-
Define the table structure for the state
- Once you have completed the bulletin download routine, start by defining the table structures which will hold the data for the particular state
- Create a folder with the name
<state>_tables
in thedata_extractor/db
folder. - Create files for each table in the newly created folder. The particular table class should implement a
create_table
andinsert_row
function. See the structure for Telangana for reference. - Create a new file in the
data_extractor/db
folder, inheriting from theDatabase
class in thedb.py
file. As before, use the ISO 3166-2:IN standard to name your file. - This new file should initialize an instance variable
self.tables
, a dictionary with a table identifier as the key and the table class instance as the value. Thereafter, call thecreate_tables
function to create these tables in the database. - See the Telangana file for reference.
- Finally, add the entry for the newly created state in the
main.py
file.
-
Write the data extraction logic for the state
- TBD
-
Document
- TBD