Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Pipelines course material #56

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

bhupatiraju
Copy link

This pull request includes tutorial notes for the Data Pipelines topic (as part of the python course advanced topics section), utilising the BOOST financial data from Kenya to illustrate the construction of a data pipeline using the medallion schema and then automating this pipeline.

It contains the following:

  1. Introduction File: Created a file named intro-to-data-pipelines that provides an overview of the data pipeline topic and illustrates its importance in data processing workflows.

  2. Data Processing Walk-through: Created files called Bronze, Silver and Gold which contains the data processing code using the medallion schema for the Kenya data.

  3. Get additional data: Created a file called subnational_population that retrieves data from the WB API and restricts to the columns needed for merging with the cleaned Kenya data

  4. Aggregation: Simple aggregation is done using the subnational population and the cleaned Kenya data to illustrate a simple use case

  5. Orchestration: Added a section on orchestration using Databricks Workflows, detailing how to automate and manage the data processing pipeline effectively (contained in the intro-to-data-pipelines file).

@weilu
Copy link
Member

weilu commented Oct 10, 2024

Thanks @bhupatiraju! Following the convention of other modules, can you add a README.md and clear the outputs of the notebooks?

@bhupatiraju bhupatiraju marked this pull request as draft December 10, 2024 18:54
@weilu weilu marked this pull request as ready for review December 17, 2024 16:32
## Contact

If you have any questions you can contact one of the teams behind this training
on dimeanalytics@worldbank.org.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhupatiraju this README seems identical to what's currently at the root of this repo. Did you mean to copy and modify?

Copy link
Member

@weilu weilu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhupatiraju I think the key thing to consider is to re-organize the flow to have a smaller code example (subnational population) moved up front, right after the motivations for data pipelines – see my inline comments.

General:
Consider running the content through ChatGPT for proofreading. It's generally well written but I feel like there are some potentially improvements that can be easily picked up by ChatGPT.

Some questions to anticipate:

  • Why do we need to use pyspark to process the Kenya BOOST data? Can we not just use pandas? When to use pyspark vs plain pandas?
  • Why do we have separate scripts for different stages of the medallion architecture? Can we not just have a single script?

"\n",
" In this tutorial we will use the workflows feature in **databricks** for orchestration, although it's possible to consrtuct it in pySpark itself. \n",
"\n",
"**Monitoring and Logging:**\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new line after this section title, so it is consistent with the rest

}
},
"source": [
"#### The Core of the Data Pipeline Process\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section title doesn't quite read. How about something like "What are the building blocks of a data pipeline?" to keep the question format as used for the previous 2 sections?

}
},
"source": [
"## Introduction to Data Pipelines"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using # as this is the h1 level heading for this notebook

}
},
"source": [
"#### What is a data pipeline?\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the next level of section heading, so should use ## (h2). Same consistency should be applied to all subsequent headings.

"#### The Core of the Data Pipeline Process\n",
"A data pipeline is a structured framework that enables the flow of data from source to destination, encompassing several key processes. Specific implementation may vary but the fundamental components of a data pipeline can be abstracted as follows:\n",
"\n",
"**Data Ingestion:**\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use proper headings (e.g. ###) so the table of content will nest properly. Same below.

}
},
"source": [
"The flow diagram below shows the flow of data through these layers, and we will illstrae this model using the Kenya BOOST data. \n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/illstrae/illustrate

" .option(\"inferSchema\", \"true\")\n",
" .load(Data_DIR))\n",
"\n",
"# Clean column names by replacing spaces and special characters\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if this step is still necessary? I think I read somewhere the latest version of delta(?) accommodate spaces so perhaps special characters too?

"source": [
"#### Silver\n",
"\n",
"In the silver stage (again the script is found in the data_pipelines project folder as silver.py), we read the data produced in the bronze stage and transform and refine the data. \n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to the silver script instead of (again....)

}
},
"source": [
"##### Subnational Population\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest moving this example up to be right before the "The Core of the Data Pipeline.." section. Not only this is a good stand-alone example of a simple data pipeline covering ingestion, processing, and writing, it also provides a quick recap of basic data processing using pandas. The only thing new to someone already know pandas is the writing to the data lake in delta format – it would be a good opportunity to introduce the concept of Spark and delta. Then you can move onto the next section on the basic components / building blocks of a data pipeline and referring back to this simple example and add the orchestration with scheduling demo to show how this can automate updating and saving the subnational population dataset, which can be reused. Then the latter BOOST example can focus on illustrating the medallion architecture and showcasing the re-use/merging with the subnational dataset. This way, we build the concepts on top of each other and gradually introduce them to the participants. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also breaks the instruction/concept only wall of texts up front.

Comment on lines +970 to +976
"source": [
"### Lakehouse\n",
"\n",
"A lakehouse is a unified architecture which enables storage of structured and unstructured data in a unified system. The databricks lakehouse is a specific implementation, which offers tools to process this data in the lakehouse setup. \n",
"\n",
"It allows for unified data management, and more importantly avoids **data silos**. In our setting, the delta tables constructed are not necessarily tied to the project, and can be accessed across multiple projects. For instance, the table containing subnational population for Kenya can be accessed for the purposes of a different different project. "
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's necessary to introduce lake house, especially towards the end. Consider remove, or move it up to where we first read from or write to the lake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants