-
Notifications
You must be signed in to change notification settings - Fork 1
Home
This is the main page of your Wiki and serves as an introduction.
Welcome to the Deeploans Data Lakehouse ETL Pipeline Wiki! This repository hosts the ETL pipeline designed for creating and managing the Deeploans data lakehouse, integrating raw data from external providers (such as ESMA Sec Rep and Quandl) into a structured format for analytics and machine learning.
The Lakehouse Architecture combines the scalability of a data lake with the data management capabilities of a data warehouse, providing a flexible and efficient solution for data storage and processing.
- Architecture Overview
- Lakehouse Schema
- Data Profiling and Quality Checks
- Data Assumptions
- Running the ETL Pipeline
- Tools and Technologies
- Troubleshooting
- Contributing
Provide an overview of the architecture.
Data Location and Infrastructure
- Storage: Raw data is securely stored in a Google Cloud Storage (GCS) bucket.
- Processing: Data transformations are managed using GCP Dataproc Serverless and Google Cloud Composer, ensuring efficient automated processing across different layers of the lakehouse.
Architecture Layers
- Bronze Layer: Stores minimally processed raw data, with essential profiling checks.
- Silver Layer: Cleans and normalizes data, separating dimensions for efficient querying.
- Gold Layer: Prepares data for business metrics and ML model features.
Detail each layer of the data lakehouse schema and its purpose.
- Objective: Store a one-to-one copy of the raw data, with essential data profiling checks.
- Transformations: Minimal, mainly supporting data profiling and SCD Type 2 support.
- Objective: Cleaned and normalized data, separating dimensions for BI and ML.
- Transformations: Dimensional normalization to enhance querying and feature engineering.
- Objective: Prepares data for KPIs and ML models.
- Transformations: Final refining steps for business metrics and machine learning features.
Explain the profiling steps to maintain data quality.
Initial checks for raw data:
- Primary Key Integrity: Ensures uniqueness and completeness.
- Table & Column Integrity: Ensures non-null values in required fields.
- Essential Column Checks: Confirms presence of required columns.
Additional checks based on data type and asset class:
- Validations tailored to each asset class.
- Ensures normalized data quality before advancing to the Gold layer.
List the assumptions used in the data structure to maintain consistency.
-
Assets:
dl_code + AS3
-
Collateral:
dl_code + CS1
-
Bond Information:
dl_code + BS1 + BS2
-
Amortization:
dl_code + AS3
Describe the steps to run the ETL pipeline.
-
Clone the Repository: Ensure
gcloud
CLI is installed and configured for your GCP project. -
Edit Configuration: Modify the
Makefile
if using a different GCS bucket.
- Run the setup commands:
make setup && make build
-
Upload DAG to Google Cloud Composer:
-
Stage 1: Run the DAG for Bronze-level profiling. Set
max_active_tasks
to enable parallel execution. -
Stage 2: Modify the DAG for Silver-level processing. Set
max_active_tasks
to 1 to avoid conflicts.
-
Stage 1: Run the DAG for Bronze-level profiling. Set
Outline the tools and services used in the project.
- GCP Dataproc Serverless: Handles data processing and transformation in a scalable, serverless environment.
- Google Cloud Composer: Orchestrates ETL tasks and workflows with Airflow.
- Looker Studio: For data visualization and business intelligence dashboards.
- DataFlow: Processes data for machine learning feature engineering and transformation.
Provide a list of common issues and solutions.
Issue: GCP Permissions Error
- Solution: Ensure your GCP IAM roles have the correct permissions for Dataproc, GCS, and Cloud Composer.
Issue: Data Integrity Check Failed in Bronze Layer
- Solution: Verify raw data files for required columns and re-upload if necessary.
Hello!
Thank you for your interest in contributing to the Deeploans Data Lakehouse ETL Pipeline! We welcome improvements, bug fixes, and new features to help make this project even better.
If you're submitting a pull request, please make sure to read and agree to our Contributor License Agreement (CLA). By submitting a contribution, you indicate that you agree to the terms of the CLA.
To maintain a clean and informative commit history, please follow these naming conventions for commit messages:
-
bugfix/Something
- Fixes for bugs in the codebase -
feature/Something
- New features or major changes -
docfix/Something
- Documentation fixes -
refactor/Something
- Code refactoring and improvements -
performance/Something
- Performance optimizations -
test/Something
- New or updated tests -
enhancement/Something
- Enhancements to existing features -
security/Something
- Security-related updates or fixes
The Deeploans Data Lakehouse ETL Pipeline is licensed under the Apache 2.0 License, meaning contributions must also be compatible with this license. For issues or other information: support@algoritmica.ai
Welcome! We look forward to your contributions!