Home

Overview

This is the main page of your Wiki and serves as an introduction.

Deeploans Data Lakehouse ETL Pipeline

Welcome to the Deeploans Data Lakehouse ETL Pipeline Wiki! This repository hosts the ETL pipeline designed for creating and managing the Deeploans data lakehouse, integrating raw data from external providers (such as ESMA Sec Rep and Quandl) into a structured format for analytics and machine learning.

The Lakehouse Architecture combines the scalability of a data lake with the data management capabilities of a data warehouse, providing a flexible and efficient solution for data storage and processing.

1. Architecture Overview

Provide an overview of the architecture.

Data Location and Infrastructure

Storage: Raw data is securely stored in a Google Cloud Storage (GCS) bucket.
Processing: Data transformations are managed using GCP Dataproc Serverless and Google Cloud Composer, ensuring efficient automated processing across different layers of the lakehouse.

Architecture Layers

Bronze Layer: Stores minimally processed raw data, with essential profiling checks.
Silver Layer: Cleans and normalizes data, separating dimensions for efficient querying.
Gold Layer: Prepares data for business metrics and ML model features.

2. Lakehouse Schema

Detail each layer of the data lakehouse schema and its purpose.

Bronze Layer

Objective: Store a one-to-one copy of the raw data, with essential data profiling checks.
Transformations: Minimal, mainly supporting data profiling and SCD Type 2 support.

Silver Layer

Objective: Cleaned and normalized data, separating dimensions for BI and ML.
Transformations: Dimensional normalization to enhance querying and feature engineering.

Gold Layer

Objective: Prepares data for KPIs and ML models.
Transformations: Final refining steps for business metrics and machine learning features.

3. Data Profiling and Quality Checks

Explain the profiling steps to maintain data quality.

Bronze-Level Profiling

Initial checks for raw data:

Primary Key Integrity: Ensures uniqueness and completeness.
Table & Column Integrity: Ensures non-null values in required fields.
Essential Column Checks: Confirms presence of required columns.

Silver-Level Profiling

Additional checks based on data type and asset class:

Validations tailored to each asset class.
Ensures normalized data quality before advancing to the Gold layer.

4. Data Assumptions

List the assumptions used in the data structure to maintain consistency.

Primary Key Definitions

Assets: dl_code + AS3
Collateral: dl_code + CS1
Bond Information: dl_code + BS1 + BS2
Amortization: dl_code + AS3

5. Running the ETL Pipeline

Describe the steps to run the ETL pipeline.

Setup and Configuration

Clone the Repository: Ensure gcloud CLI is installed and configured for your GCP project.
Edit Configuration: Modify the Makefile if using a different GCS bucket.

Build and Deploy

Run the setup commands:
```
make setup && make build
```
Upload DAG to Google Cloud Composer:
- Stage 1: Run the DAG for Bronze-level profiling. Set max_active_tasks to enable parallel execution.
- Stage 2: Modify the DAG for Silver-level processing. Set max_active_tasks to 1 to avoid conflicts.

6. Tools and Technologies

Outline the tools and services used in the project.

GCP Dataproc Serverless: Handles data processing and transformation in a scalable, serverless environment.
Google Cloud Composer: Orchestrates ETL tasks and workflows with Airflow.
Looker Studio: For data visualization and business intelligence dashboards.
DataFlow: Processes data for machine learning feature engineering and transformation.

7. Troubleshooting

Provide a list of common issues and solutions.

Issue: GCP Permissions Error

Solution: Ensure your GCP IAM roles have the correct permissions for Dataproc, GCS, and Cloud Composer.

Issue: Data Integrity Check Failed in Bronze Layer

Solution: Verify raw data files for required columns and re-upload if necessary.

8. Contributing

Hello!
Thank you for your interest in contributing to the Deeploans Data Lakehouse ETL Pipeline! We welcome improvements, bug fixes, and new features to help make this project even better.

Pull Requests

If you're submitting a pull request, please make sure to read and agree to our Contributor License Agreement (CLA). By submitting a contribution, you indicate that you agree to the terms of the CLA.

Git Commit Messages

To maintain a clean and informative commit history, please follow these naming conventions for commit messages:

bugfix/Something - Fixes for bugs in the codebase
feature/Something - New features or major changes
docfix/Something - Documentation fixes
refactor/Something - Code refactoring and improvements
performance/Something - Performance optimizations
test/Something - New or updated tests
enhancement/Something - Enhancements to existing features
security/Something - Security-related updates or fixes

Licenses

The Deeploans Data Lakehouse ETL Pipeline is licensed under the Apache 2.0 License, meaning contributions must also be compatible with this license. For issues or other information: support@algoritmica.ai

Welcome! We look forward to your contributions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly