Skip to content
Lorenzo Muscillo edited this page Nov 11, 2024 · 6 revisions

Overview

This is the main page of your Wiki and serves as an introduction.

Deeploans Data Lakehouse ETL Pipeline

Welcome to the Deeploans Data Lakehouse ETL Pipeline Wiki! This repository hosts the ETL pipeline designed for creating and managing the Deeploans data lakehouse, integrating raw data from external providers (such as ESMA Sec Rep and Quandl) into a structured format for analytics and machine learning.

The Lakehouse Architecture combines the scalability of a data lake with the data management capabilities of a data warehouse, providing a flexible and efficient solution for data storage and processing.


Table of Contents

  1. Architecture Overview
  2. Lakehouse Schema
  3. Data Profiling and Quality Checks
  4. Data Assumptions
  5. Running the ETL Pipeline
  6. Tools and Technologies
  7. Troubleshooting
  8. Contributing

1. Architecture Overview

Provide an overview of the architecture.

Data Location and Infrastructure

  • Storage: Raw data is securely stored in a Google Cloud Storage (GCS) bucket.
  • Processing: Data transformations are managed using GCP Dataproc Serverless and Google Cloud Composer, ensuring efficient automated processing across different layers of the lakehouse.

Architecture Layers

  1. Bronze Layer: Stores minimally processed raw data, with essential profiling checks.
  2. Silver Layer: Cleans and normalizes data, separating dimensions for efficient querying.
  3. Gold Layer: Prepares data for business metrics and ML model features.

2. Lakehouse Schema

Detail each layer of the data lakehouse schema and its purpose.

Bronze Layer

  • Objective: Store a one-to-one copy of the raw data, with essential data profiling checks.
  • Transformations: Minimal, mainly supporting data profiling and SCD Type 2 support.

Silver Layer

  • Objective: Cleaned and normalized data, separating dimensions for BI and ML.
  • Transformations: Dimensional normalization to enhance querying and feature engineering.

Gold Layer

  • Objective: Prepares data for KPIs and ML models.
  • Transformations: Final refining steps for business metrics and machine learning features.

3. Data Profiling and Quality Checks

Explain the profiling steps to maintain data quality.

Bronze-Level Profiling

Initial checks for raw data:

  • Primary Key Integrity: Ensures uniqueness and completeness.
  • Table & Column Integrity: Ensures non-null values in required fields.
  • Essential Column Checks: Confirms presence of required columns.

Silver-Level Profiling

Additional checks based on data type and asset class:

  • Validations tailored to each asset class.
  • Ensures normalized data quality before advancing to the Gold layer.

4. Data Assumptions

List the assumptions used in the data structure to maintain consistency.

Primary Key Definitions

  • Assets: dl_code + AS3
  • Collateral: dl_code + CS1
  • Bond Information: dl_code + BS1 + BS2
  • Amortization: dl_code + AS3

5. Running the ETL Pipeline

Describe the steps to run the ETL pipeline.

Setup and Configuration

  1. Clone the Repository: Ensure gcloud CLI is installed and configured for your GCP project.
  2. Edit Configuration: Modify the Makefile if using a different GCS bucket.

Build and Deploy

  1. Run the setup commands:
    make setup && make build
  2. Upload DAG to Google Cloud Composer:
    • Stage 1: Run the DAG for Bronze-level profiling. Set max_active_tasks to enable parallel execution.
    • Stage 2: Modify the DAG for Silver-level processing. Set max_active_tasks to 1 to avoid conflicts.

6. Tools and Technologies

Outline the tools and services used in the project.

  • GCP Dataproc Serverless: Handles data processing and transformation in a scalable, serverless environment.
  • Google Cloud Composer: Orchestrates ETL tasks and workflows with Airflow.
  • Looker Studio: For data visualization and business intelligence dashboards.
  • DataFlow: Processes data for machine learning feature engineering and transformation.

7. Troubleshooting

Provide a list of common issues and solutions.

Issue: GCP Permissions Error

  • Solution: Ensure your GCP IAM roles have the correct permissions for Dataproc, GCS, and Cloud Composer.

Issue: Data Integrity Check Failed in Bronze Layer

  • Solution: Verify raw data files for required columns and re-upload if necessary.

8. Contributing

Hello!
Thank you for your interest in contributing to the Deeploans Data Lakehouse ETL Pipeline! We welcome improvements, bug fixes, and new features to help make this project even better.

Pull Requests

If you're submitting a pull request, please make sure to read and agree to our Contributor License Agreement (CLA). By submitting a contribution, you indicate that you agree to the terms of the CLA.

Git Commit Messages

To maintain a clean and informative commit history, please follow these naming conventions for commit messages:

  • bugfix/Something - Fixes for bugs in the codebase
  • feature/Something - New features or major changes
  • docfix/Something - Documentation fixes
  • refactor/Something - Code refactoring and improvements
  • performance/Something - Performance optimizations
  • test/Something - New or updated tests
  • enhancement/Something - Enhancements to existing features
  • security/Something - Security-related updates or fixes

Licenses

The Deeploans Data Lakehouse ETL Pipeline is licensed under the Apache 2.0 License, meaning contributions must also be compatible with this license. For issues or other information: support@algoritmica.ai

Welcome! We look forward to your contributions!