📊🤖 Fake Data Generator for your demo projects

🚀 Introduction

This project generates synthetic data for user profiles, products, and financial transactions. It uses the Faker library to create realistic-looking fake data and DuckDB to query it. You can run it within a Docker container or locally.

You can perform SQL joins because data is generated by considering relations between its tables.

You can also run any other pandas, polars or Spark processing with the available csvs.

📝 Prerequisites

Docker Desktop (if using the Docker)
Python 3.9 (if running locally)
pip (if running locally)
Make (optional, for simplified commands)

Setup and Usage

📦 Using Docker

Build the Docker Image

First, you need to build the Docker image that contains all the necessary dependencies. Run the following command in your terminal:

make build

Run the Docker Container Once the image is built, you can run the container with:

make run

Optional: you can run this command to have the csvs available in your local if using Docker.

make csvs

This will generate the fake data and output csv and the duckdb database to have available in your machine.

Access the DuckDB Database After running the container, you are inside the DuckDB database from within the container or copy the database file to your local machine for further analysis.

You can run any queries you like:

SELECT * FROM user_profiles LIMIT 5;

SELECT DISTINCT product_name, category
FROM products
LEFT JOIN transactions ON products.product_id = transactions.product_id
LIMIT 5;

SELECT DISTINCT product_name, category, user_id, transaction_type, amount
FROM products
LEFT JOIN transactions ON products.product_id = transactions.product_id
LIMIT 5;

To exit the DuckDB shell, use Ctrl+C or Ctrl+D. You can always open DuckDB again by running:

make duckdb

Clean the container

This command will remove the container, the image and any files that were copied:

make clean

🖥️ Running Locally

Install requirements.txt

pip install requirements.tzt

Generate Data

NUM_RECORDS=1000  # Adjust the number as needed
python3 generate_data.py

Access DuckDB Database

duckdb fakedata_duckdb.db

Clean up files

You can run this script to clean up .csv and .db files

python3 clean_files.py

🔃 Data Modeling

Here's a graphic representation on how the datasets are related to each other.

📁 Output Format

The script generates the following output files:

user_profiles.csv: Contains user profile data.
products.csv: Contains product data.
transactions.csv: Contains financial transaction data.

These CSV files are then imported into a DuckDB database named fakedata_duckdb.db, with tables corresponding to each CSV file.

😎 Follow me on Linkedin

Get tips, learnings and tricks for your Data career!

📩 Subscribe to The Pipe & The Line

Join the Substack newsletter to get similar content to this one and more to improve your Data career!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

📊🤖 Fake Data Generator for your demo projects

🚀 Introduction

📝 Prerequisites

Setup and Usage

📦 Using Docker

🖥️ Running Locally

🔃 Data Modeling

📁 Output Format

😎 Follow me on Linkedin

📩 Subscribe to The Pipe & The Line

Files

README.md

Latest commit

History

README.md

File metadata and controls

📊🤖 Fake Data Generator for your demo projects

🚀 Introduction

📝 Prerequisites

Setup and Usage

📦 Using Docker

🖥️ Running Locally

🔃 Data Modeling

📁 Output Format

😎 Follow me on Linkedin

📩 Subscribe to The Pipe & The Line