This project generates synthetic data for user profiles, products, and financial transactions. It uses the Faker library to create realistic-looking fake data and DuckDB to query it. You can run it within a Docker container or locally.
You can perform SQL joins because data is generated by considering relations between its tables.
You can also run any other pandas, polars or Spark processing with the available csvs.
- Docker Desktop (if using the Docker)
- Python 3.9 (if running locally)
pip
(if running locally)- Make (optional, for simplified commands)
-
Build the Docker Image
First, you need to build the Docker image that contains all the necessary dependencies. Run the following command in your terminal:
make build
- Run the Docker Container Once the image is built, you can run the container with:
make run
Optional: you can run this command to have the csvs available in your local if using Docker.
make csvs
This will generate the fake data and output csv and the duckdb database to have available in your machine.
-
Access the DuckDB Database After running the container, you are inside the DuckDB database from within the container or copy the database file to your local machine for further analysis.
You can run any queries you like:
SELECT * FROM user_profiles LIMIT 5;
SELECT DISTINCT product_name, category
FROM products
LEFT JOIN transactions ON products.product_id = transactions.product_id
LIMIT 5;
SELECT DISTINCT product_name, category, user_id, transaction_type, amount
FROM products
LEFT JOIN transactions ON products.product_id = transactions.product_id
LIMIT 5;
To exit the DuckDB shell, use Ctrl+C or Ctrl+D. You can always open DuckDB again by running:
make duckdb
- Clean the container
This command will remove the container, the image and any files that were copied:
make clean
- Install requirements.txt
pip install requirements.tzt
- Generate Data
NUM_RECORDS=1000 # Adjust the number as needed
python3 generate_data.py
- Access DuckDB Database
duckdb fakedata_duckdb.db
- Clean up files
You can run this script to clean up .csv and .db files
python3 clean_files.py
Here's a graphic representation on how the datasets are related to each other.
The script generates the following output files:
- user_profiles.csv: Contains user profile data.
- products.csv: Contains product data.
- transactions.csv: Contains financial transaction data.
These CSV files are then imported into a DuckDB database named fakedata_duckdb.db, with tables corresponding to each CSV file.
- Get tips, learnings and tricks for your Data career!
- Join the Substack newsletter to get similar content to this one and more to improve your Data career!