Cloud Data Warehouse with Redshift

The main task of the project is to design an ETL pipeline to build a Cloud Data Warehouse for a fictional company called Sparkify. It is a music streaming platform like Spotify/Pandora. Sparkify has logged its data on user behaviour on its website. The log data, song data and JSON metadata about the songs reside in an Amazon S3 Bucket.

Here I designed a OLAP Data Model and ETL pipeline to build the Data Warehouse using Amazon Redshift as the data store so that the analysis team can coveniently wrangle the data.

Datasets

Logs Dataset

This dataset consists of logs on user behvaiour in JSON format. It is partitioned based on the year and month. The following is an example of the path to an example json file in the dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

This data was generated by this event simulator based on the songs in the Song Dataset discussed in the next section. These simulate activity logs from a music streaming app based on specified configurations.

Songs Dataset

This is a subset of real data taken from the Million Songs Dataset. Each file in the dataset represents metadata about an individual song and its artist.The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

Data Warehouse Schema

The following is the Star Schemas for the fact and dimension tables of the Data Warehouse

Fact Table

songplays - records in log data associated with song plays

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users - users in the app

user_id, first_name, last_name, gender, level

songs - songs in music database

song_id, title, artist_id, year, duration

artists - artists in music database

artist_id, name, location, lattitude, longitude

time - timestamps of records in songplays broken down into specific units

start_time, hour, day, week, month, year, weekday

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
code		code
images		images
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloud Data Warehouse with Redshift

Datasets

Logs Dataset

Songs Dataset

Data Warehouse Schema

Fact Table

Dimension Tables

About

Releases

Packages

Languages

gaurav-aiml/cloud-datawarehouse-with-redshift

Folders and files

Latest commit

History

Repository files navigation

Cloud Data Warehouse with Redshift

Datasets

Logs Dataset

Songs Dataset

Data Warehouse Schema

Fact Table

Dimension Tables

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages