The main task of the project is to design an ETL pipeline to build a Cloud Data Warehouse for a fictional company called Sparkify. It is a music streaming platform like Spotify/Pandora. Sparkify has logged its data on user behaviour on its website. The log data, song data and JSON metadata about the songs reside in an Amazon S3 Bucket.
Here I designed a OLAP Data Model and ETL pipeline to build the Data Warehouse using Amazon Redshift as the data store so that the analysis team can coveniently wrangle the data.
This dataset consists of logs on user behvaiour in JSON format. It is partitioned based on the year and month. The following is an example of the path to an example json file in the dataset.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
This data was generated by this event simulator based on the songs in the Song Dataset discussed in the next section. These simulate activity logs from a music streaming app based on specified configurations.
This is a subset of real data taken from the Million Songs Dataset. Each file in the dataset represents metadata about an individual song and its artist.The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
The following is the Star Schemas for the fact and dimension tables of the Data Warehouse
songplays - records in log data associated with song plays
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
users - users in the app
- user_id, first_name, last_name, gender, level
songs - songs in music database
- song_id, title, artist_id, year, duration
artists - artists in music database
- artist_id, name, location, lattitude, longitude
time - timestamps of records in songplays broken down into specific units
- start_time, hour, day, week, month, year, weekday