This project has been created as part of my contract work with Lighthouse Labs: Canada's Leading Coding Bootcamp. This material has been created for a startup, located in Toronto, ON, as part of their effort to utilize various tools from machine learning/data science. This notebook among others would be used to onboard new members of a team.
Here the tools that you will learn (I will teach you) in order to understand how this project works:
Because we will be running PySpark on Google servers, there is no need to install or configure the Spark package. This source (https://medium.com/swlh/pyspark-on-macos-installation-and-use-31f84ca61400) is a pretty good tutorial.
We need to get and store data for future analysis. In order to store data, we will create a bucket. However, the name of the bucket is not that important. To create a bucket, the following stepts must be taken:
Create a bucket
- Create a name
- Skip other options
Create
button on the bottom of the list
Once you create the bucket, we will need to get data inside of that bucket. First, we start with the actual NYC Taxi data:
!curl -L https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv | gsutil cp - gs://name_of_created_bucket/trip_data/yellow_tripdata_2019-01.csv
!curl -L https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv | gsutil cp - gs://name_of_created_bucket/trip_data/yellow_tripdata_2019-02.csv
!curl -L https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv | gsutil cp - gs://name_of_created_bucket/trip_data/yellow_tripdata_2019-03.csv
!curl -L https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv | gsutil cp - gs://name_of_created_bucket/trip_data/yellow_tripdata_2019-04.csv
!curl -L https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-05.csv | gsutil cp - gs://name_of_created_bucket/trip_data/yellow_tripdata_2019-05.csv
!curl -L https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-06.csv | gsutil cp - gs://name_of_created_bucket/trip_data/yellow_tripdata_2019-06.csv
!curl -L https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-07.csv | gsutil cp - gs://name_of_created_bucket/trip_data/yellow_tripdata_2019-07.csv
!curl -L https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-08.csv | gsutil cp - gs://name_of_created_bucket/trip_data/yellow_tripdata_2019-08.csv
!curl -L https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-09.csv | gsutil cp - gs://name_of_created_bucket/trip_data/yellow_tripdata_2019-09.csv
!curl -L https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-10.csv | gsutil cp - gs://name_of_created_bucket/trip_data/yellow_tripdata_2019-10.csv
!curl -L https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-11.csv | gsutil cp - gs://name_of_created_bucket/trip_data/yellow_tripdata_2019-11.csv
!curl -L https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-12.csv | gsutil cp - gs://name_of_created_bucket/trip_data/yellow_tripdata_2019-12.csv
In order to enhance our dataset, we will add data available as public datasets on GCP Big Query:
!bq --location=US extract --compression GZIP 'bigquery-public-data:new_york_taxi_trips.tlc_yellow_trips_2019' gs://name_of_created_bucket/nyc_taxi_trips/nyc_taxi_2018-*.csv.gz
!bq --location=US extract --compression GZIP 'bigquery-public-data:new_york_taxi_trips.taxi_zone_geom' gs://name_of_created_bucket/nyc_taxi_trips/nyc_taxi_zones-*.csv.gz
!bq --location=US extract --compression GZIP 'bigquery-public-data:noaa_gsod.stations' gs://name_of_created_bucket/nyc_weather/noaa_stations-*.csv.gz
!bq --location=US extract --compression GZIP 'bigquery-public-data:noaa_gsod.gsod2019' gs://name_of_created_bucket/nyc_weather/noaa_weather-*.csv.gz
The main reason behind using GCP Dataproc and not AWS EMR is mainly due to billing policies. GCP allows to create and play with Hadoop clusters for free, while AWS does not allow to do so, and charges incurred during the usage of EMR could skyrocket.
To start with Dataproc, Enable
API, and then we will need to create a cluster:
Create Cluster
- Chose a name for
Cluster name
. Feel free to leave as default. For region, chooseus-central-1
, and for cluster typeStandard
. - For Image version, choose `1.3 (Ubuntu 18.04 LTS, Hadoop 2.9, Spark 2.3).
- For
Components
, chooseEnable Component Gateway
as we will need to open Jupyter notebooks - For optional components, choose
Jupyter Notebook
andAnaconda
. - In the setting
Customise Cluster
, scroll down toCloud Storage staging bucket
,browse
andCreate new bucket
. You will need this bucket to store jupyter notebooks Create
After creating the cluster, click on it, and go to Web Interfaces
and choose Jupyter
option. Right after that, treat this as a general Jupyter notebook but remember to create a PySpark
notebook when you start a new one.
Note: It can take some time to enable the API, create and start Hadoop clusters. Don't panic if nothing is happening for 5-10 minutes.
While this project has been created with the certain goal in mind, anyone from LightHouse Labs or anyone else can cantribute. Any contributions you make are greatly appreciated and will be reviewed.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/Innovation
) - Commit your Changes (
git commit -m 'Add some Innovating feature that Emin didn't do'
) - Push to the Branch (
git push origin feature/Innovation
) - Open a Pull Request
After that, I will review it and either merge, or comment back.
Distributed under the MIT License. See LICENSE for more information.
Emin Mammadov - @LinkedIn - emin.e.mammadov@gmail.com