Skip to content

Exploratory analysis of the dataset that contains historical time series data of public bike sharing system in Warsaw.

Notifications You must be signed in to change notification settings

piekarsky/Exploratory-Data-Analysis-of-Bike-Sharing-Systems-in-Warsaw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploratory Data Analysis of Bike Sharing Systems in Warsaw

Table of Contents

Overview

Exploratory analysis on the dataset which contains historical time series data of public bike sharing systems in Warsaw. This dataset was built with information of 4100 JSON files, saved every 10 minutes containing data about each bike station (i.e.: number of bikes at the station, number of racks, number of free racks) in Warsaw. This project explores patterns of bike routes using clustering algorithm and shows among other things the relationship between the number of bike rentals and the weather or day of the week. The following notebook contains all stages from preparing data (cleaning, checking outliers) to creating analysis, visualization and discovering patterns in dataset.


About the dataset

Main information about the data:

  • 4184 JSON files, saved every 10 minutes between 03/04/2018 - 04/04/2018
  • 1 JSON file contains information about 355 bike stations in Warsaw on average

The most important attributes included in this dataset used in the analysis and visualization:

  • uid - the ID of the bike station
  • bikes - the number of bikes at the station
  • bike_racks - the number of racks at the station
  • free_racks - the number of free positions at the station
  • bike_numbers - numbers of bikes docked at the station
  • lat, lng - the coordinates (longitude and latitude) of the bike station

Preparing data for analysis

Data preprocessing

After loading and preprocessing data (removing redundant columns, changing data types and dividing the JSON file name into year, month, day, hour, minutes), the basic dataframe was extended, among others the values ​​of temperatures and the amount of rainfall in particular time intervals. The analyzed, grouped dataframe is presented below. The columns: day_of_the_week, city_code, date_normalize were created for analysis purposes.
The analyzed dataframe with multiple indexes is shown below:

The first part of the notebook contains the analysis and visualization of bike routes (based on the numbers of bikes attached to each bike station in a given period of time). This collection contains 19 783 745 observations. The second part of the notebook contains the analysis of bike rentals. This collection contains 1 472 021 observations.

Checking for missing values

The data analysis began with checking whether the entire data set at each station contains enough information about the number of bikes in a given time period. As seen below, missing values ​​were noted for 29 stations (a maximum of 4184 could be recorded because that's how many JSON files were used in the dataset).

For the stations:

  • Czerniakowska - Gagarina
  • Marszałkowska - al. Solidarności
  • Wołoska - Odyńca
  • al. Jana Pawła II - Grzybowska

The number of missing information on the number of bikes at stations has been replaced with values from previous time intervals, as they do not constitute a large share in the entire dataset. The rest of the stations from the dataframe above were removed from the analysis due to the large amount of missing information e.g. for the Fieldorf - Bukowski station 1092 NaN / 4184 = 26%

Checking outliers

The occurrence of outliers for particular days was checked using a box plot. It shows that one station on March 14 and March 25 recorded a much larger number of bike rentals compared to all other stations on that day.

Due to the large variety of locations of bike stations and the fact that they can be very popular in the event of major sports or music events, these values do not have to mean a data collection error. The popularity of the station was assessed on the basis of the median, which is not sensitive to outliers, so extreme points were not removed.


Using the box-plot, it is also possible to evaluate the occurrence of outliers at particular hours of each day based on the sum of bikes rented from all stations. The chart shows that one day at 2 a.m. and 7 a.m. there was a much greater number of rentals, and these values ​​in such hours over five times higher than their median are certainly unrealistic.

Analyzing the data in terms of the largest number of bikes rented from all time periods, it can be seen that for many stations on March 27 at 2:30 and March 14 at 7:00 there were above-average numbers of bikes rented.

In view of the above, the data on the number of bikes at stations from 2018-03-27 02:30, 2018-03-27 02:40 and 2018-03-14 07:00 have been deleted. After adding up the number of bikes rented at all stations throughout the dataset period, it can be seen that there are stations from which no bike has left in the considered time. These stations, in the context of the popularity rating, were not taken into account and were removed from the dataset.

Exploratory data analysis

Interactive grouping of bike stations using the Folium library

The analysis and visualization of this data uses the Folium library, thanks to which it was created a map containing interactive markers that automatically group the number of stations on the map. Tags are grouped with locations if they are close enough to each other.

The picture below shows the map of Warsaw with the location of 380 bike stations using interactive grouping.


On this map there are names of stations along with the number of bike stands there. This is visible after zooming in on the map and hovering the cursor over the selected marker.

Analysis of the popularity of bike routes

The main dataset was transformed into a dataset containing information on bike numbers at stations to analyze popular bike routes. The analyzed set, containing information about the numbers of bikes that are docked at the station in a given time interval, constitutes 19 783 745 observations. In the analyzed period, information on 5249 bikes was recorded. The analyzed dataframe is presented below.

By using the Folium library, the most popular routes can also be shown. Those that were counted at least 50 times over the analyzed period are presented on the map below.

Exploration of patterns of bike routes using clustering algorithm

Routes can be represented as a graph. Such a graph with distinguished clusters is presented below (the Louvain algorithm which is a hierarchical clustering algorithm was used as the clustering method).

The composition of three exemplary clusters is presented below.

The vast majority of stations in a particular cluster are stations located in one region, so it can be concluded that people most often move between stations located in close proximity to each other. This relationship can also be seen on the heatmap below.

It shows that most people cycle between stations not far from each other (stations with similar ID numbers).

The table below presents the most popular bike routes (the count column indicates the number of bike trips from station A to station B in a given period).

The picture below shows the map of Warsaw with the 15 most popular bike routes marked.

The most popular routes can also be presented as a graph.

The fragment of the map of Warsaw illustrates the most popular bike route Stefana Banacha - UW <—> al. Niepodległości - Batory is presented below.

Analysis of the length of bike rentals

The histogram below illustrates how long bikes are typically rented. Unfortunately, due to the low dynamics of data (data was collected every 10 minutes), this histogram is burdened with a large error and a trip that lasted e.g. 12 minutes can be recorded in the same way as a 28 minutes drive. For example, a bike that was rented at 2:39 and returned at 2:51 was qualified as a 30 minutes ride, as well as a bike rented at 2:31 and returned at 2:59. The 30 minutes bar does not have to specify such a bike rental time and with more dynamic data it could be a 20 minutes value. The issue of the length of renting a bike is quite important because the first 20 minutes of bike rental are free.

Analysis of bike rentals

The notebook includes many charts, among others correlation between the number of bike rentals and temperature or day of the week.
The chart below shows the relationship between the number of bike rentals and temperature.

This relationship can also be seen in the scatter plot of number of bike rentals depending on temperature.

With regard to number of bikes rented per hour on each days, it is as follows:

The table of correlation values ​​between the number of bikes rented and temperature or total rainfall during the day is presented below.

The Pearson correlation coefficient between the number of bikes rented and the temperature is 0.67, which indicates a significant correlation between these variables. Pearson's correlation coefficient between the number of bikes rented and the sum of rainfall is -0.41 (negative correlation), which shows a moderate correlation.

The graph below shows the number of bike rentals depending on the hour on each day of the week.

The table below presents a detailed average number of bikes rented at specific hours of each day of the week.

Looking at the graph and table above, it can be seen that the largest number of bike rentals is recorded on working days in the afternoon, i.e. 4 - 5 p.m., and higher than usual rental values ​​are also visible in the morning 7 - 8 a.m.. Therefore, it can be concluded that city bikes are a popular means of transport when commuting to or returning from work, school, or they are used to transport to the subway station. The graph shows that city bikes are also popular on weekends. Great interest can be seen in the afternoon (especially on Sundays).

Popularity analysis of bike stations

The most popular bike stations in Warsaw (with the largest median of bike rentals during the day) are the following stations: Al. Niepodległości - Batory, Stefan Banach - UW and Rondo Jazdy Polskiej. The dataframe of the 10 bike stations with the highest median of bike rentals is presented below.

About

Exploratory analysis of the dataset that contains historical time series data of public bike sharing system in Warsaw.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published