Skip to content

Data analysis and visualisation of UK house price data with pyspark

Notifications You must be signed in to change notification settings

chrisk314/spark-uk-house-data-map

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Spark UK House Data Map

This project demonstrates how to use ESRI shapefiles to generate a map of the UK (actually the UK minus Northern Ireland due to data availability) using pyspark with pyshp, pyproj and matplotlib. Shapefiles which contain the geometry of the administrative districts of England, Wales and Scotland with coordinates specified in OSGB36 format are available in the Code-Point Open data set from Ordnance Survey (OS). The districts of England and Wales are coloured based on the mean sales price for a given year. House sale price data for England and Wales is available from The Office for National Statistics (ONS) in the Price Paid dataset. Data for Scotland and Northern Ireland is not available in this dataset.

The Price Paid dataset turned out to be too large (it's ~4GB) to load in to a Jupyter notebook with Pandas on my laptop (16GB memory). I decided it would be nice to demonstrate using the Spark DataFrame API for the data aggregations as Spark can be run in "standalone" mode on a single machine and it tends to handle memory a lot better than Pandas in my experience.

Anyway, here's the map. Take a look at the notebook to see how it's made.

map-mean-price-anim-small

Data sources

Below is a table of data sources used in this project and where to find them (they are all freely available). To recreate the map these data sources need to be available with there locations set in the .env file.

Dataset Description Available from Env var
OS Boundary Line Contains shapefiles specifying map geometry link DATA_BDLINE
OS Code-Point Open Contains data about postcode locations etc link DATA_CODEPO
ONS Price Paid Historic house sale price data link DATA_PP_CSV
ONS London Postcodes ...London Postcodes link DATA_LDN_POSTCODES_CSV
OS RPI Historical Retail Prices Index (RPI) data link DATA_RPI_CSV
OS GDP Historical Gross Domestic Product (GDP) data link DATA_GDP_CSV
OS Population Historical population data. This is in .xls format. I manually exported the data of interest in .csv format link DATA_POP_CSV
OS Labour Market Survey Historical employment related data link DATA_LMS_CSV

Install

Installing Spark standalone

Spark must be installed in order to run the notebook. Spark requires java. Currently Spark 2.3.2 only works with java version 8. To install java 8 on ubuntu

sudo apt-get install openjdk-8-jdk openjdk-8-jre

Then add these lines to your ~/.profile

JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
export PATH=${PATH}:${JAVA_HOME}/bin

Download a precompiled binary version of Spark with Hadoop

cd /opt
wget http://apache.claz.org/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz
tar -xzf spark-2.3.2-bin-hadoop2.7.tgz

Add the following lines to your ~/.profile

export SPARK_HOME=/opt/spark-2.3.2-bin-hadoop2.7
export PATH=${PATH}:${SPARK_HOME}/bin
export PYTHONPATH=${SPARK_HOME}/python:${PYTHONPATH}

Installing project dependencies

git clone git@github.com:chrisk314/spark-uk-house-data-map.git
cd spark-uk-house-data-map
virtualenv -p python3 venv
source .env
python -m pip install -r requirements.in

Run

The house price map is generated interactively by running the Jupyter notebook. To start the notebook server using pyspark run the below commands

source .env
pyspark

About

Data analysis and visualisation of UK house price data with pyspark

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published