This project demonstrates how to use ESRI shapefiles to generate a map of the UK (actually the UK
minus Northern Ireland due to data availability) using pyspark
with pyshp
, pyproj
and
matplotlib
. Shapefiles which contain the geometry of the administrative districts of England,
Wales and Scotland with coordinates specified in OSGB36 format are available in the Code-Point Open
data set from Ordnance Survey (OS). The districts of England and Wales are coloured based on the
mean sales price for a given year. House sale price data for England and Wales is available from
The Office for National Statistics (ONS) in the Price Paid dataset. Data for Scotland and Northern
Ireland is not available in this dataset.
The Price Paid dataset turned out to be too large (it's ~4GB) to load in to a Jupyter notebook with Pandas on my laptop (16GB memory). I decided it would be nice to demonstrate using the Spark DataFrame API for the data aggregations as Spark can be run in "standalone" mode on a single machine and it tends to handle memory a lot better than Pandas in my experience.
Anyway, here's the map. Take a look at the notebook to see how it's made.
Below is a table of data sources used in this project and where to find them (they are all
freely available). To recreate the map these data sources need to be available with there
locations set in the .env
file.
Dataset | Description | Available from | Env var |
---|---|---|---|
OS Boundary Line | Contains shapefiles specifying map geometry | link | DATA_BDLINE |
OS Code-Point Open | Contains data about postcode locations etc | link | DATA_CODEPO |
ONS Price Paid | Historic house sale price data | link | DATA_PP_CSV |
ONS London Postcodes | ...London Postcodes | link | DATA_LDN_POSTCODES_CSV |
OS RPI | Historical Retail Prices Index (RPI) data | link | DATA_RPI_CSV |
OS GDP | Historical Gross Domestic Product (GDP) data | link | DATA_GDP_CSV |
OS Population | Historical population data. This is in .xls format. I manually exported the data of interest in .csv format |
link | DATA_POP_CSV |
OS Labour Market Survey | Historical employment related data | link | DATA_LMS_CSV |
Spark must be installed in order to run the notebook. Spark requires java. Currently Spark 2.3.2 only works with java version 8. To install java 8 on ubuntu
sudo apt-get install openjdk-8-jdk openjdk-8-jre
Then add these lines to your ~/.profile
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
export PATH=${PATH}:${JAVA_HOME}/bin
Download a precompiled binary version of Spark with Hadoop
cd /opt
wget http://apache.claz.org/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz
tar -xzf spark-2.3.2-bin-hadoop2.7.tgz
Add the following lines to your ~/.profile
export SPARK_HOME=/opt/spark-2.3.2-bin-hadoop2.7
export PATH=${PATH}:${SPARK_HOME}/bin
export PYTHONPATH=${SPARK_HOME}/python:${PYTHONPATH}
git clone git@github.com:chrisk314/spark-uk-house-data-map.git
cd spark-uk-house-data-map
virtualenv -p python3 venv
source .env
python -m pip install -r requirements.in
The house price map is generated interactively by running the Jupyter notebook. To start the
notebook server using pyspark
run the below commands
source .env
pyspark