This project is a data pipeline designed to extract and parse monthly chess games from the Lichess database.
Lichess is a popular online chess platform where millions of chess games are played every day. Each month, these games are compiled and published for free on the Lichess database, making them a great data source for chess-related projects. However, extracting and processing theses games poses two key challenges:
-
The PGN format: The Portable Game Notation format doesn't have a natural indexing system, making it difficult to access specific games or perform filtering and aggregation. As a result, PGN games must first be parsed into a JSON or tabular format to support these operations.
-
Size: Each monthly file from the database contains up to 100 million games which is about 350GB when decompressed. Processing files of this size can be time-consuming and expensive using traditional methods like Python.
This data pipeline aims to address these issues by providing the following features:
- Processing with Spark: Utilizing Apache Spark for parallel processing, the pipeline can handle 100 million games in just under 60 minutes.
- Parquet Format: Chess games are parsed from PGN to Parquet format, optimizing storage and data analysis capabilities.
- Flexible Transformations: Spark enables customizable, large-scale data transformations such as filtering and aggregation.
- Automated Processing: A serverless architecture enables automatic monthly processing of new data files.
This solution streamlines the handling of Lichess's monthly data files, making the data more accessible and manageable for developers and researchers.
- Spark Databricks for data processing
- Azure Data Factory (ADF) for orchestration, and
- Azure Data Lake Storage Gen2 (ADLS2) for storage.
See the process diagram for this data pipeline below:
The pipline has four main stages:
- Copy Data: Data Factory copies the compressed data file from the Lichess database to ADLS2.
- Decompress File: The downloaded ZST file is decompressed into PGN format.
- Parse Games: Spark parses the PGN file, extracting individual chess games and storing them into Parquet format.
- (Optional) Analyze Games: Spark can be used to further filter, enhance, or aggregate the dataset.
The core of the data pipeline consists of several Spark notebooks, each responsible for a specific stage of processing:
1-decompress-file
: Extracts data from the Raw layer to the Bronze layer.2-parse-games
: Parses data from Bronze to Silver layer, converting it from PGN to Parquet format3-analyze-games
: Performs custom analysis on parsed games, saving results in the Gold layer. Input your custom Spark logic here to transform the data as needed.0-run-all
: Orchestrates the execution of the above notebooks.
All the above notebooks can be found in this folder.
The 2-parse-games
notebook is the most important, handling the task of parsing PGN data into Parquet format. This process involves:
-
Reading Decompressed Data: The parsing process starts by reading the decompressed PGN data file into a DataFrame, where each row represents a single line from the original file.
-
Extracting Key and Value Information: Key and Value information are extracted from each line. For example, keys like Event, Site, Date, etc., are identified and their corresponding values are extracted.
-
Assigning GameIDs: A start-of-game identifier called GameID is assigned to the Event tag. This identifier effectively groups each Event line and the lines following it into a unique game:
- Pivot Operation: Once every lines are assigned with their GameIDs, a pivot operation transforms the grouped lines with the same GameID into one game record. This results in a final table where each row represents a complete game with all its attributes:
This pivot operation, while necessary, can be computationally expensive as it requires sorting the entire dataset to assign the correct GameID to each line. To perform this sorting task, Spark needs to move (or shuffle) all records into a single partition. Considering the large size of the dataset, this can lead to massive data spills and significantly slow down processing time.
To address this issue, the pipeline implements two key optimizations:
- Data Chunking: In the
1-decompress-file
notebook, the data file is split into smaller, more manageable chunks during the decompression step. These chunks are then passed to the2-parse-games
notebook for parsing. The idea is that each chunk will still be parsed in a single partition but since their size is now significantly smaller, Spark can handle them efficiently without the risk of spilling. - Parallel Processing: The
concurrent.futures
module is utilized to process these chunks in parallel. This ensures optimal resource utilization across all available nodes, with each node handling one data chunk at a time.
As an example of how this pipeline can be used and customized for specific applications, I use the data from this pipeline for my game "Guess The ELO". It's a chess-based quiz game where your goal is to guess the Elo rating of a chess match.
For “Guess the ELO”, I made some modifications to the default pipeline:
- Silver layer: Added additional filtering logic after parsing to select suitable chess games such as games with evaluation, with more than 20 moves, etc.
- Gold layer: Applied custom sampling logic to ensure ELO ratings are randomly distributed (as opposed to normally distributed). This makes sure that games from all ELO ranges have the same chance to be shown to the player.
- Added a final step to transfer the processed dataset into MongoDB for application usage.
The modified notebooks are provided here.
Addtionally, if you're interested in "Guess the ELO", feel free to check out the game here and its source code.