Skip to content

Blocking records for record linkage and data deduplication based on ANN algorithms in Python.

License

Notifications You must be signed in to change notification settings

ncn-foreigners/BlockingPy

Repository files navigation

License Project Status: Active – The project has reached a stable, usable state and is being actively developed. Python version Code Coverage
PyPI version Ruff Tests GitHub last commit Documentation Status PyPI Downloads

BlockingPy

BlockingPy is a Python package that implements efficient blocking methods for record linkage and data deduplication using Approximate Nearest Neighbor (ANN) algorithms. It is based on R blocking package.

Purpose

When performing record linkage or deduplication on large datasets, comparing all possible record pairs becomes computationally infeasible. Blocking helps reduce the comparison space by identifying candidate record pairs that are likely to match, using efficient approximate nearest neighbor search algorithms.

Installation

BlockingPy requires Python 3.10 or later. Installation is handled via PIP as follows:

pip install blockingpy

or i.e. with poetry:

poetry add blockingpy

Note

You may need to run the following beforehand:

sudo apt-get install -y libmlpack-dev # on Linux
brew install mlpack # on MacOS

Basic Usage

Record Linkage

from blockingpy import Blocker
import pandas as pd

# Example data for record linkage
x = pd.DataFrame({
    "txt": [
            "johnsmith",
            "smithjohn",
            "smiithhjohn",
            "smithjohnny",
            "montypython",
            "pythonmonty",
            "errmontypython",
            "monty",
        ]})

y = pd.DataFrame({
    "txt": [
            "montypython",
            "smithjohn",
            "other",
        ]})

# Initialize blocker instance
blocker = Blocker()

# Perform blocking with default ANN : FAISS
block_result = blocker.block(x = x['txt'], y = y['txt'])

Printing block_result contains:

  • The method used (faiss - refers to Facebook AI Similarity Search)
  • Number of blocks created (3 in this case)
  • Number of columns (features) used for blocking (intersecting n-grams generated from both datasets, 17 in this example)
  • Reduction ratio, i.e. how large is the reduction of comparison pairs (here 0.8750 which means blocking reduces comparison by over 87.5%).
print(block_result)
# ========================================================
# Blocking based on the faiss method.
# Number of blocks: 3
# Number of columns used for blocking: 17
# Reduction ratio: 0.8750
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          2 | 3  

By printing block_result.result we can take a look at the results table containing:

  • row numbers from the original data,
  • block number (integers),
  • distance (from the ANN algorithm).
print(block_result.result)
#    x  y  block  dist
# 0  4  0      0   0.0
# 1  1  1      1   0.0
# 2  7  2      2   6.0

Deduplication

We can perform deduplication by putting previously created DataFrame in the block() method.

dedup_result = blocker.block(x = x['txt'])
print(dedup_result)
# ========================================================
# Blocking based on the faiss method.
# Number of blocks: 2
# Number of columns used for blocking: 25
# Reduction ratio: 0.5714
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          4 | 2 
print(dedup_result.result)
#    x  y  block  dist
# 0  1  0      0   2.0
# 1  1  2      0   2.0
# 2  1  3      0   2.0
# 3  5  4      1   2.0
# 4  4  6      1   3.0
# 5  4  7      1   6.0

Features

  • Multiple ANN algorithms available:

    • FAISS (Facebook AI Similarity Search)
    • Voyager (Spotify)
    • HNSW (Hierarchical Navigable Small World)
    • MLPACK (both LSH and k-d tree)
    • NND (Nearest Neighbor Descent)
    • Annoy (Spotify)
  • Multiple distance metrics such as:

    • Euclidean
    • Cosine
    • Inner Product

    and more...

  • Comprehensive algorithm parameters customization with control_ann and control_txt

  • Support for already created Document-Term-Matrices (as np.ndarray or csr_matrix)

  • Support for both record linkage and deduplication

  • Evaluation metrics when true blocks are known

You can find detailed information about BlockingPy in documentation.

Example Datasets

BlockingPy comes with built-in example datasets:

  • Census-Cis dataset created by Paula McLeod, Dick Heasman and Ian Forbes, ONS, for the ESSnet DI on-the-job training course, Southampton, 25-28 January 2011

  • Deduplication dataset taken from RecordLinkage R package developed by Murat Sariyar and Andreas Borg. Package is licensed under GPL-3 license. Also known as RLdata10000.

Disclaimer

BlockingPy is still under development, API and features may change. Also bugs or errors can occur.

License

BlockingPy is released under MIT license.

Third Party

BlockingPy benefits from many open-source packages such as Faiss or Annoy. For detailed information see third party notice.

Contributing

Please see CONTRIBUTING.md for more information.

Code of Conduct

You can find it here

Acknowledgements

This package is based on the R blocking package developed by BERENZ.

Funding

Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941 (Towards census-like statistics for foreign-born populations -- quality, data integration and estimation)

About

Blocking records for record linkage and data deduplication based on ANN algorithms in Python.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages