Splink — screenshot of github.com

Splink

Splink is a Python library for probabilistic record linkage. It's designed to deduplicate and link datasets lacking unique identifiers, leveraging Fellegi-Sunter's model and offering scalable execution.

Visit github.com →

Questions & Answers

What is Splink?
Splink is a Python library for probabilistic record linkage and entity resolution. It deduplicates and links records across datasets, even when unique identifiers are absent. It is based on the Fellegi-Sunter model with customisations for accuracy.
Who should use Splink?
Splink is for data practitioners and analysts who need to link and deduplicate data from multiple sources without relying on pre-existing unique identifiers. It is particularly effective for datasets with multiple, non-highly correlated columns, suitable for use in government, academia, and the private sector.
How does Splink differ from other data linkage tools?
Splink offers fast and scalable linkage, capable of processing millions of records on a laptop or hundreds of millions with big-data backends like Spark or Athena. A key differentiator is its unsupervised learning approach, which requires no training data for model training. It also provides interactive visualisations for model understanding.
When is Splink most effective for data linkage?
Splink is most effective when input data contains multiple columns that are not highly correlated, such as name, date of birth, and city for persons, or name, turnover, and sector for companies. It is not designed for linking a single column containing a 'bag of words'.
What technical details are notable about Splink?
Splink uses Fellegi-Sunter's model of record linkage and can execute linkage in Python using DuckDB, or with big-data backends like AWS Athena or Spark. It supports term frequency adjustments and user-defined fuzzy matching logic to improve accuracy.