11 datasets found

Features of probabilistic linkage solutions available for record linkage...
plos.figshare.com
xls
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Features of probabilistic linkage solutions available for record linkage applications. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291581.t001
Dataset updated
Oct 20, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
John Prindle; Himal Suthar; Emily Putnam-Hornstein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Features of probabilistic linkage solutions available for record linkage applications.
synthetic-gold-database
kaggle.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PJ Gibson (2023). synthetic-gold-database [Dataset]. https://www.kaggle.com/datasets/pjgibson/synthetic-gold-database
Explore at:
zip(9292035305 bytes)Available download formats
Dataset updated
Aug 4, 2023
Authors
PJ Gibson
License
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Description
Synthetic Gold

This database represents a synthetic population of Nebraska from 1920-2022. It was created using this publicly available Github Repository that allows a user to make a synthetic population for a specific state. See that repository for an in-depth background for the project.

Record Linkage

One of the primary uses of this dataset is for training record linkage models. Coming from a public health background, health records often don't have one single reliable unique person identifier (like Social Security Number). By creating a synthetic dataset with snapshots of the population each year from 1920-2022 with known unique person identifiers, we can produce "golden" training / testing / validation data for supervised machine learning models. See below for some links to useful information on the record linkage process:

Bulding a Scalable Record Linkage System with Apache Spark, Python 3, and Machine Learning YouTube video.

Learning Blocking Schemes for Record Linkage article that covers blocking schemas (and record linkage) in good detail for beginners.

Record Linkage, a python library with great record linkage functions, for local python users (non-cloud). Has great markdown coverage of the different steps in a record linkage process - Preprocessing, Indexing, Comparing, Classification, Evaluation.

Jellyfish, a python library for string comparison functions.
f
Fit statistics for scored XGBoost models with 50,000 rows per dataset.
figshare.com
plos.figshare.com
xls
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Fit statistics for scored XGBoost models with 50,000 rows per dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291581.t002
Dataset updated
Oct 20, 2023
Dataset provided by
PLOS ONE
Authors
John Prindle; Himal Suthar; Emily Putnam-Hornstein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Fit statistics for scored XGBoost models with 50,000 rows per dataset.
PRLF match pair overlap with Link Plus match pairs.
plos.figshare.com
xls
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). PRLF match pair overlap with Link Plus match pairs. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291581.t003
Dataset updated
Oct 20, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
John Prindle; Himal Suthar; Emily Putnam-Hornstein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PRLF match pair overlap with Link Plus match pairs.
Synthetic Oregon
kaggle.com
zip
Updated Jul 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PJ Gibson (2023). Synthetic Oregon [Dataset]. https://www.kaggle.com/datasets/pjgibson/synthetic-oregon/suggestions
Explore at:
zip(7872753697 bytes)Available download formats
Dataset updated
Jul 12, 2023
Authors
PJ Gibson
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Oregon
Description
synthetic-gold

Introduction

Record linkage is a complex process and is utilized to some extent in nearly every organization that works with modern human data records. People create methods for linking records on a case-by-case basis. Some may use basic matching between record 1 and record 2 as seen below ```python

Pseudo-Code!

if (r1.FirstName == r2.FirstName) & (r1.LastName == r2.LastName): out.match = True else: out.match = False ``` while others may choose to create more complex decision trees or even machine learning approaches to record linkage.

When people approach record linkage via machine learning (ML), they can match on a variety of fields, typically dependent on the forms used to collect data. While these ML-utilized fields can vary on a organization-to-organization level, there are several fields that appear more frequently than others. They are as follows: * First Name * Middle Name * Last Name * Date of Birth * Sex at Birth * Race * SSN (maybe) * Address house * Address zip * Address city * Address county * Address state * Phone * Email * Date Data Submitted

By comparing two records on all of these fields, ML record linkage models use complex logic to make the "Yes" or "No" decision on whether 2 records reflect the same individual. Record linkage can become difficult when individuals change addresses, adopt new last name, erroniously fill out data, or have information that closely resembles another individual (ex: twins).

The Problem

As described above, record linkage can have many complex elements to it.
Consider a situation where you are manually reviewing 2 records. These two records only contain basic information on the individuals and you are tasked to decide if Record #1 and Record #2 belong to the same person.

Record # First Name Last Name Sex DOB Address Address ZIP Address State Date Recieved
1 Wanda Smith F 1992-09-13 1768 Walker Rd. Unit 209 99301 WA 2015-03-01
2 Wanda Turner 1992-09-13 4545 Pennsylvania Ct. 98682 WA 2021-06-30

At a glance, these records are significantly different and you should therefore mark them as different persons. For the purposes of record linkage manual review, you probably made the correct decision. After all, for record linkage, most models prefer False Negatives to False Positives.

When groups validate record linkage models, they often turn to manually-reviewed record comparisons as their "gold-standard". There are two seperate marks of judgement for record linkage that I would like us to consider 1. Creating a model that simulates a human's decision making processs 2. Creating a model that seeks a deeper record equality "Truth"

I believe many groups aim for and are content with accomplishing goal #1. That approach is inarguably useful. However, I believe that it can be harmful in biases that it introduces. For example, it is biased against people who adopt new last names upon marriages/civil unions (more often "Female" Sex at Birth). Models that bias against non-american names can also produce high validation marks, but are flawed nonetheless. Consider the 2 records displayed earlier. There is a real chance that Wanda adopted a new last name and moved in the 6 years between when the data was collected.

Without relavant documentation (birth, marriage, ... , housing records), we have no way of knowing whether or not "Wanda Smith" is the same person as "Wanda Turner". It follows that treating manual review as a "gold-standard" fails to completely support goal #2.

The Solution

We hope to create a simulated society that can be used as absoulte truth. The simulated society will be built to reflect the population of Washington State. This will have a relational-database type structure with tables containing all relevant supporting structures for record linkage such as: * birth records * partnership (marriage) records * moving records

We hope to create a society with representative and diverse names, representative demographic breakdowns, and representative geographic population densities. The structure of the database will allow for "Time Travel" queries that allow a user to capture all data from a specific year in time.

By creating a simulated society, we will have absolute truth in determining whether record1 = record2. This approach will give us an opportunity to assess record linkage models considering goal #2.

Following Work

After we wrap this work, we will work on proccesses/functions for simulating human error in record filling. Also, functions to help support the process of bias-recognition in using our dataset as a test set.

Contact

PJ Gibson - peter.gibson@doh.wa.gov
Demographic characteristics of the 2016 California birth cohort.
plos.figshare.com
xls
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Demographic characteristics of the 2016 California birth cohort. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291581.t004
Dataset updated
Oct 20, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
John Prindle; Himal Suthar; Emily Putnam-Hornstein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
California
Description
Demographic characteristics of the 2016 California birth cohort.
f
Generalized linear modeling parameter estimates of birth characteristics...
plos.figshare.com
figshare.com
xls
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Generalized linear modeling parameter estimates of birth characteristics predicting CPS referral by age 3. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291581.t005
Dataset updated
Oct 20, 2023
Dataset provided by
PLOS ONE
Authors
John Prindle; Himal Suthar; Emily Putnam-Hornstein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Generalized linear modeling parameter estimates of birth characteristics predicting CPS referral by age 3.
h
whereabouts-db-us
huggingface.co
Updated Sep 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric S. (2025). whereabouts-db-us [Dataset]. https://huggingface.co/datasets/2SFCA/whereabouts-db-us
Explore at:
Dataset updated
Sep 16, 2025
Authors
Eric S.
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Updates

Add CA dataset

Whereabouts: Reference databases

This is a space containing reference databases to be used by whereabouts. Whereabouts is a geocoding package in Python that implements some clever record linkage algorithms in SQL using DuckDB. The package itself is available at whereabouts and can be installed via pip install whereabouts

Installation of reference databases

Put the downloaded duckdb database files in the Model folder where your… See the full description on the dataset page: https://huggingface.co/datasets/2SFCA/whereabouts-db-us.
Amazon Books Dataset (20K Books + 727K Reviews)
kaggle.com
zip
Updated Oct 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hadi Fariborzi (2025). Amazon Books Dataset (20K Books + 727K Reviews) [Dataset]. https://www.kaggle.com/datasets/hadifariborzi/amazon-books-dataset-20k-books-727k-reviews
Explore at:
zip(233373889 bytes)Available download formats
Dataset updated
Oct 21, 2025
Authors
Hadi Fariborzi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A comprehensive Amazon books dataset featuring 20,000 books and 727,876 reviews spanning 26 years (1997-2023), paired with a complete step-by-step data science tutorial. Perfect for learning data analytics from scratch or conducting advanced book market analysis.

What's Included:

Raw Data: 20K book metadata (titles, authors, prices, ratings, descriptions) + 727K detailed reviews Complete Tutorial Series: 4 progressive Python scripts covering data loading, cleaning, exploratory analysis, and visualization Ready-to-Run Code: Fully documented scripts with practice exercises Educational Focus: Designed for ENTR 3901 coursework but suitable for all skill levels Key Features:

Real-world e-commerce data (pre-filtered for quality: 200+ reviews, $5+ price) Comprehensive documentation and setup instructions Generates 6+ professional visualizations Includes bonus analysis challenges (sentiment analysis, price optimization, time patterns) Perfect for business analytics, market research, and data science education Use Cases:

Learning data analytics fundamentals Book market analysis and trends Customer behavior insights Price optimization studies Review sentiment analysis Academic coursework and projects This dataset bridges the gap between raw data and practical learning, making it ideal for both beginners and experienced analysts looking to explore e-commerce patterns in the publishing industry.
SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution
figshare.com
csv
Updated Oct 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur (2025). SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution [Dataset]. http://doi.org/10.6084/m9.figshare.30472712.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30472712.v1
Dataset updated
Oct 29, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SPIDER (v2) – Synthetic Person Information Dataset for Entity Resolution provides researchers with ready-to-use data for benchmarking Duplicate or Entity Resolution algorithms. The dataset focuses on person-level fields typical in customer or citizen records. Since real-world person-level data is restricted due to Personally Identifiable Information (PII) constraints, publicly available synthetic datasets are limited in scope, volume, or realism.SPIDER addresses these limitations by providing a large-scale, realistic dataset containing first name, last name, email, phone, address, and date of birth (DOB) attributes. Using the Python Faker library, 40,000 unique synthetic person records were generated, followed by 10,000 controlled duplicate records derived using seven real-world transformation rules. Each duplicate record is linked to its original base record and rule through the fields is_duplicate_of and duplication_rule.Version 2 introduces major realism and structural improvements, enhancing both the dataset and generation framework.Enhancements in Version 2New cluster_id column to group base and duplicate records for improved entity-level benchmarking.Improved data realism with consistent field relationships:State and ZIP codes now match correctly.Phone numbers are generated based on state codes.Email addresses are logically related to name components.Refined duplication logic:Rule 4 updated for realistic address variation.Rule 7 enhanced to simulate shared accounts among different individuals (with distinct DOBs).Improved data validation and formatting for address, email, and date fields.Updated Python generation script for modular configuration, reproducibility, and extensibility.Duplicate Rules (with real-world use cases)Duplicate record with a variation in email address.Use case: Same person using multiple email accounts.Duplicate record with a variation in phone numbers.Use case: Same person using multiple contact numbers.Duplicate record with last-name variation.Use case: Name changes or data entry inconsistencies.Duplicate record with address variation.Use case: Same person maintaining multiple addresses or moving residences.Duplicate record with a nickname.Use case: Same person using formal and informal names (Robert → Bob, Elizabeth → Liz).Duplicate record with minor spelling variations in the first name.Use case: Legitimate entry or migration errors (Sara → Sarah).Duplicate record with multiple individuals sharing the same email and last name but different DOBs.Use case: Realistic shared accounts among family members or households (benefits, tax, or insurance portals).Output FormatThe dataset is available in both CSV and JSON formats for direct use in data-processing, machine-learning, and record-linkage frameworks.Data RegenerationThe included Python script can be used to fully regenerate the dataset and supports:Addition of new duplication rulesRegional, linguistic, or domain-specific variationsVolume scaling for large-scale testing scenariosFiles Includedspider_dataset_v2_6_20251027_022215.csvspider_dataset_v2_6_20251027_022215.jsonspider_readme_v2.mdSPIDER_generation_script_v2.pySupportingDocuments/ folder containing:benchmark_comparison_script.py – script used for derive F-1 score.Public_census_data_surname.csv – sample U.S. Census name and demographic data used for comparison.ssa_firstnames.csv – Social Security Administration names dataset.simplemaps_uszips.csv – ZIP-to-state mapping data used for phone and address validation.
f
Data from: RNxQuest: An Extension to the xQuest Pipeline Enabling Analysis...
acs.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Sep 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris P. Sarnowski; Michael Götze; Alexander Leitner (2023). RNxQuest: An Extension to the xQuest Pipeline Enabling Analysis of Protein–RNA Cross-Linking/Mass Spectrometry Data [Dataset]. http://doi.org/10.1021/acs.jproteome.3c00341.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.3c00341.s002
Dataset updated
Sep 6, 2023
Dataset provided by
ACS Publications
Authors
Chris P. Sarnowski; Michael Götze; Alexander Leitner
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Cross-linking and mass spectrometry (XL-MS) workflows are increasingly popular techniques for generating low-resolution structural information about interacting biomolecules. xQuest is an established software package for analysis of protein–protein XL-MS data, supporting stable isotope-labeled cross-linking reagents. Resultant paired peaks in mass spectra aid sensitivity and specificity of data analysis. The recently developed cross-linking of isotope-labeled RNA and mass spectrometry (CLIR-MS) approach extends the XL-MS concept to protein–RNA interactions, also employing isotope-labeled cross-link (XL) species to facilitate data analysis. Data from CLIR-MS experiments are broadly compatible with core xQuest functionality, but the required analysis approach for this novel data type presents several technical challenges not optimally served by the original xQuest package. Here we introduce RNxQuest, a Python package extension for xQuest, which automates the analysis approach required for CLIR-MS data, providing bespoke, state-of-the-art processing and visualization functionality for this novel data type. Using functions included with RNxQuest, we evaluate three false discovery rate control approaches for CLIR-MS data. We demonstrate the versatility of the RNxQuest-enabled data analysis pipeline by also reanalyzing published protein–RNA XL-MS data sets that lack isotope-labeled RNA. This study demonstrates that RNxQuest provides a sensitive and specific data analysis pipeline for detection of isotope-labeled XLs in protein–RNA XL-MS experiments.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Record #	First Name	Last Name	Sex	DOB	Address	Address ZIP	Address State	Date Recieved
1	Wanda	Smith	F	1992-09-13	1768 Walker Rd. Unit 209	99301	WA	2015-03-01
2	Wanda	Turner		1992-09-13	4545 Pennsylvania Ct.	98682	WA	2021-06-30

Facebook

Twitter

Click to copy link

Link copied

Cite

John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Features of probabilistic linkage solutions available for record linkage applications. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t001

Features of probabilistic linkage solutions available for record linkage applications.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0291581.t001

Dataset updated

Oct 20, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

John Prindle; Himal Suthar; Emily Putnam-Hornstein

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Features of probabilistic linkage solutions available for record linkage applications.

Clear search

Close search

Google apps

Main menu

Features of probabilistic linkage solutions available for record linkage...

synthetic-gold-database

Synthetic Gold

Record Linkage

Fit statistics for scored XGBoost models with 50,000 rows per dataset.

PRLF match pair overlap with Link Plus match pairs.

Synthetic Oregon

synthetic-gold

Introduction

Pseudo-Code!

The Problem

The Solution

Following Work

Contact

Demographic characteristics of the 2016 California birth cohort.

Generalized linear modeling parameter estimates of birth characteristics...

whereabouts-db-us

Amazon Books Dataset (20K Books + 727K Reviews)

SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution

Data from: RNxQuest: An Extension to the xQuest Pipeline Enabling Analysis...

Features of probabilistic linkage solutions available for record linkage applications.