Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Features of probabilistic linkage solutions available for record linkage applications.
Facebook
Twitterhttp://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
This database represents a synthetic population of Nebraska from 1920-2022. It was created using this publicly available Github Repository that allows a user to make a synthetic population for a specific state. See that repository for an in-depth background for the project.
One of the primary uses of this dataset is for training record linkage models. Coming from a public health background, health records often don't have one single reliable unique person identifier (like Social Security Number). By creating a synthetic dataset with snapshots of the population each year from 1920-2022 with known unique person identifiers, we can produce "golden" training / testing / validation data for supervised machine learning models. See below for some links to useful information on the record linkage process:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fit statistics for scored XGBoost models with 50,000 rows per dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PRLF match pair overlap with Link Plus match pairs.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Record linkage is a complex process and is utilized to some extent in nearly every organization that works with modern human data records. People create methods for linking records on a case-by-case basis. Some may use basic matching between record 1 and record 2 as seen below ```python
if (r1.FirstName == r2.FirstName) & (r1.LastName == r2.LastName): out.match = True else: out.match = False ``` while others may choose to create more complex decision trees or even machine learning approaches to record linkage.
When people approach record linkage via machine learning (ML), they can match on a variety of fields, typically dependent on the forms used to collect data. While these ML-utilized fields can vary on a organization-to-organization level, there are several fields that appear more frequently than others. They are as follows: * First Name * Middle Name * Last Name * Date of Birth * Sex at Birth * Race * SSN (maybe) * Address house * Address zip * Address city * Address county * Address state * Phone * Email * Date Data Submitted
By comparing two records on all of these fields, ML record linkage models use complex logic to make the "Yes" or "No" decision on whether 2 records reflect the same individual. Record linkage can become difficult when individuals change addresses, adopt new last name, erroniously fill out data, or have information that closely resembles another individual (ex: twins).
As described above, record linkage can have many complex elements to it.
Consider a situation where you are manually reviewing 2 records.
These two records only contain basic information on the individuals and you are tasked to decide if Record #1 and Record #2 belong to the same person.
| Record # | First Name | Last Name | Sex | DOB | Address | Address ZIP | Address State | Date Recieved |
|---|---|---|---|---|---|---|---|---|
| 1 | Wanda | Smith | F | 1992-09-13 | 1768 Walker Rd. Unit 209 | 99301 | WA | 2015-03-01 |
| 2 | Wanda | Turner | 1992-09-13 | 4545 Pennsylvania Ct. | 98682 | WA | 2021-06-30 |
At a glance, these records are significantly different and you should therefore mark them as different persons. For the purposes of record linkage manual review, you probably made the correct decision. After all, for record linkage, most models prefer False Negatives to False Positives.
When groups validate record linkage models, they often turn to manually-reviewed record comparisons as their "gold-standard". There are two seperate marks of judgement for record linkage that I would like us to consider 1. Creating a model that simulates a human's decision making processs 2. Creating a model that seeks a deeper record equality "Truth"
I believe many groups aim for and are content with accomplishing goal #1. That approach is inarguably useful. However, I believe that it can be harmful in biases that it introduces. For example, it is biased against people who adopt new last names upon marriages/civil unions (more often "Female" Sex at Birth). Models that bias against non-american names can also produce high validation marks, but are flawed nonetheless. Consider the 2 records displayed earlier. There is a real chance that Wanda adopted a new last name and moved in the 6 years between when the data was collected.
Without relavant documentation (birth, marriage, ... , housing records), we have no way of knowing whether or not "Wanda Smith" is the same person as "Wanda Turner". It follows that treating manual review as a "gold-standard" fails to completely support goal #2.
We hope to create a simulated society that can be used as absoulte truth. The simulated society will be built to reflect the population of Washington State. This will have a relational-database type structure with tables containing all relevant supporting structures for record linkage such as: * birth records * partnership (marriage) records * moving records
We hope to create a society with representative and diverse names, representative demographic breakdowns, and representative geographic population densities. The structure of the database will allow for "Time Travel" queries that allow a user to capture all data from a specific year in time.
By creating a simulated society, we will have absolute truth in determining whether record1 = record2. This approach will give us an opportunity to assess record linkage models considering goal #2.
After we wrap this work, we will work on proccesses/functions for simulating human error in record filling. Also, functions to help support the process of bias-recognition in using our dataset as a test set.
PJ Gibson - peter.gibson@doh.wa.gov
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Demographic characteristics of the 2016 California birth cohort.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Generalized linear modeling parameter estimates of birth characteristics predicting CPS referral by age 3.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Updates
Add CA dataset
Whereabouts: Reference databases
This is a space containing reference databases to be used by whereabouts. Whereabouts is a geocoding package in Python that implements some clever record linkage algorithms in SQL using DuckDB. The package itself is available at whereabouts and can be installed via pip install whereabouts
Installation of reference databases
Put the downloaded duckdb database files in the Model folder where your… See the full description on the dataset page: https://huggingface.co/datasets/2SFCA/whereabouts-db-us.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A comprehensive Amazon books dataset featuring 20,000 books and 727,876 reviews spanning 26 years (1997-2023), paired with a complete step-by-step data science tutorial. Perfect for learning data analytics from scratch or conducting advanced book market analysis.
What's Included:
Raw Data: 20K book metadata (titles, authors, prices, ratings, descriptions) + 727K detailed reviews Complete Tutorial Series: 4 progressive Python scripts covering data loading, cleaning, exploratory analysis, and visualization Ready-to-Run Code: Fully documented scripts with practice exercises Educational Focus: Designed for ENTR 3901 coursework but suitable for all skill levels Key Features:
Real-world e-commerce data (pre-filtered for quality: 200+ reviews, $5+ price) Comprehensive documentation and setup instructions Generates 6+ professional visualizations Includes bonus analysis challenges (sentiment analysis, price optimization, time patterns) Perfect for business analytics, market research, and data science education Use Cases:
Learning data analytics fundamentals Book market analysis and trends Customer behavior insights Price optimization studies Review sentiment analysis Academic coursework and projects This dataset bridges the gap between raw data and practical learning, making it ideal for both beginners and experienced analysts looking to explore e-commerce patterns in the publishing industry.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SPIDER (v2) – Synthetic Person Information Dataset for Entity Resolution provides researchers with ready-to-use data for benchmarking Duplicate or Entity Resolution algorithms. The dataset focuses on person-level fields typical in customer or citizen records. Since real-world person-level data is restricted due to Personally Identifiable Information (PII) constraints, publicly available synthetic datasets are limited in scope, volume, or realism.SPIDER addresses these limitations by providing a large-scale, realistic dataset containing first name, last name, email, phone, address, and date of birth (DOB) attributes. Using the Python Faker library, 40,000 unique synthetic person records were generated, followed by 10,000 controlled duplicate records derived using seven real-world transformation rules. Each duplicate record is linked to its original base record and rule through the fields is_duplicate_of and duplication_rule.Version 2 introduces major realism and structural improvements, enhancing both the dataset and generation framework.Enhancements in Version 2New cluster_id column to group base and duplicate records for improved entity-level benchmarking.Improved data realism with consistent field relationships:State and ZIP codes now match correctly.Phone numbers are generated based on state codes.Email addresses are logically related to name components.Refined duplication logic:Rule 4 updated for realistic address variation.Rule 7 enhanced to simulate shared accounts among different individuals (with distinct DOBs).Improved data validation and formatting for address, email, and date fields.Updated Python generation script for modular configuration, reproducibility, and extensibility.Duplicate Rules (with real-world use cases)Duplicate record with a variation in email address.Use case: Same person using multiple email accounts.Duplicate record with a variation in phone numbers.Use case: Same person using multiple contact numbers.Duplicate record with last-name variation.Use case: Name changes or data entry inconsistencies.Duplicate record with address variation.Use case: Same person maintaining multiple addresses or moving residences.Duplicate record with a nickname.Use case: Same person using formal and informal names (Robert → Bob, Elizabeth → Liz).Duplicate record with minor spelling variations in the first name.Use case: Legitimate entry or migration errors (Sara → Sarah).Duplicate record with multiple individuals sharing the same email and last name but different DOBs.Use case: Realistic shared accounts among family members or households (benefits, tax, or insurance portals).Output FormatThe dataset is available in both CSV and JSON formats for direct use in data-processing, machine-learning, and record-linkage frameworks.Data RegenerationThe included Python script can be used to fully regenerate the dataset and supports:Addition of new duplication rulesRegional, linguistic, or domain-specific variationsVolume scaling for large-scale testing scenariosFiles Includedspider_dataset_v2_6_20251027_022215.csvspider_dataset_v2_6_20251027_022215.jsonspider_readme_v2.mdSPIDER_generation_script_v2.pySupportingDocuments/ folder containing:benchmark_comparison_script.py – script used for derive F-1 score.Public_census_data_surname.csv – sample U.S. Census name and demographic data used for comparison.ssa_firstnames.csv – Social Security Administration names dataset.simplemaps_uszips.csv – ZIP-to-state mapping data used for phone and address validation.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Cross-linking and mass spectrometry (XL-MS) workflows are increasingly popular techniques for generating low-resolution structural information about interacting biomolecules. xQuest is an established software package for analysis of protein–protein XL-MS data, supporting stable isotope-labeled cross-linking reagents. Resultant paired peaks in mass spectra aid sensitivity and specificity of data analysis. The recently developed cross-linking of isotope-labeled RNA and mass spectrometry (CLIR-MS) approach extends the XL-MS concept to protein–RNA interactions, also employing isotope-labeled cross-link (XL) species to facilitate data analysis. Data from CLIR-MS experiments are broadly compatible with core xQuest functionality, but the required analysis approach for this novel data type presents several technical challenges not optimally served by the original xQuest package. Here we introduce RNxQuest, a Python package extension for xQuest, which automates the analysis approach required for CLIR-MS data, providing bespoke, state-of-the-art processing and visualization functionality for this novel data type. Using functions included with RNxQuest, we evaluate three false discovery rate control approaches for CLIR-MS data. We demonstrate the versatility of the RNxQuest-enabled data analysis pipeline by also reanalyzing published protein–RNA XL-MS data sets that lack isotope-labeled RNA. This study demonstrates that RNxQuest provides a sensitive and specific data analysis pipeline for detection of isotope-labeled XLs in protein–RNA XL-MS experiments.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Features of probabilistic linkage solutions available for record linkage applications.