11 datasets found
  1. Features of probabilistic linkage solutions available for record linkage...

    • plos.figshare.com
    xls
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Features of probabilistic linkage solutions available for record linkage applications. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    John Prindle; Himal Suthar; Emily Putnam-Hornstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Features of probabilistic linkage solutions available for record linkage applications.

  2. synthetic-gold-database

    • kaggle.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PJ Gibson (2023). synthetic-gold-database [Dataset]. https://www.kaggle.com/datasets/pjgibson/synthetic-gold-database
    Explore at:
    zip(9292035305 bytes)Available download formats
    Dataset updated
    Aug 4, 2023
    Authors
    PJ Gibson
    License

    http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html

    Description

    Synthetic Gold

    This database represents a synthetic population of Nebraska from 1920-2022. It was created using this publicly available Github Repository that allows a user to make a synthetic population for a specific state. See that repository for an in-depth background for the project.

    Record Linkage

    One of the primary uses of this dataset is for training record linkage models. Coming from a public health background, health records often don't have one single reliable unique person identifier (like Social Security Number). By creating a synthetic dataset with snapshots of the population each year from 1920-2022 with known unique person identifiers, we can produce "golden" training / testing / validation data for supervised machine learning models. See below for some links to useful information on the record linkage process:

  3. f

    Fit statistics for scored XGBoost models with 50,000 rows per dataset.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Fit statistics for scored XGBoost models with 50,000 rows per dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    PLOS ONE
    Authors
    John Prindle; Himal Suthar; Emily Putnam-Hornstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fit statistics for scored XGBoost models with 50,000 rows per dataset.

  4. PRLF match pair overlap with Link Plus match pairs.

    • plos.figshare.com
    xls
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). PRLF match pair overlap with Link Plus match pairs. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    John Prindle; Himal Suthar; Emily Putnam-Hornstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PRLF match pair overlap with Link Plus match pairs.

  5. Synthetic Oregon

    • kaggle.com
    zip
    Updated Jul 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PJ Gibson (2023). Synthetic Oregon [Dataset]. https://www.kaggle.com/datasets/pjgibson/synthetic-oregon/suggestions
    Explore at:
    zip(7872753697 bytes)Available download formats
    Dataset updated
    Jul 12, 2023
    Authors
    PJ Gibson
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Oregon
    Description

    synthetic-gold

    Introduction

    Record linkage is a complex process and is utilized to some extent in nearly every organization that works with modern human data records. People create methods for linking records on a case-by-case basis. Some may use basic matching between record 1 and record 2 as seen below ```python

    Pseudo-Code!

    if (r1.FirstName == r2.FirstName) & (r1.LastName == r2.LastName): out.match = True else: out.match = False ``` while others may choose to create more complex decision trees or even machine learning approaches to record linkage.

    When people approach record linkage via machine learning (ML), they can match on a variety of fields, typically dependent on the forms used to collect data. While these ML-utilized fields can vary on a organization-to-organization level, there are several fields that appear more frequently than others. They are as follows: * First Name * Middle Name * Last Name * Date of Birth * Sex at Birth * Race * SSN (maybe) * Address house * Address zip * Address city * Address county * Address state * Phone * Email * Date Data Submitted

    By comparing two records on all of these fields, ML record linkage models use complex logic to make the "Yes" or "No" decision on whether 2 records reflect the same individual. Record linkage can become difficult when individuals change addresses, adopt new last name, erroniously fill out data, or have information that closely resembles another individual (ex: twins).

    The Problem

    As described above, record linkage can have many complex elements to it.
    Consider a situation where you are manually reviewing 2 records. These two records only contain basic information on the individuals and you are tasked to decide if Record #1 and Record #2 belong to the same person.

    Record #First NameLast NameSexDOBAddressAddress ZIPAddress StateDate Recieved
    1WandaSmithF1992-09-131768 Walker Rd. Unit 20999301WA2015-03-01
    2WandaTurner1992-09-134545 Pennsylvania Ct.98682WA2021-06-30

    At a glance, these records are significantly different and you should therefore mark them as different persons. For the purposes of record linkage manual review, you probably made the correct decision. After all, for record linkage, most models prefer False Negatives to False Positives.

    When groups validate record linkage models, they often turn to manually-reviewed record comparisons as their "gold-standard". There are two seperate marks of judgement for record linkage that I would like us to consider 1. Creating a model that simulates a human's decision making processs 2. Creating a model that seeks a deeper record equality "Truth"

    I believe many groups aim for and are content with accomplishing goal #1. That approach is inarguably useful. However, I believe that it can be harmful in biases that it introduces. For example, it is biased against people who adopt new last names upon marriages/civil unions (more often "Female" Sex at Birth). Models that bias against non-american names can also produce high validation marks, but are flawed nonetheless. Consider the 2 records displayed earlier. There is a real chance that Wanda adopted a new last name and moved in the 6 years between when the data was collected.

    Without relavant documentation (birth, marriage, ... , housing records), we have no way of knowing whether or not "Wanda Smith" is the same person as "Wanda Turner". It follows that treating manual review as a "gold-standard" fails to completely support goal #2.

    The Solution

    We hope to create a simulated society that can be used as absoulte truth. The simulated society will be built to reflect the population of Washington State. This will have a relational-database type structure with tables containing all relevant supporting structures for record linkage such as: * birth records * partnership (marriage) records * moving records

    We hope to create a society with representative and diverse names, representative demographic breakdowns, and representative geographic population densities. The structure of the database will allow for "Time Travel" queries that allow a user to capture all data from a specific year in time.

    By creating a simulated society, we will have absolute truth in determining whether record1 = record2. This approach will give us an opportunity to assess record linkage models considering goal #2.

    Following Work

    After we wrap this work, we will work on proccesses/functions for simulating human error in record filling. Also, functions to help support the process of bias-recognition in using our dataset as a test set.

    Contact

    PJ Gibson - peter.gibson@doh.wa.gov

  6. Demographic characteristics of the 2016 California birth cohort.

    • plos.figshare.com
    xls
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Demographic characteristics of the 2016 California birth cohort. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    John Prindle; Himal Suthar; Emily Putnam-Hornstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    California
    Description

    Demographic characteristics of the 2016 California birth cohort.

  7. f

    Generalized linear modeling parameter estimates of birth characteristics...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Generalized linear modeling parameter estimates of birth characteristics predicting CPS referral by age 3. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    PLOS ONE
    Authors
    John Prindle; Himal Suthar; Emily Putnam-Hornstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Generalized linear modeling parameter estimates of birth characteristics predicting CPS referral by age 3.

  8. h

    whereabouts-db-us

    • huggingface.co
    Updated Sep 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric S. (2025). whereabouts-db-us [Dataset]. https://huggingface.co/datasets/2SFCA/whereabouts-db-us
    Explore at:
    Dataset updated
    Sep 16, 2025
    Authors
    Eric S.
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Updates

    Add CA dataset

      Whereabouts: Reference databases
    

    This is a space containing reference databases to be used by whereabouts. Whereabouts is a geocoding package in Python that implements some clever record linkage algorithms in SQL using DuckDB. The package itself is available at whereabouts and can be installed via pip install whereabouts

      Installation of reference databases
    

    Put the downloaded duckdb database files in the Model folder where your… See the full description on the dataset page: https://huggingface.co/datasets/2SFCA/whereabouts-db-us.

  9. Amazon Books Dataset (20K Books + 727K Reviews)

    • kaggle.com
    zip
    Updated Oct 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hadi Fariborzi (2025). Amazon Books Dataset (20K Books + 727K Reviews) [Dataset]. https://www.kaggle.com/datasets/hadifariborzi/amazon-books-dataset-20k-books-727k-reviews
    Explore at:
    zip(233373889 bytes)Available download formats
    Dataset updated
    Oct 21, 2025
    Authors
    Hadi Fariborzi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A comprehensive Amazon books dataset featuring 20,000 books and 727,876 reviews spanning 26 years (1997-2023), paired with a complete step-by-step data science tutorial. Perfect for learning data analytics from scratch or conducting advanced book market analysis.

    What's Included:

    Raw Data: 20K book metadata (titles, authors, prices, ratings, descriptions) + 727K detailed reviews Complete Tutorial Series: 4 progressive Python scripts covering data loading, cleaning, exploratory analysis, and visualization Ready-to-Run Code: Fully documented scripts with practice exercises Educational Focus: Designed for ENTR 3901 coursework but suitable for all skill levels Key Features:

    Real-world e-commerce data (pre-filtered for quality: 200+ reviews, $5+ price) Comprehensive documentation and setup instructions Generates 6+ professional visualizations Includes bonus analysis challenges (sentiment analysis, price optimization, time patterns) Perfect for business analytics, market research, and data science education Use Cases:

    Learning data analytics fundamentals Book market analysis and trends Customer behavior insights Price optimization studies Review sentiment analysis Academic coursework and projects This dataset bridges the gap between raw data and practical learning, making it ideal for both beginners and experienced analysts looking to explore e-commerce patterns in the publishing industry.

  10. SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution

    • figshare.com
    csv
    Updated Oct 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur (2025). SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution [Dataset]. http://doi.org/10.6084/m9.figshare.30472712.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SPIDER (v2) – Synthetic Person Information Dataset for Entity Resolution provides researchers with ready-to-use data for benchmarking Duplicate or Entity Resolution algorithms. The dataset focuses on person-level fields typical in customer or citizen records. Since real-world person-level data is restricted due to Personally Identifiable Information (PII) constraints, publicly available synthetic datasets are limited in scope, volume, or realism.SPIDER addresses these limitations by providing a large-scale, realistic dataset containing first name, last name, email, phone, address, and date of birth (DOB) attributes. Using the Python Faker library, 40,000 unique synthetic person records were generated, followed by 10,000 controlled duplicate records derived using seven real-world transformation rules. Each duplicate record is linked to its original base record and rule through the fields is_duplicate_of and duplication_rule.Version 2 introduces major realism and structural improvements, enhancing both the dataset and generation framework.Enhancements in Version 2New cluster_id column to group base and duplicate records for improved entity-level benchmarking.Improved data realism with consistent field relationships:State and ZIP codes now match correctly.Phone numbers are generated based on state codes.Email addresses are logically related to name components.Refined duplication logic:Rule 4 updated for realistic address variation.Rule 7 enhanced to simulate shared accounts among different individuals (with distinct DOBs).Improved data validation and formatting for address, email, and date fields.Updated Python generation script for modular configuration, reproducibility, and extensibility.Duplicate Rules (with real-world use cases)Duplicate record with a variation in email address.Use case: Same person using multiple email accounts.Duplicate record with a variation in phone numbers.Use case: Same person using multiple contact numbers.Duplicate record with last-name variation.Use case: Name changes or data entry inconsistencies.Duplicate record with address variation.Use case: Same person maintaining multiple addresses or moving residences.Duplicate record with a nickname.Use case: Same person using formal and informal names (Robert → Bob, Elizabeth → Liz).Duplicate record with minor spelling variations in the first name.Use case: Legitimate entry or migration errors (Sara → Sarah).Duplicate record with multiple individuals sharing the same email and last name but different DOBs.Use case: Realistic shared accounts among family members or households (benefits, tax, or insurance portals).Output FormatThe dataset is available in both CSV and JSON formats for direct use in data-processing, machine-learning, and record-linkage frameworks.Data RegenerationThe included Python script can be used to fully regenerate the dataset and supports:Addition of new duplication rulesRegional, linguistic, or domain-specific variationsVolume scaling for large-scale testing scenariosFiles Includedspider_dataset_v2_6_20251027_022215.csvspider_dataset_v2_6_20251027_022215.jsonspider_readme_v2.mdSPIDER_generation_script_v2.pySupportingDocuments/ folder containing:benchmark_comparison_script.py – script used for derive F-1 score.Public_census_data_surname.csv – sample U.S. Census name and demographic data used for comparison.ssa_firstnames.csv – Social Security Administration names dataset.simplemaps_uszips.csv – ZIP-to-state mapping data used for phone and address validation.

  11. f

    Data from: RNxQuest: An Extension to the xQuest Pipeline Enabling Analysis...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Sep 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris P. Sarnowski; Michael Götze; Alexander Leitner (2023). RNxQuest: An Extension to the xQuest Pipeline Enabling Analysis of Protein–RNA Cross-Linking/Mass Spectrometry Data [Dataset]. http://doi.org/10.1021/acs.jproteome.3c00341.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 6, 2023
    Dataset provided by
    ACS Publications
    Authors
    Chris P. Sarnowski; Michael Götze; Alexander Leitner
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cross-linking and mass spectrometry (XL-MS) workflows are increasingly popular techniques for generating low-resolution structural information about interacting biomolecules. xQuest is an established software package for analysis of protein–protein XL-MS data, supporting stable isotope-labeled cross-linking reagents. Resultant paired peaks in mass spectra aid sensitivity and specificity of data analysis. The recently developed cross-linking of isotope-labeled RNA and mass spectrometry (CLIR-MS) approach extends the XL-MS concept to protein–RNA interactions, also employing isotope-labeled cross-link (XL) species to facilitate data analysis. Data from CLIR-MS experiments are broadly compatible with core xQuest functionality, but the required analysis approach for this novel data type presents several technical challenges not optimally served by the original xQuest package. Here we introduce RNxQuest, a Python package extension for xQuest, which automates the analysis approach required for CLIR-MS data, providing bespoke, state-of-the-art processing and visualization functionality for this novel data type. Using functions included with RNxQuest, we evaluate three false discovery rate control approaches for CLIR-MS data. We demonstrate the versatility of the RNxQuest-enabled data analysis pipeline by also reanalyzing published protein–RNA XL-MS data sets that lack isotope-labeled RNA. This study demonstrates that RNxQuest provides a sensitive and specific data analysis pipeline for detection of isotope-labeled XLs in protein–RNA XL-MS experiments.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Features of probabilistic linkage solutions available for record linkage applications. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t001
Organization logo

Features of probabilistic linkage solutions available for record linkage applications.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Oct 20, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
John Prindle; Himal Suthar; Emily Putnam-Hornstein
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Features of probabilistic linkage solutions available for record linkage applications.

Search
Clear search
Close search
Google apps
Main menu