6 datasets found
  1. Generated output for our example data sets.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah-Al Mamun; Robert Aseltine; Sanguthevar Rajasekaran (2023). Generated output for our example data sets. [Dataset]. http://doi.org/10.1371/journal.pone.0124449.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Abdullah-Al Mamun; Robert Aseltine; Sanguthevar Rajasekaran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Generated output for our example data sets.

  2. Record Linkage Datasets

    • figshare.com
    txt
    Updated Apr 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Soliman (2022). Record Linkage Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.19500671.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 2, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Ahmed Soliman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This simulated dataset is a corrupted segment from the Social Security Death Master File (SSDMF) available at https://ssdmf.info/. There are 11 original datasets: dsxo where x runs from 1...11 and the suffix o stands for original. The sizes (number of original records) of these datasets are as follows:| dataset | size ||:----------:|:----:|| ds1o | 10K || ds2o | 20K | | ds3o | 40K || ds4o | 80K || ds5o | 120K || ds6o | 160K || ds7o | 200K || ds8o | 400K || ds9o | 600K || ds10o | 800K || ds11o | 1M |These original records are then corrupted via a modified version of the dsgen Python script by Peter Christen.The modified/corrupted files are saved as: dsxm where the suffix m stands for modified.The modified records plus four original replicates are concatenated and mixed up (by the Linux command tool shuf).The resultant datasets are named: dsx.0 (dsx.1) before(after) shuffling.So, the sizes of these datasets are as follows:| dataset | size ||:-------:|:----:|| ds1.1 | 50k || ds2.1 | 100k || ds3.1 | 200k || ds4.1 | 400k || ds5.1 | 600k || ds6.1 | 800k || ds7.1 | 1M || ds8.1 | 2M || ds9.1 | 3M || ds10.1 | 4M || ds11.1 | 5M |Furthermore, each dataset is split into two halves to serve as input for record linkage algorithms. For example, ds1.1 is split into ds1.1.1 & ds1.1.2.

  3. Data from: A Probabilistic Record Linkage Model for Survival Data

    • tandf.figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michel H. Hof; Anita C. Ravelli; Aeilko H. Zwinderman (2023). A Probabilistic Record Linkage Model for Survival Data [Dataset]. http://doi.org/10.6084/m9.figshare.4876799.v4
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Michel H. Hof; Anita C. Ravelli; Aeilko H. Zwinderman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the absence of a unique identifier, combining information from multiple files relies on partially identifying variables (e.g., gender, initials). With a record linkage procedure, these variables are used to distinguish record pairs that belong together (matches) from record pairs that do not belong together (nonmatches). Generally, the combined strength of the partially identifying variables is too low causing imperfect linkage; some true nonmatches are identified as match and, on the other hand, some true matches as nonmatch. To avoid bias in further analyses, it is necessary to correct for imperfect linkage. In this article, pregnancy data from the Perinatal Registry of the Netherlands were used to estimate the associations between the (baseline) characteristics from the first delivery and the time to a second delivery. Because of privacy regulations, no unique identifier was available to determine which pregnancies belonged to the same woman. To deal with imperfect linkage in a time-to-event setting, where we have a file with baseline characteristics and a file with event times, we developed a joint model in which the record linkage procedure and the time-to-event analysis are performed simultaneously. R code and example data are available as online supplemental material.

  4. D

    Privacy‑Preserving Record Linkage Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Privacy‑Preserving Record Linkage Market Research Report 2033 [Dataset]. https://dataintelo.com/report/privacypreserving-record-linkage-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Privacy‑Preserving Record Linkage Market Outlook



    According to the latest research, the global Privacy‑Preserving Record Linkage (PPRL) market size reached USD 1.02 billion in 2024. The market is poised for robust expansion, projected to grow at a CAGR of 14.7% during the forecast period from 2025 to 2033. By 2033, the market is expected to achieve a value of USD 3.09 billion. This growth is primarily driven by escalating data privacy regulations, the exponential rise in data sharing across sectors, and the increasing need for secure, compliant data integration solutions worldwide.




    A key growth driver for the Privacy‑Preserving Record Linkage market is the intensifying regulatory landscape around data privacy and protection. With the enforcement of stringent frameworks such as the General Data Protection Regulation (GDPR) in Europe, the Health Insurance Portability and Accountability Act (HIPAA) in the United States, and similar legislations globally, organizations are under immense pressure to ensure that sensitive personal data remains confidential, even during data linkage or integration processes. This has led to a surge in the adoption of PPRL solutions, which enable entities to combine and analyze datasets from disparate sources without exposing or compromising individual privacy. The ability of these technologies to facilitate compliance while unlocking the value of linked data is a significant factor fueling market growth.




    Another substantial growth factor is the proliferation of data-driven initiatives across healthcare, government, financial services, and research sectors. As organizations increasingly depend on collaborative analytics and cross-institutional studies, there is a mounting demand for technologies that can securely link records without revealing personal identifiers. In healthcare, for example, PPRL enables the aggregation of patient data from multiple providers to support clinical research and population health management, all while maintaining strict confidentiality. Similarly, government agencies leverage PPRL to streamline public service delivery and detect fraud, while financial institutions use it for secure customer verification and risk assessment. The versatility and scalability of PPRL solutions across these diverse applications are propelling the market forward.




    Technological advancements are also playing a pivotal role in the expansion of the Privacy‑Preserving Record Linkage market. Innovations in cryptographic techniques, secure multiparty computation, and differential privacy are enhancing the efficacy, speed, and scalability of PPRL solutions. These advancements are enabling organizations to handle larger datasets, more complex linkage scenarios, and higher security requirements. The integration of artificial intelligence and machine learning further augments the accuracy of record linkage, reducing false positives and negatives. As these technologies mature and become more accessible, their adoption is expected to accelerate, providing a fertile ground for market growth over the coming years.




    From a regional perspective, North America currently leads the Privacy‑Preserving Record Linkage market, accounting for the largest share in 2024, followed closely by Europe and the Asia Pacific. The dominance of North America is attributed to its advanced digital infrastructure, early adoption of privacy-enhancing technologies, and the presence of key market players. Europe’s strong regulatory environment and focus on data sovereignty are driving significant investments in PPRL, while rapid digitalization and expanding healthcare and financial sectors in Asia Pacific are creating new growth opportunities. Emerging markets in Latin America and the Middle East & Africa are also demonstrating increasing interest, particularly as data privacy awareness and regulatory frameworks evolve.



    Component Analysis



    The Privacy‑Preserving Record Linkage market is segmented by component into software and services, each playing a critical role in the ecosystem. Software solutions form the backbone of the market, providing the core algorithms and platforms that enable secure record linkage. These include standalone applications, integration modules, and cloud-based platforms that support a variety of use cases across industries. The software segment is characterized by continuous innovation, with vendors focusing on enhancing usability, interoperability, and security features to meet the evolving

  5. Synthetic Oregon

    • kaggle.com
    zip
    Updated Jul 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PJ Gibson (2023). Synthetic Oregon [Dataset]. https://www.kaggle.com/datasets/pjgibson/synthetic-oregon/suggestions
    Explore at:
    zip(7872753697 bytes)Available download formats
    Dataset updated
    Jul 12, 2023
    Authors
    PJ Gibson
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Oregon
    Description

    synthetic-gold

    Introduction

    Record linkage is a complex process and is utilized to some extent in nearly every organization that works with modern human data records. People create methods for linking records on a case-by-case basis. Some may use basic matching between record 1 and record 2 as seen below ```python

    Pseudo-Code!

    if (r1.FirstName == r2.FirstName) & (r1.LastName == r2.LastName): out.match = True else: out.match = False ``` while others may choose to create more complex decision trees or even machine learning approaches to record linkage.

    When people approach record linkage via machine learning (ML), they can match on a variety of fields, typically dependent on the forms used to collect data. While these ML-utilized fields can vary on a organization-to-organization level, there are several fields that appear more frequently than others. They are as follows: * First Name * Middle Name * Last Name * Date of Birth * Sex at Birth * Race * SSN (maybe) * Address house * Address zip * Address city * Address county * Address state * Phone * Email * Date Data Submitted

    By comparing two records on all of these fields, ML record linkage models use complex logic to make the "Yes" or "No" decision on whether 2 records reflect the same individual. Record linkage can become difficult when individuals change addresses, adopt new last name, erroniously fill out data, or have information that closely resembles another individual (ex: twins).

    The Problem

    As described above, record linkage can have many complex elements to it.
    Consider a situation where you are manually reviewing 2 records. These two records only contain basic information on the individuals and you are tasked to decide if Record #1 and Record #2 belong to the same person.

    Record #First NameLast NameSexDOBAddressAddress ZIPAddress StateDate Recieved
    1WandaSmithF1992-09-131768 Walker Rd. Unit 20999301WA2015-03-01
    2WandaTurner1992-09-134545 Pennsylvania Ct.98682WA2021-06-30

    At a glance, these records are significantly different and you should therefore mark them as different persons. For the purposes of record linkage manual review, you probably made the correct decision. After all, for record linkage, most models prefer False Negatives to False Positives.

    When groups validate record linkage models, they often turn to manually-reviewed record comparisons as their "gold-standard". There are two seperate marks of judgement for record linkage that I would like us to consider 1. Creating a model that simulates a human's decision making processs 2. Creating a model that seeks a deeper record equality "Truth"

    I believe many groups aim for and are content with accomplishing goal #1. That approach is inarguably useful. However, I believe that it can be harmful in biases that it introduces. For example, it is biased against people who adopt new last names upon marriages/civil unions (more often "Female" Sex at Birth). Models that bias against non-american names can also produce high validation marks, but are flawed nonetheless. Consider the 2 records displayed earlier. There is a real chance that Wanda adopted a new last name and moved in the 6 years between when the data was collected.

    Without relavant documentation (birth, marriage, ... , housing records), we have no way of knowing whether or not "Wanda Smith" is the same person as "Wanda Turner". It follows that treating manual review as a "gold-standard" fails to completely support goal #2.

    The Solution

    We hope to create a simulated society that can be used as absoulte truth. The simulated society will be built to reflect the population of Washington State. This will have a relational-database type structure with tables containing all relevant supporting structures for record linkage such as: * birth records * partnership (marriage) records * moving records

    We hope to create a society with representative and diverse names, representative demographic breakdowns, and representative geographic population densities. The structure of the database will allow for "Time Travel" queries that allow a user to capture all data from a specific year in time.

    By creating a simulated society, we will have absolute truth in determining whether record1 = record2. This approach will give us an opportunity to assess record linkage models considering goal #2.

    Following Work

    After we wrap this work, we will work on proccesses/functions for simulating human error in record filling. Also, functions to help support the process of bias-recognition in using our dataset as a test set.

    Contact

    PJ Gibson - peter.gibson@doh.wa.gov

  6. f

    Description of missing data on variables used for the linkage from the...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert W. Aldridge; Kunju Shaji; Andrew C. Hayward; Ibrahim Abubakar (2023). Description of missing data on variables used for the linkage from the laboratory, case notifications and an example pre-entry screening dataset, by NHS number availability and validity. [Dataset]. http://doi.org/10.1371/journal.pone.0136179.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Robert W. Aldridge; Kunju Shaji; Andrew C. Hayward; Ibrahim Abubakar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    E.g. house number and street name*E.g. city.Description of missing data on variables used for the linkage from the laboratory, case notifications and an example pre-entry screening dataset, by NHS number availability and validity.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abdullah-Al Mamun; Robert Aseltine; Sanguthevar Rajasekaran (2023). Generated output for our example data sets. [Dataset]. http://doi.org/10.1371/journal.pone.0124449.t004
Organization logo

Generated output for our example data sets.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Abdullah-Al Mamun; Robert Aseltine; Sanguthevar Rajasekaran
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Generated output for our example data sets.

Search
Clear search
Close search
Google apps
Main menu