6 datasets found

Generated output for our example data sets.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdullah-Al Mamun; Robert Aseltine; Sanguthevar Rajasekaran (2023). Generated output for our example data sets. [Dataset]. http://doi.org/10.1371/journal.pone.0124449.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0124449.t004
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Abdullah-Al Mamun; Robert Aseltine; Sanguthevar Rajasekaran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Generated output for our example data sets.
Record Linkage Datasets
figshare.com
txt
Updated Apr 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Soliman (2022). Record Linkage Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.19500671.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19500671.v1
Dataset updated
Apr 2, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Ahmed Soliman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This simulated dataset is a corrupted segment from the Social Security Death Master File (SSDMF) available at https://ssdmf.info/. There are 11 original datasets: dsxo where x runs from 1...11 and the suffix o stands for original. The sizes (number of original records) of these datasets are as follows:| dataset | size ||:----------:|:----:|| ds1o | 10K || ds2o | 20K | | ds3o | 40K || ds4o | 80K || ds5o | 120K || ds6o | 160K || ds7o | 200K || ds8o | 400K || ds9o | 600K || ds10o | 800K || ds11o | 1M |These original records are then corrupted via a modified version of the dsgen Python script by Peter Christen.The modified/corrupted files are saved as: dsxm where the suffix m stands for modified.The modified records plus four original replicates are concatenated and mixed up (by the Linux command tool shuf).The resultant datasets are named: dsx.0 (dsx.1) before(after) shuffling.So, the sizes of these datasets are as follows:| dataset | size ||:-------:|:----:|| ds1.1 | 50k || ds2.1 | 100k || ds3.1 | 200k || ds4.1 | 400k || ds5.1 | 600k || ds6.1 | 800k || ds7.1 | 1M || ds8.1 | 2M || ds9.1 | 3M || ds10.1 | 4M || ds11.1 | 5M |Furthermore, each dataset is split into two halves to serve as input for record linkage algorithms. For example, ds1.1 is split into ds1.1.1 & ds1.1.2.
Data from: A Probabilistic Record Linkage Model for Survival Data
tandf.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michel H. Hof; Anita C. Ravelli; Aeilko H. Zwinderman (2023). A Probabilistic Record Linkage Model for Survival Data [Dataset]. http://doi.org/10.6084/m9.figshare.4876799.v4
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4876799.v4
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Michel H. Hof; Anita C. Ravelli; Aeilko H. Zwinderman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the absence of a unique identifier, combining information from multiple files relies on partially identifying variables (e.g., gender, initials). With a record linkage procedure, these variables are used to distinguish record pairs that belong together (matches) from record pairs that do not belong together (nonmatches). Generally, the combined strength of the partially identifying variables is too low causing imperfect linkage; some true nonmatches are identified as match and, on the other hand, some true matches as nonmatch. To avoid bias in further analyses, it is necessary to correct for imperfect linkage. In this article, pregnancy data from the Perinatal Registry of the Netherlands were used to estimate the associations between the (baseline) characteristics from the first delivery and the time to a second delivery. Because of privacy regulations, no unique identifier was available to determine which pregnancies belonged to the same woman. To deal with imperfect linkage in a time-to-event setting, where we have a file with baseline characteristics and a file with event times, we developed a joint model in which the record linkage procedure and the time-to-event analysis are performed simultaneously. R code and example data are available as online supplemental material.
D
Privacy‑Preserving Record Linkage Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Privacy‑Preserving Record Linkage Market Research Report 2033 [Dataset]. https://dataintelo.com/report/privacypreserving-record-linkage-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Privacy‑Preserving Record Linkage Market Outlook

According to the latest research, the global Privacy‑Preserving Record Linkage (PPRL) market size reached USD 1.02 billion in 2024. The market is poised for robust expansion, projected to grow at a CAGR of 14.7% during the forecast period from 2025 to 2033. By 2033, the market is expected to achieve a value of USD 3.09 billion. This growth is primarily driven by escalating data privacy regulations, the exponential rise in data sharing across sectors, and the increasing need for secure, compliant data integration solutions worldwide.

A key growth driver for the Privacy‑Preserving Record Linkage market is the intensifying regulatory landscape around data privacy and protection. With the enforcement of stringent frameworks such as the General Data Protection Regulation (GDPR) in Europe, the Health Insurance Portability and Accountability Act (HIPAA) in the United States, and similar legislations globally, organizations are under immense pressure to ensure that sensitive personal data remains confidential, even during data linkage or integration processes. This has led to a surge in the adoption of PPRL solutions, which enable entities to combine and analyze datasets from disparate sources without exposing or compromising individual privacy. The ability of these technologies to facilitate compliance while unlocking the value of linked data is a significant factor fueling market growth.

Another substantial growth factor is the proliferation of data-driven initiatives across healthcare, government, financial services, and research sectors. As organizations increasingly depend on collaborative analytics and cross-institutional studies, there is a mounting demand for technologies that can securely link records without revealing personal identifiers. In healthcare, for example, PPRL enables the aggregation of patient data from multiple providers to support clinical research and population health management, all while maintaining strict confidentiality. Similarly, government agencies leverage PPRL to streamline public service delivery and detect fraud, while financial institutions use it for secure customer verification and risk assessment. The versatility and scalability of PPRL solutions across these diverse applications are propelling the market forward.

Technological advancements are also playing a pivotal role in the expansion of the Privacy‑Preserving Record Linkage market. Innovations in cryptographic techniques, secure multiparty computation, and differential privacy are enhancing the efficacy, speed, and scalability of PPRL solutions. These advancements are enabling organizations to handle larger datasets, more complex linkage scenarios, and higher security requirements. The integration of artificial intelligence and machine learning further augments the accuracy of record linkage, reducing false positives and negatives. As these technologies mature and become more accessible, their adoption is expected to accelerate, providing a fertile ground for market growth over the coming years.

From a regional perspective, North America currently leads the Privacy‑Preserving Record Linkage market, accounting for the largest share in 2024, followed closely by Europe and the Asia Pacific. The dominance of North America is attributed to its advanced digital infrastructure, early adoption of privacy-enhancing technologies, and the presence of key market players. Europe’s strong regulatory environment and focus on data sovereignty are driving significant investments in PPRL, while rapid digitalization and expanding healthcare and financial sectors in Asia Pacific are creating new growth opportunities. Emerging markets in Latin America and the Middle East & Africa are also demonstrating increasing interest, particularly as data privacy awareness and regulatory frameworks evolve.

Component Analysis

The Privacy‑Preserving Record Linkage market is segmented by component into software and services, each playing a critical role in the ecosystem. Software solutions form the backbone of the market, providing the core algorithms and platforms that enable secure record linkage. These include standalone applications, integration modules, and cloud-based platforms that support a variety of use cases across industries. The software segment is characterized by continuous innovation, with vendors focusing on enhancing usability, interoperability, and security features to meet the evolving
Synthetic Oregon
kaggle.com
zip
Updated Jul 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PJ Gibson (2023). Synthetic Oregon [Dataset]. https://www.kaggle.com/datasets/pjgibson/synthetic-oregon/suggestions
Explore at:
zip(7872753697 bytes)Available download formats
Dataset updated
Jul 12, 2023
Authors
PJ Gibson
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Oregon
Description
synthetic-gold

Introduction

Record linkage is a complex process and is utilized to some extent in nearly every organization that works with modern human data records. People create methods for linking records on a case-by-case basis. Some may use basic matching between record 1 and record 2 as seen below ```python

Pseudo-Code!

if (r1.FirstName == r2.FirstName) & (r1.LastName == r2.LastName): out.match = True else: out.match = False ``` while others may choose to create more complex decision trees or even machine learning approaches to record linkage.

When people approach record linkage via machine learning (ML), they can match on a variety of fields, typically dependent on the forms used to collect data. While these ML-utilized fields can vary on a organization-to-organization level, there are several fields that appear more frequently than others. They are as follows: * First Name * Middle Name * Last Name * Date of Birth * Sex at Birth * Race * SSN (maybe) * Address house * Address zip * Address city * Address county * Address state * Phone * Email * Date Data Submitted

By comparing two records on all of these fields, ML record linkage models use complex logic to make the "Yes" or "No" decision on whether 2 records reflect the same individual. Record linkage can become difficult when individuals change addresses, adopt new last name, erroniously fill out data, or have information that closely resembles another individual (ex: twins).

The Problem

As described above, record linkage can have many complex elements to it.
Consider a situation where you are manually reviewing 2 records. These two records only contain basic information on the individuals and you are tasked to decide if Record #1 and Record #2 belong to the same person.

Record # First Name Last Name Sex DOB Address Address ZIP Address State Date Recieved
1 Wanda Smith F 1992-09-13 1768 Walker Rd. Unit 209 99301 WA 2015-03-01
2 Wanda Turner 1992-09-13 4545 Pennsylvania Ct. 98682 WA 2021-06-30

At a glance, these records are significantly different and you should therefore mark them as different persons. For the purposes of record linkage manual review, you probably made the correct decision. After all, for record linkage, most models prefer False Negatives to False Positives.

When groups validate record linkage models, they often turn to manually-reviewed record comparisons as their "gold-standard". There are two seperate marks of judgement for record linkage that I would like us to consider 1. Creating a model that simulates a human's decision making processs 2. Creating a model that seeks a deeper record equality "Truth"

I believe many groups aim for and are content with accomplishing goal #1. That approach is inarguably useful. However, I believe that it can be harmful in biases that it introduces. For example, it is biased against people who adopt new last names upon marriages/civil unions (more often "Female" Sex at Birth). Models that bias against non-american names can also produce high validation marks, but are flawed nonetheless. Consider the 2 records displayed earlier. There is a real chance that Wanda adopted a new last name and moved in the 6 years between when the data was collected.

Without relavant documentation (birth, marriage, ... , housing records), we have no way of knowing whether or not "Wanda Smith" is the same person as "Wanda Turner". It follows that treating manual review as a "gold-standard" fails to completely support goal #2.

The Solution

We hope to create a simulated society that can be used as absoulte truth. The simulated society will be built to reflect the population of Washington State. This will have a relational-database type structure with tables containing all relevant supporting structures for record linkage such as: * birth records * partnership (marriage) records * moving records

We hope to create a society with representative and diverse names, representative demographic breakdowns, and representative geographic population densities. The structure of the database will allow for "Time Travel" queries that allow a user to capture all data from a specific year in time.

By creating a simulated society, we will have absolute truth in determining whether record1 = record2. This approach will give us an opportunity to assess record linkage models considering goal #2.

Following Work

After we wrap this work, we will work on proccesses/functions for simulating human error in record filling. Also, functions to help support the process of bias-recognition in using our dataset as a test set.

Contact

PJ Gibson - peter.gibson@doh.wa.gov
f
Description of missing data on variables used for the linkage from the...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert W. Aldridge; Kunju Shaji; Andrew C. Hayward; Ibrahim Abubakar (2023). Description of missing data on variables used for the linkage from the laboratory, case notifications and an example pre-entry screening dataset, by NHS number availability and validity. [Dataset]. http://doi.org/10.1371/journal.pone.0136179.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0136179.t003
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Robert W. Aldridge; Kunju Shaji; Andrew C. Hayward; Ibrahim Abubakar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
E.g. house number and street name*E.g. city.Description of missing data on variables used for the linkage from the laboratory, case notifications and an example pre-entry screening dataset, by NHS number availability and validity.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Record #	First Name	Last Name	Sex	DOB	Address	Address ZIP	Address State	Date Recieved
1	Wanda	Smith	F	1992-09-13	1768 Walker Rd. Unit 209	99301	WA	2015-03-01
2	Wanda	Turner		1992-09-13	4545 Pennsylvania Ct.	98682	WA	2021-06-30

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdullah-Al Mamun; Robert Aseltine; Sanguthevar Rajasekaran (2023). Generated output for our example data sets. [Dataset]. http://doi.org/10.1371/journal.pone.0124449.t004

Generated output for our example data sets.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0124449.t004

Dataset updated

May 30, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Abdullah-Al Mamun; Robert Aseltine; Sanguthevar Rajasekaran

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Generated output for our example data sets.

Clear search

Close search

Google apps

Main menu

Generated output for our example data sets.

Record Linkage Datasets

Data from: A Probabilistic Record Linkage Model for Survival Data

Privacy‑Preserving Record Linkage Market Research Report 2033

Privacy‑Preserving Record Linkage Market Outlook

Component Analysis

Synthetic Oregon

synthetic-gold

Introduction

Pseudo-Code!

The Problem

The Solution

Following Work

Contact

Description of missing data on variables used for the linkage from the...

Generated output for our example data sets.