Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Features of probabilistic linkage solutions available for record linkage applications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the absence of a unique identifier, combining information from multiple files relies on partially identifying variables (e.g., gender, initials). With a record linkage procedure, these variables are used to distinguish record pairs that belong together (matches) from record pairs that do not belong together (nonmatches). Generally, the combined strength of the partially identifying variables is too low causing imperfect linkage; some true nonmatches are identified as match and, on the other hand, some true matches as nonmatch. To avoid bias in further analyses, it is necessary to correct for imperfect linkage. In this article, pregnancy data from the Perinatal Registry of the Netherlands were used to estimate the associations between the (baseline) characteristics from the first delivery and the time to a second delivery. Because of privacy regulations, no unique identifier was available to determine which pregnancies belonged to the same woman. To deal with imperfect linkage in a time-to-event setting, where we have a file with baseline characteristics and a file with event times, we developed a joint model in which the record linkage procedure and the time-to-event analysis are performed simultaneously. R code and example data are available as online supplemental material.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project points to an article in The Stata Journal describing a set of routines to preprocess nominal data (firm names and addresses), perform probabilistic linking of two datasets, and display candidate matches for clerical review.The ado files and supporting pattern files are downloadable within Stata.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Ethics Reference No: 209113723/2023/1Source Code is available on Github. The datasets are used to reproduce the same results: https://github.com/DHollenbach/record-linkage-and-deduplication/blob/main/README.mdAbstract:The research emphasised the vital role of a Master Patient Index (MPI) solution in addressing the challenges public healthcare facilities face in eliminating duplicate patient records and improving record linkage. The study recognised that traditional MPI systems may have limitations in terms of efficiency and accuracy. To address this, the study focused on utilising machine learning techniques to enhance the effectiveness of MPI systems, aiming to support the growing record linkage healthcare ecosystem.It was essential to highlight that integrating machine learning into MPI systems is crucial for optimising their capabilities. The study aimed to improve data linking and deduplication processes within MPI systems by leveraging machine learning techniques. This emphasis on machine learning represented a significant shift towards more sophisticated and intelligent healthcare technologies. Ultimately, the goal was to ensure safe and efficient patient care, benefiting individuals and the broader healthcare industry.This research investigated the performance of five machine learning classification algorithms (random forests, extreme gradient boosting, logistic regression, stacking ensemble, and deep multilayer perceptron) for data linkage and deduplication on four datasets. These techniques improved data linking and deduplication for use in an MPI system.The findings demonstrate the applicability of machine learning models for effective data linkage and deduplication of electronic health records. The random forest algorithm achieved the best performance (identifying duplicates correctly) based on accuracy, F1-Score, and AUC-score for three datasets (Electronic Practice-Based Research Network (ePBRN): Acc = 99.83%, F1-score = 81.09%, AUC = 99.98%; Freely Extensible Biomedical Record Linkage (FEBRL) 3: Acc = 99.55%, F1-score = 96.29%, AUC = 99.77%; Custom-synthetic: Acc = 99.98%, F1-score = 99.18%, AUC = 99.99%). In contrast, the experimentation on the FEBRL4 dataset revealed that the Multi-Layer Perceptron Artificial Neural Network (MLP-ANN) and logistic regression algorithms outperformed the random forest algorithm. The performance results for the MLP-ANN were (FEBRL4: Acc = 99.93%, F1-score = 96.95%, AUC = 99.97%). For the logistic regression algorithm, the results were (FEBRL4: Acc = 99.99%, F1 = 96.91%, AUC = 99.97%).In conclusion, the results of this research have significant implications for the healthcare industry, as they are expected to enhance the utilisation of MPI systems and improve their effectiveness in the record linkage healthcare ecosystem. By improving patient record linking and deduplication, healthcare providers can ensure safer and more efficient care, ultimately benefiting patients and the industry.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The prevalence of various (and increasingly large) datasets presents the challenging problem of discovering common entities dispersed across disparate datasets. Solutions to the private record linkage problem (PRL) aim to enable such explorations of datasets in a secure manner. A two-party PRL protocol allows two parties to determine for which entities they each possess a record (either an exact matching record or a fuzzy matching record) in their respective datasets — without revealing to one another information about any entities for which they do not both possess records. Although several solutions have been proposed to solve the PRL problem, no current solution offers a fully cryptographic security guarantee while maintaining both high accuracy of output and subquadratic runtime efficiency. To this end, we propose the first known efficient PRL protocol that runs in subquadratic time, provides high accuracy, and guarantees cryptographic security.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AimsOur study aimed to identify the common themes, knowledge gaps and to evaluate the quality of data linkage research on diabetes in Australia.MethodsThis systematic review was developed in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (the PRISMA Statement). Six biomedical databases and the Australian Population Health Research Network (PHRN) website were searched. A narrative synthesis was conducted to comprehensively identify the common themes and knowledge gaps. The guidelines for studies involving data linkage were used to appraise methodological quality of included studies.ResultsAfter screening and hand-searching, 118 studies were included in the final analysis. Data linkage publications confirmed negative health outcomes in people with diabetes, reported risk factors for diabetes and its complications, and found an inverse association between primary care use and hospitalization. Linked data were used to validate data sources and diabetes instruments. There were limited publications investigating healthcare expenditure and adverse drug reactions (ADRs) in people with diabetes. Regarding methodological assessment, important information about the linkage performed was under-reported in included studies.ConclusionsIn the future, more up to date data linkage research addressing costs of diabetes and its complications in a contemporary Australian setting, as well as research assessing ADRs of recently approved antidiabetic medications, are required.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A set of software tools for privacy preserving record linkage. anonlink A library for carrying out the low level hash comparisons required server side. Available from github at http://github.com/n1analytics/anonlink/ entity-service Server side component of private record linkage REST api utilizing the anonlink library. clkhash A client utility and library for turning personally identifiable information into bloom filter hashes. Available from github at https://github.com/n1analytics/clkhash/ en…Show full descriptionA set of software tools for privacy preserving record linkage. anonlink A library for carrying out the low level hash comparisons required server side. Available from github at http://github.com/n1analytics/anonlink/ entity-service Server side component of private record linkage REST api utilizing the anonlink library. clkhash A client utility and library for turning personally identifiable information into bloom filter hashes. Available from github at https://github.com/n1analytics/clkhash/ encoding-service A REST api wrapper around clkhash for encoding PII data into CLKs. Available from github at https://github.com/n1analytics/encoding-service/ The metadata and files (if any) are available to the public.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
E.g. house number and street name*E.g. city.Description of missing data on variables used for the linkage from the laboratory, case notifications and an example pre-entry screening dataset, by NHS number availability and validity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets were produced in my thesis project. The thesis (in Czech language) explores the application of approximate string matching in scientific publication record linkage process. An introduction to record matching along with five commonly used metrics for string distance (Levenshtein, Jaro, Jaro-Winkler, Cosine distances and Jaccard coefficient) are provided. These metrics are applied on publication metadata from V3S current research information system of the Czech Technical University in Prague. Based on the findings, optimal thresholds in the F1, F2 and F3-measures are determined for each metric.
Thesis citation:
DOBIÁŠOVSKÝ, Jan. Approximate equality of character strings and its application to record linkage in metadata of scientific publications [online]. Praha, 2020 [cit. 2020-05-04]. Masters thesis. Charles University. Faculty of Arts. Institute of Information Studies and Librarianship.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a comprehensive, privacy-preserving linkage between insurance claims and anonymized medical records, including claim details, patient demographics, provider information, and medical coding. It enables advanced analytics for healthcare utilization, cost analysis, and outcomes research across the insurance and medical sectors.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Abstract: Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers usually use approximate string (``fuzzy'') matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations. In response, a number of machine-learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. We show how, by leveraging information from LinkedIn regarding organizational name-to-name links, we can improve upon existing matching benchmarks, incorporating the trillions of name pair examples from LinkedIn into various methods to improve performance by explicitly maximizing match probabilities inferred from the LinkedIn corpus. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source. Keywords: Record linkage; Interest groups; Text as data; Unstructured data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation:Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description:An augmented version of the amazon-google products dataset for benchmarking entity matching/record linkage methods found at: https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolutio...The augmented version adds a fixed set of non-matching pairs to the original dataset. In addition, fixed splits for training, validation and testing as well as their corresponding feature vectors are provided. The feature vectors are built using data type specific similarity metrics.The dataset contains 1,363 records describing products deriving from amazon which are matched against 3,226 product records from google. The gold standards have manual annotations for 1,298 matching and 6,306 non-matching pairs. The total number of attributes used to decribe the product records are 4 while the attribute density is 0.75.The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results.The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download:http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
Processed LiDAR data and environmental covariates from 2015 and 2019 LiDAR scans in the Vicinity of Snodgrass Mountain (Western Colorado, USA), in a geographic subset used in primary analysis for the research paper. This package contains LiDAR-derived canopy height maps for 2015 and 2019, crown polygons derived from the height maps using a segmentation algorithm, and environmental covariates supporting the model of forest growth. Source datasets include August 2015 and August 2019 discrete-return LiDAR point clouds collected by Quantum Geospatial for terrain mapping purposes on behalf of the Colorado Hazard Mapping Program and the Colorado Water Conservation Board. Both datasets adhere to the USGS QL2 quality standard. The point cloud data were processed using the R package lidR to generate a canopy height model representing maximum vegetation height above the ground surface, using a pit-free algorithm. This dataset was compiled to assess how spatial patterns of tree growth in montane and subalpine forests are influenced by water and energy availability. Understanding these growth patterns can provide insight into forest dynamics in the Southern Rocky Mountains under changing climatic conditions. This dataset contains .tif, .csv, and .txt files. This dataset additionally includes a file-level metadata (flmd.csv) file that lists each file contained in the dataset with associated metadata; and a data dictionary (dd.csv) file that contains column/row headers used throughout the files along with a definition, units, and data type.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PATRON is a human ethics approved program of research incorporating an enduring de-identified repository of Primary Care data facilitating research and knowledge generation. PATRON is a part of the 'Data for Decisions' initiative of the Department of General Practice, University of Melbourne. 'Data for Decisions' is a research initiative in partnership with general practices. It is an exciting undertaking that makes possible primary care research projects to increase knowledge and improve healthcare practices and policy. Principal Researcher: Jon EmeryData Custodian: Lena SanciData Steward: Douglas BoyleManager: Rachel CanawayMore information about Data for Decisions and utilising PATRON data is available from the Data for Decisions website.
https://uow.libguides.com/uow-ro-copyright-all-rights-reservedhttps://uow.libguides.com/uow-ro-copyright-all-rights-reserved
Curated independent genetic linkage results for type 2 diabetes and related quantitative traits. Includes results of genetic studies of family units with disease, observing inheritance patterns and causative genes. Text files of genetic data generated from literature searches for personal research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Capture of diagnosis of dementia and overlap between data sources.
Abstract: Since most social science research relies upon multiple data sources, merging data sets is an essential part of researchers' workflow. Unfortunately, a unique identifier that unambiguously links records is often unavailable, and data may contain missing and inaccurate information. These problems are severe especially when merging large-scale administrative records. We develop a fast and scalable algorithm to implement a canonical probabilistic model of record linkage that has many advantages over deterministic methods frequently used by social scientists. The proposed methodology efficiently handles millions of observations while accounting for missing data and measurement error, incorporating auxiliary information, and adjusting for uncertainty about merging in post-merge analyses. We conduct comprehensive simulation studies to evaluate the performance of our algorithm in realistic scenarios. We also apply our methodology to merging campaign contribution records, survey data, and nationwide voter files. An open-source software package is available for implementing the proposed methodology.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
Linking survey and administrative data offers the possibility of combining the strengths, and mitigating the weaknesses, of both. Such linkage is therefore an extremely promising basis for future empirical research in social science. For ethical and legal reasons, linking administrative data to survey responses will usually require obtaining explicit consent. It is well known that not all respondents give consent. Past research on consent has generated many null and inconsistent findings. A weakness of the existing literature is that little effort has been made to understand the cognitive processes of how respondents make the decision whether or not to consent. The overall aim of this project was to improve our understanding about how to pursue the twin goals of maximizing consent and ensuring that consent is genuinely informed. The ultimate objective is to strengthen the data infrastructure for social science and policy research in the UK. Specific aims were: 1. To understand how respondents process requests for data linkage: which factors influence their understanding of data linkage, which factors influence their decision to consent, and to open the black box of consent decisions to begin to understand how respondents make the decision. 2. To develop and test methods of maximising consent in web surveys, by understanding why web respondents are less likely to give consent than face-to-face respondents. 3. To develop and test methods of maximising consent with requests for linkage to multiple data sets, by understanding how respondents process multiple requests. 4. As a by-product of testing hypotheses about the previous points, to test the effects of different approaches to wording consent questions on informed consent. Our findings are based on a series of experiments conducted in four surveys using two different studies: The Understanding Society Innovation Panel (IP) and the PopulusLive online access panel (AP). The Innovation Panel is part of Understanding Society: the UK Household Longitudinal Study. It is a probability sample of households in Great Britain used for methodological testing, with a design that mirrors that of the main Understanding Society survey. The Innovation Panel survey was conducted in wave 11, fielded in 2018. The Innovation Panel data are available from the UK Data Service (SN: 6849, http://doi.org/10.5255/UKDA-SN-6849-12). Since the Innovation Panel sample size (around 2,900 respondents) constrained the number of experimental treatment groups we could implement, we fielded a parallel survey with additional experiments, using a different sample. PopulusLive is a non-probability online panel with around 130,000 active sample members, who are recruited through web advertising, word of mouth, and database partners. We used age, gender and education quotas to match the sample composition of the Innovation Panel. A total of nine experiments were conducted across the two sample sources. Experiments 1 to 5 all used variations of a single consent question, about linkage to tax data (held by HM Revenue and Customs, HMRC). Experiments 6 and 7 also used single consent questions, but respondents were either assigned to questions on tax or health data (held by the National Health Service, NHS) linkage. Experiments 8 and 9 used five different data linkage requests: tax data (held by HMRC), health data (held by the NHS), education data (held by the Department for Education in England, DfE, and equivalent departments in Scotland and Wales), household energy data (held the Department for Business, Energy and Industrial Strategy, BEIS), and benefit and pensions data (held by the Department for Work and Pensions, DWP). The experiments, and the survey(s) on which they were conducted, are briefly summarized here: 1. Easy vs. standard wording of consent request (IP and AP). Half the respondents were allocated to the ‘standard’ question wording, used previously in Understanding Society. The balance was allocated to an ‘easy’ version, where the text was rewritten to reduce reading difficulty and to provide all essential information about the linkage in the question text rather than an additional information leaflet. 2. Early vs. late placement of consent question (IP). Half the respondents were asked for consent early in the interview, the other half were asked at the end. 3. Web vs. face-to-face interview (IP). This experiment exploits the random assignment of IP cases to explore mode effects on consent. 4. Default question wording (AP). Experiment 4 tested a default approach to giving consent, asking respondents to “Press ‘next’ to continue” or explicitly opt out, versus the standard opt-in consent procedure. 5. Additional information question wording (AP). This experiment tested the effect of offering additional information, with a version that added a third response option (“I need more information before making a decision”) to the standard ‘yes’ or no’ options. 6. Data linkage domain (AP). Half the respondents were assigned to a question asking for consent to link to HMRC data; the other half were asked for linkage to NHS data. 7. Trust priming (AP).This experiment was crossed with the data linkage domain experiment, and focused on the effect of priming trust on consent. Half the sample saw an additional statement: “HMRC / The NHS is a trusted data holder” on an introductory screen prior to the consent question. This was followed by an icon symbolizing data security: a shield and lock symbol with the heading “Trust”. The balance was not shown the additional statement or icon. 8. Format of multiple consents (AP). For one group, the five consent questions were each presented on a separate page, with respondents consenting to each in turn. For the second group the questions were all presented on one page; however, the respondent still had to answer each consent question individually. For the third group all five data requests were presented on a single page and the respondent answered a single yes/no question, whether they consented to all the linkages or not. 9. Order of multiple consents (AP). One version asked the five consent questions in ascending order of sensitivity of the request (based on previous data), with NHS asked first. The other version reversed the order, with consent to linkage to HMRC data asked first. For all of the experiments described above, we examined the rates of consent. We also tested comprehension of the consent request, using a series of knowledge questions about the consent process. We also measured subjective understanding, to get a sense of how much respondents felt they understood about the request. Finally, we also ascertained subjective confidence in the decision they had made. In additional to the experiments, we used digital audio-recordings of the IP11 face-to-face interviews (recorded with respondents’ permission) to explore how interviewers communicate the consent request to respondents, whether and how they provide additional information or attempt to persuade respondents to consent, and whether respondents raise questions when asked for consent to data linkage. Key Findings Correlates of consent: (1) Respondents who have better understanding of the data linkage request (as measured by a set of knowledge questions) are also more likely to consent. (2) As in previous studies, we find no socio-demographic characteristics that consistently predict consent in all samples. The only consistent predictors are positive attitudes towards data sharing, trust in HMRC, and knowledge of what data HMRC have. (3) Respondents are less likely to consent to data linkage if the wording of the request is difficult and the question is asked late in the questionnaire. Position has no effect on consent if the wording is easy; wording has no effect on consent if the position is early. (4) Priming respondents to think about trust in the organisations involved in the data linkage increases consent. (5) The only socio-demographic characteristic that consistently predicts objective understanding of the linkage request is education. Understanding is positively associated with the number of online data sharing behaviours (e.g., posting text or images on social media, downloading apps, online purchases or banking) and with trust in HMRC. (6) Easy wording of the consent question increases objective understanding of the linkage request. Position of the consent question in the questionnaire has no effect on understanding. The consent decision process: (7) Respondents decide about the consent request in different ways: some use more reflective decision-making strategies, others use less reflective strategies. (8) Different decision processes are associated with very different levels of consent, comprehension, and confidence in the consent decision. (9) Placing the consent request earlier in the survey increases the probability of the respondent using a reflective decision-making process. Effects of mode of data collection on consent: (10) As in previous studies, respondents are less likely to consent online than with an interviewer. (11) Web respondents have lower levels of understanding than face-to-face respondents. (12) There is no difference by mode in respondents’ confidence in their decisions. (13) Web respondents report higher levels of concern about data security than face-to-face respondents. (14) Web respondents are less likely to use reflective strategies to make their decision than face-to-face respondents, and instead more likely to make habit-based decisions. (15) Easier wording of the consent request does not reduce mode effects on rates of consent. (16) Respondents rarely ask questions and interviewers rarely provide additional information. Multiple consent requests: (17) The format in which a sequence of consent requests is asked does not seem to matter. (18) The order of multiple consent requests affects consent rates, but not in a
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.