Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Features of probabilistic linkage solutions available for record linkage applications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project points to an article in The Stata Journal describing a set of routines to preprocess nominal data (firm names and addresses), perform probabilistic linking of two datasets, and display candidate matches for clerical review.The ado files and supporting pattern files are downloadable within Stata.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT OBJECTIVE To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs. METHODS The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algorithm was developed using a set of 70 rules, based on the combination of fragments of the key variables with or without modification (Soundex or substring). Each rule was formed by three or more fragments. The probabilistic approach required a cutoff point for the score, above which the links would be automatically classified as belonging to the same individual. The cutoff point was obtained by linkage of the Notifiable Diseases Information System – Tuberculosis database with itself, subsequent manual review and ROC curves and precision-recall. Sensitivity and specificity for accurate analysis were calculated. RESULTS Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively. The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used. CONCLUSIONS The two techniques showed a high level of correlation for pair classification. Although deterministic linkage identified more duplicate records than probabilistic linkage, the latter retrieved records not identified by the former. User need and experience should be considered when choosing the best technique to be used.
Abstract: Since most social science research relies upon multiple data sources, merging data sets is an essential part of researchers' workflow. Unfortunately, a unique identifier that unambiguously links records is often unavailable, and data may contain missing and inaccurate information. These problems are severe especially when merging large-scale administrative records. We develop a fast and scalable algorithm to implement a canonical probabilistic model of record linkage that has many advantages over deterministic methods frequently used by social scientists. The proposed methodology efficiently handles millions of observations while accounting for missing data and measurement error, incorporating auxiliary information, and adjusting for uncertainty about merging in post-merge analyses. We conduct comprehensive simulation studies to evaluate the performance of our algorithm in realistic scenarios. We also apply our methodology to merging campaign contribution records, survey data, and nationwide voter files. An open-source software package is available for implementing the proposed methodology.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Material 1
This study includes a synthetically-generated version of the Ministry of Justice Data First Probation datasets. Synthetic versions of all 43 tables in the MoJ Data First data ecosystem have been created. These versions can be used / joined in the same way as the real datasets. As well as underpinning training, synthetic datasets should enable researchers to explore research questions and to design research proposals prior to submitting these for approval. The code created during this exploration and design process should then enable initial results to be obtained as soon as data access is granted.
The Ministry of Justice Data First probation dataset provides data on people under the supervision of the probation service in England and Wales from 2014. This is a statutory criminal justice service that supervises high-risk offenders released into the community. The data has been extracted from the management information system national Delius (nDelius), used by His Majesty's Prisons and Probation Service (HMPPS) to manage people on probation.
Information is included on service users' characteristics and offence, and on their pre-sentence reports, sentence requirements, licence conditions and post-sentence supervision; for example, age, gender, ethnicity, offence category, key dates relating to sentence and recalls, activities and programmes required as part of rehabilitation (e.g. drug and alcohol treatment, skills training) and limitations set on their activities (e.g. curfew, location monitoring, drugs testing).
Each record in the dataset gives information about a single person and probation journey. As part of Data First, records have been deidentified and deduplicated, using our probabilistic record linkage package, Splink, so that a unique identifier is assigned to all records believed to relate to the same person, allowing for longitudinal analysis and investigation of repeat interactions with probation. This aims to improve on links already made within probation services. This opens up the potential to better understand probation service users and address questions on, for example, what works to reduce reoffending.
The Ministry of Justice Data First linking dataset can be used in combination with this and other Data First datasets to join up administrative records about people from across justice services (courts, prisons and probation) to increase understanding around users' interactions, pathways and outcomes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used ORBIS as business register. It is a global corporate database maintained by Bureau van Dijk (http://bvdinfo.com/orbis) and contains detailed corporate information on over 200 million private companies world wide. The database has been claimed to ‘suffer from some structural biases’ (Ribeiro et al., 2010). However, regarding European companies with an annual turnover of more than € 100000, the data set is practically complete (Garcia‐Bernardo and Takes, 2018). Data from business registers regarding smaller foreign companies are not needed in our analysis, as these companies do not have to file tax returns in the Netherlands. The ORBIS database is used, because it contains the principal and secondary NACE (revision 2) codes for companies that are established in the EU. The NACE code can be used to select all active (and inactive) companies that are established in the EU and that are principally or secondarily economically active in retail trade. The result is a data set of 6996468 companies, from which companies established in the Netherlands have been excluded. This data set, including each company's country of establishment, was extracted from ORBIS on June 24th, 2017.The data has been used in the following publication:https://doi.org/10.1111/rssa.12487
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The survival risk ratios (SRRs) were calculated using linked hospitalisation and mortality data from New South Wales (NSW), Australia. Hospital admissions was obtained from the NSW Ministry of Health and included all injury-related admissions identified using a principal diagnosis of injury (ICD-10-AM: S00-T89) during 1 January 2010 to 30 June 2014. Mortality data was obtained from the NSW Registry of Births, Deaths and Marriages from 1 January 2010 to 31 March 2015. Hospitalisation and mortality data were probabilistic linked by the Centre for Health Record Linkage (CHeReL). NSW covers an area of 800,628km2 with a population of around 7.7 million.
The SRRs were calculated for each injury diagnosis. A SRR represents the ratio of the number of individuals with each injury diagnosis who did not die to the total number of individuals with the injury diagnosis. The SRRs can be used to estimate injury severity (i.e. the International Classification of Injury Severity Score: ICISS). The ICISS is calculated by applying the SRRs to each injury diagnosis code in your data. There are two methods commonly used to then estimate ICISS values: (i) multiplicative-injury ICISS where ICISS is the product of all SRRs for each of the individual’s injuries; and (ii) single worst-injury, where ICISS only includes the worst-injury (i.e. the injury diagnosis with the lowest SRR) as the single worst-injury.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fit statistics for scored XGBoost models with 50,000 rows per dataset.
The study used deterministic and probabilistic data linking to match eligible members with birth registry records to identify births paid by Virginia Medicaid during calendar year (CY) 2021. Medicaid member, claims, and encounter data files were used with birth registry data fields to match members from each data linkage process. All probabilistically or deterministically linked birth registry records were included in the eligible focus study population.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Demographic characteristics of the 2016 California birth cohort.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
2000 to 2015.
HTTPS://CPRD.COM/DATA-ACCESSHTTPS://CPRD.COM/DATA-ACCESS
A list of all likely mother-baby pairs in the CPRD GOLD database generated using a probabilistic algorithm applied to the primary care data. Algorithmic linkage is done based on household number plus maternity information from the mother’s primary care record, the infant’s month of birth and care records of newly registered babies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Generalized linear modeling parameter estimates of birth characteristics predicting CPS referral by age 3.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset accompanies a paper to be published in "Morphology" (JOMO, Springer). Under the present DOI, all data generated for this research as well as all scripts used are stored. The paper itself is not CC-licensed, refer to Springer's "Morphology" website for details!
Abstract
In this paper, we take a closer theoretical and empirical look at the linking elements in German N1+N2 compounds which are identical to the plural marker of N1 (such as -er with umlaut, as in Häus-er-meer 'sea of houses'). Various perspectives on the actual extent of plural interpretability of these pluralic linking elements are expressed in the literature. We aim to clarify this question by empirically examining to what extent there may be a relationship between plural form and meaning which informs in which sorts of compounds pluralic linking elements appear. Specifically, we investigate whether pluralic linking elements occur especially frequently in compounds where a plural meaning of the first constituent is induced either externally (through plural inflection of the entire compound) or internally (through a relation between the constituents such that N2 forces N1 to be conceptually plural, as in the example above). The results of a corpus study using the DECOW16A corpus and a split-100 experiment show that in the internal but not external plural meaning conditions, a pluralic linking element is preferred over a non-pluralic one, though there is considerable inter-speaker variability, and limitations imposed by other constraints on linking element distribution also play a role. However, we show the overall tendency that German language users do use pluralic linking elements as cues to the plural interpretation of N1+N2 compounds. Our interpretation does not reference a specific morphological framework. Instead, we view our data as strengthening the general approach of probabilistic morphology.
AbstractIntroduction American English Nickname Collection was developed by Intelius, Inc. and is a compilation of American English nicknames to given name mappings based on information in US government records, public web profiles and financial and property reports. This corpus is intended as a tool for the quantitative study of nickname usage in the United States such as in demographic and sociological studies. It has multiple potential human language technology applications as well, including entity extraction, coreference resolution, people search, language modeling and machine translation. Data The American English Nickname Collection contains 331,237 distinct mappings encompassing millions of names. The data was collected and processed through a record linkage pipeline. The steps in the pipeline were (1) data cleaning, (2) blocking, (3) pair-wise linkage and (4) clustering. In the cleaning step, material was categorized, processed to remove junk and spam records and normalized to an approximately common representation. The blocking process utitlized an algorithm to group records by shared properties for determining which record pairs should be examined by the pairwise linker as potential duplicates. The linkage step assigned a score to record pairs using a supervised pairwise-based machine learning model. The clustering step combined record pairs into connected components and further partitioned each connected component to remove inconsistent pairwise links. The result is that input records were partitioned into disjoint sets called profiles, where each profile corresponded to a single person. The material is presented in the form of a comma delimited text file. Each line contains a first name, a nickname or alias, its conditional probability and its frequency. The conditional probability for each nickname is derived from the base data using an algorithim which calculates both the probability for which any alias refers to a given name and a threshold below which the mapping is most likely an error. This threshold eliminates typographic errors and other noise from the data.
HTTPS://CPRD.COM/DATA-ACCESSHTTPS://CPRD.COM/DATA-ACCESS
A list of all likely mother-baby pairs in the CPRD Aurum database generated using a probabilistic algorithm applied to the primary care data. Algorithmic linkage is done based on household number plus maternity information from the mother’s primary care record, the infant’s month of birth and care records of newly registered babies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Brazil: 2000–2015.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Calculation of sensitivity and specificity for probabilistic matching without manual review, not including address variables and using an ETS dataset that only including non-UK born individuals.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.