35 datasets found

Data from: Automated Linking of Historical Data
linkagelibrary.icpsr.umich.edu
Updated Aug 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez (2020). Automated Linking of Historical Data [Dataset]. http://doi.org/10.3886/E120703V1
Explore at:
Unique identifier
https://doi.org/10.3886/E120703V1
Dataset updated
Aug 20, 2020
Authors
Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
1850 - 1940
Area covered
United States
Description
Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.
f
Features of probabilistic linkage solutions available for record linkage...
plos.figshare.com
figshare.com
xls
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Features of probabilistic linkage solutions available for record linkage applications. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291581.t001
Dataset updated
Oct 20, 2023
Dataset provided by
PLOS ONE
Authors
John Prindle; Himal Suthar; Emily Putnam-Hornstein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Features of probabilistic linkage solutions available for record linkage applications.
Record linkage using Stata
linkagelibrary.icpsr.umich.edu
Updated Jan 3, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nada Wasi; Aaron Flaaen (2019). Record linkage using Stata [Dataset]. http://doi.org/10.3886/E107948V1
Explore at:
Unique identifier
https://doi.org/10.3886/E107948V1
Dataset updated
Jan 3, 2019
Dataset provided by
University of Michigan/ISR
Board of Governors of the Federal Reserve System, Division of Research and Statistics
Authors
Nada Wasi; Aaron Flaaen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project points to an article in The Stata Journal describing a set of routines to preprocess nominal data (firm names and addresses), perform probabilistic linking of two datasets, and display candidate matches for clerical review.The ado files and supporting pattern files are downloadable within Stata.
f
Data from: Accuracy of probabilistic and deterministic record linkage: the...
scielo.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gisele Pinto de Oliveira; Ana Luiza de Souza Bierrenbach; Kenneth Rochel de Camargo Júnior; Cláudia Medina Coeli; Rejane Sobrino Pinheiro (2023). Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis [Dataset]. http://doi.org/10.6084/m9.figshare.19934264.v1
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19934264.v1
Dataset updated
Jun 10, 2023
Dataset provided by
SciELO journals
Authors
Gisele Pinto de Oliveira; Ana Luiza de Souza Bierrenbach; Kenneth Rochel de Camargo Júnior; Cláudia Medina Coeli; Rejane Sobrino Pinheiro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT OBJECTIVE To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs. METHODS The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algorithm was developed using a set of 70 rules, based on the combination of fragments of the key variables with or without modification (Soundex or substring). Each rule was formed by three or more fragments. The probabilistic approach required a cutoff point for the score, above which the links would be automatically classified as belonging to the same individual. The cutoff point was obtained by linkage of the Notifiable Diseases Information System – Tuberculosis database with itself, subsequent manual review and ROC curves and precision-recall. Sensitivity and specificity for accurate analysis were calculated. RESULTS Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively. The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used. CONCLUSIONS The two techniques showed a high level of correlation for pair classification. Although deterministic linkage identified more duplicate records than probabilistic linkage, the latter retrieved records not identified by the former. User need and experience should be considered when choosing the best technique to be used.
H
Replication Data for: Using a Probabilistic Model to Assist Merging of...
dataverse.harvard.edu
pdf, tar
Updated Apr 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2020). Replication Data for: Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records [Dataset]. http://doi.org/10.7910/DVN/YGUHTD
Explore at:
tar(78616576), pdf(136543)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/YGUHTD
Dataset updated
Apr 21, 2020
Dataset provided by
Harvard Dataverse
Description
Abstract: Since most social science research relies upon multiple data sources, merging data sets is an essential part of researchers' workflow. Unfortunately, a unique identifier that unambiguously links records is often unavailable, and data may contain missing and inaccurate information. These problems are severe especially when merging large-scale administrative records. We develop a fast and scalable algorithm to implement a canonical probabilistic model of record linkage that has many advantages over deterministic methods frequently used by social scientists. The proposed methodology efficiently handles millions of observations while accounting for missing data and measurement error, incorporating auxiliary information, and adjusting for uncertainty about merging in post-merge analyses. We conduct comprehensive simulation studies to evaluate the performance of our algorithm in realistic scenarios. We also apply our methodology to merging campaign contribution records, survey data, and nationwide voter files. An open-source software package is available for implementing the proposed methodology.
f
Additional file 1 of Accuracy, potential, and limitations of probabilistic...
springernature.figshare.com
xlsx
Updated Jun 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ricardo de Mattos Russo Rafael; Kleison Pereira da Silva; Helena Gonçalves de Souza Santos; Davi Gomes Depret; Jaime Alonso Caravaca-Morera; Karen Marie Lucas Breda (2024). Additional file 1 of Accuracy, potential, and limitations of probabilistic record linkage in identifying deaths by gender identity and sexual orientation in the state of Rio De Janeiro, Brazil [Dataset]. http://doi.org/10.6084/m9.figshare.25953318.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25953318.v1
Dataset updated
Jun 2, 2024
Dataset provided by
figshare
Authors
Ricardo de Mattos Russo Rafael; Kleison Pereira da Silva; Helena Gonçalves de Souza Santos; Davi Gomes Depret; Jaime Alonso Caravaca-Morera; Karen Marie Lucas Breda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
State of Rio de Janeiro, Brazil
Description
Supplementary Material 1
Ministry of Justice Synthetic Data First Probation Iteration 2, England and...
beta.ukdataservice.ac.uk
Updated 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ministry of Justice (2025). Ministry of Justice Synthetic Data First Probation Iteration 2, England and Wales, 2014-2023 [Dataset]. http://doi.org/10.5255/ukda-sn-9398-3
Explore at:
Unique identifier
https://doi.org/10.5255/ukda-sn-9398-3
Dataset updated
2025
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
datacite
Authors
Ministry of Justice
Area covered
Wales, England
Description
The Ministry of Justice (MoJ) Data First Synthetic Data Project aims to improve engagement with Data First datasets by making synthetic versions of content available to enable more rapid development of research proposals and to thereby enhance the potential for linked administrative data to improve understanding and outcomes across justice systems. The project has led the development of two components: a dataset generation platform and an initial release of lo-fidelity, synthetic data tables.

This study includes a synthetically-generated version of the Ministry of Justice Data First Probation datasets. Synthetic versions of all 43 tables in the MoJ Data First data ecosystem have been created. These versions can be used / joined in the same way as the real datasets. As well as underpinning training, synthetic datasets should enable researchers to explore research questions and to design research proposals prior to submitting these for approval. The code created during this exploration and design process should then enable initial results to be obtained as soon as data access is granted.

The Ministry of Justice Data First probation dataset provides data on people under the supervision of the probation service in England and Wales from 2014. This is a statutory criminal justice service that supervises high-risk offenders released into the community. The data has been extracted from the management information system national Delius (nDelius), used by His Majesty's Prisons and Probation Service (HMPPS) to manage people on probation.

Information is included on service users' characteristics and offence, and on their pre-sentence reports, sentence requirements, licence conditions and post-sentence supervision; for example, age, gender, ethnicity, offence category, key dates relating to sentence and recalls, activities and programmes required as part of rehabilitation (e.g. drug and alcohol treatment, skills training) and limitations set on their activities (e.g. curfew, location monitoring, drugs testing).

Each record in the dataset gives information about a single person and probation journey. As part of Data First, records have been deidentified and deduplicated, using our probabilistic record linkage package, Splink, so that a unique identifier is assigned to all records believed to relate to the same person, allowing for longitudinal analysis and investigation of repeat interactions with probation. This aims to improve on links already made within probation services. This opens up the potential to better understand probation service users and address questions on, for example, what works to reduce reoffending.

The Ministry of Justice Data First linking dataset can be used in combination with this and other Data First datasets to join up administrative records about people from across justice services (courts, prisons and probation) to increase understanding around users' interactions, pathways and outcomes.
f
Global business register used to estimate cross-border internet purchases...
uvaauas.figshare.com
Updated Dec 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Q.A. Meertens (2020). Global business register used to estimate cross-border internet purchases within the EU [Dataset]. http://doi.org/10.21942/uva.13303082.v1
Explore at:
Unique identifier
https://doi.org/10.21942/uva.13303082.v1
Dataset updated
Dec 7, 2020
Dataset provided by
University of Amsterdam / Amsterdam University of Applied Sciences
Authors
Q.A. Meertens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
European Union
Description
We used ORBIS as business register. It is a global corporate database maintained by Bureau van Dijk (http://bvdinfo.com/orbis) and contains detailed corporate information on over 200 million private companies world wide. The database has been claimed to ‘suffer from some structural biases’ (Ribeiro et al., 2010). However, regarding European companies with an annual turnover of more than € 100000, the data set is practically complete (Garcia‐Bernardo and Takes, 2018). Data from business registers regarding smaller foreign companies are not needed in our analysis, as these companies do not have to file tax returns in the Netherlands. The ORBIS database is used, because it contains the principal and secondary NACE (revision 2) codes for companies that are established in the EU. The NACE code can be used to select all active (and inactive) companies that are established in the EU and that are principally or secondarily economically active in retail trade. The result is a data set of 6996468 companies, from which companies established in the Netherlands have been excluded. This data set, including each company's country of establishment, was extracted from ORBIS on June 24th, 2017.The data has been used in the following publication:https://doi.org/10.1111/rssa.12487
m
Survival risk ratios for ICD-10-AM injury diagnosis classifications for all...
figshare.mq.edu.au
researchdata.edu.au
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rebecca Mitchell; Hsuen P Ting (2023). Survival risk ratios for ICD-10-AM injury diagnosis classifications for all ages [Dataset]. http://doi.org/10.25949/14852925.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25949/14852925.v1
Dataset updated
May 30, 2023
Dataset provided by
Macquarie University
Authors
Rebecca Mitchell; Hsuen P Ting
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The survival risk ratios (SRRs) were calculated using linked hospitalisation and mortality data from New South Wales (NSW), Australia. Hospital admissions was obtained from the NSW Ministry of Health and included all injury-related admissions identified using a principal diagnosis of injury (ICD-10-AM: S00-T89) during 1 January 2010 to 30 June 2014. Mortality data was obtained from the NSW Registry of Births, Deaths and Marriages from 1 January 2010 to 31 March 2015. Hospitalisation and mortality data were probabilistic linked by the Centre for Health Record Linkage (CHeReL). NSW covers an area of 800,628km2 with a population of around 7.7 million.

The SRRs were calculated for each injury diagnosis. A SRR represents the ratio of the number of individuals with each injury diagnosis who did not die to the total number of individuals with the injury diagnosis. The SRRs can be used to estimate injury severity (i.e. the International Classification of Injury Severity Score: ICISS). The ICISS is calculated by applying the SRRs to each injury diagnosis code in your data. There are two methods commonly used to then estimate ICISS values: (i) multiplicative-injury ICISS where ICISS is the product of all SRRs for each of the individual’s injuries; and (ii) single worst-injury, where ICISS only includes the worst-injury (i.e. the injury diagnosis with the lowest SRR) as the single worst-injury.
f
Fit statistics for scored XGBoost models with 50,000 rows per dataset.
plos.figshare.com
xls
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Fit statistics for scored XGBoost models with 50,000 rows per dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291581.t002
Dataset updated
Oct 20, 2023
Dataset provided by
PLOS ONE
Authors
John Prindle; Himal Suthar; Emily Putnam-Hornstein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Fit statistics for scored XGBoost models with 50,000 rows per dataset.
V
2021–22 Medicaid and CHIP Maternal and Child Health Focus Study Report -...
data.virginia.gov
pdf
Updated Feb 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Other (2024). 2021–22 Medicaid and CHIP Maternal and Child Health Focus Study Report - Datathon23 [Dataset]. https://data.virginia.gov/dataset/2021-22-medicaid-and-chip-maternal-and-child-health-focus-study-report-datathon23
Explore at:
pdfAvailable download formats
Dataset updated
Feb 3, 2024
Dataset authored and provided by
Other
Description
The study used deterministic and probabilistic data linking to match eligible members with birth registry records to identify births paid by Virginia Medicaid during calendar year (CY) 2021. Medicaid member, claims, and encounter data files were used with birth registry data fields to match members from each data linkage process. All probabilistically or deterministically linked birth registry records were included in the eligible focus study population.
f
Demographic characteristics of the 2016 California birth cohort.
plos.figshare.com
xls
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Demographic characteristics of the 2016 California birth cohort. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291581.t004
Dataset updated
Oct 20, 2023
Dataset provided by
PLOS ONE
Authors
John Prindle; Himal Suthar; Emily Putnam-Hornstein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
California
Description
Demographic characteristics of the 2016 California birth cohort.
f
Sensitivity analysis of multiple linear regression model with non-exclusive...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isabela Maia Diniz; Augusto Afonso Guerra Junior; Livia Lovato Pires de Lemos; Kathiaja M. Souza; Brian Godman; Marion Bennie; Björn Wettermark; Francisco de Assis Acurcio; Juliana Alvares; Eli Iola Gurgel Andrade; Mariangela Leal Cherchiglia; Vânia Eloisa de Araújo (2023). Sensitivity analysis of multiple linear regression model with non-exclusive user of medicines. [Dataset]. http://doi.org/10.1371/journal.pone.0199446.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0199446.t003
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Isabela Maia Diniz; Augusto Afonso Guerra Junior; Livia Lovato Pires de Lemos; Kathiaja M. Souza; Brian Godman; Marion Bennie; Björn Wettermark; Francisco de Assis Acurcio; Juliana Alvares; Eli Iola Gurgel Andrade; Mariangela Leal Cherchiglia; Vânia Eloisa de Araújo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
2000 to 2015.
h
Mother-Baby Link for CPRD GOLD
healthdatagateway.org
unknown
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CPRD, Mother-Baby Link for CPRD GOLD [Dataset]. http://doi.org/10.48329/dchj-w803
Explore at:
unknownAvailable download formats
Unique identifier
https://doi.org/10.48329/dchj-w803
Dataset authored and provided by
CPRD
License
HTTPS://CPRD.COM/DATA-ACCESSHTTPS://CPRD.COM/DATA-ACCESS
Description
A list of all likely mother-baby pairs in the CPRD GOLD database generated using a probabilistic algorithm applied to the primary care data. Algorithmic linkage is done based on household number plus maternity information from the mother’s primary care record, the infant’s month of birth and care records of newly registered babies.
f
Generalized linear modeling parameter estimates of birth characteristics...
plos.figshare.com
xls
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Generalized linear modeling parameter estimates of birth characteristics predicting CPS referral by age 3. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291581.t005
Dataset updated
Oct 20, 2023
Dataset provided by
PLOS ONE
Authors
John Prindle; Himal Suthar; Emily Putnam-Hornstein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Generalized linear modeling parameter estimates of birth characteristics predicting CPS referral by age 3.
Dataset: The plural interpretability of German linking elements...
zenodo.org
live.european-language-grid.eu
+2more
zip
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roland Schäfer; Roland Schäfer; Elizabeth Pankratz; Elizabeth Pankratz (2020). Dataset: The plural interpretability of German linking elements ("Morphology") [Dataset]. http://doi.org/10.5281/zenodo.1322791
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1322791
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Roland Schäfer; Roland Schäfer; Elizabeth Pankratz; Elizabeth Pankratz
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset accompanies a paper to be published in "Morphology" (JOMO, Springer). Under the present DOI, all data generated for this research as well as all scripts used are stored. The paper itself is not CC-licensed, refer to Springer's "Morphology" website for details!

Abstract

In this paper, we take a closer theoretical and empirical look at the linking elements in German N1+N2 compounds which are identical to the plural marker of N1 (such as -er with umlaut, as in Häus-er-meer 'sea of houses'). Various perspectives on the actual extent of plural interpretability of these pluralic linking elements are expressed in the literature. We aim to clarify this question by empirically examining to what extent there may be a relationship between plural form and meaning which informs in which sorts of compounds pluralic linking elements appear. Specifically, we investigate whether pluralic linking elements occur especially frequently in compounds where a plural meaning of the first constituent is induced either externally (through plural inflection of the entire compound) or internally (through a relation between the constituents such that N2 forces N1 to be conceptually plural, as in the example above). The results of a corpus study using the DECOW16A corpus and a split-100 experiment show that in the internal but not external plural meaning conditions, a pluralic linking element is preferred over a non-pluralic one, though there is considerable inter-speaker variability, and limitations imposed by other constraints on linking element distribution also play a role. However, we show the overall tendency that German language users do use pluralic linking elements as cues to the plural interpretation of N1+N2 compounds. Our interpretation does not reference a specific morphological framework. Instead, we view our data as strengthening the general approach of probabilistic morphology.
A
American English Nickname Collection
abacus.library.ubc.ca
bin, txt
Updated Aug 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abacus Data Network (2022). American English Nickname Collection [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml;jsessionid=52176414a19f44c6b96758184b48?persistentId=hdl%3A11272.1%2FAB2%2FJR1WG6&version=&q=&fileTypeGroupFacet=%22Text%22&fileAccess=
Explore at:
txt(211), bin(3622267)Available download formats
Dataset updated
Aug 9, 2022
Dataset provided by
Abacus Data Network
Area covered
United States
Description
AbstractIntroduction American English Nickname Collection was developed by Intelius, Inc. and is a compilation of American English nicknames to given name mappings based on information in US government records, public web profiles and financial and property reports. This corpus is intended as a tool for the quantitative study of nickname usage in the United States such as in demographic and sociological studies. It has multiple potential human language technology applications as well, including entity extraction, coreference resolution, people search, language modeling and machine translation. Data The American English Nickname Collection contains 331,237 distinct mappings encompassing millions of names. The data was collected and processed through a record linkage pipeline. The steps in the pipeline were (1) data cleaning, (2) blocking, (3) pair-wise linkage and (4) clustering. In the cleaning step, material was categorized, processed to remove junk and spam records and normalized to an approximately common representation. The blocking process utitlized an algorithm to group records by shared properties for determining which record pairs should be examined by the pairwise linker as potential duplicates. The linkage step assigned a score to record pairs using a supervised pairwise-based machine learning model. The clustering step combined record pairs into connected components and further partitioned each connected component to remove inconsistent pairwise links. The result is that input records were partitioned into disjoint sets called profiles, where each profile corresponded to a single person. The material is presented in the form of a comma delimited text file. Each line contains a first name, a nickname or alias, its conditional probability and its frequency. The conditional probability for each nickname is derived from the base data using an algorithim which calculates both the probability for which any alias refers to a given name and a threshold below which the mapping is most likely an error. This threshold eliminates typographic errors and other noise from the data.
h
Mother-Baby Link for CPRD Aurum
healthdatagateway.org
unknown
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CPRD, Mother-Baby Link for CPRD Aurum [Dataset]. http://doi.org/10.48329/e9py-2133
Explore at:
unknownAvailable download formats
Unique identifier
https://doi.org/10.48329/e9py-2133
Dataset authored and provided by
CPRD
License
HTTPS://CPRD.COM/DATA-ACCESSHTTPS://CPRD.COM/DATA-ACCESS
Description
A list of all likely mother-baby pairs in the CPRD Aurum database generated using a probabilistic algorithm applied to the primary care data. Algorithmic linkage is done based on household number plus maternity information from the mother’s primary care record, the infant’s month of birth and care records of newly registered babies.
Mean annual cost per patient according to clinical and demographic...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isabela Maia Diniz; Augusto Afonso Guerra Junior; Livia Lovato Pires de Lemos; Kathiaja M. Souza; Brian Godman; Marion Bennie; Björn Wettermark; Francisco de Assis Acurcio; Juliana Alvares; Eli Iola Gurgel Andrade; Mariangela Leal Cherchiglia; Vânia Eloisa de Araújo (2023). Mean annual cost per patient according to clinical and demographic variables, DMT drug at study entry and sequence of events for the 23,082 MS patients. [Dataset]. http://doi.org/10.1371/journal.pone.0199446.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0199446.t001
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Isabela Maia Diniz; Augusto Afonso Guerra Junior; Livia Lovato Pires de Lemos; Kathiaja M. Souza; Brian Godman; Marion Bennie; Björn Wettermark; Francisco de Assis Acurcio; Juliana Alvares; Eli Iola Gurgel Andrade; Mariangela Leal Cherchiglia; Vânia Eloisa de Araújo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Brazil: 2000–2015.
f
Calculation of sensitivity and specificity for probabilistic matching...
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert W. Aldridge; Kunju Shaji; Andrew C. Hayward; Ibrahim Abubakar (2023). Calculation of sensitivity and specificity for probabilistic matching without manual review, not including address variables and using an ETS dataset that only including non-UK born individuals. [Dataset]. http://doi.org/10.1371/journal.pone.0136179.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0136179.t005
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Robert W. Aldridge; Kunju Shaji; Andrew C. Hayward; Ibrahim Abubakar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United Kingdom
Description
Calculation of sensitivity and specificity for probabilistic matching without manual review, not including address variables and using an ETS dataset that only including non-UK born individuals.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez (2020). Automated Linking of Historical Data [Dataset]. http://doi.org/10.3886/E120703V1

Data from: Automated Linking of Historical Data

Explore at:

Unique identifier

https://doi.org/10.3886/E120703V1

Dataset updated

Aug 20, 2020

Authors

Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

1850 - 1940

Area covered

United States

Description

Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.

Clear search

Close search

Google apps

Main menu

Data from: Automated Linking of Historical Data

Features of probabilistic linkage solutions available for record linkage...

Record linkage using Stata

Data from: Accuracy of probabilistic and deterministic record linkage: the...

Replication Data for: Using a Probabilistic Model to Assist Merging of...

Additional file 1 of Accuracy, potential, and limitations of probabilistic...

Ministry of Justice Synthetic Data First Probation Iteration 2, England and...

Global business register used to estimate cross-border internet purchases...

Survival risk ratios for ICD-10-AM injury diagnosis classifications for all...

Fit statistics for scored XGBoost models with 50,000 rows per dataset.

2021–22 Medicaid and CHIP Maternal and Child Health Focus Study Report -...

Demographic characteristics of the 2016 California birth cohort.

Sensitivity analysis of multiple linear regression model with non-exclusive...

Mother-Baby Link for CPRD GOLD

Generalized linear modeling parameter estimates of birth characteristics...

Dataset: The plural interpretability of German linking elements...

American English Nickname Collection

Mother-Baby Link for CPRD Aurum

Mean annual cost per patient according to clinical and demographic...

Calculation of sensitivity and specificity for probabilistic matching...

Data from: Automated Linking of Historical Data