35 datasets found
  1. Data from: Automated Linking of Historical Data

    • linkagelibrary.icpsr.umich.edu
    Updated Aug 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez (2020). Automated Linking of Historical Data [Dataset]. http://doi.org/10.3886/E120703V1
    Explore at:
    Dataset updated
    Aug 20, 2020
    Authors
    Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1850 - 1940
    Area covered
    United States
    Description

    Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.

  2. f

    Features of probabilistic linkage solutions available for record linkage...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Features of probabilistic linkage solutions available for record linkage applications. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    PLOS ONE
    Authors
    John Prindle; Himal Suthar; Emily Putnam-Hornstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Features of probabilistic linkage solutions available for record linkage applications.

  3. Record linkage using Stata

    • linkagelibrary.icpsr.umich.edu
    Updated Jan 3, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nada Wasi; Aaron Flaaen (2019). Record linkage using Stata [Dataset]. http://doi.org/10.3886/E107948V1
    Explore at:
    Dataset updated
    Jan 3, 2019
    Dataset provided by
    University of Michigan/ISR
    Board of Governors of the Federal Reserve System, Division of Research and Statistics
    Authors
    Nada Wasi; Aaron Flaaen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This project points to an article in The Stata Journal describing a set of routines to preprocess nominal data (firm names and addresses), perform probabilistic linking of two datasets, and display candidate matches for clerical review.The ado files and supporting pattern files are downloadable within Stata.

  4. f

    Data from: Accuracy of probabilistic and deterministic record linkage: the...

    • scielo.figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gisele Pinto de Oliveira; Ana Luiza de Souza Bierrenbach; Kenneth Rochel de Camargo Júnior; Cláudia Medina Coeli; Rejane Sobrino Pinheiro (2023). Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis [Dataset]. http://doi.org/10.6084/m9.figshare.19934264.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    SciELO journals
    Authors
    Gisele Pinto de Oliveira; Ana Luiza de Souza Bierrenbach; Kenneth Rochel de Camargo Júnior; Cláudia Medina Coeli; Rejane Sobrino Pinheiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT OBJECTIVE To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs. METHODS The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algorithm was developed using a set of 70 rules, based on the combination of fragments of the key variables with or without modification (Soundex or substring). Each rule was formed by three or more fragments. The probabilistic approach required a cutoff point for the score, above which the links would be automatically classified as belonging to the same individual. The cutoff point was obtained by linkage of the Notifiable Diseases Information System – Tuberculosis database with itself, subsequent manual review and ROC curves and precision-recall. Sensitivity and specificity for accurate analysis were calculated. RESULTS Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively. The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used. CONCLUSIONS The two techniques showed a high level of correlation for pair classification. Although deterministic linkage identified more duplicate records than probabilistic linkage, the latter retrieved records not identified by the former. User need and experience should be considered when choosing the best technique to be used.

  5. H

    Replication Data for: Using a Probabilistic Model to Assist Merging of...

    • dataverse.harvard.edu
    pdf, tar
    Updated Apr 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2020). Replication Data for: Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records [Dataset]. http://doi.org/10.7910/DVN/YGUHTD
    Explore at:
    tar(78616576), pdf(136543)Available download formats
    Dataset updated
    Apr 21, 2020
    Dataset provided by
    Harvard Dataverse
    Description

    Abstract: Since most social science research relies upon multiple data sources, merging data sets is an essential part of researchers' workflow. Unfortunately, a unique identifier that unambiguously links records is often unavailable, and data may contain missing and inaccurate information. These problems are severe especially when merging large-scale administrative records. We develop a fast and scalable algorithm to implement a canonical probabilistic model of record linkage that has many advantages over deterministic methods frequently used by social scientists. The proposed methodology efficiently handles millions of observations while accounting for missing data and measurement error, incorporating auxiliary information, and adjusting for uncertainty about merging in post-merge analyses. We conduct comprehensive simulation studies to evaluate the performance of our algorithm in realistic scenarios. We also apply our methodology to merging campaign contribution records, survey data, and nationwide voter files. An open-source software package is available for implementing the proposed methodology.

  6. f

    Additional file 1 of Accuracy, potential, and limitations of probabilistic...

    • springernature.figshare.com
    xlsx
    Updated Jun 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ricardo de Mattos Russo Rafael; Kleison Pereira da Silva; Helena Gonçalves de Souza Santos; Davi Gomes Depret; Jaime Alonso Caravaca-Morera; Karen Marie Lucas Breda (2024). Additional file 1 of Accuracy, potential, and limitations of probabilistic record linkage in identifying deaths by gender identity and sexual orientation in the state of Rio De Janeiro, Brazil [Dataset]. http://doi.org/10.6084/m9.figshare.25953318.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2024
    Dataset provided by
    figshare
    Authors
    Ricardo de Mattos Russo Rafael; Kleison Pereira da Silva; Helena Gonçalves de Souza Santos; Davi Gomes Depret; Jaime Alonso Caravaca-Morera; Karen Marie Lucas Breda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    State of Rio de Janeiro, Brazil
    Description

    Supplementary Material 1

  7. Ministry of Justice Synthetic Data First Probation Iteration 2, England and...

    • beta.ukdataservice.ac.uk
    Updated 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ministry of Justice (2025). Ministry of Justice Synthetic Data First Probation Iteration 2, England and Wales, 2014-2023 [Dataset]. http://doi.org/10.5255/ukda-sn-9398-3
    Explore at:
    Dataset updated
    2025
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    datacite
    Authors
    Ministry of Justice
    Area covered
    Wales, England
    Description

    The Ministry of Justice (MoJ) Data First Synthetic Data Project aims to improve engagement with Data First datasets by making synthetic versions of content available to enable more rapid development of research proposals and to thereby enhance the potential for linked administrative data to improve understanding and outcomes across justice systems. The project has led the development of two components: a dataset generation platform and an initial release of lo-fidelity, synthetic data tables.

    This study includes a synthetically-generated version of the Ministry of Justice Data First Probation datasets. Synthetic versions of all 43 tables in the MoJ Data First data ecosystem have been created. These versions can be used / joined in the same way as the real datasets. As well as underpinning training, synthetic datasets should enable researchers to explore research questions and to design research proposals prior to submitting these for approval. The code created during this exploration and design process should then enable initial results to be obtained as soon as data access is granted.

    The Ministry of Justice Data First probation dataset provides data on people under the supervision of the probation service in England and Wales from 2014. This is a statutory criminal justice service that supervises high-risk offenders released into the community. The data has been extracted from the management information system national Delius (nDelius), used by His Majesty's Prisons and Probation Service (HMPPS) to manage people on probation.

    Information is included on service users' characteristics and offence, and on their pre-sentence reports, sentence requirements, licence conditions and post-sentence supervision; for example, age, gender, ethnicity, offence category, key dates relating to sentence and recalls, activities and programmes required as part of rehabilitation (e.g. drug and alcohol treatment, skills training) and limitations set on their activities (e.g. curfew, location monitoring, drugs testing).

    Each record in the dataset gives information about a single person and probation journey. As part of Data First, records have been deidentified and deduplicated, using our probabilistic record linkage package, Splink, so that a unique identifier is assigned to all records believed to relate to the same person, allowing for longitudinal analysis and investigation of repeat interactions with probation. This aims to improve on links already made within probation services. This opens up the potential to better understand probation service users and address questions on, for example, what works to reduce reoffending.

    The Ministry of Justice Data First linking dataset can be used in combination with this and other Data First datasets to join up administrative records about people from across justice services (courts, prisons and probation) to increase understanding around users' interactions, pathways and outcomes.

  8. f

    Global business register used to estimate cross-border internet purchases...

    • uvaauas.figshare.com
    Updated Dec 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Q.A. Meertens (2020). Global business register used to estimate cross-border internet purchases within the EU [Dataset]. http://doi.org/10.21942/uva.13303082.v1
    Explore at:
    Dataset updated
    Dec 7, 2020
    Dataset provided by
    University of Amsterdam / Amsterdam University of Applied Sciences
    Authors
    Q.A. Meertens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    European Union
    Description

    We used ORBIS as business register. It is a global corporate database maintained by Bureau van Dijk (http://bvdinfo.com/orbis) and contains detailed corporate information on over 200 million private companies world wide. The database has been claimed to ‘suffer from some structural biases’ (Ribeiro et al., 2010). However, regarding European companies with an annual turnover of more than € 100000, the data set is practically complete (Garcia‐Bernardo and Takes, 2018). Data from business registers regarding smaller foreign companies are not needed in our analysis, as these companies do not have to file tax returns in the Netherlands. The ORBIS database is used, because it contains the principal and secondary NACE (revision 2) codes for companies that are established in the EU. The NACE code can be used to select all active (and inactive) companies that are established in the EU and that are principally or secondarily economically active in retail trade. The result is a data set of 6996468 companies, from which companies established in the Netherlands have been excluded. This data set, including each company's country of establishment, was extracted from ORBIS on June 24th, 2017.The data has been used in the following publication:https://doi.org/10.1111/rssa.12487

  9. m

    Survival risk ratios for ICD-10-AM injury diagnosis classifications for all...

    • figshare.mq.edu.au
    • researchdata.edu.au
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rebecca Mitchell; Hsuen P Ting (2023). Survival risk ratios for ICD-10-AM injury diagnosis classifications for all ages [Dataset]. http://doi.org/10.25949/14852925.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Macquarie University
    Authors
    Rebecca Mitchell; Hsuen P Ting
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The survival risk ratios (SRRs) were calculated using linked hospitalisation and mortality data from New South Wales (NSW), Australia. Hospital admissions was obtained from the NSW Ministry of Health and included all injury-related admissions identified using a principal diagnosis of injury (ICD-10-AM: S00-T89) during 1 January 2010 to 30 June 2014. Mortality data was obtained from the NSW Registry of Births, Deaths and Marriages from 1 January 2010 to 31 March 2015. Hospitalisation and mortality data were probabilistic linked by the Centre for Health Record Linkage (CHeReL). NSW covers an area of 800,628km2 with a population of around 7.7 million.

    The SRRs were calculated for each injury diagnosis. A SRR represents the ratio of the number of individuals with each injury diagnosis who did not die to the total number of individuals with the injury diagnosis. The SRRs can be used to estimate injury severity (i.e. the International Classification of Injury Severity Score: ICISS). The ICISS is calculated by applying the SRRs to each injury diagnosis code in your data. There are two methods commonly used to then estimate ICISS values: (i) multiplicative-injury ICISS where ICISS is the product of all SRRs for each of the individual’s injuries; and (ii) single worst-injury, where ICISS only includes the worst-injury (i.e. the injury diagnosis with the lowest SRR) as the single worst-injury.

  10. f

    Fit statistics for scored XGBoost models with 50,000 rows per dataset.

    • plos.figshare.com
    xls
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Fit statistics for scored XGBoost models with 50,000 rows per dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    PLOS ONE
    Authors
    John Prindle; Himal Suthar; Emily Putnam-Hornstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fit statistics for scored XGBoost models with 50,000 rows per dataset.

  11. V

    2021–22 Medicaid and CHIP Maternal and Child Health Focus Study Report -...

    • data.virginia.gov
    pdf
    Updated Feb 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Other (2024). 2021–22 Medicaid and CHIP Maternal and Child Health Focus Study Report - Datathon23 [Dataset]. https://data.virginia.gov/dataset/2021-22-medicaid-and-chip-maternal-and-child-health-focus-study-report-datathon23
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 3, 2024
    Dataset authored and provided by
    Other
    Description

    The study used deterministic and probabilistic data linking to match eligible members with birth registry records to identify births paid by Virginia Medicaid during calendar year (CY) 2021. Medicaid member, claims, and encounter data files were used with birth registry data fields to match members from each data linkage process. All probabilistically or deterministically linked birth registry records were included in the eligible focus study population.

  12. f

    Demographic characteristics of the 2016 California birth cohort.

    • plos.figshare.com
    xls
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Demographic characteristics of the 2016 California birth cohort. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    PLOS ONE
    Authors
    John Prindle; Himal Suthar; Emily Putnam-Hornstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    California
    Description

    Demographic characteristics of the 2016 California birth cohort.

  13. f

    Sensitivity analysis of multiple linear regression model with non-exclusive...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isabela Maia Diniz; Augusto Afonso Guerra Junior; Livia Lovato Pires de Lemos; Kathiaja M. Souza; Brian Godman; Marion Bennie; Björn Wettermark; Francisco de Assis Acurcio; Juliana Alvares; Eli Iola Gurgel Andrade; Mariangela Leal Cherchiglia; Vânia Eloisa de Araújo (2023). Sensitivity analysis of multiple linear regression model with non-exclusive user of medicines. [Dataset]. http://doi.org/10.1371/journal.pone.0199446.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Isabela Maia Diniz; Augusto Afonso Guerra Junior; Livia Lovato Pires de Lemos; Kathiaja M. Souza; Brian Godman; Marion Bennie; Björn Wettermark; Francisco de Assis Acurcio; Juliana Alvares; Eli Iola Gurgel Andrade; Mariangela Leal Cherchiglia; Vânia Eloisa de Araújo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    2000 to 2015.

  14. h

    Mother-Baby Link for CPRD GOLD

    • healthdatagateway.org
    unknown
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CPRD, Mother-Baby Link for CPRD GOLD [Dataset]. http://doi.org/10.48329/dchj-w803
    Explore at:
    unknownAvailable download formats
    Dataset authored and provided by
    CPRD
    License

    HTTPS://CPRD.COM/DATA-ACCESSHTTPS://CPRD.COM/DATA-ACCESS

    Description

    A list of all likely mother-baby pairs in the CPRD GOLD database generated using a probabilistic algorithm applied to the primary care data. Algorithmic linkage is done based on household number plus maternity information from the mother’s primary care record, the infant’s month of birth and care records of newly registered babies.

  15. f

    Generalized linear modeling parameter estimates of birth characteristics...

    • plos.figshare.com
    xls
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Generalized linear modeling parameter estimates of birth characteristics predicting CPS referral by age 3. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    PLOS ONE
    Authors
    John Prindle; Himal Suthar; Emily Putnam-Hornstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Generalized linear modeling parameter estimates of birth characteristics predicting CPS referral by age 3.

  16. Dataset: The plural interpretability of German linking elements...

    • zenodo.org
    • live.european-language-grid.eu
    • +2more
    zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roland Schäfer; Roland Schäfer; Elizabeth Pankratz; Elizabeth Pankratz (2020). Dataset: The plural interpretability of German linking elements ("Morphology") [Dataset]. http://doi.org/10.5281/zenodo.1322791
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Roland Schäfer; Roland Schäfer; Elizabeth Pankratz; Elizabeth Pankratz
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset accompanies a paper to be published in "Morphology" (JOMO, Springer). Under the present DOI, all data generated for this research as well as all scripts used are stored. The paper itself is not CC-licensed, refer to Springer's "Morphology" website for details!

    Abstract

    In this paper, we take a closer theoretical and empirical look at the linking elements in German N1+N2 compounds which are identical to the plural marker of N1 (such as -er with umlaut, as in Häus-er-meer 'sea of houses'). Various perspectives on the actual extent of plural interpretability of these pluralic linking elements are expressed in the literature. We aim to clarify this question by empirically examining to what extent there may be a relationship between plural form and meaning which informs in which sorts of compounds pluralic linking elements appear. Specifically, we investigate whether pluralic linking elements occur especially frequently in compounds where a plural meaning of the first constituent is induced either externally (through plural inflection of the entire compound) or internally (through a relation between the constituents such that N2 forces N1 to be conceptually plural, as in the example above). The results of a corpus study using the DECOW16A corpus and a split-100 experiment show that in the internal but not external plural meaning conditions, a pluralic linking element is preferred over a non-pluralic one, though there is considerable inter-speaker variability, and limitations imposed by other constraints on linking element distribution also play a role. However, we show the overall tendency that German language users do use pluralic linking elements as cues to the plural interpretation of N1+N2 compounds. Our interpretation does not reference a specific morphological framework. Instead, we view our data as strengthening the general approach of probabilistic morphology.

  17. A

    American English Nickname Collection

    • abacus.library.ubc.ca
    bin, txt
    Updated Aug 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2022). American English Nickname Collection [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml;jsessionid=52176414a19f44c6b96758184b48?persistentId=hdl%3A11272.1%2FAB2%2FJR1WG6&version=&q=&fileTypeGroupFacet=%22Text%22&fileAccess=
    Explore at:
    txt(211), bin(3622267)Available download formats
    Dataset updated
    Aug 9, 2022
    Dataset provided by
    Abacus Data Network
    Area covered
    United States
    Description

    AbstractIntroduction American English Nickname Collection was developed by Intelius, Inc. and is a compilation of American English nicknames to given name mappings based on information in US government records, public web profiles and financial and property reports. This corpus is intended as a tool for the quantitative study of nickname usage in the United States such as in demographic and sociological studies. It has multiple potential human language technology applications as well, including entity extraction, coreference resolution, people search, language modeling and machine translation. Data The American English Nickname Collection contains 331,237 distinct mappings encompassing millions of names. The data was collected and processed through a record linkage pipeline. The steps in the pipeline were (1) data cleaning, (2) blocking, (3) pair-wise linkage and (4) clustering. In the cleaning step, material was categorized, processed to remove junk and spam records and normalized to an approximately common representation. The blocking process utitlized an algorithm to group records by shared properties for determining which record pairs should be examined by the pairwise linker as potential duplicates. The linkage step assigned a score to record pairs using a supervised pairwise-based machine learning model. The clustering step combined record pairs into connected components and further partitioned each connected component to remove inconsistent pairwise links. The result is that input records were partitioned into disjoint sets called profiles, where each profile corresponded to a single person. The material is presented in the form of a comma delimited text file. Each line contains a first name, a nickname or alias, its conditional probability and its frequency. The conditional probability for each nickname is derived from the base data using an algorithim which calculates both the probability for which any alias refers to a given name and a threshold below which the mapping is most likely an error. This threshold eliminates typographic errors and other noise from the data.

  18. h

    Mother-Baby Link for CPRD Aurum

    • healthdatagateway.org
    unknown
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CPRD, Mother-Baby Link for CPRD Aurum [Dataset]. http://doi.org/10.48329/e9py-2133
    Explore at:
    unknownAvailable download formats
    Dataset authored and provided by
    CPRD
    License

    HTTPS://CPRD.COM/DATA-ACCESSHTTPS://CPRD.COM/DATA-ACCESS

    Description

    A list of all likely mother-baby pairs in the CPRD Aurum database generated using a probabilistic algorithm applied to the primary care data. Algorithmic linkage is done based on household number plus maternity information from the mother’s primary care record, the infant’s month of birth and care records of newly registered babies.

  19. Mean annual cost per patient according to clinical and demographic...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isabela Maia Diniz; Augusto Afonso Guerra Junior; Livia Lovato Pires de Lemos; Kathiaja M. Souza; Brian Godman; Marion Bennie; Björn Wettermark; Francisco de Assis Acurcio; Juliana Alvares; Eli Iola Gurgel Andrade; Mariangela Leal Cherchiglia; Vânia Eloisa de Araújo (2023). Mean annual cost per patient according to clinical and demographic variables, DMT drug at study entry and sequence of events for the 23,082 MS patients. [Dataset]. http://doi.org/10.1371/journal.pone.0199446.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Isabela Maia Diniz; Augusto Afonso Guerra Junior; Livia Lovato Pires de Lemos; Kathiaja M. Souza; Brian Godman; Marion Bennie; Björn Wettermark; Francisco de Assis Acurcio; Juliana Alvares; Eli Iola Gurgel Andrade; Mariangela Leal Cherchiglia; Vânia Eloisa de Araújo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Brazil: 2000–2015.

  20. f

    Calculation of sensitivity and specificity for probabilistic matching...

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert W. Aldridge; Kunju Shaji; Andrew C. Hayward; Ibrahim Abubakar (2023). Calculation of sensitivity and specificity for probabilistic matching without manual review, not including address variables and using an ETS dataset that only including non-UK born individuals. [Dataset]. http://doi.org/10.1371/journal.pone.0136179.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Robert W. Aldridge; Kunju Shaji; Andrew C. Hayward; Ibrahim Abubakar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United Kingdom
    Description

    Calculation of sensitivity and specificity for probabilistic matching without manual review, not including address variables and using an ETS dataset that only including non-UK born individuals.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez (2020). Automated Linking of Historical Data [Dataset]. http://doi.org/10.3886/E120703V1
Organization logo

Data from: Automated Linking of Historical Data

Related Article
Explore at:
Dataset updated
Aug 20, 2020
Authors
Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
1850 - 1940
Area covered
United States
Description

Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.

Search
Clear search
Close search
Google apps
Main menu