100+ datasets found
  1. Data from: Automated Linking of Historical Data

    • linkagelibrary.icpsr.umich.edu
    Updated Aug 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez (2020). Automated Linking of Historical Data [Dataset]. http://doi.org/10.3886/E120703V1
    Explore at:
    Dataset updated
    Aug 20, 2020
    Authors
    Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1850 - 1940
    Area covered
    United States
    Description

    Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.

  2. c

    Understanding and Improving Data Linkage Consent in Surveys, 2018-2019

    • datacatalogue.cessda.eu
    • beta.ukdataservice.ac.uk
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jäckle, A; Burton, J; Couper, M; Crossley, T (2025). Understanding and Improving Data Linkage Consent in Surveys, 2018-2019 [Dataset]. http://doi.org/10.5255/UKDA-SN-855036
    Explore at:
    Dataset updated
    Mar 26, 2025
    Dataset provided by
    University of Essex
    European University Institute
    University of Michigan
    Authors
    Jäckle, A; Burton, J; Couper, M; Crossley, T
    Time period covered
    May 1, 2018 - Dec 31, 2019
    Area covered
    Great Britain
    Variables measured
    Individual
    Measurement technique
    The data were collected on two independent samples from the PopulusLive online access panel. The first sample was surveyed twice, with a one-year interval. The first wave (AP1-1) was fielded in June 2018 and included eleven experimental conditions with n~500 respondents each. A total of 46,206 panelists were invited to AP1-1, of whom 6,532 started the survey and 5,633 completed it (401 broke off and 498 were screened out), for a survey response rate of 12.2%. To track changes in consent over time, four of these eleven experimental groups were re-interviewed about a year later (AP1-2). Of the 2,053 panelists invited to AP1-2, 1,693 started the survey and 1,630 completed it, for a response rate of 79.4%. As a follow up to the results from these two surveys, a second sample was drawn (AP2) and surveyed, with eight experimental groups designed to address further research questions. This sample was fielded in December 2019. A total of 30,682 panelists were invited to AP2, of whom 6,459 started the survey and 3,850 completed it (301 broke off and 2,308 were screened out), for a response rate of 21.1%. The samples were restricted to Great Britain with quotas to match the composition of the Understanding Society Innovation Panel: gender (50% male, 50% female), age (33% 16-40, 33% 41-59, 33% 60+), and highest educational qualification (40% degree or equivalent, 20% A-level or equivalent, 40% GCSE or lower). All surveys included either a single question asking for consent to link the survey data to government administrative records or a set of five consent questions, as well as background questions on socio-demographics, understanding of the linkage request, perceived sensitivity of the consent request, trust in data holding institutions, and general data sharing attitudes and behaviours. Dependent on experimental group, median times for completion of the questionnaire ranged between 9 and 12 minutes (in AP1-1).
    Description

    Linking survey and administrative data offers the possibility of combining the strengths, and mitigating the weaknesses, of both. Such linkage is therefore an extremely promising basis for future empirical research in social science. For ethical and legal reasons, linking administrative data to survey responses will usually require obtaining explicit consent. It is well known that not all respondents give consent. Past research on consent has generated many null and inconsistent findings. A weakness of the existing literature is that little effort has been made to understand the cognitive processes of how respondents make the decision whether or not to consent. The overall aim of this project was to improve our understanding about how to pursue the twin goals of maximizing consent and ensuring that consent is genuinely informed. The ultimate objective is to strengthen the data infrastructure for social science and policy research in the UK. Specific aims were: 1. To understand how respondents process requests for data linkage: which factors influence their understanding of data linkage, which factors influence their decision to consent, and to open the black box of consent decisions to begin to understand how respondents make the decision. 2. To develop and test methods of maximising consent in web surveys, by understanding why web respondents are less likely to give consent than face-to-face respondents. 3. To develop and test methods of maximising consent with requests for linkage to multiple data sets, by understanding how respondents process multiple requests. 4. As a by-product of testing hypotheses about the previous points, to test the effects of different approaches to wording consent questions on informed consent.

    Our findings are based on a series of experiments conducted in four surveys using two different studies: The Understanding Society Innovation Panel (IP) and the PopulusLive online access panel (AP). The Innovation Panel is part of Understanding Society: the UK Household Longitudinal Study. It is a probability sample of households in Great Britain used for methodological testing, with a design that mirrors that of the main Understanding Society survey. The Innovation Panel survey was conducted in wave 11, fielded in 2018. The Innovation Panel data are available from the UK Data Service (SN: 6849, http://doi.org/10.5255/UKDA-SN-6849-12). Since the Innovation Panel sample size (around 2,900 respondents) constrained the number of experimental treatment groups we could implement, we fielded a parallel survey with additional experiments, using a different sample. PopulusLive is a non-probability online panel with around 130,000 active sample members, who are recruited through web advertising, word of mouth, and database partners. We used age, gender and education quotas to match the sample composition of the Innovation Panel. A total of nine experiments were conducted across the two sample sources. Experiments 1 to 5 all used variations of a single consent question, about linkage to tax data (held by HM Revenue and Customs, HMRC). Experiments 6 and 7 also used single consent questions, but respondents were either assigned to questions on tax or health data (held by the National Health Service, NHS) linkage. Experiments 8 and 9 used five different data linkage requests: tax data (held by HMRC), health data (held by the NHS), education data (held by the Department for Education in England, DfE, and equivalent departments in Scotland and Wales), household energy data (held the Department for Business, Energy and Industrial Strategy, BEIS), and benefit and pensions data (held by the Department for Work and Pensions, DWP). The experiments, and the survey(s) on which they were conducted, are briefly summarized here:
    1. Easy vs. standard wording of consent request (IP and AP). Half the respondents were allocated to the ‘standard’ question wording, used previously in Understanding Society. The balance was allocated to an ‘easy’ version, where the text was rewritten to reduce reading difficulty and to provide all essential information about the linkage in the question text rather than an additional information leaflet. 2. Early vs. late placement of consent question (IP). Half the respondents were asked for consent early in the interview, the other half were asked at the end. 3. Web vs. face-to-face interview (IP). This experiment exploits the random assignment of IP cases to explore mode effects on consent. 4. Default question wording (AP). Experiment 4 tested a default approach to giving consent, asking respondents to “Press ‘next’ to continue” or explicitly opt out, versus the standard opt-in consent procedure. 5. Additional information question wording (AP). This experiment tested the effect of offering additional information, with a version that added a third response option (“I need more information before making a decision”) to the standard ‘yes’ or no’ options. 6. Data linkage...

  3. d

    HES-MHLD Data Linkage Report

    • digital.nhs.uk
    pdf
    Updated Jan 8, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). HES-MHLD Data Linkage Report [Dataset]. https://digital.nhs.uk/data-and-information/publications/statistical/hes-mhld-data-linkage-report
    Explore at:
    pdf(178.0 kB), pdf(181.6 kB), pdf(561.3 kB)Available download formats
    Dataset updated
    Jan 8, 2016
    License

    https://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions

    Time period covered
    Sep 1, 2015 - Sep 30, 2015
    Area covered
    England
    Description

    This is the latest monthly (September 2015) statistical publication in relation to the linked HES (Hospital Episode Statistics) and MHLDDS (Mental Health and Learning Disabilities Data Set) data. The two data sets have been linked using specific patient identifiers collected in HES and MHLDDS. The linkage allows the data sets to be linked in this manner from 2006-07; however, this report focuses on patients who were present in the two data sets from April 2015. The bridging file used for this publication was also released on 08 January 2016; it utilises the latest published provisional (Monthly) HES data and year-to-date MHLDDS data relating to September 2015. The HES-MHLDDS linkage provides the ability to undertake national (within England) analysis along acute patient pathways for mental health and learning disability service users' interactions with acute secondary care.

  4. Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical...

    • beta.ukdataservice.ac.uk
    • datacatalogue.cessda.eu
    Updated 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCL Institute Of Education University College London (2025). Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical Records), Scottish Birth Records, 2000-2002: Secure Access [Dataset]. http://doi.org/10.5255/ukda-sn-8712-1
    Explore at:
    Dataset updated
    2025
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    datacite
    Authors
    UCL Institute Of Education University College London
    Area covered
    Scotland
    Description

    Background:
    The Millennium Cohort Study (MCS) is a large-scale, multi-purpose longitudinal dataset providing information about babies born at the beginning of the 21st century, their progress through life, and the families who are bringing them up, for the four countries of the United Kingdom. The original objectives of the first MCS survey, as laid down in the proposal to the Economic and Social Research Council (ESRC) in March 2000, were:

    • to chart the initial conditions of social, economic and health advantages and disadvantages facing children born at the start of the 21st century, capturing information that the research community of the future will require
    • to provide a basis for comparing patterns of development with the preceding cohorts (the National Child Development Study, held at the UK Data Archive under GN 33004, and the 1970 Birth Cohort Study, held under GN 33229)
    • to collect information on previously neglected topics, such as fathers' involvement in children's care and development
    • to focus on parents as the most immediate elements of the children's 'background', charting their experience as mothers and fathers of newborn babies in the year 2000, recording how they (and any other children in the family) adapted to the newcomer, and what their aspirations for her/his future may be
    • to emphasise intergenerational links including those back to the parents' own childhood
    • to investigate the wider social ecology of the family, including social networks, civic engagement and community facilities and services, splicing in geo-coded data when available
    Additional objectives subsequently included for MCS were:
    • to provide control cases for the national evaluation of Sure Start (a government programme intended to alleviate child poverty and social exclusion)
    • to provide samples of adequate size to analyse and compare the smaller countries of the United Kingdom, and include disadvantaged areas of England

    Further information about the MCS can be found on the Centre for Longitudinal Studies web pages.

    The content of MCS studies, including questions, topics and variables can be explored via the CLOSER Discovery website.

    The first sweep (MCS1) interviewed both mothers and (where resident) fathers (or father-figures) of infants included in the sample when the babies were nine months old, and the second sweep (MCS2) was carried out with the same respondents when the children were three years of age. The third sweep (MCS3) was conducted in 2006, when the children were aged five years old, the fourth sweep (MCS4) in 2008, when they were seven years old, the fifth sweep (MCS5) in 2012-2013, when they were eleven years old, the sixth sweep (MCS6) in 2015, when they were fourteen years old, and the seventh sweep (MCS7) in 2018, when they were seventeen years old.

    End User Licence versions of MCS studies:
    The End User Licence (EUL) versions of MCS1, MCS2, MCS3, MCS4, MCS5, MCS6 and MCS7 are held under UK Data Archive SNs 4683, 5350, 5795, 6411, 7464, 8156 and 8682 respectively. The longitudinal family file is held under SN 8172.

    Sub-sample studies:
    Some studies based on sub-samples of MCS have also been conducted, including a study of MCS respondent mothers who had received assisted fertility treatment, conducted in 2003 (see EUL SN 5559). Also, birth registration and maternity hospital episodes for the MCS respondents are held as a separate dataset (see EUL SN 5614).

    Release of Sweeps 1 to 4 to Long Format (Summer 2020)
    To support longitudinal research and make it easier to compare data from different time points, all data from across all sweeps is now in a consistent format. The update affects the data from sweeps 1 to 4 (from 9 months to 7 years), which are updated from the old/wide to a new/long format to match the format of data of sweeps 5 and 6 (age 11 and 14 sweeps). The old/wide formatted datasets contained one row per family with multiple variables for different respondents. The new/long formatted datasets contain one row per respondent (per parent or per cohort member) for each MCS family. Additional updates have been made to all sweeps to harmonise variable labels and enhance anonymisation.

    How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
    For information on how to access biomedical data from MCS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.

    Secure Access datasets:
    Secure Access versions of the MCS have more restrictive access conditions than versions available under the standard End User Licence or Special Licence (see 'Access data' tab above).

    Secure Access versions of the MCS include:
    • detailed sensitive variables not available under EUL. These have been grouped thematically and are held under SN 8753 (socio-economic, accommodation and occupational data), SN 8754 (self-reported health, behaviour and fertility), SN 8755 (demographics, language and religion) and SN 8756 (exact participation dates). These files replace previously available studies held under SNs 8456 and 8622-8627
    • detailed geographical identifier files which are grouped by sweep held under SN 7758 (MCS1), SN 7759 (MCS2), SN 7760 (MCS3), SN 7761 (MCS4), SN 7762 (MCS5 2001 Census Boundaries), SN 7763 (MCS5 2011 Census Boundaries), SN 8231 (MCS6 2001 Census Boundaries), SN 8232 (MCS6 2011 Census Boundaries), SN 8757 (MCS7), SN 8758 (MCS7 2001 Census Boundaries) and SN 8759 (MCS7 2011 Census Boundaries). These files replace previously available files grouped by geography SN 7049 (Ward level), SN 7050 (Lower Super Output Area level), and SN 7051 (Output Area level)
    • linked education administrative datasets for Key Stages 1, 2, 4 and 5 held under SN 8481 (England). This replaces previously available datasets for Key Stage 1 (SN 6862) and Key Stage 2 (SN 7712)
    • linked education administrative datasets for Key Stage 1 held under SN 7414 (Scotland)
    • linked education administrative dataset for Key Stages 1, 2, 3 and 4 under SN 9085 (Wales)
    • linked NHS Patient Episode Database for Wales (PEDW) for MCS1 – MCS5 held under SN 8302
    • linked Scottish Medical Records data held under SNs 8709, 8710, 8711, 8712, 8713 and 8714;
    • Banded Distances to English Grammar Schools for MCS5 held under SN 8394
    • linked Health Administrative Datasets (Hospital Episode Statistics) for England for years 2000-2019 held under SN 9030
    • linked Health Administrative Datasets (SAIL) for Wales held under SN 9310
    • linked Hospital of Birth data held under SN 5724.
    The linked education administrative datasets held under SNs 8481,7414 and 9085 may be ordered alongside the MCS detailed geographical identifier files only if sufficient justification is provided in the application. Users are also only allowed access to either 2001 or 2011 of Geographical Identifiers Census Boundaries studies. So for MCS5 either SN 7762 (2001 Census Boundaries) or SN 7763 (2011 Census Boundaries), for the MCS6 users are only allowed either SN 8231 (2001 Census Boundaries) or SN 8232 (2011 Census Boundaries); and the same applies for MCS7 so either SN 8758 (2001 Census Boundaries) or SN 8759 (2011 Census Boundaries).

    Researchers applying for access to the Secure Access MCS datasets should indicate on their ESRC Accredited Researcher application form the EUL dataset(s) that they also wish to access (selected from the MCS Series Access web page).

    The Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical Records), Scottish Birth Records, 2000-2002: Secure Access includes data files from the NHS Digital Hospital Episode Statistics database for those cohort members who provided consent to health data linkage in the Age 50 sweep, and had ever lived in Scotland. The Scottish Medical Records database contains information about all hospital admissions in Scotland. This study concerns the Scottish Birth Records.

    Other datasets are available from the Scottish Medical Records database, these include:

    • Child Health Reviews (CHR) held under SN 8709
    • Prescribing Information System (PIS) held under SN 8710
    • Scottish Immunisation and Recall System

  5. d

    HES-DID Data Linkage Report

    • digital.nhs.uk
    pdf
    Updated Jul 7, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). HES-DID Data Linkage Report [Dataset]. https://digital.nhs.uk/data-and-information/publications/statistical/hes-did-data-linkage-report
    Explore at:
    pdf(210.8 kB), pdf(165.5 kB)Available download formats
    Dataset updated
    Jul 7, 2016
    License

    https://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions

    Time period covered
    Apr 1, 2015 - Feb 29, 2016
    Area covered
    England
    Description

    This is the latest statistical publication of linked HES (Hospital Episode Statistics) and DID (Diagnostic Imaging Dataset) data held by the Health and Social Care Information Centre. The HES-DID linkage provides the ability to undertake national (within England) analysis along acute patient pathways to understand typical imaging requirements for given procedures, and/or the outcomes after particular imaging has been undertaken, thereby enabling a much deeper understanding of outcomes of imaging and to allow assessment of variation in practice. This publication aims to highlight to users the availability of this updated linkage and provide users of the data with some standard information to assess their analysis approach against. The two data sets have been linked using specific patient identifiers collected in HES and DID. The linkage allows the data sets to be linked from April 2012 when the DID data was first collected; however this report focuses on patients who were present in either data set for the period April 2015-February 2016 only. For DID this is provisional 2015/16 data. For HES this is provisional 2015/16 data. The linkage used for this publication was created on 06 June 2016 and released together with this publication on 07 July 2016.

  6. Serpstat: Backlinks Database | 1.7 trillion links and 170 million domains

    • datarade.ai
    .json, .csv, .xls
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Serpstat, Serpstat: Backlinks Database | 1.7 trillion links and 170 million domains [Dataset]. https://datarade.ai/data-products/serpstat-backlink-index-for-commercial-use-reseller-program-serpstat
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset authored and provided by
    Serpstat
    Area covered
    Bolivia (Plurinational State of), Northern Mariana Islands, Cayman Islands, Bhutan, Montenegro, Cameroon, Cabo Verde, Paraguay, Anguilla, Congo
    Description

    Delve into Serpstat's Backlinks Database, boasting an impressive 1.7 trillion links and covering 170 million domains. Our unique architectural feature ensures live and up-to-date data, eliminating any separation between fresh and historical records.

    With Serpstat's Backlink Index, gain a complete and detailed picture of a domain's backlink profile. Our update period guarantees freshness, with full index data refreshed every 70 days, ensuring that you have access to the most current and relevant backlink information.

    Experience the power of Serpstat's Backlinks Database in uncovering insights, optimizing strategies, and staying ahead in your SEO endeavors. Choose between purchasing the entire dataset or customizable subsets tailored to your specific needs.

  7. f

    Records for 5 people having 4 attributes.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah-Al Mamun; Robert Aseltine; Sanguthevar Rajasekaran (2023). Records for 5 people having 4 attributes. [Dataset]. http://doi.org/10.1371/journal.pone.0124449.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Abdullah-Al Mamun; Robert Aseltine; Sanguthevar Rajasekaran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Each row of the table represents each row of Input02.csv file.Records for 5 people having 4 attributes.

  8. v

    Global import data of Linkage

    • volza.com
    csv
    Updated Feb 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Volza.LLC (2025). Global import data of Linkage [Dataset]. https://www.volza.com/imports-india/india-import-data-of-linkage-from-finland
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 17, 2025
    Dataset provided by
    Volza.LLC
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Count of importers, Sum of import value, 2014-01-01/2021-09-30, Count of import shipments
    Description

    28 Global import shipment records of Linkage with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.

  9. d

    Alesco Email Database - Email Address Data - 2.3+ Billion US email records -...

    • datarade.ai
    .csv, .xls, .txt
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alesco Data, Alesco Email Database - Email Address Data - 2.3+ Billion US email records - available acquisition marketing and identify resolution! [Dataset]. https://datarade.ai/data-products/alesco-email-database-over-1-8-billion-us-email-records-alesco-data
    Explore at:
    .csv, .xls, .txtAvailable download formats
    Dataset authored and provided by
    Alesco Data
    Area covered
    United States of America
    Description

    Alesco’s aggregated consumer email database consists of over 2.3 billion U.S. records with name, address and email. The database is fully CAN-SPAM and privacy compliant, and records include referring URL, IP address and date stamp. Postal addresses are address standardized and processed through NCOA. Available for licensing!

    File size: 2.3 Billion IP Address: 1.9 Billion eAppend data: 1.8 Billion (full name/postal) Acquisition: 269 Million (full demo’s)

    Fields Included: Name Address Email Phone IP Address

  10. c

    Millennium Cohort Study: Linkage with the Point of Interest Data

    • datacatalogue.cessda.eu
    Updated Nov 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University College London, UCL Institute of Education (2024). Millennium Cohort Study: Linkage with the Point of Interest Data [Dataset]. http://doi.org/10.5255/UKDA-SN-8824-1
    Explore at:
    Dataset updated
    Nov 29, 2024
    Dataset provided by
    Centre for Longitudinal Studies
    Authors
    University College London, UCL Institute of Education
    Time period covered
    Jan 1, 2008 - Jan 1, 2014
    Area covered
    Scotland, Wales, England
    Variables measured
    Individuals, National
    Measurement technique
    Compilation/Synthesis
    Description

    Abstract copyright UK Data Service and data collection copyright owner.

    Background:
    The Millennium Cohort Study (MCS) is a large-scale, multi-purpose longitudinal dataset providing information about babies born at the beginning of the 21st century, their progress through life, and the families who are bringing them up, for the four countries of the United Kingdom. The original objectives of the first MCS survey, as laid down in the proposal to the Economic and Social Research Council (ESRC) in March 2000, were:

    • to chart the initial conditions of social, economic and health advantages and disadvantages facing children born at the start of the 21st century, capturing information that the research community of the future will require
    • to provide a basis for comparing patterns of development with the preceding cohorts (the National Child Development Study, held at the UK Data Archive under GN 33004, and the 1970 Birth Cohort Study, held under GN 33229)
    • to collect information on previously neglected topics, such as fathers' involvement in children's care and development
    • to focus on parents as the most immediate elements of the children's 'background', charting their experience as mothers and fathers of newborn babies in the year 2000, recording how they (and any other children in the family) adapted to the newcomer, and what their aspirations for her/his future may be
    • to emphasise intergenerational links including those back to the parents' own childhood
    • to investigate the wider social ecology of the family, including social networks, civic engagement and community facilities and services, splicing in geo-coded data when available
    Additional objectives subsequently included for MCS were:
    • to provide control cases for the national evaluation of Sure Start (a government programme intended to alleviate child poverty and social exclusion)
    • to provide samples of adequate size to analyse and compare the smaller countries of the United Kingdom, and include disadvantaged areas of England

    Further information about the MCS can be found on the Centre for Longitudinal Studies web pages.

    The content of MCS studies, including questions, topics and variables can be explored via the CLOSER Discovery website.

    The first sweep (MCS1) interviewed both mothers and (where resident) fathers (or father-figures) of infants included in the sample when the babies were nine months old, and the second sweep (MCS2) was carried out with the same respondents when the children were three years of age. The third sweep (MCS3) was conducted in 2006, when the children were aged five years old, the fourth sweep (MCS4) in 2008, when they were seven years old, the fifth sweep (MCS5) in 2012-2013, when they were eleven years old, the sixth sweep (MCS6) in 2015, when they were fourteen years old, and the seventh sweep (MCS7) in 2018, when they were seventeen years old.

    End User Licence versions of MCS studies:
    The End User Licence (EUL) versions of MCS1, MCS2, MCS3, MCS4, MCS5, MCS6 and MCS7 are held under UK Data Archive SNs 4683, 5350, 5795, 6411, 7464, 8156 and 8682 respectively. The longitudinal family file is held under SN 8172.

    Sub-sample studies:
    Some studies based on sub-samples of MCS have also been conducted, including a study of MCS respondent mothers who had received assisted fertility treatment, conducted in 2003 (see EUL SN 5559). Also, birth registration and maternity hospital episodes for the MCS respondents are held as a separate dataset (see EUL SN 5614).

    Release of Sweeps 1 to 4 to Long Format (Summer 2020)
    To support longitudinal research and make it easier to compare data from different time points, all data from across all sweeps is now in a consistent format. The update affects the data from sweeps 1 to 4 (from 9 months to 7 years), which are updated from the old/wide to a new/long format to match the format of data of sweeps 5 and 6 (age 11 and 14 sweeps). The old/wide formatted datasets contained one row per family with multiple variables for different respondents. The new/long formatted datasets contain one row per respondent (per parent or per cohort member) for each MCS family. Additional updates have been made to all sweeps to harmonise variable labels and enhance anonymisation.

    How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
    For information on how to access biomedical data from MCS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.

    Secure Access datasets:
    Secure Access versions of the MCS have more restrictive access conditions than versions available under the standard End User Licence or Special Licence (see 'Access data' tab above).

    Secure Access versions of the MCS...

  11. Global export data of Linkage

    • volza.com
    csv
    Updated Mar 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Volza FZ LLC (2025). Global export data of Linkage [Dataset]. https://www.volza.com/p/linkage/export/export-from-japan/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 24, 2025
    Dataset provided by
    Volza
    Authors
    Volza FZ LLC
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Count of exporters, Sum of export value, 2014-01-01/2021-09-30, Count of export shipments
    Description

    3722 Global export shipment records of Linkage with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.

  12. Genotype, phenotype and linkage data for Mimulus parishii x M. cardinalis...

    • data.niaid.nih.gov
    zip
    Updated Aug 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Genotype, phenotype and linkage data for Mimulus parishii x M. cardinalis hybrid incompatibility study [Dataset]. https://data.niaid.nih.gov/resources?id=dryad_v6wwpzh1m
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 14, 2023
    Dataset provided by
    University of Connecticut
    University of Georgia
    California State Polytechnic University
    University of Montana
    Authors
    V. Alex Sotola; Colette Berg; Matt Samuli; Hongfei Chen; Samuel Mantel; Paul Beardsley; Yao-wu Yuan; Andrea Sweigart; Lila Fishman
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The evolution of genomic incompatibilities causing postzygotic barriers to hybridization is a key step in species divergence. Incompatibilities take two general forms – structural divergence between chromosomes leading to severe hybrid sterility in F1 hybrids and epistatic interactions between genes causing reduced fitness of hybrid gametes or zygotes (Dobzhansky-Muller incompatibilities). Despite substantial recent progress in understanding the molecular mechanisms and evolutionary origins of both types of incompatibility, how each behaves across multiple generations of hybridization remains relatively unexplored. Here, we use genetic mapping in F2 and RIL hybrid populations between the phenotypically divergent but naturally hybridizing monkeyflowers Mimulus cardinalis and M. parishii to characterize the genetic basis of hybrid incompatibility and examine its changing effects over multiple generations of experimental hybridization. In F2s, we found severe hybrid pollen inviability (< 50% reduction vs. parental genotypes) and pseudolinkage caused by a reciprocal translocation between Chromosomes 6 and 7 in the parental species. RILs retained excess heterozygosity around the translocation breakpoints, which caused substantial pollen inviability when interstitial crossovers had not created compatible heterokaryotypic configurations. Strong transmission ratio distortion and inter-chromosomal linkage disequilibrium in both F2s and RILs identified a novel two-locus genic incompatibility causing sex-independent gametophytic (haploid) lethality. The latter interaction eliminated three of the expected nine F2 genotypic classes via F1 gamete loss without detectable effects on the pollen number or viability of F2 double heterozygotes. Along with the mapping of numerous milder incompatibilities, these key findings illuminate the complex genetics of plant hybrid breakdown and are an important step toward understanding the genomic consequences of natural hybridization in this model system. Methods Study system and plant lines The plants in this study were all derived from two highly (>10 generations) inbred lines of Sierran M. cardinalis (CE10) and M. parishii (PAR), which were also used in previous investigations of species barriers (Bradshaw et al. 1998; Schemske and Bradshaw 1999; Ramsey et al. 2003; Bradshaw and Schemske 2003; Fishman et al. 2013, 2015; Nelson et al. 2021a). We generated PAR x CE10 F1 hybrids by hand-pollination (with prior emasculation of the PAR seed parent in the bud) and F2 hybrids by self-pollination of F1 hybrids. The F2 hybrids were grown in two separate greenhouse common gardens at the University of Montana (UM-F2; total N = 524) and the University of Connecticut (UC-F2 N = 253), along with parental control lines, and were phenotyped for numerous floral and vegetative traits including the pollen fertility traits presented here. Recombinant inbred lines (RILs) were generated by single-seed-descent from additional F2 individuals grown at the University of Georgia and California State Polytechnic University, Pomona; a total of 167 RILs were formed through 3–6 generations of self-fertilization. DNA extraction and sequencing Genomic DNA was extracted from bud and leaf tissue of the greenhouse-grown F2 and RIL mapping populations using a CTAB chloroform protocol modified for 96-well plates (dx.doi.org/10.17504/protocols.io.bgv6jw9e). We used a double-digest restriction-site associated DNA sequencing (ddRADSeq) protocol to generate genome-wide sequence clusters (tags), following the BestRAD library preparation protocol (dx.doi.org/10.17504/protocols.io.6awhafe), using restriction enzymes PstI and BfaI (New England Biolabs, Ipswich, MA). Post-digestion, half plates of individual DNAs were labeled by ligation of 48 unique in-line barcoded adapters, then pooled for size selection. Libraries were prepared using NEBNext Ultra II library preparation kits for Illumina (New England BioLabs, Ipswich, MA). Each pool was indexed with a unique NEBNext i7 adapter and an i5 adapter containing a degenerate barcode and PCR amplified with 12 cycles. The F2 libraries were size-selected to 200-700bp using BluePippin 2% agarose cassettes (Sage Science, Beverly, MA) and sequenced (150-bp paired-end reads) in a partial lane of an Illumina HiSeq4000 sequencer at GC3F, the University of Oregon Genomics Core Facility. The RIL library was sequenced (150-bp paired-ends) without size-selection on an Illumina HiSeq4000 at Genewiz (South Plainfield, NJ). Sequence processing and linkage mapping After sequencing, two separate ddRAD datasets were analyzed: one with samples from both F2 populations (N = 283 UM-F2 hybrids with 3 M. parishii and 2 M. cardinalis controls, and 253 UC-F2 hybrids, with 3 each F1, M. parishii and M. cardinalis controls) and one with samples from the RIL population (N = 167). Samples from both datasets were demultiplexed using a custom Python script (dx.doi.org/10.17504/protocols.io.bjnbkman), trimmed using Trimmomatic (Bolger et al. 2014), mapped to the M. cardinalis CE10 v2.0 reference genome (http://mimubase.org/FTP/Genomes/CE10g_v2.0) using BWA MEM, and indexed using SAMtools (Li et al. 2009). The RIL dataset was also filtered in SAMtools using a mapping quality ≥ 29. We called SNPs in both datasets using HaplotypeCaller in GATK v3.3 in F2s, v4.1.8.1 in RILs (McKenna et al. 2010). Next, we performed a series of filtering steps to generate sets of high-quality SNPs. In the F2 dataset, we filtered using vcftools (Danecek et al. 2011), retaining sites with read depth ≥ 5, mapping quality ≥ 10, and < 40% missing data. We also filtered out loci deviating from Hardy-Weinberg Equilibrium at a p-value < 0.00005. In the RIL dataset, we filtered a combined GVCF file using GATK, retaining sites with read depth ≥ 4*N (with N = number of RIL samples) and < the mean + 2 * the standard deviation, QD score < 2.0, FS score > 60, MQ < 40, MQ rank sum < -12.5, ReadPosRankSum < -8.0, and < 10% missing genotypes. For both datasets, we used custom scripts to remove sites that were not polymorphic in the parents and heterozygous in the F1 hybrids (F2: https://github.com/bergcolette/F2_genotype_processing). We excluded individuals from the F2 dataset with > 10% missing data and from the RIL dataset with low coverage, high missingness, or excessive heterozygosity (> 50%, indicating line contamination). These filtering steps produced an F2 dataset with 18,119 SNPs (N = 252 UM-F2 and 253 UC-F2) and a RIL dataset with 47,851 SNPs (N = 145). To produce sets of high-quality marker genotypes for mapping, we binned each dataset into 18-SNP windows using custom Python and R scripts (provided at GitHub links above), requiring ≥ 8 sites to have SNP genotype calls to assign a windowed genotype. In the F2 binning script, M. cardinalis homozygotes were coded as 2, M. parishii homozygotes as 0, and heterozygotes as 1. We called windows with mean values < 0.2 as parishii homozygotes, > 1.8 as cardinalis homozygotes, and between 0.8 and 1.2 as heterozygotes. Windows with means outside of these ranges were coded as missing genotypes. For the RILs, we required ≥ 88% of SNP calls to match each other to assign each parental homozygous genotype (e.g., 16/18 sites = homozygous for M. cardinalis alleles; https://github.com/vasotola/GenomicsScripts). We generated linkage maps for each dataset using Lep-MAP3 (Rastas 2017). First, we used the SeparateChromosomes2 module to assign markers to linkage groups (F2: LodLimit = 25, theta = 3, RIL: LodLimit = 28, theta = 0.2). In the RIL dataset, 10 markers were assigned to linkage groups inconsistent with the reference genome assembly; we manually re-assigned these markers to linkage groups corresponding to their reference assembly chromosomes. Next, we performed iterative ordering using the OrderMarkers2 module (Kosambi mapping function; 6 iterations/per linkage group in the F2s, 10 in the RILs); the order with the highest likelihood for each linkage group was chosen. This resulted in an F2 map with 997 markers in seven linkage groups and a RIL map with 2,535 markers in eight linkage groups. In the RIL dataset, the genotype matrix output by Lep-MAP3 differed in two important respects from the input file. First, due to stringent thresholds for calling windowed genotypes, our input file includes a high percentage of missing data (23% of genotypes are coded as ‘no call’), whereas the output file contains no missing data (Lep-MAP3 converts each ‘no call’ genotype to a called genotype). Second, the Lep-MAP3 output file contains more heterozygous genotype calls than the input file. The reason for this increase in heterozygosity is that Lep-MAP3 disproportionately converts ‘no call’ genotypes to heterozygotes: relative to the input file, the output genotype matrix includes 115% more heterozygotes, compared to only 18% more M. cardinalis homozygotes and 20% more M. parishii homozygotes. Notably, Lep-MAP3 frequently converted ‘no call’ genotypes to heterozygotes when they occur at single markers between recombination breakpoints. Because most recombinational switches in this RIL population are between alternative homozygotes, any window that contains an actual breakpoint will carry a mixture of M. cardinalis and M. parishii homozygotes at the 18 SNPs (and thus be coded as ‘no call’ in our windowed genotype matrix). To circumvent these problems, for all downstream analyses, we used a modified version of the genotype matrix output from Lep-MAP3 in which genotypes were recoded as ‘no call’ as in the input file.
    QTL mapping of pollen traits In the UM-F2 and RIL populations, we directly assessed male fertility by collecting all four anthers of the first flower from each plant into 50 ml of lactophenol-aniline blue dye. We counted viable (darkly stained) and inviable (unstained) pollen grains using a hemocytometer (≥ 100 grains/flower). We estimated total pollen grains per flower

  13. f

    Data fields used to link Correctional Service Canada data to the Registered...

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kathryn E. McIsaac; Shanna Farrell MacDonald; Nelson Chong; Andrea Moser; Rahim Moineddin; Angela Colantonio; Avery Nathens; Flora I. Matheson (2023). Data fields used to link Correctional Service Canada data to the Registered Persons Database, by pass number [Dataset]. http://doi.org/10.1371/journal.pone.0161173.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Kathryn E. McIsaac; Shanna Farrell MacDonald; Nelson Chong; Andrea Moser; Rahim Moineddin; Angela Colantonio; Avery Nathens; Flora I. Matheson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data fields used to link Correctional Service Canada data to the Registered Persons Database, by pass number

  14. H

    eLIXIR - Early Life Data Cross-Linkage in Research

    • dtechtive.com
    • find.data.gov.scot
    Updated May 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TISSUE DIRECTORY (2023). eLIXIR - Early Life Data Cross-Linkage in Research [Dataset]. https://dtechtive.com/datasets/26264
    Explore at:
    Dataset updated
    May 26, 2023
    Dataset provided by
    TISSUE DIRECTORY
    Area covered
    England, United Kingdom
    Description

    Collection of samples and data across the following diseases: Fit and well

  15. Clinical Database to Support Comparative Effectiveness Studies of Complex...

    • icpsr.umich.edu
    Updated Sep 8, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blaum, Caroline (2013). Clinical Database to Support Comparative Effectiveness Studies of Complex Patients, 2005-2010 [United States] [Dataset]. http://doi.org/10.3886/ICPSR34644.v1
    Explore at:
    Dataset updated
    Sep 8, 2013
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Blaum, Caroline
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/34644/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/34644/terms

    Time period covered
    2005 - 2010
    Area covered
    United States
    Description

    Overview: The goal of the project was to develop a unique database linking chronic disease clinical data from an electronic medical record (EMR) of a large academic healthcare system to multi-payer claims data. The longitudinal relational database can be used to study clinical effectiveness of many diagnostic and treatment interventions. The population of patients used consisted of those patients who were attributed to the University of Michigan Health System (UMHS) as continuing care patients, who are also in adjudicated and validated chronic disease registries. Data Access: These data are not available from ICPSR. The data are restricted to use by the principal investigator and cannot be shared.

  16. o

    Data from "A Bayesian Record Linkage Approach to Applications in Tree...

    • osti.gov
    • search.dataone.org
    Updated Jan 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Watershed Function SFA (2024). Data from "A Bayesian Record Linkage Approach to Applications in Tree Demography Using Overlapping LiDAR Scans" [Dataset]. http://doi.org/10.15485/2476543
    Explore at:
    Dataset updated
    Jan 1, 2024
    Dataset provided by
    Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) (United States)
    U.S. DOE > Office of Science > Biological and Environmental Research (BER)
    Watershed Function SFA
    Description

    Processed LiDAR data and environmental covariates from 2015 and 2019 LiDAR scans in the Vicinity of Snodgrass Mountain (Western Colorado, USA), in a geographic subset used in primary analysis for the research paper.This package contains LiDAR-derived canopy height maps for 2015 and 2019, crown polygons derived from the height maps using a segmentation algorithm, and environmental covariates supporting the model of forest growth. Source datasets include August 2015 and August 2019 discrete-return LiDAR point clouds collected by Quantum Geospatial for terrain mapping purposes on behalf of the Colorado Hazard Mapping Program and the Colorado Water Conservation Board. Both datasets adhere to the USGS QL2 quality standard. The point cloud data were processed using the R package lidR to generate a canopy height model representing maximum vegetation height above the ground surface, using a pit-free algorithm.This dataset was compiled to assess how spatial patterns of tree growth in montane and subalpine forests are influenced by water and energy availability. Understanding these growth patterns can provide insight into forest dynamics in the Southern Rocky Mountains under changing climatic conditions.This dataset contains .tif, .csv, and .txt files. This dataset additionally includes a file-level metadata (flmd.csv) file that lists each file contained in the dataset with associated metadata; and a data dictionary (dd.csv) file that contains column/row headers used throughout the files along with a definition, units, and data type.

  17. HES-MHLD data linkage report, summary statistics, October 2014

    • gov.uk
    Updated Feb 10, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HES-MHLD data linkage report, summary statistics, October 2014 [Dataset]. https://www.gov.uk/government/statistics/hes-mhld-data-linkage-report-summary-statistics-october-2014
    Explore at:
    Dataset updated
    Feb 10, 2015
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    Health and Social Care Information Centre
    Description

    Official statistics are produced impartially and free from political influence. This replaces the series “HES-MHMDS data linkage report”.

  18. f

    Number and proportion of babies with BSI caused by clearly pathogenic...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caroline Fraser; Berit Muller-Pebody; Ruth Blackburn; Jim Gray; Sam J. Oddie; Ruth E. Gilbert; Katie Harron (2023). Number and proportion of babies with BSI caused by clearly pathogenic organisms between 2010–2017 and rate ratios (representing monthly rate change) based on Poisson regression for BSI identified through deterministic + probabilistic linkage, deterministic linkage alone, clinical records of BSI in NNRD and any record of BSI (either from linkage or clinical record of BSI in NNRD). [Dataset]. http://doi.org/10.1371/journal.pone.0226040.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Caroline Fraser; Berit Muller-Pebody; Ruth Blackburn; Jim Gray; Sam J. Oddie; Ruth E. Gilbert; Katie Harron
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number and proportion of babies with BSI caused by clearly pathogenic organisms between 2010–2017 and rate ratios (representing monthly rate change) based on Poisson regression for BSI identified through deterministic + probabilistic linkage, deterministic linkage alone, clinical records of BSI in NNRD and any record of BSI (either from linkage or clinical record of BSI in NNRD).

  19. u

    Jyutping Project - Raw Data and Clean Data

    • rdr.ucl.ac.uk
    application/csv
    Updated Aug 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph Lam (2024). Jyutping Project - Raw Data and Clean Data [Dataset]. http://doi.org/10.5522/04/26504347.v1
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Aug 19, 2024
    Dataset provided by
    University College London
    Authors
    Joseph Lam
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Raw and clean data for Jyutping project, submitted to International Journal of Epidemiology.All data are openly available at the time of scrapping. I only retained Chinese Name and Hong Kong Government Romanised English Names. This project aims to describe the problem of non-standardised romanisation and it's impact on data linkage. The included data allows researchers to replicate my process of extracting Jyutping and Pinyin from Chinese Characters. Quite a few of manual screening and reviewing was required, so the code itself was not fully automated. The codes are stored on my personal GitHub, https://github.com/Jo-Lam/Jyutping_project/tree/main.Please cite this data resource: doi:10.5522/04/26504347

  20. d

    Data Management Plan Examples Database

    • search.dataone.org
    • borealisdata.ca
    Updated Sep 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evering, Danica; Acharya, Shrey; Pratt, Isaac; Behal, Sarthak (2024). Data Management Plan Examples Database [Dataset]. http://doi.org/10.5683/SP3/SDITUG
    Explore at:
    Dataset updated
    Sep 4, 2024
    Dataset provided by
    Borealis
    Authors
    Evering, Danica; Acharya, Shrey; Pratt, Isaac; Behal, Sarthak
    Time period covered
    Jan 1, 2011 - Jan 1, 2023
    Description

    This dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined below. Data included/extracted from the examples include the discipline and field of study, author, institutional affiliation and funding information, location, date created, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications or grant pages. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez (2020). Automated Linking of Historical Data [Dataset]. http://doi.org/10.3886/E120703V1
Organization logo

Data from: Automated Linking of Historical Data

Related Article
Explore at:
Dataset updated
Aug 20, 2020
Authors
Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
1850 - 1940
Area covered
United States
Description

Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.

Search
Clear search
Close search
Google apps
Main menu