Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.
Linking survey and administrative data offers the possibility of combining the strengths, and mitigating the weaknesses, of both. Such linkage is therefore an extremely promising basis for future empirical research in social science. For ethical and legal reasons, linking administrative data to survey responses will usually require obtaining explicit consent. It is well known that not all respondents give consent. Past research on consent has generated many null and inconsistent findings. A weakness of the existing literature is that little effort has been made to understand the cognitive processes of how respondents make the decision whether or not to consent. The overall aim of this project was to improve our understanding about how to pursue the twin goals of maximizing consent and ensuring that consent is genuinely informed. The ultimate objective is to strengthen the data infrastructure for social science and policy research in the UK. Specific aims were: 1. To understand how respondents process requests for data linkage: which factors influence their understanding of data linkage, which factors influence their decision to consent, and to open the black box of consent decisions to begin to understand how respondents make the decision. 2. To develop and test methods of maximising consent in web surveys, by understanding why web respondents are less likely to give consent than face-to-face respondents. 3. To develop and test methods of maximising consent with requests for linkage to multiple data sets, by understanding how respondents process multiple requests. 4. As a by-product of testing hypotheses about the previous points, to test the effects of different approaches to wording consent questions on informed consent.
Our findings are based on a series of experiments conducted in four surveys using two different studies: The Understanding Society Innovation Panel (IP) and the PopulusLive online access panel (AP). The Innovation Panel is part of Understanding Society: the UK Household Longitudinal Study. It is a probability sample of households in Great Britain used for methodological testing, with a design that mirrors that of the main Understanding Society survey. The Innovation Panel survey was conducted in wave 11, fielded in 2018. The Innovation Panel data are available from the UK Data Service (SN: 6849, http://doi.org/10.5255/UKDA-SN-6849-12).
Since the Innovation Panel sample size (around 2,900 respondents) constrained the number of experimental treatment groups we could implement, we fielded a parallel survey with additional experiments, using a different sample. PopulusLive is a non-probability online panel with around 130,000 active sample members, who are recruited through web advertising, word of mouth, and database partners. We used age, gender and education quotas to match the sample composition of the Innovation Panel.
A total of nine experiments were conducted across the two sample sources. Experiments 1 to 5 all used variations of a single consent question, about linkage to tax data (held by HM Revenue and Customs, HMRC). Experiments 6 and 7 also used single consent questions, but respondents were either assigned to questions on tax or health data (held by the National Health Service, NHS) linkage. Experiments 8 and 9 used five different data linkage requests: tax data (held by HMRC), health data (held by the NHS), education data (held by the Department for Education in England, DfE, and equivalent departments in Scotland and Wales), household energy data (held the Department for Business, Energy and Industrial Strategy, BEIS), and benefit and pensions data (held by the Department for Work and Pensions, DWP).
The experiments, and the survey(s) on which they were conducted, are briefly summarized here:
1. Easy vs. standard wording of consent request (IP and AP). Half the respondents were allocated to the ‘standard’ question wording, used previously in Understanding Society. The balance was allocated to an ‘easy’ version, where the text was rewritten to reduce reading difficulty and to provide all essential information about the linkage in the question text rather than an additional information leaflet.
2. Early vs. late placement of consent question (IP). Half the respondents were asked for consent early in the interview, the other half were asked at the end.
3. Web vs. face-to-face interview (IP). This experiment exploits the random assignment of IP cases to explore mode effects on consent.
4. Default question wording (AP). Experiment 4 tested a default approach to giving consent, asking respondents to “Press ‘next’ to continue” or explicitly opt out, versus the standard opt-in consent procedure.
5. Additional information question wording (AP). This experiment tested the effect of offering additional information, with a version that added a third response option (“I need more information before making a decision”) to the standard ‘yes’ or no’ options.
6. Data linkage...
https://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions
This is the latest monthly (September 2015) statistical publication in relation to the linked HES (Hospital Episode Statistics) and MHLDDS (Mental Health and Learning Disabilities Data Set) data. The two data sets have been linked using specific patient identifiers collected in HES and MHLDDS. The linkage allows the data sets to be linked in this manner from 2006-07; however, this report focuses on patients who were present in the two data sets from April 2015. The bridging file used for this publication was also released on 08 January 2016; it utilises the latest published provisional (Monthly) HES data and year-to-date MHLDDS data relating to September 2015. The HES-MHLDDS linkage provides the ability to undertake national (within England) analysis along acute patient pathways for mental health and learning disability service users' interactions with acute secondary care.
Background:
The Millennium Cohort Study (MCS) is a large-scale, multi-purpose longitudinal dataset providing information about babies born at the beginning of the 21st century, their progress through life, and the families who are bringing them up, for the four countries of the United Kingdom. The original objectives of the first MCS survey, as laid down in the proposal to the Economic and Social Research Council (ESRC) in March 2000, were:
Further information about the MCS can be found on the Centre for Longitudinal Studies web pages.
The content of MCS studies, including questions, topics and variables can be explored via the CLOSER Discovery website.
The first sweep (MCS1) interviewed both mothers and (where resident) fathers (or father-figures) of infants included in the sample when the babies were nine months old, and the second sweep (MCS2) was carried out with the same respondents when the children were three years of age. The third sweep (MCS3) was conducted in 2006, when the children were aged five years old, the fourth sweep (MCS4) in 2008, when they were seven years old, the fifth sweep (MCS5) in 2012-2013, when they were eleven years old, the sixth sweep (MCS6) in 2015, when they were fourteen years old, and the seventh sweep (MCS7) in 2018, when they were seventeen years old.The Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical Records), Scottish Birth Records, 2000-2002: Secure Access includes data files from the NHS Digital Hospital Episode Statistics database for those cohort members who provided consent to health data linkage in the Age 50 sweep, and had ever lived in Scotland. The Scottish Medical Records database contains information about all hospital admissions in Scotland. This study concerns the Scottish Birth Records.
Other datasets are available from the Scottish Medical Records database, these include:
https://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions
This is the latest statistical publication of linked HES (Hospital Episode Statistics) and DID (Diagnostic Imaging Dataset) data held by the Health and Social Care Information Centre. The HES-DID linkage provides the ability to undertake national (within England) analysis along acute patient pathways to understand typical imaging requirements for given procedures, and/or the outcomes after particular imaging has been undertaken, thereby enabling a much deeper understanding of outcomes of imaging and to allow assessment of variation in practice. This publication aims to highlight to users the availability of this updated linkage and provide users of the data with some standard information to assess their analysis approach against. The two data sets have been linked using specific patient identifiers collected in HES and DID. The linkage allows the data sets to be linked from April 2012 when the DID data was first collected; however this report focuses on patients who were present in either data set for the period April 2015-February 2016 only. For DID this is provisional 2015/16 data. For HES this is provisional 2015/16 data. The linkage used for this publication was created on 06 June 2016 and released together with this publication on 07 July 2016.
Delve into Serpstat's Backlinks Database, boasting an impressive 1.7 trillion links and covering 170 million domains. Our unique architectural feature ensures live and up-to-date data, eliminating any separation between fresh and historical records.
With Serpstat's Backlink Index, gain a complete and detailed picture of a domain's backlink profile. Our update period guarantees freshness, with full index data refreshed every 70 days, ensuring that you have access to the most current and relevant backlink information.
Experience the power of Serpstat's Backlinks Database in uncovering insights, optimizing strategies, and staying ahead in your SEO endeavors. Choose between purchasing the entire dataset or customizable subsets tailored to your specific needs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Each row of the table represents each row of Input02.csv file.Records for 5 people having 4 attributes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
28 Global import shipment records of Linkage with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
Alesco’s aggregated consumer email database consists of over 2.3 billion U.S. records with name, address and email. The database is fully CAN-SPAM and privacy compliant, and records include referring URL, IP address and date stamp. Postal addresses are address standardized and processed through NCOA. Available for licensing!
File size: 2.3 Billion IP Address: 1.9 Billion eAppend data: 1.8 Billion (full name/postal) Acquisition: 269 Million (full demo’s)
Fields Included: Name Address Email Phone IP Address
Abstract copyright UK Data Service and data collection copyright owner.
Background:
The Millennium Cohort Study (MCS) is a large-scale, multi-purpose longitudinal dataset providing information about babies born at the beginning of the 21st century, their progress through life, and the families who are bringing them up, for the four countries of the United Kingdom. The original objectives of the first MCS survey, as laid down in the proposal to the Economic and Social Research Council (ESRC) in March 2000, were:
Further information about the MCS can be found on the Centre for Longitudinal Studies web pages.
The content of MCS studies, including questions, topics and variables can be explored via the CLOSER Discovery website.
The first sweep (MCS1) interviewed both mothers and (where resident) fathers (or father-figures) of infants included in the sample when the babies were nine months old, and the second sweep (MCS2) was carried out with the same respondents when the children were three years of age. The third sweep (MCS3) was conducted in 2006, when the children were aged five years old, the fourth sweep (MCS4) in 2008, when they were seven years old, the fifth sweep (MCS5) in 2012-2013, when they were eleven years old, the sixth sweep (MCS6) in 2015, when they were fourteen years old, and the seventh sweep (MCS7) in 2018, when they were seventeen years old.
End User Licence versions of MCS studies:
The End User Licence (EUL) versions of MCS1, MCS2, MCS3, MCS4, MCS5, MCS6 and MCS7 are held under UK Data Archive SNs 4683, 5350, 5795, 6411, 7464, 8156 and 8682 respectively. The longitudinal family file is held under SN 8172.
Sub-sample studies:
Some studies based on sub-samples of MCS have also been conducted, including a study of MCS respondent mothers who had received assisted fertility treatment, conducted in 2003 (see EUL SN 5559). Also, birth registration and maternity hospital episodes for the MCS respondents are held as a separate dataset (see EUL SN 5614).
Release of Sweeps 1 to 4 to Long Format (Summer 2020)
To support longitudinal research and make it easier to compare data from different time points, all data from across all sweeps is now in a consistent format. The update affects the data from sweeps 1 to 4 (from 9 months to 7 years), which are updated from the old/wide to a new/long format to match the format of data of sweeps 5 and 6 (age 11 and 14 sweeps). The old/wide formatted datasets contained one row per family with multiple variables for different respondents. The new/long formatted datasets contain one row per respondent (per parent or per cohort member) for each MCS family. Additional updates have been made to all sweeps to harmonise variable labels and enhance anonymisation.
How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
For information on how to access biomedical data from MCS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.
Secure Access datasets:
Secure Access versions of the MCS have more restrictive access conditions than versions available under the standard End User Licence or Special Licence (see 'Access data' tab above).
Secure Access versions of the MCS...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
3722 Global export shipment records of Linkage with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The evolution of genomic incompatibilities causing postzygotic barriers to hybridization is a key step in species divergence. Incompatibilities take two general forms – structural divergence between chromosomes leading to severe hybrid sterility in F1 hybrids and epistatic interactions between genes causing reduced fitness of hybrid gametes or zygotes (Dobzhansky-Muller incompatibilities). Despite substantial recent progress in understanding the molecular mechanisms and evolutionary origins of both types of incompatibility, how each behaves across multiple generations of hybridization remains relatively unexplored. Here, we use genetic mapping in F2 and RIL hybrid populations between the phenotypically divergent but naturally hybridizing monkeyflowers Mimulus cardinalis and M. parishii to characterize the genetic basis of hybrid incompatibility and examine its changing effects over multiple generations of experimental hybridization. In F2s, we found severe hybrid pollen inviability (< 50% reduction vs. parental genotypes) and pseudolinkage caused by a reciprocal translocation between Chromosomes 6 and 7 in the parental species. RILs retained excess heterozygosity around the translocation breakpoints, which caused substantial pollen inviability when interstitial crossovers had not created compatible heterokaryotypic configurations. Strong transmission ratio distortion and inter-chromosomal linkage disequilibrium in both F2s and RILs identified a novel two-locus genic incompatibility causing sex-independent gametophytic (haploid) lethality. The latter interaction eliminated three of the expected nine F2 genotypic classes via F1 gamete loss without detectable effects on the pollen number or viability of F2 double heterozygotes. Along with the mapping of numerous milder incompatibilities, these key findings illuminate the complex genetics of plant hybrid breakdown and are an important step toward understanding the genomic consequences of natural hybridization in this model system.
Methods
Study system and plant lines
The plants in this study were all derived from two highly (>10 generations) inbred lines of Sierran M. cardinalis (CE10) and M. parishii (PAR), which were also used in previous investigations of species barriers (Bradshaw et al. 1998; Schemske and Bradshaw 1999; Ramsey et al. 2003; Bradshaw and Schemske 2003; Fishman et al. 2013, 2015; Nelson et al. 2021a). We generated PAR x CE10 F1 hybrids by hand-pollination (with prior emasculation of the PAR seed parent in the bud) and F2 hybrids by self-pollination of F1 hybrids. The F2 hybrids were grown in two separate greenhouse common gardens at the University of Montana (UM-F2; total N = 524) and the University of Connecticut (UC-F2 N = 253), along with parental control lines, and were phenotyped for numerous floral and vegetative traits including the pollen fertility traits presented here. Recombinant inbred lines (RILs) were generated by single-seed-descent from additional F2 individuals grown at the University of Georgia and California State Polytechnic University, Pomona; a total of 167 RILs were formed through 3–6 generations of self-fertilization.
DNA extraction and sequencing
Genomic DNA was extracted from bud and leaf tissue of the greenhouse-grown F2 and RIL mapping populations using a CTAB chloroform protocol modified for 96-well plates (dx.doi.org/10.17504/protocols.io.bgv6jw9e). We used a double-digest restriction-site associated DNA sequencing (ddRADSeq) protocol to generate genome-wide sequence clusters (tags), following the BestRAD library preparation protocol (dx.doi.org/10.17504/protocols.io.6awhafe), using restriction enzymes PstI and BfaI (New England Biolabs, Ipswich, MA). Post-digestion, half plates of individual DNAs were labeled by ligation of 48 unique in-line barcoded adapters, then pooled for size selection. Libraries were prepared using NEBNext Ultra II library preparation kits for Illumina (New England BioLabs, Ipswich, MA). Each pool was indexed with a unique NEBNext i7 adapter and an i5 adapter containing a degenerate barcode and PCR amplified with 12 cycles. The F2 libraries were size-selected to 200-700bp using BluePippin 2% agarose cassettes (Sage Science, Beverly, MA) and sequenced (150-bp paired-end reads) in a partial lane of an Illumina HiSeq4000 sequencer at GC3F, the University of Oregon Genomics Core Facility. The RIL library was sequenced (150-bp paired-ends) without size-selection on an Illumina HiSeq4000 at Genewiz (South Plainfield, NJ).
Sequence processing and linkage mapping
After sequencing, two separate ddRAD datasets were analyzed: one with samples from both F2 populations (N = 283 UM-F2 hybrids with 3 M. parishii and 2 M. cardinalis controls, and 253 UC-F2 hybrids, with 3 each F1, M. parishii and M. cardinalis controls) and one with samples from the RIL population (N = 167). Samples from both datasets were demultiplexed using a custom Python script (dx.doi.org/10.17504/protocols.io.bjnbkman), trimmed using Trimmomatic (Bolger et al. 2014), mapped to the M. cardinalis CE10 v2.0 reference genome (http://mimubase.org/FTP/Genomes/CE10g_v2.0) using BWA MEM, and indexed using SAMtools (Li et al. 2009). The RIL dataset was also filtered in SAMtools using a mapping quality ≥ 29. We called SNPs in both datasets using HaplotypeCaller in GATK v3.3 in F2s, v4.1.8.1 in RILs (McKenna et al. 2010).
Next, we performed a series of filtering steps to generate sets of high-quality SNPs. In the F2 dataset, we filtered using vcftools (Danecek et al. 2011), retaining sites with read depth ≥ 5, mapping quality ≥ 10, and < 40% missing data. We also filtered out loci deviating from Hardy-Weinberg Equilibrium at a p-value < 0.00005. In the RIL dataset, we filtered a combined GVCF file using GATK, retaining sites with read depth ≥ 4*N (with N = number of RIL samples) and < the mean + 2 * the standard deviation, QD score < 2.0, FS score > 60, MQ < 40, MQ rank sum < -12.5, ReadPosRankSum < -8.0, and < 10% missing genotypes. For both datasets, we used custom scripts to remove sites that were not polymorphic in the parents and heterozygous in the F1 hybrids (F2: https://github.com/bergcolette/F2_genotype_processing). We excluded individuals from the F2 dataset with > 10% missing data and from the RIL dataset with low coverage, high missingness, or excessive heterozygosity (> 50%, indicating line contamination). These filtering steps produced an F2 dataset with 18,119 SNPs (N = 252 UM-F2 and 253 UC-F2) and a RIL dataset with 47,851 SNPs (N = 145).
To produce sets of high-quality marker genotypes for mapping, we binned each dataset into 18-SNP windows using custom Python and R scripts (provided at GitHub links above), requiring ≥ 8 sites to have SNP genotype calls to assign a windowed genotype. In the F2 binning script, M. cardinalis homozygotes were coded as 2, M. parishii homozygotes as 0, and heterozygotes as 1. We called windows with mean values < 0.2 as parishii homozygotes, > 1.8 as cardinalis homozygotes, and between 0.8 and 1.2 as heterozygotes. Windows with means outside of these ranges were coded as missing genotypes. For the RILs, we required ≥ 88% of SNP calls to match each other to assign each parental homozygous genotype (e.g., 16/18 sites = homozygous for M. cardinalis alleles; https://github.com/vasotola/GenomicsScripts).
We generated linkage maps for each dataset using Lep-MAP3 (Rastas 2017). First, we used the SeparateChromosomes2 module to assign markers to linkage groups (F2: LodLimit = 25, theta = 3, RIL: LodLimit = 28, theta = 0.2). In the RIL dataset, 10 markers were assigned to linkage groups inconsistent with the reference genome assembly; we manually re-assigned these markers to linkage groups corresponding to their reference assembly chromosomes. Next, we performed iterative ordering using the OrderMarkers2 module (Kosambi mapping function; 6 iterations/per linkage group in the F2s, 10 in the RILs); the order with the highest likelihood for each linkage group was chosen. This resulted in an F2 map with 997 markers in seven linkage groups and a RIL map with 2,535 markers in eight linkage groups. In the RIL dataset, the genotype matrix output by Lep-MAP3 differed in two important respects from the input file. First, due to stringent thresholds for calling windowed genotypes, our input file includes a high percentage of missing data (23% of genotypes are coded as ‘no call’), whereas the output file contains no missing data (Lep-MAP3 converts each ‘no call’ genotype to a called genotype). Second, the Lep-MAP3 output file contains more heterozygous genotype calls than the input file. The reason for this increase in heterozygosity is that Lep-MAP3 disproportionately converts ‘no call’ genotypes to heterozygotes: relative to the input file, the output genotype matrix includes 115% more heterozygotes, compared to only 18% more M. cardinalis homozygotes and 20% more M. parishii homozygotes. Notably, Lep-MAP3 frequently converted ‘no call’ genotypes to heterozygotes when they occur at single markers between recombination breakpoints. Because most recombinational switches in this RIL population are between alternative homozygotes, any window that contains an actual breakpoint will carry a mixture of M. cardinalis and M. parishii homozygotes at the 18 SNPs (and thus be coded as ‘no call’ in our windowed genotype matrix). To circumvent these problems, for all downstream analyses, we used a modified version of the genotype matrix output from Lep-MAP3 in which genotypes were recoded as ‘no call’ as in the input file.
QTL mapping of pollen traits
In the UM-F2 and RIL populations, we directly assessed male fertility by collecting all four anthers of the first flower from each plant into 50 ml of lactophenol-aniline blue dye. We counted viable (darkly stained) and inviable (unstained) pollen grains using a hemocytometer (≥ 100 grains/flower). We estimated total pollen grains per flower
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data fields used to link Correctional Service Canada data to the Registered Persons Database, by pass number
Collection of samples and data across the following diseases: Fit and well
https://www.icpsr.umich.edu/web/ICPSR/studies/34644/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/34644/terms
Overview: The goal of the project was to develop a unique database linking chronic disease clinical data from an electronic medical record (EMR) of a large academic healthcare system to multi-payer claims data. The longitudinal relational database can be used to study clinical effectiveness of many diagnostic and treatment interventions. The population of patients used consisted of those patients who were attributed to the University of Michigan Health System (UMHS) as continuing care patients, who are also in adjudicated and validated chronic disease registries. Data Access: These data are not available from ICPSR. The data are restricted to use by the principal investigator and cannot be shared.
Processed LiDAR data and environmental covariates from 2015 and 2019 LiDAR scans in the Vicinity of Snodgrass Mountain (Western Colorado, USA), in a geographic subset used in primary analysis for the research paper.This package contains LiDAR-derived canopy height maps for 2015 and 2019, crown polygons derived from the height maps using a segmentation algorithm, and environmental covariates supporting the model of forest growth. Source datasets include August 2015 and August 2019 discrete-return LiDAR point clouds collected by Quantum Geospatial for terrain mapping purposes on behalf of the Colorado Hazard Mapping Program and the Colorado Water Conservation Board. Both datasets adhere to the USGS QL2 quality standard. The point cloud data were processed using the R package lidR to generate a canopy height model representing maximum vegetation height above the ground surface, using a pit-free algorithm.This dataset was compiled to assess how spatial patterns of tree growth in montane and subalpine forests are influenced by water and energy availability. Understanding these growth patterns can provide insight into forest dynamics in the Southern Rocky Mountains under changing climatic conditions.This dataset contains .tif, .csv, and .txt files. This dataset additionally includes a file-level metadata (flmd.csv) file that lists each file contained in the dataset with associated metadata; and a data dictionary (dd.csv) file that contains column/row headers used throughout the files along with a definition, units, and data type.
Official statistics are produced impartially and free from political influence. This replaces the series “HES-MHMDS data linkage report”.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number and proportion of babies with BSI caused by clearly pathogenic organisms between 2010–2017 and rate ratios (representing monthly rate change) based on Poisson regression for BSI identified through deterministic + probabilistic linkage, deterministic linkage alone, clinical records of BSI in NNRD and any record of BSI (either from linkage or clinical record of BSI in NNRD).
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Raw and clean data for Jyutping project, submitted to International Journal of Epidemiology.All data are openly available at the time of scrapping. I only retained Chinese Name and Hong Kong Government Romanised English Names. This project aims to describe the problem of non-standardised romanisation and it's impact on data linkage. The included data allows researchers to replicate my process of extracting Jyutping and Pinyin from Chinese Characters. Quite a few of manual screening and reviewing was required, so the code itself was not fully automated. The codes are stored on my personal GitHub, https://github.com/Jo-Lam/Jyutping_project/tree/main.Please cite this data resource: doi:10.5522/04/26504347
This dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined below. Data included/extracted from the examples include the discipline and field of study, author, institutional affiliation and funding information, location, date created, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications or grant pages. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.