34 datasets found
  1. Simulated Data for Patient Time Series Record Linkage

    • figshare.com
    zip
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Soliman (2023). Simulated Data for Patient Time Series Record Linkage [Dataset]. http://doi.org/10.6084/m9.figshare.19224786.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ahmed Soliman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This simulated dataset constitutes two files (after decompression), namely: sim_ergo_1600.csv and sim_pat_1600.csv.1. ergo.csv contains heart rate timeseries data for 1600 patients' ergometric tests. For each patient, 20 different ergometric tests were simulated. Each row in this file constitutes three field values: Ergo_ID, Heart Rate (BPM), and timestamp.2. pat.csv contains only four sample readings from each of the patient's 20 ergometric tests. Each row contains three values: patient_ID, Heart Rate, and timestamp. The goal is to link patients (identified by their patient_ID in the pat.csv file) to their corresponding ergometric tests (identified by their Ergo_ID in the ergo.csv file). This is done solely on matching the timestamp-value pairs from both files.The timeseries record linkage task described above is efficiently accomplished by the proposed tslink2 algorithm. tslink2 is implemented in C++ and is publicly availabe at https://github.com/ahmsoliman/tslink2Data is simulated such that correctly linked/matched identifiers follow the following formula:|Ergo_ID - patient_ID| mod 104 == 0The above formula is useful in evaluating the linkage algorithm performance.

  2. d

    Privacy Preserving Linkage Software

    • data.gov.au
    zip
    Updated Apr 2, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Commonwealth Scientific and Industrial Research Organisation (2019). Privacy Preserving Linkage Software [Dataset]. https://data.gov.au/dataset/ds-dap-csiro%3A26733
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 2, 2019
    Dataset provided by
    The Commonwealth Scientific and Industrial Research Organisation
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A set of software tools for privacy preserving record linkage. anonlink A library for carrying out the low level hash comparisons required server side. Available from github at http://github.com/n1analytics/anonlink/ entity-service Server side component of private record linkage REST api utilizing the anonlink library. clkhash A client utility and library for turning personally identifiable information into bloom filter hashes. Available from github at https://github.com/n1analytics/clkhash/ en…Show full descriptionA set of software tools for privacy preserving record linkage. anonlink A library for carrying out the low level hash comparisons required server side. Available from github at http://github.com/n1analytics/anonlink/ entity-service Server side component of private record linkage REST api utilizing the anonlink library. clkhash A client utility and library for turning personally identifiable information into bloom filter hashes. Available from github at https://github.com/n1analytics/clkhash/ encoding-service A REST api wrapper around clkhash for encoding PII data into CLKs. Available from github at https://github.com/n1analytics/encoding-service/ The metadata and files (if any) are available to the public.

  3. synthetic-gold-database

    • kaggle.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PJ Gibson (2023). synthetic-gold-database [Dataset]. https://www.kaggle.com/datasets/pjgibson/synthetic-gold-database
    Explore at:
    zip(9292035305 bytes)Available download formats
    Dataset updated
    Aug 4, 2023
    Authors
    PJ Gibson
    License

    http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html

    Description

    Synthetic Gold

    This database represents a synthetic population of Nebraska from 1920-2022. It was created using this publicly available Github Repository that allows a user to make a synthetic population for a specific state. See that repository for an in-depth background for the project.

    Record Linkage

    One of the primary uses of this dataset is for training record linkage models. Coming from a public health background, health records often don't have one single reliable unique person identifier (like Social Security Number). By creating a synthetic dataset with snapshots of the population each year from 1920-2022 with known unique person identifiers, we can produce "golden" training / testing / validation data for supervised machine learning models. See below for some links to useful information on the record linkage process:

  4. E-Commerce Products Dataset For Record Linkage

    • kaggle.com
    zip
    Updated Nov 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Furkan Gözükara (2025). E-Commerce Products Dataset For Record Linkage [Dataset]. https://www.kaggle.com/furkangozukara/ecommerce-products-dataset-for-record-linkage
    Explore at:
    zip(215619488 bytes)Available download formats
    Dataset updated
    Nov 30, 2025
    Authors
    Furkan Gözükara
    Description

    -> If you use Turkish_Ecommerce_Products_by_Gozukara_and_Ozel_2016 dataset, please cite: https://academic.oup.com/comjnl/advance-article-abstract/doi/10.1093/comjnl/bxab179/6425234

    @article{10.1093/comjnl/bxab179, author = {Gözükara, Furkan and Özel, Selma Ayşe}, title = "{An Incremental Hierarchical Clustering Based System For Record Linkage In E-Commerce Domain}", journal = {The Computer Journal}, year = {2021}, month = {11}, abstract = "{In this study, a novel record linkage system for E-commerce products is presented. Our system aims to cluster the same products that are crawled from different E-commerce websites into the same cluster. The proposed system achieves a very high success rate by combining both semi-supervised and unsupervised approaches. Unlike the previously proposed systems in the literature, neither a training set nor structured corpora are necessary. The core of the system is based on Hierarchical Agglomerative Clustering (HAC); however, the HAC algorithm is modified to be dynamic such that it can efficiently cluster a stream of incoming new data. Since the proposed system does not depend on any prior data, it can cluster new products. The system uses bag-of-words representation of the product titles, employs a single distance metric, exploits multiple domain-based attributes and does not depend on the characteristics of the natural language used in the product records. To our knowledge, there is no commonly used tool or technique to measure the quality of a clustering task. Therefore in this study, we use ELKI (Environment for Developing KDD-Applications Supported by Index-Structures), an open-source data mining software, for performance measurement of the clustering methods; and show how to use ELKI for this purpose. To evaluate our system, we collect our own dataset and make it publicly available to researchers who study E-commerce product clustering. Our proposed system achieves 96.25\% F-Measure according to our experimental analysis. The other state-of-the-art clustering systems obtain the best 89.12\% F-Measure.}", issn = {0010-4620}, doi = {10.1093/comjnl/bxab179}, url = {https://doi.org/10.1093/comjnl/bxab179}, note = {bxab179}, eprint = {https://academic.oup.com/comjnl/advance-article-pdf/doi/10.1093/comjnl/bxab179/41133297/bxab179.pdf}, }

    -> elki-bundle-0.7.2-SNAPSHOT.jar Is the ELKI bundle that we have compiled from the github source code of ELKI. The date of the source code is 6 June 2016. The compile command is as below: ->-> mvn -DskipTests -Dmaven.javadoc.skip=true -P svg,bundle package ->-> Github repository of ELKI: https://github.com/elki-project/elki ->-> This bundle file is used for all of the experiments that are presented in the article

    -> Turkish_Ecommerce_Products_by_Gozukara_and_Ozel_2016 dataset is composed as below: ->-> Top 50 E-commerce websites that operate in Turkey are crawled, and their attributes are extracted. ->-> The crawling is made between 2015-01-13 15:12:46 ---- 2015-01-17 19:07:53 dates. ->-> Then 250 product offers from Vatanbilgisayar are randomly selected. ->-> Then the entire dataset is manually scanned to find which other products that are sold in different E-commerce websites are same as the selected ones. ->-> Then each product is classified respectively. ->-> This dataset contains these products along with their price (if available), title, categories (if available), free text description (if available), wrapped features (if available), crawled URL (the URL might have expired) attributes

    -> The dataset files are provided as used in the study. -> ARFF files are generated with Raw Frequency of terms rather than used Weighting Schemes for All_Products and Only_Price_Having_Products. The reason is, we have tested these datasets with only our system and since our system does incremental clustering, even if provide TF-IDF weightings, they wouldn't be same as used in the article. More information provided in the article. ->-> For Macro_Average_Datasets we provide both Raw frequency and TF-IDF scheme weightings as used in the experiments

    -> There are 3 main folders -> All_Products: This folder contains 1800 products. ->-> This is the entire collection that is manually labeled. ->-> They are from 250 different classes. -> Only_Price_Having_Products: This folder contains all of the products that have the price feature set. ->-> The collection has 1721 products from 250 classes. ->-> This is the dataset that we have experimented. -> Macro_Average_Datasets: This folder contains 100 datasets that we have used to conduct more reliable experiments. ->-> Each dataset is composed of selecting 1000 different products from the price having products dataset and then randomly ordering them...

  5. u

    Jyutping Project - Raw Data and Clean Data

    • rdr.ucl.ac.uk
    application/csv
    Updated Aug 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph Lam (2024). Jyutping Project - Raw Data and Clean Data [Dataset]. http://doi.org/10.5522/04/26504347.v1
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Aug 19, 2024
    Dataset provided by
    University College London
    Authors
    Joseph Lam
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Raw and clean data for Jyutping project, submitted to International Journal of Epidemiology.All data are openly available at the time of scrapping. I only retained Chinese Name and Hong Kong Government Romanised English Names. This project aims to describe the problem of non-standardised romanisation and it's impact on data linkage. The included data allows researchers to replicate my process of extracting Jyutping and Pinyin from Chinese Characters. Quite a few of manual screening and reviewing was required, so the code itself was not fully automated. The codes are stored on my personal GitHub, https://github.com/Jo-Lam/Jyutping_project/tree/main.Please cite this data resource: doi:10.5522/04/26504347

  6. SemTab 2024: Semantic Web Challenge on Tabular Data to Knowledge Graph...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Nov 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oktie Hassanzadeh; Oktie Hassanzadeh; Vasilis Efthymiou; Jiaoyan Chen; Vasilis Efthymiou; Jiaoyan Chen (2024). SemTab 2024: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Data Sets - WikidataTables2024R1 and WikidataTables2024R2 [Dataset]. http://doi.org/10.5281/zenodo.14207232
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Nov 22, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Oktie Hassanzadeh; Oktie Hassanzadeh; Vasilis Efthymiou; Jiaoyan Chen; Vasilis Efthymiou; Jiaoyan Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Sets from the ISWC 2024 Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, Round 1, Wikidata Tables. Links to other datasets can be found on the challenge website: https://sem-tab-challenge.github.io/2024/ as well as the proceedings of the challenge published on CEUR.

    For details about the challenge, see: http://www.cs.ox.ac.uk/isg/challenges/sem-tab/

    For 2024 edition, see: https://sem-tab-challenge.github.io/2024/

    Note on License: This data includes data from the following sources. Refer to each source for license details:
    - Wikidata https://www.wikidata.org/

    THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

  7. Datasets for Out-of-KB Mention Discovery with Entity Linking

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    Updated Aug 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hang Dong; Hang Dong; Jiaoyan Chen; Jiaoyan Chen; Yuan He; Yuan He; Liu Yinan; Liu Yinan; Ian Horrocks; Ian Horrocks (2023). Datasets for Out-of-KB Mention Discovery with Entity Linking [Dataset]. http://doi.org/10.5281/zenodo.8228371
    Explore at:
    Dataset updated
    Aug 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hang Dong; Hang Dong; Jiaoyan Chen; Jiaoyan Chen; Yuan He; Yuan He; Liu Yinan; Liu Yinan; Ian Horrocks; Ian Horrocks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The repository contains datasets for out-of-KB mention discovery from texts, documented in the work, Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking, on arXiv: https://arxiv.org/abs/2302.07189 (CIKM 2023).

    Each data setting (as a sub-folder) contains train, valid, and test files and also 100 random sample files for each data split for debugging.

    Data folder names with “syn_full” at the end are synonym augmented data (each synonym as an entity) for the setting.

    Ontology .jsonl files have two versions for each, "syn_attr" setting treats synonyms are attributes, "syn_full" setting treats synonyms as entities.

    Data scripts are available at https://github.com/KRR-Oxford/BLINKout#data-scripts

    Acknowledgement of the data sources below:

    ShARe/CLEF 2013 dataset is from https://physionet.org/content/shareclefehealth2013/1.0/

    MedMention dataset is from https://github.com/chanzuckerberg/MedMentions

    UMLS (versions 2012AB, 2014AB, 2017AA) is from https://www.nlm.nih.gov/research/umls/index.html

    SNOMED CT (corresponding versions) is from https://www.nlm.nih.gov/healthit/snomedct/index.html

    NILK dataset is from https://zenodo.org/record/6607514

    WikiData 2017 dump is from https://archive.org/download/enwiki-20170220/enwiki-20170220-pages-articles.xml.bz2

  8. ONC Patient Matching Algorithm Challenge Data

    • linkagelibrary.icpsr.umich.edu
    Updated Sep 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of the National Coordinator for Health (2019). ONC Patient Matching Algorithm Challenge Data [Dataset]. http://doi.org/10.3886/E111962V1
    Explore at:
    Dataset updated
    Sep 20, 2019
    Dataset authored and provided by
    Office of the National Coordinator for Health
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The goal of the Patient Matching Algorithm Challenge is to bring about greater transparency and data on the performance of existing patient matching algorithms, spur the adoption of performance metrics for patient data matching algorithm vendors, and positively impact other aspects of patient matching such as deduplication and linking to clinical data. Participants will be provided a data set and will have their answers evaluated and scored against a master key. Up to 6 cash prizes will be awarded with a total purse of up to $75,000.00.https://www.patientmatchingchallenge.com/The test dataset used in the ONC Patient Matching Algorithm Challenge is available for download by students, researchers, or anyone else interested in additional analysis and patient matching algorithm development. More information about the Patient Matching Algorithm Challenge can be found: https://www.patientmatchingchallenge.com/.The dataset containing 1 million patients was split into eight files of alphabetical groupings by the the patient's last name, plus an additional file containing test patients with no last name recorded (Null). All files should be downloaded and merged for analysis.https://github.com/onc-healthit/patient-matching

  9. u

    Synthetic Administrative Data: Census 1991, 2023

    • datacatalogue.ukdataservice.ac.uk
    Updated Feb 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shlomo, N, University of Manchester; Kim, M, University of Manchester (2024). Synthetic Administrative Data: Census 1991, 2023 [Dataset]. http://doi.org/10.5255/UKDA-SN-856310
    Explore at:
    Dataset updated
    Feb 21, 2024
    Authors
    Shlomo, N, University of Manchester; Kim, M, University of Manchester
    Area covered
    United Kingdom
    Description

    We create a synthetic administrative dataset to be used in the development of the R package for calculating quality indicators for administrative data (see: https://github.com/sook-tusk/qualadmin) that mimic the properties of a real administrative dataset according to specifications by the ONS. Taking over 1 million records from a synthetic 1991 UK census dataset, we deleted records, moved records to a different geography and duplicated records to a different geography according to pre-specified proportions for each broad ethnic group (White, Non-white) and gender (males, females). The final size of the synthetic administrative data was 1033664 individuals.

    National Statistical Institutes (NSIs) are directing resources into advancing the use of administrative data in official statistics systems. This is a top priority for the UK Office for National Statistics (ONS) as they are undergoing transformations in their statistical systems to make more use of administrative data for future censuses and population statistics. Administrative data are defined as secondary data sources since they are produced by other agencies as a result of an event or a transaction relating to administrative procedures of organisations, public administrations and government agencies. Nevertheless, they have the potential to become important data sources for the production of official statistics by significantly reducing the cost and burden of response and improving the efficiency of such systems. Embedding administrative data in statistical systems is not without costs and it is vital to understand where potential errors may arise. The Total Administrative Data Error Framework sets out all possible sources of error when using administrative data as statistical data, depending on whether it is a single data source or integrated with other data sources such as survey data. For a single administrative data, one of the main sources of error is coverage and representation to the target population of interest. This is particularly relevant when administrative data is delivered over time, such as tax data for maintaining the Business Register. For sub-project 1 of this research project, we develop quality indicators that allow the statistical agency to assess if the administrative data is representative to the target population and which sub-groups may be missing or over-covered. This is essential for producing unbiased estimates from administrative data. Another priority at statistical agencies is to produce a statistical register for population characteristic estimates, such as employment statistics, from multiple sources of administrative and survey data. Using administrative data to build a spine, survey data can be integrated using record linkage and statistical matching approaches on a set of common matching variables. This will be the topic for sub-project 2, which will be split into several topics of research. The first topic is whether adding statistical predictions and correlation structures improves the linkage and data integration. The second topic is to research a mass imputation framework for imputing missing target variables in the statistical register where the missing data may be due to multiple underlying mechanisms. Therefore, the third topic will aim to improve the mass imputation framework to mitigate against possible measurement errors, for example by adding benchmarks and other constraints into the approaches. On completion of a statistical register, estimates for key target variables at local areas can easily be aggregated. However, it is essential to also measure the precision of these estimates through mean square errors and this will be the fourth topic of the sub-project. Finally, this new way of producing official statistics is compared to the more common method of incorporating administrative data through survey weights and model-based estimation approaches. In other words, we evaluate whether it is better 'to weight' or 'to impute' for population characteristic estimates - a key question under investigation by survey statisticians in the last decade.

  10. Frequency Report Utility

    • healthdata.gov
    • data.virginia.gov
    • +1more
    csv, xlsx, xml
    Updated Sep 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Frequency Report Utility [Dataset]. https://healthdata.gov/d/bsvr-yipx
    Explore at:
    xlsx, csv, xmlAvailable download formats
    Dataset updated
    Sep 3, 2025
    Description

    The Frequency Report Utility can generate a report that provides an overview of the number and type of raw responses being generated for a particular data element to help title IV-E agencies assess and improve data quality.

    Metadata-only record linking to the original dataset. Open original dataset below.

  11. Data from: Genetic Architecture Reconciles Linkage and Association Studies...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loic Yengo; Loic Yengo (2023). Genetic Architecture Reconciles Linkage and Association Studies of Complex Traits [Dataset]. http://doi.org/10.5281/zenodo.10416893
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Dec 21, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Loic Yengo; Loic Yengo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 21, 2023
    Description

    This (zipped) folder contains 3 sub-folders:

    #**********************************************************************************************************
    The "bin" folder contains fuctions and gentic maps needed for analyes
    bin \
    predLink.R - function to predict linkage
    sibREML_v0.1.1.R - function to run SibREML
    sim-sib-array.R - script to simulate sib-pairs from parental haplotypes
    Summarised_genetic_map_bcf.txt - genetic map per 0.5-cM long segments, based on map from bcftools
    (BCFtools: https://samtools.github.io/bcftools/bcftools.html)
    Summarised_genetic_map_OMNI.txt - genetic map per 0.5-cM long segments, based on OMNI map
    (https://github.com/joepickrell/1000-genomes-genetic-maps/tree/master/interpolated_OMNI)
    #**********************************************************************************************************

    #**********************************************************************************************************
    The "SIM" folder contains the simulation pipeline (scripts 01-15) as well as IBD sharing and simulated phenotypes for Simulated sib-pairs.
    SIM \
    01_sim-sib-array.sh *pre-run*
    02_bed_recode_bcf_map.sh *pre-run*
    03_make_merlin.R *pre-run*
    04_error_merlin.sh *pre-run*
    05_merlin_IBD.sh *pre-run*
    06_sample_causal_snps.R *pre-run*
    07_simulate_pheno.sh *pre-run*
    08_bhat_gwas.R *can be run using provided data*
    09_Linkage_VH.R *can be run using provided data*
    10_predLink.R *can be run using provided data*
    11_phi_hat.R *can be run using provided data*
    12_IBD_Mb.R *can be run using provided data*
    13_IBD_cM_recombrate_stratified.R *can be run using provided data*
    14_SibREML.R *can be run using provided data*
    15_SibREML_stratified_Q4.R *can be run using provided data*
    causal_snps \ *provided causal SNPs*
    IBD_results \ *provided IBD-probabilities for 1000 simulated sib-pairs*
    Linkage_VH_results \
    pheno \ *provided simulated phenotypes (h2=1) for 8 genetic architectures*
    Phi_hat_results.txt
    predicted \
    README
    SibREML_results.txt
    SibREML_stratified_Q4.txt

    The data can be used to run Linkage analysis, predict linkage, estimate phi_hat,
    as well as estimate non-stratified and recombination rate stratified sib-heritability (h2_FS and c).
    The README is provided within the folder.
    #**********************************************************************************************************

    #**********************************************************************************************************
    The "HT_BMI" folder contains data and scripts to predict linkage and estimate phi_hat for height and BMI.
    HT_BMI \
    01_predLink_HT_BMI.R
    02_phi_hat_HT_BMI.R
    gws_sumstats \ *provided summary GWAS summary statistics to predict linkage for height and BMI*
    Linkage_results \ *provided linkage meta-analysis results for height and BMI from this study*
    Phi_hat_results_HT_BMI.txt
    PREDLINK_bmi.txt
    PREDLINK_height.txt
    README
    The README is provided within the folder.
    #**********************************************************************************************************

  12. Z

    Public Utility Data Liberation Project (PUDL) Data Release

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Selvans, Zane A.; Gosnell, Christina M.; Sharpe, Austen; Norman, Bennett; Schira, Zach; Lamb, Katherine; Xia, Dazhong; Belfer, Ella (2025). Public Utility Data Liberation Project (PUDL) Data Release [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3653158
    Explore at:
    Dataset updated
    Feb 14, 2025
    Dataset provided by
    Catalyst Cooperative
    Authors
    Selvans, Zane A.; Gosnell, Christina M.; Sharpe, Austen; Norman, Bennett; Schira, Zach; Lamb, Katherine; Xia, Dazhong; Belfer, Ella
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PUDL v2025.2.0 Data Release

    This is our regular quarterly release for 2025Q1. It includes updates to all the datasets that are published with quarterly or higher frequency, plus initial verisons of a few new data sources that have been in the works for a while.

    One major change this quarter is that we are now publishing all processed PUDL data as Apache Parquet files, alongside our existing SQLite databases. See Data Access for more on how to access these outputs.

    Some potentially breaking changes to be aware of:

    In the EIA Form 930 – Hourly and Daily Balancing Authority Operations Report a number of new energy sources have been added, and some old energy sources have been split into more granular categories. See Changes in energy source granularity over time.

    We are now running the EPA’s CAMD to EIA unit crosswalk code for each individual year starting from 2018, rather than just 2018 and 2021, resulting in more connections between these two datasets and changes to some sub-plant IDs. See the note below for more details.

    Many thanks to the organizations who make these regular updates possible! Especially GridLab, RMI, and the ZERO Lab at Princeton University. If you rely on PUDL and would like to help ensure that the data keeps flowing, please consider joining them as a PUDL Sustainer, as we are still fundraising for 2025.

    New Data

    EIA 176

    Add a couple of semi-transformed interim EIA-176 (natural gas sources and dispositions) tables. They aren’t yet being written to the database, but are one step closer. See #3555 and PRs #3590, #3978. Thanks to @davidmudrauskas for moving this dataset forward.

    Extracted these interim tables up through the latest 2023 data release. See #4002 and #4004.

    EIA 860

    Added EIA 860 Multifuel table. See #3438 and #3946.

    FERC 1

    Added three new output tables containing granular utility accounting data. See #4057, #3642 and the table descriptions in the data dictionary:

    out_ferc1_yearly_detailed_income_statements

    out_ferc1_yearly_detailed_balance_sheet_assets

    out_ferc1_yearly_detailed_balance_sheet_liabilities

    SEC Form 10-K Parent-Subsidiary Ownership

    We have added some new tables describing the parent-subsidiary company ownership relationships reported in the SEC’s Form 10-K, Exhibit 21 “Subsidiaries of the Registrant”. Where possible these tables link the SEC filers or their subsidiary companies to the corresponding EIA utilities. This work was funded by a grant from the Mozilla Foundation. Most of the ML models and data preparation took place in the mozilla-sec-eia repository separate from the main PUDL ETL, as it requires processing hundreds of thousands of PDFs and the deployment of some ML experiment tracking infrastructure. The new tables are handed off as nearly finished products to the PUDL ETL pipeline. Note that these are preliminary, experimental data products and are known to be incomplete and to contain errors. Extracting data tables from unstructured PDFs and the SEC to EIA record linkage are necessarily probabalistic processes.

    See PRs #4026, #4031, #4035, #4046, #4048, #4050 and check out the table descriptions in the PUDL data dictionary:

    out_sec10k_parents_and_subsidiaries

    core_sec10k_quarterly_filings

    core_sec10k_quarterly_exhibit_21_company_ownership

    core_sec10k_quarterly_company_information

    Expanded Data Coverage

    EPA CEMS

    Added 2024 Q4 of CEMS data. See #4041 and #4052.

    EPA CAMD EIA Crosswalk

    In the past, the crosswalk in PUDL has used the EPA’s published crosswalk (run with 2018 data), and an additional crosswalk we ran with 2021 EIA 860 data. To ensure that the crosswalk reflects updates in both EIA and EPA data, we re-ran the EPA R code which generates the EPA CAMD EIA crosswalk with 4 new years of data: 2019, 2020, 2022 and 2023. Re-running the crosswalk pulls the latest data from the CAMD FACT API, which results in some changes to the generator and unit IDs reported on the EPA side of the crosswalk, which feeds into the creation of core_epa_assn_eia_epacamd.

    The changes only result in the addition of new units and generators in the EPA data, with no changes to matches at the plant level. However, the updates to generator and unit IDs have resulted in changes to the subplant IDs - some EIA boilers and generators which previously had no matches to EPA data have now been matched to EPA unit data, resulting in an overall reduction in the number of rows in the core_epa_assn_eia_epacamd_subplant_ids table. See issues #4039 and PR #4056 for a discussion of the changes observed in the course of this update.

    EIA 860M

    Added EIA 860m through December 2024. See #4038 and #4047.

    EIA 923

    Added EIA 923 monthly data through September 2024. See #4038 and #4047.

    EIA Bulk Electricity Data

    Updated the EIA Bulk Electricity data to include data published up through 2024-11-01. See #4042 and PR #4051.

    EIA 930

    Updated the EIA 930 data to include data published up through the beginning of February 2025. See #4040 and PR #4054. 10 new energy sources were added and 3 were retired; see Changes in energy source granularity over time for more information.

    Bug Fixes

    Fix an accidentally swapped set of starting balance / ending balance column rename parameters in the pre-2021 DBF derived data that feeds into core_ferc1_yearly_other_regulatory_liabilities_sched278. See issue #3952 and PRs #3969, #3979. Thanks to @yolandazzz13 for making this fix.

    Added preliminary data validation checks for several FERC 1 tables that were missing it #3860.

    Fix spelling of Lake Huron and Lake Saint Clair in out_vcerare_hourly_available_capacity_factor and related tables. See issue #4007 and PR #4029.

    Quality of Life Improvements

    We added a sources parameter to pudl.metadata.classes.DataSource.from_id() in order to make it possible to use the pudl-archiver repository to archive datasets that won’t necessarily be ingested into PUDL. See this PUDL archiver issue and PRs #4003 and #4013.

    Other PUDL v2025.2.0 Resources

    PUDL v2025.2.0 Data Dictionary

    PUDL v2025.2.0 Documentation

    PUDL in the AWS Open Data Registry

    PUDL v2025.2.0 in a free, public AWS S3 bucket: s3://pudl.catalyst.coop/v2025.2.0/

    PUDL v2025.2.0 in a requester-pays GCS bucket: gs://pudl.catalyst.coop/v2025.2.0/

    Zenodo archive of the PUDL GitHub repo for this release

    PUDL v2025.2.0 release on GitHub

    PUDL v2025.2.0 package in the Python Package Index (PyPI)

    Contact Us

    If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. Here's a bunch of different ways to get in touch:

    Follow us on GitHub

    Use the PUDL Github issue tracker to let us know about any bugs or data issues you encounter

    GitHub Discussions is where we provide user support.

    Watch our GitHub Project to see what we're working on.

    Email us at hello@catalyst.coop for private communications.

    On Mastodon: @CatalystCoop@mastodon.energy

    On BlueSky: @catalyst.coop

    On Twitter: @CatalystCoop

    Connect with us on LinkedIn

    Play with our data and notebooks on Kaggle

    Combine our data with ML models on HuggingFace

    Learn more about us on our website: https://catalyst.coop

    Subscribe to our announcements list for email updates.

  13. CFSR Round 3 Statewide Data Indicators Workbook

    • healthdata.gov
    • data.virginia.gov
    • +1more
    csv, xlsx, xml
    Updated Sep 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). CFSR Round 3 Statewide Data Indicators Workbook [Dataset]. https://healthdata.gov/d/pguh-hcak
    Explore at:
    xlsx, csv, xmlAvailable download formats
    Dataset updated
    Sep 4, 2025
    Description

    This workbook provides state-by-state performance data for Round 3 of the Child and Family Services Reviews in addition to national performance comparisons for the past 12-month reporting period included in data profiles transmitted to states in February 2021.

    Metadata-only record linking to the original dataset. Open original dataset below.

  14. How do I submit a letter of intent?

    • healthdata.gov
    • data.virginia.gov
    • +1more
    csv, xlsx, xml
    Updated Sep 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). How do I submit a letter of intent? [Dataset]. https://healthdata.gov/d/ypuq-bya9
    Explore at:
    xlsx, csv, xmlAvailable download formats
    Dataset updated
    Sep 4, 2025
    Description

    ACF Children Bureau resource

    Metadata-only record linking to the original dataset. Open original dataset below.

  15. piq2011-01

    • healthdata.gov
    • data.virginia.gov
    • +1more
    csv, xlsx, xml
    Updated Sep 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). piq2011-01 [Dataset]. https://healthdata.gov/d/y96k-ga24
    Explore at:
    xml, csv, xlsxAvailable download formats
    Dataset updated
    Sep 3, 2025
    Description

    ACF Agency Wide resource

    Metadata-only record linking to the original dataset. Open original dataset below.

  16. Information Memorandum (IM-16-03)

    • healthdata.gov
    • odgavaprod.ogopendata.com
    • +1more
    csv, xlsx, xml
    Updated Sep 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Information Memorandum (IM-16-03) [Dataset]. https://healthdata.gov/d/3tv3-uwzp
    Explore at:
    xlsx, csv, xmlAvailable download formats
    Dataset updated
    Sep 3, 2025
    Description

    This Information Memorandum (IM) informs state and tribal IV-E agencies about the publication of the Executive’s Guide to CCWIS.

    Metadata-only record linking to the original dataset. Open original dataset below.

  17. Information Memorandum (IM-01-07)

    • healthdata.gov
    • odgavaprod.ogopendata.com
    • +2more
    csv, xlsx, xml
    Updated Sep 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Information Memorandum (IM-01-07) [Dataset]. https://healthdata.gov/d/ccgq-mhjh
    Explore at:
    csv, xml, xlsxAvailable download formats
    Dataset updated
    Sep 3, 2025
    Description

    This Information Memorandum (IM) provides information and guidance for use by States and Regional Offices on Updated National Standards for the Child and Family Service Reviews and Guidance on Program Improvement Plans.

    Metadata-only record linking to the original dataset. Open original dataset below.

  18. Information Memorandum (IM-01-09)

    • healthdata.gov
    • odgavaprod.ogopendata.com
    • +1more
    csv, xlsx, xml
    Updated Sep 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Information Memorandum (IM-01-09) [Dataset]. https://healthdata.gov/d/yev2-tpss
    Explore at:
    xlsx, csv, xmlAvailable download formats
    Dataset updated
    Sep 3, 2025
    Description

    This Information Memorandum (IM) provides States, Tribes, and Territories a comprehensive list of those Children's Bureau policies withdrawn in 2000 and 2001.

    Metadata-only record linking to the original dataset. Open original dataset below.

  19. Information Memorandum (IM-92-04)

    • healthdata.gov
    • data.virginia.gov
    • +2more
    csv, xlsx, xml
    Updated Sep 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Information Memorandum (IM-92-04) [Dataset]. https://healthdata.gov/d/5k6b-ug9z
    Explore at:
    xml, csv, xlsxAvailable download formats
    Dataset updated
    Sep 3, 2025
    Description

    This Information Memorandum (IM) expands aid to Families with Dependent Children Expansion of the Definition of Specified Caretaker Relative as it Relates to Title IV-E Eligibility.

    Metadata-only record linking to the original dataset. Open original dataset below.

  20. The Comprehensive Child Welfare Information System Final Rule: Overview

    • healthdata.gov
    • data.virginia.gov
    • +1more
    csv, xlsx, xml
    Updated Sep 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). The Comprehensive Child Welfare Information System Final Rule: Overview [Dataset]. https://healthdata.gov/d/9vfv-dh33
    Explore at:
    xml, csv, xlsxAvailable download formats
    Dataset updated
    Sep 3, 2025
    Description

    This document provides an overview of CCWIS and key provisions of the final rule.

    Metadata-only record linking to the original dataset. Open original dataset below.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ahmed Soliman (2023). Simulated Data for Patient Time Series Record Linkage [Dataset]. http://doi.org/10.6084/m9.figshare.19224786.v1
Organization logoOrganization logo

Simulated Data for Patient Time Series Record Linkage

Explore at:
zipAvailable download formats
Dataset updated
Jun 3, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ahmed Soliman
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This simulated dataset constitutes two files (after decompression), namely: sim_ergo_1600.csv and sim_pat_1600.csv.1. ergo.csv contains heart rate timeseries data for 1600 patients' ergometric tests. For each patient, 20 different ergometric tests were simulated. Each row in this file constitutes three field values: Ergo_ID, Heart Rate (BPM), and timestamp.2. pat.csv contains only four sample readings from each of the patient's 20 ergometric tests. Each row contains three values: patient_ID, Heart Rate, and timestamp. The goal is to link patients (identified by their patient_ID in the pat.csv file) to their corresponding ergometric tests (identified by their Ergo_ID in the ergo.csv file). This is done solely on matching the timestamp-value pairs from both files.The timeseries record linkage task described above is efficiently accomplished by the proposed tslink2 algorithm. tslink2 is implemented in C++ and is publicly availabe at https://github.com/ahmsoliman/tslink2Data is simulated such that correctly linked/matched identifiers follow the following formula:|Ergo_ID - patient_ID| mod 104 == 0The above formula is useful in evaluating the linkage algorithm performance.

Search
Clear search
Close search
Google apps
Main menu