34 datasets found

Simulated Data for Patient Time Series Record Linkage
figshare.com
zip
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Soliman (2023). Simulated Data for Patient Time Series Record Linkage [Dataset]. http://doi.org/10.6084/m9.figshare.19224786.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19224786.v1
Dataset updated
Jun 3, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ahmed Soliman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This simulated dataset constitutes two files (after decompression), namely: sim_ergo_1600.csv and sim_pat_1600.csv.1. ergo.csv contains heart rate timeseries data for 1600 patients' ergometric tests. For each patient, 20 different ergometric tests were simulated. Each row in this file constitutes three field values: Ergo_ID, Heart Rate (BPM), and timestamp.2. pat.csv contains only four sample readings from each of the patient's 20 ergometric tests. Each row contains three values: patient_ID, Heart Rate, and timestamp. The goal is to link patients (identified by their patient_ID in the pat.csv file) to their corresponding ergometric tests (identified by their Ergo_ID in the ergo.csv file). This is done solely on matching the timestamp-value pairs from both files.The timeseries record linkage task described above is efficiently accomplished by the proposed tslink2 algorithm. tslink2 is implemented in C++ and is publicly availabe at https://github.com/ahmsoliman/tslink2Data is simulated such that correctly linked/matched identifiers follow the following formula:|Ergo_ID - patient_ID| mod 104 == 0The above formula is useful in evaluating the linkage algorithm performance.
d
Privacy Preserving Linkage Software
data.gov.au
zip
Updated Apr 2, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Commonwealth Scientific and Industrial Research Organisation (2019). Privacy Preserving Linkage Software [Dataset]. https://data.gov.au/dataset/ds-dap-csiro%3A26733
Explore at:
zipAvailable download formats
Dataset updated
Apr 2, 2019
Dataset provided by
The Commonwealth Scientific and Industrial Research Organisation
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A set of software tools for privacy preserving record linkage. anonlink A library for carrying out the low level hash comparisons required server side. Available from github at http://github.com/n1analytics/anonlink/ entity-service Server side component of private record linkage REST api utilizing the anonlink library. clkhash A client utility and library for turning personally identifiable information into bloom filter hashes. Available from github at https://github.com/n1analytics/clkhash/ en…Show full descriptionA set of software tools for privacy preserving record linkage. anonlink A library for carrying out the low level hash comparisons required server side. Available from github at http://github.com/n1analytics/anonlink/ entity-service Server side component of private record linkage REST api utilizing the anonlink library. clkhash A client utility and library for turning personally identifiable information into bloom filter hashes. Available from github at https://github.com/n1analytics/clkhash/ encoding-service A REST api wrapper around clkhash for encoding PII data into CLKs. Available from github at https://github.com/n1analytics/encoding-service/ The metadata and files (if any) are available to the public.
synthetic-gold-database
kaggle.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PJ Gibson (2023). synthetic-gold-database [Dataset]. https://www.kaggle.com/datasets/pjgibson/synthetic-gold-database
Explore at:
zip(9292035305 bytes)Available download formats
Dataset updated
Aug 4, 2023
Authors
PJ Gibson
License
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Description
Synthetic Gold

This database represents a synthetic population of Nebraska from 1920-2022. It was created using this publicly available Github Repository that allows a user to make a synthetic population for a specific state. See that repository for an in-depth background for the project.

Record Linkage

One of the primary uses of this dataset is for training record linkage models. Coming from a public health background, health records often don't have one single reliable unique person identifier (like Social Security Number). By creating a synthetic dataset with snapshots of the population each year from 1920-2022 with known unique person identifiers, we can produce "golden" training / testing / validation data for supervised machine learning models. See below for some links to useful information on the record linkage process:

Bulding a Scalable Record Linkage System with Apache Spark, Python 3, and Machine Learning YouTube video.

Learning Blocking Schemes for Record Linkage article that covers blocking schemas (and record linkage) in good detail for beginners.

Record Linkage, a python library with great record linkage functions, for local python users (non-cloud). Has great markdown coverage of the different steps in a record linkage process - Preprocessing, Indexing, Comparing, Classification, Evaluation.

Jellyfish, a python library for string comparison functions.
E-Commerce Products Dataset For Record Linkage
kaggle.com
zip
Updated Nov 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Furkan Gözükara (2025). E-Commerce Products Dataset For Record Linkage [Dataset]. https://www.kaggle.com/furkangozukara/ecommerce-products-dataset-for-record-linkage
Explore at:
zip(215619488 bytes)Available download formats
Dataset updated
Nov 30, 2025
Authors
Furkan Gözükara
Description
-> If you use Turkish_Ecommerce_Products_by_Gozukara_and_Ozel_2016 dataset, please cite: https://academic.oup.com/comjnl/advance-article-abstract/doi/10.1093/comjnl/bxab179/6425234

@article{10.1093/comjnl/bxab179, author = {Gözükara, Furkan and Özel, Selma Ayşe}, title = "{An Incremental Hierarchical Clustering Based System For Record Linkage In E-Commerce Domain}", journal = {The Computer Journal}, year = {2021}, month = {11}, abstract = "{In this study, a novel record linkage system for E-commerce products is presented. Our system aims to cluster the same products that are crawled from different E-commerce websites into the same cluster. The proposed system achieves a very high success rate by combining both semi-supervised and unsupervised approaches. Unlike the previously proposed systems in the literature, neither a training set nor structured corpora are necessary. The core of the system is based on Hierarchical Agglomerative Clustering (HAC); however, the HAC algorithm is modified to be dynamic such that it can efficiently cluster a stream of incoming new data. Since the proposed system does not depend on any prior data, it can cluster new products. The system uses bag-of-words representation of the product titles, employs a single distance metric, exploits multiple domain-based attributes and does not depend on the characteristics of the natural language used in the product records. To our knowledge, there is no commonly used tool or technique to measure the quality of a clustering task. Therefore in this study, we use ELKI (Environment for Developing KDD-Applications Supported by Index-Structures), an open-source data mining software, for performance measurement of the clustering methods; and show how to use ELKI for this purpose. To evaluate our system, we collect our own dataset and make it publicly available to researchers who study E-commerce product clustering. Our proposed system achieves 96.25\% F-Measure according to our experimental analysis. The other state-of-the-art clustering systems obtain the best 89.12\% F-Measure.}", issn = {0010-4620}, doi = {10.1093/comjnl/bxab179}, url = {https://doi.org/10.1093/comjnl/bxab179}, note = {bxab179}, eprint = {https://academic.oup.com/comjnl/advance-article-pdf/doi/10.1093/comjnl/bxab179/41133297/bxab179.pdf}, }

-> elki-bundle-0.7.2-SNAPSHOT.jar Is the ELKI bundle that we have compiled from the github source code of ELKI. The date of the source code is 6 June 2016. The compile command is as below: ->-> mvn -DskipTests -Dmaven.javadoc.skip=true -P svg,bundle package ->-> Github repository of ELKI: https://github.com/elki-project/elki ->-> This bundle file is used for all of the experiments that are presented in the article

-> Turkish_Ecommerce_Products_by_Gozukara_and_Ozel_2016 dataset is composed as below: ->-> Top 50 E-commerce websites that operate in Turkey are crawled, and their attributes are extracted. ->-> The crawling is made between 2015-01-13 15:12:46 ---- 2015-01-17 19:07:53 dates. ->-> Then 250 product offers from Vatanbilgisayar are randomly selected. ->-> Then the entire dataset is manually scanned to find which other products that are sold in different E-commerce websites are same as the selected ones. ->-> Then each product is classified respectively. ->-> This dataset contains these products along with their price (if available), title, categories (if available), free text description (if available), wrapped features (if available), crawled URL (the URL might have expired) attributes

-> The dataset files are provided as used in the study. -> ARFF files are generated with Raw Frequency of terms rather than used Weighting Schemes for All_Products and Only_Price_Having_Products. The reason is, we have tested these datasets with only our system and since our system does incremental clustering, even if provide TF-IDF weightings, they wouldn't be same as used in the article. More information provided in the article. ->-> For Macro_Average_Datasets we provide both Raw frequency and TF-IDF scheme weightings as used in the experiments

-> There are 3 main folders -> All_Products: This folder contains 1800 products. ->-> This is the entire collection that is manually labeled. ->-> They are from 250 different classes. -> Only_Price_Having_Products: This folder contains all of the products that have the price feature set. ->-> The collection has 1721 products from 250 classes. ->-> This is the dataset that we have experimented. -> Macro_Average_Datasets: This folder contains 100 datasets that we have used to conduct more reliable experiments. ->-> Each dataset is composed of selecting 1000 different products from the price having products dataset and then randomly ordering them...
u
Jyutping Project - Raw Data and Clean Data
rdr.ucl.ac.uk
application/csv
Updated Aug 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph Lam (2024). Jyutping Project - Raw Data and Clean Data [Dataset]. http://doi.org/10.5522/04/26504347.v1
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.5522/04/26504347.v1
Dataset updated
Aug 19, 2024
Dataset provided by
University College London
Authors
Joseph Lam
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Raw and clean data for Jyutping project, submitted to International Journal of Epidemiology.All data are openly available at the time of scrapping. I only retained Chinese Name and Hong Kong Government Romanised English Names. This project aims to describe the problem of non-standardised romanisation and it's impact on data linkage. The included data allows researchers to replicate my process of extracting Jyutping and Pinyin from Chinese Characters. Quite a few of manual screening and reviewing was required, so the code itself was not fully automated. The codes are stored on my personal GitHub, https://github.com/Jo-Lam/Jyutping_project/tree/main.Please cite this data resource: doi:10.5522/04/26504347
SemTab 2024: Semantic Web Challenge on Tabular Data to Knowledge Graph...
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Nov 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oktie Hassanzadeh; Oktie Hassanzadeh; Vasilis Efthymiou; Jiaoyan Chen; Vasilis Efthymiou; Jiaoyan Chen (2024). SemTab 2024: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Data Sets - WikidataTables2024R1 and WikidataTables2024R2 [Dataset]. http://doi.org/10.5281/zenodo.14207232
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14207232
Dataset updated
Nov 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Oktie Hassanzadeh; Oktie Hassanzadeh; Vasilis Efthymiou; Jiaoyan Chen; Vasilis Efthymiou; Jiaoyan Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Sets from the ISWC 2024 Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, Round 1, Wikidata Tables. Links to other datasets can be found on the challenge website: https://sem-tab-challenge.github.io/2024/ as well as the proceedings of the challenge published on CEUR.

For details about the challenge, see: http://www.cs.ox.ac.uk/isg/challenges/sem-tab/

For 2024 edition, see: https://sem-tab-challenge.github.io/2024/

Note on License: This data includes data from the following sources. Refer to each source for license details:
- Wikidata https://www.wikidata.org/

THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Datasets for Out-of-KB Mention Discovery with Entity Linking
zenodo.org
data.niaid.nih.gov
+1more
Updated Aug 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hang Dong; Hang Dong; Jiaoyan Chen; Jiaoyan Chen; Yuan He; Yuan He; Liu Yinan; Liu Yinan; Ian Horrocks; Ian Horrocks (2023). Datasets for Out-of-KB Mention Discovery with Entity Linking [Dataset]. http://doi.org/10.5281/zenodo.8228371
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.8228371
Dataset updated
Aug 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hang Dong; Hang Dong; Jiaoyan Chen; Jiaoyan Chen; Yuan He; Yuan He; Liu Yinan; Liu Yinan; Ian Horrocks; Ian Horrocks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The repository contains datasets for out-of-KB mention discovery from texts, documented in the work, Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking, on arXiv: https://arxiv.org/abs/2302.07189 (CIKM 2023).

Each data setting (as a sub-folder) contains train, valid, and test files and also 100 random sample files for each data split for debugging.

Data folder names with “syn_full” at the end are synonym augmented data (each synonym as an entity) for the setting.

Ontology .jsonl files have two versions for each, "syn_attr" setting treats synonyms are attributes, "syn_full" setting treats synonyms as entities.

Data scripts are available at https://github.com/KRR-Oxford/BLINKout#data-scripts

Acknowledgement of the data sources below:

ShARe/CLEF 2013 dataset is from https://physionet.org/content/shareclefehealth2013/1.0/

MedMention dataset is from https://github.com/chanzuckerberg/MedMentions

UMLS (versions 2012AB, 2014AB, 2017AA) is from https://www.nlm.nih.gov/research/umls/index.html

SNOMED CT (corresponding versions) is from https://www.nlm.nih.gov/healthit/snomedct/index.html

NILK dataset is from https://zenodo.org/record/6607514

WikiData 2017 dump is from https://archive.org/download/enwiki-20170220/enwiki-20170220-pages-articles.xml.bz2
ONC Patient Matching Algorithm Challenge Data
linkagelibrary.icpsr.umich.edu
Updated Sep 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of the National Coordinator for Health (2019). ONC Patient Matching Algorithm Challenge Data [Dataset]. http://doi.org/10.3886/E111962V1
Explore at:
Unique identifier
https://doi.org/10.3886/E111962V1
Dataset updated
Sep 20, 2019
Dataset authored and provided by
Office of the National Coordinator for Health
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The goal of the Patient Matching Algorithm Challenge is to bring about greater transparency and data on the performance of existing patient matching algorithms, spur the adoption of performance metrics for patient data matching algorithm vendors, and positively impact other aspects of patient matching such as deduplication and linking to clinical data. Participants will be provided a data set and will have their answers evaluated and scored against a master key. Up to 6 cash prizes will be awarded with a total purse of up to $75,000.00.https://www.patientmatchingchallenge.com/The test dataset used in the ONC Patient Matching Algorithm Challenge is available for download by students, researchers, or anyone else interested in additional analysis and patient matching algorithm development. More information about the Patient Matching Algorithm Challenge can be found: https://www.patientmatchingchallenge.com/.The dataset containing 1 million patients was split into eight files of alphabetical groupings by the the patient's last name, plus an additional file containing test patients with no last name recorded (Null). All files should be downloaded and merged for analysis.https://github.com/onc-healthit/patient-matching
u
Synthetic Administrative Data: Census 1991, 2023
datacatalogue.ukdataservice.ac.uk
Updated Feb 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shlomo, N, University of Manchester; Kim, M, University of Manchester (2024). Synthetic Administrative Data: Census 1991, 2023 [Dataset]. http://doi.org/10.5255/UKDA-SN-856310
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-856310
Dataset updated
Feb 21, 2024
Authors
Shlomo, N, University of Manchester; Kim, M, University of Manchester
Area covered
United Kingdom
Description
We create a synthetic administrative dataset to be used in the development of the R package for calculating quality indicators for administrative data (see: https://github.com/sook-tusk/qualadmin) that mimic the properties of a real administrative dataset according to specifications by the ONS. Taking over 1 million records from a synthetic 1991 UK census dataset, we deleted records, moved records to a different geography and duplicated records to a different geography according to pre-specified proportions for each broad ethnic group (White, Non-white) and gender (males, females). The final size of the synthetic administrative data was 1033664 individuals.
National Statistical Institutes (NSIs) are directing resources into advancing the use of administrative data in official statistics systems. This is a top priority for the UK Office for National Statistics (ONS) as they are undergoing transformations in their statistical systems to make more use of administrative data for future censuses and population statistics. Administrative data are defined as secondary data sources since they are produced by other agencies as a result of an event or a transaction relating to administrative procedures of organisations, public administrations and government agencies. Nevertheless, they have the potential to become important data sources for the production of official statistics by significantly reducing the cost and burden of response and improving the efficiency of such systems. Embedding administrative data in statistical systems is not without costs and it is vital to understand where potential errors may arise. The Total Administrative Data Error Framework sets out all possible sources of error when using administrative data as statistical data, depending on whether it is a single data source or integrated with other data sources such as survey data. For a single administrative data, one of the main sources of error is coverage and representation to the target population of interest. This is particularly relevant when administrative data is delivered over time, such as tax data for maintaining the Business Register. For sub-project 1 of this research project, we develop quality indicators that allow the statistical agency to assess if the administrative data is representative to the target population and which sub-groups may be missing or over-covered. This is essential for producing unbiased estimates from administrative data. Another priority at statistical agencies is to produce a statistical register for population characteristic estimates, such as employment statistics, from multiple sources of administrative and survey data. Using administrative data to build a spine, survey data can be integrated using record linkage and statistical matching approaches on a set of common matching variables. This will be the topic for sub-project 2, which will be split into several topics of research. The first topic is whether adding statistical predictions and correlation structures improves the linkage and data integration. The second topic is to research a mass imputation framework for imputing missing target variables in the statistical register where the missing data may be due to multiple underlying mechanisms. Therefore, the third topic will aim to improve the mass imputation framework to mitigate against possible measurement errors, for example by adding benchmarks and other constraints into the approaches. On completion of a statistical register, estimates for key target variables at local areas can easily be aggregated. However, it is essential to also measure the precision of these estimates through mean square errors and this will be the fourth topic of the sub-project. Finally, this new way of producing official statistics is compared to the more common method of incorporating administrative data through survey weights and model-based estimation approaches. In other words, we evaluate whether it is better 'to weight' or 'to impute' for population characteristic estimates - a key question under investigation by survey statisticians in the last decade.
Frequency Report Utility
healthdata.gov
data.virginia.gov
+1more
csv, xlsx, xml
Updated Sep 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Frequency Report Utility [Dataset]. https://healthdata.gov/d/bsvr-yipx
Explore at:
xlsx, csv, xmlAvailable download formats
Dataset updated
Sep 3, 2025
Description
The Frequency Report Utility can generate a report that provides an overview of the number and type of raw responses being generated for a particular data element to help title IV-E agencies assess and improve data quality.

Metadata-only record linking to the original dataset. Open original dataset below.
Data from: Genetic Architecture Reconciles Linkage and Association Studies...
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Dec 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Loic Yengo; Loic Yengo (2023). Genetic Architecture Reconciles Linkage and Association Studies of Complex Traits [Dataset]. http://doi.org/10.5281/zenodo.10416893
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10416893
Dataset updated
Dec 21, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Loic Yengo; Loic Yengo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 21, 2023
Description
This (zipped) folder contains 3 sub-folders:

#**********************************************************************************************************
The "bin" folder contains fuctions and gentic maps needed for analyes
bin \
predLink.R - function to predict linkage
sibREML_v0.1.1.R - function to run SibREML
sim-sib-array.R - script to simulate sib-pairs from parental haplotypes
Summarised_genetic_map_bcf.txt - genetic map per 0.5-cM long segments, based on map from bcftools
(BCFtools: https://samtools.github.io/bcftools/bcftools.html)
Summarised_genetic_map_OMNI.txt - genetic map per 0.5-cM long segments, based on OMNI map
(https://github.com/joepickrell/1000-genomes-genetic-maps/tree/master/interpolated_OMNI)
#**********************************************************************************************************

#**********************************************************************************************************
The "SIM" folder contains the simulation pipeline (scripts 01-15) as well as IBD sharing and simulated phenotypes for Simulated sib-pairs.
SIM \
01_sim-sib-array.sh *pre-run*
02_bed_recode_bcf_map.sh *pre-run*
03_make_merlin.R *pre-run*
04_error_merlin.sh *pre-run*
05_merlin_IBD.sh *pre-run*
06_sample_causal_snps.R *pre-run*
07_simulate_pheno.sh *pre-run*
08_bhat_gwas.R *can be run using provided data*
09_Linkage_VH.R *can be run using provided data*
10_predLink.R *can be run using provided data*
11_phi_hat.R *can be run using provided data*
12_IBD_Mb.R *can be run using provided data*
13_IBD_cM_recombrate_stratified.R *can be run using provided data*
14_SibREML.R *can be run using provided data*
15_SibREML_stratified_Q4.R *can be run using provided data*
causal_snps \ *provided causal SNPs*
IBD_results \ *provided IBD-probabilities for 1000 simulated sib-pairs*
Linkage_VH_results \
pheno \ *provided simulated phenotypes (h2=1) for 8 genetic architectures*
Phi_hat_results.txt
predicted \
README
SibREML_results.txt
SibREML_stratified_Q4.txt

The data can be used to run Linkage analysis, predict linkage, estimate phi_hat,
as well as estimate non-stratified and recombination rate stratified sib-heritability (h2_FS and c).
The README is provided within the folder.
#**********************************************************************************************************

#**********************************************************************************************************
The "HT_BMI" folder contains data and scripts to predict linkage and estimate phi_hat for height and BMI.
HT_BMI \
01_predLink_HT_BMI.R
02_phi_hat_HT_BMI.R
gws_sumstats \ *provided summary GWAS summary statistics to predict linkage for height and BMI*
Linkage_results \ *provided linkage meta-analysis results for height and BMI from this study*
Phi_hat_results_HT_BMI.txt
PREDLINK_bmi.txt
PREDLINK_height.txt
README
The README is provided within the folder.
#**********************************************************************************************************
Z
Public Utility Data Liberation Project (PUDL) Data Release
data.niaid.nih.gov
zenodo.org
Updated Feb 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Selvans, Zane A.; Gosnell, Christina M.; Sharpe, Austen; Norman, Bennett; Schira, Zach; Lamb, Katherine; Xia, Dazhong; Belfer, Ella (2025). Public Utility Data Liberation Project (PUDL) Data Release [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3653158
Explore at:
Dataset updated
Feb 14, 2025
Dataset provided by
Catalyst Cooperative
Authors
Selvans, Zane A.; Gosnell, Christina M.; Sharpe, Austen; Norman, Bennett; Schira, Zach; Lamb, Katherine; Xia, Dazhong; Belfer, Ella
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PUDL v2025.2.0 Data Release

This is our regular quarterly release for 2025Q1. It includes updates to all the datasets that are published with quarterly or higher frequency, plus initial verisons of a few new data sources that have been in the works for a while.

One major change this quarter is that we are now publishing all processed PUDL data as Apache Parquet files, alongside our existing SQLite databases. See Data Access for more on how to access these outputs.

Some potentially breaking changes to be aware of:

In the EIA Form 930 – Hourly and Daily Balancing Authority Operations Report a number of new energy sources have been added, and some old energy sources have been split into more granular categories. See Changes in energy source granularity over time.

We are now running the EPA’s CAMD to EIA unit crosswalk code for each individual year starting from 2018, rather than just 2018 and 2021, resulting in more connections between these two datasets and changes to some sub-plant IDs. See the note below for more details.

Many thanks to the organizations who make these regular updates possible! Especially GridLab, RMI, and the ZERO Lab at Princeton University. If you rely on PUDL and would like to help ensure that the data keeps flowing, please consider joining them as a PUDL Sustainer, as we are still fundraising for 2025.

New Data

EIA 176

Add a couple of semi-transformed interim EIA-176 (natural gas sources and dispositions) tables. They aren’t yet being written to the database, but are one step closer. See #3555 and PRs #3590, #3978. Thanks to @davidmudrauskas for moving this dataset forward.

Extracted these interim tables up through the latest 2023 data release. See #4002 and #4004.

EIA 860

Added EIA 860 Multifuel table. See #3438 and #3946.

FERC 1

Added three new output tables containing granular utility accounting data. See #4057, #3642 and the table descriptions in the data dictionary:

out_ferc1_yearly_detailed_income_statements

out_ferc1_yearly_detailed_balance_sheet_assets

out_ferc1_yearly_detailed_balance_sheet_liabilities

SEC Form 10-K Parent-Subsidiary Ownership

We have added some new tables describing the parent-subsidiary company ownership relationships reported in the SEC’s Form 10-K, Exhibit 21 “Subsidiaries of the Registrant”. Where possible these tables link the SEC filers or their subsidiary companies to the corresponding EIA utilities. This work was funded by a grant from the Mozilla Foundation. Most of the ML models and data preparation took place in the mozilla-sec-eia repository separate from the main PUDL ETL, as it requires processing hundreds of thousands of PDFs and the deployment of some ML experiment tracking infrastructure. The new tables are handed off as nearly finished products to the PUDL ETL pipeline. Note that these are preliminary, experimental data products and are known to be incomplete and to contain errors. Extracting data tables from unstructured PDFs and the SEC to EIA record linkage are necessarily probabalistic processes.

See PRs #4026, #4031, #4035, #4046, #4048, #4050 and check out the table descriptions in the PUDL data dictionary:

out_sec10k_parents_and_subsidiaries

core_sec10k_quarterly_filings

core_sec10k_quarterly_exhibit_21_company_ownership

core_sec10k_quarterly_company_information

Expanded Data Coverage

EPA CEMS

Added 2024 Q4 of CEMS data. See #4041 and #4052.

EPA CAMD EIA Crosswalk

In the past, the crosswalk in PUDL has used the EPA’s published crosswalk (run with 2018 data), and an additional crosswalk we ran with 2021 EIA 860 data. To ensure that the crosswalk reflects updates in both EIA and EPA data, we re-ran the EPA R code which generates the EPA CAMD EIA crosswalk with 4 new years of data: 2019, 2020, 2022 and 2023. Re-running the crosswalk pulls the latest data from the CAMD FACT API, which results in some changes to the generator and unit IDs reported on the EPA side of the crosswalk, which feeds into the creation of core_epa_assn_eia_epacamd.

The changes only result in the addition of new units and generators in the EPA data, with no changes to matches at the plant level. However, the updates to generator and unit IDs have resulted in changes to the subplant IDs - some EIA boilers and generators which previously had no matches to EPA data have now been matched to EPA unit data, resulting in an overall reduction in the number of rows in the core_epa_assn_eia_epacamd_subplant_ids table. See issues #4039 and PR #4056 for a discussion of the changes observed in the course of this update.

EIA 860M

Added EIA 860m through December 2024. See #4038 and #4047.

EIA 923

Added EIA 923 monthly data through September 2024. See #4038 and #4047.

EIA Bulk Electricity Data

Updated the EIA Bulk Electricity data to include data published up through 2024-11-01. See #4042 and PR #4051.

EIA 930

Updated the EIA 930 data to include data published up through the beginning of February 2025. See #4040 and PR #4054. 10 new energy sources were added and 3 were retired; see Changes in energy source granularity over time for more information.

Bug Fixes

Fix an accidentally swapped set of starting balance / ending balance column rename parameters in the pre-2021 DBF derived data that feeds into core_ferc1_yearly_other_regulatory_liabilities_sched278. See issue #3952 and PRs #3969, #3979. Thanks to @yolandazzz13 for making this fix.

Added preliminary data validation checks for several FERC 1 tables that were missing it #3860.

Fix spelling of Lake Huron and Lake Saint Clair in out_vcerare_hourly_available_capacity_factor and related tables. See issue #4007 and PR #4029.

Quality of Life Improvements

We added a sources parameter to pudl.metadata.classes.DataSource.from_id() in order to make it possible to use the pudl-archiver repository to archive datasets that won’t necessarily be ingested into PUDL. See this PUDL archiver issue and PRs #4003 and #4013.

Other PUDL v2025.2.0 Resources

PUDL v2025.2.0 Data Dictionary

PUDL v2025.2.0 Documentation

PUDL in the AWS Open Data Registry

PUDL v2025.2.0 in a free, public AWS S3 bucket: s3://pudl.catalyst.coop/v2025.2.0/

PUDL v2025.2.0 in a requester-pays GCS bucket: gs://pudl.catalyst.coop/v2025.2.0/

Zenodo archive of the PUDL GitHub repo for this release

PUDL v2025.2.0 release on GitHub

PUDL v2025.2.0 package in the Python Package Index (PyPI)

Contact Us

If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. Here's a bunch of different ways to get in touch:

Follow us on GitHub

Use the PUDL Github issue tracker to let us know about any bugs or data issues you encounter

GitHub Discussions is where we provide user support.

Watch our GitHub Project to see what we're working on.

Email us at hello@catalyst.coop for private communications.

On Mastodon: @CatalystCoop@mastodon.energy

On BlueSky: @catalyst.coop

On Twitter: @CatalystCoop

Connect with us on LinkedIn

Play with our data and notebooks on Kaggle

Combine our data with ML models on HuggingFace

Learn more about us on our website: https://catalyst.coop

Subscribe to our announcements list for email updates.
CFSR Round 3 Statewide Data Indicators Workbook
healthdata.gov
data.virginia.gov
+1more
csv, xlsx, xml
Updated Sep 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). CFSR Round 3 Statewide Data Indicators Workbook [Dataset]. https://healthdata.gov/d/pguh-hcak
Explore at:
xlsx, csv, xmlAvailable download formats
Dataset updated
Sep 4, 2025
Description
This workbook provides state-by-state performance data for Round 3 of the Child and Family Services Reviews in addition to national performance comparisons for the past 12-month reporting period included in data profiles transmitted to states in February 2021.

Metadata-only record linking to the original dataset. Open original dataset below.
How do I submit a letter of intent?
healthdata.gov
data.virginia.gov
+1more
csv, xlsx, xml
Updated Sep 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). How do I submit a letter of intent? [Dataset]. https://healthdata.gov/d/ypuq-bya9
Explore at:
xlsx, csv, xmlAvailable download formats
Dataset updated
Sep 4, 2025
Description
ACF Children Bureau resource

Metadata-only record linking to the original dataset. Open original dataset below.
piq2011-01
healthdata.gov
data.virginia.gov
+1more
csv, xlsx, xml
Updated Sep 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). piq2011-01 [Dataset]. https://healthdata.gov/d/y96k-ga24
Explore at:
xml, csv, xlsxAvailable download formats
Dataset updated
Sep 3, 2025
Description
ACF Agency Wide resource

Metadata-only record linking to the original dataset. Open original dataset below.
Information Memorandum (IM-16-03)
healthdata.gov
odgavaprod.ogopendata.com
+1more
csv, xlsx, xml
Updated Sep 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Information Memorandum (IM-16-03) [Dataset]. https://healthdata.gov/d/3tv3-uwzp
Explore at:
xlsx, csv, xmlAvailable download formats
Dataset updated
Sep 3, 2025
Description
This Information Memorandum (IM) informs state and tribal IV-E agencies about the publication of the Executive’s Guide to CCWIS.

Metadata-only record linking to the original dataset. Open original dataset below.
Information Memorandum (IM-01-07)
healthdata.gov
odgavaprod.ogopendata.com
+2more
csv, xlsx, xml
Updated Sep 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Information Memorandum (IM-01-07) [Dataset]. https://healthdata.gov/d/ccgq-mhjh
Explore at:
csv, xml, xlsxAvailable download formats
Dataset updated
Sep 3, 2025
Description
This Information Memorandum (IM) provides information and guidance for use by States and Regional Offices on Updated National Standards for the Child and Family Service Reviews and Guidance on Program Improvement Plans.

Metadata-only record linking to the original dataset. Open original dataset below.
Information Memorandum (IM-01-09)
healthdata.gov
odgavaprod.ogopendata.com
+1more
csv, xlsx, xml
Updated Sep 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Information Memorandum (IM-01-09) [Dataset]. https://healthdata.gov/d/yev2-tpss
Explore at:
xlsx, csv, xmlAvailable download formats
Dataset updated
Sep 3, 2025
Description
This Information Memorandum (IM) provides States, Tribes, and Territories a comprehensive list of those Children's Bureau policies withdrawn in 2000 and 2001.

Metadata-only record linking to the original dataset. Open original dataset below.
Information Memorandum (IM-92-04)
healthdata.gov
data.virginia.gov
+2more
csv, xlsx, xml
Updated Sep 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Information Memorandum (IM-92-04) [Dataset]. https://healthdata.gov/d/5k6b-ug9z
Explore at:
xml, csv, xlsxAvailable download formats
Dataset updated
Sep 3, 2025
Description
This Information Memorandum (IM) expands aid to Families with Dependent Children Expansion of the Definition of Specified Caretaker Relative as it Relates to Title IV-E Eligibility.

Metadata-only record linking to the original dataset. Open original dataset below.
The Comprehensive Child Welfare Information System Final Rule: Overview
healthdata.gov
data.virginia.gov
+1more
csv, xlsx, xml
Updated Sep 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). The Comprehensive Child Welfare Information System Final Rule: Overview [Dataset]. https://healthdata.gov/d/9vfv-dh33
Explore at:
xml, csv, xlsxAvailable download formats
Dataset updated
Sep 3, 2025
Description
This document provides an overview of CCWIS and key provisions of the final rule.

Metadata-only record linking to the original dataset. Open original dataset below.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Soliman (2023). Simulated Data for Patient Time Series Record Linkage [Dataset]. http://doi.org/10.6084/m9.figshare.19224786.v1

Simulated Data for Patient Time Series Record Linkage

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.19224786.v1

Dataset updated

Jun 3, 2023

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Ahmed Soliman

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This simulated dataset constitutes two files (after decompression), namely: sim_ergo_1600.csv and sim_pat_1600.csv.1. ergo.csv contains heart rate timeseries data for 1600 patients' ergometric tests. For each patient, 20 different ergometric tests were simulated. Each row in this file constitutes three field values: Ergo_ID, Heart Rate (BPM), and timestamp.2. pat.csv contains only four sample readings from each of the patient's 20 ergometric tests. Each row contains three values: patient_ID, Heart Rate, and timestamp. The goal is to link patients (identified by their patient_ID in the pat.csv file) to their corresponding ergometric tests (identified by their Ergo_ID in the ergo.csv file). This is done solely on matching the timestamp-value pairs from both files.The timeseries record linkage task described above is efficiently accomplished by the proposed tslink2 algorithm. tslink2 is implemented in C++ and is publicly availabe at https://github.com/ahmsoliman/tslink2Data is simulated such that correctly linked/matched identifiers follow the following formula:|Ergo_ID - patient_ID| mod 104 == 0The above formula is useful in evaluating the linkage algorithm performance.

Clear search

Close search

Google apps

Main menu

Simulated Data for Patient Time Series Record Linkage

Privacy Preserving Linkage Software

synthetic-gold-database

Synthetic Gold

Record Linkage

E-Commerce Products Dataset For Record Linkage

Jyutping Project - Raw Data and Clean Data

SemTab 2024: Semantic Web Challenge on Tabular Data to Knowledge Graph...

Datasets for Out-of-KB Mention Discovery with Entity Linking

ONC Patient Matching Algorithm Challenge Data

Synthetic Administrative Data: Census 1991, 2023

Frequency Report Utility

Data from: Genetic Architecture Reconciles Linkage and Association Studies...

Public Utility Data Liberation Project (PUDL) Data Release

CFSR Round 3 Statewide Data Indicators Workbook

How do I submit a letter of intent?

piq2011-01

Information Memorandum (IM-16-03)

Information Memorandum (IM-01-07)

Information Memorandum (IM-01-09)

Information Memorandum (IM-92-04)

The Comprehensive Child Welfare Information System Final Rule: Overview

Simulated Data for Patient Time Series Record Linkage