Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This simulated dataset constitutes two files (after decompression), namely: sim_ergo_1600.csv and sim_pat_1600.csv.1. ergo.csv contains heart rate timeseries data for 1600 patients' ergometric tests. For each patient, 20 different ergometric tests were simulated. Each row in this file constitutes three field values: Ergo_ID, Heart Rate (BPM), and timestamp.2. pat.csv contains only four sample readings from each of the patient's 20 ergometric tests. Each row contains three values: patient_ID, Heart Rate, and timestamp. The goal is to link patients (identified by their patient_ID in the pat.csv file) to their corresponding ergometric tests (identified by their Ergo_ID in the ergo.csv file). This is done solely on matching the timestamp-value pairs from both files.The timeseries record linkage task described above is efficiently accomplished by the proposed tslink2 algorithm. tslink2 is implemented in C++ and is publicly availabe at https://github.com/ahmsoliman/tslink2Data is simulated such that correctly linked/matched identifiers follow the following formula:|Ergo_ID - patient_ID| mod 104 == 0The above formula is useful in evaluating the linkage algorithm performance.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A set of software tools for privacy preserving record linkage. anonlink A library for carrying out the low level hash comparisons required server side. Available from github at http://github.com/n1analytics/anonlink/ entity-service Server side component of private record linkage REST api utilizing the anonlink library. clkhash A client utility and library for turning personally identifiable information into bloom filter hashes. Available from github at https://github.com/n1analytics/clkhash/ en…Show full descriptionA set of software tools for privacy preserving record linkage. anonlink A library for carrying out the low level hash comparisons required server side. Available from github at http://github.com/n1analytics/anonlink/ entity-service Server side component of private record linkage REST api utilizing the anonlink library. clkhash A client utility and library for turning personally identifiable information into bloom filter hashes. Available from github at https://github.com/n1analytics/clkhash/ encoding-service A REST api wrapper around clkhash for encoding PII data into CLKs. Available from github at https://github.com/n1analytics/encoding-service/ The metadata and files (if any) are available to the public.
Facebook
Twitterhttp://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
This database represents a synthetic population of Nebraska from 1920-2022. It was created using this publicly available Github Repository that allows a user to make a synthetic population for a specific state. See that repository for an in-depth background for the project.
One of the primary uses of this dataset is for training record linkage models. Coming from a public health background, health records often don't have one single reliable unique person identifier (like Social Security Number). By creating a synthetic dataset with snapshots of the population each year from 1920-2022 with known unique person identifiers, we can produce "golden" training / testing / validation data for supervised machine learning models. See below for some links to useful information on the record linkage process:
Facebook
Twitter-> If you use Turkish_Ecommerce_Products_by_Gozukara_and_Ozel_2016 dataset, please cite: https://academic.oup.com/comjnl/advance-article-abstract/doi/10.1093/comjnl/bxab179/6425234
@article{10.1093/comjnl/bxab179, author = {Gözükara, Furkan and Özel, Selma Ayşe}, title = "{An Incremental Hierarchical Clustering Based System For Record Linkage In E-Commerce Domain}", journal = {The Computer Journal}, year = {2021}, month = {11}, abstract = "{In this study, a novel record linkage system for E-commerce products is presented. Our system aims to cluster the same products that are crawled from different E-commerce websites into the same cluster. The proposed system achieves a very high success rate by combining both semi-supervised and unsupervised approaches. Unlike the previously proposed systems in the literature, neither a training set nor structured corpora are necessary. The core of the system is based on Hierarchical Agglomerative Clustering (HAC); however, the HAC algorithm is modified to be dynamic such that it can efficiently cluster a stream of incoming new data. Since the proposed system does not depend on any prior data, it can cluster new products. The system uses bag-of-words representation of the product titles, employs a single distance metric, exploits multiple domain-based attributes and does not depend on the characteristics of the natural language used in the product records. To our knowledge, there is no commonly used tool or technique to measure the quality of a clustering task. Therefore in this study, we use ELKI (Environment for Developing KDD-Applications Supported by Index-Structures), an open-source data mining software, for performance measurement of the clustering methods; and show how to use ELKI for this purpose. To evaluate our system, we collect our own dataset and make it publicly available to researchers who study E-commerce product clustering. Our proposed system achieves 96.25\% F-Measure according to our experimental analysis. The other state-of-the-art clustering systems obtain the best 89.12\% F-Measure.}", issn = {0010-4620}, doi = {10.1093/comjnl/bxab179}, url = {https://doi.org/10.1093/comjnl/bxab179}, note = {bxab179}, eprint = {https://academic.oup.com/comjnl/advance-article-pdf/doi/10.1093/comjnl/bxab179/41133297/bxab179.pdf}, }
-> elki-bundle-0.7.2-SNAPSHOT.jar Is the ELKI bundle that we have compiled from the github source code of ELKI. The date of the source code is 6 June 2016. The compile command is as below: ->-> mvn -DskipTests -Dmaven.javadoc.skip=true -P svg,bundle package ->-> Github repository of ELKI: https://github.com/elki-project/elki ->-> This bundle file is used for all of the experiments that are presented in the article
-> Turkish_Ecommerce_Products_by_Gozukara_and_Ozel_2016 dataset is composed as below: ->-> Top 50 E-commerce websites that operate in Turkey are crawled, and their attributes are extracted. ->-> The crawling is made between 2015-01-13 15:12:46 ---- 2015-01-17 19:07:53 dates. ->-> Then 250 product offers from Vatanbilgisayar are randomly selected. ->-> Then the entire dataset is manually scanned to find which other products that are sold in different E-commerce websites are same as the selected ones. ->-> Then each product is classified respectively. ->-> This dataset contains these products along with their price (if available), title, categories (if available), free text description (if available), wrapped features (if available), crawled URL (the URL might have expired) attributes
-> The dataset files are provided as used in the study. -> ARFF files are generated with Raw Frequency of terms rather than used Weighting Schemes for All_Products and Only_Price_Having_Products. The reason is, we have tested these datasets with only our system and since our system does incremental clustering, even if provide TF-IDF weightings, they wouldn't be same as used in the article. More information provided in the article. ->-> For Macro_Average_Datasets we provide both Raw frequency and TF-IDF scheme weightings as used in the experiments
-> There are 3 main folders -> All_Products: This folder contains 1800 products. ->-> This is the entire collection that is manually labeled. ->-> They are from 250 different classes. -> Only_Price_Having_Products: This folder contains all of the products that have the price feature set. ->-> The collection has 1721 products from 250 classes. ->-> This is the dataset that we have experimented. -> Macro_Average_Datasets: This folder contains 100 datasets that we have used to conduct more reliable experiments. ->-> Each dataset is composed of selecting 1000 different products from the price having products dataset and then randomly ordering them...
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Raw and clean data for Jyutping project, submitted to International Journal of Epidemiology.All data are openly available at the time of scrapping. I only retained Chinese Name and Hong Kong Government Romanised English Names. This project aims to describe the problem of non-standardised romanisation and it's impact on data linkage. The included data allows researchers to replicate my process of extracting Jyutping and Pinyin from Chinese Characters. Quite a few of manual screening and reviewing was required, so the code itself was not fully automated. The codes are stored on my personal GitHub, https://github.com/Jo-Lam/Jyutping_project/tree/main.Please cite this data resource: doi:10.5522/04/26504347
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Sets from the ISWC 2024 Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, Round 1, Wikidata Tables. Links to other datasets can be found on the challenge website: https://sem-tab-challenge.github.io/2024/ as well as the proceedings of the challenge published on CEUR.
For details about the challenge, see: http://www.cs.ox.ac.uk/isg/challenges/sem-tab/
For 2024 edition, see: https://sem-tab-challenge.github.io/2024/
Note on License: This data includes data from the following sources. Refer to each source for license details:
- Wikidata https://www.wikidata.org/
THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The repository contains datasets for out-of-KB mention discovery from texts, documented in the work, Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking, on arXiv: https://arxiv.org/abs/2302.07189 (CIKM 2023).
Each data setting (as a sub-folder) contains train, valid, and test files and also 100 random sample files for each data split for debugging.
Data folder names with “syn_full” at the end are synonym augmented data (each synonym as an entity) for the setting.
Ontology .jsonl files have two versions for each, "syn_attr" setting treats synonyms are attributes, "syn_full" setting treats synonyms as entities.
Data scripts are available at https://github.com/KRR-Oxford/BLINKout#data-scripts
Acknowledgement of the data sources below:
ShARe/CLEF 2013 dataset is from https://physionet.org/content/shareclefehealth2013/1.0/
MedMention dataset is from https://github.com/chanzuckerberg/MedMentions
UMLS (versions 2012AB, 2014AB, 2017AA) is from https://www.nlm.nih.gov/research/umls/index.html
SNOMED CT (corresponding versions) is from https://www.nlm.nih.gov/healthit/snomedct/index.html
NILK dataset is from https://zenodo.org/record/6607514
WikiData 2017 dump is from https://archive.org/download/enwiki-20170220/enwiki-20170220-pages-articles.xml.bz2
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of the Patient Matching Algorithm Challenge is to bring about greater transparency and data on the performance of existing patient matching algorithms, spur the adoption of performance metrics for patient data matching algorithm vendors, and positively impact other aspects of patient matching such as deduplication and linking to clinical data. Participants will be provided a data set and will have their answers evaluated and scored against a master key. Up to 6 cash prizes will be awarded with a total purse of up to $75,000.00.https://www.patientmatchingchallenge.com/The test dataset used in the ONC Patient Matching Algorithm Challenge is available for download by students, researchers, or anyone else interested in additional analysis and patient matching algorithm development. More information about the Patient Matching Algorithm Challenge can be found: https://www.patientmatchingchallenge.com/.The dataset containing 1 million patients was split into eight files of alphabetical groupings by the the patient's last name, plus an additional file containing test patients with no last name recorded (Null). All files should be downloaded and merged for analysis.https://github.com/onc-healthit/patient-matching
Facebook
TwitterWe create a synthetic administrative dataset to be used in the development of the R package for calculating quality indicators for administrative data (see: https://github.com/sook-tusk/qualadmin) that mimic the properties of a real administrative dataset according to specifications by the ONS. Taking over 1 million records from a synthetic 1991 UK census dataset, we deleted records, moved records to a different geography and duplicated records to a different geography according to pre-specified proportions for each broad ethnic group (White, Non-white) and gender (males, females). The final size of the synthetic administrative data was 1033664 individuals.
National Statistical Institutes (NSIs) are directing resources into advancing the use of administrative data in official statistics systems. This is a top priority for the UK Office for National Statistics (ONS) as they are undergoing transformations in their statistical systems to make more use of administrative data for future censuses and population statistics. Administrative data are defined as secondary data sources since they are produced by other agencies as a result of an event or a transaction relating to administrative procedures of organisations, public administrations and government agencies. Nevertheless, they have the potential to become important data sources for the production of official statistics by significantly reducing the cost and burden of response and improving the efficiency of such systems. Embedding administrative data in statistical systems is not without costs and it is vital to understand where potential errors may arise. The Total Administrative Data Error Framework sets out all possible sources of error when using administrative data as statistical data, depending on whether it is a single data source or integrated with other data sources such as survey data. For a single administrative data, one of the main sources of error is coverage and representation to the target population of interest. This is particularly relevant when administrative data is delivered over time, such as tax data for maintaining the Business Register. For sub-project 1 of this research project, we develop quality indicators that allow the statistical agency to assess if the administrative data is representative to the target population and which sub-groups may be missing or over-covered. This is essential for producing unbiased estimates from administrative data. Another priority at statistical agencies is to produce a statistical register for population characteristic estimates, such as employment statistics, from multiple sources of administrative and survey data. Using administrative data to build a spine, survey data can be integrated using record linkage and statistical matching approaches on a set of common matching variables. This will be the topic for sub-project 2, which will be split into several topics of research. The first topic is whether adding statistical predictions and correlation structures improves the linkage and data integration. The second topic is to research a mass imputation framework for imputing missing target variables in the statistical register where the missing data may be due to multiple underlying mechanisms. Therefore, the third topic will aim to improve the mass imputation framework to mitigate against possible measurement errors, for example by adding benchmarks and other constraints into the approaches. On completion of a statistical register, estimates for key target variables at local areas can easily be aggregated. However, it is essential to also measure the precision of these estimates through mean square errors and this will be the fourth topic of the sub-project. Finally, this new way of producing official statistics is compared to the more common method of incorporating administrative data through survey weights and model-based estimation approaches. In other words, we evaluate whether it is better 'to weight' or 'to impute' for population characteristic estimates - a key question under investigation by survey statisticians in the last decade.
Facebook
TwitterThe Frequency Report Utility can generate a report that provides an overview of the number and type of raw responses being generated for a particular data element to help title IV-E agencies assess and improve data quality.
Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This (zipped) folder contains 3 sub-folders:
#**********************************************************************************************************
The "bin" folder contains fuctions and gentic maps needed for analyes
bin \
predLink.R - function to predict linkage
sibREML_v0.1.1.R - function to run SibREML
sim-sib-array.R - script to simulate sib-pairs from parental haplotypes
Summarised_genetic_map_bcf.txt - genetic map per 0.5-cM long segments, based on map from bcftools
(BCFtools: https://samtools.github.io/bcftools/bcftools.html)
Summarised_genetic_map_OMNI.txt - genetic map per 0.5-cM long segments, based on OMNI map
(https://github.com/joepickrell/1000-genomes-genetic-maps/tree/master/interpolated_OMNI)
#**********************************************************************************************************
#**********************************************************************************************************
The "SIM" folder contains the simulation pipeline (scripts 01-15) as well as IBD sharing and simulated phenotypes for Simulated sib-pairs.
SIM \
01_sim-sib-array.sh *pre-run*
02_bed_recode_bcf_map.sh *pre-run*
03_make_merlin.R *pre-run*
04_error_merlin.sh *pre-run*
05_merlin_IBD.sh *pre-run*
06_sample_causal_snps.R *pre-run*
07_simulate_pheno.sh *pre-run*
08_bhat_gwas.R *can be run using provided data*
09_Linkage_VH.R *can be run using provided data*
10_predLink.R *can be run using provided data*
11_phi_hat.R *can be run using provided data*
12_IBD_Mb.R *can be run using provided data*
13_IBD_cM_recombrate_stratified.R *can be run using provided data*
14_SibREML.R *can be run using provided data*
15_SibREML_stratified_Q4.R *can be run using provided data*
causal_snps \ *provided causal SNPs*
IBD_results \ *provided IBD-probabilities for 1000 simulated sib-pairs*
Linkage_VH_results \
pheno \ *provided simulated phenotypes (h2=1) for 8 genetic architectures*
Phi_hat_results.txt
predicted \
README
SibREML_results.txt
SibREML_stratified_Q4.txt
The data can be used to run Linkage analysis, predict linkage, estimate phi_hat,
as well as estimate non-stratified and recombination rate stratified sib-heritability (h2_FS and c).
The README is provided within the folder.
#**********************************************************************************************************
#**********************************************************************************************************
The "HT_BMI" folder contains data and scripts to predict linkage and estimate phi_hat for height and BMI.
HT_BMI \
01_predLink_HT_BMI.R
02_phi_hat_HT_BMI.R
gws_sumstats \ *provided summary GWAS summary statistics to predict linkage for height and BMI*
Linkage_results \ *provided linkage meta-analysis results for height and BMI from this study*
Phi_hat_results_HT_BMI.txt
PREDLINK_bmi.txt
PREDLINK_height.txt
README
The README is provided within the folder.
#**********************************************************************************************************
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PUDL v2025.2.0 Data Release
This is our regular quarterly release for 2025Q1. It includes updates to all the datasets that are published with quarterly or higher frequency, plus initial verisons of a few new data sources that have been in the works for a while.
One major change this quarter is that we are now publishing all processed PUDL data as Apache Parquet files, alongside our existing SQLite databases. See Data Access for more on how to access these outputs.
Some potentially breaking changes to be aware of:
In the EIA Form 930 – Hourly and Daily Balancing Authority Operations Report a number of new energy sources have been added, and some old energy sources have been split into more granular categories. See Changes in energy source granularity over time.
We are now running the EPA’s CAMD to EIA unit crosswalk code for each individual year starting from 2018, rather than just 2018 and 2021, resulting in more connections between these two datasets and changes to some sub-plant IDs. See the note below for more details.
Many thanks to the organizations who make these regular updates possible! Especially GridLab, RMI, and the ZERO Lab at Princeton University. If you rely on PUDL and would like to help ensure that the data keeps flowing, please consider joining them as a PUDL Sustainer, as we are still fundraising for 2025.
New Data
EIA 176
Add a couple of semi-transformed interim EIA-176 (natural gas sources and dispositions) tables. They aren’t yet being written to the database, but are one step closer. See #3555 and PRs #3590, #3978. Thanks to @davidmudrauskas for moving this dataset forward.
Extracted these interim tables up through the latest 2023 data release. See #4002 and #4004.
EIA 860
Added EIA 860 Multifuel table. See #3438 and #3946.
FERC 1
Added three new output tables containing granular utility accounting data. See #4057, #3642 and the table descriptions in the data dictionary:
out_ferc1_yearly_detailed_income_statements
out_ferc1_yearly_detailed_balance_sheet_assets
out_ferc1_yearly_detailed_balance_sheet_liabilities
SEC Form 10-K Parent-Subsidiary Ownership
We have added some new tables describing the parent-subsidiary company ownership relationships reported in the SEC’s Form 10-K, Exhibit 21 “Subsidiaries of the Registrant”. Where possible these tables link the SEC filers or their subsidiary companies to the corresponding EIA utilities. This work was funded by a grant from the Mozilla Foundation. Most of the ML models and data preparation took place in the mozilla-sec-eia repository separate from the main PUDL ETL, as it requires processing hundreds of thousands of PDFs and the deployment of some ML experiment tracking infrastructure. The new tables are handed off as nearly finished products to the PUDL ETL pipeline. Note that these are preliminary, experimental data products and are known to be incomplete and to contain errors. Extracting data tables from unstructured PDFs and the SEC to EIA record linkage are necessarily probabalistic processes.
See PRs #4026, #4031, #4035, #4046, #4048, #4050 and check out the table descriptions in the PUDL data dictionary:
out_sec10k_parents_and_subsidiaries
core_sec10k_quarterly_filings
core_sec10k_quarterly_exhibit_21_company_ownership
core_sec10k_quarterly_company_information
Expanded Data Coverage
EPA CEMS
Added 2024 Q4 of CEMS data. See #4041 and #4052.
EPA CAMD EIA Crosswalk
In the past, the crosswalk in PUDL has used the EPA’s published crosswalk (run with 2018 data), and an additional crosswalk we ran with 2021 EIA 860 data. To ensure that the crosswalk reflects updates in both EIA and EPA data, we re-ran the EPA R code which generates the EPA CAMD EIA crosswalk with 4 new years of data: 2019, 2020, 2022 and 2023. Re-running the crosswalk pulls the latest data from the CAMD FACT API, which results in some changes to the generator and unit IDs reported on the EPA side of the crosswalk, which feeds into the creation of core_epa_assn_eia_epacamd.
The changes only result in the addition of new units and generators in the EPA data, with no changes to matches at the plant level. However, the updates to generator and unit IDs have resulted in changes to the subplant IDs - some EIA boilers and generators which previously had no matches to EPA data have now been matched to EPA unit data, resulting in an overall reduction in the number of rows in the core_epa_assn_eia_epacamd_subplant_ids table. See issues #4039 and PR #4056 for a discussion of the changes observed in the course of this update.
EIA 860M
Added EIA 860m through December 2024. See #4038 and #4047.
EIA 923
Added EIA 923 monthly data through September 2024. See #4038 and #4047.
EIA Bulk Electricity Data
Updated the EIA Bulk Electricity data to include data published up through 2024-11-01. See #4042 and PR #4051.
EIA 930
Updated the EIA 930 data to include data published up through the beginning of February 2025. See #4040 and PR #4054. 10 new energy sources were added and 3 were retired; see Changes in energy source granularity over time for more information.
Bug Fixes
Fix an accidentally swapped set of starting balance / ending balance column rename parameters in the pre-2021 DBF derived data that feeds into core_ferc1_yearly_other_regulatory_liabilities_sched278. See issue #3952 and PRs #3969, #3979. Thanks to @yolandazzz13 for making this fix.
Added preliminary data validation checks for several FERC 1 tables that were missing it #3860.
Fix spelling of Lake Huron and Lake Saint Clair in out_vcerare_hourly_available_capacity_factor and related tables. See issue #4007 and PR #4029.
Quality of Life Improvements
We added a sources parameter to pudl.metadata.classes.DataSource.from_id() in order to make it possible to use the pudl-archiver repository to archive datasets that won’t necessarily be ingested into PUDL. See this PUDL archiver issue and PRs #4003 and #4013.
Other PUDL v2025.2.0 Resources
PUDL v2025.2.0 Data Dictionary
PUDL v2025.2.0 Documentation
PUDL in the AWS Open Data Registry
PUDL v2025.2.0 in a free, public AWS S3 bucket: s3://pudl.catalyst.coop/v2025.2.0/
PUDL v2025.2.0 in a requester-pays GCS bucket: gs://pudl.catalyst.coop/v2025.2.0/
Zenodo archive of the PUDL GitHub repo for this release
PUDL v2025.2.0 release on GitHub
PUDL v2025.2.0 package in the Python Package Index (PyPI)
Contact Us
If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. Here's a bunch of different ways to get in touch:
Follow us on GitHub
Use the PUDL Github issue tracker to let us know about any bugs or data issues you encounter
GitHub Discussions is where we provide user support.
Watch our GitHub Project to see what we're working on.
Email us at hello@catalyst.coop for private communications.
On Mastodon: @CatalystCoop@mastodon.energy
On BlueSky: @catalyst.coop
On Twitter: @CatalystCoop
Connect with us on LinkedIn
Play with our data and notebooks on Kaggle
Combine our data with ML models on HuggingFace
Learn more about us on our website: https://catalyst.coop
Subscribe to our announcements list for email updates.
Facebook
TwitterThis workbook provides state-by-state performance data for Round 3 of the Child and Family Services Reviews in addition to national performance comparisons for the past 12-month reporting period included in data profiles transmitted to states in February 2021.
Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterACF Children Bureau resource
Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterACF Agency Wide resource
Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterThis Information Memorandum (IM) informs state and tribal IV-E agencies about the publication of the Executive’s Guide to CCWIS.
Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterThis Information Memorandum (IM) provides information and guidance for use by States and Regional Offices on Updated National Standards for the Child and Family Service Reviews and Guidance on Program Improvement Plans.
Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterThis Information Memorandum (IM) provides States, Tribes, and Territories a comprehensive list of those Children's Bureau policies withdrawn in 2000 and 2001.
Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterThis Information Memorandum (IM) expands aid to Families with Dependent Children Expansion of the Definition of Specified Caretaker Relative as it Relates to Title IV-E Eligibility.
Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterThis document provides an overview of CCWIS and key provisions of the final rule.
Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This simulated dataset constitutes two files (after decompression), namely: sim_ergo_1600.csv and sim_pat_1600.csv.1. ergo.csv contains heart rate timeseries data for 1600 patients' ergometric tests. For each patient, 20 different ergometric tests were simulated. Each row in this file constitutes three field values: Ergo_ID, Heart Rate (BPM), and timestamp.2. pat.csv contains only four sample readings from each of the patient's 20 ergometric tests. Each row contains three values: patient_ID, Heart Rate, and timestamp. The goal is to link patients (identified by their patient_ID in the pat.csv file) to their corresponding ergometric tests (identified by their Ergo_ID in the ergo.csv file). This is done solely on matching the timestamp-value pairs from both files.The timeseries record linkage task described above is efficiently accomplished by the proposed tslink2 algorithm. tslink2 is implemented in C++ and is publicly availabe at https://github.com/ahmsoliman/tslink2Data is simulated such that correctly linked/matched identifiers follow the following formula:|Ergo_ID - patient_ID| mod 104 == 0The above formula is useful in evaluating the linkage algorithm performance.