100+ datasets found

h
example-generate-preference-dataset
huggingface.co
Updated Aug 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
distilabel-internal-testing (2024). example-generate-preference-dataset [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 23, 2024
Dataset authored and provided by
distilabel-internal-testing
Description
Dataset Card for example-preference-dataset

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-preference-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset.
Data from: Simulated Radar Waveform and RF Dataset Generator for Incumbent...
catalog.data.gov
datasets.ai
+2more
Updated Jul 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). Simulated Radar Waveform and RF Dataset Generator for Incumbent Signals in the 3.5 GHz CBRS Band [Dataset]. https://catalog.data.gov/dataset/simulated-radar-waveform-and-rf-dataset-generator-for-incumbent-signals-in-the-3-5-ghz-cbr-a6a00
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This software tool generates simulated radar signals and creates RF datasets. The datasets can be used to develop and test detection algorithms by utilizing machine learning/deep learning techniques for the 3.5 GHz Citizens Broadband Radio Service (CBRS) or similar bands. In these bands, the primary users of the band are federal incumbent radar systems. The software tool generates radar waveforms and randomizes the radar waveform parameters. The pulse modulation types for the radar signals and their parameters are selected based on NTIA testing procedures for ESC certification, available at http://www.its.bldrdoc.gov/publications/3184.aspx. Furthermore, the tool mixes the waveforms with interference and packages them into one RF dataset file. The tool utilizes a graphical user interface (GUI) to simplify the selection of parameters and the mixing process. A reference RF dataset was generated using this software. The RF dataset is published at https://doi.org/10.18434/M32116.
R
Data from: Dataset Generator Dataset
universe.roboflow.com
zip
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Science Lab (2025). Dataset Generator Dataset [Dataset]. https://universe.roboflow.com/science-lab-lpukz/dataset-generator
Explore at:
zipAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Science Lab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
FTC Element Bounding Boxes
Description
Dataset Generator

## Overview Dataset Generator is a dataset for object detection tasks - it contains FTC Element annotations for 1,512 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
f
Dataset for: Simulation and data-generation for random-effects network...
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Svenja Elisabeth Seide; Katrin Jensen; Meinhard Kieser (2023). Dataset for: Simulation and data-generation for random-effects network meta-analysis of binary outcome [Dataset]. http://doi.org/10.6084/m9.figshare.8001863.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8001863.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wiley
Authors
Svenja Elisabeth Seide; Katrin Jensen; Meinhard Kieser
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The performance of statistical methods is frequently evaluated by means of simulation studies. In case of network meta-analysis of binary data, however, available data- generating models are restricted to either inclusion of two-armed trials or the fixed-effect model. Based on data-generation in the pairwise case, we propose a framework for the simulation of random-effect network meta-analyses including multi-arm trials with binary outcome. The only of the common data-generating models which is directly applicable to a random-effects network setting uses strongly restrictive assumptions. To overcome these limitations, we modify this approach and derive a related simulation procedure using odds ratios as effect measure. The performance of this procedure is evaluated with synthetic data and in an empirical example.
w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
nada-demo.ihsn.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World, World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.
h
clinical-synthetic-text-kg
huggingface.co
Updated Jun 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ran Xu (2024). clinical-synthetic-text-kg [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-kg
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 23, 2024
Authors
Ran Xu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Data Description

We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on external knowledge graphs.

Generated Datasets

The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000 synthetic… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-kg.
C
Gridded Weather Generator Perturbations of Historical Detrended and...
data.cnra.ca.gov
data.ca.gov
+1more
csv, jpeg, netcdf +2
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of Water Resources (2025). Gridded Weather Generator Perturbations of Historical Detrended and Stochastically Generated Temperature and Precipitation for the State of CA and HUC8s [Dataset]. https://data.cnra.ca.gov/dataset/ca-weather-generator-gridded-climate-pr-tmin-tmax-2023
Explore at:
csv(4454), xlsx(19137), csv, txt, jpeg(183900), netcdf, xlsx(469606)Available download formats
Dataset updated
May 14, 2025
Dataset authored and provided by
California Department of Water Resources
Area covered
California
Description
The Weather Generator Gridded Data consists of two products:

[1] statistically perturbed gridded 100-year historic daily weather data including precipitation [in mm], and detrended maximum and minimum temperature in degrees Celsius, and

[2] stochastically generated and statistically perturbed gridded 1000-year daily weather data including precipitation [in mm], maximum temperature [in degrees Celsius], and minimum temperature in degrees Celsius.

The base climate of this dataset is a combination of historically observed gridded data including Livneh Unsplit 1915-2018 (Pierce et. al. 2021), Livneh 1915-2015 (Livneh et. al. 2013) and PRISM 2016-2018 (PRISM Climate Group, 2014). Daily precipitation is from Livneh Unsplit 1915-2018, daily temperature is from Livneh 2013 spanning 1915-2015 and was extended to 2018 with daily 4km PRISM that was rescaled to the Livneh grid resolution (1/16 deg). The Livneh temperature was bias corrected by month to the corresponding monthly PRISM climate over the same period. Baseline temperature was then detrended by month over the entire time series based on the average monthly temperature from 1991-2020. Statistical perturbations and stochastic generation of the time series were performed by the Weather Generator (Najibi et al. 2024a and Najibi et al. 2024b).

The repository consists of 30 climate perturbation scenarios that range from -25 to +25 % change in mean precipitation, and from 0 to +5 degrees Celsius change in mean temperature. Changes in thermodynamics represent scaling of precipitation during extreme events by a scaling factor per degree Celsius increase in mean temperature and consists primarily of 7%/degree-Celsius with 14%/degree-Celsius as sensitivity perturbations. Further insight for thermodynamic scaling can be found in full report linked below or in Najibi et al. 2024a and Najibi et al. 2024b.

The data presented here was created by the Weather Generator which was developed by Dr. Scott Steinschneider and Dr. Nasser Najibi (Cornell University). If a separate weather generator product is desired apart from this gridded climate dataset, the weather generator code can be adopted to suit the specific needs of the user. The weather generator code and supporting information can be found here: https://github.com/nassernajibi/WGEN-v2.0/tree/main. The full report for the model and performance can be found here: https://water.ca.gov/-/media/DWR-Website/Web-Pages/Programs/All-Programs/Climate-Change-Program/Resources-for-Water-Managers/Files/WGENCalifornia_Final_Report_final_20230808.pdf
d
Next Generation Simulation (NGSIM) Vehicle Trajectories and Supporting Data
catalog.data.gov
data.transportation.gov
+5more
Updated Jun 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federal Highway Administration (2025). Next Generation Simulation (NGSIM) Vehicle Trajectories and Supporting Data [Dataset]. https://catalog.data.gov/dataset/next-generation-simulation-ngsim-vehicle-trajectories-and-supporting-data
Explore at:
Dataset updated
Jun 16, 2025
Dataset provided by
Federal Highway Administration
Description
Click “Export” on the right to download the vehicle trajectory data. The associated metadata and additional data can be downloaded below under "Attachments". Researchers for the Next Generation Simulation (NGSIM) program collected detailed vehicle trajectory data on southbound US 101 and Lankershim Boulevard in Los Angeles, CA, eastbound I-80 in Emeryville, CA and Peachtree Street in Atlanta, Georgia. Data was collected through a network of synchronized digital video cameras. NGVIDEO, a customized software application developed for the NGSIM program, transcribed the vehicle trajectory data from the video. This vehicle trajectory data provided the precise location of each vehicle within the study area every one-tenth of a second, resulting in detailed lane positions and locations relative to other vehicles. Click the "Show More" button below to find additional contextual data and metadata for this dataset. For site-specific NGSIM video file datasets, please see the following: - NGSIM I-80 Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-I-80-Vide/2577-gpny - NGSIM US-101 Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-US-101-Vi/4qzi-thur - NGSIM Lankershim Boulevard Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-Lankershi/uv3e-y54k - NGSIM Peachtree Street Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-Peachtree/mupt-aksf
Language Generation Dataset: 200M Samples
kaggle.com
zip
Updated Sep 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Chatterjee (2019). Language Generation Dataset: 200M Samples [Dataset]. https://www.kaggle.com/datasets/imdeepmind/language-generation-dataset-200m-samples
Explore at:
zip(3416608411 bytes)Available download formats
Dataset updated
Sep 7, 2019
Authors
Abhishek Chatterjee
Description
Context

Amazon Customer Reviews Dataset is a dataset of user-generated product reviews on the shopping website Amazon. It contains over 130 million product reviews.

This dataset contains a tiny fraction of that dataset processed and prepared specifically for language generation.

To know how the dataset is prepared, then please check the GitHub repository for this dataset. https://github.com/imdeepmind/AmazonReview-LanguageGenerationDataset

Content

The dataset is stored in an SQLite database. The database contains one table called reviews. This table contains two columns sequence and next.

The sequence column contains sequences of characters. In this dataset, each sequence of 40 characters long.

The next column contains the next character after the sequence.

There are about 200 million samples are in the dataset.

Acknowledgements

Thanks to Amazon for making this awesome dataset. Here is the link for the dataset: https://s3.amazonaws.com/amazon-reviews-pds/readme.html

Inspiration

This dataset can be used for Language Generation. As it contains 200 million samples, complex Deep Learning models can be trained on this data.
D
Public Dataset Access and Usage
data.sfgov.org
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
+3more
application/rdfxml +5
Updated Aug 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Public Dataset Access and Usage [Dataset]. https://data.sfgov.org/City-Infrastructure/Public-Dataset-Access-and-Usage/su99-qvi4
Explore at:
csv, application/rssxml, json, tsv, application/rdfxml, xmlAvailable download formats
Dataset updated
Aug 9, 2025
Description
A. SUMMARY This dataset is used to report on public dataset access and usage within the open data portal. Each row sums the amount of users who access a dataset each day, grouped by access type (API Read, Download, Page View, etc).

B. HOW THE DATASET IS CREATED This dataset is created by joining two internal analytics datasets generated by the SF Open Data Portal. We remove non-public information during the process.

C. UPDATE PROCESS This dataset is scheduled to update every 7 days via ETL.

D. HOW TO USE THIS DATASET This dataset can help you identify stale datasets, highlight the most popular datasets and calculate other metrics around the performance and usage in the open data portal.

Please note a special call-out for two fields: - "derived": This field shows if an asset is an original source (derived = "False") or if it is made from another asset though filtering (derived = "True"). Essentially, if it is derived from another source or not. - "provenance": This field shows if an asset is "official" (created by someone in the city of San Francisco) or "community" (created by a member of the community, not official). All community assets are derived as members of the community cannot add data to the open data portal.
VegeNet - Image datasets and Codes
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jo Yen Tan; Jo Yen Tan (2022). VegeNet - Image datasets and Codes [Dataset]. http://doi.org/10.5281/zenodo.7254508
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7254508
Dataset updated
Oct 27, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jo Yen Tan; Jo Yen Tan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).

Image datasets:

vege_original : Images of vegetables captured manually in data acquisition stage

vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed

non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods

food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.

food_image_dataset_split : Image dataset (4) split into train and test sets

process : Images created when cropping (pre-processing step) to create dataset (2).
Solar Plant Generation Data
kaggle.com
zip
Updated Apr 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Afroz (2024). Solar Plant Generation Data [Dataset]. https://www.kaggle.com/datasets/pythonafroz/solar-plant-generation-data
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 5, 2024
Authors
Afroz
Description
Dataset

This dataset was created by Afroz

Contents
p
Data from: EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge...
physionet.org
Updated Jan 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantin Kotschenreuther (2024). EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems [Dataset]. http://doi.org/10.13026/25fx-f706
Explore at:
Unique identifier
https://doi.org/10.13026/25fx-f706
Dataset updated
Jan 11, 2024
Authors
Konstantin Kotschenreuther
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
This dataset was designed and created to enable advancements in healthcare-focused large language models, particularly in the context of retrieval-augmented clinical question-answering capabilities. Developed using a self-constructed pipeline based on the 13-billion parameter Meta Llama 2 model, this dataset encompasses 21466 medical discharge summaries extracted from the MIMIC-IV-Note dataset, with 156599 synthetically generated question-and-answer pairs, a subset of which was verified for accuracy by a physician. These pairs were generated by providing the model with a discharge summary and instructing it to generate question-and-answer pairs based on the contextual information present in the summaries. This work aims to generate data in support of the development of compact large language models capable of efficiently extracting information from medical notes and discharge summaries, thus enabling potential improvements for real-time decision-making processes in clinical settings. Additionally, accompanying the dataset is code facilitating question-and-answer pair generation from any medical and non-medical text. Despite the robustness of the presented dataset, it has certain limitations. The generation process was confined to a maximum context length of 6000 input tokens, owing to hardware constraints. The large language model's nature in generating these question-and-answer pairs may introduce an underlying bias or a lack in diversity and complexity. Future iterations should focus on rectifying these issues, possibly through diversified training and expanded verification procedures as well as the employment of more powerful large language models.
Data used by EPA researchers to generate illustrative figures for overview...
s.cnmilf.com
datasets.ai
+1more
Updated Nov 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Data used by EPA researchers to generate illustrative figures for overview article "Multiscale Modeling of Background Ozone: Research Needs to Inform and Improve Air Quality Management" [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/data-used-by-epa-researchers-to-generate-illustrative-figures-for-overview-article-multisc
Explore at:
Dataset updated
Nov 14, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Data sets used to prepare illustrative figures for the overview article “Multiscale Modeling of Background Ozone” Overview The CMAQ model output datasets used to create illustrative figures for this overview article were generated by scientists in EPA/ORD/CEMM and EPA/OAR/OAQPS. The EPA/ORD/CEMM-generated dataset consisted of hourly CMAQ output from two simulations. The first simulation was performed for July 1 – 31 over a 12 km modeling _domain covering the Western U.S. The simulation was configured with the Integrated Source Apportionment Method (ISAM) to estimate the contributions from 9 source categories to modeled ozone. ISAM source contributions for July 17 – 31 averaged over all grid cells located in Colorado were used to generate the illustrative pie chart in the overview article. The second simulation was performed for October 1, 2013 – August 31, 2014 over a 108 km modeling _domain covering the northern hemisphere. This simulation was also configured with ISAM to estimate the contributions from non-US anthropogenic sources, natural sources, stratospheric ozone, and other sources on ozone concentrations. Ozone ISAM results from this simulation were extracted along a boundary curtain of the 12 km modeling _domain specified over the Western U.S. for the time period January 1, 2014 – July 31, 2014 and used to generate the illustrative time-height cross-sections in the overview article. The EPA/OAR/OAQPS-generated dataset consisted of hourly gridded CMAQ output for surface ozone concentrations for the year 2016. The CMAQ simulations were performed over the northern hemisphere at a horizontal resolution of 108 km. NO2 and O3 data for July 2016 was extracted from these simulations generate the vertically-integrated column densities shown in the illustrative comparison to satellite-derived column densities. CMAQ Model Data The data from the CMAQ model simulations used in this research effort are very large (several terabytes) and cannot be uploaded to ScienceHub due to size restrictions. The model simulations are stored on the /asm archival system accessible through the atmos high-performance computing (HPC) system. Due to data management policies, files on /asm are subject to expiry depending on the template of the project. Files not requested for extension after the expiry date are deleted permanently from the system. The format of the files used in this analysis and listed below is ioapi/netcdf. Documentation of this format, including definitions of the geographical projection attributes contained in the file headers, are available at https://www.cmascenter.org/ioapi/ Documentation on the CMAQ model, including a description of the output file format and output model species can be found in the CMAQ documentation on the CMAQ GitHub site at https://github.com/USEPA/CMAQ. This dataset is associated with the following publication: Hogrefe, C., B. Henderson, G. Tonnesen, R. Mathur, and R. Matichuk. Multiscale Modeling of Background Ozone: Research Needs to Inform and Improve Air Quality Management. EM Magazine. Air and Waste Management Association, Pittsburgh, PA, USA, 1-6, (2020).
h
pv-generation
huggingface.co
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EDS Lab (2024). pv-generation [Dataset]. https://huggingface.co/datasets/EDS-lab/pv-generation
Explore at:
Dataset updated
Mar 7, 2024
Dataset authored and provided by
EDS Lab
License
https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
Description
PV Generation Dataset

This dataset compiles and harmonizes multiple open pv datasets.

Curated by: Attila Balint License: BSD 3-clause "New" or "Revised" licence

Uses

This pv dataset facilitates primarily solar generation forecasting.

Dataset Structure

The dataset contains three main files.

data/generation.parquet data/metadata.parquet data/weather.parquet

data/generation.parquet

This file contains the electricity generation values and has three… See the full description on the dataset page: https://huggingface.co/datasets/EDS-lab/pv-generation.
T
Synthetic Suicide Prevention Dataset with SDoH
data.va.gov
datahub.va.gov
+3more
application/rdfxml +5
Updated Feb 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VHA (2021). Synthetic Suicide Prevention Dataset with SDoH [Dataset]. https://www.data.va.gov/dataset/Synthetic-Suicide-Prevention-Dataset-with-SDoH/h5zp-pekf
Explore at:
application/rssxml, application/rdfxml, xml, tsv, csv, jsonAvailable download formats
Dataset updated
Feb 18, 2021
Dataset authored and provided by
VHA
Description
The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.
u
Unimelb Corridor Synthetic dataset
figshare.unimelb.edu.au
png
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Debaditya Acharya; KOUROSH KHOSHELHAM; STEPHAN WINTER (2023). Unimelb Corridor Synthetic dataset [Dataset]. http://doi.org/10.26188/5dd8b8085b191
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.26188/5dd8b8085b191
Dataset updated
May 30, 2023
Dataset provided by
The University of Melbourne
Authors
Debaditya Acharya; KOUROSH KHOSHELHAM; STEPHAN WINTER
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data-set is a supplementary material related to the generation of synthetic images of a corridor in the University of Melbourne, Australia from a building information model (BIM). This data-set was generated to check the ability of deep learning algorithms to learn task of indoor localisation from synthetic images, when being tested on real images. =============================================================================The following is the name convention used for the data-sets. The brackets show the number of images in the data-set.REAL DATAReal
---------------------> Real images (949 images)

Gradmag-Real -------> Gradmag of real data (949 images)SYNTHETIC DATASyn-Car
----------------> Cartoonish images (2500 images)

Syn-pho-real ----------> Synthetic photo-realistic images (2500 images)

Syn-pho-real-tex -----> Synthetic photo-realistic textured (2500 images)

Syn-Edge --------------> Edge render images (2500 images)

Gradmag-Syn-Car ---> Gradmag of Cartoonish images (2500 images)=============================================================================Each folder contains the images and their respective groundtruth poses in the following format [ImageName X Y Z w p q r].To generate the synthetic data-set, we define a trajectory in the 3D indoor model. The points in the trajectory serve as the ground truth poses of the synthetic images. The height of the trajectory was kept in the range of 1.5–1.8 m from the floor, which is the usual height of holding a camera in hand. Artificial point light sources were placed to illuminate the corridor (except for Edge render images). The length of the trajectory was approximately 30 m. A virtual camera was moved along the trajectory to render four different sets of synthetic images in Blender*. The intrinsic parameters of the virtual camera were kept identical to the real camera (VGA resolution, focal length of 3.5 mm, no distortion modeled). We have rendered images along the trajectory at 0.05 m interval and ± 10° tilt.The main difference between the cartoonish (Syn-car) and photo-realistic images (Syn-pho-real) is the model of rendering. Photo-realistic rendering is a physics-based model that traces the path of light rays in the scene, which is similar to the real world, whereas the cartoonish rendering roughly traces the path of light rays. The photorealistic textured images (Syn-pho-real-tex) were rendered by adding repeating synthetic textures to the 3D indoor model, such as the textures of brick, carpet and wooden ceiling. The realism of the photo-realistic rendering comes at the cost of rendering times. However, the rendering times of the photo-realistic data-sets were considerably reduced with the help of a GPU. Note that the naming convention used for the data-sets (e.g. Cartoonish) is according to Blender terminology.An additional data-set (Gradmag-Syn-car) was derived from the cartoonish images by taking the edge gradient magnitude of the images and suppressing weak edges below a threshold. The edge rendered images (Syn-edge) were generated by rendering only the edges of the 3D indoor model, without taking into account the lighting conditions. This data-set is similar to the Gradmag-Syn-car data-set, however, does not contain the effect of illumination of the scene, such as reflections and shadows.*Blender is an open-source 3D computer graphics software and finds its applications in video games, animated films, simulation and visual art. For more information please visit: http://www.blender.orgPlease cite the papers if you use the data-set:1) Acharya, D., Khoshelham, K., and Winter, S., 2019. BIM-PoseNet: Indoor camera localisation using a 3D indoor model and deep learning from synthetic images. ISPRS Journal of Photogrammetry and Remote Sensing. 150: 245-258.2) Acharya, D., Singha Roy, S., Khoshelham, K. and Winter, S. 2019. Modelling uncertainty of single image indoor localisation using a 3D model and deep learning. In ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, IV-2/W5, pages 247-254.
i
Insider Threat Test Dataset
impactcybertrust.org
Updated Sep 18, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
External Data Source (2019). Insider Threat Test Dataset [Dataset]. http://doi.org/10.23721/100/1504339
Explore at:
Unique identifier
https://doi.org/10.23721/100/1504339
Dataset updated
Sep 18, 2019
Authors
External Data Source
Description
The CERT Division, in partnership with ExactData, LLC, and under sponsorship from DARPA I2O, generated a collection of synthetic insider threat test datasets. These datasets provide both synthetic background data and data from synthetic malicious actors. Datasets are organized according to the data generator release that created them. Most releases include multiple datasets (e.g., r3.1 and r3.2). Generally, later releases include a superset of the data generation functionality of earlier releases. Each dataset file contains a readme file that provides detailed notes about the features of that release. The answer key file answers.tar.bz2 contains the details of the malicious activity included in each dataset, including descriptions of the scenarios enacted and the identifiers of the synthetic users involved.
Z
Magnetic Tape Recorder Dataset
data.niaid.nih.gov
zenodo.org
Updated Jun 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moliner, Eloi (2023). Magnetic Tape Recorder Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8026271
Explore at:
Dataset updated
Jun 30, 2023
Dataset provided by
Välimäki, Vesa
Wright, Alec
Moliner, Eloi
Mikkonen, Otto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the datasets collected and used in the research project:

O. Mikkonen, A. Wright, E. Moliner and V. Välimäki, “Neural Modeling Of Magnetic Tape Recorders,” in Proceedings of the International Conference on Digital Audio Effects (DAFx), Copenhagen, Denmark, 4-7 September 2023.

A pre-print of the article is available in arXiv. The code is open-source and published in GitHub. The accompanying web page can be found from here.

Overview

The data is divided into various subsets, stored in separate directories. The data contains both toy data generated using a software emulation of a reel-to-reel tape recorder, as well as real data collected from a physical device. The various subsets can be used for training, validating, and testing neural network behavior, similarly as was done in the research article.

Toy and Real Data

The toy data was generated using CHOWTape, a physically modeled reel-to-reel tape recorder. The subsets generated with the software emulation are denoted with the string CHOWTAPE. Two variants of the toy data was produced: in the first variant, the fluctuating delay produced by the simulated tape transport was disabled, and in the second kind, the delay was enabled. The latter variants are denoted with the string WOWFLUTTER.

The real data is collected using an Akai 4000D reel-to-reel tape recorder. The corresponding subsets are denoted with the string AKAI. Two tape speeds were used during the recording: 3 3/4 IPS (inches per second) and 7 1/2 IPS, with the corresponding subsets denoted with '3.75IPS' and '7.5IPS' respectively. On top of this, two different brands of magnetic tape were used for capturing the datasets with different tape speeds: Maxell and Scotch, with the corresponding subsets denoted with 'MAXELL' and 'SCOTCH' respectively.

Directories

For training the models, a fraction of the inputs from SignalTrain LA2A Dataset was used. The training, validation, and testing can be replicated using the subsets:

ReelToReel_Dataset_MiniPulse100_AKAI_*/ (hysteretic nonlinearity, real data)

ReelToReel_Dataset_Mini192kHzPulse100_AKAI_*/ (delay generator, real data)

Silence_AKAI_*/ (noise generator, real data)

ReelToReel_Dataset_MiniPulse100_CHOWTAPE*/ (hysteretic nonlinearity, toy data)

ReelToReel_Dataset_MiniPulse100_CHOWTAPE_F[0.6]_SL[60]_TRAJECTORIES/ (delay generator, toy data)

For visualizing the model behavior, the following subsets can be used:

LogSweepsContinuousPulse100_*/ (nonlinear magnitude responses)

SinesFadedShortContinuousPulse100*/ (magnetic hysteresis curves)

Directory structure

Each directory/subset is made of up of further subdirectories that are most often used to separate the training, validation and test sets from each other. Thus, a typical directory will look like the following: [DIRECTORY_NAME] ├── Train │ ├── input_x_.wav │ ... │ ├── target_x_.wav │ ... └── Val │ ├── input_y_.wav │ ... │ ├── target_y_.wav │ ... ├── Test │ ├── input_z_.wav │ ... │ ├── target_z_.wav │ ...

While not all of the audio is used for training purposes, all of the subsets share part of this structure to make the corresponding datasets compatible with the dataloader that was used.

The input and target files denoted with the same number x, e.g. input_100_.wav and target_100_.wav make up a pair, such that the target audio is the input audio processed with one of the used effects. In some of the cases, a third file named trajectory_x_.npy can be found, which consists of the corresponding pre-extracted delay trajectory in the NumPy binary file format.
d
Data from: International Climate Benchmarks and Input Parameters for a...
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). International Climate Benchmarks and Input Parameters for a Stochastic Weather Generator, CLIGEN [Dataset]. https://catalog.data.gov/dataset/international-climate-benchmarks-and-input-parameters-for-a-stochastic-weather-generator-c-74051
Explore at:
Dataset updated
Jun 5, 2025
Dataset provided by
Agricultural Research Service
Description
This dataset represents CLIGEN input parameters for locations in 68 countries. CLIGEN is a point-scale stochastic weather generator that produces long-term weather simulations with daily output. The input parameters are essentially monthly climate statistics that also serve as climate benchmarks. Three unique input parameter sets are differentiated by having been produced from 30-year, 20-year and 10-year minimum record lengths that correspond to 7673, 2336, and 2694 stations, respectively. The primary source of data is the NOAA GHCN-Daily dataset, and due to data gaps, records longer than the three minimum record lengths were often queried to produce the needed number of complete monthly records. The vast majority of stations used at least some data from the 2000's, and temporal coverages are shown in the Excel table for each station. CLIGEN has various applications including being used to force soil erosion models. This dataset may reduce the effort needed in preparing climate inputs for such applications. Revised input files added on 11/16/20. These files were revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months. Second revision input files added on 2/12/20. A formatting error was fixed that affected transition probabilities for 238 stations with zero recorded precipitation for one or more months. The affected stations were predominantly in Australia and Mexico. Resources in this dataset:Resource Title: 30-year input files. File Name: 30-year.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files. File Name: 20-year.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files. File Name: 10-year.zipResource Description: CLIGEN .par input files based on 10-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: Map Layer. File Name: MapLayer.kmzResource Description: Map Layer showing locations of the new CLIGEN stations. This layer may be imported into Google Earth and used to find the station closest to an area of interest.Resource Software Recommended: Google Earth,url: https://www.google.com/earth/ Resource Title: Temporal Ranges of Years Queried. File Name: GHCN-Daily Year Ranges.xlsxResource Description: Excel tables of the first and last years queried from GHCN-Daily when searching for complete monthly records (with no gaps in data). Any ranges in excess of 30 years, 20 years and 10 years, for respective datasets, are due to data gaps.Resource Title: 30-year input files (revised). File Name: 30-year revised.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files (revised). File Name: 20-year revised.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files (revised). File Name: 10-year revised.zipResource Description: CLIGEN .par input files based on 10-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 30-year input files (revised 2). File Name: 30-year revised 2.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files (revised 2). File Name: 20-year revised 2.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files (revised 2). File Name: 10-year revised 2.zipResource Description: CLIGEN *.par input files based on 10-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/

Facebook

Twitter

Click to copy link

Link copied

Cite

distilabel-internal-testing (2024). example-generate-preference-dataset [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset

example-generate-preference-dataset

distilabel-internal-testing/example-generate-preference-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 23, 2024

Dataset authored and provided by

distilabel-internal-testing

Description

Dataset Card for example-preference-dataset

This dataset has been created with distilabel.

  Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-preference-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset.

Clear search

Close search

Google apps

Main menu

example-generate-preference-dataset

Data from: Simulated Radar Waveform and RF Dataset Generator for Incumbent...

Data from: Dataset Generator Dataset

Dataset Generator

Dataset for: Simulation and data-generation for random-effects network...

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

clinical-synthetic-text-kg

Gridded Weather Generator Perturbations of Historical Detrended and...

Next Generation Simulation (NGSIM) Vehicle Trajectories and Supporting Data

Language Generation Dataset: 200M Samples

Context

Content

Acknowledgements

Inspiration

Public Dataset Access and Usage

VegeNet - Image datasets and Codes

Solar Plant Generation Data

Dataset

Contents

Data from: EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge...

Data used by EPA researchers to generate illustrative figures for overview...

pv-generation

Synthetic Suicide Prevention Dataset with SDoH

Unimelb Corridor Synthetic dataset

Insider Threat Test Dataset

Magnetic Tape Recorder Dataset

Data from: International Climate Benchmarks and Input Parameters for a...

example-generate-preference-datasetSee More Versions

distilabel-internal-testing/example-generate-preference-dataset

example-generate-preference-dataset