23 datasets found

C
Synthetic Integrated Services Data
data.wprdc.org
csv, html, pdf, zip
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allegheny County (2024). Synthetic Integrated Services Data [Dataset]. https://data.wprdc.org/dataset/synthetic-integrated-services-data
Explore at:
csv(1375554033), html, pdf, zip(39231637)Available download formats
Dataset updated
Jun 25, 2024
Dataset provided by
Allegheny County
Description
Motivation

This dataset was created to pilot techniques for creating synthetic data from datasets containing sensitive and protected information in the local government context. Synthetic data generation replaces actual data with representative data generated from statistical models; this preserves the key data properties that allow insights to be drawn from the data while protecting the privacy of the people included in the data. We invite you to read the Understanding Synthetic Data white paper for a concise introduction to synthetic data.

This effort was a collaboration of the Urban Institute, Allegheny County’s Department of Human Services (DHS) and CountyStat, and the University of Pittsburgh’s Western Pennsylvania Regional Data Center.

Collection

The source data for this project consisted of 1) month-by-month records of services included in Allegheny County's data warehouse and 2) demographic data about the individuals who received the services. As the County’s data warehouse combines this service and client data, this data is referred to as “Integrated Services data”. Read more about the data warehouse and the kinds of services it includes here.

Preprocessing

Synthetic data are typically generated from probability distributions or models identified as being representative of the confidential data. For this dataset, a model of the Integrated Services data was used to generate multiple versions of the synthetic dataset. These different candidate datasets were evaluated to select for publication the dataset version that best balances utility and privacy. For high-level information about this evaluation, see the Synthetic Data User Guide.

For more information about the creation of the synthetic version of this data, see the technical brief for this project, which discusses the technical decision making and modeling process in more detail.

Recommended Uses

This disaggregated synthetic data allows for many analyses that are not possible with aggregate data (summary statistics). Broadly, this synthetic version of this data could be analyzed to better understand the usage of human services by people in Allegheny County, including the interplay in the usage of multiple services and demographic information about clients.

Known Limitations/Biases

Some amount of deviation from the original data is inherent to the synthetic data generation process. Specific examples of limitations (including undercounts and overcounts for the usage of different services) are given in the Synthetic Data User Guide and the technical report describing this dataset's creation.

Feedback

Please reach out to this dataset's data steward (listed below) to let us know how you are using this data and if you found it to be helpful. Please also provide any feedback on how to make this dataset more applicable to your work, any suggestions of future synthetic datasets, or any additional information that would make this more useful. Also, please copy wprdc@pitt.edu on any such feedback (as the WPRDC always loves to hear about how people use the data that they publish and how the data could be improved).

Further Documentation and Resources

1) A high-level overview of synthetic data generation as a method for protecting privacy can be found in the Understanding Synthetic Data white paper.
2) The Synthetic Data User Guide provides high-level information to help users understand the motivation, evaluation process, and limitations of the synthetic version of Allegheny County DHS's Human Services data published here.
3) Generating a Fully Synthetic Human Services Dataset: A Technical Report on Synthesis and Evaluation Methodologies describes the full technical methodology used for generating the synthetic data, evaluating the various options, and selecting the final candidate for publication.
4) The WPRDC also hosts the Allegheny County Human Services Community Profiles dataset, which provides annual updates on human-services usage, aggregated by neighborhood/municipality. That data can be explored using the County's Human Services Community Profile web site.
Data from: Many Models in R: A Tutorial - National Child Development Study:...
beta.ukdataservice.ac.uk
Updated 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liam Wright (2023). Many Models in R: A Tutorial - National Child Development Study: Age 46, Sweep 7, 2004-2005: Synthetic Data, 2023 [Dataset]. http://doi.org/10.5255/ukda-sn-856610
Explore at:
Unique identifier
https://doi.org/10.5255/ukda-sn-856610
Dataset updated
2023
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
DataCitehttps://www.datacite.org/
Authors
Liam Wright
Description
The deposit contains a dataset created for the paper, 'Many Models in R: A Tutorial'. ncds.Rds is an R format synthetic dataset created with the synthpop dataset in R using data from the National Child Development Study (NCDS), a birth cohort of individuals born in a single week of March 1958 in Britain. The dataset contains data on fourteen biomarkers collected at the age 46/47 sweep of the survey, four measures of cognitive ability from age 11 and 16, and three covariates, sex, body mass index at age 11 and father's social class. The data is only intended to be used in the tutorial - it is not to be used for drawing statistical inferences.
Z
Dataset for a tutorial dedicated to the Sankey diagram
data.niaid.nih.gov
Updated Aug 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antoine Lamer (2022). Dataset for a tutorial dedicated to the Sankey diagram [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7004010
Explore at:
Dataset updated
Aug 18, 2022
Dataset provided by
Antoine Lamer
Manel Ismail
Rémi Lenain
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a standard table representing steps of patient care. It contains 4 standard variables : a patient identifier, the label of the step, the start date and the end date of the step. One patient may have several steps. The step labels are synthetic (i.e., A, B, C, D, E, F) and may correspond to passages in care unit, successive administrations of drugs or carrying out of medical procedures.

This dataset is used for a tutorial dedicated to the Sankey diagram : https://gitlab.com/d8096/health_data_science_tutorials/-/tree/main/tutorials/sankey_diagram
Synthetic total-field magnetic anomaly data and code to perform Euler...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonardo Uieda; Vanderlei C. Oliveira Jr.; Valeria C. F. Barbosa (2023). Synthetic total-field magnetic anomaly data and code to perform Euler deconvolution on it [Dataset]. http://doi.org/10.6084/m9.figshare.923450.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.923450.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Leonardo Uieda; Vanderlei C. Oliveira Jr.; Valeria C. F. Barbosa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic data, source code, and supplementary text for the article "Euler deconvolution of potential field data" by Leonardo Uieda, Vanderlei C. Oliveira Jr., and Valéria C. F. Barbosa. This is part of a tutorial submitted to The Leading Edge (http://library.seg.org/journal/tle). Results were generated using the open-source Python package Fatiando a Terra version 0.2 (http://www.fatiando.org). This material along with the manuscript can also be found at https://github.com/pinga-lab/paper-tle-euler-tutorial Synthetic data and model Examples in the tutorial use synthetic data generated with the IPython notebook create_synthetic_data.ipynb. File synthetic_data.txt has 4 columns: x (north), y (east), z (down) and the total field magnetic anomaly. x, y, and z are in meters. The total field anomaly is in nanoTesla (nT). File metadata.json contains extra information about the data, such as inclination and declination of the inducing field (in degrees), shape of the data grid (number of points in y and x, respectively), the area containing the data (W, E, S, N, in meters), and the model boundaries (W, E, S, N, top, bottom, in meters). File model.pickle is a serialized version of the model used to generate the data. It contains a list of instances of the PolygonalPrism class of Fatiando a Terra. The serialization was done using the cPickle Python module. Reproducing the results in the tutorial The notebook euler-deconvolution-examples.ipynb runs the Euler deconvolution on the synthetic data and generates the figures for the manuscript. It also presents a more detailed explanation of the method and more tests than went into the finished manuscript.
R
Synthetic Fruit Object Detection Dataset
public.roboflow.com
zip
Updated Aug 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brad Dwyer (2021). Synthetic Fruit Object Detection Dataset [Dataset]. https://public.roboflow.com/object-detection/synthetic-fruit
Explore at:
zipAvailable download formats
Dataset updated
Aug 11, 2021
Dataset authored and provided by
Brad Dwyer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Bounding Boxes of Fruits
Description
About this dataset

This dataset contains 6,000 example images generated with the process described in Roboflow's How to Create a Synthetic Dataset tutorial.

The images are composed of a background (randomly selected from Google's Open Images dataset) and a number of fruits (from Horea94's Fruit Classification Dataset) superimposed on top with a random orientation, scale, and color transformation. All images are 416x550 to simulate a smartphone aspect ratio.

To generate your own images, follow our tutorial or download the code.

Example: https://blog.roboflow.ai/content/images/2020/04/synthetic-fruit-examples.jpg" alt="Example Image">
R
Synthetic Fruit Old Dataset
universe.roboflow.com
zip
Updated Apr 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brad Dwyer (2020). Synthetic Fruit Old Dataset [Dataset]. https://universe.roboflow.com/brad-dwyer/synthetic-fruit-old/3
Explore at:
zipAvailable download formats
Dataset updated
Apr 14, 2020
Dataset authored and provided by
Brad Dwyer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Fruits Bounding Boxes
Description
About this dataset

This dataset contains 6,000 example images generated with the process described in Roboflow's How to Create a Synthetic Dataset tutorial.

The images are composed of a background (randomly selected from Google's Open Images dataset) and a number of fruits (from Horea94's Fruit Classification Dataset) superimposed on top with a random orientation, scale, and color transformation. All images are 416x550 to simulate a smartphone aspect ratio.

To generate your own images, follow our tutorial or download the code.

Example: https://blog.roboflow.ai/content/images/2020/04/synthetic-fruit-examples.jpg" alt="Example Image">
Land Cover Fraction Mapping with FORCE - Supplemental Data
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Franz Schug; Franz Schug; David Frantz; David Frantz (2023). Land Cover Fraction Mapping with FORCE - Supplemental Data [Dataset]. http://doi.org/10.5281/zenodo.7529763
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7529763
Dataset updated
Jan 13, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Franz Schug; Franz Schug; David Frantz; David Frantz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

This upload contains data required to replicate a tutorial that applies regression-based unmixing of spectral-temporal metrics for sub-pixel land cover mapping with synthetically created training data. The tutorial uses the Framework for Operational Radiometric Correction for Environmental monitoring.

This dataset contains intermediate and final results of the workflow described in that tutorial as well as auxiliary data such as parameter files.

Please refer to the above mentioned tutorial for more information.
O
Data from: SMART-DS Synthetic Electrical Network Data OpenDSS Models for...
data.openei.org
catalog.data.gov
code, data, website
Updated Dec 18, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bryan Palmintier; Carlos Mateo Domingo; Fernando Emilio Postigo Marcos; Tomas Gomez San Roman; Fernando de Cuadra; Nicolas Gensollen; Tarek Elgindy; Pablo Duenas; Bryan Palmintier; Carlos Mateo Domingo; Fernando Emilio Postigo Marcos; Tomas Gomez San Roman; Fernando de Cuadra; Nicolas Gensollen; Tarek Elgindy; Pablo Duenas (2020). SMART-DS Synthetic Electrical Network Data OpenDSS Models for SFO, GSO, and AUS [Dataset]. https://data.openei.org/submissions/2981
Explore at:
website, code, dataAvailable download formats
Dataset updated
Dec 18, 2020
Dataset provided by
Open Energy Data Initiative (OEDI)
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
National Renewable Energy Laboratory (NREL)
Authors
Bryan Palmintier; Carlos Mateo Domingo; Fernando Emilio Postigo Marcos; Tomas Gomez San Roman; Fernando de Cuadra; Nicolas Gensollen; Tarek Elgindy; Pablo Duenas; Bryan Palmintier; Carlos Mateo Domingo; Fernando Emilio Postigo Marcos; Tomas Gomez San Roman; Fernando de Cuadra; Nicolas Gensollen; Tarek Elgindy; Pablo Duenas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SMART-DS datasets (Synthetic Models for Advanced, Realistic Testing: Distribution systems and Scenarios) are realistic large-scale U.S. electrical distribution models for testing advanced grid algorithms and technology analysis. This document provides a user guide for the datasets.

This dataset contains synthetic detailed electrical distribution network models, and connected timeseries loads for the greater San Francisco (SFO), Greensboro, and Austin areas. It is intended to provide researchers with very realistic and complete models that can be used for extensive powerflow simulations under a variety of scenarios. The data is synthetic, but has been validated against thousands of utility feeders to ensure statistical and operational similarity to electrical distribution networks in the US.

The OpenDSS data is partitioned into several regions (each zipped separately). After unzipping these files, each region has a folder for each substation, and subsequent folders for each feeder within the substation. This allows users to simulate smaller sections of the full dataset. Each of these folders (region, substation and feeder) has a folder titled "analysis" which contains CSV files listing voltages and overloads throughout the network for the peak loading time in the year. It also contains .png files showing the loading of residential and commercial loads on the network for every day of the year, and daily breakdowns of loads for commercial building categories. Time series data is provided in the "profiles" folder including real and reactive power at 15 minute resolution along with parquet files in the "endues" folder with breakdowns of building end-uses.
R
Go Positions Dataset
universe.roboflow.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Synthetic Data (2023). Go Positions Dataset [Dataset]. https://universe.roboflow.com/synthetic-data-3ol2y/go-positions/model/4
Explore at:
zipAvailable download formats
Dataset updated
Aug 4, 2023
Dataset authored and provided by
Synthetic Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Go Pieces Bounding Boxes
Description
Synthetic dataset of black and white stones on go boards. Generated using Unity Perception

Use Case

To be able to take a picture of a go game and figure out the position of each stone in order to score the game or analyze with AI. Project inspiration stems from this blog post along with past ideas we've had for this: https://blog.roboflow.com/chess-boards/

Classes

blackStone: Black go stones, 90,501 labels whiteStone: White go stones, 89,963 labels grid: Cross section grid of a go board, 1,000 labels
f
GenoCAD Tutorials
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mary Mangan; Mandy Wilson; Laura Adam; Jean Peccoud (2023). GenoCAD Tutorials [Dataset]. http://doi.org/10.6084/m9.figshare.153827.v15
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.153827.v15
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Mary Mangan; Mandy Wilson; Laura Adam; Jean Peccoud
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This tutorial includes two PowerPoint presentations developed by Mary Mangan from OpenHelix. Students should start with the Introduction prior to moving on to the Advanced tutorial. The slides decks include numerous comments that will help students go through the tutorials. In order to perform the hands on activities students need to download the GenoCAD Training Set. This dataset includes a list of parts and a grammar used as part of the GenoCAD Introductory tutorial. In order to import this data set in GenoCAD, proceed as follows: 1- Log into GenoCAD, create an account if you don't already have one. 2- Click on the Parts tab. 3- Click on the Grammars tab. 4- Click on the Add/Import Grammar button. 5- Using the "choose file" button, select the grammar file (.genocad) and click on import grammar. 6- Click on "use existing icon set" and click on "continue import". Upon completion of this procedure you should have a new grammar with a library of 37 parts in your workspace.

The tutorial also includes a series of additional exercises that will be used to reinforce the concepts introduced in the tutorial. Please visit the GenoCAD page for videos of the tutotials.
f
datasheet1_Causal Datasheet for Datasets: An Evaluation Guide for Real-World...
frontiersin.figshare.com
pdf
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bradley Butcher; Vincent S. Huang; Christopher Robinson; Jeremy Reffin; Sema K. Sgaier; Grace Charles; Novi Quadrianto (2023). datasheet1_Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks.pdf [Dataset]. http://doi.org/10.3389/frai.2021.612551.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2021.612551.s001
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Bradley Butcher; Vincent S. Huang; Christopher Robinson; Jeremy Reffin; Sema K. Sgaier; Grace Charles; Novi Quadrianto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.
Dataset for RNA-seq basic tutorial in Galaxy Australia
zenodo.org
bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jessica Chung; Jessica Chung (2020). Dataset for RNA-seq basic tutorial in Galaxy Australia [Dataset]. http://doi.org/10.5281/zenodo.1409427
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1409427
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jessica Chung; Jessica Chung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset of synthetic RNA-seq data for Drosophila melanogaster: 2 conditions, 3 replicates in each, paired-end reads. Reference genome in GTF format.
o
Test Data In Spreadsheet Format
explore.openaire.eu
Updated Oct 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J. Goedhart (2017). Test Data In Spreadsheet Format [Dataset]. http://doi.org/10.5281/zenodo.1003222
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.1003222
Dataset updated
Oct 6, 2017
Authors
J. Goedhart
Description
Synthetic test data for a tutorial that explains how to convert spreadsheet data to tidy data.
f
Synthetic data generating parameters. The table summarizes the generating...
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Ranieri; Floriana Pichiorri; Emma Colamarino; Febo Cincotti; Donatella Mattia; Jlenia Toppi (2025). Synthetic data generating parameters. The table summarizes the generating parameters for synthetic networks showing the corresponding symbol, name and range after the application of the constraints in Section e.2. [Dataset]. http://doi.org/10.1371/journal.pone.0319031.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0319031.t002
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Andrea Ranieri; Floriana Pichiorri; Emma Colamarino; Febo Cincotti; Donatella Mattia; Jlenia Toppi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic data generating parameters. The table summarizes the generating parameters for synthetic networks showing the corresponding symbol, name and range after the application of the constraints in Section e.2.
P
V2X-SIM Dataset
paperswithcode.com
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yiming Li; Dekun Ma; Ziyan An; Zixun Wang; Yiqi Zhong; Siheng Chen; Chen Feng (2023). V2X-SIM Dataset [Dataset]. https://paperswithcode.com/dataset/v2x-sim
Explore at:
Dataset updated
May 9, 2023
Authors
Yiming Li; Dekun Ma; Ziyan An; Zixun Wang; Yiqi Zhong; Siheng Chen; Chen Feng
Description
V2X-Sim, short for vehicle-to-everything simulation, is the a synthetic collaborative perception dataset in autonomous driving developed by AI4CE Lab at NYU and MediaBrain Group at SJTU to facilitate collaborative perception between multiple vehicles and roadside infrastructure. Data is collected from both roadside and vehicles when they are presented near the same intersection. With information from both the roadside infrastructure and vehicles, the dataset aims to encourage research on collaborative perception tasks.

Although not collected from the real world, highly realistic traffic simulation software is used to ensure the representativeness of the dataset compared to real-world driving scenarios. To be more exact, the traffic flow of the recording files is managed by CARLA-SUMO co-simulation, and three town maps from CARLA are currently used to increase the diversity of the dataset.

Here is a tutorial showing how to load the dataset: https://ai4ce.github.io/V2X-Sim/tutorial.html
Sentinel-1 RTC imagery processed by ASF over central Himalaya in High...
zenodo.org
explore.openaire.eu
+1more
zip
Updated Oct 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emma Marshall; Scott Henderson; Deepak Cherian; Jessica Scheick; Emma Marshall; Scott Henderson; Deepak Cherian; Jessica Scheick (2022). Sentinel-1 RTC imagery processed by ASF over central Himalaya in High Mountain Asia [Dataset]. http://doi.org/10.5281/zenodo.7236413
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7236413
Dataset updated
Oct 28, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Emma Marshall; Scott Henderson; Deepak Cherian; Jessica Scheick; Emma Marshall; Scott Henderson; Deepak Cherian; Jessica Scheick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Himalayas, High-mountain Asia
Description
This is a dataset of Sentinel-1 radiometric terrain corrected (RTC) imagery processed by the Alaska Satellite Facility covering a region within the Central Himalaya. It accompanies a tutorial demonstrating accessing and working with Sentinel-1 RTC imagery using xarray and other open source python packages.
Z
Multi-Sensor Ice Analysis Data: Analysis for Belgica Bank, North East...
data.niaid.nih.gov
data.europa.eu
Updated Sep 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hughes, Nick (2022). Multi-Sensor Ice Analysis Data: Analysis for Belgica Bank, North East Greenland 2019-20 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7053974
Explore at:
Dataset updated
Sep 8, 2022
Dataset provided by
Hughes, Nick
Amdal, Frank
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Greenland
Description
The intention is that this dataset can be used for machine learning and deep neural network training/validation, and it distinguishes sea ice concentration, type and form derived from manual analysis of a combination of different satellite sensors including ALOS-2, Sentinel-1, COSMO-SkyMed, Sentinel-2, and ICESAT-2. The region chosen for the analysis was the Belgica Bank area offshore of North East Greenland, as this is an area which experiences a wide variety of sea ice, and iceberg, conditions throughout the year. The dataset consists of two parts: 11 days of individual sea ice interpretations, one for each month in the period from April 2019 to March 2020, with the exception of October 2019, and iceberg surveys derived from Sentinel-2 for spring in 2019 and 2020.

The dataset includes a user guide issued by MET Norway as report 10/2022 (see https://www.met.no/publikasjoner/met-report) in which the first part describes the data sources, nomenclature, file formats and data in the analysis. A second part of the report compares synthetic aperture radar (SAR) data from both L-band ALOS-2 and C-band Sentinel-1 satellites, and identifies the visible synergies and anomalies. The results confirm that there are variations in backscatter signatures between ALOS-2 and Sentinel-1 data when comparing them for different sea ice situations and conditions. ALOS-2 data in many cases is proven to be a reliable and beneficial source of data when it comes to identifying icebergs, ridges, determining sea ice type, and also distinguishing ice and water compared to standalone Sentinel-1 data.
f
Open CAN IDS datasets’ metadata.
figshare.com
plos.figshare.com
xls
Updated Jan 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miki E. Verma; Robert A. Bridges; Michael D. Iannacone; Samuel C. Hollifield; Pablo Moriano; Steven C. Hespeler; Bill Kay; Frank L. Combs (2024). Open CAN IDS datasets’ metadata. [Dataset]. http://doi.org/10.1371/journal.pone.0296879.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0296879.t002
Dataset updated
Jan 22, 2024
Dataset provided by
PLOS ONE
Authors
Miki E. Verma; Robert A. Bridges; Michael D. Iannacone; Samuel C. Hollifield; Pablo Moriano; Steven C. Hespeler; Bill Kay; Frank L. Combs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Although ubiquitous in modern vehicles, Controller Area Networks (CANs) lack basic security properties and are easily exploitable. A rapidly growing field of CAN security research has emerged that seeks to detect intrusions or anomalies on CANs. Producing vehicular CAN data with a variety of intrusions is a difficult task for most researchers as it requires expensive assets and deep expertise. To illuminate this task, we introduce the first comprehensive guide to the existing open CAN intrusion detection system (IDS) datasets. We categorize attacks on CANs including fabrication (adding frames, e.g., flooding or targeting and ID), suspension (removing an ID’s frames), and masquerade attacks (spoofed frames sent in lieu of suspended ones). We provide a quality analysis of each dataset; an enumeration of each datasets’ attacks, benefits, and drawbacks; categorization as real vs. simulated CAN data and real vs. simulated attacks; whether the data is raw CAN data or signal-translated; number of vehicles/CANs; quantity in terms of time; and finally a suggested use case of each dataset. State-of-the-art public CAN IDS datasets are limited to real fabrication (simple message injection) attacks and simulated attacks often in synthetic data, lacking fidelity. In general, the physical effects of attacks on the vehicle are not verified in the available datasets. Only one dataset provides signal-translated data but is missing a corresponding “raw” binary version. This issue pigeon-holes CAN IDS research into testing on limited and often inappropriate data (usually with attacks that are too easily detectable to truly test the method). The scarcity of appropriate data has stymied comparability and reproducibility of results for researchers. As our primary contribution, we present the Real ORNL Automotive Dynamometer (ROAD) CAN IDS dataset, consisting of over 3.5 hours of one vehicle’s CAN data. ROAD contains ambient data recorded during a diverse set of activities, and attacks of increasing stealth with multiple variants and instances of real (i.e. non-simulated) fuzzing, fabrication, unique advanced attacks, and simulated masquerade attacks. To facilitate a benchmark for CAN IDS methods that require signal-translated inputs, we also provide the signal time series format for many of the CAN captures. Our contributions aim to facilitate appropriate benchmarking and needed comparability in the CAN IDS research field.
f
Logs in ROAD CAN intrusion detection dataset.
plos.figshare.com
xls
Updated Jan 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miki E. Verma; Robert A. Bridges; Michael D. Iannacone; Samuel C. Hollifield; Pablo Moriano; Steven C. Hespeler; Bill Kay; Frank L. Combs (2024). Logs in ROAD CAN intrusion detection dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0296879.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0296879.t005
Dataset updated
Jan 22, 2024
Dataset provided by
PLOS ONE
Authors
Miki E. Verma; Robert A. Bridges; Michael D. Iannacone; Samuel C. Hollifield; Pablo Moriano; Steven C. Hespeler; Bill Kay; Frank L. Combs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Although ubiquitous in modern vehicles, Controller Area Networks (CANs) lack basic security properties and are easily exploitable. A rapidly growing field of CAN security research has emerged that seeks to detect intrusions or anomalies on CANs. Producing vehicular CAN data with a variety of intrusions is a difficult task for most researchers as it requires expensive assets and deep expertise. To illuminate this task, we introduce the first comprehensive guide to the existing open CAN intrusion detection system (IDS) datasets. We categorize attacks on CANs including fabrication (adding frames, e.g., flooding or targeting and ID), suspension (removing an ID’s frames), and masquerade attacks (spoofed frames sent in lieu of suspended ones). We provide a quality analysis of each dataset; an enumeration of each datasets’ attacks, benefits, and drawbacks; categorization as real vs. simulated CAN data and real vs. simulated attacks; whether the data is raw CAN data or signal-translated; number of vehicles/CANs; quantity in terms of time; and finally a suggested use case of each dataset. State-of-the-art public CAN IDS datasets are limited to real fabrication (simple message injection) attacks and simulated attacks often in synthetic data, lacking fidelity. In general, the physical effects of attacks on the vehicle are not verified in the available datasets. Only one dataset provides signal-translated data but is missing a corresponding “raw” binary version. This issue pigeon-holes CAN IDS research into testing on limited and often inappropriate data (usually with attacks that are too easily detectable to truly test the method). The scarcity of appropriate data has stymied comparability and reproducibility of results for researchers. As our primary contribution, we present the Real ORNL Automotive Dynamometer (ROAD) CAN IDS dataset, consisting of over 3.5 hours of one vehicle’s CAN data. ROAD contains ambient data recorded during a diverse set of activities, and attacks of increasing stealth with multiple variants and instances of real (i.e. non-simulated) fuzzing, fabrication, unique advanced attacks, and simulated masquerade attacks. To facilitate a benchmark for CAN IDS methods that require signal-translated inputs, we also provide the signal time series format for many of the CAN captures. Our contributions aim to facilitate appropriate benchmarking and needed comparability in the CAN IDS research field.
Data from: SCScore: Synthetic Complexity Learned from a Reaction Corpus
acs.figshare.com
txt
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Connor W. Coley; Luke Rogers; William H. Green; Klavs F. Jensen (2023). SCScore: Synthetic Complexity Learned from a Reaction Corpus [Dataset]. http://doi.org/10.1021/acs.jcim.7b00622.s004
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.7b00622.s004
Dataset updated
Jun 9, 2023
Dataset provided by
ACS Publications
Authors
Connor W. Coley; Luke Rogers; William H. Green; Klavs F. Jensen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Several definitions of molecular complexity exist to facilitate prioritization of lead compounds, to identify diversity-inducing and complexifying reactions, and to guide retrosynthetic searches. In this work, we focus on synthetic complexity and reformalize its definition to correlate with the expected number of reaction steps required to produce a target molecule, with implicit knowledge about what compounds are reasonable starting materials. We train a neural network model on 12 million reactions from the Reaxys database to impose a pairwise inequality constraint enforcing the premise of this definition: that on average, the products of published chemical reactions should be more synthetically complex than their corresponding reactants. The learned metric (SCScore) exhibits highly desirable nonlinear behavior, particularly in recognizing increases in synthetic complexity throughout a number of linear synthetic routes.

Facebook

Twitter

Click to copy link

Link copied

Cite

Allegheny County (2024). Synthetic Integrated Services Data [Dataset]. https://data.wprdc.org/dataset/synthetic-integrated-services-data

Synthetic Integrated Services Data

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

csv(1375554033), html, pdf, zip(39231637)Available download formats

Dataset updated

Jun 25, 2024

Dataset provided by

Allegheny County

Description

Motivation

This dataset was created to pilot techniques for creating synthetic data from datasets containing sensitive and protected information in the local government context. Synthetic data generation replaces actual data with representative data generated from statistical models; this preserves the key data properties that allow insights to be drawn from the data while protecting the privacy of the people included in the data. We invite you to read the Understanding Synthetic Data white paper for a concise introduction to synthetic data.

This effort was a collaboration of the Urban Institute, Allegheny County’s Department of Human Services (DHS) and CountyStat, and the University of Pittsburgh’s Western Pennsylvania Regional Data Center.

Collection

The source data for this project consisted of 1) month-by-month records of services included in Allegheny County's data warehouse and 2) demographic data about the individuals who received the services. As the County’s data warehouse combines this service and client data, this data is referred to as “Integrated Services data”. Read more about the data warehouse and the kinds of services it includes here.

Preprocessing

Synthetic data are typically generated from probability distributions or models identified as being representative of the confidential data. For this dataset, a model of the Integrated Services data was used to generate multiple versions of the synthetic dataset. These different candidate datasets were evaluated to select for publication the dataset version that best balances utility and privacy. For high-level information about this evaluation, see the Synthetic Data User Guide.

For more information about the creation of the synthetic version of this data, see the technical brief for this project, which discusses the technical decision making and modeling process in more detail.

Recommended Uses

This disaggregated synthetic data allows for many analyses that are not possible with aggregate data (summary statistics). Broadly, this synthetic version of this data could be analyzed to better understand the usage of human services by people in Allegheny County, including the interplay in the usage of multiple services and demographic information about clients.

Known Limitations/Biases

Some amount of deviation from the original data is inherent to the synthetic data generation process. Specific examples of limitations (including undercounts and overcounts for the usage of different services) are given in the Synthetic Data User Guide and the technical report describing this dataset's creation.

Feedback

Please reach out to this dataset's data steward (listed below) to let us know how you are using this data and if you found it to be helpful. Please also provide any feedback on how to make this dataset more applicable to your work, any suggestions of future synthetic datasets, or any additional information that would make this more useful. Also, please copy wprdc@pitt.edu on any such feedback (as the WPRDC always loves to hear about how people use the data that they publish and how the data could be improved).

Further Documentation and Resources

1) A high-level overview of synthetic data generation as a method for protecting privacy can be found in the Understanding Synthetic Data white paper.
2) The Synthetic Data User Guide provides high-level information to help users understand the motivation, evaluation process, and limitations of the synthetic version of Allegheny County DHS's Human Services data published here.
3) Generating a Fully Synthetic Human Services Dataset: A Technical Report on Synthesis and Evaluation Methodologies describes the full technical methodology used for generating the synthetic data, evaluating the various options, and selecting the final candidate for publication.
4) The WPRDC also hosts the Allegheny County Human Services Community Profiles dataset, which provides annual updates on human-services usage, aggregated by neighborhood/municipality. That data can be explored using the County's Human Services Community Profile web site.

Clear search

Close search

Google apps

Main menu

Synthetic Integrated Services Data

Motivation

Collection

Preprocessing

Recommended Uses

Known Limitations/Biases

Feedback

Further Documentation and Resources

Data from: Many Models in R: A Tutorial - National Child Development Study:...

Dataset for a tutorial dedicated to the Sankey diagram

Synthetic total-field magnetic anomaly data and code to perform Euler...

Synthetic Fruit Object Detection Dataset

About this dataset

Synthetic Fruit Old Dataset

About this dataset

Land Cover Fraction Mapping with FORCE - Supplemental Data

Data from: SMART-DS Synthetic Electrical Network Data OpenDSS Models for...

Go Positions Dataset

Use Case

Classes

GenoCAD Tutorials

datasheet1_Causal Datasheet for Datasets: An Evaluation Guide for Real-World...

Dataset for RNA-seq basic tutorial in Galaxy Australia

Test Data In Spreadsheet Format

Synthetic data generating parameters. The table summarizes the generating...

V2X-SIM Dataset

Sentinel-1 RTC imagery processed by ASF over central Himalaya in High...

Multi-Sensor Ice Analysis Data: Analysis for Belgica Bank, North East...

Open CAN IDS datasets’ metadata.

Logs in ROAD CAN intrusion detection dataset.

Data from: SCScore: Synthetic Complexity Learned from a Reaction Corpus

Synthetic Integrated Services Data

Motivation

Collection

Preprocessing

Recommended Uses

Known Limitations/Biases

Feedback

Further Documentation and Resources