23 datasets found
  1. C

    Synthetic Integrated Services Data

    • data.wprdc.org
    csv, html, pdf, zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allegheny County (2024). Synthetic Integrated Services Data [Dataset]. https://data.wprdc.org/dataset/synthetic-integrated-services-data
    Explore at:
    csv(1375554033), html, pdf, zip(39231637)Available download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    Allegheny County
    Description

    Motivation

    This dataset was created to pilot techniques for creating synthetic data from datasets containing sensitive and protected information in the local government context. Synthetic data generation replaces actual data with representative data generated from statistical models; this preserves the key data properties that allow insights to be drawn from the data while protecting the privacy of the people included in the data. We invite you to read the Understanding Synthetic Data white paper for a concise introduction to synthetic data.

    This effort was a collaboration of the Urban Institute, Allegheny County’s Department of Human Services (DHS) and CountyStat, and the University of Pittsburgh’s Western Pennsylvania Regional Data Center.

    Collection

    The source data for this project consisted of 1) month-by-month records of services included in Allegheny County's data warehouse and 2) demographic data about the individuals who received the services. As the County’s data warehouse combines this service and client data, this data is referred to as “Integrated Services data”. Read more about the data warehouse and the kinds of services it includes here.

    Preprocessing

    Synthetic data are typically generated from probability distributions or models identified as being representative of the confidential data. For this dataset, a model of the Integrated Services data was used to generate multiple versions of the synthetic dataset. These different candidate datasets were evaluated to select for publication the dataset version that best balances utility and privacy. For high-level information about this evaluation, see the Synthetic Data User Guide.

    For more information about the creation of the synthetic version of this data, see the technical brief for this project, which discusses the technical decision making and modeling process in more detail.

    Recommended Uses

    This disaggregated synthetic data allows for many analyses that are not possible with aggregate data (summary statistics). Broadly, this synthetic version of this data could be analyzed to better understand the usage of human services by people in Allegheny County, including the interplay in the usage of multiple services and demographic information about clients.

    Known Limitations/Biases

    Some amount of deviation from the original data is inherent to the synthetic data generation process. Specific examples of limitations (including undercounts and overcounts for the usage of different services) are given in the Synthetic Data User Guide and the technical report describing this dataset's creation.

    Feedback

    Please reach out to this dataset's data steward (listed below) to let us know how you are using this data and if you found it to be helpful. Please also provide any feedback on how to make this dataset more applicable to your work, any suggestions of future synthetic datasets, or any additional information that would make this more useful. Also, please copy wprdc@pitt.edu on any such feedback (as the WPRDC always loves to hear about how people use the data that they publish and how the data could be improved).

    Further Documentation and Resources

    1) A high-level overview of synthetic data generation as a method for protecting privacy can be found in the Understanding Synthetic Data white paper.
    2) The Synthetic Data User Guide provides high-level information to help users understand the motivation, evaluation process, and limitations of the synthetic version of Allegheny County DHS's Human Services data published here.
    3) Generating a Fully Synthetic Human Services Dataset: A Technical Report on Synthesis and Evaluation Methodologies describes the full technical methodology used for generating the synthetic data, evaluating the various options, and selecting the final candidate for publication.
    4) The WPRDC also hosts the Allegheny County Human Services Community Profiles dataset, which provides annual updates on human-services usage, aggregated by neighborhood/municipality. That data can be explored using the County's Human Services Community Profile web site.

  2. Data from: Many Models in R: A Tutorial - National Child Development Study:...

    • beta.ukdataservice.ac.uk
    Updated 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liam Wright (2023). Many Models in R: A Tutorial - National Child Development Study: Age 46, Sweep 7, 2004-2005: Synthetic Data, 2023 [Dataset]. http://doi.org/10.5255/ukda-sn-856610
    Explore at:
    Dataset updated
    2023
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    DataCitehttps://www.datacite.org/
    Authors
    Liam Wright
    Description

    The deposit contains a dataset created for the paper, 'Many Models in R: A Tutorial'. ncds.Rds is an R format synthetic dataset created with the synthpop dataset in R using data from the National Child Development Study (NCDS), a birth cohort of individuals born in a single week of March 1958 in Britain. The dataset contains data on fourteen biomarkers collected at the age 46/47 sweep of the survey, four measures of cognitive ability from age 11 and 16, and three covariates, sex, body mass index at age 11 and father's social class. The data is only intended to be used in the tutorial - it is not to be used for drawing statistical inferences.

  3. Z

    Dataset for a tutorial dedicated to the Sankey diagram

    • data.niaid.nih.gov
    Updated Aug 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antoine Lamer (2022). Dataset for a tutorial dedicated to the Sankey diagram [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7004010
    Explore at:
    Dataset updated
    Aug 18, 2022
    Dataset provided by
    Antoine Lamer
    Manel Ismail
    Rémi Lenain
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is a standard table representing steps of patient care. It contains 4 standard variables : a patient identifier, the label of the step, the start date and the end date of the step. One patient may have several steps. The step labels are synthetic (i.e., A, B, C, D, E, F) and may correspond to passages in care unit, successive administrations of drugs or carrying out of medical procedures.

    This dataset is used for a tutorial dedicated to the Sankey diagram : https://gitlab.com/d8096/health_data_science_tutorials/-/tree/main/tutorials/sankey_diagram

  4. Synthetic total-field magnetic anomaly data and code to perform Euler...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonardo Uieda; Vanderlei C. Oliveira Jr.; Valeria C. F. Barbosa (2023). Synthetic total-field magnetic anomaly data and code to perform Euler deconvolution on it [Dataset]. http://doi.org/10.6084/m9.figshare.923450.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Leonardo Uieda; Vanderlei C. Oliveira Jr.; Valeria C. F. Barbosa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetic data, source code, and supplementary text for the article "Euler deconvolution of potential field data" by Leonardo Uieda, Vanderlei C. Oliveira Jr., and Valéria C. F. Barbosa. This is part of a tutorial submitted to The Leading Edge (http://library.seg.org/journal/tle). Results were generated using the open-source Python package Fatiando a Terra version 0.2 (http://www.fatiando.org). This material along with the manuscript can also be found at https://github.com/pinga-lab/paper-tle-euler-tutorial Synthetic data and model Examples in the tutorial use synthetic data generated with the IPython notebook create_synthetic_data.ipynb. File synthetic_data.txt has 4 columns: x (north), y (east), z (down) and the total field magnetic anomaly. x, y, and z are in meters. The total field anomaly is in nanoTesla (nT). File metadata.json contains extra information about the data, such as inclination and declination of the inducing field (in degrees), shape of the data grid (number of points in y and x, respectively), the area containing the data (W, E, S, N, in meters), and the model boundaries (W, E, S, N, top, bottom, in meters). File model.pickle is a serialized version of the model used to generate the data. It contains a list of instances of the PolygonalPrism class of Fatiando a Terra. The serialization was done using the cPickle Python module. Reproducing the results in the tutorial The notebook euler-deconvolution-examples.ipynb runs the Euler deconvolution on the synthetic data and generates the figures for the manuscript. It also presents a more detailed explanation of the method and more tests than went into the finished manuscript.

  5. R

    Synthetic Fruit Object Detection Dataset

    • public.roboflow.com
    zip
    Updated Aug 11, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brad Dwyer (2021). Synthetic Fruit Object Detection Dataset [Dataset]. https://public.roboflow.com/object-detection/synthetic-fruit
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 11, 2021
    Dataset authored and provided by
    Brad Dwyer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Bounding Boxes of Fruits
    Description

    About this dataset

    This dataset contains 6,000 example images generated with the process described in Roboflow's How to Create a Synthetic Dataset tutorial.

    The images are composed of a background (randomly selected from Google's Open Images dataset) and a number of fruits (from Horea94's Fruit Classification Dataset) superimposed on top with a random orientation, scale, and color transformation. All images are 416x550 to simulate a smartphone aspect ratio.

    To generate your own images, follow our tutorial or download the code.

    Example: https://blog.roboflow.ai/content/images/2020/04/synthetic-fruit-examples.jpg" alt="Example Image">

  6. R

    Synthetic Fruit Old Dataset

    • universe.roboflow.com
    zip
    Updated Apr 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brad Dwyer (2020). Synthetic Fruit Old Dataset [Dataset]. https://universe.roboflow.com/brad-dwyer/synthetic-fruit-old/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 14, 2020
    Dataset authored and provided by
    Brad Dwyer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Fruits Bounding Boxes
    Description

    About this dataset

    This dataset contains 6,000 example images generated with the process described in Roboflow's How to Create a Synthetic Dataset tutorial.

    The images are composed of a background (randomly selected from Google's Open Images dataset) and a number of fruits (from Horea94's Fruit Classification Dataset) superimposed on top with a random orientation, scale, and color transformation. All images are 416x550 to simulate a smartphone aspect ratio.

    To generate your own images, follow our tutorial or download the code.

    Example: https://blog.roboflow.ai/content/images/2020/04/synthetic-fruit-examples.jpg" alt="Example Image">

  7. Land Cover Fraction Mapping with FORCE - Supplemental Data

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Franz Schug; Franz Schug; David Frantz; David Frantz (2023). Land Cover Fraction Mapping with FORCE - Supplemental Data [Dataset]. http://doi.org/10.5281/zenodo.7529763
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 13, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Franz Schug; Franz Schug; David Frantz; David Frantz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This upload contains data required to replicate a tutorial that applies regression-based unmixing of spectral-temporal metrics for sub-pixel land cover mapping with synthetically created training data. The tutorial uses the Framework for Operational Radiometric Correction for Environmental monitoring.

    This dataset contains intermediate and final results of the workflow described in that tutorial as well as auxiliary data such as parameter files.

    Please refer to the above mentioned tutorial for more information.

  8. O

    Data from: SMART-DS Synthetic Electrical Network Data OpenDSS Models for...

    • data.openei.org
    • catalog.data.gov
    code, data, website
    Updated Dec 18, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bryan Palmintier; Carlos Mateo Domingo; Fernando Emilio Postigo Marcos; Tomas Gomez San Roman; Fernando de Cuadra; Nicolas Gensollen; Tarek Elgindy; Pablo Duenas; Bryan Palmintier; Carlos Mateo Domingo; Fernando Emilio Postigo Marcos; Tomas Gomez San Roman; Fernando de Cuadra; Nicolas Gensollen; Tarek Elgindy; Pablo Duenas (2020). SMART-DS Synthetic Electrical Network Data OpenDSS Models for SFO, GSO, and AUS [Dataset]. https://data.openei.org/submissions/2981
    Explore at:
    website, code, dataAvailable download formats
    Dataset updated
    Dec 18, 2020
    Dataset provided by
    Open Energy Data Initiative (OEDI)
    USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
    National Renewable Energy Laboratory (NREL)
    Authors
    Bryan Palmintier; Carlos Mateo Domingo; Fernando Emilio Postigo Marcos; Tomas Gomez San Roman; Fernando de Cuadra; Nicolas Gensollen; Tarek Elgindy; Pablo Duenas; Bryan Palmintier; Carlos Mateo Domingo; Fernando Emilio Postigo Marcos; Tomas Gomez San Roman; Fernando de Cuadra; Nicolas Gensollen; Tarek Elgindy; Pablo Duenas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The SMART-DS datasets (Synthetic Models for Advanced, Realistic Testing: Distribution systems and Scenarios) are realistic large-scale U.S. electrical distribution models for testing advanced grid algorithms and technology analysis. This document provides a user guide for the datasets.

    This dataset contains synthetic detailed electrical distribution network models, and connected timeseries loads for the greater San Francisco (SFO), Greensboro, and Austin areas. It is intended to provide researchers with very realistic and complete models that can be used for extensive powerflow simulations under a variety of scenarios. The data is synthetic, but has been validated against thousands of utility feeders to ensure statistical and operational similarity to electrical distribution networks in the US.

    The OpenDSS data is partitioned into several regions (each zipped separately). After unzipping these files, each region has a folder for each substation, and subsequent folders for each feeder within the substation. This allows users to simulate smaller sections of the full dataset. Each of these folders (region, substation and feeder) has a folder titled "analysis" which contains CSV files listing voltages and overloads throughout the network for the peak loading time in the year. It also contains .png files showing the loading of residential and commercial loads on the network for every day of the year, and daily breakdowns of loads for commercial building categories. Time series data is provided in the "profiles" folder including real and reactive power at 15 minute resolution along with parquet files in the "endues" folder with breakdowns of building end-uses.

  9. R

    Go Positions Dataset

    • universe.roboflow.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Synthetic Data (2023). Go Positions Dataset [Dataset]. https://universe.roboflow.com/synthetic-data-3ol2y/go-positions/model/4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 4, 2023
    Dataset authored and provided by
    Synthetic Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Go Pieces Bounding Boxes
    Description

    Synthetic dataset of black and white stones on go boards. Generated using Unity Perception

    Use Case

    To be able to take a picture of a go game and figure out the position of each stone in order to score the game or analyze with AI. Project inspiration stems from this blog post along with past ideas we've had for this: https://blog.roboflow.com/chess-boards/

    Classes

    blackStone: Black go stones, 90,501 labels whiteStone: White go stones, 89,963 labels grid: Cross section grid of a go board, 1,000 labels

  10. f

    GenoCAD Tutorials

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mary Mangan; Mandy Wilson; Laura Adam; Jean Peccoud (2023). GenoCAD Tutorials [Dataset]. http://doi.org/10.6084/m9.figshare.153827.v15
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Mary Mangan; Mandy Wilson; Laura Adam; Jean Peccoud
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This tutorial includes two PowerPoint presentations developed by Mary Mangan from OpenHelix. Students should start with the Introduction prior to moving on to the Advanced tutorial. The slides decks include numerous comments that will help students go through the tutorials. In order to perform the hands on activities students need to download the GenoCAD Training Set. This dataset includes a list of parts and a grammar used as part of the GenoCAD Introductory tutorial. In order to import this data set in GenoCAD, proceed as follows: 1- Log into GenoCAD, create an account if you don't already have one. 2- Click on the Parts tab. 3- Click on the Grammars tab. 4- Click on the Add/Import Grammar button. 5- Using the "choose file" button, select the grammar file (.genocad) and click on import grammar. 6- Click on "use existing icon set" and click on "continue import". Upon completion of this procedure you should have a new grammar with a library of 37 parts in your workspace.

    The tutorial also includes a series of additional exercises that will be used to reinforce the concepts introduced in the tutorial. Please visit the GenoCAD page for videos of the tutotials.

  11. f

    datasheet1_Causal Datasheet for Datasets: An Evaluation Guide for Real-World...

    • frontiersin.figshare.com
    pdf
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bradley Butcher; Vincent S. Huang; Christopher Robinson; Jeremy Reffin; Sema K. Sgaier; Grace Charles; Novi Quadrianto (2023). datasheet1_Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks.pdf [Dataset]. http://doi.org/10.3389/frai.2021.612551.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Frontiers
    Authors
    Bradley Butcher; Vincent S. Huang; Christopher Robinson; Jeremy Reffin; Sema K. Sgaier; Grace Charles; Novi Quadrianto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.

  12. Dataset for RNA-seq basic tutorial in Galaxy Australia

    • zenodo.org
    bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jessica Chung; Jessica Chung (2020). Dataset for RNA-seq basic tutorial in Galaxy Australia [Dataset]. http://doi.org/10.5281/zenodo.1409427
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jessica Chung; Jessica Chung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset of synthetic RNA-seq data for Drosophila melanogaster: 2 conditions, 3 replicates in each, paired-end reads. Reference genome in GTF format.

  13. o

    Test Data In Spreadsheet Format

    • explore.openaire.eu
    Updated Oct 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J. Goedhart (2017). Test Data In Spreadsheet Format [Dataset]. http://doi.org/10.5281/zenodo.1003222
    Explore at:
    Dataset updated
    Oct 6, 2017
    Authors
    J. Goedhart
    Description

    Synthetic test data for a tutorial that explains how to convert spreadsheet data to tidy data.

  14. f

    Synthetic data generating parameters. The table summarizes the generating...

    • plos.figshare.com
    xls
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Ranieri; Floriana Pichiorri; Emma Colamarino; Febo Cincotti; Donatella Mattia; Jlenia Toppi (2025). Synthetic data generating parameters. The table summarizes the generating parameters for synthetic networks showing the corresponding symbol, name and range after the application of the constraints in Section e.2. [Dataset]. http://doi.org/10.1371/journal.pone.0319031.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Andrea Ranieri; Floriana Pichiorri; Emma Colamarino; Febo Cincotti; Donatella Mattia; Jlenia Toppi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetic data generating parameters. The table summarizes the generating parameters for synthetic networks showing the corresponding symbol, name and range after the application of the constraints in Section e.2.

  15. P

    V2X-SIM Dataset

    • paperswithcode.com
    Updated May 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yiming Li; Dekun Ma; Ziyan An; Zixun Wang; Yiqi Zhong; Siheng Chen; Chen Feng (2023). V2X-SIM Dataset [Dataset]. https://paperswithcode.com/dataset/v2x-sim
    Explore at:
    Dataset updated
    May 9, 2023
    Authors
    Yiming Li; Dekun Ma; Ziyan An; Zixun Wang; Yiqi Zhong; Siheng Chen; Chen Feng
    Description

    V2X-Sim, short for vehicle-to-everything simulation, is the a synthetic collaborative perception dataset in autonomous driving developed by AI4CE Lab at NYU and MediaBrain Group at SJTU to facilitate collaborative perception between multiple vehicles and roadside infrastructure. Data is collected from both roadside and vehicles when they are presented near the same intersection. With information from both the roadside infrastructure and vehicles, the dataset aims to encourage research on collaborative perception tasks.

    Although not collected from the real world, highly realistic traffic simulation software is used to ensure the representativeness of the dataset compared to real-world driving scenarios. To be more exact, the traffic flow of the recording files is managed by CARLA-SUMO co-simulation, and three town maps from CARLA are currently used to increase the diversity of the dataset.

    Here is a tutorial showing how to load the dataset: https://ai4ce.github.io/V2X-Sim/tutorial.html

  16. Sentinel-1 RTC imagery processed by ASF over central Himalaya in High...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Oct 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emma Marshall; Scott Henderson; Deepak Cherian; Jessica Scheick; Emma Marshall; Scott Henderson; Deepak Cherian; Jessica Scheick (2022). Sentinel-1 RTC imagery processed by ASF over central Himalaya in High Mountain Asia [Dataset]. http://doi.org/10.5281/zenodo.7236413
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Emma Marshall; Scott Henderson; Deepak Cherian; Jessica Scheick; Emma Marshall; Scott Henderson; Deepak Cherian; Jessica Scheick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Himalayas, High-mountain Asia
    Description

    This is a dataset of Sentinel-1 radiometric terrain corrected (RTC) imagery processed by the Alaska Satellite Facility covering a region within the Central Himalaya. It accompanies a tutorial demonstrating accessing and working with Sentinel-1 RTC imagery using xarray and other open source python packages.

  17. Z

    Multi-Sensor Ice Analysis Data: Analysis for Belgica Bank, North East...

    • data.niaid.nih.gov
    • data.europa.eu
    Updated Sep 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hughes, Nick (2022). Multi-Sensor Ice Analysis Data: Analysis for Belgica Bank, North East Greenland 2019-20 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7053974
    Explore at:
    Dataset updated
    Sep 8, 2022
    Dataset provided by
    Hughes, Nick
    Amdal, Frank
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Greenland
    Description

    The intention is that this dataset can be used for machine learning and deep neural network training/validation, and it distinguishes sea ice concentration, type and form derived from manual analysis of a combination of different satellite sensors including ALOS-2, Sentinel-1, COSMO-SkyMed, Sentinel-2, and ICESAT-2. The region chosen for the analysis was the Belgica Bank area offshore of North East Greenland, as this is an area which experiences a wide variety of sea ice, and iceberg, conditions throughout the year. The dataset consists of two parts: 11 days of individual sea ice interpretations, one for each month in the period from April 2019 to March 2020, with the exception of October 2019, and iceberg surveys derived from Sentinel-2 for spring in 2019 and 2020.

    The dataset includes a user guide issued by MET Norway as report 10/2022 (see https://www.met.no/publikasjoner/met-report) in which the first part describes the data sources, nomenclature, file formats and data in the analysis. A second part of the report compares synthetic aperture radar (SAR) data from both L-band ALOS-2 and C-band Sentinel-1 satellites, and identifies the visible synergies and anomalies. The results confirm that there are variations in backscatter signatures between ALOS-2 and Sentinel-1 data when comparing them for different sea ice situations and conditions. ALOS-2 data in many cases is proven to be a reliable and beneficial source of data when it comes to identifying icebergs, ridges, determining sea ice type, and also distinguishing ice and water compared to standalone Sentinel-1 data.

  18. f

    Open CAN IDS datasets’ metadata.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jan 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miki E. Verma; Robert A. Bridges; Michael D. Iannacone; Samuel C. Hollifield; Pablo Moriano; Steven C. Hespeler; Bill Kay; Frank L. Combs (2024). Open CAN IDS datasets’ metadata. [Dataset]. http://doi.org/10.1371/journal.pone.0296879.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 22, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Miki E. Verma; Robert A. Bridges; Michael D. Iannacone; Samuel C. Hollifield; Pablo Moriano; Steven C. Hespeler; Bill Kay; Frank L. Combs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Although ubiquitous in modern vehicles, Controller Area Networks (CANs) lack basic security properties and are easily exploitable. A rapidly growing field of CAN security research has emerged that seeks to detect intrusions or anomalies on CANs. Producing vehicular CAN data with a variety of intrusions is a difficult task for most researchers as it requires expensive assets and deep expertise. To illuminate this task, we introduce the first comprehensive guide to the existing open CAN intrusion detection system (IDS) datasets. We categorize attacks on CANs including fabrication (adding frames, e.g., flooding or targeting and ID), suspension (removing an ID’s frames), and masquerade attacks (spoofed frames sent in lieu of suspended ones). We provide a quality analysis of each dataset; an enumeration of each datasets’ attacks, benefits, and drawbacks; categorization as real vs. simulated CAN data and real vs. simulated attacks; whether the data is raw CAN data or signal-translated; number of vehicles/CANs; quantity in terms of time; and finally a suggested use case of each dataset. State-of-the-art public CAN IDS datasets are limited to real fabrication (simple message injection) attacks and simulated attacks often in synthetic data, lacking fidelity. In general, the physical effects of attacks on the vehicle are not verified in the available datasets. Only one dataset provides signal-translated data but is missing a corresponding “raw” binary version. This issue pigeon-holes CAN IDS research into testing on limited and often inappropriate data (usually with attacks that are too easily detectable to truly test the method). The scarcity of appropriate data has stymied comparability and reproducibility of results for researchers. As our primary contribution, we present the Real ORNL Automotive Dynamometer (ROAD) CAN IDS dataset, consisting of over 3.5 hours of one vehicle’s CAN data. ROAD contains ambient data recorded during a diverse set of activities, and attacks of increasing stealth with multiple variants and instances of real (i.e. non-simulated) fuzzing, fabrication, unique advanced attacks, and simulated masquerade attacks. To facilitate a benchmark for CAN IDS methods that require signal-translated inputs, we also provide the signal time series format for many of the CAN captures. Our contributions aim to facilitate appropriate benchmarking and needed comparability in the CAN IDS research field.

  19. f

    Logs in ROAD CAN intrusion detection dataset.

    • plos.figshare.com
    xls
    Updated Jan 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miki E. Verma; Robert A. Bridges; Michael D. Iannacone; Samuel C. Hollifield; Pablo Moriano; Steven C. Hespeler; Bill Kay; Frank L. Combs (2024). Logs in ROAD CAN intrusion detection dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0296879.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 22, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Miki E. Verma; Robert A. Bridges; Michael D. Iannacone; Samuel C. Hollifield; Pablo Moriano; Steven C. Hespeler; Bill Kay; Frank L. Combs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Although ubiquitous in modern vehicles, Controller Area Networks (CANs) lack basic security properties and are easily exploitable. A rapidly growing field of CAN security research has emerged that seeks to detect intrusions or anomalies on CANs. Producing vehicular CAN data with a variety of intrusions is a difficult task for most researchers as it requires expensive assets and deep expertise. To illuminate this task, we introduce the first comprehensive guide to the existing open CAN intrusion detection system (IDS) datasets. We categorize attacks on CANs including fabrication (adding frames, e.g., flooding or targeting and ID), suspension (removing an ID’s frames), and masquerade attacks (spoofed frames sent in lieu of suspended ones). We provide a quality analysis of each dataset; an enumeration of each datasets’ attacks, benefits, and drawbacks; categorization as real vs. simulated CAN data and real vs. simulated attacks; whether the data is raw CAN data or signal-translated; number of vehicles/CANs; quantity in terms of time; and finally a suggested use case of each dataset. State-of-the-art public CAN IDS datasets are limited to real fabrication (simple message injection) attacks and simulated attacks often in synthetic data, lacking fidelity. In general, the physical effects of attacks on the vehicle are not verified in the available datasets. Only one dataset provides signal-translated data but is missing a corresponding “raw” binary version. This issue pigeon-holes CAN IDS research into testing on limited and often inappropriate data (usually with attacks that are too easily detectable to truly test the method). The scarcity of appropriate data has stymied comparability and reproducibility of results for researchers. As our primary contribution, we present the Real ORNL Automotive Dynamometer (ROAD) CAN IDS dataset, consisting of over 3.5 hours of one vehicle’s CAN data. ROAD contains ambient data recorded during a diverse set of activities, and attacks of increasing stealth with multiple variants and instances of real (i.e. non-simulated) fuzzing, fabrication, unique advanced attacks, and simulated masquerade attacks. To facilitate a benchmark for CAN IDS methods that require signal-translated inputs, we also provide the signal time series format for many of the CAN captures. Our contributions aim to facilitate appropriate benchmarking and needed comparability in the CAN IDS research field.

  20. Data from: SCScore: Synthetic Complexity Learned from a Reaction Corpus

    • acs.figshare.com
    txt
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Connor W. Coley; Luke Rogers; William H. Green; Klavs F. Jensen (2023). SCScore: Synthetic Complexity Learned from a Reaction Corpus [Dataset]. http://doi.org/10.1021/acs.jcim.7b00622.s004
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    ACS Publications
    Authors
    Connor W. Coley; Luke Rogers; William H. Green; Klavs F. Jensen
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Several definitions of molecular complexity exist to facilitate prioritization of lead compounds, to identify diversity-inducing and complexifying reactions, and to guide retrosynthetic searches. In this work, we focus on synthetic complexity and reformalize its definition to correlate with the expected number of reaction steps required to produce a target molecule, with implicit knowledge about what compounds are reasonable starting materials. We train a neural network model on 12 million reactions from the Reaxys database to impose a pairwise inequality constraint enforcing the premise of this definition: that on average, the products of published chemical reactions should be more synthetically complex than their corresponding reactants. The learned metric (SCScore) exhibits highly desirable nonlinear behavior, particularly in recognizing increases in synthetic complexity throughout a number of linear synthetic routes.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Allegheny County (2024). Synthetic Integrated Services Data [Dataset]. https://data.wprdc.org/dataset/synthetic-integrated-services-data

Synthetic Integrated Services Data

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
csv(1375554033), html, pdf, zip(39231637)Available download formats
Dataset updated
Jun 25, 2024
Dataset provided by
Allegheny County
Description

Motivation

This dataset was created to pilot techniques for creating synthetic data from datasets containing sensitive and protected information in the local government context. Synthetic data generation replaces actual data with representative data generated from statistical models; this preserves the key data properties that allow insights to be drawn from the data while protecting the privacy of the people included in the data. We invite you to read the Understanding Synthetic Data white paper for a concise introduction to synthetic data.

This effort was a collaboration of the Urban Institute, Allegheny County’s Department of Human Services (DHS) and CountyStat, and the University of Pittsburgh’s Western Pennsylvania Regional Data Center.

Collection

The source data for this project consisted of 1) month-by-month records of services included in Allegheny County's data warehouse and 2) demographic data about the individuals who received the services. As the County’s data warehouse combines this service and client data, this data is referred to as “Integrated Services data”. Read more about the data warehouse and the kinds of services it includes here.

Preprocessing

Synthetic data are typically generated from probability distributions or models identified as being representative of the confidential data. For this dataset, a model of the Integrated Services data was used to generate multiple versions of the synthetic dataset. These different candidate datasets were evaluated to select for publication the dataset version that best balances utility and privacy. For high-level information about this evaluation, see the Synthetic Data User Guide.

For more information about the creation of the synthetic version of this data, see the technical brief for this project, which discusses the technical decision making and modeling process in more detail.

Recommended Uses

This disaggregated synthetic data allows for many analyses that are not possible with aggregate data (summary statistics). Broadly, this synthetic version of this data could be analyzed to better understand the usage of human services by people in Allegheny County, including the interplay in the usage of multiple services and demographic information about clients.

Known Limitations/Biases

Some amount of deviation from the original data is inherent to the synthetic data generation process. Specific examples of limitations (including undercounts and overcounts for the usage of different services) are given in the Synthetic Data User Guide and the technical report describing this dataset's creation.

Feedback

Please reach out to this dataset's data steward (listed below) to let us know how you are using this data and if you found it to be helpful. Please also provide any feedback on how to make this dataset more applicable to your work, any suggestions of future synthetic datasets, or any additional information that would make this more useful. Also, please copy wprdc@pitt.edu on any such feedback (as the WPRDC always loves to hear about how people use the data that they publish and how the data could be improved).

Further Documentation and Resources

1) A high-level overview of synthetic data generation as a method for protecting privacy can be found in the Understanding Synthetic Data white paper.
2) The Synthetic Data User Guide provides high-level information to help users understand the motivation, evaluation process, and limitations of the synthetic version of Allegheny County DHS's Human Services data published here.
3) Generating a Fully Synthetic Human Services Dataset: A Technical Report on Synthesis and Evaluation Methodologies describes the full technical methodology used for generating the synthetic data, evaluating the various options, and selecting the final candidate for publication.
4) The WPRDC also hosts the Allegheny County Human Services Community Profiles dataset, which provides annual updates on human-services usage, aggregated by neighborhood/municipality. That data can be explored using the County's Human Services Community Profile web site.

Search
Clear search
Close search
Google apps
Main menu