86 datasets found
  1. Z

    Surgical-Synthetic-Data-Generation-and-Segmentation

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leoncini, Pietro (2025). Surgical-Synthetic-Data-Generation-and-Segmentation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14671905
    Explore at:
    Dataset updated
    Jan 16, 2025
    Authors
    Leoncini, Pietro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains synthetic and real images, with their labels, for Computer Vision in robotic surgery. It is part of ongoing research on sim-to-real applications in surgical robotics. The dataset will be updated with further details and references once the related work is published. For further information see the repository on GitHub: https://github.com/PietroLeoncini/Surgical-Synthetic-Data-Generation-and-Segmentation

  2. MOSTLY AI Prize Data

    • kaggle.com
    zip
    Updated May 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ivonaK (2025). MOSTLY AI Prize Data [Dataset]. https://www.kaggle.com/datasets/ivonav/mostly-ai-prize-data/code
    Explore at:
    zip(9871594 bytes)Available download formats
    Dataset updated
    May 16, 2025
    Authors
    ivonaK
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Competition

    • Generate the BEST tabular synthetic data and win 100,000 USD in cash.
    • Competition runs for 50 days: May 14 - July 3, 2025.
    • MOSTLY AI Prize

    This competition features two independent synthetic data challenges that you can join separately: - The FLAT DATA Challenge - The SEQUENTIAL DATA Challenge

    For each challenge, generate a dataset with the same size and structure as the original, capturing its statistical patterns — but without being significantly closer to the (released) original samples than to the (unreleased) holdout samples.

    Train a generative model that generalizes well, using any open-source tools (Synthetic Data SDK, synthcity, reprosyn, etc.) or your own solution. Submissions must be fully open-source, reproducible, and runnable within 6 hours on a standard machine.

    Timeline

    • Submissions open: May 14, 2025, 15:30 UTC
    • Submission credits: 3 per calendar week (+bonus)
    • Submissions close: July 3, 2025, 23:59 UTC
    • Evaluation of Leaders: July 3 - July 9
    • Winners announced: on July 9 🏆

    Datasets

    Flat Data - 100,000 records - 80 data columns: 60 numeric, 20 categorical

    Sequential Data - 20,000 groups - each group contains 5-10 records - 10 data columns: 7 numeric, 3 categorical

    Evaluation

    • CSV submissions are parsed using pandas.read_csv() and checked for expected structure & size
    • Evaluated using the Synthetic Data Quality Assurance toolkit
    • Compared against the released training set and a hidden holdout set (same size, non-overlapping, from the same source)

    Submission

    MOSTLY AI Prize

    Citation

    If you use this dataset in your research, please cite:

    @dataset{mostlyaiprize,
     author = {MOSTLY AI},
     title = {MOSTLY AI Prize Dataset},
     year = {2025},
     url = {https://www.mostlyaiprize.com/},
    }
    
  3. Synthetic datasets of the UK Biobank cohort

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, pdf, zip
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
    Explore at:
    bin, csv, zip, pdfAvailable download formats
    Dataset updated
    Sep 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

    The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

    The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

    • Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]
    • Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

    Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

    The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

    Content

    The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

    • synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
    • synthbdbasevar: baseline variables, mostly collected at recruitment.
    • synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history.
    • synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

    In addition, this repository provides these additional files:

    • codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
    • asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
    • Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

    Generation of the synthetic data

    The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

    The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

    This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

  4. o

    Nominal and adversarial synthetic PMU data for standard IEEE test systems

    • osti.gov
    Updated Jun 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pacific Northwest National Laboratory 2 (2021). Nominal and adversarial synthetic PMU data for standard IEEE test systems [Dataset]. http://doi.org/10.25584/DataHub/1788186
    Explore at:
    Dataset updated
    Jun 15, 2021
    Dataset provided by
    PNNL
    US
    Pacific Northwest National Laboratory 2
    Description

    GridSTAGE (Spatio-Temporal Adversarial scenario GEneration) is a framework for the simulation of adversarial scenarios and the generation of multivariate spatio-temporal data in cyber-physical systems. GridSTAGE is developed based on Matlab and leverages Power System Toolbox (PST) where the evolution of the power network is governed by nonlinear differential equations. Using GridSTAGE, one can create several event scenarios that correspond to several operating states of the power network by enabling or disabling any of the following: faults, AGC control, PSS control, exciter control, load changes, generation changes, and different types of cyber-attacks. Standard IEEE bus system data is used to define the power system environment. GridSTAGE emulates the data from PMU and SCADA sensors. The rate of frequency and location of the sensors can be adjusted as well. Detailed instructions on generating data scenarios with different system topologies, attack characteristics, load characteristics, sensor configuration, control parameters are available in the Github repository - https://github.com/pnnl/GridSTAGE. There is no existing adversarial data-generation framework that can incorporate several attack characteristics and yield adversarial PMU data. The GridSTAGE framework currently supports simulation of False Data Injection attacks (such as a ramp, step, random, trapezoidal, multiplicative, replay, freezing) and Denial of Service attacks (such as time-delay, packet-loss) on PMU data. Furthermore, it supports generating spatio-temporal time-series data corresponding to several random load changes across the network or corresponding to several generation changes. A Koopman mode decomposition (KMD) based algorithm to detect and identify the false data attacks in real-time is proposed in https://ieeexplore.ieee.org/document/9303022. Machine learning-based predictive models are developed to capture the dynamics of the underlying power system with a high level of accuracy under various operating conditions for IEEE 68 bus system. The corresponding machine learning models are available at https://github.com/pnnl/grid_prediction.

  5. MatSim Dataset and benchmark for one-shot visual materials and textures...

    • zenodo.org
    • data.niaid.nih.gov
    pdf, zip
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik; Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik (2025). MatSim Dataset and benchmark for one-shot visual materials and textures recognition [Dataset]. http://doi.org/10.5281/zenodo.7390166
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jun 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik; Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The MatSim Dataset and benchmark

    Lastest version

    Synthetic dataset and real images benchmark for visual similarity recognition of materials and textures.

    MatSim: a synthetic dataset, a benchmark, and a method for computer vision-based recognition of similarities and transitions between materials and textures focusing on identifying any material under any conditions using one or a few examples (one-shot learning).

    Based on the paper: One-shot recognition of any material anywhere using contrastive learning with physics-based rendering

    Benchmark_MATSIM.zip: contain the benchmark made of real-world images as described in the paper



    MatSim_object_train_split_1,2,3.zip: Contain a subset of the synthetics dataset for images of CGI images materials on random objects as described in the paper.

    MatSim_Vessels_Train_1,2,3.zip : Contain a subset of the synthetics dataset for images of CGI images materials inside transparent containers as described in the paper.

    *Note: these are subsets of the dataset; the full dataset can be found at:
    https://e1.pcloud.link/publink/show?code=kZIiSQZCYU5M4HOvnQykql9jxF4h0KiC5MX

    or
    https://icedrive.net/s/A13FWzZ8V2aP9T4ufGQ1N3fBZxDF

    Code:

    Up to date code for generating the dataset, reading and evaluation and trained nets can be found in this URL:https://github.com/sagieppel/MatSim-Dataset-Generator-Scripts-And-Neural-net

    Dataset Generation Scripts.zip: Contain the Blender (3.1) Python scripts used for generating the dataset, this code might be odl up to date code can be found here
    Net_Code_And_Trained_Model.zip: Contain a reference neural net code, including loaders, trained models, and evaluators scripts that can be used to read and train with the synthetic dataset or test the model with the benchmark. Note code in the ZIP file is not up to date and contains some bugs For the Latest version of this code see this URL

    Further documentation can be found inside the zip files or in the paper.

  6. Credit_Card_Frauds(Synthetic Dataset)

    • kaggle.com
    zip
    Updated Apr 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahesh Yadav (2023). Credit_Card_Frauds(Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/maheshyaadav/credit-card-fraudssynthetic-dataset
    Explore at:
    zip(211766720 bytes)Available download formats
    Dataset updated
    Apr 18, 2023
    Authors
    Mahesh Yadav
    Description

    About the Dataset This is a simulated credit card transaction dataset containing legitimate and fraud transactions from the duration 1st Jan 2019 - 31st Dec 2020. It covers credit cards of 1000 customers doing transactions with a pool of 800 merchants.

    Source of Simulation This was generated using Sparkov Data Generation | Github tool created by Brandon Harris. This simulation was run for the duration - 1 Jan 2019 to 31 Dec 2020. The files were combined and converted into a standard format.

    Information about the Simulator I do not own the simulator. I used the one used by Brandon Harris and just to understand how it works, I went through few portions of the code. This is what I understood from what I read:

    The simulator has certain pre-defined list of merchants, customers and transaction categories. And then using a python library called "faker", and with the number of customers, merchants that you mention during simulation, an intermediate list is created.

    After this, depending on the profile you choose for e.g. "adults 2550 female rural.json" (which means simulation properties of adult females in the age range of 25-50 who are from rural areas), the transactions are created. Say, for this profile, you could check "Sparkov | Github | adults_2550_female_rural.json", there are parameter value ranges defined in terms of min, max transactions per day, distribution of transactions across days of the week and normal distribution properties (mean, standard deviation) for amounts in various categories. Using these measures of distributions, the transactions are generated using faker.

    What I did was generate transactions across all profiles and then merged them together to create a more realistic representation of simulated transactions.

    Acknowledgements - Brandon Harris for his amazing work in creating this easy-to-use simulation tool for creating fraud transaction datasets.

  7. h

    synpat-dataset

    • huggingface.co
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karan Srivastava (2025). synpat-dataset [Dataset]. https://huggingface.co/datasets/Karan0901/synpat-dataset
    Explore at:
    Dataset updated
    May 28, 2025
    Authors
    Karan Srivastava
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    SynPAT: Generating Synthetic Physical Theories with Data

    This is the Hugging Face dataset entry for SynPAT, a synthetic theory and data generation system developed for the paper:SynPAT: Generating Synthetic Physical Theories with DataGitHub: https://github.com/jlenchner/theorizer SynPAT generates symbolic physical systems and corresponding synthetic data to benchmark symbolic regression and scientific discovery algorithms. Each synthetic system includes symbolic equations… See the full description on the dataset page: https://huggingface.co/datasets/Karan0901/synpat-dataset.

  8. h

    internal-datasets

    • huggingface.co
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Rivaldo Marbun (2023). internal-datasets [Dataset]. https://huggingface.co/datasets/Marbyun/internal-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2023
    Authors
    Ivan Rivaldo Marbun
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    SynQA is a Reading Comprehension dataset created in the work "Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation" (https://aclanthology.org/2021.emnlp-main.696/). It consists of 314,811 synthetically generated questions on the passages in the SQuAD v1.1 (https://arxiv.org/abs/1606.05250) training set.

    In this work, we use a synthetic adversarial data generation to make QA models more robust to human adversaries. We develop a data generation pipeline that selects source passages, identifies candidate answers, generates questions, then finally filters or re-labels them to improve quality. Using this approach, we amplify a smaller human-written adversarial dataset to a much larger set of synthetic question-answer pairs. By incorporating our synthetic data, we improve the state-of-the-art on the AdversarialQA (https://adversarialqa.github.io/) dataset by 3.7F1 and improve model generalisation on nine of the twelve MRQA datasets. We further conduct a novel human-in-the-loop evaluation to show that our models are considerably more robust to new human-written adversarial examples: crowdworkers can fool our model only 8.8% of the time on average, compared to 17.6% for a model trained without synthetic data.

    For full details on how the dataset was created, kindly refer to the paper.

  9. Bearings with Varying Degradation Behaviors

    • kaggle.com
    zip
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prognostics @ HSE (2025). Bearings with Varying Degradation Behaviors [Dataset]. https://www.kaggle.com/datasets/prognosticshse/bearings-with-varying-degradation-behaviors
    Explore at:
    zip(297945986 bytes)Available download formats
    Dataset updated
    Jun 13, 2025
    Authors
    Prognostics @ HSE
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context: The Bearings with Varying Degradation Behaviors data set is a synthetic data set representing the run-to-failure degradation data of rolling bearings. This data set is designed to facilitate the development and evaluation of diagnostic and prognostic methods in the context of Prognostics and Health Management (PHM). For the generation of the data set, the simulation model presented by Mauthe, Hagmeyer, and Zeiler (2025) was used. The simulation model is publicly available on GitHub.

    Simulation Model: Mauthe, Hagmeyer, and Zeiler (2025) introduce a generic simulation model for generating representative run-to-failure data of rolling bearings. It is designed to address challenges in the development of data-driven diagnostic and prognostic methods, such as unbalanced or limited data availability. The model consists of three modular components: the life and fault modeling, the degradation progression simulation, and the vibration signal generation. Each module incorporates random processes to reproduce real-world variations, such as differences in bearing lives and degradation progressions under similar operating conditions. The model simulates vibration signals throughout a bearing's life, reflecting both operating and degradation conditions. As such, the versatile model enables its users to create synthetic data sets of rolling bearings tailored to specific scenarios. A more detailed description of the model can be found in the corresponding paper (see Data Set Citation).

    Given Data Scenario and Specification: See the provided description file Bearings_with_Varying_Degradation_Behaviors.pdf

    Task: The data set contains training and test data, consisting of run-to-failure data from 28 and 12 simulated bearings. The objective of the data set is to predict the remaining useful life (RUL) of the rolling bearings within the given test data. All data proceed up to the identical failure threshold, which means that RUL=0 applies to the last point in time and the last vibration measurement, respectively.

    Data Set Creator: Hochschule Esslingen – University of Applied Sciences, Institute for Technical Reliability and Prognostics (IZP), Robert-Bosch-Straße 1, 73037 Göppingen, Germany

    Data Set Citation: Mauthe, F.; Hagmeyer, S.; Zeiler, P. (2025). Holistic simulation model of the temporal degradation of rolling bearings. Proceedings of the 35th European Safety and Reliability Conference and the 33rd Society for Risk Analysis Europe Conference, 15.06. – 19.06.2025, Stavanger, Norway, pp. 953–960, DOI: 10.3850/978-981-94-3281-3_ESREL-SRA-E2025-P8028-cd

    https://rpsonline.com.sg/proceedings/esrel-sra-e2025/html/ESREL-SRA-E2025-P8028.html

  10. h

    synthetic-multiturn-multimodal

    • huggingface.co
    Updated Jan 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mesolitica (2024). synthetic-multiturn-multimodal [Dataset]. https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 28, 2024
    Dataset authored and provided by
    Mesolitica
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Multiturn Multimodal

    We want to generate synthetic data that able to understand position and relationship between multi-images and multi-audio, example as below, All notebooks at https://github.com/mesolitica/malaysian-dataset/tree/master/chatbot/multiturn-multimodal

      multi-images
    

    synthetic-multi-images-relationship.jsonl, 100000 rows, 109MB. Images at https://huggingface.co/datasets/mesolitica/translated-LLaVA-Pretrain/tree/main

      Example data
    

    {'filename':… See the full description on the dataset page: https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal.

  11. LLM - Detect AI Datamix

    • kaggle.com
    zip
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26
    Explore at:
    zip(172818297 bytes)Available download formats
    Dataset updated
    Jan 19, 2024
    Authors
    Raja Biswas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

    It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

    To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

    Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

    • Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:
      • Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.
      • One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.
      • Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

    We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

    Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence

  12. Z

    Unmet Risk Index Dataset

    • data.niaid.nih.gov
    Updated Aug 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeanson, Francis; Farkouh, Michael E.; Godoy, Lucas C.; Minha, Sa'ar; Tzuman, Oran; Marcus, Gil (2023). Unmet Risk Index Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8241871
    Explore at:
    Dataset updated
    Aug 15, 2023
    Dataset provided by
    Department of Cardiology, Shamir Medical Center, Zeriffin, Israel; and Sackler School of Medicine, Tel-Aviv University, Ramat-Aviv, Israel
    Datadex Inc., Toronto, Canada
    Peter Munk Cardiac Centre and Heart and Stroke Richard Lewar Centre, University of Toronto, Toronto, Ontario, Canada
    Authors
    Jeanson, Francis; Farkouh, Michael E.; Godoy, Lucas C.; Minha, Sa'ar; Tzuman, Oran; Marcus, Gil
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the generated profiles from the combination of ASCVD and SMART risk calculators. In addition, the Unmet Risk Index value is included at the end of each data row. This data was used in the research paper titled "Medical calculators derived synthetic patients: a novel method for generation of synthetic patient data" in pre-print.

    The code used to generate these profiles is available on GitHub at: https://github.com/FrancisJMR/unmet-risk-index

  13. h

    aditi-syn-v1

    • huggingface.co
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manish Prakash (2024). aditi-syn-v1 [Dataset]. https://huggingface.co/datasets/manishiitg/aditi-syn-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2024
    Authors
    Manish Prakash
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    v1 for synthetic dataset generate for aditi model. Generation scripts are located here https://github.com/manishiitg/aditi_dataset/tree/main/gen

  14. H

    SOD Synthetic Forecast Generation Dataset

    • hydroshare.org
    zip
    Updated Oct 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zachary Paul Brodeur (2025). SOD Synthetic Forecast Generation Dataset [Dataset]. http://doi.org/10.4211/hs.833b01b4c0ee47378fd1eac7ba17ace4
    Explore at:
    zip(223.3 MB)Available download formats
    Dataset updated
    Oct 31, 2025
    Dataset provided by
    HydroShare
    Authors
    Zachary Paul Brodeur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 2, 1912 - Sep 30, 2024
    Area covered
    Description

    Pre-processed subset of raw HEFS hindcast data for Seven Oaks Dam (SOD) configured for compatibility with the repository structure of the versions 1 and 2 synthetic forecast model contained here: https://github.com/zpb4/Synthetic-Forecast-v1-FIRO-DISES and here: https://github.com/zpb4/Synthetic-Forecast-v2-FIRO-DISES. The data are pre-structured for the repository setup and instructions are included in README files for both GitHub repos on how to setup the data contained in this resource.

    Contains HEFS hindcast .csv files and observed full-natural-flow .csv files for the following sites: SRWC1 - main reservoir inflow to Seven Oaks Dam

    Note: The zipped file contains some R scripts that were used to pre-process the raw data. They do not interact with the GitHub scripts referenced above and can be discarded. All the information in the raw data is contained within the zipped files, it has simply been converted to a standardized format for compatibility with the synthetic forecast generation codebase.

  15. Data from: GHTraffic: A Dataset for Reproducible Research in...

    • zenodo.org
    zip
    Updated Aug 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thilini Bhagya; Jens Dietrich; Hans Guesgen; Thilini Bhagya; Jens Dietrich; Hans Guesgen (2020). GHTraffic: A Dataset for Reproducible Research in Service-Oriented Computing [Dataset]. http://doi.org/10.5281/zenodo.3748921
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 29, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thilini Bhagya; Jens Dietrich; Hans Guesgen; Thilini Bhagya; Jens Dietrich; Hans Guesgen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the latest version of the GHTraffic project. The main aim is to model a variety of transaction sequences to reflect more complex service behaviour.

    This version consists of a single edition collected from the google/guava repository.

    The entire data generation process is quite similar to the original GHTraffic design. But it incorporates minor changes to the process of synthetic data generation where it uses a random date after successfully posting a resource to make up the request and response for all of the HTTP methods. It also adds yet another subset of unsuccessful transactions by stipulating requests before resource creation is successful.

    This results in a far more dynamic series of transactions to named resources.

    Scripts used for datasets construction are accessible from the repository.

  16. Z

    replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Apr 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David (2023). replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7849416
    Explore at:
    Dataset updated
    Apr 21, 2023
    Dataset provided by
    The Pocket Dimension, Munich
    Imperial College London
    Authors
    Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for detection and tracking experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

    Abstract:

    Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

    Benchmark data

    Two video datasets were curated to quantify detection performance; one in laboratory and one in field conditions. The laboratory dataset consists of top-down recordings of foraging trails of Atta vollenweideri (Forel 1893) leaf-cutter ants. The colony was collected in Uruguay in 2014, and housed in a climate chamber at 25°C and 60% humidity. A recording box was built from clear acrylic, and placed between the colony nest and a box external to the climate chamber, which functioned as feeding site. Bramble leaves were placed in the feeding area prior to each recording session, and ants had access to the recording area at will. The recorded area was 104 mm wide and 200 mm long. An OAK-D camera (OpenCV AI Kit: OAK-D, Luxonis Holding Corporation) was positioned centrally 195 mm above the ground. While keeping the camera position constant, lighting, exposure, and background conditions were varied to create recordings with variable appearance: The “base” case is an evenly lit and well exposed scene with scattered leaf fragments on an otherwise plain white backdrop. A “bright” and “dark” case are characterised by systematic over- or underexposure, respectively, which introduces motion blur, colour-clipped appendages, and extensive flickering and compression artefacts. In a separate well exposed recording, the clear acrylic backdrop was substituted with a printout of a highly textured forest ground to create a “noisy” case. Last, we decreased the camera distance to 100 mm at constant focal distance, effectively doubling the magnification, and yielding a “close” case, distinguished by out-of-focus workers. All recordings were captured at 25 frames per second (fps).

    The field datasets consists of video recordings of Gnathamitermes sp. desert termites, filmed close to the nest entrance in the desert of Maricopa County, Arizona, using a Nikon D850 and a Nikkor 18-105 mm lens on a tripod at camera distances between 20 cm to 40 cm. All video recordings were well exposed, and captured at 23.976 fps.

    Each video was trimmed to the first 1000 frames, and contains between 36 and 103 individuals. In total, 5000 and 1000 frames were hand-annotated for the laboratory- and field-dataset, respectively: each visible individual was assigned a constant size bounding box, with a centre coinciding approximately with the geometric centre of the thorax in top-down view. The size of the bounding boxes was chosen such that they were large enough to completely enclose the largest individuals, and was automatically adjusted near the image borders. A custom-written Blender Add-on aided hand-annotation: the Add-on is a semi-automated multi animal tracker, which leverages blender’s internal contrast-based motion tracker, but also include track refinement options, and CSV export functionality. Comprehensive documentation of this tool and Jupyter notebooks for track visualisation and benchmarking is provided on the replicAnt and BlenderMotionExport GitHub repositories.

    Synthetic data generation

    Two synthetic datasets, each with a population size of 100, were generated from 3D models of \textit{Atta vollenweideri} leaf-cutter ants. All 3D models were created with the scAnt photogrammetry workflow. A “group” population was based on three distinct 3D models of an ant minor (1.1 mg), a media (9.8 mg), and a major (50.1 mg) (see 10.5281/zenodo.7849059)). To approximately simulate the size distribution of A. vollenweideri colonies, these models make up 20%, 60%, and 20% of the simulated population, respectively. A 33% within-class scale variation, with default hue, contrast, and brightness subject material variation, was used. A “single” population was generated using the major model only, with 90% scale variation, but equal material variation settings.

    A Gnathamitermes sp. synthetic dataset was generated from two hand-sculpted models; a worker and a soldier made up 80% and 20% of the simulated population of 100 individuals, respectively with default hue, contrast, and brightness subject material variation. Both 3D models were created in Blender v3.1, using reference photographs.

    Each of the three synthetic datasets contains 10,000 images, rendered at a resolution of 1024 by 1024 px, using the default generator settings as documented in the Generator_example level file (see documentation on GitHub). To assess how the training dataset size affects performance, we trained networks on 100 (“small”), 1,000 (“medium”), and 10,000 (“large”) subsets of the “group” dataset. Generating 10,000 samples at the specified resolution took approximately 10 hours per dataset on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).

    Additionally, five datasets which contain both real and synthetic images were curated. These “mixed” datasets combine image samples from the synthetic “group” dataset with image samples from the real “base” case. The ratio between real and synthetic images across the five datasets varied between 10/1 to 1/100.

    Funding

    This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

  17. d

    YRS Synthetic Forecast Generation Dataset

    • search.dataone.org
    • hydroshare.org
    Updated Jun 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zachary Paul Brodeur (2025). YRS Synthetic Forecast Generation Dataset [Dataset]. http://doi.org/10.4211/hs.29a7c696ee4e4766883078ca0d681884
    Explore at:
    Dataset updated
    Jun 7, 2025
    Dataset provided by
    Hydroshare
    Authors
    Zachary Paul Brodeur
    Time period covered
    Oct 2, 1985 - Sep 30, 2019
    Area covered
    Description

    Pre-processed subset of raw HEFS hindcast data for Feather-Yuba system (YRS) configured for compatibility with the repository structure of the versions 1 and 2 synthetic forecast model contained here: https://github.com/zpb4/Synthetic-Forecast-v1-FIRO-DISES and here: https://github.com/zpb4/Synthetic-Forecast-v2-FIRO-DISES. The data are pre-structured for the repository setup and instructions are included in README files for both GitHub repos on how to setup the data contained in this resource.

    Contains HEFS hindcast .csv files and observed full-natural-flow files for the following sites: ORDC1 - main reservoir inflow to Oroville Lake NBBC1 - main reservoir inflow to New Bullards Bar MRYC1L - downstream local flows at Marysville junction

    Data also contains R scripts used to preprocess the raw HEFS data. These raw data are too large for easy storage in a public repository (YRS has 30+ modeled sites) but are available upon reasonable request from: Zach Brodeur, zpb4@cornell.edu

  18. Data from: eCARLA-scenes: A synthetically generated dataset for event-based...

    • zenodo.org
    zip
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jad Mansour; Jad Mansour; Hayat Rajani; Hayat Rajani; Rafael Garcia; Rafael Garcia; Nuno Gracias; Nuno Gracias (2024). eCARLA-scenes: A synthetically generated dataset for event-based optical flow prediction [Dataset]. http://doi.org/10.5281/zenodo.14412251
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jad Mansour; Jad Mansour; Hayat Rajani; Hayat Rajani; Rafael Garcia; Rafael Garcia; Nuno Gracias; Nuno Gracias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository hosts a synthetic event-based optical flow dataset, meticulously designed to simulate diverse environments under varying weather conditions using the CARLA simulator. The dataset is specifically tailored for autonomous field vehicles, featuring event streams, grayscale images, and corresponding ground truth optical flow.

    In addition to the dataset, the accompanying repository provides a user-friendly pipeline for generating custom datasets, including optical flow displacements and grayscale images. The generated data leverages the optimized eWiz framework, ensuring efficient storage, access, and processing.

    The data generation pipeline can be utilized by cloning the eCARLA-scenes repository. Whether you're a researcher or developer, this resource is an ideal starting point for advancing event-based vision systems in real-world autonomous applications.

  19. 2D high-resolution synthetic MR images of Alzheimer's patients and healthy...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, csv
    Updated Dec 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo Lai; Matteo Lai; Chiara Marzi; Chiara Marzi; Luca Citi; Luca Citi; Stefano Diciotti; Stefano Diciotti (2023). 2D high-resolution synthetic MR images of Alzheimer's patients and healthy subjects using PACGAN [Dataset]. http://doi.org/10.5281/zenodo.8276786
    Explore at:
    application/gzip, csvAvailable download formats
    Dataset updated
    Dec 13, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matteo Lai; Matteo Lai; Chiara Marzi; Chiara Marzi; Luca Citi; Luca Citi; Stefano Diciotti; Stefano Diciotti
    Description

    This dataset encompasses a NIfTI file containing a collection of 500 images, each capturing the central axial slice of a synthetic brain MRI.

    Accompanying this file is a CSV dataset that serves as a repository for the corresponding labels linked to each image:

    • Label 0: Healthy Controls (HC)
    • Label 1: Alzheimer's Disease (AD)

    Each image within this dataset has been generated by PACGAN (Progressive Auxiliary Classifier Generative Adversarial Network), a framework designed and implemented by the AI for Medicine Research Group at the University of Bologna.

    PACGAN is a generative adversarial network trained to generate high-resolution images belonging to different classes. In our work, we trained this framework on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, which contains brain MRI images of AD patients and HC.

    The implementation of the training algorithm can be found within our GitHub repository, with Docker containerization.

    For further exploration, the pre-trained models are available within the Code Ocean capsule. These models can facilitate the generation of synthetic images for both classes and also aid in classifying new brain MRI images.

  20. GARD: Gustavo’s Awesome Runway Dataset (2025)

    • kaggle.com
    zip
    Updated Mar 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gustavo de Paula (2025). GARD: Gustavo’s Awesome Runway Dataset (2025) [Dataset]. https://www.kaggle.com/datasets/depaulagu/gard2025
    Explore at:
    zip(55376320216 bytes)Available download formats
    Dataset updated
    Mar 30, 2025
    Authors
    Gustavo de Paula
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    GARD (Gustavo’s Awesome Runway Dataset) is the largest publicly available synthetic runway image dataset, built to support machine learning tasks in vision-based aircraft landing systems. It contains over 45,000 high-resolution (1024×1024) labeled images.

    This dataset was created using Canny2Concrete, a modular open-source data augmentation pipeline leveraging ControlNet and Stable Diffusion XL. The generation process conditions on edge maps extracted from real-world template images and applies multiple stages of variation including weather, lighting, and occlusion effects.

    Models trained with GARD have been shown to outperform or match those trained on existing synthetic datasets like LARD, especially in challenging segmentation tasks.

    🚀 What’s Inside:

    • BaseImages: Direct and diverse generations from runway edge maps (Canny).
    • VariantImages: Geometric augmentations (rotations, translations, etc).
    • VariantImagesWithOcclusion: Added weather occlusion effects (rain, fog, snow, night).

    Each image includes: - 📷 .png image file
    - 🏷 .txt YOLO-format label
    - 🧩 .mask.png segmentation mask
    - 📄 .json full metadata, designed for full reproducibility (prompt, seed, label points, effects applied)

    📂 Resources:

    🏁 Built For:

    • Runway segmentation and detection
    • Computer vision research in aviation
    • Synthetic dataset generation at scale
    • Researchers working on UAV and autonomous landing
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Leoncini, Pietro (2025). Surgical-Synthetic-Data-Generation-and-Segmentation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14671905

Surgical-Synthetic-Data-Generation-and-Segmentation

Explore at:
Dataset updated
Jan 16, 2025
Authors
Leoncini, Pietro
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains synthetic and real images, with their labels, for Computer Vision in robotic surgery. It is part of ongoing research on sim-to-real applications in surgical robotics. The dataset will be updated with further details and references once the related work is published. For further information see the repository on GitHub: https://github.com/PietroLeoncini/Surgical-Synthetic-Data-Generation-and-Segmentation

Search
Clear search
Close search
Google apps
Main menu