100+ datasets found
  1. Synthetic Harmful & Safe Prompts – 4500 samples

    • kaggle.com
    zip
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ümit (2025). Synthetic Harmful & Safe Prompts – 4500 samples [Dataset]. https://www.kaggle.com/datasets/umitka/synthetic-harmful-and-safe-prompts-4500-samples
    Explore at:
    zip(137472 bytes)Available download formats
    Dataset updated
    Nov 21, 2025
    Authors
    ümit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 4,500 synthetic English prompts designed for research on AI safety, content moderation, and ethical machine learning. It includes prompts from multiple categories, such as Hate, Fraud, Drugs, Sexual, Cybercrime, Harassment, Copyright, Disinformation, and Safe. Each example is labeled as either harmful or safe, enabling researchers and developers to train, evaluate, and benchmark language models for responsible behavior.

    The dataset is entirely synthetic, ensuring no real individuals are targeted or harmed. It is split into training (70%), validation (15%), and test (15%) sets to facilitate model development and evaluation. It can be used for tasks like prompt classification, model safety evaluation, and ethical AI research.

    Languages: English Columns: category, prompt, prompt_clean, label, source

  2. Supporting Shellfish Aquaculture in the Chesapeake Bay using Artificial...

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Sep 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NASA/GSFC/SED/ESD/GCDC/OB.DAAC;NASA/GSFC/SED/ESD/GCDC/SeaBASS (2025). Supporting Shellfish Aquaculture in the Chesapeake Bay using Artificial Intelligence to Detect Poor Water Quality through Sampling and Remote Sensing [Dataset]. https://catalog.data.gov/dataset/supporting-shellfish-aquaculture-in-the-chesapeake-bay-using-artificial-intelligence-to-de-f0c6a
    Explore at:
    Dataset updated
    Sep 18, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Area covered
    Chesapeake Bay
    Description

    This use-inspired NASA AIST project collects biological, chemical, and physical variables in and above the water at Chesapeake Bay sites for analysis within the lab. These ground-truth data are then used for data labeling, in combination with remotely sensed data, within a machine learning model trained to identify water quality challenges of resource managers that could result in shellfish bed closures, for example.

  3. Health Metrics Dataset

    • kaggle.com
    zip
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhay Ayare (2024). Health Metrics Dataset [Dataset]. https://www.kaggle.com/datasets/abhayayare/health-metrics-dataset
    Explore at:
    zip(46175 bytes)Available download formats
    Dataset updated
    Jul 22, 2024
    Authors
    Abhay Ayare
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset was generated using synthetic data created with the Python faker library. It simulates health metrics for 1,000 individuals, including information on blood pressure, cholesterol levels, BMI, smoking status, and diabetes status. The data was generated randomly, with certain constraints to mimic real-world distributions.

    Data Generation Date: July 22, 2024 Generated by: [Abhay Ayare] Data Source: Synthetic data generated using Python scripts. Purpose: The dataset is intended for educational and research purposes, allowing users to perform health-related data analysis and machine learning experiments without concerns about privacy and ethical issues related to real patient data.

    Columns Description:

    • Name: Randomly generated names of individuals.
    • Gender: Gender of the individuals (Male/Female).
    • Age: Age of the individuals (18-80 years).
    • Systolic BP: Systolic blood pressure.
    • Diastolic BP: Diastolic blood pressure.
    • Cholesterol: Cholesterol levels.
    • Height (cm): Height of the individuals in centimeters.
    • Weight (kg): Weight of the individuals in kilograms.
    • BMI: Body Mass Index calculated from height and weight.
    • Smoker: Smoking status (True/False).
    • Diabetes: Diabetes status (True/False).
    • Health: Overall health assessment based on combined metrics (Good/Fair/Bad).
  4. Synthetic Financial Datasets For Fraud Detection

    • kaggle.com
    zip
    Updated Apr 3, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edgar Lopez-Rojas (2017). Synthetic Financial Datasets For Fraud Detection [Dataset]. https://www.kaggle.com/datasets/ealaxi/paysim1
    Explore at:
    zip(186385561 bytes)Available download formats
    Dataset updated
    Apr 3, 2017
    Authors
    Edgar Lopez-Rojas
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.

    We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

    Content

    PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

    This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.

    NOTE: Transactions which are detected as fraud are cancelled, so for fraud detection these columns (oldbalanceOrg, newbalanceOrig, oldbalanceDest, newbalanceDest ) must not be used.

    Headers

    This is a sample of 1 row with headers explanation:

    1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0

    step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

    type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

    amount - amount of the transaction in local currency.

    nameOrig - customer who started the transaction

    oldbalanceOrg - initial balance before the transaction

    newbalanceOrig - new balance after the transaction.

    nameDest - customer who is the recipient of the transaction

    oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

    newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

    isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

    isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

    Past Research

    There are 5 similar files that contain the run of 5 different scenarios. These files are better explained at my PhD thesis chapter 7 (PhD Thesis Available here http://urn.kb.se/resolve?urn=urn:nbn:se:bth-12932.

    We ran PaySim several times using random seeds for 744 steps, representing each hour of one month of real time, which matches the original logs. Each run took around 45 minutes on an i7 intel processor with 16GB of RAM. The final result of a run contains approximately 24 million of financial records divided into the 5 types of categories: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

    Acknowledgements

    This work is part of the research project ”Scalable resource-efficient systems for big data analytics” funded by the Knowledge Foundation (grant: 20140032) in Sweden.

    Please refer to this dataset using the following citations:

    PaySim first paper of the simulator:

    E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. "PaySim: A financial mobile money simulator for fraud detection". In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016

  5. Z

    Dataset for SAR Remote Sensing for Monitoring Harmful Algal Blooms Using...

    • data.niaid.nih.gov
    Updated Apr 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phetanan, Kritnipit (2025). Dataset for SAR Remote Sensing for Monitoring Harmful Algal Blooms Using Deep Learning Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14862787
    Explore at:
    Dataset updated
    Apr 12, 2025
    Authors
    Phetanan, Kritnipit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in this study is designed to facilitate the monitoring and detection of Harmful Algal Blooms (HABs) using Synthetic Aperture Radar (SAR) remote sensing and deep learning models. The dataset includes Sentinel-1 SAR C-band (TIF), Sentinel-2 MSI (TIF), and Water indices (TIF) that were utilized as input dataset in the deep learning model. The dataset used in this study originates from external sources and is not the property of the authors. If reused, proper attribution to the original sources is required in accordance with their respective citation guidelines. The authors have modified the dataset for research purposes.

  6. f

    fdata-02-00032_AI for Not Bad.xml

    • frontiersin.figshare.com
    bin
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jared Moore (2023). fdata-02-00032_AI for Not Bad.xml [Dataset]. http://doi.org/10.3389/fdata.2019.00032.s002
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Jared Moore
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hype surrounds the promotions, aspirations, and notions of “artificial intelligence (AI) for social good” and its related permutations. These terms, as used in data science and particularly in public discourse, are vague. Far from being irrelevant to data scientists or practitioners of AI, the terms create the public notion of the systems built. Through a critical reflection, I explore how notions of AI for social good are vague, offer insufficient criteria for judgement, and elide the externalities and structural interdependence of AI systems. Instead, the field known as “AI for social good” is best understood and referred to as “AI for not bad.”

  7. d

    Artificial Intelligence for Robust Integration of AMI and Synchrophasor Data...

    • catalog.data.gov
    • data.openei.org
    Updated Apr 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arizona State University (2025). Artificial Intelligence for Robust Integration of AMI and Synchrophasor Data to Significantly Boost Solar Adoption [Dataset]. https://catalog.data.gov/dataset/artificial-intelligence-for-robust-integration-of-ami-and-synchrophasor-data-to-significan
    Explore at:
    Dataset updated
    Apr 16, 2025
    Dataset provided by
    Arizona State University
    Description

    The overarching goal of the project is to create a highly efficient framework of machine learning (ML) methods that provide consistent and accurate real-time knowledge of system states from diverse advanced metering infrastructure (AMI) devices and phasor measurement units (PMUs) in order to accommodate extreme levels of PV. For this goal, we aim at creating a highly efficient AI framework of machine learning (ML) methods that provide consistent and accurate real-time knowledge of system states from diverse AMI devices and PMUs. The files contain the integrated bad data detection with a pre-trained Deep Neural Network-based State Estimation (DNN-SE) model with a voltage regulation control algorithm to manage over-voltage issues in J-1 Feeder with high PV penetration.

  8. Supporting Shellfish Aquaculture in the Chesapeake Bay using Artificial...

    • data.nasa.gov
    • cmr.earthdata.nasa.gov
    • +1more
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Supporting Shellfish Aquaculture in the Chesapeake Bay using Artificial Intelligence to Detect Poor Water Quality through Field Sampling and Remote Sensing [Dataset]. https://data.nasa.gov/dataset/supporting-shellfish-aquaculture-in-the-chesapeake-bay-using-artificial-intelligence-to-de-b5ba2
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Area covered
    Chesapeake Bay
    Description

    We are collecting and analyzing biological, chemical, and physical variables in and above the water at target sites and in the lab, looking for hyperspectral proxies that covarying with pollutants. This project is applying an AI model to address water quality, using datasets collected around the Bay in combination with remotely sensed data during targeted field work to support the need to more effectively sort through disparate data sets to identify areas of poor water quality that result in shellfish bed closure.

  9. h

    deny-harmful-behaviour

    • huggingface.co
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nishith Jain (2025). deny-harmful-behaviour [Dataset]. https://huggingface.co/datasets/KingNish/deny-harmful-behaviour
    Explore at:
    Dataset updated
    May 2, 2025
    Authors
    Nishith Jain
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Summary

    deny-harmful-behaviour is a synthetic dataset designed to help language models recognize and gracefully refuse requests that involve unethical, illegal, or dangerous behaviors. Using humorous, empathetic, and non-cooperative reasoning, each sample demonstrates how a model might respond to harmful prompts without engaging with the request. This dataset was generated using Curator and inspired by prompts found in the mlabonne/harmful_behaviors dataset.… See the full description on the dataset page: https://huggingface.co/datasets/KingNish/deny-harmful-behaviour.

  10. m

    FruitNet: Indian Fruits Dataset with quality (Good, Bad & Mixed quality)

    • data.mendeley.com
    Updated Mar 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kailas PATIL (2022). FruitNet: Indian Fruits Dataset with quality (Good, Bad & Mixed quality) [Dataset]. http://doi.org/10.17632/b6fftwbr2v.3
    Explore at:
    Dataset updated
    Mar 8, 2022
    Authors
    Kailas PATIL
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    High quality images of fruits are required to solve fruit classification and recognition problem. To build the machine learning models, neat and clean dataset is the elementary requirement. With this objective we have created the dataset of six popular Indian fruits named as “FruitNet”. This dataset consists of 14700+ high-quality images of 6 different classes of fruits in the processed format. The images are divided into 3 sub-folders 1) Good quality fruits 2) Bad quality fruits and 3) Mixed quality fruits. Each sub-folder contains the 6 fruits images i.e. apple, banana, guava, lime, orange, and pomegranate. Mobile phone with a high-end resolution camera was used to capture the images. The images were taken at the different backgrounds and in different lighting conditions. The proposed dataset can be used for training, testing and validation of fruit classification or reorganization model.

    [The related article is available at: https://www.sciencedirect.com/science/article/pii/S2352340921009616. Cite the article as : V. Meshram, K. Patil, FruitNet: Indian fruits image dataset with quality for machine learning applications, Data in Brief, Volume 40, 2022, 107686, ISSN 2352-3409, https://doi.org/10.1016/j.dib.2021.107686 ]

  11. Good/Bad data set

    • zenodo.org
    Updated May 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhenxing Zhang; Lambert Schomaker; Zhenxing Zhang; Lambert Schomaker (2022). Good/Bad data set [Dataset]. http://doi.org/10.5281/zenodo.5850224
    Explore at:
    Dataset updated
    May 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zhenxing Zhang; Lambert Schomaker; Zhenxing Zhang; Lambert Schomaker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Good/Bad data set is used for the image-quality research, containing unsuccessfully and successfully synthetic samples.

  12. E-Commerce Dataset for Practice

    • kaggle.com
    zip
    Updated Nov 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SHIVRAJ_SHARMA (2024). E-Commerce Dataset for Practice [Dataset]. https://www.kaggle.com/datasets/shivrajguvi/e-commerce-dataset-for-practice/data
    Explore at:
    zip(4236155 bytes)Available download formats
    Dataset updated
    Nov 9, 2024
    Authors
    SHIVRAJ_SHARMA
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    E-Commerce Synthetic Dataset

    This synthetic dataset simulates a large-scale e-commerce platform with 100,000 records, ideal for data analysis, machine learning, and visualization projects. It includes various data types and reflects real-world e-commerce operations, making it suitable for portfolio projects focused on user behavior analysis, sales trends, and product performance.

    Dataset Overview

    This dataset contains 100,000 rows with details on users, products, and transactions, as well as user engagement and transaction attributes. It is crafted to resemble actual e-commerce data, providing insights into customer demographics, purchasing patterns, and engagement.

    Columns Description

    1. UserID: Unique identifier for each user.
    2. UserName: Simulated username for each user.
    3. Age: Age of the user (ranging from 18 to 70).
    4. Gender: Gender of the user, with possible values: Male, Female, and Non-Binary.
    5. Country: User's country, chosen from USA, Canada, UK, Australia, India, and Germany.
    6. SignUpDate: The date when the user signed up for the platform.

    Product Information

    1. ProductID: Unique identifier for each product.
    2. ProductName: Name of the product purchased (Laptop, Smartphone, Headphones, Shoes, T-shirt, Book, Watch).
    3. Category: Category of the product, including Electronics, Apparel, Books, and Accessories.
    4. Price: Price of the product (randomly set between $10 and $1,000).

    Transaction Details

    1. PurchaseDate: Date of purchase.
    2. Quantity: Number of units purchased in the transaction.
    3. TotalAmount: Total amount spent on the transaction (Price * Quantity).

    User Engagement Metrics

    1. HasDiscountApplied: Indicates whether a discount was applied (True or False).
    2. DiscountRate: Discount rate applied to the transaction (ranging from 0 to 0.5).
    3. ReviewScore: User's review score for the product, ranging from 1 to 5.
    4. ReviewText: Text-based review (Excellent, Good, Average, Poor).

    User Behavior Metrics

    1. LastLogin: Date of the user’s last login.
    2. SessionDuration: Duration of the user’s session in minutes (ranging from 5 to 120 minutes).
    3. DeviceType: Device type used by the user, including Mobile, Desktop, and Tablet.
    4. ReferralSource: Source of referral, which could be Organic Search, Ad Campaign, Email Marketing, or Social Media.

    Usage

    This dataset is intended for: - Exploratory Data Analysis (EDA): Understanding customer demographics, popular products, and sales distribution. - Data Visualization: Visualizing user engagement, sales trends, and product category performance. - Machine Learning Models: Training models on customer segmentation, purchase prediction, and review rating analysis.

    Notes

    • Synthetic Data: This dataset is entirely synthetic and generated for educational purposes.
    • No Personally Identifiable Information (PII): All names, IDs, and records are fictional.

    License

    This dataset is freely available for use in projects and portfolios. When sharing results derived from this dataset, please credit it as a synthetic data source.

  13. Reliability verification of synthetic data.

    • plos.figshare.com
    xls
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jianwei Dong; Ruishuang Sun; Zhipeng Yan; Meilun Shi; Xinyu Bi (2025). Reliability verification of synthetic data. [Dataset]. http://doi.org/10.1371/journal.pone.0325713.t012
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jianwei Dong; Ruishuang Sun; Zhipeng Yan; Meilun Shi; Xinyu Bi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Academic achievement is an important index to measure the quality of education and students’ learning outcomes. Reasonable and accurate prediction of academic achievement can help improve teachers’ educational methods. And it also provides corresponding data support for the formulation of education policies. However, traditional methods for classifying academic performance have many problems, such as low accuracy, limited ability to handle nonlinear relationships, and poor handling of data sparsity. Based on this, our study analyzes various characteristics of students, including personal information, academic performance, attendance rate, family background, extracurricular activities and etc. Our work offers a comprehensive view to understand the various factors affecting students’ academic performance. In order to improve the accuracy and robustness of student performance classification, we adopted Gaussian Distribution based Data Augmentation technique (GDO), combined with multiple Deep Learning (DL) and Machine Learning (ML) models. We explored the application of different Machine Learning and Deep Learning models in classifying student grades. And different feature combinations and data augmentation techniques were used to evaluate the performance of multiple models in classification tasks. In addition, we also checked the synthetic data’s effectiveness with variance homogeneity and P-values, and studied how the oversampling rate affects actual classification results. Research has shown that the RBFN model based on educational habit features performs the best after using GDO data augmentation. The accuracy rate is 94.12%, and the F1 score is 94.46%. These results provide valuable references for the classification of student grades and the development of intervention strategies. New methods and perspectives in the field of educational data analysis are proposed in our study. At the same time, it has also promoted innovation and development in the intelligence of the education system.

  14. C4 200M Grammar Error Correction dataset

    • kaggle.com
    zip
    Updated Apr 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dario Cioni (2023). C4 200M Grammar Error Correction dataset [Dataset]. https://www.kaggle.com/datasets/dariocioni/c4200m/discussion
    Explore at:
    zip(15601869562 bytes)Available download formats
    Dataset updated
    Apr 18, 2023
    Authors
    Dario Cioni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Grammar Error Correction synthetic dataset consisting of 185 million sentence pairs, created using a Tagged Corruption modelon Google's C4 dataset.

    This version of the dataset was extracted from "https://huggingface.co/datasets/liweili/c4_200m">Li Liwei's HuggingFace dataset and converted to TSV format.

    The corruption edits by Felix Stahlberg and Shankar Kumar are licensed under CC BY 4.0. C4 dataset was released by AllenAI under the terms of ODC-BY By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.

    Format

    This dataset is converted in Parquet format, but a TSV format is available in previous versions. The reason of the conversion was the poor performance in accessing each file. I'm open to request and suggestions on how to better handle such a big dataset.

    This dataset is available in TSV format, splitted in 10 files of approximately 18M samples each. Each sample is a couple formed by the incorrect and the corrected sentences. | Incorrect | Corrected| | ------------- |:-------------:| | Much many brands and sellers still in the market. | Many brands and sellers still in the market. | | She likes playing in park and come here every week | She likes playing in the park and comes here every week |

    Usage

    I'm planning of releasing a notebook where I'll show Grammar Error Correction using a seq2seq architecture based on BERT and LSTM. Until then, you can try to build your own model!

    This dataset can be used to train sequence-to-sequence models, based on encoder-decoder approach.
    The task is quite similar to the NMT task, here are some tutorials: - NLP from scratch: translation with a seq2seq network and attention - Language Translation with nn.Transformers and TorchText

    https://production-media.paperswithcode.com/tasks/gec_foTfIZW.png" alt="Grammar Error Correction example">

    Acknowledgments

    Thanks to the dataset creators Felix Stahlberg and Shankar Kumar and to Li Liwei for first giving access to the processed dataset.

    References

  15. Synthetic Airborne Intruder Dataset: A dataset based on High-Resolution...

    • zenodo.org
    bin, xz
    Updated Aug 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Lyhs; Lars Hinneburg; Florian Oelsner; Michael Fischer; Jeremy Tschirner; Stefan Milz; Stefan Milz; Patrick Maeder; Patrick Maeder; Jonathan Lyhs; Lars Hinneburg; Florian Oelsner; Michael Fischer; Jeremy Tschirner (2023). Synthetic Airborne Intruder Dataset: A dataset based on High-Resolution Inpainting for Safety Critical Detect and Avoid [Dataset]. http://doi.org/10.5281/zenodo.8301120
    Explore at:
    bin, xzAvailable download formats
    Dataset updated
    Aug 30, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jonathan Lyhs; Lars Hinneburg; Florian Oelsner; Michael Fischer; Jeremy Tschirner; Stefan Milz; Stefan Milz; Patrick Maeder; Patrick Maeder; Jonathan Lyhs; Lars Hinneburg; Florian Oelsner; Michael Fischer; Jeremy Tschirner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Modern machine learning techniques have shown tremendous potential, especially for object detection on camera images. For this reason, they are also used to enable safety-critical automated processes such as autonomous drone flights. We present a study on object detection for Detect and Avoid, a safety critical function for drones that detects air traffic during automated flights for safety reasons. An ill-posed problem is the generation of good and especially large data sets, since detection itself is the corner case. Most models suffer from limited ground truth in raw data, e.g. recorded air traffic or frontal flight with a small aircraft. It often leads to poor and critical detection rates. We overcome this problem by using inpainting methods to bootstrap the dataset such that it explicitly contains the corner cases of the raw data. We provide an overview of inpainting methods and generative models and present an example pipeline given a small annotated dataset. We validate our method by generating a high-resolution dataset and present it to an independent object detector that was fully trained on real data.

    This dataset is represented in the following repository. The dataset is structured as follows:

    # Synthetic Airborne Intruder Dataset

    This dataset was syntheticaly generated using an adapted [Pix2Pix](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix) with different background images and object segementations. Each image contains one object instance.

    The annotations are in the COCO annotation format.

    ## Data Structure
    Synthetic Dataset Root:
    --train
    |--images
    |--instances.json
    --val
    |--images
    |--instances.json
    --test
    |--images
    |--instances.json
    --Background_Sources
    |--sources_train.csv
    |--sources_val.csv
    |--sources_test.cs
    --README.md

    ## Categories

    | Id | Name | Instances over all splits |
    | ---| --- | --- |
    | 0 | large airplane | 1695 |
    | 1 | small airplane | 1255 |
    | 2 | very small airplane | 46 |
    | 3 | helicopter | 2201 |
    | 4 | drone | 961 |
    | 5 | hot air balloon | 315 |
    | 6 | paraglider | 565 |
    | 7 | airship | 42 |
    | 8 | UFO | 0 |

    ### Note:
    UFO is a placeholder for future expansion of the dataset.

    ## Splits
    The dataset consists of 3 splits: train 5900 images, val 590 images, test 590 images.
    The Number of instances per class and per split can be seen in the table below:

    Class | train | val | test
    -------|-------|-----|--------
    large airplane | 1416 | 142 | 137
    small airplane | 1046 | 96 | 113
    very small airplane | 38 | 2 | 6
    helicopter | 1812 | 206 | 183
    drone | 800 | 86 | 75
    hot air balloon | 268 | 21 | 26
    paragliders | 492 | 32 | 41
    airship | 28 | 5 | 9
    UFO | 0 | 0 | 0

    ## Sources
    The sources of the background images can be found in the files [here](./Background_Sources/).

  16. Z

    Virtual Reality Dataset used for Proof of Concept in the Validation of the...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Jan 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Degas, Augustin; Hurter, Christophe (2024). Virtual Reality Dataset used for Proof of Concept in the Validation of the Conflict Detection and Resolution Use Case (ARTIMATION) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7437968
    Explore at:
    Dataset updated
    Jan 1, 2024
    Dataset provided by
    Ecole Nationale de l'Aviation Civile
    Authors
    Degas, Augustin; Hurter, Christophe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the dataset used in the Virtual Reality POC for the validation of the Conflict Detection and Resolution (CD&R) use case.

    This dataset represent a extract of different (using K-means) candidate solution, either good or bad ones.

  17. Synthetic data using GaussianCopula.

    • plos.figshare.com
    csv
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Junayed Hasan; Jannat Sultana; Silvia Ahmed; Sifat Momen (2025). Synthetic data using GaussianCopula. [Dataset]. http://doi.org/10.1371/journal.pone.0323265.s003
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mohammad Junayed Hasan; Jannat Sultana; Silvia Ahmed; Sifat Momen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Occupational stress is a major concern for employers and organizations as it compromises decision-making and overall safety of workers. Studies indicate that work-stress contributes to severe mental strain, increased accident rates, and in extreme cases, even suicides. This study aims to enhance early detection of occupational stress through machine learning (ML) methods, providing stakeholders with better insights into the underlying causes of stress to improve occupational safety. Utilizing a newly published workplace survey dataset, we developed a novel feature selection pipeline identifying 39 key indicators of work-stress. An ensemble of three ML models achieved a state-of-the-art accuracy of 90.32%, surpassing existing studies. The framework’s generalizability was confirmed through a three-step validation technique: holdout-validation, 10-fold cross-validation, and external-validation with synthetic data generation, achieving an accuracy of 89% on unseen data. We also introduced a 1D-CNN to enable hierarchical and temporal learning from the data. Additionally, we created an algorithm to convert tabular data into texts with 100% information retention, facilitating domain analysis with large language models, revealing that occupational stress is more closely related to the biomedical domain than clinical or generalist domains. Ablation studies reinforced our feature selection pipeline, and revealed sociodemographic features as the most important. Explainable AI techniques identified excessive workload and ambiguity (27%), poor communication (17%), and a positive work environment (16%) as key stress factors. Unlike previous studies relying on clinical settings or biomarkers, our approach streamlines stress detection from simple survey questions, offering a real-time, deployable tool for periodic stress assessment in workplaces.

  18. D

    Fraudulent Account Creation Detection AI Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Fraudulent Account Creation Detection AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/fraudulent-account-creation-detection-ai-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Fraudulent Account Creation Detection AI Market Outlook



    According to our latest research, the Fraudulent Account Creation Detection AI market size reached USD 2.37 billion globally in 2024, demonstrating robust momentum fueled by the increasing sophistication of digital fraud schemes. The market is set to expand at a CAGR of 23.8% during the forecast period, propelling the market value to USD 18.86 billion by 2033. This remarkable growth is primarily driven by the rapid proliferation of online services and transactions, which has, in turn, heightened the need for advanced fraud prevention technologies leveraging artificial intelligence. As organizations across sectors face mounting threats from fraudulent account creation, investments in AI-powered detection solutions are accelerating worldwide.




    One of the most significant growth factors for the Fraudulent Account Creation Detection AI market is the escalating volume and complexity of cyberattacks targeting digital onboarding processes. As digital transformation initiatives accelerate globally, businesses are onboarding customers, users, and employees remotely at an unprecedented scale. This shift has created exploitable vulnerabilities, with bad actors using synthetic identities, stolen credentials, and automated bots to create fraudulent accounts. AI-driven solutions are uniquely positioned to address these challenges by analyzing vast datasets in real time, detecting subtle anomalies, and adapting to evolving fraud tactics. The ability of AI to learn from new patterns and continuously enhance detection accuracy is compelling organizations to adopt these technologies as a core component of their cybersecurity strategy.




    Another pivotal driver is the tightening of regulatory frameworks across industries, particularly in sectors such as banking, financial services, healthcare, and government. Regulators worldwide now mandate rigorous Know Your Customer (KYC), Anti-Money Laundering (AML), and identity verification processes, increasing the demand for automated, reliable, and scalable solutions. Fraudulent Account Creation Detection AI platforms enable organizations to achieve compliance while reducing manual review burdens and operational costs. Furthermore, the integration of AI with biometric authentication, behavioral analytics, and device fingerprinting is enhancing the efficacy of fraud detection, helping organizations stay ahead of regulatory requirements and minimizing reputational risks associated with data breaches and identity theft.




    The surge in digital commerce and the expansion of online platforms have also contributed significantly to market growth. E-commerce, social media, and telecommunications companies are particularly vulnerable to fake account creation, which can lead to financial losses, reputational damage, and erosion of user trust. AI-powered detection tools are being integrated into customer onboarding workflows, leveraging machine learning, natural language processing, and network analysis to identify suspicious activity with high precision. As user expectations for seamless digital experiences rise, these organizations are prioritizing AI-driven fraud prevention to ensure security without introducing friction or delays in legitimate user journeys.




    Regionally, North America currently leads the Fraudulent Account Creation Detection AI market due to its advanced digital infrastructure, high adoption of online services, and stringent regulatory environment. However, Asia Pacific is rapidly emerging as a high-growth region, propelled by the digitalization of financial services, expanding e-commerce markets, and increasing investments in cybersecurity. Europe maintains a strong presence, driven by GDPR compliance and the adoption of advanced identity verification technologies. Meanwhile, Latin America and the Middle East & Africa are witnessing steady growth as organizations in these regions recognize the critical importance of fraud prevention in the digital economy. The interplay of regional dynamics, regulatory landscapes, and technological advancements will continue to shape the competitive landscape and growth trajectory of the market in the coming years.



    Component Analysis



    The Fraudulent Account Creation Detection AI market is segmented by component into software, hardware, and services, each playing a vital role in the deployment and effectiveness of AI-driven fraud prevention solutions. Software solutions dominate the market

  19. Generative AI concerns among U.S. users and non-users 2024

    • statista.com
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Generative AI concerns among U.S. users and non-users 2024 [Dataset]. https://www.statista.com/statistics/1610210/generative-ai-concerns-united-states-users-non-users/
    Explore at:
    Dataset updated
    May 20, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2024
    Area covered
    United States
    Description

    Privacy or data security issues were mentioned as the biggest concern about generative artificial intelligence usage in 2024, according to responding adults in the United States. Around 37 percent of current users of this technology mentioned this as a matter of concern, while 45 percent of non-users stated this. Further issues related to this technology, like unauthorized use of one's own original work, lack of transparency on how it works, and potential effects of this technology on the environment, were much higher among current users. On the other hand, non-users were considerably more worried about the potential usage to create and spread harmful content.

  20. h

    synthetic-sugar-quill

    • huggingface.co
    Updated Apr 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Gaderbauer (2025). synthetic-sugar-quill [Dataset]. https://huggingface.co/datasets/Nelathan/synthetic-sugar-quill
    Explore at:
    Dataset updated
    Apr 6, 2025
    Authors
    Daniel Gaderbauer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Synthetic Sugarquill with author profiles

    This is a complete literary editing of the original Sugarquill 10k dataset: https://huggingface.co/datasets/allura-org/sugarquill-10k

    the id references the index of the original dataset filtered out 206 bad rows used primarily gemini-2.0-flash and gemini-2.5-pro-exp-03-25 to rewrite the original shortstory using the following system prompt. It is inspired by the evaluation system from eqbench creative writing.

    You are an expert literary… See the full description on the dataset page: https://huggingface.co/datasets/Nelathan/synthetic-sugar-quill.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ümit (2025). Synthetic Harmful & Safe Prompts – 4500 samples [Dataset]. https://www.kaggle.com/datasets/umitka/synthetic-harmful-and-safe-prompts-4500-samples
Organization logo

Synthetic Harmful & Safe Prompts – 4500 samples

4500-entry English synthetic dataset with multiple harmful and safe prompt types

Explore at:
zip(137472 bytes)Available download formats
Dataset updated
Nov 21, 2025
Authors
ümit
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains 4,500 synthetic English prompts designed for research on AI safety, content moderation, and ethical machine learning. It includes prompts from multiple categories, such as Hate, Fraud, Drugs, Sexual, Cybercrime, Harassment, Copyright, Disinformation, and Safe. Each example is labeled as either harmful or safe, enabling researchers and developers to train, evaluate, and benchmark language models for responsible behavior.

The dataset is entirely synthetic, ensuring no real individuals are targeted or harmed. It is split into training (70%), validation (15%), and test (15%) sets to facilitate model development and evaluation. It can be used for tasks like prompt classification, model safety evaluation, and ethical AI research.

Languages: English Columns: category, prompt, prompt_clean, label, source

Search
Clear search
Close search
Google apps
Main menu