31 datasets found
  1. Z

    Stack Overflow Chat Dutch

    • data.niaid.nih.gov
    Updated Jun 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanroy, Bram (2023). Stack Overflow Chat Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8052379
    Explore at:
    Dataset updated
    Jun 20, 2023
    Authors
    Vanroy, Bram
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.

    Data Instances

    {'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},

    Data Fields

    id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]

    input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].

    topic: the topic description

    Dataset Creation

    Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

    The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

    CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.

    Here are the requirements that you should adhere to: 1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other; 2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places; 3. ensure accurate translation and keep the correctness of the conversation; 4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 5. translate the human's text using informal, but standard, language; 6. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang}; 8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested; 9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.

    Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.

    """

    The prompt to translate the topic is:

    TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
    " summarizing manner. Translate accurately and formally. Do not provide any explanation"
    " about the translation and do not include the original title.

    "

    The system message was:

    You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

    Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].

    The translation quality has not been verified. Use at your own risk!

    Licensing Information

    Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.

    This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

    If you use this dataset, you must also follow the Sharing and Usage policies.

    As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

    This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.

  2. Fake News data set

    • kaggle.com
    zip
    Updated Dec 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bjørn-Jostein (2021). Fake News data set [Dataset]. https://www.kaggle.com/datasets/bjoernjostein/fake-news-data-set
    Explore at:
    zip(56446259 bytes)Available download formats
    Dataset updated
    Dec 17, 2021
    Authors
    Bjørn-Jostein
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Today, we are producing more information than ever before, but not all information is true. Some of it is actually malicious and harmful. And it makes it harder for us to trust any piece of information we come across! Not only that, now the bad actors are able to use language modelling tools like Open AI's GPT 2 to generate fake news too. Ever since its initial release, there have been talks on how it can be potentially misused for generating misleading news articles, automating the production of abusive or fake content for social media, and automating the creation of spam and phishing content.

    How do we figure out what is true and what is fake? Can we do something about it?

    Content

    The dataset consists of around 387,000 pieces of text which has been sourced from various news articles on the web as well as texts generated by Open AI's GPT 2 language model!

    The dataset is split into train, validation and test such that each of the sets has an equal split of the two classes.

    Acknowledgements

    This dataset was published on AI Crowd in a so-called KIIT AI (mini)Blitz⚡ Challenge. AI Blitz⚡ is a series of educational challenges by AIcrowd, with an aim to make it really easy for anyone to get started with the world of AI. This AI Blitz⚡challenge was an exclusive challenge just for the students and the faculty of the Kalinga Institute of Industrial Technology.

  3. CIFAKE: Real and AI-Generated Synthetic Images

    • kaggle.com
    Updated Mar 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan J. Bird (2023). CIFAKE: Real and AI-Generated Synthetic Images [Dataset]. https://www.kaggle.com/datasets/birdy654/cifake-real-and-ai-generated-synthetic-images
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jordan J. Bird
    Description

    CIFAKE: Real and AI-Generated Synthetic Images

    The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness.

    CIFAKE is a dataset that contains 60,000 synthetically-generated images and 60,000 real images (collected from CIFAR-10). Can computer vision techniques be used to detect when an image is real or has been generated by AI?

    Further information on this dataset can be found here: Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.

    Dataset details

    The dataset contains two classes - REAL and FAKE.

    For REAL, we collected the images from Krizhevsky & Hinton's CIFAR-10 dataset

    For the FAKE images, we generated the equivalent of CIFAR-10 with Stable Diffusion version 1.4

    There are 100,000 images for training (50k per class) and 20,000 for testing (10k per class)

    Papers with Code

    The dataset and all studies using it are linked using Papers with Code https://paperswithcode.com/dataset/cifake-real-and-ai-generated-synthetic-images

    References

    If you use this dataset, you must cite the following sources

    Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.

    Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.

    Real images are from Krizhevsky & Hinton (2009), fake images are from Bird & Lotfi (2024). The Bird & Lotfi study is available here.

    Notes

    The updates to the dataset on the 28th of March 2023 did not change anything; the file formats ".jpeg" were renamed ".jpg" and the root folder was uploaded to meet Kaggle's usability requirements.

    License

    This dataset is published under the same MIT license as CIFAR-10:

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

  4. d

    AI Hallucination Cases Database

    • damiencharlotin.com
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damien Charlotin (2025). AI Hallucination Cases Database [Dataset]. https://www.damiencharlotin.com/hallucinations/
    Explore at:
    Dataset updated
    Nov 17, 2025
    Authors
    Damien Charlotin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A curated database of legal cases where generative AI produced hallucinated citations submitted in court filings.

  5. Real and Fake (AI-Generated) Art Images Dataset

    • kaggle.com
    zip
    Updated Jan 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaleemullah Younas (2025). Real and Fake (AI-Generated) Art Images Dataset [Dataset]. https://www.kaggle.com/datasets/doctorstrange420/real-and-fake-ai-generated-art-images-dataset
    Explore at:
    zip(283918573 bytes)Available download formats
    Dataset updated
    Jan 19, 2025
    Authors
    Kaleemullah Younas
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset has been meticulously curated to aid in the classification of real and AI-generated (fake) art images. It is part of the "FauxFinder" project, which aims to build advanced models capable of distinguishing between authentic artworks and those generated by AI, GANs, and other digital methods. The dataset is ideal for training, testing, and benchmarking machine learning models, particularly in the domains of computer vision and generative adversarial networks (GANs).

    Dataset Structure The dataset is organized into two folders: -Real: Contains 10,821 images of authentic artworks obtained from reliable art repositories. Source: Images were scraped from WikiArt, a comprehensive database of artworks spanning various art styles, genres, and periods. - Fake: Contains*** 10,821 ***AI-generated images created using GANs and other generative models.

    These images were synthesized using advanced techniques to replicate the visual characteristics of real artworks. Additional AI-generated resources were also included to ensure diversity. Each folder contains high-quality images that have been resized to 256x256 pixels for consistency and ease of use in machine learning pipelines.

    Key Features - Balanced Classes: Both the "Real" and "Fake" folders contain an equal number of images (10,821 each), ensuring a balanced dataset for binary classification tasks. - High Quality: All images are resized to a uniform resolution of 256x256 pixels, making them ready for use in CNNs and other deep learning architectures. - Diverse Sources: Real images encompass a wide range of art styles and genres, while fake images include outputs from various GAN architectures and AI tools, offering a rich diversity for model training.

    Folder Structure: python Data/ ├── Real/ │ ├── image1.jpg │ ├── image2.jpg │ └── ... ├── Fake/ ├── image1.jpg ├── image2.jpg └── ... Get Started 1. Download the dataset from Kaggle. 2. Use the folder structure for training and validation splits (e.g., 80-20 split). 3. Build, train, and evaluate your machine/Deep learning models for real-vs-fake art detection.

  6. h

    DFBench_Speech25

    • huggingface.co
    Updated Sep 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepfake Leaderboard (2025). DFBench_Speech25 [Dataset]. https://huggingface.co/datasets/DFBench/DFBench_Speech25
    Explore at:
    Dataset updated
    Sep 14, 2025
    Authors
    Deepfake Leaderboard
    Description

    Speech Deepfake Detection Challenge Dataset (Version 2025)

    This repository contains 4,524 audio files which are either real (authentic) or fake (AI-generated).Your task is to classify them and create an output CSV file in the following format:

    Each line corresponds to one audio file.
    The CSV must contain two columns:
    filename – the file name (e.g. 983.mp3). label – real for authentic speech, fake for AI-generated speech.

      Data
    

    The audio files are provided in… See the full description on the dataset page: https://huggingface.co/datasets/DFBench/DFBench_Speech25.

  7. AI-generated fake podcasts (mostly via NotebookLM)

    • kaggle.com
    zip
    Updated Nov 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Listen Notes (2025). AI-generated fake podcasts (mostly via NotebookLM) [Dataset]. https://www.kaggle.com/datasets/listennotes/ai-generated-fake-podcasts-spams/code
    Explore at:
    zip(1788541 bytes)Available download formats
    Dataset updated
    Nov 5, 2025
    Dataset authored and provided by
    Listen Notes
    License

    http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html

    Description

    Podcasting is a unique space where people can share their voices, ideas, and stories freely. Unlike platforms controlled by a single company (like YouTube or Instagram), podcasting supports true freedom of expression. However, this openness is now being threatened by AI tools, such as Notebook LM, which make it easy to produce fake, low-quality podcasts. Unfortunately, many of these AI-generated shows are created by spammers, scammers, or blackhat SEOs, and they are harming both listeners and genuine podcast creators.

    At Listen Notes, the leading podcast search engine and podcast API, we believe that creating a quality podcast takes real effort. Listeners can tell when a show has been crafted with care, and that’s why we are committed to stopping the spread of fake, AI-generated podcasts on our platform.

    This dataset represents a small subset of AI-generated fake podcasts that were flagged during attempts to add them to the Listen Notes podcast database. These "podcasts" were predominantly created using Notebook LM and are not designed for human consumption.

    The goal of sharing this dataset is to support the AI community in developing more effective tools to combat spam. While it may not be possible to eliminate spam entirely, we can work together to minimize its impact and contribute to making the digital world a better place.

    If you're building a podcast app for discovering human-made shows, PodcastAPI.com is your best bet. Apple Podcasts and Spotify are increasingly flooded with AI-generated fakes.

  8. Stable Diffusion Face Dataset

    • kaggle.com
    Updated Apr 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohannad Ayman Salah (2024). Stable Diffusion Face Dataset [Dataset]. https://www.kaggle.com/datasets/mohannadaymansalah/stable-diffusion-dataaaaaaaaa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 23, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mohannad Ayman Salah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About the images:

    Fake Ai generated Human faces using Stable Diffusion 1.5, 2.1, and SDXL 1.0 checkpoint. The main objective was to generate photos that were as realistic as possible, without any specific style, focusing mainly on the face.

    Fake Ai generated Human faces

    • Images in 512x512px resolution were generated using SD 1.5;
    • Images in 768x768px resolution were generated using SD 2.1;
    • Images in 1024x1024px resolution were generated using SD XL 1.0;

    More details on the images and the process of creating the images in the readme file.

    The data is not mine, the data is taken from a GitHub repository to a user named: tobecwb Repo link: https://github.com/tobecwb/stable-diffusion-face-dataset

  9. P

    Placeholder Image Generator Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Aug 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Placeholder Image Generator Report [Dataset]. https://www.datainsightsmarket.com/reports/placeholder-image-generator-509333
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Aug 5, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The placeholder image generator market is experiencing robust growth, driven by the increasing demand for visually appealing websites and applications without relying on heavy, slow-to-load original images. The market's expansion is fueled by the rise of web design and development, the growing popularity of content creation tools, and the need for efficient prototyping and wireframing. While precise market sizing data is not provided, based on general market trends for similar software-as-a-service (SaaS) offerings and the large number of active players, we can reasonably estimate the 2025 market size to be around $50 million. Considering a conservative Compound Annual Growth Rate (CAGR) of 15%, we project the market to reach approximately $100 million by 2033. Key trends include increasing integration with other design tools, the development of AI-powered image generation features, and a shift towards more customizable and versatile placeholder options. The market faces some restraints, such as the availability of free, basic alternatives and the potential for users to overlook the value proposition of dedicated placeholder generators, especially in smaller-scale projects. The competitive landscape is highly fragmented, with numerous players ranging from established companies to individual developers offering a variety of features and pricing models. The key success factors for vendors include offering a diverse library of placeholder images, seamless integration with popular design tools, and providing a user-friendly interface. Future growth will likely be influenced by the adoption of advanced technologies like AI and machine learning to enhance image quality, personalization, and efficiency. Furthermore, strategic partnerships with other software providers and expanding the available image types and styles will be vital for sustained market leadership. The potential for growth is significant, especially as developers and designers increasingly prioritize efficiency and high-quality visuals in their projects.

  10. h

    DFBench_Image25

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepfake Leaderboard, DFBench_Image25 [Dataset]. https://huggingface.co/datasets/DFBench/DFBench_Image25
    Explore at:
    Authors
    Deepfake Leaderboard
    Description

    Image Deepfake Detection Challenge Dataset (Version 2025)

    This repository contains 2,920 images which are either real (authentic) or fake (AI-generated).Your task is to classify them and create an output CSV file in the following format:

    Each line corresponds to one image in the dataset.
    The CSV must contain two columns:
    filename – the image file name (e.g. 983.jpg)
    label – real for authentic images, fake for AI-generated images

      Data
    

    The images are provided… See the full description on the dataset page: https://huggingface.co/datasets/DFBench/DFBench_Image25.

  11. Z

    Results of the evaluation of fact-checking tools

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nosov, Vitalii; Tovstyk, Vadym (2025). Results of the evaluation of fact-checking tools [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14991091
    Explore at:
    Dataset updated
    Mar 19, 2025
    Dataset provided by
    Kharkiv National University of Internal Affairs
    Kharkiv National University of Internal Affairs/ Харьковский национальный университет внутренних дел / Харківський національний університет внутрішніх справ
    Authors
    Nosov, Vitalii; Tovstyk, Vadym
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Scenario for evaluating fact-checking tools in messages and newsObjective: to evaluate the ability of fact-checking tools to determine the truthfulness of news articles using keywords generated by artificial intelligence.Input: real (TRUE) and fake news (FALSE):1. Attack on Kyiv region: police showed photos of destruction - TRUE;2. No 'red lines' in Ukraine support, French foreign minister tells BBC - TRUE;3. Kyiv is in shock. Residents of Pokrovsk say they are waiting for Russian liberators - FALSE;4. Russian troops grind towards Pokrovsk, take second Ukrainian town in a week - FALSE.Steps.1. Select the news.2. Create a set of relevant keywords from the news text using the ChatGPT dialogue template: "Hi. Could you please create a list of the most relevant keywords for the article link to the article?"3. Enter the keywords into the fact-checking tool.4. Record the result of the tool's evaluation.

  12. ai-generated-images-vs-real-images

    • kaggle.com
    zip
    Updated May 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tristan Zhang (2024). ai-generated-images-vs-real-images [Dataset]. https://www.kaggle.com/datasets/tristanzhang32/ai-generated-images-vs-real-images
    Explore at:
    zip(52001947102 bytes)Available download formats
    Dataset updated
    May 23, 2024
    Authors
    Tristan Zhang
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset consists of 30,000 AI-Generated Images and 30,000 real images. Fake: - 10,000 from Stable Diffusion - 10,000 from MidJourney - 10,000 from DALL-E

    Real: - 22,500 from Pexels, Unsplash - 7,500 from WikiArt

  13. Generative AI aids the publication of fake articles: Methods and materials...

    • zenodo.org
    zip
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diomidis Spinellis; Diomidis Spinellis (2024). Generative AI aids the publication of fake articles: Methods and materials package [Dataset]. http://doi.org/10.5281/zenodo.13832537
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Diomidis Spinellis; Diomidis Spinellis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Measurement technique
    <p>Data were gathered on 10 and 11 September 2024 on a host running an Anaconda Python environment version 1.12.3 and Cygwin Bash version 5.2.15(3). The journal site global-us.mellbaou.com was completely crawled with the wget command in order to obtain the article PDFs. Article metadata were retrieved separately using the get-metadata.sh shell script. Citations and contact emails were extracted from the article PDF files with the extract-citations-emails.py and apply-to-pdfs.sh scripts. Article DOIs and author affiliations were extracted from the article metadata HTML files with the extract-doi-affiliations.py and extract-all-doi-affiliations.sh scripts. The two results were joined based on the article number key and used to create the initial version of the article-details.xlsx Excel file. A list of contact author emails and URLs was created with emails-to-csv.awk, and then used to inform article authors regarding the findings. Articles with a low citation count heuristic (measured through the number brackets and braces appearing before the article’s Reference section) were manually inspected for signs of entirely AI authorship (mainly formulaic content, lack of citations, tables, and figures). A subset of those was also submitted to Turnitin for AI scoring on 2024-09-24.</p> <p>The provided Microsoft Excel document, based on the automatically generated article-details.tsv file, was hand-curated as follows.</p> <ul> <li>Four duplicate entries with wrongly extracted multiple contact emails were removed (articles 172 and 248).</li> <li>Contact emails were obfuscated to comply with personal data protection regulations.</li> <li>Documents ranked 50 or lower with a low citation count and the ten highest ranked ones were hand-verified regarding their AI content.</li> <li>Turnitin AI generation scores were added for one every ten documents in the above low citation count documents and one every two in the above high count documents. Turnitin AI scores were obtained using the web-based service on 2024-09-24.</li> <li>Email domains were extracted from emails and listed in a separate column.</li> <li>A column with undeliverable emails was added and hand-filled based on failed delivery reports regarding the sent notification emails.</li> <li>Affiliations of authors of publications that were unlikely to have been submitted by them (mainly evidenced by wrong contact emails) were marked in bold.</li> <li>Notes with email communications and other provenance details were added to substantiate the preceding actions.</li> </ul>
    Description

    This package contains Python, shell, awk scripts, and data used to obtain the curated table associated with the above named article. It also contains (in this file) a description of the methods employed to obtain the curated table with details regarding the published articles.

    Contents

    The following items are included.

    • README.md: This file
    • article-details.xlsx: Curated table with details of published articles in Microsoft Excel file format
    • index.html: HTML document with
      • links to GIJIR materials saved in the Internet Archive
      • a list of all the GIJIR articles’ citation data according to Crossref and links to each article’s locally available landing page, full-text PDF, plus links to Crossref metadata and the article via DOI and original journal URL. (Note that non-local, non-archived links may rot over time.)
    • Makefile: Commands that orchestrate the articles’ analysis
    • get-metadata.sh: Obtain article metadata pages from the journal’s web site
    • apply-to-pdfs.sh: Apply the specified Python script to all article PDFs
    • extract-citations-emails.py: Extract number of probable in-text citations and corresponding author email from article PDF
    • extract-doi-affiliations.py: Extract article DOI and affiliations from an article’s metadata
    • extract-all-doi-affiliations.sh: Extract article DOI and affiliations from all articles’ metadata
    • emails-to-csv.awk: Convert emails and article numbers to CSV with URL for sending emails
    • ybs-works.json: Results of Crossref query to obtain all the publisher’s works made on 2024-09-22
    • ChatGPT: Prompts and responses associated with the generation of a fake article in one of the journal’s topics.
    • global-us/metadata/: Article metadata as HTML files collected on 2024-09-10
    • global-us/global-us.mellbaou.com/index.php/global/article/download/: A copy of the journal’s article PDFs as crawled on 2024-09-10
  14. h

    real-vs-fake-human-voice-deepfake-audio

    • huggingface.co
    Updated Nov 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata NLP (2025). real-vs-fake-human-voice-deepfake-audio [Dataset]. https://huggingface.co/datasets/ud-nlp/real-vs-fake-human-voice-deepfake-audio
    Explore at:
    Dataset updated
    Nov 17, 2025
    Authors
    Unidata NLP
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Deepfake Audio Dataset - 5,000 Audio

    Dataset contains 5,000 audio files, comprising both authentic human recordings and synthetic** AI-generated voice** samples. It designed for advanced research in deepfake detection, focusing on detecting fake voices and generated speech analysis. Specifically engineered to challenge voice authentication systems, it supports the development of robust models for real vs fake human voice recognition. - Get the data

      Dataset characteristics:… See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/real-vs-fake-human-voice-deepfake-audio.
    
  15. S2 Data -

    • plos.figshare.com
    txt
    Updated Mar 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wataru Zaitsu; Mingzhe Jin; Shunichi Ishihara; Satoru Tsuge; Mitsuyuki Inaba (2024). S2 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0299031.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 13, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Wataru Zaitsu; Mingzhe Jin; Shunichi Ishihara; Satoru Tsuge; Mitsuyuki Inaba
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Public comments are an important opinion for civic when the government establishes rules. However, recent AI can easily generate large quantities of disinformation, including fake public comments. We attempted to distinguish between human public comments and ChatGPT-generated public comments (including ChatGPT emulated that of humans) using Japanese stylometric analysis. Study 1 conducted multidimensional scaling (MDS) to compare 500 texts of five classes: Human public comments, GPT-3.5 and GPT-4 generated public comments only by presenting the titles of human public comments (i.e., zero-shot learning, GPTzero), GPT-3.5 and GPT-4 emulated by presenting sentences of human public comments and instructing to emulate that (i.e., one-shot learning, GPTone). The MDS results showed that the Japanese stylometric features of the public comments were completely different from those of the GPTzero-generated texts. Moreover, GPTone-generated public comments were closer to those of humans than those generated by GPTzero. In Study 2, the performance levels of the random forest (RF) classifier for distinguishing three classes (human, GPTzero, and GPTone texts). RF classifiers showed the best precision for the human public comments of approximately 90%, and the best precision for the fake public comments generated by GPT (GPTzero and GPTone) was 99.5% by focusing on integrated next writing style features: phrase patterns, parts-of-speech (POS) bigram and trigram, and function words. Therefore, the current study concluded that we could discriminate between GPT-generated fake public comments and those written by humans at the present time.

  16. DeepFake Videos Dataset

    • kaggle.com
    Updated Jun 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata (2025). DeepFake Videos Dataset [Dataset]. https://www.kaggle.com/datasets/unidpro/deepfake-videos-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 16, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Unidata
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    DeepFake Videos for detection tasks

    Dataset consists of 10,000+ files featuring 7,000+ people, providing a comprehensive resource for research in deepfake detection and deepfake technology. It includes real videos of individuals with AI-generated faces overlaid, specifically designed to enhance liveness detection systems.

    By utilizing this dataset, researchers can advance their understanding of deepfake generation and improve the performance of detection methods. - Get the data

    Metadata for the dataset

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2F7f47885f0afdca5c22f9f47e81307b95%2FFrame%201%20(8).png?generation=1742726304761567&alt=media" alt=""> Dataset was created by generating fake faces and overlaying them onto authentic video clips sourced from platforms such as aisaver.io, faceswapvideo.ai, and magichour.ai.Videos featuring different individuals, backgrounds, and scenarios, making it suitable for various research applications.

    💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

    Researchers can leverage this dataset to enhance their understanding of deepfake detection and contribute to the development of more robust detection methods that can effectively combat the challenges posed by deepfake technology.

    🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects

  17. m

    Bangla Audio Dataset: Original and DeepFake Voices for AI-Based Voice...

    • data.mendeley.com
    Updated Jun 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Akteruzzaman Dipto (2024). Bangla Audio Dataset: Original and DeepFake Voices for AI-Based Voice Analysis and Detection [Dataset]. http://doi.org/10.17632/4ftmwt86vr.4
    Explore at:
    Dataset updated
    Jun 24, 2024
    Authors
    Md Akteruzzaman Dipto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Total number of audios : 4500 (Original: 2250; Fake: 2250) Total number of individuals: 75 Total number of distinct sets: 15 Total number of distinct sentences per set : 30 Each set spoken by the number of individuals: 5 Each sentence voice time: 2 to 6 seconds Male-Female ratio: Nearly Balanced

    It has a parent folder named ‘DATASET’ under which there are 75 folders with unique names to identify the characteristics, and under this there are two folders named ‘Real’ and 'Fake', which include 30 voices each named numerically.

    Dataset Usage Agreement (DUA)

    1. Grant of Use 1.1 The Author grants the user a non-exclusive, non-transferable license to use the dataset solely for academic and research purposes. 1.2 The User agrees to use the dataset only for purposes consistent with this Agreement and will not use the dataset for any commercial purposes without the prior written consent of the Provider.

    2. Data Security and Privacy 2.1 The User shall ensure that the dataset is stored securely and is not accessible to unauthorized individuals. 2.2 The User agrees to comply with all applicable data protection and privacy laws in relation to the use of the dataset.

    3. Attribution 3.1 The User agrees to provide appropriate acknowledgment to the Provider in any publications, presentations, or other outputs that utilize the dataset. The acknowledgment should include citation.

    4. Restrictions on Use 4.1 The User shall not distribute, share, sell, or sublicense the dataset to any third party without the prior written consent of the Provider. 4.2 The User shall not attempt to re-identify any individuals from the dataset.

    4.3 The User shall not use the dataset to develop any technologies or applications that are intended to be used for malicious purposes, including but not limited to, creating or distributing deepfakes.

    1. Intellectual Property 5.1 The Author retains all rights, title, and interest in and to the dataset, including any intellectual property rights. 5.2 The User agrees not to claim ownership of the dataset or any derivative works based on the dataset.

    2. Liability 6.1 The user acknowledges that the dataset is a research tool provided for academic purposes and assumes full responsibility for the use of the dataset. 6.2 The Provider shall not be liable for any damages arising from the use of the dataset.

    3. Termination 7.1 The Author reserves the right to terminate this Agreement at any time if the User breaches any terms of this Agreement. 7.2 Upon termination, the User agrees to destroy all copies of the dataset in their possession.

    By using the dataset, the User acknowledges that they have read, understood, and agreed to be bound by the terms of this Agreement.

    Contact info: Sumaiya Akhtar Mitu mitu.ict12@gmail.com

  18. REAL/AI VIDEO DATASET

    • kaggle.com
    zip
    Updated Jan 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MRKANZET (2025). REAL/AI VIDEO DATASET [Dataset]. https://www.kaggle.com/datasets/kanzeus/realai-video-dataset
    Explore at:
    zip(2855297695 bytes)Available download formats
    Dataset updated
    Jan 21, 2025
    Authors
    MRKANZET
    Description

    Purpose: - Tagging the ai generated videos on Social Media Platforms. - Auto-marks the videos such as fake news, fake events, unreal statutions etc. - Predicting the videos was promoted/edited - Ethics

    Contribute to growth of this dataset with ai videos: kanz.misystems@gmail.com

  19. synthetic-legal-contracts-dataset

    • kaggle.com
    zip
    Updated Sep 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). synthetic-legal-contracts-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/synthetic-legal-contracts-dataset
    Explore at:
    zip(109408 bytes)Available download formats
    Dataset updated
    Sep 11, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Synthetic Legal Contract Dataset — Powered by Syncora

    High-Fidelity Synthetic Dataset for LLM Training, Legal NLP & AI Research

    About This Dataset

    This repository provides a synthetic dataset of legal contract Q&A interactions, modeled after real-world corporate filings (e.g., SEC 8-K disclosures). The data is generated using Syncora.ai, ensuring privacy-safe, fake data that is safe to use in LLM training, benchmarking, and experimentation.

    This free dataset captures the style and structure of legal exchanges without exposing any confidential or sensitive client information.

    Dataset Context & Features

    FeatureDescription
    Structured JSONL FormatIncludes system, user, and assistant roles for conversational Q&A.
    Contract & Compliance QuestionsModeled on SEC filings and legal disclosure scenarios.
    Statistically Realistic Fake DataFully synthetic, mirrors real-world patterns without privacy risks.
    NLP-ReadyOptimized for direct fine-tuning, benchmarking, and evaluation in LLM pipelines.

    🚨 Simulated Regulatory Scenarios

    This synthetic legal dataset is not just for LLM training — it enables developers and researchers to create simulated regulatory scenarios. Examples include:

    • Detecting high-risk clauses in contracts before real-world deployment
    • Testing AI models on rare or edge-case compliance situations
    • Simulating SEC filings and corporate disclosures to evaluate NLP models
    • Benchmarking contract analysis tools safely without exposing sensitive data

    This section gives the dataset a practical, standout value, showing it can be used for stress-testing AI in legal environments.

    Why Syncora?

    Syncora.ai creates synthetic datasets optimized for LLM training with:

    • High similarity to real-world distributions
    • Free dataset access for research and open innovation
    • 0% privacy leakage — fully synthetic fake data
    • Robust benchmarking potential for AI & legal NLP tasks

    🔗 Generate Your Own Synthetic Data

    Take your AI projects further with Syncora.ai:
    → Generate your own synthetic datasets now

    📜 License

    This dataset is released under the MIT License.

    It is 100% synthetic, safe for LLM training, and ideal for research, experimentation, and open-source projects.

  20. synthetic-medical-records-dataset

    • kaggle.com
    zip
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). synthetic-medical-records-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/synthetic-medical-records-dataset
    Explore at:
    zip(1582643 bytes)Available download formats
    Dataset updated
    Sep 11, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Synthetic Healthcare Dataset — Powered by Syncora

    High-Fidelity Synthetic Medical Records for AI, ML Modeling, LLM Training & HealthTech Research

    About This Dataset

    This is a synthetic dataset of healthcare records generated using Syncora.ai, a next-generation synthetic data generation platform designed for privacy-safe AI development.

    It simulates patient demographics, medical conditions, treatments, billing, and admission data, preserving statistical realism while ensuring 0% privacy risk.

    This free dataset is designed for:

    • Healthcare AI research
    • Predictive analytics (disease risk, treatment outcomes)
    • LLM training on structured tabular healthcare data
    • Medical data science education & experimentation

    Think of this as fake data that mimics real-world healthcare patterns — statistically accurate, but without any sensitive patient information.

    Dataset Context & Features

    The dataset captures patient-level hospital information, including:

    • Demographics: Age, Gender, Blood Type
    • Medical Details: Diagnosed medical condition, prescribed medication, test results
    • Hospital Records: Admission type (emergency, planned, outpatient), billing amount
    • Target Applications: Predictive modeling, anomaly detection, cost optimization

    All records are 100% synthetic, maintaining the statistical properties of real-world healthcare data while remaining safe to share and use for ML & LLM tasks.

    LLM Training & Generative AI Applications 🧠

    Unlike most healthcare datasets, this one is tailored for LLM training:

    • Fine-tune LLMs on tabular + medical data for reasoning tasks
    • Create medical report generators from structured fields (e.g., convert demographics + condition + test results into natural language summaries)
    • Use as fake data for prompt engineering, synthetic QA pairs, or generative simulations
    • Safely train LLMs to understand healthcare schemas without exposing private patient data

    Machine Learning & AI Use Cases

    • Predictive Modeling: Forecast patient outcomes or readmission likelihood
    • Classification: Disease diagnosis prediction using demographic and medical variables
    • Clustering: Patient segmentation by condition, treatment, or billing pattern
    • Healthcare Cost Prediction: Estimate and optimize billing amounts
    • Bias & Fairness Testing: Study algorithmic bias without exposing sensitive patient data

    Why Syncora?

    Syncora.ai is a synthetic data generation platform designed for healthcare, finance, and enterprise AI.

    Key benefits:

    • Privacy-first: 100% synthetic, zero risk of re-identification
    • Statistical accuracy: Feature relationships preserved for ML & LLM training
    • Regulatory compliance: HIPAA, GDPR, DPDP safe
    • Scalability: Generate millions of synthetic patient records with agentic AI

    Ideas for Exploration

    • Which medical conditions correlate with higher billing amounts?
    • Can test results predict hospitalization type?
    • How do demographics influence treatment or billing trends?
    • Can synthetic datasets reduce bias in healthcare AI & LLMs?

    🔗 Generate Your Own Synthetic Data

    Take your AI projects to the next level with Syncora.ai:
    → Generate your own synthetic datasets now

    Licensing & Compliance

    This is a free dataset, 100% synthetic, and contains no real patient information.
    It is safe for public use in education, research, open-source contributions, LLM training, and AI development.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Vanroy, Bram (2023). Stack Overflow Chat Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8052379

Stack Overflow Chat Dutch

Explore at:
Dataset updated
Jun 20, 2023
Authors
Vanroy, Bram
License

http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

Description

This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.

Data Instances

{'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},

Data Fields

id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]

input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].

topic: the topic description

Dataset Creation

Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.

Here are the requirements that you should adhere to: 1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other; 2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places; 3. ensure accurate translation and keep the correctness of the conversation; 4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 5. translate the human's text using informal, but standard, language; 6. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang}; 8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested; 9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.

Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.

"""

The prompt to translate the topic is:

TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
" summarizing manner. Translate accurately and formally. Do not provide any explanation"
" about the translation and do not include the original title.

"

The system message was:

You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].

The translation quality has not been verified. Use at your own risk!

Licensing Information

Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.

This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

If you use this dataset, you must also follow the Sharing and Usage policies.

As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.

Search
Clear search
Close search
Google apps
Main menu