31 datasets found

Z
Stack Overflow Chat Dutch
data.niaid.nih.gov
Updated Jun 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vanroy, Bram (2023). Stack Overflow Chat Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8052379
Explore at:
Dataset updated
Jun 20, 2023
Authors
Vanroy, Bram
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Description
This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.

Data Instances

{'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},

Data Fields

id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]

input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].

topic: the topic description

Dataset Creation

Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.

Here are the requirements that you should adhere to: 1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other; 2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places; 3. ensure accurate translation and keep the correctness of the conversation; 4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 5. translate the human's text using informal, but standard, language; 6. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang}; 8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested; 9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.

Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.

"""

The prompt to translate the topic is:

TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
" summarizing manner. Translate accurately and formally. Do not provide any explanation"
" about the translation and do not include the original title.

"

The system message was:

You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].

The translation quality has not been verified. Use at your own risk!

Licensing Information

Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.

This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

If you use this dataset, you must also follow the Sharing and Usage policies.

As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.
Fake News data set
kaggle.com
zip
Updated Dec 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bjørn-Jostein (2021). Fake News data set [Dataset]. https://www.kaggle.com/datasets/bjoernjostein/fake-news-data-set
Explore at:
zip(56446259 bytes)Available download formats
Dataset updated
Dec 17, 2021
Authors
Bjørn-Jostein
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Today, we are producing more information than ever before, but not all information is true. Some of it is actually malicious and harmful. And it makes it harder for us to trust any piece of information we come across! Not only that, now the bad actors are able to use language modelling tools like Open AI's GPT 2 to generate fake news too. Ever since its initial release, there have been talks on how it can be potentially misused for generating misleading news articles, automating the production of abusive or fake content for social media, and automating the creation of spam and phishing content.

How do we figure out what is true and what is fake? Can we do something about it?

Content

The dataset consists of around 387,000 pieces of text which has been sourced from various news articles on the web as well as texts generated by Open AI's GPT 2 language model!

The dataset is split into train, validation and test such that each of the sets has an equal split of the two classes.

Acknowledgements

This dataset was published on AI Crowd in a so-called KIIT AI (mini)Blitz⚡ Challenge. AI Blitz⚡ is a series of educational challenges by AIcrowd, with an aim to make it really easy for anyone to get started with the world of AI. This AI Blitz⚡challenge was an exclusive challenge just for the students and the faculty of the Kalinga Institute of Industrial Technology.
CIFAKE: Real and AI-Generated Synthetic Images
kaggle.com
Updated Mar 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan J. Bird (2023). CIFAKE: Real and AI-Generated Synthetic Images [Dataset]. https://www.kaggle.com/datasets/birdy654/cifake-real-and-ai-generated-synthetic-images
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jordan J. Bird
Description
CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness.

CIFAKE is a dataset that contains 60,000 synthetically-generated images and 60,000 real images (collected from CIFAR-10). Can computer vision techniques be used to detect when an image is real or has been generated by AI?

Further information on this dataset can be found here: Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.

Dataset details

The dataset contains two classes - REAL and FAKE.

For REAL, we collected the images from Krizhevsky & Hinton's CIFAR-10 dataset

For the FAKE images, we generated the equivalent of CIFAR-10 with Stable Diffusion version 1.4

There are 100,000 images for training (50k per class) and 20,000 for testing (10k per class)

Papers with Code

The dataset and all studies using it are linked using Papers with Code https://paperswithcode.com/dataset/cifake-real-and-ai-generated-synthetic-images

References

If you use this dataset, you must cite the following sources

Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.

Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.

Real images are from Krizhevsky & Hinton (2009), fake images are from Bird & Lotfi (2024). The Bird & Lotfi study is available here.

Notes

The updates to the dataset on the 28th of March 2023 did not change anything; the file formats ".jpeg" were renamed ".jpg" and the root folder was uploaded to meet Kaggle's usability requirements.

License

This dataset is published under the same MIT license as CIFAR-10:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
d
AI Hallucination Cases Database
damiencharlotin.com
Updated Nov 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damien Charlotin (2025). AI Hallucination Cases Database [Dataset]. https://www.damiencharlotin.com/hallucinations/
Explore at:
Dataset updated
Nov 17, 2025
Authors
Damien Charlotin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A curated database of legal cases where generative AI produced hallucinated citations submitted in court filings.
Real and Fake (AI-Generated) Art Images Dataset
kaggle.com
zip
Updated Jan 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaleemullah Younas (2025). Real and Fake (AI-Generated) Art Images Dataset [Dataset]. https://www.kaggle.com/datasets/doctorstrange420/real-and-fake-ai-generated-art-images-dataset
Explore at:
zip(283918573 bytes)Available download formats
Dataset updated
Jan 19, 2025
Authors
Kaleemullah Younas
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset has been meticulously curated to aid in the classification of real and AI-generated (fake) art images. It is part of the "FauxFinder" project, which aims to build advanced models capable of distinguishing between authentic artworks and those generated by AI, GANs, and other digital methods. The dataset is ideal for training, testing, and benchmarking machine learning models, particularly in the domains of computer vision and generative adversarial networks (GANs).

Dataset Structure The dataset is organized into two folders: -Real: Contains 10,821 images of authentic artworks obtained from reliable art repositories. Source: Images were scraped from WikiArt, a comprehensive database of artworks spanning various art styles, genres, and periods. - Fake: Contains*** 10,821 ***AI-generated images created using GANs and other generative models.

These images were synthesized using advanced techniques to replicate the visual characteristics of real artworks. Additional AI-generated resources were also included to ensure diversity. Each folder contains high-quality images that have been resized to 256x256 pixels for consistency and ease of use in machine learning pipelines.

Key Features - Balanced Classes: Both the "Real" and "Fake" folders contain an equal number of images (10,821 each), ensuring a balanced dataset for binary classification tasks. - High Quality: All images are resized to a uniform resolution of 256x256 pixels, making them ready for use in CNNs and other deep learning architectures. - Diverse Sources: Real images encompass a wide range of art styles and genres, while fake images include outputs from various GAN architectures and AI tools, offering a rich diversity for model training.

Folder Structure: python Data/ ├── Real/ │ ├── image1.jpg │ ├── image2.jpg │ └── ... ├── Fake/ ├── image1.jpg ├── image2.jpg └── ... Get Started 1. Download the dataset from Kaggle. 2. Use the folder structure for training and validation splits (e.g., 80-20 split). 3. Build, train, and evaluate your machine/Deep learning models for real-vs-fake art detection.
h
DFBench_Speech25
huggingface.co
Updated Sep 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deepfake Leaderboard (2025). DFBench_Speech25 [Dataset]. https://huggingface.co/datasets/DFBench/DFBench_Speech25
Explore at:
Dataset updated
Sep 14, 2025
Authors
Deepfake Leaderboard
Description
Speech Deepfake Detection Challenge Dataset (Version 2025)

This repository contains 4,524 audio files which are either real (authentic) or fake (AI-generated).Your task is to classify them and create an output CSV file in the following format:

Each line corresponds to one audio file.
The CSV must contain two columns:
filename – the file name (e.g. 983.mp3). label – real for authentic speech, fake for AI-generated speech.

Data

The audio files are provided in… See the full description on the dataset page: https://huggingface.co/datasets/DFBench/DFBench_Speech25.
AI-generated fake podcasts (mostly via NotebookLM)
kaggle.com
zip
Updated Nov 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Listen Notes (2025). AI-generated fake podcasts (mostly via NotebookLM) [Dataset]. https://www.kaggle.com/datasets/listennotes/ai-generated-fake-podcasts-spams/code
Explore at:
zip(1788541 bytes)Available download formats
Dataset updated
Nov 5, 2025
Dataset authored and provided by
Listen Notes
License
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Description
Podcasting is a unique space where people can share their voices, ideas, and stories freely. Unlike platforms controlled by a single company (like YouTube or Instagram), podcasting supports true freedom of expression. However, this openness is now being threatened by AI tools, such as Notebook LM, which make it easy to produce fake, low-quality podcasts. Unfortunately, many of these AI-generated shows are created by spammers, scammers, or blackhat SEOs, and they are harming both listeners and genuine podcast creators.

At Listen Notes, the leading podcast search engine and podcast API, we believe that creating a quality podcast takes real effort. Listeners can tell when a show has been crafted with care, and that’s why we are committed to stopping the spread of fake, AI-generated podcasts on our platform.

This dataset represents a small subset of AI-generated fake podcasts that were flagged during attempts to add them to the Listen Notes podcast database. These "podcasts" were predominantly created using Notebook LM and are not designed for human consumption.

The goal of sharing this dataset is to support the AI community in developing more effective tools to combat spam. While it may not be possible to eliminate spam entirely, we can work together to minimize its impact and contribute to making the digital world a better place.

If you're building a podcast app for discovering human-made shows, PodcastAPI.com is your best bet. Apple Podcasts and Spotify are increasingly flooded with AI-generated fakes.
Stable Diffusion Face Dataset
kaggle.com
Updated Apr 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohannad Ayman Salah (2024). Stable Diffusion Face Dataset [Dataset]. https://www.kaggle.com/datasets/mohannadaymansalah/stable-diffusion-dataaaaaaaaa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 23, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohannad Ayman Salah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About the images:

Fake Ai generated Human faces using Stable Diffusion 1.5, 2.1, and SDXL 1.0 checkpoint. The main objective was to generate photos that were as realistic as possible, without any specific style, focusing mainly on the face.

Fake Ai generated Human faces

Images in 512x512px resolution were generated using SD 1.5;

Images in 768x768px resolution were generated using SD 2.1;

Images in 1024x1024px resolution were generated using SD XL 1.0;

More details on the images and the process of creating the images in the readme file.

The data is not mine, the data is taken from a GitHub repository to a user named: tobecwb Repo link: https://github.com/tobecwb/stable-diffusion-face-dataset
P
Placeholder Image Generator Report
datainsightsmarket.com
doc, pdf, ppt
Updated Aug 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Placeholder Image Generator Report [Dataset]. https://www.datainsightsmarket.com/reports/placeholder-image-generator-509333
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Aug 5, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The placeholder image generator market is experiencing robust growth, driven by the increasing demand for visually appealing websites and applications without relying on heavy, slow-to-load original images. The market's expansion is fueled by the rise of web design and development, the growing popularity of content creation tools, and the need for efficient prototyping and wireframing. While precise market sizing data is not provided, based on general market trends for similar software-as-a-service (SaaS) offerings and the large number of active players, we can reasonably estimate the 2025 market size to be around $50 million. Considering a conservative Compound Annual Growth Rate (CAGR) of 15%, we project the market to reach approximately $100 million by 2033. Key trends include increasing integration with other design tools, the development of AI-powered image generation features, and a shift towards more customizable and versatile placeholder options. The market faces some restraints, such as the availability of free, basic alternatives and the potential for users to overlook the value proposition of dedicated placeholder generators, especially in smaller-scale projects. The competitive landscape is highly fragmented, with numerous players ranging from established companies to individual developers offering a variety of features and pricing models. The key success factors for vendors include offering a diverse library of placeholder images, seamless integration with popular design tools, and providing a user-friendly interface. Future growth will likely be influenced by the adoption of advanced technologies like AI and machine learning to enhance image quality, personalization, and efficiency. Furthermore, strategic partnerships with other software providers and expanding the available image types and styles will be vital for sustained market leadership. The potential for growth is significant, especially as developers and designers increasingly prioritize efficiency and high-quality visuals in their projects.
h
DFBench_Image25
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deepfake Leaderboard, DFBench_Image25 [Dataset]. https://huggingface.co/datasets/DFBench/DFBench_Image25
Explore at:
Authors
Deepfake Leaderboard
Description
Image Deepfake Detection Challenge Dataset (Version 2025)

This repository contains 2,920 images which are either real (authentic) or fake (AI-generated).Your task is to classify them and create an output CSV file in the following format:

Each line corresponds to one image in the dataset.
The CSV must contain two columns:
filename – the image file name (e.g. 983.jpg)
label – real for authentic images, fake for AI-generated images

Data

The images are provided… See the full description on the dataset page: https://huggingface.co/datasets/DFBench/DFBench_Image25.
Z
Results of the evaluation of fact-checking tools
data.niaid.nih.gov
zenodo.org
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nosov, Vitalii; Tovstyk, Vadym (2025). Results of the evaluation of fact-checking tools [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14991091
Explore at:
Dataset updated
Mar 19, 2025
Dataset provided by
Kharkiv National University of Internal Affairs
Kharkiv National University of Internal Affairs/ Харьковский национальный университет внутренних дел / Харківський національний університет внутрішніх справ
Authors
Nosov, Vitalii; Tovstyk, Vadym
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Scenario for evaluating fact-checking tools in messages and newsObjective: to evaluate the ability of fact-checking tools to determine the truthfulness of news articles using keywords generated by artificial intelligence.Input: real (TRUE) and fake news (FALSE):1. Attack on Kyiv region: police showed photos of destruction - TRUE;2. No 'red lines' in Ukraine support, French foreign minister tells BBC - TRUE;3. Kyiv is in shock. Residents of Pokrovsk say they are waiting for Russian liberators - FALSE;4. Russian troops grind towards Pokrovsk, take second Ukrainian town in a week - FALSE.Steps.1. Select the news.2. Create a set of relevant keywords from the news text using the ChatGPT dialogue template: "Hi. Could you please create a list of the most relevant keywords for the article link to the article?"3. Enter the keywords into the fact-checking tool.4. Record the result of the tool's evaluation.
ai-generated-images-vs-real-images
kaggle.com
zip
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tristan Zhang (2024). ai-generated-images-vs-real-images [Dataset]. https://www.kaggle.com/datasets/tristanzhang32/ai-generated-images-vs-real-images
Explore at:
zip(52001947102 bytes)Available download formats
Dataset updated
May 23, 2024
Authors
Tristan Zhang
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset consists of 30,000 AI-Generated Images and 30,000 real images. Fake: - 10,000 from Stable Diffusion - 10,000 from MidJourney - 10,000 from DALL-E

Real: - 22,500 from Pexels, Unsplash - 7,500 from WikiArt
Generative AI aids the publication of fake articles: Methods and materials...
zenodo.org
zip
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diomidis Spinellis; Diomidis Spinellis (2024). Generative AI aids the publication of fake articles: Methods and materials package [Dataset]. http://doi.org/10.5281/zenodo.13832537
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13832537
Dataset updated
Sep 24, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Diomidis Spinellis; Diomidis Spinellis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Measurement technique
<p>Data were gathered on 10 and 11 September 2024 on a host running an Anaconda Python environment version 1.12.3 and Cygwin Bash version 5.2.15(3). The journal site global-us.mellbaou.com was completely crawled with the wget command in order to obtain the article PDFs. Article metadata were retrieved separately using the get-metadata.sh shell script. Citations and contact emails were extracted from the article PDF files with the extract-citations-emails.py and apply-to-pdfs.sh scripts. Article DOIs and author affiliations were extracted from the article metadata HTML files with the extract-doi-affiliations.py and extract-all-doi-affiliations.sh scripts. The two results were joined based on the article number key and used to create the initial version of the article-details.xlsx Excel file. A list of contact author emails and URLs was created with emails-to-csv.awk, and then used to inform article authors regarding the findings. Articles with a low citation count heuristic (measured through the number brackets and braces appearing before the article’s Reference section) were manually inspected for signs of entirely AI authorship (mainly formulaic content, lack of citations, tables, and figures). A subset of those was also submitted to Turnitin for AI scoring on 2024-09-24.</p> <p>The provided Microsoft Excel document, based on the automatically generated article-details.tsv file, was hand-curated as follows.</p> <ul> <li>Four duplicate entries with wrongly extracted multiple contact emails were removed (articles 172 and 248).</li> <li>Contact emails were obfuscated to comply with personal data protection regulations.</li> <li>Documents ranked 50 or lower with a low citation count and the ten highest ranked ones were hand-verified regarding their AI content.</li> <li>Turnitin AI generation scores were added for one every ten documents in the above low citation count documents and one every two in the above high count documents. Turnitin AI scores were obtained using the web-based service on 2024-09-24.</li> <li>Email domains were extracted from emails and listed in a separate column.</li> <li>A column with undeliverable emails was added and hand-filled based on failed delivery reports regarding the sent notification emails.</li> <li>Affiliations of authors of publications that were unlikely to have been submitted by them (mainly evidenced by wrong contact emails) were marked in bold.</li> <li>Notes with email communications and other provenance details were added to substantiate the preceding actions.</li> </ul>
Description
This package contains Python, shell, awk scripts, and data used to obtain the curated table associated with the above named article. It also contains (in this file) a description of the methods employed to obtain the curated table with details regarding the published articles.

Contents

The following items are included.

README.md: This file

article-details.xlsx: Curated table with details of published articles in Microsoft Excel file format

index.html: HTML document with

links to GIJIR materials saved in the Internet Archive

a list of all the GIJIR articles’ citation data according to Crossref and links to each article’s locally available landing page, full-text PDF, plus links to Crossref metadata and the article via DOI and original journal URL. (Note that non-local, non-archived links may rot over time.)

Makefile: Commands that orchestrate the articles’ analysis

get-metadata.sh: Obtain article metadata pages from the journal’s web site

apply-to-pdfs.sh: Apply the specified Python script to all article PDFs

extract-citations-emails.py: Extract number of probable in-text citations and corresponding author email from article PDF

extract-doi-affiliations.py: Extract article DOI and affiliations from an article’s metadata

extract-all-doi-affiliations.sh: Extract article DOI and affiliations from all articles’ metadata

emails-to-csv.awk: Convert emails and article numbers to CSV with URL for sending emails

ybs-works.json: Results of Crossref query to obtain all the publisher’s works made on 2024-09-22

ChatGPT: Prompts and responses associated with the generation of a fake article in one of the journal’s topics.

global-us/metadata/: Article metadata as HTML files collected on 2024-09-10

global-us/global-us.mellbaou.com/index.php/global/article/download/: A copy of the journal’s article PDFs as crawled on 2024-09-10
h
real-vs-fake-human-voice-deepfake-audio
huggingface.co
Updated Nov 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata NLP (2025). real-vs-fake-human-voice-deepfake-audio [Dataset]. https://huggingface.co/datasets/ud-nlp/real-vs-fake-human-voice-deepfake-audio
Explore at:
Dataset updated
Nov 17, 2025
Authors
Unidata NLP
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Deepfake Audio Dataset - 5,000 Audio

Dataset contains 5,000 audio files, comprising both authentic human recordings and synthetic** AI-generated voice** samples. It designed for advanced research in deepfake detection, focusing on detecting fake voices and generated speech analysis. Specifically engineered to challenge voice authentication systems, it supports the development of robust models for real vs fake human voice recognition. - Get the data

Dataset characteristics:… See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/real-vs-fake-human-voice-deepfake-audio.
S2 Data -
plos.figshare.com
txt
Updated Mar 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wataru Zaitsu; Mingzhe Jin; Shunichi Ishihara; Satoru Tsuge; Mitsuyuki Inaba (2024). S2 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0299031.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0299031.s002
Dataset updated
Mar 13, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Wataru Zaitsu; Mingzhe Jin; Shunichi Ishihara; Satoru Tsuge; Mitsuyuki Inaba
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Public comments are an important opinion for civic when the government establishes rules. However, recent AI can easily generate large quantities of disinformation, including fake public comments. We attempted to distinguish between human public comments and ChatGPT-generated public comments (including ChatGPT emulated that of humans) using Japanese stylometric analysis. Study 1 conducted multidimensional scaling (MDS) to compare 500 texts of five classes: Human public comments, GPT-3.5 and GPT-4 generated public comments only by presenting the titles of human public comments (i.e., zero-shot learning, GPTzero), GPT-3.5 and GPT-4 emulated by presenting sentences of human public comments and instructing to emulate that (i.e., one-shot learning, GPTone). The MDS results showed that the Japanese stylometric features of the public comments were completely different from those of the GPTzero-generated texts. Moreover, GPTone-generated public comments were closer to those of humans than those generated by GPTzero. In Study 2, the performance levels of the random forest (RF) classifier for distinguishing three classes (human, GPTzero, and GPTone texts). RF classifiers showed the best precision for the human public comments of approximately 90%, and the best precision for the fake public comments generated by GPT (GPTzero and GPTone) was 99.5% by focusing on integrated next writing style features: phrase patterns, parts-of-speech (POS) bigram and trigram, and function words. Therefore, the current study concluded that we could discriminate between GPT-generated fake public comments and those written by humans at the present time.
DeepFake Videos Dataset
kaggle.com
Updated Jun 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata (2025). DeepFake Videos Dataset [Dataset]. https://www.kaggle.com/datasets/unidpro/deepfake-videos-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 16, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Unidata
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
DeepFake Videos for detection tasks

Dataset consists of 10,000+ files featuring 7,000+ people, providing a comprehensive resource for research in deepfake detection and deepfake technology. It includes real videos of individuals with AI-generated faces overlaid, specifically designed to enhance liveness detection systems.

By utilizing this dataset, researchers can advance their understanding of deepfake generation and improve the performance of detection methods. - Get the data

Metadata for the dataset

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2F7f47885f0afdca5c22f9f47e81307b95%2FFrame%201%20(8).png?generation=1742726304761567&alt=media" alt=""> Dataset was created by generating fake faces and overlaying them onto authentic video clips sourced from platforms such as aisaver.io, faceswapvideo.ai, and magichour.ai.Videos featuring different individuals, backgrounds, and scenarios, making it suitable for various research applications.

💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

Researchers can leverage this dataset to enhance their understanding of deepfake detection and contribute to the development of more robust detection methods that can effectively combat the challenges posed by deepfake technology.

🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects
m
Bangla Audio Dataset: Original and DeepFake Voices for AI-Based Voice...
data.mendeley.com
Updated Jun 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Akteruzzaman Dipto (2024). Bangla Audio Dataset: Original and DeepFake Voices for AI-Based Voice Analysis and Detection [Dataset]. http://doi.org/10.17632/4ftmwt86vr.4
Explore at:
Unique identifier
https://doi.org/10.17632/4ftmwt86vr.4
Dataset updated
Jun 24, 2024
Authors
Md Akteruzzaman Dipto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Total number of audios : 4500 (Original: 2250; Fake: 2250) Total number of individuals: 75 Total number of distinct sets: 15 Total number of distinct sentences per set : 30 Each set spoken by the number of individuals: 5 Each sentence voice time: 2 to 6 seconds Male-Female ratio: Nearly Balanced

It has a parent folder named ‘DATASET’ under which there are 75 folders with unique names to identify the characteristics, and under this there are two folders named ‘Real’ and 'Fake', which include 30 voices each named numerically.

Dataset Usage Agreement (DUA)

Grant of Use 1.1 The Author grants the user a non-exclusive, non-transferable license to use the dataset solely for academic and research purposes. 1.2 The User agrees to use the dataset only for purposes consistent with this Agreement and will not use the dataset for any commercial purposes without the prior written consent of the Provider.

Data Security and Privacy 2.1 The User shall ensure that the dataset is stored securely and is not accessible to unauthorized individuals. 2.2 The User agrees to comply with all applicable data protection and privacy laws in relation to the use of the dataset.

Attribution 3.1 The User agrees to provide appropriate acknowledgment to the Provider in any publications, presentations, or other outputs that utilize the dataset. The acknowledgment should include citation.

Restrictions on Use 4.1 The User shall not distribute, share, sell, or sublicense the dataset to any third party without the prior written consent of the Provider. 4.2 The User shall not attempt to re-identify any individuals from the dataset.

4.3 The User shall not use the dataset to develop any technologies or applications that are intended to be used for malicious purposes, including but not limited to, creating or distributing deepfakes.

Intellectual Property 5.1 The Author retains all rights, title, and interest in and to the dataset, including any intellectual property rights. 5.2 The User agrees not to claim ownership of the dataset or any derivative works based on the dataset.

Liability 6.1 The user acknowledges that the dataset is a research tool provided for academic purposes and assumes full responsibility for the use of the dataset. 6.2 The Provider shall not be liable for any damages arising from the use of the dataset.

Termination 7.1 The Author reserves the right to terminate this Agreement at any time if the User breaches any terms of this Agreement. 7.2 Upon termination, the User agrees to destroy all copies of the dataset in their possession.

By using the dataset, the User acknowledges that they have read, understood, and agreed to be bound by the terms of this Agreement.

Contact info: Sumaiya Akhtar Mitu mitu.ict12@gmail.com
REAL/AI VIDEO DATASET
kaggle.com
zip
Updated Jan 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MRKANZET (2025). REAL/AI VIDEO DATASET [Dataset]. https://www.kaggle.com/datasets/kanzeus/realai-video-dataset
Explore at:
zip(2855297695 bytes)Available download formats
Dataset updated
Jan 21, 2025
Authors
MRKANZET
Description
Purpose: - Tagging the ai generated videos on Social Media Platforms. - Auto-marks the videos such as fake news, fake events, unreal statutions etc. - Predicting the videos was promoted/edited - Ethics

Contribute to growth of this dataset with ai videos: kanz.misystems@gmail.com

synthetic-legal-contracts-dataset

kaggle.com

zip

Updated Sep 11, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Syncora_ai (2025). synthetic-legal-contracts-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/synthetic-legal-contracts-dataset

Explore at:

zip(109408 bytes)Available download formats

Dataset updated

Sep 11, 2025

Authors

Syncora_ai

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Synthetic Legal Contract Dataset — Powered by Syncora

High-Fidelity Synthetic Dataset for LLM Training, Legal NLP & AI Research

About This Dataset

This repository provides a synthetic dataset of legal contract Q&A interactions, modeled after real-world corporate filings (e.g., SEC 8-K disclosures). The data is generated using Syncora.ai, ensuring privacy-safe, fake data that is safe to use in LLM training, benchmarking, and experimentation.

This free dataset captures the style and structure of legal exchanges without exposing any confidential or sensitive client information.

Dataset Context & Features

Feature	Description
Structured JSONL Format	Includes system, user, and assistant roles for conversational Q&A.
Contract & Compliance Questions	Modeled on SEC filings and legal disclosure scenarios.
Statistically Realistic Fake Data	Fully synthetic, mirrors real-world patterns without privacy risks.
NLP-Ready	Optimized for direct fine-tuning, benchmarking, and evaluation in LLM pipelines.

🚨 Simulated Regulatory Scenarios

This synthetic legal dataset is not just for LLM training — it enables developers and researchers to create simulated regulatory scenarios. Examples include:

Detecting high-risk clauses in contracts before real-world deployment
Testing AI models on rare or edge-case compliance situations
Simulating SEC filings and corporate disclosures to evaluate NLP models
Benchmarking contract analysis tools safely without exposing sensitive data

This section gives the dataset a practical, standout value, showing it can be used for stress-testing AI in legal environments.

Why Syncora?

Syncora.ai creates synthetic datasets optimized for LLM training with:

High similarity to real-world distributions
Free dataset access for research and open innovation
0% privacy leakage — fully synthetic fake data
Robust benchmarking potential for AI & legal NLP tasks

🔗 Generate Your Own Synthetic Data

Take your AI projects further with Syncora.ai:
→ Generate your own synthetic datasets now

📜 License

This dataset is released under the MIT License.

It is 100% synthetic, safe for LLM training, and ideal for research, experimentation, and open-source projects.

synthetic-medical-records-dataset
kaggle.com
zip
Updated Sep 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syncora_ai (2025). synthetic-medical-records-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/synthetic-medical-records-dataset
Explore at:
zip(1582643 bytes)Available download formats
Dataset updated
Sep 11, 2025
Authors
Syncora_ai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Synthetic Healthcare Dataset — Powered by Syncora

High-Fidelity Synthetic Medical Records for AI, ML Modeling, LLM Training & HealthTech Research

About This Dataset

This is a synthetic dataset of healthcare records generated using Syncora.ai, a next-generation synthetic data generation platform designed for privacy-safe AI development.

It simulates patient demographics, medical conditions, treatments, billing, and admission data, preserving statistical realism while ensuring 0% privacy risk.

This free dataset is designed for:

Healthcare AI research

Predictive analytics (disease risk, treatment outcomes)

LLM training on structured tabular healthcare data

Medical data science education & experimentation

Think of this as fake data that mimics real-world healthcare patterns — statistically accurate, but without any sensitive patient information.

Dataset Context & Features

The dataset captures patient-level hospital information, including:

Demographics: Age, Gender, Blood Type

Medical Details: Diagnosed medical condition, prescribed medication, test results

Hospital Records: Admission type (emergency, planned, outpatient), billing amount

Target Applications: Predictive modeling, anomaly detection, cost optimization

All records are 100% synthetic, maintaining the statistical properties of real-world healthcare data while remaining safe to share and use for ML & LLM tasks.

LLM Training & Generative AI Applications 🧠

Unlike most healthcare datasets, this one is tailored for LLM training:

Fine-tune LLMs on tabular + medical data for reasoning tasks

Create medical report generators from structured fields (e.g., convert demographics + condition + test results into natural language summaries)

Use as fake data for prompt engineering, synthetic QA pairs, or generative simulations

Safely train LLMs to understand healthcare schemas without exposing private patient data

Machine Learning & AI Use Cases

Predictive Modeling: Forecast patient outcomes or readmission likelihood

Classification: Disease diagnosis prediction using demographic and medical variables

Clustering: Patient segmentation by condition, treatment, or billing pattern

Healthcare Cost Prediction: Estimate and optimize billing amounts

Bias & Fairness Testing: Study algorithmic bias without exposing sensitive patient data

Why Syncora?

Syncora.ai is a synthetic data generation platform designed for healthcare, finance, and enterprise AI.

Key benefits:

Privacy-first: 100% synthetic, zero risk of re-identification

Statistical accuracy: Feature relationships preserved for ML & LLM training

Regulatory compliance: HIPAA, GDPR, DPDP safe

Scalability: Generate millions of synthetic patient records with agentic AI

Ideas for Exploration

Which medical conditions correlate with higher billing amounts?

Can test results predict hospitalization type?

How do demographics influence treatment or billing trends?

Can synthetic datasets reduce bias in healthcare AI & LLMs?

🔗 Generate Your Own Synthetic Data

Take your AI projects to the next level with Syncora.ai:
→ Generate your own synthetic datasets now

Licensing & Compliance

This is a free dataset, 100% synthetic, and contains no real patient information.
It is safe for public use in education, research, open-source contributions, LLM training, and AI development.

Facebook

Twitter

Click to copy link

Link copied

Cite

Vanroy, Bram (2023). Stack Overflow Chat Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8052379

Stack Overflow Chat Dutch

Explore at:

Dataset updated

Jun 20, 2023

Authors

Vanroy, Bram

License

http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

Description

This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.

Data Instances

{'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},

Data Fields

id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]

input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].

topic: the topic description

Dataset Creation

Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.

Here are the requirements that you should adhere to: 1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other; 2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places; 3. ensure accurate translation and keep the correctness of the conversation; 4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 5. translate the human's text using informal, but standard, language; 6. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang}; 8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested; 9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.

Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.

"""

The prompt to translate the topic is:

TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
" summarizing manner. Translate accurately and formally. Do not provide any explanation"
" about the translation and do not include the original title.

The system message was:

You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].

The translation quality has not been verified. Use at your own risk!

Licensing Information

Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.

This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

If you use this dataset, you must also follow the Sharing and Usage policies.

As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.

Clear search

Close search

Google apps

Main menu

Stack Overflow Chat Dutch

Fake News data set

Context

Content

Acknowledgements

CIFAKE: Real and AI-Generated Synthetic Images

CIFAKE: Real and AI-Generated Synthetic Images

Dataset details

Papers with Code

References

Notes

License

AI Hallucination Cases Database

Real and Fake (AI-Generated) Art Images Dataset

DFBench_Speech25

AI-generated fake podcasts (mostly via NotebookLM)

Stable Diffusion Face Dataset

About the images:

Placeholder Image Generator Report

DFBench_Image25

Results of the evaluation of fact-checking tools

ai-generated-images-vs-real-images

Generative AI aids the publication of fake articles: Methods and materials...

Contents

real-vs-fake-human-voice-deepfake-audio

S2 Data -

DeepFake Videos Dataset

DeepFake Videos for detection tasks

Metadata for the dataset

💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects

Bangla Audio Dataset: Original and DeepFake Voices for AI-Based Voice...

REAL/AI VIDEO DATASET

synthetic-legal-contracts-dataset

Synthetic Legal Contract Dataset — Powered by Syncora

About This Dataset

Dataset Context & Features

🚨 Simulated Regulatory Scenarios

Why Syncora?

🔗 Generate Your Own Synthetic Data

📜 License

synthetic-medical-records-dataset

Synthetic Healthcare Dataset — Powered by Syncora

About This Dataset

Dataset Context & Features

LLM Training & Generative AI Applications 🧠

Machine Learning & AI Use Cases

Why Syncora?

Ideas for Exploration

🔗 Generate Your Own Synthetic Data

Licensing & Compliance

Stack Overflow Chat DutchSee More Versions

Stack Overflow Chat Dutch