Facebook
Twitterhttp://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.
Data Instances
{'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},
Data Fields
id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]
input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].
topic: the topic description
Dataset Creation
Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.
The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):
CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.
Here are the requirements that you should adhere to:
1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other;
2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places;
3. ensure accurate translation and keep the correctness of the conversation;
4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
5. translate the human's text using informal, but standard, language;
6. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang};
8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested;
9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.
Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.
"""
The prompt to translate the topic is:
TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
" summarizing manner. Translate accurately and formally. Do not provide any explanation"
" about the translation and do not include the original title.
"
The system message was:
You are a helpful assistant that translates English to Dutch to the requirements that are given to you.
Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].
The translation quality has not been verified. Use at your own risk!
Licensing Information
Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.
This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.
If you use this dataset, you must also follow the Sharing and Usage policies.
As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.
This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Today, we are producing more information than ever before, but not all information is true. Some of it is actually malicious and harmful. And it makes it harder for us to trust any piece of information we come across! Not only that, now the bad actors are able to use language modelling tools like Open AI's GPT 2 to generate fake news too. Ever since its initial release, there have been talks on how it can be potentially misused for generating misleading news articles, automating the production of abusive or fake content for social media, and automating the creation of spam and phishing content.
How do we figure out what is true and what is fake? Can we do something about it?
The dataset consists of around 387,000 pieces of text which has been sourced from various news articles on the web as well as texts generated by Open AI's GPT 2 language model!
The dataset is split into train, validation and test such that each of the sets has an equal split of the two classes.
This dataset was published on AI Crowd in a so-called KIIT AI (mini)Blitz⚡ Challenge. AI Blitz⚡ is a series of educational challenges by AIcrowd, with an aim to make it really easy for anyone to get started with the world of AI. This AI Blitz⚡challenge was an exclusive challenge just for the students and the faculty of the Kalinga Institute of Industrial Technology.
Facebook
TwitterThe quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness.
CIFAKE is a dataset that contains 60,000 synthetically-generated images and 60,000 real images (collected from CIFAR-10). Can computer vision techniques be used to detect when an image is real or has been generated by AI?
Further information on this dataset can be found here: Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.
The dataset contains two classes - REAL and FAKE.
For REAL, we collected the images from Krizhevsky & Hinton's CIFAR-10 dataset
For the FAKE images, we generated the equivalent of CIFAR-10 with Stable Diffusion version 1.4
There are 100,000 images for training (50k per class) and 20,000 for testing (10k per class)
The dataset and all studies using it are linked using Papers with Code https://paperswithcode.com/dataset/cifake-real-and-ai-generated-synthetic-images
If you use this dataset, you must cite the following sources
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.
Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.
Real images are from Krizhevsky & Hinton (2009), fake images are from Bird & Lotfi (2024). The Bird & Lotfi study is available here.
The updates to the dataset on the 28th of March 2023 did not change anything; the file formats ".jpeg" were renamed ".jpg" and the root folder was uploaded to meet Kaggle's usability requirements.
This dataset is published under the same MIT license as CIFAR-10:
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A curated database of legal cases where generative AI produced hallucinated citations submitted in court filings.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset has been meticulously curated to aid in the classification of real and AI-generated (fake) art images. It is part of the "FauxFinder" project, which aims to build advanced models capable of distinguishing between authentic artworks and those generated by AI, GANs, and other digital methods. The dataset is ideal for training, testing, and benchmarking machine learning models, particularly in the domains of computer vision and generative adversarial networks (GANs).
Dataset Structure The dataset is organized into two folders: -Real: Contains 10,821 images of authentic artworks obtained from reliable art repositories. Source: Images were scraped from WikiArt, a comprehensive database of artworks spanning various art styles, genres, and periods. - Fake: Contains*** 10,821 ***AI-generated images created using GANs and other generative models.
These images were synthesized using advanced techniques to replicate the visual characteristics of real artworks. Additional AI-generated resources were also included to ensure diversity. Each folder contains high-quality images that have been resized to 256x256 pixels for consistency and ease of use in machine learning pipelines.
Key Features - Balanced Classes: Both the "Real" and "Fake" folders contain an equal number of images (10,821 each), ensuring a balanced dataset for binary classification tasks. - High Quality: All images are resized to a uniform resolution of 256x256 pixels, making them ready for use in CNNs and other deep learning architectures. - Diverse Sources: Real images encompass a wide range of art styles and genres, while fake images include outputs from various GAN architectures and AI tools, offering a rich diversity for model training.
Folder Structure:
python
Data/
├── Real/
│ ├── image1.jpg
│ ├── image2.jpg
│ └── ...
├── Fake/
├── image1.jpg
├── image2.jpg
└── ...
Get Started
1. Download the dataset from Kaggle.
2. Use the folder structure for training and validation splits (e.g., 80-20 split).
3. Build, train, and evaluate your machine/Deep learning models for real-vs-fake art detection.
Facebook
TwitterSpeech Deepfake Detection Challenge Dataset (Version 2025)
This repository contains 4,524 audio files which are either real (authentic) or fake (AI-generated).Your task is to classify them and create an output CSV file in the following format:
Each line corresponds to one audio file.
The CSV must contain two columns:
filename – the file name (e.g. 983.mp3).
label – real for authentic speech, fake for AI-generated speech.
Data
The audio files are provided in… See the full description on the dataset page: https://huggingface.co/datasets/DFBench/DFBench_Speech25.
Facebook
Twitterhttp://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Podcasting is a unique space where people can share their voices, ideas, and stories freely. Unlike platforms controlled by a single company (like YouTube or Instagram), podcasting supports true freedom of expression. However, this openness is now being threatened by AI tools, such as Notebook LM, which make it easy to produce fake, low-quality podcasts. Unfortunately, many of these AI-generated shows are created by spammers, scammers, or blackhat SEOs, and they are harming both listeners and genuine podcast creators.
At Listen Notes, the leading podcast search engine and podcast API, we believe that creating a quality podcast takes real effort. Listeners can tell when a show has been crafted with care, and that’s why we are committed to stopping the spread of fake, AI-generated podcasts on our platform.
This dataset represents a small subset of AI-generated fake podcasts that were flagged during attempts to add them to the Listen Notes podcast database. These "podcasts" were predominantly created using Notebook LM and are not designed for human consumption.
The goal of sharing this dataset is to support the AI community in developing more effective tools to combat spam. While it may not be possible to eliminate spam entirely, we can work together to minimize its impact and contribute to making the digital world a better place.
If you're building a podcast app for discovering human-made shows, PodcastAPI.com is your best bet. Apple Podcasts and Spotify are increasingly flooded with AI-generated fakes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fake Ai generated Human faces using Stable Diffusion 1.5, 2.1, and SDXL 1.0 checkpoint. The main objective was to generate photos that were as realistic as possible, without any specific style, focusing mainly on the face.
Fake Ai generated Human faces
More details on the images and the process of creating the images in the readme file.
The data is not mine, the data is taken from a GitHub repository to a user named: tobecwb Repo link: https://github.com/tobecwb/stable-diffusion-face-dataset
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The placeholder image generator market is experiencing robust growth, driven by the increasing demand for visually appealing websites and applications without relying on heavy, slow-to-load original images. The market's expansion is fueled by the rise of web design and development, the growing popularity of content creation tools, and the need for efficient prototyping and wireframing. While precise market sizing data is not provided, based on general market trends for similar software-as-a-service (SaaS) offerings and the large number of active players, we can reasonably estimate the 2025 market size to be around $50 million. Considering a conservative Compound Annual Growth Rate (CAGR) of 15%, we project the market to reach approximately $100 million by 2033. Key trends include increasing integration with other design tools, the development of AI-powered image generation features, and a shift towards more customizable and versatile placeholder options. The market faces some restraints, such as the availability of free, basic alternatives and the potential for users to overlook the value proposition of dedicated placeholder generators, especially in smaller-scale projects. The competitive landscape is highly fragmented, with numerous players ranging from established companies to individual developers offering a variety of features and pricing models. The key success factors for vendors include offering a diverse library of placeholder images, seamless integration with popular design tools, and providing a user-friendly interface. Future growth will likely be influenced by the adoption of advanced technologies like AI and machine learning to enhance image quality, personalization, and efficiency. Furthermore, strategic partnerships with other software providers and expanding the available image types and styles will be vital for sustained market leadership. The potential for growth is significant, especially as developers and designers increasingly prioritize efficiency and high-quality visuals in their projects.
Facebook
TwitterImage Deepfake Detection Challenge Dataset (Version 2025)
This repository contains 2,920 images which are either real (authentic) or fake (AI-generated).Your task is to classify them and create an output CSV file in the following format:
Each line corresponds to one image in the dataset.
The CSV must contain two columns:
filename – the image file name (e.g. 983.jpg)
label – real for authentic images, fake for AI-generated images
Data
The images are provided… See the full description on the dataset page: https://huggingface.co/datasets/DFBench/DFBench_Image25.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Scenario for evaluating fact-checking tools in messages and newsObjective: to evaluate the ability of fact-checking tools to determine the truthfulness of news articles using keywords generated by artificial intelligence.Input: real (TRUE) and fake news (FALSE):1. Attack on Kyiv region: police showed photos of destruction - TRUE;2. No 'red lines' in Ukraine support, French foreign minister tells BBC - TRUE;3. Kyiv is in shock. Residents of Pokrovsk say they are waiting for Russian liberators - FALSE;4. Russian troops grind towards Pokrovsk, take second Ukrainian town in a week - FALSE.Steps.1. Select the news.2. Create a set of relevant keywords from the news text using the ChatGPT dialogue template: "Hi. Could you please create a list of the most relevant keywords for the article link to the article?"3. Enter the keywords into the fact-checking tool.4. Record the result of the tool's evaluation.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset consists of 30,000 AI-Generated Images and 30,000 real images. Fake: - 10,000 from Stable Diffusion - 10,000 from MidJourney - 10,000 from DALL-E
Real: - 22,500 from Pexels, Unsplash - 7,500 from WikiArt
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This package contains Python, shell, awk scripts, and data used to obtain the curated table associated with the above named article. It also contains (in this file) a description of the methods employed to obtain the curated table with details regarding the published articles.
The following items are included.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Deepfake Audio Dataset - 5,000 Audio
Dataset contains 5,000 audio files, comprising both authentic human recordings and synthetic** AI-generated voice** samples. It designed for advanced research in deepfake detection, focusing on detecting fake voices and generated speech analysis. Specifically engineered to challenge voice authentication systems, it supports the development of robust models for real vs fake human voice recognition. - Get the data
Dataset characteristics:… See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/real-vs-fake-human-voice-deepfake-audio.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Public comments are an important opinion for civic when the government establishes rules. However, recent AI can easily generate large quantities of disinformation, including fake public comments. We attempted to distinguish between human public comments and ChatGPT-generated public comments (including ChatGPT emulated that of humans) using Japanese stylometric analysis. Study 1 conducted multidimensional scaling (MDS) to compare 500 texts of five classes: Human public comments, GPT-3.5 and GPT-4 generated public comments only by presenting the titles of human public comments (i.e., zero-shot learning, GPTzero), GPT-3.5 and GPT-4 emulated by presenting sentences of human public comments and instructing to emulate that (i.e., one-shot learning, GPTone). The MDS results showed that the Japanese stylometric features of the public comments were completely different from those of the GPTzero-generated texts. Moreover, GPTone-generated public comments were closer to those of humans than those generated by GPTzero. In Study 2, the performance levels of the random forest (RF) classifier for distinguishing three classes (human, GPTzero, and GPTone texts). RF classifiers showed the best precision for the human public comments of approximately 90%, and the best precision for the fake public comments generated by GPT (GPTzero and GPTone) was 99.5% by focusing on integrated next writing style features: phrase patterns, parts-of-speech (POS) bigram and trigram, and function words. Therefore, the current study concluded that we could discriminate between GPT-generated fake public comments and those written by humans at the present time.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset consists of 10,000+ files featuring 7,000+ people, providing a comprehensive resource for research in deepfake detection and deepfake technology. It includes real videos of individuals with AI-generated faces overlaid, specifically designed to enhance liveness detection systems.
By utilizing this dataset, researchers can advance their understanding of deepfake generation and improve the performance of detection methods. - Get the data
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2F7f47885f0afdca5c22f9f47e81307b95%2FFrame%201%20(8).png?generation=1742726304761567&alt=media" alt="">
Dataset was created by generating fake faces and overlaying them onto authentic video clips sourced from platforms such as aisaver.io, faceswapvideo.ai, and magichour.ai.Videos featuring different individuals, backgrounds, and scenarios, making it suitable for various research applications.
Researchers can leverage this dataset to enhance their understanding of deepfake detection and contribute to the development of more robust detection methods that can effectively combat the challenges posed by deepfake technology.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Total number of audios : 4500 (Original: 2250; Fake: 2250) Total number of individuals: 75 Total number of distinct sets: 15 Total number of distinct sentences per set : 30 Each set spoken by the number of individuals: 5 Each sentence voice time: 2 to 6 seconds Male-Female ratio: Nearly Balanced
It has a parent folder named ‘DATASET’ under which there are 75 folders with unique names to identify the characteristics, and under this there are two folders named ‘Real’ and 'Fake', which include 30 voices each named numerically.
Dataset Usage Agreement (DUA)
Grant of Use 1.1 The Author grants the user a non-exclusive, non-transferable license to use the dataset solely for academic and research purposes. 1.2 The User agrees to use the dataset only for purposes consistent with this Agreement and will not use the dataset for any commercial purposes without the prior written consent of the Provider.
Data Security and Privacy 2.1 The User shall ensure that the dataset is stored securely and is not accessible to unauthorized individuals. 2.2 The User agrees to comply with all applicable data protection and privacy laws in relation to the use of the dataset.
Attribution 3.1 The User agrees to provide appropriate acknowledgment to the Provider in any publications, presentations, or other outputs that utilize the dataset. The acknowledgment should include citation.
Restrictions on Use 4.1 The User shall not distribute, share, sell, or sublicense the dataset to any third party without the prior written consent of the Provider. 4.2 The User shall not attempt to re-identify any individuals from the dataset.
4.3 The User shall not use the dataset to develop any technologies or applications that are intended to be used for malicious purposes, including but not limited to, creating or distributing deepfakes.
Intellectual Property 5.1 The Author retains all rights, title, and interest in and to the dataset, including any intellectual property rights. 5.2 The User agrees not to claim ownership of the dataset or any derivative works based on the dataset.
Liability 6.1 The user acknowledges that the dataset is a research tool provided for academic purposes and assumes full responsibility for the use of the dataset. 6.2 The Provider shall not be liable for any damages arising from the use of the dataset.
Termination 7.1 The Author reserves the right to terminate this Agreement at any time if the User breaches any terms of this Agreement. 7.2 Upon termination, the User agrees to destroy all copies of the dataset in their possession.
By using the dataset, the User acknowledges that they have read, understood, and agreed to be bound by the terms of this Agreement.
Contact info: Sumaiya Akhtar Mitu mitu.ict12@gmail.com
Facebook
TwitterPurpose: - Tagging the ai generated videos on Social Media Platforms. - Auto-marks the videos such as fake news, fake events, unreal statutions etc. - Predicting the videos was promoted/edited - Ethics
Contribute to growth of this dataset with ai videos: kanz.misystems@gmail.com
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
High-Fidelity Synthetic Dataset for LLM Training, Legal NLP & AI Research
This repository provides a synthetic dataset of legal contract Q&A interactions, modeled after real-world corporate filings (e.g., SEC 8-K disclosures). The data is generated using Syncora.ai, ensuring privacy-safe, fake data that is safe to use in LLM training, benchmarking, and experimentation.
This free dataset captures the style and structure of legal exchanges without exposing any confidential or sensitive client information.
| Feature | Description |
|---|---|
| Structured JSONL Format | Includes system, user, and assistant roles for conversational Q&A. |
| Contract & Compliance Questions | Modeled on SEC filings and legal disclosure scenarios. |
| Statistically Realistic Fake Data | Fully synthetic, mirrors real-world patterns without privacy risks. |
| NLP-Ready | Optimized for direct fine-tuning, benchmarking, and evaluation in LLM pipelines. |
This synthetic legal dataset is not just for LLM training — it enables developers and researchers to create simulated regulatory scenarios. Examples include:
This section gives the dataset a practical, standout value, showing it can be used for stress-testing AI in legal environments.
Syncora.ai creates synthetic datasets optimized for LLM training with:
Take your AI projects further with Syncora.ai:
→ Generate your own synthetic datasets now
This dataset is released under the MIT License.
It is 100% synthetic, safe for LLM training, and ideal for research, experimentation, and open-source projects.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
High-Fidelity Synthetic Medical Records for AI, ML Modeling, LLM Training & HealthTech Research
This is a synthetic dataset of healthcare records generated using Syncora.ai, a next-generation synthetic data generation platform designed for privacy-safe AI development.
It simulates patient demographics, medical conditions, treatments, billing, and admission data, preserving statistical realism while ensuring 0% privacy risk.
This free dataset is designed for:
Think of this as fake data that mimics real-world healthcare patterns — statistically accurate, but without any sensitive patient information.
The dataset captures patient-level hospital information, including:
All records are 100% synthetic, maintaining the statistical properties of real-world healthcare data while remaining safe to share and use for ML & LLM tasks.
Unlike most healthcare datasets, this one is tailored for LLM training:
Syncora.ai is a synthetic data generation platform designed for healthcare, finance, and enterprise AI.
Key benefits:
Take your AI projects to the next level with Syncora.ai:
→ Generate your own synthetic datasets now
This is a free dataset, 100% synthetic, and contains no real patient information.
It is safe for public use in education, research, open-source contributions, LLM training, and AI development.
Facebook
Twitterhttp://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.
Data Instances
{'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},
Data Fields
id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]
input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].
topic: the topic description
Dataset Creation
Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.
The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):
CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.
Here are the requirements that you should adhere to:
1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other;
2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places;
3. ensure accurate translation and keep the correctness of the conversation;
4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
5. translate the human's text using informal, but standard, language;
6. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang};
8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested;
9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.
Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.
"""
The prompt to translate the topic is:
TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
" summarizing manner. Translate accurately and formally. Do not provide any explanation"
" about the translation and do not include the original title.
"
The system message was:
You are a helpful assistant that translates English to Dutch to the requirements that are given to you.
Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].
The translation quality has not been verified. Use at your own risk!
Licensing Information
Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.
This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.
If you use this dataset, you must also follow the Sharing and Usage policies.
As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.
This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.