51 datasets found

h
clinical-synthetic-text-llm
huggingface.co
Updated Jul 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ran Xu (2024). clinical-synthetic-text-llm [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2024
Authors
Ran Xu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Data Description

We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.

Generated Datasets

The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.
f
Data Sheet 1_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
xlsx
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 1_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s001
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
d
Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...
datarade.ai
.json, .csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
Explore at:
.json, .csvAvailable download formats
Dataset provided by
Xverum LLC
Authors
Xverum
Area covered
Jordan, Western Sahara, United Kingdom, India, Sint Maarten (Dutch part), Cook Islands, Dominican Republic, Norway, Barbados, Oman
Description
Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

What Makes Our Data Unique?

Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

Primary Use Cases and Verticals

Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

Contact us for sample datasets or to discuss your specific needs.
f
Implications for future LLM research.
plos.figshare.com
xls
Updated Jan 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce (2024). Implications for future LLM research. [Dataset]. http://doi.org/10.1371/journal.pdig.0000417.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000417.t002
Dataset updated
Jan 18, 2024
Dataset provided by
PLOS Digital Health
Authors
Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The study provides a comprehensive review of OpenAI’s Generative Pre-trained Transformer 4 (GPT-4) technical report, with an emphasis on applications in high-risk settings like healthcare. A diverse team, including experts in artificial intelligence (AI), natural language processing, public health, law, policy, social science, healthcare research, and bioethics, analyzed the report against established peer review guidelines. The GPT-4 report shows a significant commitment to transparent AI research, particularly in creating a systems card for risk assessment and mitigation. However, it reveals limitations such as restricted access to training data, inadequate confidence and uncertainty estimations, and concerns over privacy and intellectual property rights. Key strengths identified include the considerable time and economic investment in transparent AI research and the creation of a comprehensive systems card. On the other hand, the lack of clarity in training processes and data raises concerns about encoded biases and interests in GPT-4. The report also lacks confidence and uncertainty estimations, crucial in high-risk areas like healthcare, and fails to address potential privacy and intellectual property issues. Furthermore, this study emphasizes the need for diverse, global involvement in developing and evaluating large language models (LLMs) to ensure broad societal benefits and mitigate risks. The paper presents recommendations such as improving data transparency, developing accountability frameworks, establishing confidence standards for LLM outputs in high-risk settings, and enhancing industry research review processes. It concludes that while GPT-4’s report is a step towards open discussions on LLMs, more extensive interdisciplinary reviews are essential for addressing bias, harm, and risk concerns, especially in high-risk domains. The review aims to expand the understanding of LLMs in general and highlights the need for new reflection forms on how LLMs are reviewed, the data required for effective evaluation, and addressing critical issues like bias and risk.
h
synthetic-from-unit-triple-tasks-danish
huggingface.co
sprogteknologi.dk
Updated Jan 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kasper Groes Albin Ludvigsen (2025). synthetic-from-unit-triple-tasks-danish [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2025
Authors
Kasper Groes Albin Ludvigsen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

The purpose of this dataset is to pre- or post-train embedding models for Danish on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish.
LLM Generated Synthetic Dataset of DoS Exposed Solidity Contracts
zenodo.org
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibba Giacomo; Baralla Gavina; Destefanis Giuseppe; Ibba Giacomo; Baralla Gavina; Destefanis Giuseppe (2025). LLM Generated Synthetic Dataset of DoS Exposed Solidity Contracts [Dataset]. http://doi.org/10.5281/zenodo.14262663
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14262663
Dataset updated
May 14, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ibba Giacomo; Baralla Gavina; Destefanis Giuseppe; Ibba Giacomo; Baralla Gavina; Destefanis Giuseppe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Nov 29, 2024
Description
This dataset provides the replication package for the paper 'Large Language Models for Synthetic Dataset
Generation: A Case Study on Ethereum Smart Contract DoS Vulnerabilities' accepted for publication at the 8th International Workshop on Blockchain Oriented Software Engineering. The provided sources encompass:
1) The synthetic contracts (Vulnerable, Exploit, and Patched contract for each use case) generated by Claude and GPT4.
2) The configuration files of the hardhat-based testing environment.
3) The test suite that showcases the vulnerabilities of the generated contracts (including mock contracts) (hardhat is required to run and test contracts).
LLM - Detect AI Datamix
kaggle.com
Updated Feb 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Raja Biswas
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:

Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.

One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.

Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
h
synthetic-from-unit-triple-tasks-swedish
huggingface.co
Updated Jan 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kasper Groes Albin Ludvigsen (2025). synthetic-from-unit-triple-tasks-swedish [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-swedish
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2025
Authors
Kasper Groes Albin Ludvigsen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

The purpose of this dataset is to pre- or post-train embedding models on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by Arrow Denmark and… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-swedish.
h
llm-detection-generation-contribution2-train
huggingface.co
Updated Apr 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiacheng Zhu (2024). llm-detection-generation-contribution2-train [Dataset]. https://huggingface.co/datasets/jjz5463/llm-detection-generation-contribution2-train
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 21, 2024
Authors
Jiacheng Zhu
Description
Dataset Card

Add more information here

This dataset was produced with DataDreamer 🤖💤. The synthetic dataset card can be found here.
h
tiny-codes
huggingface.co
Updated Jan 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nam Pham (2023). tiny-codes [Dataset]. http://doi.org/10.57967/hf/0937
Explore at:
Unique identifier
https://doi.org/10.57967/hf/0937
Dataset updated
Jan 26, 2024
Authors
Nam Pham
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Reasoning with Language and Code

This synthetic dataset is a collection of 1.6 millions short and clear code snippets that can help LLM models learn how to reason with both natural and programming languages. The dataset covers a wide range of programming languages, such as Python, TypeScript, JavaScript, Ruby, Julia, Rust, C++, Bash, Java, C#, and Go. It also includes two database languages: Cypher (for graph databases) and SQL (for relational databases) in order to study the… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-codes.
h
synthetic_multilingual_llm_prompts
huggingface.co
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gretel.ai (2024). synthetic_multilingual_llm_prompts [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_multilingual_llm_prompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 11, 2024
Dataset provided by
Gretel.ai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Image generated by DALL-E. See prompt for more details

📝🌐 Synthetic Multilingual LLM Prompts

Welcome to the "Synthetic Multilingual LLM Prompts" dataset! This comprehensive collection features 1,250 synthetic LLM prompts generated using Gretel Navigator, available in seven different languages. To ensure accuracy and diversity in prompts, and translation quality and consistency across the different languages, we employed Gretel Navigator both as a generation tool and as an… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_multilingual_llm_prompts.
h
Data from: Messages for: "Artificial Intelligence for Health Message...
works.hcommons.org
docx
Updated Nov 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ralf Schmaelzle; Ralf Schmaelzle (2024). Messages for: "Artificial Intelligence for Health Message Generation: An Empirical Study Using a Large Language Model (LLM) and Prompt Engineering" [Dataset]. http://doi.org/10.17613/c9q1-1x32
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.17613/c9q1-1x32
Dataset updated
Nov 14, 2024
Dataset provided by
unknown
Authors
Ralf Schmaelzle; Ralf Schmaelzle
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This study introduces and examines the potential of an AI system to generate health awareness messages. The topic of folic acid, a vitamin that is critical during pregnancy, served as a test case. We used prompt engineering to generate awareness messages about folic acid and compared them to the most retweeted human-generated messages via human evaluation with the university and young adult women samples. We also conducted computational text analysis to examine the similarities between the AI-generated messages and human generated tweets in terms of content and semantic structure. The results showed that AI-generated messages ranked higher in message quality and clarity across both samples. The computational analyses revealed that the AI-generated messages were on par with human-generated ones in terms of sentiment, reading ease, and semantic content. Overall, these results demonstrate the potential of large language models for message generation. Theoretical, practical, and ethical implications are discussed.
t
Privacy-Sensitive Conversations between Care Workers and Care Home Residents...
researchdata.tuwien.ac.at
test.researchdata.tuwien.ac.at
+1more
bin, text/markdown
Updated Feb 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reinhard Grabler; Reinhard Grabler; Michael Starzinger; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns (2025). Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home [Dataset]. http://doi.org/10.48436/q1kt0-edc53
Explore at:
bin, text/markdownAvailable download formats
Unique identifier
https://doi.org/10.48436/q1kt0-edc53
Dataset updated
Feb 25, 2025
Dataset provided by
TU Wien
Authors
Reinhard Grabler; Reinhard Grabler; Michael Starzinger; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 2024 - Aug 2024
Description
Dataset Card for "privacy-care-interactions"

Table of Contents

Dataset Description

Purpose and Features

Dataset Overview

Language Distribution

Locale Distribution

Key Facts

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Dataset Description

Purpose and Features

🔒 Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home 🔒

The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

Dataset Overview

Total entries: 95

Number of distinct taxonomy categories in the public dataset: 4

Number of distinct conversational categories in public dataset: 7

Papers:

Continues the work of: Privacy Agents: Utilizing Large Language Models to Safeguard Contextual Integrity in Elderly Care

Continues the work of: Prototype of a care documentation support system using audio recordings of care actions and large language models

Language Distribution 🌍

English (en): 95

Locale Distribution 🌎

United States (US) 🇺🇸: 95

Key Facts 🔑

This is synthetic data! Generated using proprietary algorithms - no privacy violations!

Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).

The data was manually labeled by an expert.

Dataset Structure

Data Instances

The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

{ "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

Data Fields

The data fields are:

text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).

taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.

category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.

affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.

language: a string feature. Language code as defined by ISO 639.

locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.

data_type: a string a classification label, with possible values including real (0), synthetic (1).

uid: a int64 feature. A unique identifier within the dataset.

split: a string feature. Either train, validation or test.

Dataset Splits

The dataset has 2 subsets:

split: with a total of 95 examples split into train, validation and test (70%-15%-15%)

unsplit: with a total of 95 examples in a single train split

name train validation test
split 66 14 15
unsplit 95 n/a n/a

The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

split-train-en.jsonl

split-validation-en.jsonl

split-test-en.jsonl

unsplit-train-en.jsonl

Dataset Creation

Curation Rationale

Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

Source Data

Initial Data Collection

The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

Data Processing

The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the sections were translated from German to U.S. English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank" rel="noopener
P
Speech Brown Dataset
paperswithcode.com
Updated Feb 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Mahdi Abootorabi; Ehsaneddin Asgari (2025). Speech Brown Dataset [Dataset]. https://paperswithcode.com/dataset/speechbrown
Explore at:
Dataset updated
Feb 22, 2025
Authors
Mohammad Mahdi Abootorabi; Ehsaneddin Asgari
Description
Dataset Summary Speech Brown is a comprehensive, synthetic, and diverse paired speech-text dataset in 15 categories, covering a wide range of topics from fiction to religion. This dataset consists of over 55,000 sentence-level samples.

To train the CLASP model, we created this dataset based on the Brown Corpus. The synthetic speech was generated using the NVIDIA Tacotron 2 text-to-speech model.

For more information about our proposed model, please refer to this paper. The dataset generation pipeline, along with code and usage instructions, is available on this GitHub page.

Dataset Statistics

Total size: Approximately 30 GB.
Number of samples: 55,173 pairs of speech and text.
Average tokens per sample: 19.00.
Maximum tokens in a sample: 48.
Average characters per sample: 96.72. Number of unique tokens: 50,667 Categories: 15 categories consist of adventure, belles_lettres, editorial, fiction, government, hobbies, humor, learned, lore, mystery, news, religion, reviews, romance, science_fiction.

Dataset Structure To ensure ease of use, the dataset is partitioned into 10 parts. Each part can be used independently if it meets the requirements of your task and model.

Metadata Files

global_metadata: A JSON file containing metadata for all 55,173 samples.
localized_metadata: A JSON file containing metadata for all samples, categorized into the 10 dataset partitions.

Metadata Fields

id: The unique identifier for the sample.
audio_file_path: The file path for the audio in the dataset.
category: The category of the sample's text.
text: The corresponding text of the audio file.

Usage Instructions To use this dataset, download the parts and metadata files as follows:

Option 1: Manual Download Visit the dataset repository and download all dataset_partX.zip files and the global_metadata.json file.

Option 2: Programmatic Download Use the huggingface_hub library to download the files programmatically:

from huggingface_hub import hf_hub_download from zipfile import ZipFile import os import json Download dataset parts zip_file_path1 = hf_hub_download(repo_id="llm-lab/SpeechBrown", filename="dataset_part1.zip", repo_type="dataset") zip_file_path2 = hf_hub_download(repo_id="llm-lab/SpeechBrown", filename="dataset_part2.zip", repo_type="dataset") Download other parts... Download metadata metadata_file_path = hf_hub_download(repo_id="llm-lab/SpeechBrown", filename="global_metadata.json", repo_type="dataset") for i in range(1, 11): with ZipFile(f'dataset_part{i}.zip', 'r') as zip_ref: zip_ref.extractall(f'dataset_part{i}') os.remove(f'dataset_part{i}.zip') with open('global_metadata.json', 'r') as f: metadata = json.load(f) metadata.keys()

Citations If you find our paper, code, data, or models useful, please cite the paper: @misc{abootorabi2024claspcontrastivelanguagespeechpretraining, title={CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval}, author={Mohammad Mahdi Abootorabi and Ehsaneddin Asgari}, year={2024}, eprint={2412.13071}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.13071}, }

Contact If you have questions, please email mahdi.abootorabi2@gmail.com or asgari@berkeley.edu.
s
Synthetic from Classification Tasks Danish
sprogteknologi.dk
Updated Jan 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danish Data Science Community (2025). Synthetic from Classification Tasks Danish [Dataset]. https://sprogteknologi.dk/dataset/synthetic-from-classification-tasks-danish
Explore at:
http://publications.europa.eu/resource/authority/file-type/parquetAvailable download formats
Dataset updated
Jan 24, 2025
Dataset authored and provided by
Danish Data Science Community
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
Danmark
Description
The purpose of this dataset is to pre- or post-train embedding models for Danish text classification tasks.

The dataset consists of 100,000 samples generated with gemma-2-27b-it.

The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/classification-tasks-processed

The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368

Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.
s
Synthetic from Retrieval Tasks Danish
sprogteknologi.dk
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danish Data Science Community (2025). Synthetic from Retrieval Tasks Danish [Dataset]. https://sprogteknologi.dk/dataset/synthetic-from-retrieval-tasks-danish
Explore at:
http://publications.europa.eu/resource/authority/file-type/parquetAvailable download formats
Dataset updated
Jan 24, 2025
Dataset authored and provided by
Danish Data Science Community
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
Danmark
Description
The purpose of this dataset is to pre- or post-train embedding models for Danish retrieval tasks.

The dataset consists of 100,000 samples generated with gemma-2-27b-it.

The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/retrieval-tasks-processed

The data generation process described in this paper was followed:

https://arxiv.org/pdf/2401.00368

Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.
A
AIGC Large Language Model (LLM) Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). AIGC Large Language Model (LLM) Report [Dataset]. https://www.datainsightsmarket.com/reports/aigc-large-language-model-llm-1940562
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
May 12, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Artificial Intelligence Generated Content (AIGC) Large Language Model (LLM) market is experiencing explosive growth, projected to reach $1.3 billion in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 141.7%. This phenomenal expansion is fueled by several key drivers. Firstly, the increasing demand for automated content creation across diverse sectors, including marketing, customer service, and education, is significantly boosting adoption. Secondly, advancements in deep learning techniques and the availability of massive datasets are enabling the development of increasingly sophisticated and accurate LLMs. Thirdly, the growing accessibility of cloud-based computing resources is making LLM development and deployment more cost-effective for businesses of all sizes. Finally, the emergence of specialized LLMs tailored to specific applications, such as medical diagnosis or code generation, further accelerates market penetration. However, the market also faces certain restraints. Data privacy concerns and ethical considerations surrounding the use of AI-generated content are significant hurdles. Furthermore, the high computational cost associated with training and deploying large LLMs can pose a barrier to entry for smaller companies. Despite these challenges, the market segmentation reveals significant opportunities. The "Above 100 Billion Parameters" segment is expected to dominate due to its superior performance capabilities, while applications like chatbots and virtual assistants are driving immediate adoption. Geographically, North America and Asia Pacific are expected to be the leading regions, fueled by strong technological innovation and high adoption rates. The competitive landscape is highly dynamic, with major technology companies like OpenAI, Google, and Meta leading the pack, alongside a growing number of specialized AI startups. The forecast period (2025-2033) promises continued market expansion, driven by ongoing innovation and wider industry adoption.
h
llm-detection-generation-failcase-test
huggingface.co
Updated Apr 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiacheng Zhu (2024). llm-detection-generation-failcase-test [Dataset]. https://huggingface.co/datasets/jjz5463/llm-detection-generation-failcase-test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 21, 2024
Authors
Jiacheng Zhu
Description
Dataset Card

Add more information here

This dataset was produced with DataDreamer 🤖💤. The synthetic dataset card can be found here.
s
Synthetic from Text Matching Long Tasks Danish
sprogteknologi.dk
Updated Jan 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danish Data Science Community (2025). Synthetic from Text Matching Long Tasks Danish [Dataset]. https://sprogteknologi.dk/dataset/synthetic-from-text-matching-long-tasks-danish
Explore at:
http://publications.europa.eu/resource/authority/file-type/parquetAvailable download formats
Dataset updated
Jan 24, 2025
Dataset authored and provided by
Danish Data Science Community
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
Danmark
Description
The purpose of this dataset is to pre- or post-train embedding models for Danish text matching tasks.

The dataset consists of 100,000 samples generated with gemma-2-27b-it.

The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/retrieval-tasks-processed

The data generation process described in this paper was followed:

https://arxiv.org/pdf/2401.00368

Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.
L
Large Language Model (LLM) Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Large Language Model (LLM) Report [Dataset]. https://www.marketreportanalytics.com/reports/large-language-model-llm-52461
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Large Language Model (LLM) market is experiencing explosive growth, driven by advancements in artificial intelligence, increasing demand for natural language processing (NLP) applications, and the rising adoption of cloud computing. The market, estimated at $15 billion in 2025, is projected to exhibit a robust Compound Annual Growth Rate (CAGR) of 35% from 2025 to 2033, reaching approximately $120 billion by 2033. This growth is fueled by several key factors, including the development of more sophisticated and accurate LLMs, their integration into various business applications such as customer service chatbots, content generation tools, and personalized education platforms, and the increasing availability of large datasets for training these models. Furthermore, the ongoing research and development in areas like transfer learning and few-shot learning are contributing to improved efficiency and reduced training costs, making LLMs accessible to a wider range of businesses and developers. However, the market also faces certain challenges. High computational costs associated with training and deploying LLMs remain a significant hurdle, especially for smaller companies. Concerns regarding data privacy, bias in training data, and the ethical implications of using AI-generated content are also emerging as important considerations. Nevertheless, ongoing innovations in hardware, software, and algorithmic optimization are continuously mitigating these challenges. The segmentation of the market, based on application (e.g., chatbots, machine translation, text summarization) and type (e.g., transformer-based models, recurrent neural networks), reveals diverse growth opportunities. Geographical distribution shows strong growth across North America and Asia-Pacific, fueled by substantial investments in AI research and the presence of major technology companies. Continued technological advancements and increasing market adoption will continue to shape the future trajectory of the LLM market.

name	train	validation	test
split	66	14	15
unsplit	95	n/a	n/a

Facebook

Twitter

Click to copy link

Link copied

Cite

Ran Xu (2024). clinical-synthetic-text-llm [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm

clinical-synthetic-text-llm

ritaranx/clinical-synthetic-text-llm

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 5, 2024

Authors

Ran Xu

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Data Description

We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.

  Generated Datasets

The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.

Clear search

Close search

Google apps

Main menu

clinical-synthetic-text-llm

Data Sheet 1_Large language models generating synthetic clinical datasets: a...

Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...

Implications for future LLM research.

synthetic-from-unit-triple-tasks-danish

LLM Generated Synthetic Dataset of DoS Exposed Solidity Contracts

LLM - Detect AI Datamix

synthetic-from-unit-triple-tasks-swedish

llm-detection-generation-contribution2-train

tiny-codes

synthetic_multilingual_llm_prompts

Data from: Messages for: "Artificial Intelligence for Health Message...

Privacy-Sensitive Conversations between Care Workers and Care Home Residents...

Dataset Card for "privacy-care-interactions"

Table of Contents

Dataset Description

Purpose and Features

Dataset Overview

Language Distribution 🌍

Locale Distribution 🌎

Key Facts 🔑

Dataset Structure

Data Instances

Data Fields

Dataset Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection

Data Processing

Speech Brown Dataset

Synthetic from Classification Tasks Danish

Synthetic from Retrieval Tasks Danish

AIGC Large Language Model (LLM) Report

llm-detection-generation-failcase-test

Synthetic from Text Matching Long Tasks Danish

Large Language Model (LLM) Report

clinical-synthetic-text-llmSee More Versions

ritaranx/clinical-synthetic-text-llm

clinical-synthetic-text-llm