51 datasets found
  1. h

    clinical-synthetic-text-llm

    • huggingface.co
    Updated Jul 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ran Xu (2024). clinical-synthetic-text-llm [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2024
    Authors
    Ran Xu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Data Description

    We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.

      Generated Datasets
    

    The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.

  2. f

    Data Sheet 1_Large language models generating synthetic clinical datasets: a...

    • frontiersin.figshare.com
    xlsx
    Updated Feb 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 1_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Frontiers
    Authors
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

  3. d

    Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...

    • datarade.ai
    .json, .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
    Explore at:
    .json, .csvAvailable download formats
    Dataset provided by
    Xverum LLC
    Authors
    Xverum
    Area covered
    Jordan, Western Sahara, United Kingdom, India, Sint Maarten (Dutch part), Cook Islands, Dominican Republic, Norway, Barbados, Oman
    Description

    Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

    What Makes Our Data Unique?

    Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

    Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

    Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

    Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

    How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

    Primary Use Cases and Verticals

    Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

    Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

    B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

    HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

    How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

    Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

    Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

    Contact us for sample datasets or to discuss your specific needs.

  4. f

    Implications for future LLM research.

    • plos.figshare.com
    xls
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce (2024). Implications for future LLM research. [Dataset]. http://doi.org/10.1371/journal.pdig.0000417.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 18, 2024
    Dataset provided by
    PLOS Digital Health
    Authors
    Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The study provides a comprehensive review of OpenAI’s Generative Pre-trained Transformer 4 (GPT-4) technical report, with an emphasis on applications in high-risk settings like healthcare. A diverse team, including experts in artificial intelligence (AI), natural language processing, public health, law, policy, social science, healthcare research, and bioethics, analyzed the report against established peer review guidelines. The GPT-4 report shows a significant commitment to transparent AI research, particularly in creating a systems card for risk assessment and mitigation. However, it reveals limitations such as restricted access to training data, inadequate confidence and uncertainty estimations, and concerns over privacy and intellectual property rights. Key strengths identified include the considerable time and economic investment in transparent AI research and the creation of a comprehensive systems card. On the other hand, the lack of clarity in training processes and data raises concerns about encoded biases and interests in GPT-4. The report also lacks confidence and uncertainty estimations, crucial in high-risk areas like healthcare, and fails to address potential privacy and intellectual property issues. Furthermore, this study emphasizes the need for diverse, global involvement in developing and evaluating large language models (LLMs) to ensure broad societal benefits and mitigate risks. The paper presents recommendations such as improving data transparency, developing accountability frameworks, establishing confidence standards for LLM outputs in high-risk settings, and enhancing industry research review processes. It concludes that while GPT-4’s report is a step towards open discussions on LLMs, more extensive interdisciplinary reviews are essential for addressing bias, harm, and risk concerns, especially in high-risk domains. The review aims to expand the understanding of LLMs in general and highlights the need for new reflection forms on how LLMs are reviewed, the data required for effective evaluation, and addressing critical issues like bias and risk.

  5. h

    synthetic-from-unit-triple-tasks-danish

    • huggingface.co
    • sprogteknologi.dk
    Updated Jan 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kasper Groes Albin Ludvigsen (2025). synthetic-from-unit-triple-tasks-danish [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2025
    Authors
    Kasper Groes Albin Ludvigsen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

    The purpose of this dataset is to pre- or post-train embedding models for Danish on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish.

  6. LLM Generated Synthetic Dataset of DoS Exposed Solidity Contracts

    • zenodo.org
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibba Giacomo; Baralla Gavina; Destefanis Giuseppe; Ibba Giacomo; Baralla Gavina; Destefanis Giuseppe (2025). LLM Generated Synthetic Dataset of DoS Exposed Solidity Contracts [Dataset]. http://doi.org/10.5281/zenodo.14262663
    Explore at:
    Dataset updated
    May 14, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ibba Giacomo; Baralla Gavina; Destefanis Giuseppe; Ibba Giacomo; Baralla Gavina; Destefanis Giuseppe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Nov 29, 2024
    Description

    This dataset provides the replication package for the paper 'Large Language Models for Synthetic Dataset
    Generation: A Case Study on Ethereum Smart Contract DoS Vulnerabilities' accepted for publication at the 8th International Workshop on Blockchain Oriented Software Engineering. The provided sources encompass:
    1) The synthetic contracts (Vulnerable, Exploit, and Patched contract for each use case) generated by Claude and GPT4.
    2) The configuration files of the hardhat-based testing environment.
    3) The test suite that showcases the vulnerabilities of the generated contracts (including mock contracts) (hardhat is required to run and test contracts).

  7. LLM - Detect AI Datamix

    • kaggle.com
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Raja Biswas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the datamix created by Team πŸ” πŸ“ πŸ•΅οΈβ€β™‚οΈ πŸ€– during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

    It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

    To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

    Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

    • Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:
      • Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.
      • One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.
      • Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

    We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

    Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence

  8. h

    synthetic-from-unit-triple-tasks-swedish

    • huggingface.co
    Updated Jan 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kasper Groes Albin Ludvigsen (2025). synthetic-from-unit-triple-tasks-swedish [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-swedish
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2025
    Authors
    Kasper Groes Albin Ludvigsen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

    The purpose of this dataset is to pre- or post-train embedding models on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by Arrow Denmark and… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-swedish.

  9. h

    llm-detection-generation-contribution2-train

    • huggingface.co
    Updated Apr 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiacheng Zhu (2024). llm-detection-generation-contribution2-train [Dataset]. https://huggingface.co/datasets/jjz5463/llm-detection-generation-contribution2-train
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 21, 2024
    Authors
    Jiacheng Zhu
    Description

    Dataset Card

    Add more information here

    This dataset was produced with DataDreamer πŸ€–πŸ’€. The synthetic dataset card can be found here.

  10. h

    tiny-codes

    • huggingface.co
    Updated Jan 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nam Pham (2023). tiny-codes [Dataset]. http://doi.org/10.57967/hf/0937
    Explore at:
    Dataset updated
    Jan 26, 2024
    Authors
    Nam Pham
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Reasoning with Language and Code

    This synthetic dataset is a collection of 1.6 millions short and clear code snippets that can help LLM models learn how to reason with both natural and programming languages. The dataset covers a wide range of programming languages, such as Python, TypeScript, JavaScript, Ruby, Julia, Rust, C++, Bash, Java, C#, and Go. It also includes two database languages: Cypher (for graph databases) and SQL (for relational databases) in order to study the… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-codes.

  11. h

    synthetic_multilingual_llm_prompts

    • huggingface.co
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gretel.ai (2024). synthetic_multilingual_llm_prompts [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_multilingual_llm_prompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 11, 2024
    Dataset provided by
    Gretel.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Image generated by DALL-E. See prompt for more details

      πŸ“πŸŒ Synthetic Multilingual LLM Prompts
    

    Welcome to the "Synthetic Multilingual LLM Prompts" dataset! This comprehensive collection features 1,250 synthetic LLM prompts generated using Gretel Navigator, available in seven different languages. To ensure accuracy and diversity in prompts, and translation quality and consistency across the different languages, we employed Gretel Navigator both as a generation tool and as an… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_multilingual_llm_prompts.

  12. h

    Data from: Messages for: "Artificial Intelligence for Health Message...

    • works.hcommons.org
    docx
    Updated Nov 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralf Schmaelzle; Ralf Schmaelzle (2024). Messages for: "Artificial Intelligence for Health Message Generation: An Empirical Study Using a Large Language Model (LLM) and Prompt Engineering" [Dataset]. http://doi.org/10.17613/c9q1-1x32
    Explore at:
    docxAvailable download formats
    Dataset updated
    Nov 14, 2024
    Dataset provided by
    unknown
    Authors
    Ralf Schmaelzle; Ralf Schmaelzle
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This study introduces and examines the potential of an AI system to generate health awareness messages. The topic of folic acid, a vitamin that is critical during pregnancy, served as a test case. We used prompt engineering to generate awareness messages about folic acid and compared them to the most retweeted human-generated messages via human evaluation with the university and young adult women samples. We also conducted computational text analysis to examine the similarities between the AI-generated messages and human generated tweets in terms of content and semantic structure. The results showed that AI-generated messages ranked higher in message quality and clarity across both samples. The computational analyses revealed that the AI-generated messages were on par with human-generated ones in terms of sentiment, reading ease, and semantic content. Overall, these results demonstrate the potential of large language models for message generation. Theoretical, practical, and ethical implications are discussed.

  13. t

    Privacy-Sensitive Conversations between Care Workers and Care Home Residents...

    • researchdata.tuwien.ac.at
    • test.researchdata.tuwien.ac.at
    • +1more
    bin, text/markdown
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reinhard Grabler; Reinhard Grabler; Michael Starzinger; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns (2025). Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home [Dataset]. http://doi.org/10.48436/q1kt0-edc53
    Explore at:
    bin, text/markdownAvailable download formats
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    TU Wien
    Authors
    Reinhard Grabler; Reinhard Grabler; Michael Starzinger; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2024 - Aug 2024
    Description

    Dataset Card for "privacy-care-interactions"

    Table of Contents

    Dataset Description

    Purpose and Features

    πŸ”’ Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home πŸ”’

    The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

    Dataset Overview

    Language Distribution 🌍

    • English (en): 95

    Locale Distribution 🌎

    • United States (US) πŸ‡ΊπŸ‡Έ: 95

    Key Facts πŸ”‘

    • This is synthetic data! Generated using proprietary algorithms - no privacy violations!
    • Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).
    • The data was manually labeled by an expert.

    Dataset Structure

    Data Instances

    The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

    { "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

    Data Fields

    The data fields are:

    • text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).
    • taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.
    • category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.
    • affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.
    • language: a string feature. Language code as defined by ISO 639.
    • locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.
    • data_type: a string a classification label, with possible values including real (0), synthetic (1).
    • uid: a int64 feature. A unique identifier within the dataset.
    • split: a string feature. Either train, validation or test.

    Dataset Splits

    The dataset has 2 subsets:

    • split: with a total of 95 examples split into train, validation and test (70%-15%-15%)
    • unsplit: with a total of 95 examples in a single train split
    nametrainvalidationtest
    split661415
    unsplit95n/an/a

    The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

    • split-train-en.jsonl
    • split-validation-en.jsonl
    • split-test-en.jsonl
    • unsplit-train-en.jsonl

    Dataset Creation

    Curation Rationale

    Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

    Source Data

    Initial Data Collection

    The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

    Data Processing

    The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the sections were translated from German to U.S. English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank" rel="noopener

  14. P

    Speech Brown Dataset

    • paperswithcode.com
    Updated Feb 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Mahdi Abootorabi; Ehsaneddin Asgari (2025). Speech Brown Dataset [Dataset]. https://paperswithcode.com/dataset/speechbrown
    Explore at:
    Dataset updated
    Feb 22, 2025
    Authors
    Mohammad Mahdi Abootorabi; Ehsaneddin Asgari
    Description

    Dataset Summary Speech Brown is a comprehensive, synthetic, and diverse paired speech-text dataset in 15 categories, covering a wide range of topics from fiction to religion. This dataset consists of over 55,000 sentence-level samples.

    To train the CLASP model, we created this dataset based on the Brown Corpus. The synthetic speech was generated using the NVIDIA Tacotron 2 text-to-speech model.

    For more information about our proposed model, please refer to this paper. The dataset generation pipeline, along with code and usage instructions, is available on this GitHub page.

    Dataset Statistics

    Total size: Approximately 30 GB.
    Number of samples: 55,173 pairs of speech and text.
    Average tokens per sample: 19.00.
    Maximum tokens in a sample: 48.
    Average characters per sample: 96.72. Number of unique tokens: 50,667 Categories: 15 categories consist of adventure, belles_lettres, editorial, fiction, government, hobbies, humor, learned, lore, mystery, news, religion, reviews, romance, science_fiction.

    Dataset Structure To ensure ease of use, the dataset is partitioned into 10 parts. Each part can be used independently if it meets the requirements of your task and model.

    Metadata Files

    global_metadata: A JSON file containing metadata for all 55,173 samples.
    localized_metadata: A JSON file containing metadata for all samples, categorized into the 10 dataset partitions.

    Metadata Fields

    id: The unique identifier for the sample.
    audio_file_path: The file path for the audio in the dataset.
    category: The category of the sample's text.
    text: The corresponding text of the audio file.

    Usage Instructions To use this dataset, download the parts and metadata files as follows:

    Option 1: Manual Download Visit the dataset repository and download all dataset_partX.zip files and the global_metadata.json file.

    Option 2: Programmatic Download Use the huggingface_hub library to download the files programmatically:

    from huggingface_hub import hf_hub_download
    from zipfile import ZipFile
    import os
    import json
    
    Download dataset parts
    zip_file_path1 = hf_hub_download(repo_id="llm-lab/SpeechBrown", filename="dataset_part1.zip", repo_type="dataset")
    zip_file_path2 = hf_hub_download(repo_id="llm-lab/SpeechBrown", filename="dataset_part2.zip", repo_type="dataset")
    
    Download other parts...
    Download metadata
    metadata_file_path = hf_hub_download(repo_id="llm-lab/SpeechBrown", filename="global_metadata.json", repo_type="dataset")
    
    for i in range(1, 11):
      with ZipFile(f'dataset_part{i}.zip', 'r') as zip_ref:
        zip_ref.extractall(f'dataset_part{i}')
      os.remove(f'dataset_part{i}.zip')
    
    with open('global_metadata.json', 'r') as f:
      metadata = json.load(f)
    metadata.keys()
    

    Citations If you find our paper, code, data, or models useful, please cite the paper: @misc{abootorabi2024claspcontrastivelanguagespeechpretraining, title={CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval}, author={Mohammad Mahdi Abootorabi and Ehsaneddin Asgari}, year={2024}, eprint={2412.13071}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.13071}, }

    Contact If you have questions, please email mahdi.abootorabi2@gmail.com or asgari@berkeley.edu.

  15. s

    Synthetic from Classification Tasks Danish

    • sprogteknologi.dk
    Updated Jan 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danish Data Science Community (2025). Synthetic from Classification Tasks Danish [Dataset]. https://sprogteknologi.dk/dataset/synthetic-from-classification-tasks-danish
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/parquetAvailable download formats
    Dataset updated
    Jan 24, 2025
    Dataset authored and provided by
    Danish Data Science Community
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Danmark
    Description

    The purpose of this dataset is to pre- or post-train embedding models for Danish text classification tasks.

    The dataset consists of 100,000 samples generated with gemma-2-27b-it.

    The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

    Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/classification-tasks-processed

    The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368

    Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.

  16. s

    Synthetic from Retrieval Tasks Danish

    • sprogteknologi.dk
    Updated Jan 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danish Data Science Community (2025). Synthetic from Retrieval Tasks Danish [Dataset]. https://sprogteknologi.dk/dataset/synthetic-from-retrieval-tasks-danish
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/parquetAvailable download formats
    Dataset updated
    Jan 24, 2025
    Dataset authored and provided by
    Danish Data Science Community
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Danmark
    Description

    The purpose of this dataset is to pre- or post-train embedding models for Danish retrieval tasks.

    The dataset consists of 100,000 samples generated with gemma-2-27b-it.

    The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

    Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/retrieval-tasks-processed

    The data generation process described in this paper was followed:

    https://arxiv.org/pdf/2401.00368

    Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.

  17. A

    AIGC Large Language Model (LLM) Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). AIGC Large Language Model (LLM) Report [Dataset]. https://www.datainsightsmarket.com/reports/aigc-large-language-model-llm-1940562
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    May 12, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Artificial Intelligence Generated Content (AIGC) Large Language Model (LLM) market is experiencing explosive growth, projected to reach $1.3 billion in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 141.7%. This phenomenal expansion is fueled by several key drivers. Firstly, the increasing demand for automated content creation across diverse sectors, including marketing, customer service, and education, is significantly boosting adoption. Secondly, advancements in deep learning techniques and the availability of massive datasets are enabling the development of increasingly sophisticated and accurate LLMs. Thirdly, the growing accessibility of cloud-based computing resources is making LLM development and deployment more cost-effective for businesses of all sizes. Finally, the emergence of specialized LLMs tailored to specific applications, such as medical diagnosis or code generation, further accelerates market penetration. However, the market also faces certain restraints. Data privacy concerns and ethical considerations surrounding the use of AI-generated content are significant hurdles. Furthermore, the high computational cost associated with training and deploying large LLMs can pose a barrier to entry for smaller companies. Despite these challenges, the market segmentation reveals significant opportunities. The "Above 100 Billion Parameters" segment is expected to dominate due to its superior performance capabilities, while applications like chatbots and virtual assistants are driving immediate adoption. Geographically, North America and Asia Pacific are expected to be the leading regions, fueled by strong technological innovation and high adoption rates. The competitive landscape is highly dynamic, with major technology companies like OpenAI, Google, and Meta leading the pack, alongside a growing number of specialized AI startups. The forecast period (2025-2033) promises continued market expansion, driven by ongoing innovation and wider industry adoption.

  18. h

    llm-detection-generation-failcase-test

    • huggingface.co
    Updated Apr 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiacheng Zhu (2024). llm-detection-generation-failcase-test [Dataset]. https://huggingface.co/datasets/jjz5463/llm-detection-generation-failcase-test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 21, 2024
    Authors
    Jiacheng Zhu
    Description

    Dataset Card

    Add more information here

    This dataset was produced with DataDreamer πŸ€–πŸ’€. The synthetic dataset card can be found here.

  19. s

    Synthetic from Text Matching Long Tasks Danish

    • sprogteknologi.dk
    Updated Jan 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danish Data Science Community (2025). Synthetic from Text Matching Long Tasks Danish [Dataset]. https://sprogteknologi.dk/dataset/synthetic-from-text-matching-long-tasks-danish
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/parquetAvailable download formats
    Dataset updated
    Jan 24, 2025
    Dataset authored and provided by
    Danish Data Science Community
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Danmark
    Description

    The purpose of this dataset is to pre- or post-train embedding models for Danish text matching tasks.

    The dataset consists of 100,000 samples generated with gemma-2-27b-it.

    The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

    Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/retrieval-tasks-processed

    The data generation process described in this paper was followed:

    https://arxiv.org/pdf/2401.00368

    Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.

  20. L

    Large Language Model (LLM) Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Large Language Model (LLM) Report [Dataset]. https://www.marketreportanalytics.com/reports/large-language-model-llm-52461
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Large Language Model (LLM) market is experiencing explosive growth, driven by advancements in artificial intelligence, increasing demand for natural language processing (NLP) applications, and the rising adoption of cloud computing. The market, estimated at $15 billion in 2025, is projected to exhibit a robust Compound Annual Growth Rate (CAGR) of 35% from 2025 to 2033, reaching approximately $120 billion by 2033. This growth is fueled by several key factors, including the development of more sophisticated and accurate LLMs, their integration into various business applications such as customer service chatbots, content generation tools, and personalized education platforms, and the increasing availability of large datasets for training these models. Furthermore, the ongoing research and development in areas like transfer learning and few-shot learning are contributing to improved efficiency and reduced training costs, making LLMs accessible to a wider range of businesses and developers. However, the market also faces certain challenges. High computational costs associated with training and deploying LLMs remain a significant hurdle, especially for smaller companies. Concerns regarding data privacy, bias in training data, and the ethical implications of using AI-generated content are also emerging as important considerations. Nevertheless, ongoing innovations in hardware, software, and algorithmic optimization are continuously mitigating these challenges. The segmentation of the market, based on application (e.g., chatbots, machine translation, text summarization) and type (e.g., transformer-based models, recurrent neural networks), reveals diverse growth opportunities. Geographical distribution shows strong growth across North America and Asia-Pacific, fueled by substantial investments in AI research and the presence of major technology companies. Continued technological advancements and increasing market adoption will continue to shape the future trajectory of the LLM market.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ran Xu (2024). clinical-synthetic-text-llm [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm

clinical-synthetic-text-llm

ritaranx/clinical-synthetic-text-llm

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2024
Authors
Ran Xu
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Data Description

We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.

  Generated Datasets

The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.

Search
Clear search
Close search
Google apps
Main menu