100+ datasets found
  1. h

    ai-job-embedding-finetuning

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shawhin Talebi, ai-job-embedding-finetuning [Dataset]. https://huggingface.co/datasets/shawhin/ai-job-embedding-finetuning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Shawhin Talebi
    Description

    Dataset for fine-tuning an embedding model for AI job search. Data sourced from datastax/linkedin_job_listings. Data used to fine-tune shawhin/distilroberta-ai-job-embeddings for AI job search. Links

    GitHub Repo Video link Blog link

  2. h

    ai-wit-training-data

    • huggingface.co
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jay (2025). ai-wit-training-data [Dataset]. https://huggingface.co/datasets/artificialreply/ai-wit-training-data
    Explore at:
    Dataset updated
    Oct 7, 2025
    Authors
    Jay
    Description

    AI Wit Training Dataset

    This dataset contains witty comeback and humor training data for fine-tuning language models.

      Dataset Structure
    

    Each sample contains:

    messages: List of user/assistant conversation source: Data source (e.g., "reddit_jokes") style: Response style (e.g., "humorous", "witty")

      Usage
    

    This dataset is designed for fine-tuning conversational AI models to generate witty, humorous responses to offensive or provocative inputs.

      Example
    

    {… See the full description on the dataset page: https://huggingface.co/datasets/artificialreply/ai-wit-training-data.

  3. Data from: AstroChat

    • kaggle.com
    • huggingface.co
    zip
    Updated Jun 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    astro_pat (2024). AstroChat [Dataset]. https://www.kaggle.com/datasets/patrickfleith/astrochat
    Explore at:
    zip(1214166 bytes)Available download formats
    Dataset updated
    Jun 9, 2024
    Authors
    astro_pat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Purpose and Scope

    The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.

    Intended Use

    The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).

    Quickstart

    To be completed

    DATASET DESCRIPTION

    Access

    Structure

    901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column): - id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets. - topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split. - subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc. - persona: description of the persona used to simulate a user - opening_question: the first question asked by the user to start a conversation with the AI-assistant - messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields: - role: the role of the speaker, either user or assistant - content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.

    Important See the full list of topics and subtopics covered below.

    Metadata

    Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main

    Generation Method

    We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:

    Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.

    Step-by-step description

    • Defined a set of user persona
    • Defined a set of topics/ disciplines within the domain of Astronautics / Space Mission Engineering
    • For each topics, we defined a set of subtopics to narrow down the conversation to more specific and niche conversations (see below the full list)
    • For each subtopic we generate a set of opening questions that the user could ask to start a conversation (see below the full list)
    • We then distil the knowledge of an strong Chat Model (in our case ChatGPT through then api with gpt-4-turbo model) to generate the answers to the opening questions
    • We simulate follow-up questions from the user to the assistant, and the assistant's answers to these questions which builds up the messages.

    Future work and contributions appreciated

    • Distil knowledge from more models (Anthropic, Mixtral, GPT-4o, etc...)
    • Implement more creativity in the opening questions and follow-up questions
    • Filter-out questions and conversations which are too similar
    • Ask topic and subtopic expert to validate the generated conversations to have a sense on how reliable is the overall dataset

    Languages

    All instances in the dataset are in english

    Size

    901 synthetically-generated dialogue

    USAGE AND GUIDELINES

    License

    AstroChat Ā© 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International

    Restrictions

    No restriction. Please provide the correct attribution following the license terms.

    Citation

    Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579

    Update Frequency

    Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)

    Have a feedback or spot an error?

    Use the ...

  4. Mental Health Conversational AI Training Dataset

    • kaggle.com
    zip
    Updated Jun 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nguyen Le Truong Thien (2025). Mental Health Conversational AI Training Dataset [Dataset]. https://www.kaggle.com/datasets/nguyenletruongthien/mental-health
    Explore at:
    zip(96858618 bytes)Available download formats
    Dataset updated
    Jun 10, 2025
    Authors
    Nguyen Le Truong Thien
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This comprehensive mental health conversational dataset contains over 510,000+ professionally curated conversations, therapeutic dialogues, and support interactions designed for training empathetic AI systems. The dataset combines real-world counseling scenarios, community discussions, and synthetic conversations covering the full spectrum of mental health topics including anxiety, depression, crisis intervention, and wellness support. All content has been carefully anonymized, ethically reviewed, and formatted for immediate compatibility with popular machine learning frameworks including Hugging Face Transformers, OpenAI APIs, and custom language models. The dataset includes multiple file formats (CSV, JSON) and ready-to-use training splits optimized for fine-tuning conversational AI models, making it an invaluable resource for researchers, developers, and organizations building mental health support technologies while maintaining the highest standards of privacy, safety, and therapeutic appropriateness.

  5. h

    Whisper-Fine-Tune-One-Shot-Eval

    • huggingface.co
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Rosehill (2025). Whisper-Fine-Tune-One-Shot-Eval [Dataset]. https://huggingface.co/datasets/danielrosehill/Whisper-Fine-Tune-One-Shot-Eval
    Explore at:
    Dataset updated
    Nov 17, 2025
    Authors
    Daniel Rosehill
    Description

    Whisper Fine-Tuning Evaluation: Local vs Commercial ASR

    A "back of the envelope" evaluation comparing fine-tuned Whisper models running locally against commercial ASR APIs via Eden AI.

      The Question
    

    Can fine-tuning Whisper achieve measurable WER reductions, even when comparing local inference against cloud-based commercial models?

      TL;DR
    

    Yes. Fine-tuned Whisper Large Turbo running locally achieved 5.84% WER, beating the best commercial API (Assembly at… See the full description on the dataset page: https://huggingface.co/datasets/danielrosehill/Whisper-Fine-Tune-One-Shot-Eval.

  6. AI-MATH-LLM-Package

    • kaggle.com
    zip
    Updated Jun 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johnson chong (2024). AI-MATH-LLM-Package [Dataset]. https://www.kaggle.com/datasets/johnsonhk88/ai-math-llm-package
    Explore at:
    zip(3330554065 bytes)Available download formats
    Dataset updated
    Jun 20, 2024
    Authors
    Johnson chong
    Description

    This Install Package for LLM RAG, fine tuning essential library such as ( HuggingFace hub , transformer, langchain , evalate, sentence-transformers and etc. ) , suitable for Kaggle competition (offline) requirement which download form kaggle development environment.

    Support Package list as below: transformer datasets accelerate bitsandbytes langchain langchain-community sentence-transformers chromadb
    faiss-cpu huggingface_hub langchain-text-splitters
    peft trl umap-learn evaluate deepeval weave

    Suggestion install command in kaggle: !pip install transformers --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/tranformers !pip install -U datasets --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/datasets !pip install -U accelerate --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/accelerate !pip install build --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/build-1.2.1-py3-none-any.whl !pip install -U bitsandbytes --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl !pip install langchain --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain-0.2.5-py3-none-any.whl !pip install langchain-core --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_core-0.2.9-py3-none-any.whl !pip install langsmith --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langsmith-0.1.81-py3-none-any.whl !pip install langchain-community --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_community-0.2.5-py3-none-any.whl !pip install sentence-transformers --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/sentence_transformers-3.0.1-py3-none-any.whl !pip install chromadb --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/chromadb-0.5.3-py3-none-any.whl !pip install faiss-cpu --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl !pip install -U huggingface_hub --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/huggingface_hub !pip install -qU langchain-text-splitters --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_text_splitters-0.2.1-py3-none-any.whl !pip install -U peft --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/peft-0.11.1-py3-none-any.whl !pip install -U trl --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/trl-0.9.4-py3-none-any.whl !pip install umap-learn --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/umap-learn !pip install evaluate --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/evaluate-0.4.2-py3-none-any.whl !pip install deepeval --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/deepeval-0.21.59-py3-none-any.whl !pip install weave --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/weave-0.50.2-py3-none-any.whl

  7. nayjest/Phi-3-mini-4k-instruct

    • kaggle.com
    zip
    Updated May 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vitalii Stepanenko (2024). nayjest/Phi-3-mini-4k-instruct [Dataset]. https://www.kaggle.com/datasets/nayjest/phi-3-mini-4k-instruct
    Explore at:
    zip(6067852377 bytes)Available download formats
    Dataset updated
    May 9, 2024
    Authors
    Vitalii Stepanenko
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

    Model Summary

    The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.

    The model has underwent a post-training process that incorporates both supervised fine-tuning and direct preference optimization for the instruction following and safety measures. When assessed against benchmarks testing common sense, language understanding, math, code, long context and logical reasoning, Phi-3 Mini-4K-Instruct showcased a robust and state-of-the-art performance among models with less than 13 billion parameters.

    Resources and Technical Documentation:

    Intended Uses

    Primary use cases

    The model is intended for commercial and research use in English. The model provides uses for applications which require:

    1) Memory/compute constrained environments 2) Latency bound scenarios 3) Strong reasoning (especially code, math and logic)

    Our model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features.

    Use case considerations

    Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fariness before using within a specific downstream use case, particularly for high risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.

    Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.

    How to Use

    Phi-3 Mini-4K-Instruct has been integrated in the development version (4.41.0.dev0) of transformers. Until the official version is released through pip, ensure that you are doing one of the following:

    • When loading the model, ensure that trust_remote_code=True is passed as an argument of the from_pretrained() function.

    • Update your local transformers to the development version: pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers. The previous command is an alternative to cloning and installing from the source.

    The current transformers version can be verified with: pip list | grep transformers.

    Phi-3 Mini-4K-Instruct is also available in HuggingChat.

    Tokenizer

    Phi-3 Mini-4K-Instruct supports a vocabulary size of up to 32064 tokens. The tokenizer files already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size.

    Chat Format

    Given the nature of the training data, the Phi-3 Mini-4K-Instruct model is best suited for prompts using the chat format as follows. You can provide the prompt as a question with a generic template as follow: markdown <|user|> Question <|end|> <|assistant|> For example: markdown <|user|> How to explain Internet for a medieval knight?<|end|> <|assistant|>

    where the model generates the text after <|assistant|> . In case of few-shots prompt, the prompt can be formatted as the following:

    <|user|>
    I am going to Paris, what should I see?<|end|>
    <|assistant|>
    Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:
    
    1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
    2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
    3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic...
    
  8. Whisper-Small-Bengali

    • kaggle.com
    zip
    Updated Oct 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Bondarenko (2023). Whisper-Small-Bengali [Dataset]. https://www.kaggle.com/datasets/bond005/whisper-small-bengali
    Explore at:
    zip(896374466 bytes)Available download formats
    Dataset updated
    Oct 3, 2023
    Authors
    Ivan Bondarenko
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.

    This is a fine-tuned small Whisper for Bangla. The fine-tuning was started from small model OpenAI/whisper-small, and it was realized on the Bengali.AI Speech Recognition dataset.

  9. Contextual Input SFT Dataset

    • kaggle.com
    zip
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeeshan-ul-hassan Usmani (2025). Contextual Input SFT Dataset [Dataset]. https://www.kaggle.com/datasets/zusmani/contextual-input-sft-dataset
    Explore at:
    zip(499476 bytes)Available download formats
    Dataset updated
    May 29, 2025
    Authors
    Zeeshan-ul-hassan Usmani
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    šŸ“¦ Instruction-Tuned Dataset with Contextual Inputs (10,000 Examples for SFT)

    🧠 What is Supervised Fine-Tuning (SFT)?

    Supervised Fine-Tuning (SFT) is a foundational technique for adapting large language models (LLMs) like GPT, LLaMA, and Claude to perform specific tasks. In SFT, a model is trained on a dataset of instruction–input–output triples, allowing it to learn how to generate helpful, relevant, and accurate responses based on human-designed prompts and inputs.

    This technique is widely used for building task-specific AI agents, copilots, educational tools, and customer service bots.

    šŸ“š About This Dataset

    This dataset contains 10,000 instruction–input–output examples spanning 10 practical domains:

    • Healthcare
    • Code
    • Finance
    • Education
    • Law
    • Productivity
    • Marketing
    • Psychology
    • Sports
    • Travel

    Each record is structured as:

    ColumnDescription
    idUnique identifier
    domainDomain/topic of the task
    instructionA prompt asking the model to perform a task
    inputContext or information needed to complete the task
    outputTarget response generated for the given instruction + input
    sourceWhether the entry is synthetic or human-curated
    quality_scoreA rating from 1–5 reflecting the response's quality

    šŸ’” Example Entry

    InstructionInputOutput
    "Summarize the following article""Photosynthesis is the process by which plants...""Photosynthesis converts light into chemical energy."
    "Fix the code below""def greet(name): print('Hello' name)""def greet(name): print('Hello', name)"
    "Plan a 5-day trip""Destination: Japan. Interests: culture, tech.""Day 1: Tokyo tour... Day 2: Kyoto temples..."

    🧪 What Can You Do With This Dataset?

    šŸ§‘ā€šŸŽ“ Beginners

    • Train a small transformer model using instruction + input → output
    • Experiment with prompt engineering and token analysis
    • Evaluate models on diverse domains and tasks

    šŸ§‘ā€šŸ’» Practitioners

    • Fine-tune LLaMA, Mistral, GPT-J, or Falcon on instruction tasks
    • Perform domain-based SFT (e.g., only legal or medical examples)
    • Use quality scores to train a filtering mechanism or reward model

    🧠 Researchers

    • Investigate performance variance across domains
    • Run evaluation benchmarks (BLEU, ROUGE, METEOR, GPT-4 eval)
    • Study model alignment and generalization with diverse instructions

    šŸŽÆ Suggested Projects

    • Fine-tune models using transformers and PEFT
    • Build a quality prediction model using the quality_score
    • Visualize attention distribution over instruction vs. input
    • Compare SFT vs. zero-shot/few-shot prompting using the same tasks

    šŸ›  Tools That Work Well

    • Hugging Face Transformers and Datasets
    • PEFT for parameter-efficient tuning
    • LoRA, QLoRA, or 8-bit training on Colab or local GPU
    • LangChain for interactive API wrappers
    • Weights & Biases for experiment tracking

    šŸ”– License

    Released under the MIT License. You may use, modify, and share with attribution.

    šŸ™Œ Acknowledgments

    Created by Zeeshan-ul-hassan Usmani to support open learning, LLM research, and educational outreach. Inspired by initiatives like Self-Instruct, OpenAssistant, and Hugging Face open datasets.

  10. h

    Recurv-Clinical-Dataset

    • huggingface.co
    Updated Feb 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Recurv AI (2025). Recurv-Clinical-Dataset [Dataset]. https://huggingface.co/datasets/RecurvAI/Recurv-Clinical-Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2025
    Dataset authored and provided by
    Recurv AI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🩺 Recurv-Clinical-Dataset:

    The Recurv Clinical Dataset is a comprehensive resource containing 12,631 high-quality question-answer pairs specifically designed for training and fine-tuning medical AI models. Curated from trusted medical sources, this dataset focuses on real-world scenarios, including patient history, diagnostics, and treatment recommendations. It sets a new benchmark for advancing conversational AI in the healthcare field.

      šŸ“ˆ Dataset Statistics… See the full description on the dataset page: https://huggingface.co/datasets/RecurvAI/Recurv-Clinical-Dataset.
    
  11. hr-policies-qa-dataset

    • kaggle.com
    zip
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). hr-policies-qa-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/hr-policies-qa-dataset
    Explore at:
    zip(54895 bytes)Available download formats
    Dataset updated
    Sep 11, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    šŸ¢ HR Policies Q&A Synthetic Dataset

    This synthetic dataset for LLM training captures realistic employee–assistant interactions about HR and compliance policies.
    Generated using Syncora.ai's synthetic data generation engine, it provides privacy-safe, high-quality conversations for training Large Language Models (LLMs) to handle HR-related queries.

    Perfect for researchers, HR tech startups, and AI developers building chatbots, compliance assistants, or policy QA systems — without exposing sensitive employee data.

    🧠 Context & Applications

    HR departments handle countless queries on policies, compliance, and workplace practices.
    This dataset simulates those Q&A flows, making it a powerful dataset for LLM training and research.

    You can use it for:

    • HR chatbot prototyping
    • Policy compliance assistants
    • Internal knowledge base fine-tuning
    • Generative AI experimentation
    • Synthetic benchmarking in enterprise QA systems

    šŸ“Š Dataset Features

    ColumnDescription
    roleRole of the message author (system, user, or assistant)
    contentActual text of the message
    messagesGrouped sequence of role–content exchanges (conversation turns)

    Each entry represents a self-contained dialogue snippet designed to reflect natural HR conversations, ideal for synthetic data generation research.

    šŸ“¦ This Repo Contains

    • HR Policies QA Dataset – JSON format, ready to use for LLM training or evaluation
    • Jupyter Notebook – Explore the dataset structure and basic preprocessing
    • Synthetic Data Tools – Generate your own datasets using Syncora.ai
    • ⚔ Generate Synthetic Data
      Need more? Use Syncora.ai’s synthetic data generation tool to create custom HR/compliance datasets. Our process is simple, reliable, and ensures privacy.

    🧪 ML & Research Use Cases

    • Policy Chatbots — Train assistants to answer compliance and HR questions
    • Knowledge Management — Fine-tune models for consistent responses
    • Synthetic Data Research — Explore structured dialogue datasets without legal risks
    • Evaluation Benchmarks — Test enterprise AI assistants on HR-related queries
    • Dataset Expansion — Combine this dataset with your own data using synthetic generation

    šŸ”’ Why Syncora.ai Synthetic Data?

    • Zero real-user data → Zero privacy liability
    • High realism → Actionable insights for LLM training
    • Fully customizable → Generate synthetic data tailored to your domain
    • Ethically aligned → Safe and responsible dataset creation

    Whether you're building an HR assistant, compliance bot, or experimenting with enterprise LLMs, Syncora.ai synthetic datasets give you trustworthy, free datasets to start with — and scalable tools to grow further.

    šŸ’¬ Questions or Contributions?

    Got feedback, research use cases, or want to collaborate?
    Open an issue or reach out — we’re excited to work with AI researchers, HR tech builders, and compliance innovators.

    BOOK A DEMO

    āš ļø Disclaimer

    This dataset is 100% synthetic and does not represent real employees or organizations.
    It is intended solely for research, educational, and experimental use in HR analytics, compliance automation, and machine learning.

  12. Dataset and Models for Detection of News Agency Releases in Historical...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring; Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring (2025). Dataset and Models for Detection of News Agency Releases in Historical Newspapers [Dataset]. http://doi.org/10.5281/zenodo.8333933
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring; Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This record contains the annotated datasets and models used and produced for the work reported in the Master Thesis "Where Did the News come from? Detection of News Agency Releases in Historical Newspapers " (link).

    Please cite this report if you are using the models/datasets or find it relevant to your research:

    @article{Marxen:305129,
       title = {Where Did the News Come From? Detection of News Agency Releases in Historical Newspapers},
       author = {Marxen, Lea},
       pages = {114p},
       year = {2023},
       url = {http://infoscience.epfl.ch/record/305129},
    }


    1. DATA

    The newsagency-dataset contains historical newspaper articles with annotations of news agency mentions. The articles are divided into French (fr) and German (de) subsets and a train, dev and test set respectively. The data is annotated at token-level in the CoNLL format with IOB tagging format.

    The distribution of articles in the different sets is as follows:

    Dataset Statistics
    Lg.DocsAgency Mentions
    Trainde333493
    fr9031,122
    Devde3226
    fr110114
    Testde3258
    fr120163

    Due to an error, there are seven duplicated articles in the French test set (article IDs: courriergdl-1847-10-02-a-i0002, courriergdl-1852-02-14-a-i0002, courriergdl-1860-10-31-a-i0016, courriergdl-1864-12-15-a-i0005, lunion-1860-11-27-a-i0004, lunion-1865-02-05-a-i0012, lunion-1866-02-16-a-i0009).

    2. MODELS

    The two agency detection and classification models used for the inference on the impresso Corpus are released as well:

    • newsagency-model-de: based on German BERT (with maximum sequence length 128), fine-tuned with the German training set of the newsagency-dataset
    • newsagency-model-fr: based on French Europeana BERT (with maximum sequence length 128), fine-tuned with the French training set of the newsagency-dataset

    The models perform multitask classification with two prediction heads, one for token-level agency entity classification and one for sentence-level (has_agency: yes/no). They can be run with TorchServe, for details see the newsagency-classification repository.

    Please refer to the report for further information or contact us.

    3. CODE

    https://github.com/impresso/newsagency-classification

    4. CONTACT

    Maud Ehrmann (EPFL-DHLAB)
    Emanuela Boros (EPFL-DHLAB)

  13. h

    hub-tldr-dataset-summaries-llama

    • huggingface.co
    Updated Feb 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel van Strien (2025). hub-tldr-dataset-summaries-llama [Dataset]. https://huggingface.co/datasets/davanstrien/hub-tldr-dataset-summaries-llama
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 17, 2025
    Authors
    Daniel van Strien
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset card for dataset-summaries-llama

    This dataset contains AI-generated summaries of dataset cards from the Hugging Face Hub, generated using meta-llama/Llama-3.3-70B-Instruct. It is designed to be used in combination with a similar dataset of model card summaries for initial supervised fine-tuning (SFT) of language models specialized in generating tl;dr summaries of dataset and model cards from the Hugging Face Hub. This dataset was made with Curator.

      Dataset… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/hub-tldr-dataset-summaries-llama.
    
  14. m

    HVULao_NLP: A Word-Segmented and POS-Tagged Lao Corpus

    • data.mendeley.com
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ha Nguyen (2025). HVULao_NLP: A Word-Segmented and POS-Tagged Lao Corpus [Dataset]. http://doi.org/10.17632/5zwym7kwn8.1
    Explore at:
    Dataset updated
    Sep 1, 2025
    Authors
    Ha Nguyen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The HVULao_NLP project is dedicated to sharing datasets and tools for Lao Natural Language Processing (NLP), developed and maintained by the research team at Hung Vuong University (HVU), Phu Tho, Vietnam. This project is supported by Hung Vuong University with the aim of advancing research and applications in low-resource language processing, particularly for the Lao language.

    šŸ“ Datasets
    This release provides a semi-automatically constructed corpus consisting of Lao sentences that have been word-segmented and part-of-speech (POS) tagged. It is designed to support a wide range of NLP applications, including language modeling, sequence labeling, linguistic research, and the development of Lao language tools.

    • Datatest1k/ – Test set (1,000 Lao sentences)

      • testorgin1000.txt: Original raw sentences (UTF-8, one sentence per line).
      • testsegsent_1000.txt: Word-segmented version aligned 1-to-1 with the raw file (tokens separated by spaces).
      • testtag1k.json: Word-segmented and POS-tagged sentences, generated using large language models (LLMs) and manually reviewed by native linguists.
    • Datatrain10k/ – Training set (10,000 Lao sentences)

      • 10ktrainorin.txt: Original raw sentences (UTF-8, one sentence per line).
      • 10ksegmented.txt: Word-segmented version aligned 1-to-1 with the raw file.
      • 10ktraintag.json: Word-segmented and POS-tagged sentences, generated using the same method as the test set.
    • lao_finetuned_10k/ – A fine-tuned transformer-based model for Lao word segmentation, compatible with Hugging Face’s transformers library.

    All data files are encoded in UTF-8 (NFC) and prepared for direct use in NLP pipelines.

    šŸ“ The Lao sentence segmentation tool
    A command-line tool for Lao word segmentation built with a fine-tuned Hugging Face transformers model and PyTorch.

    Features
    - Accurate Lao word segmentation using a pre-trained model
    - Simple command-line usage
    - GPU support (if available)

    Example usage
    ```bash python3 segment_lao.py -i ./data/lao_raw.txt -o ./output/lao_segmented.txt šŸ“ The Lao sentence POS tagging tool
    A POS tagging tool for segmented Lao text, implemented with Python and CRF++.

    Example usage
    python3 Pos_tagging.py ./Test/lao_sentences_segmented.txt Test1

    šŸ“š Usage
    The HVULao_NLP dataset and tools are intended for:
    - Training and evaluating sequence labeling models (e.g., CRF, BiLSTM, mBERT)
    - Developing Lao NLP tools (e.g., POS taggers, tokenizers)
    - Conducting linguistic and computational research on Lao

  15. Training Data For building a chatbot

    • kaggle.com
    zip
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IndraneelBakshiss (2025). Training Data For building a chatbot [Dataset]. https://www.kaggle.com/datasets/indraneelbakshiss/training-data-for-building-a-chatbot
    Explore at:
    zip(22200 bytes)Available download formats
    Dataset updated
    Mar 5, 2025
    Authors
    IndraneelBakshiss
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview This dataset is designed to train and fine-tune chatbot models by mapping user queries (patterns) to predefined intents (tags) and generating contextually accurate responses. Each tag represents a unique conversational intent or topic (e.g., "climate_change," "crypto_regulation," "quantum_computing"), accompanied by multiple paraphrased user prompts (patterns) and a detailed, informative response. Ideal for building intent classification systems, dialogue management, or generative AI models.

    { "intents": [ { "tag": "tag_name", "patterns": ["user query 1", "user query 2", ...], "responses": ["detailed answer"] }, ... ] }

    Possible Uses Intent Classification: Train models to categorize user inputs into predefined tags.

    Response Generation: Fine-tune generative models (GPT, BERT) to produce context-aware answers.

    Educational Chatbots: Power QA systems for topics like science, history, or technology.

    Customer Support: Automate responses for FAQs or policy explanations.

    Compatibility Frameworks: TensorFlow, PyTorch, spaCy, Rasa, Hugging Face Transformers.

    Use Cases: Virtual assistants, customer service bots, trivia apps, educational tools.

  16. MELD-emotion-detection-preprocessed

    • kaggle.com
    zip
    Updated Nov 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    seniru epasinghe (2025). MELD-emotion-detection-preprocessed [Dataset]. https://www.kaggle.com/datasets/seniruepasinghe/meld-emotion-detection-preprocessed
    Explore at:
    zip(4536497614 bytes)Available download formats
    Dataset updated
    Nov 1, 2025
    Authors
    seniru epasinghe
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    Hi, I’m Seniru Epasinghe šŸ‘‹

    I’m an AI enthusiast, working on machine learning projects and open-source contributions.
    I enjoy exploring AI pipelines, natural language processing, and building tools that make development easier.

    Connect with me:

    Hugging Face
    Medium
    LinkedIn
    GitHub

    Multimodal Emotion Recognition Dataset (Processed from MELD)

    This dataset is a preprocessed and balanced version of the MELD Dataset, designed for multimodal emotion recognition research.
    It combines text, audio, and video modalities, each represented by a set of emotion probability distributions predicted by pretrained or custom-trained models.

    Overview

    FeatureDescription
    Total Samples4,000 utterances
    ModalitiesText, Audio, Video
    Balanced EmotionsEach emotion class is approximately balanced
    Cleaned SamplesVideos with unclear or no facial detection removed
    Emotion Labels['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise']

    Each row in the dataset corresponds to a single utterance, along with emotion label, file name, and predicted emotion probabilities per modality.

    Example Entry

    UtteranceEmotionFile_NameMultiModel Predictions
    You are going to a clinic!disgustdia127_utt3.mp4{"video": [0.7739, 0.0, 0.0, 0.0783, 0.1217, 0.0174, 0.0087], "audio": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0], "text": [0.0005, 0.0, 0.0, 0.0007, 0.998, 0.0004, 0.0004]}

    Column Description:

    • Utterance — spoken text in the conversation.
    • Emotion — gold-standard emotion label.
    • File_Name — corresponding video file (utterance-level).
    • MultiModel Predictions — JSON object containing model-predicted emotion probability vectors for each modality.

    Modality Emotion Extraction

    Each modality’s emotion vector was generated independently using specialized models:

    ModalityModel / MethodDescription
    Videopython-ferFacial expression recognition using CNN-based FER library.
    AudioCustom-trained CNN modelTrained on Mel spectrogram features for emotion classification.
    Textarpanghoshal/EmoRoBERTaTransformer-based text emotion model fine-tuned on GoEmotions dataset.

    Format and Usage

    • File format: CSV
    • Recommended columns:
      • Utterance
      • Emotion
      • File_Name
      • Final_Emotion (JSON: { "video": [...], "audio": [...], "text": [...] })

    This dataset is ideal for: - Fusion model training - Fine-tuning multimodal emotion models - Benchmarking emotion fusion strategies - Ablation studies on modality importance

    Citation

    References for the original MELD Dataset - S. Poria, D. Hazarika, N. Majumder, G. Naik, R. Mihalcea, E. Cambria. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation (2018). - Chen, S.Y., Hsu, C.C., Kuo, C.C. and Ku, L.W. EmotionLines: An Emotion Corpus of Multi-Party Conversations. arXiv preprint arXiv:1802.08379 (2018).

    License & Acknowledgments

    This dataset is a derivative work of MELD, used here for research and educational purposes.
    All credit for the original dataset goes to the MELD authors and contributors.

  17. h

    toolverifier

    • huggingface.co
    Updated Mar 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2024). toolverifier [Dataset]. https://huggingface.co/datasets/facebook/toolverifier
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2024
    Dataset authored and provided by
    AI at Meta
    Description

    TOOLVERIFIER: Generalization to New Tools via Self-Verification

    This repository contains the ToolSelect dataset which was used to fine-tune Llama-2 70B for tool selection.

      Data
    

    ToolSelect data is synthetic training data generated for tool selection task using Llama-2 70B and Llama-2-Chat-70B. It consists of 555 samples corresponding to 173 tools. Each training sample is composed of a user instruction, a candidate set of tools that includes the ground truth tool, and a… See the full description on the dataset page: https://huggingface.co/datasets/facebook/toolverifier.

  18. GSAFE-finetuned

    • kaggle.com
    zip
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AlbertoCosta (2025). GSAFE-finetuned [Dataset]. https://www.kaggle.com/datasets/noobsajbot/gsafe-finetuned
    Explore at:
    zip(85872842 bytes)Available download formats
    Dataset updated
    Aug 4, 2025
    Authors
    AlbertoCosta
    Description

    These are LoRA fine‑tuned adapter weights for Google Gemma 3n E2B IT, produced for the Google Gemma 3n Impact Challenge. Base model: https://huggingface.co/google/gemma-3n-e2b-it Usage subject to the Gemma Terms of Use: https://ai.google.dev/gemma/terms and the rules listed in the competition page: https://www.kaggle.com/competitions/google-gemma-3n-hackathon/rules

  19. h

    KokoroTTS-af-Synthetic-QA

    • huggingface.co
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MI Remo (2025). KokoroTTS-af-Synthetic-QA [Dataset]. https://huggingface.co/datasets/rokeya71/KokoroTTS-af-Synthetic-QA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 6, 2025
    Authors
    MI Remo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Everyday Conversations Fine-Tuning Dataset (LLaMA 3.1 - 2K)

      Overview
    

    This repository hosts the Everyday Conversations - LLaMA 3.1 - 2K dataset, a carefully curated fine-tuning dataset designed for conversational AI models. The dataset was created using the Kokoro-82M model, featuring voice samples from the af Voicepack.

      Dataset Link
    

    Hugging Face Dataset - Everyday Conversations LLaMA 3.1 - 2K

      Features
    

    Voice Model: Kokoro-82M Voicepack: af… See the full description on the dataset page: https://huggingface.co/datasets/rokeya71/KokoroTTS-af-Synthetic-QA.

  20. h

    GSCF_finetune

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Supply Chain AI Research at The Ohio State University, GSCF_finetune [Dataset]. https://huggingface.co/datasets/Supply-Chain-AI-Research/GSCF_finetune
    Explore at:
    Dataset authored and provided by
    Supply Chain AI Research at The Ohio State University
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    GSCF Q&A Dataset for Fine-Tuning

      Dataset Summary
    

    This dataset contains a collection of question-and-answer pairs specifically designed for fine-tuning Large Language Models (LLMs) on the Global Supply Chain Forum (GSCF) framework, a leading process model developed at The Ohio State University. The data is structured to train a model to act as an expert supply chain consultant. The content covers two main types of interactions:

    Definitional Knowledge: Questions that… See the full description on the dataset page: https://huggingface.co/datasets/Supply-Chain-AI-Research/GSCF_finetune.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shawhin Talebi, ai-job-embedding-finetuning [Dataset]. https://huggingface.co/datasets/shawhin/ai-job-embedding-finetuning

ai-job-embedding-finetuning

shawhin/ai-job-embedding-finetuning

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Shawhin Talebi
Description

Dataset for fine-tuning an embedding model for AI job search. Data sourced from datastax/linkedin_job_listings. Data used to fine-tune shawhin/distilroberta-ai-job-embeddings for AI job search. Links

GitHub Repo Video link Blog link

Search
Clear search
Close search
Google apps
Main menu