100+ datasets found

h
ai-job-embedding-finetuning
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shawhin Talebi, ai-job-embedding-finetuning [Dataset]. https://huggingface.co/datasets/shawhin/ai-job-embedding-finetuning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Shawhin Talebi
Description
Dataset for fine-tuning an embedding model for AI job search. Data sourced from datastax/linkedin_job_listings. Data used to fine-tune shawhin/distilroberta-ai-job-embeddings for AI job search. Links

GitHub Repo Video link Blog link
h
ai-wit-training-data
huggingface.co
Updated Oct 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jay (2025). ai-wit-training-data [Dataset]. https://huggingface.co/datasets/artificialreply/ai-wit-training-data
Explore at:
Dataset updated
Oct 7, 2025
Authors
Jay
Description
AI Wit Training Dataset

This dataset contains witty comeback and humor training data for fine-tuning language models.

Dataset Structure

Each sample contains:

messages: List of user/assistant conversation source: Data source (e.g., "reddit_jokes") style: Response style (e.g., "humorous", "witty")

Usage

This dataset is designed for fine-tuning conversational AI models to generate witty, humorous responses to offensive or provocative inputs.

Example

{… See the full description on the dataset page: https://huggingface.co/datasets/artificialreply/ai-wit-training-data.
Data from: AstroChat
kaggle.com
huggingface.co
zip
Updated Jun 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
astro_pat (2024). AstroChat [Dataset]. https://www.kaggle.com/datasets/patrickfleith/astrochat
Explore at:
zip(1214166 bytes)Available download formats
Dataset updated
Jun 9, 2024
Authors
astro_pat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Purpose and Scope

The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.

Intended Use

The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).

Quickstart

To be completed

DATASET DESCRIPTION

Access

Manual download from Hugging face hub: https://huggingface.co/datasets/patrickfleith/AstroChat

Or with python: python from datasets import load_dataset dataset = load_dataset("patrickfleith/AstroChat")

Structure

901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column): - id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets. - topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split. - subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc. - persona: description of the persona used to simulate a user - opening_question: the first question asked by the user to start a conversation with the AI-assistant - messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields: - role: the role of the speaker, either user or assistant - content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.

Important See the full list of topics and subtopics covered below.

Metadata

Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main

Generation Method

We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:

Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.

Step-by-step description

Defined a set of user persona

Defined a set of topics/ disciplines within the domain of Astronautics / Space Mission Engineering

For each topics, we defined a set of subtopics to narrow down the conversation to more specific and niche conversations (see below the full list)

For each subtopic we generate a set of opening questions that the user could ask to start a conversation (see below the full list)

We then distil the knowledge of an strong Chat Model (in our case ChatGPT through then api with gpt-4-turbo model) to generate the answers to the opening questions

We simulate follow-up questions from the user to the assistant, and the assistant's answers to these questions which builds up the messages.

Future work and contributions appreciated

Distil knowledge from more models (Anthropic, Mixtral, GPT-4o, etc...)

Implement more creativity in the opening questions and follow-up questions

Filter-out questions and conversations which are too similar

Ask topic and subtopic expert to validate the generated conversations to have a sense on how reliable is the overall dataset

Languages

All instances in the dataset are in english

Size

901 synthetically-generated dialogue

USAGE AND GUIDELINES

License

AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International

Restrictions

No restriction. Please provide the correct attribution following the license terms.

Citation

Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579

Update Frequency

Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)

Have a feedback or spot an error?

Use the ...
Mental Health Conversational AI Training Dataset
kaggle.com
zip
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nguyen Le Truong Thien (2025). Mental Health Conversational AI Training Dataset [Dataset]. https://www.kaggle.com/datasets/nguyenletruongthien/mental-health
Explore at:
zip(96858618 bytes)Available download formats
Dataset updated
Jun 10, 2025
Authors
Nguyen Le Truong Thien
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This comprehensive mental health conversational dataset contains over 510,000+ professionally curated conversations, therapeutic dialogues, and support interactions designed for training empathetic AI systems. The dataset combines real-world counseling scenarios, community discussions, and synthetic conversations covering the full spectrum of mental health topics including anxiety, depression, crisis intervention, and wellness support. All content has been carefully anonymized, ethically reviewed, and formatted for immediate compatibility with popular machine learning frameworks including Hugging Face Transformers, OpenAI APIs, and custom language models. The dataset includes multiple file formats (CSV, JSON) and ready-to-use training splits optimized for fine-tuning conversational AI models, making it an invaluable resource for researchers, developers, and organizations building mental health support technologies while maintaining the highest standards of privacy, safety, and therapeutic appropriateness.
h
Whisper-Fine-Tune-One-Shot-Eval
huggingface.co
Updated Nov 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Rosehill (2025). Whisper-Fine-Tune-One-Shot-Eval [Dataset]. https://huggingface.co/datasets/danielrosehill/Whisper-Fine-Tune-One-Shot-Eval
Explore at:
Dataset updated
Nov 17, 2025
Authors
Daniel Rosehill
Description
Whisper Fine-Tuning Evaluation: Local vs Commercial ASR

A "back of the envelope" evaluation comparing fine-tuned Whisper models running locally against commercial ASR APIs via Eden AI.

The Question

Can fine-tuning Whisper achieve measurable WER reductions, even when comparing local inference against cloud-based commercial models?

TL;DR

Yes. Fine-tuned Whisper Large Turbo running locally achieved 5.84% WER, beating the best commercial API (Assembly at… See the full description on the dataset page: https://huggingface.co/datasets/danielrosehill/Whisper-Fine-Tune-One-Shot-Eval.
AI-MATH-LLM-Package
kaggle.com
zip
Updated Jun 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johnson chong (2024). AI-MATH-LLM-Package [Dataset]. https://www.kaggle.com/datasets/johnsonhk88/ai-math-llm-package
Explore at:
zip(3330554065 bytes)Available download formats
Dataset updated
Jun 20, 2024
Authors
Johnson chong
Description
This Install Package for LLM RAG, fine tuning essential library such as ( HuggingFace hub , transformer, langchain , evalate, sentence-transformers and etc. ) , suitable for Kaggle competition (offline) requirement which download form kaggle development environment.

Support Package list as below: transformer datasets accelerate bitsandbytes langchain langchain-community sentence-transformers chromadb
faiss-cpu huggingface_hub langchain-text-splitters
peft trl umap-learn evaluate deepeval weave

Suggestion install command in kaggle: !pip install transformers --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/tranformers !pip install -U datasets --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/datasets !pip install -U accelerate --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/accelerate !pip install build --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/build-1.2.1-py3-none-any.whl !pip install -U bitsandbytes --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl !pip install langchain --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain-0.2.5-py3-none-any.whl !pip install langchain-core --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_core-0.2.9-py3-none-any.whl !pip install langsmith --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langsmith-0.1.81-py3-none-any.whl !pip install langchain-community --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_community-0.2.5-py3-none-any.whl !pip install sentence-transformers --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/sentence_transformers-3.0.1-py3-none-any.whl !pip install chromadb --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/chromadb-0.5.3-py3-none-any.whl !pip install faiss-cpu --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl !pip install -U huggingface_hub --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/huggingface_hub !pip install -qU langchain-text-splitters --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_text_splitters-0.2.1-py3-none-any.whl !pip install -U peft --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/peft-0.11.1-py3-none-any.whl !pip install -U trl --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/trl-0.9.4-py3-none-any.whl !pip install umap-learn --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/umap-learn !pip install evaluate --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/evaluate-0.4.2-py3-none-any.whl !pip install deepeval --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/deepeval-0.21.59-py3-none-any.whl !pip install weave --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/weave-0.50.2-py3-none-any.whl
nayjest/Phi-3-mini-4k-instruct
kaggle.com
zip
Updated May 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vitalii Stepanenko (2024). nayjest/Phi-3-mini-4k-instruct [Dataset]. https://www.kaggle.com/datasets/nayjest/phi-3-mini-4k-instruct
Explore at:
zip(6067852377 bytes)Available download formats
Dataset updated
May 9, 2024
Authors
Vitalii Stepanenko
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

Model Summary

The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.

The model has underwent a post-training process that incorporates both supervised fine-tuning and direct preference optimization for the instruction following and safety measures. When assessed against benchmarks testing common sense, language understanding, math, code, long context and logical reasoning, Phi-3 Mini-4K-Instruct showcased a robust and state-of-the-art performance among models with less than 13 billion parameters.

Resources and Technical Documentation:

Phi-3 Microsoft Blog

Phi-3 Technical Report

Phi-3 on Azure AI Studio

Phi-3 GGUF: 4K

Phi-3 ONNX: 4K

Intended Uses

Primary use cases

The model is intended for commercial and research use in English. The model provides uses for applications which require:

1) Memory/compute constrained environments 2) Latency bound scenarios 3) Strong reasoning (especially code, math and logic)

Our model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features.

Use case considerations

Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fariness before using within a specific downstream use case, particularly for high risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.

Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.

How to Use

Phi-3 Mini-4K-Instruct has been integrated in the development version (4.41.0.dev0) of transformers. Until the official version is released through pip, ensure that you are doing one of the following:

When loading the model, ensure that trust_remote_code=True is passed as an argument of the from_pretrained() function.

Update your local transformers to the development version: pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers. The previous command is an alternative to cloning and installing from the source.

The current transformers version can be verified with: pip list | grep transformers.

Phi-3 Mini-4K-Instruct is also available in HuggingChat.

Tokenizer

Phi-3 Mini-4K-Instruct supports a vocabulary size of up to 32064 tokens. The tokenizer files already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size.

Chat Format

Given the nature of the training data, the Phi-3 Mini-4K-Instruct model is best suited for prompts using the chat format as follows. You can provide the prompt as a question with a generic template as follow: markdown <|user|> Question <|end|> <|assistant|> For example: markdown <|user|> How to explain Internet for a medieval knight?<|end|> <|assistant|>

where the model generates the text after <|assistant|> . In case of few-shots prompt, the prompt can be formatted as the following:

<|user|> I am going to Paris, what should I see?<|end|> <|assistant|> Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris: 1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city. 2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa. 3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic...
Whisper-Small-Bengali
kaggle.com
zip
Updated Oct 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Bondarenko (2023). Whisper-Small-Bengali [Dataset]. https://www.kaggle.com/datasets/bond005/whisper-small-bengali
Explore at:
zip(896374466 bytes)Available download formats
Dataset updated
Oct 3, 2023
Authors
Ivan Bondarenko
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.

This is a fine-tuned small Whisper for Bangla. The fine-tuning was started from small model OpenAI/whisper-small, and it was realized on the Bengali.AI Speech Recognition dataset.

Contextual Input SFT Dataset

kaggle.com

zip

Updated May 29, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Zeeshan-ul-hassan Usmani (2025). Contextual Input SFT Dataset [Dataset]. https://www.kaggle.com/datasets/zusmani/contextual-input-sft-dataset

Explore at:

zip(499476 bytes)Available download formats

Dataset updated

May 29, 2025

Authors

Zeeshan-ul-hassan Usmani

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

📦 Instruction-Tuned Dataset with Contextual Inputs (10,000 Examples for SFT)

🧠 What is Supervised Fine-Tuning (SFT)?

Supervised Fine-Tuning (SFT) is a foundational technique for adapting large language models (LLMs) like GPT, LLaMA, and Claude to perform specific tasks. In SFT, a model is trained on a dataset of instruction–input–output triples, allowing it to learn how to generate helpful, relevant, and accurate responses based on human-designed prompts and inputs.

This technique is widely used for building task-specific AI agents, copilots, educational tools, and customer service bots.

📚 About This Dataset

This dataset contains 10,000 instruction–input–output examples spanning 10 practical domains:

Healthcare
Code
Finance
Education
Law
Productivity
Marketing
Psychology
Sports
Travel

Each record is structured as:

Column	Description
`id`	Unique identifier
`domain`	Domain/topic of the task
`instruction`	A prompt asking the model to perform a task
`input`	Context or information needed to complete the task
`output`	Target response generated for the given instruction + input
`source`	Whether the entry is synthetic or human-curated
`quality_score`	A rating from 1–5 reflecting the response's quality

💡 Example Entry

Instruction	Input	Output
"Summarize the following article"	"Photosynthesis is the process by which plants..."	"Photosynthesis converts light into chemical energy."
"Fix the code below"	"def greet(name): print('Hello' name)"	"def greet(name): print('Hello', name)"
"Plan a 5-day trip"	"Destination: Japan. Interests: culture, tech."	"Day 1: Tokyo tour... Day 2: Kyoto temples..."

🧪 What Can You Do With This Dataset?

🧑‍🎓 Beginners

Train a small transformer model using instruction + input → output
Experiment with prompt engineering and token analysis
Evaluate models on diverse domains and tasks

🧑‍💻 Practitioners

Fine-tune LLaMA, Mistral, GPT-J, or Falcon on instruction tasks
Perform domain-based SFT (e.g., only legal or medical examples)
Use quality scores to train a filtering mechanism or reward model

🧠 Researchers

Investigate performance variance across domains
Run evaluation benchmarks (BLEU, ROUGE, METEOR, GPT-4 eval)
Study model alignment and generalization with diverse instructions

🎯 Suggested Projects

Fine-tune models using transformers and PEFT
Build a quality prediction model using the quality_score
Visualize attention distribution over instruction vs. input
Compare SFT vs. zero-shot/few-shot prompting using the same tasks

🛠 Tools That Work Well

Hugging Face Transformers and Datasets
PEFT for parameter-efficient tuning
LoRA, QLoRA, or 8-bit training on Colab or local GPU
LangChain for interactive API wrappers
Weights & Biases for experiment tracking

🔖 License

Released under the MIT License. You may use, modify, and share with attribution.

🙌 Acknowledgments

Created by Zeeshan-ul-hassan Usmani to support open learning, LLM research, and educational outreach. Inspired by initiatives like Self-Instruct, OpenAssistant, and Hugging Face open datasets.

h
Recurv-Clinical-Dataset
huggingface.co
Updated Feb 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Recurv AI (2025). Recurv-Clinical-Dataset [Dataset]. https://huggingface.co/datasets/RecurvAI/Recurv-Clinical-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2025
Dataset authored and provided by
Recurv AI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🩺 Recurv-Clinical-Dataset:

The Recurv Clinical Dataset is a comprehensive resource containing 12,631 high-quality question-answer pairs specifically designed for training and fine-tuning medical AI models. Curated from trusted medical sources, this dataset focuses on real-world scenarios, including patient history, diagnostics, and treatment recommendations. It sets a new benchmark for advancing conversational AI in the healthcare field.

📈 Dataset Statistics… See the full description on the dataset page: https://huggingface.co/datasets/RecurvAI/Recurv-Clinical-Dataset.
hr-policies-qa-dataset
kaggle.com
zip
Updated Sep 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syncora_ai (2025). hr-policies-qa-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/hr-policies-qa-dataset
Explore at:
zip(54895 bytes)Available download formats
Dataset updated
Sep 11, 2025
Authors
Syncora_ai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🏢 HR Policies Q&A Synthetic Dataset

This synthetic dataset for LLM training captures realistic employee–assistant interactions about HR and compliance policies.
Generated using Syncora.ai's synthetic data generation engine, it provides privacy-safe, high-quality conversations for training Large Language Models (LLMs) to handle HR-related queries.

Perfect for researchers, HR tech startups, and AI developers building chatbots, compliance assistants, or policy QA systems — without exposing sensitive employee data.

🧠 Context & Applications

HR departments handle countless queries on policies, compliance, and workplace practices.
This dataset simulates those Q&A flows, making it a powerful dataset for LLM training and research.

You can use it for:

HR chatbot prototyping

Policy compliance assistants

Internal knowledge base fine-tuning

Generative AI experimentation

Synthetic benchmarking in enterprise QA systems

📊 Dataset Features

Column Description
role Role of the message author (system, user, or assistant)
content Actual text of the message
messages Grouped sequence of role–content exchanges (conversation turns)

Each entry represents a self-contained dialogue snippet designed to reflect natural HR conversations, ideal for synthetic data generation research.

📦 This Repo Contains

HR Policies QA Dataset – JSON format, ready to use for LLM training or evaluation

Jupyter Notebook – Explore the dataset structure and basic preprocessing

Synthetic Data Tools – Generate your own datasets using Syncora.ai

⚡ Generate Synthetic Data
Need more? Use Syncora.ai’s synthetic data generation tool to create custom HR/compliance datasets. Our process is simple, reliable, and ensures privacy.

🧪 ML & Research Use Cases

Policy Chatbots — Train assistants to answer compliance and HR questions

Knowledge Management — Fine-tune models for consistent responses

Synthetic Data Research — Explore structured dialogue datasets without legal risks

Evaluation Benchmarks — Test enterprise AI assistants on HR-related queries

Dataset Expansion — Combine this dataset with your own data using synthetic generation

🔒 Why Syncora.ai Synthetic Data?

Zero real-user data → Zero privacy liability

High realism → Actionable insights for LLM training

Fully customizable → Generate synthetic data tailored to your domain

Ethically aligned → Safe and responsible dataset creation

Whether you're building an HR assistant, compliance bot, or experimenting with enterprise LLMs, Syncora.ai synthetic datasets give you trustworthy, free datasets to start with — and scalable tools to grow further.

💬 Questions or Contributions?

Got feedback, research use cases, or want to collaborate?
Open an issue or reach out — we’re excited to work with AI researchers, HR tech builders, and compliance innovators.

BOOK A DEMO

⚠️ Disclaimer

This dataset is 100% synthetic and does not represent real employees or organizations.
It is intended solely for research, educational, and experimental use in HR analytics, compliance automation, and machine learning.
Dataset and Models for Detection of News Agency Releases in Historical...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring; Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring (2025). Dataset and Models for Detection of News Agency Releases in Historical Newspapers [Dataset]. http://doi.org/10.5281/zenodo.8333933
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8333933
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring; Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This record contains the annotated datasets and models used and produced for the work reported in the Master Thesis "Where Did the News come from? Detection of News Agency Releases in Historical Newspapers " (link).

Please cite this report if you are using the models/datasets or find it relevant to your research:

@article{Marxen:305129, title = {Where Did the News Come From? Detection of News Agency Releases in Historical Newspapers}, author = {Marxen, Lea}, pages = {114p}, year = {2023}, url = {http://infoscience.epfl.ch/record/305129}, }

1. DATA

The newsagency-dataset contains historical newspaper articles with annotations of news agency mentions. The articles are divided into French (fr) and German (de) subsets and a train, dev and test set respectively. The data is annotated at token-level in the CoNLL format with IOB tagging format.

The distribution of articles in the different sets is as follows:

Dataset Statistics
Lg. Docs Agency Mentions
Train de 333 493
fr 903 1,122
Dev de 32 26
fr 110 114
Test de 32 58
fr 120 163

Due to an error, there are seven duplicated articles in the French test set (article IDs: courriergdl-1847-10-02-a-i0002, courriergdl-1852-02-14-a-i0002, courriergdl-1860-10-31-a-i0016, courriergdl-1864-12-15-a-i0005, lunion-1860-11-27-a-i0004, lunion-1865-02-05-a-i0012, lunion-1866-02-16-a-i0009).

2. MODELS

The two agency detection and classification models used for the inference on the impresso Corpus are released as well:

newsagency-model-de: based on German BERT (with maximum sequence length 128), fine-tuned with the German training set of the newsagency-dataset

newsagency-model-fr: based on French Europeana BERT (with maximum sequence length 128), fine-tuned with the French training set of the newsagency-dataset

The models perform multitask classification with two prediction heads, one for token-level agency entity classification and one for sentence-level (has_agency: yes/no). They can be run with TorchServe, for details see the newsagency-classification repository.

Please refer to the report for further information or contact us.

3. CODE

https://github.com/impresso/newsagency-classification

4. CONTACT

Maud Ehrmann (EPFL-DHLAB)
Emanuela Boros (EPFL-DHLAB)
h
hub-tldr-dataset-summaries-llama
huggingface.co
Updated Feb 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel van Strien (2025). hub-tldr-dataset-summaries-llama [Dataset]. https://huggingface.co/datasets/davanstrien/hub-tldr-dataset-summaries-llama
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 17, 2025
Authors
Daniel van Strien
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset card for dataset-summaries-llama

This dataset contains AI-generated summaries of dataset cards from the Hugging Face Hub, generated using meta-llama/Llama-3.3-70B-Instruct. It is designed to be used in combination with a similar dataset of model card summaries for initial supervised fine-tuning (SFT) of language models specialized in generating tl;dr summaries of dataset and model cards from the Hugging Face Hub. This dataset was made with Curator.

Dataset… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/hub-tldr-dataset-summaries-llama.
m
HVULao_NLP: A Word-Segmented and POS-Tagged Lao Corpus
data.mendeley.com
Updated Sep 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ha Nguyen (2025). HVULao_NLP: A Word-Segmented and POS-Tagged Lao Corpus [Dataset]. http://doi.org/10.17632/5zwym7kwn8.1
Explore at:
Unique identifier
https://doi.org/10.17632/5zwym7kwn8.1
Dataset updated
Sep 1, 2025
Authors
Ha Nguyen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The HVULao_NLP project is dedicated to sharing datasets and tools for Lao Natural Language Processing (NLP), developed and maintained by the research team at Hung Vuong University (HVU), Phu Tho, Vietnam. This project is supported by Hung Vuong University with the aim of advancing research and applications in low-resource language processing, particularly for the Lao language.

📁 Datasets
This release provides a semi-automatically constructed corpus consisting of Lao sentences that have been word-segmented and part-of-speech (POS) tagged. It is designed to support a wide range of NLP applications, including language modeling, sequence labeling, linguistic research, and the development of Lao language tools.

Datatest1k/ – Test set (1,000 Lao sentences)

testorgin1000.txt: Original raw sentences (UTF-8, one sentence per line).

testsegsent_1000.txt: Word-segmented version aligned 1-to-1 with the raw file (tokens separated by spaces).

testtag1k.json: Word-segmented and POS-tagged sentences, generated using large language models (LLMs) and manually reviewed by native linguists.

Datatrain10k/ – Training set (10,000 Lao sentences)

10ktrainorin.txt: Original raw sentences (UTF-8, one sentence per line).

10ksegmented.txt: Word-segmented version aligned 1-to-1 with the raw file.

10ktraintag.json: Word-segmented and POS-tagged sentences, generated using the same method as the test set.

lao_finetuned_10k/ – A fine-tuned transformer-based model for Lao word segmentation, compatible with Hugging Face’s transformers library.

All data files are encoded in UTF-8 (NFC) and prepared for direct use in NLP pipelines.

📁 The Lao sentence segmentation tool
A command-line tool for Lao word segmentation built with a fine-tuned Hugging Face transformers model and PyTorch.

Features
- Accurate Lao word segmentation using a pre-trained model
- Simple command-line usage
- GPU support (if available)

Example usage
```bash python3 segment_lao.py -i ./data/lao_raw.txt -o ./output/lao_segmented.txt 📁 The Lao sentence POS tagging tool
A POS tagging tool for segmented Lao text, implemented with Python and CRF++.

Example usage
python3 Pos_tagging.py ./Test/lao_sentences_segmented.txt Test1

📚 Usage
The HVULao_NLP dataset and tools are intended for:
- Training and evaluating sequence labeling models (e.g., CRF, BiLSTM, mBERT)
- Developing Lao NLP tools (e.g., POS taggers, tokenizers)
- Conducting linguistic and computational research on Lao
Training Data For building a chatbot
kaggle.com
zip
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IndraneelBakshiss (2025). Training Data For building a chatbot [Dataset]. https://www.kaggle.com/datasets/indraneelbakshiss/training-data-for-building-a-chatbot
Explore at:
zip(22200 bytes)Available download formats
Dataset updated
Mar 5, 2025
Authors
IndraneelBakshiss
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview This dataset is designed to train and fine-tune chatbot models by mapping user queries (patterns) to predefined intents (tags) and generating contextually accurate responses. Each tag represents a unique conversational intent or topic (e.g., "climate_change," "crypto_regulation," "quantum_computing"), accompanied by multiple paraphrased user prompts (patterns) and a detailed, informative response. Ideal for building intent classification systems, dialogue management, or generative AI models.

{ "intents": [ { "tag": "tag_name", "patterns": ["user query 1", "user query 2", ...], "responses": ["detailed answer"] }, ... ] }

Possible Uses Intent Classification: Train models to categorize user inputs into predefined tags.

Response Generation: Fine-tune generative models (GPT, BERT) to produce context-aware answers.

Educational Chatbots: Power QA systems for topics like science, history, or technology.

Customer Support: Automate responses for FAQs or policy explanations.

Compatibility Frameworks: TensorFlow, PyTorch, spaCy, Rasa, Hugging Face Transformers.

Use Cases: Virtual assistants, customer service bots, trivia apps, educational tools.

Column	Description
`role`	Role of the message author (`system`, `user`, or `assistant`)
`content`	Actual text of the message
`messages`	Grouped sequence of role–content exchanges (conversation turns)

Dataset Statistics
	Lg.	Docs	Agency Mentions
Train	de	333	493
	fr	903	1,122
Dev	de	32	26
	fr	110	114
Test	de	32	58
	fr	120	163

MELD-emotion-detection-preprocessed

kaggle.com

zip

Updated Nov 1, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

seniru epasinghe (2025). MELD-emotion-detection-preprocessed [Dataset]. https://www.kaggle.com/datasets/seniruepasinghe/meld-emotion-detection-preprocessed

Explore at:

zip(4536497614 bytes)Available download formats

Dataset updated

Nov 1, 2025

Authors

seniru epasinghe

License

https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

Description

Hi, I’m Seniru Epasinghe 👋

I’m an AI enthusiast, working on machine learning projects and open-source contributions.
I enjoy exploring AI pipelines, natural language processing, and building tools that make development easier.

Connect with me:

Hugging Face
Medium
LinkedIn
GitHub

Multimodal Emotion Recognition Dataset (Processed from MELD)

This dataset is a preprocessed and balanced version of the MELD Dataset, designed for multimodal emotion recognition research.
It combines text, audio, and video modalities, each represented by a set of emotion probability distributions predicted by pretrained or custom-trained models.

Overview

Feature	Description
Total Samples	4,000 utterances
Modalities	Text, Audio, Video
Balanced Emotions	Each emotion class is approximately balanced
Cleaned Samples	Videos with unclear or no facial detection removed
Emotion Labels	`['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise']`

Each row in the dataset corresponds to a single utterance, along with emotion label, file name, and predicted emotion probabilities per modality.

Example Entry

Utterance	Emotion	File_Name	MultiModel Predictions
You are going to a clinic!	disgust	dia127_utt3.mp4	{"video": [0.7739, 0.0, 0.0, 0.0783, 0.1217, 0.0174, 0.0087], "audio": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0], "text": [0.0005, 0.0, 0.0, 0.0007, 0.998, 0.0004, 0.0004]}

Column Description:

Utterance — spoken text in the conversation.
Emotion — gold-standard emotion label.
File_Name — corresponding video file (utterance-level).
MultiModel Predictions — JSON object containing model-predicted emotion probability vectors for each modality.

Modality Emotion Extraction

Each modality’s emotion vector was generated independently using specialized models:

Modality	Model / Method	Description
Video	`python-fer`	Facial expression recognition using CNN-based FER library.
Audio	`Custom-trained CNN model`	Trained on Mel spectrogram features for emotion classification.
Text	`arpanghoshal/EmoRoBERTa`	Transformer-based text emotion model fine-tuned on GoEmotions dataset.

Format and Usage

File format: CSV
Recommended columns:
- Utterance
- Emotion
- File_Name
- Final_Emotion (JSON: { "video": [...], "audio": [...], "text": [...] })

This dataset is ideal for: - Fusion model training - Fine-tuning multimodal emotion models - Benchmarking emotion fusion strategies - Ablation studies on modality importance

Citation

References for the original MELD Dataset - S. Poria, D. Hazarika, N. Majumder, G. Naik, R. Mihalcea, E. Cambria. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation (2018). - Chen, S.Y., Hsu, C.C., Kuo, C.C. and Ku, L.W. EmotionLines: An Emotion Corpus of Multi-Party Conversations. arXiv preprint arXiv:1802.08379 (2018).

License & Acknowledgments

This dataset is a derivative work of MELD, used here for research and educational purposes.
All credit for the original dataset goes to the MELD authors and contributors.

h
toolverifier
huggingface.co
Updated Mar 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2024). toolverifier [Dataset]. https://huggingface.co/datasets/facebook/toolverifier
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2024
Dataset authored and provided by
AI at Meta
Description
TOOLVERIFIER: Generalization to New Tools via Self-Verification

This repository contains the ToolSelect dataset which was used to fine-tune Llama-2 70B for tool selection.

Data

ToolSelect data is synthetic training data generated for tool selection task using Llama-2 70B and Llama-2-Chat-70B. It consists of 555 samples corresponding to 173 tools. Each training sample is composed of a user instruction, a candidate set of tools that includes the ground truth tool, and a… See the full description on the dataset page: https://huggingface.co/datasets/facebook/toolverifier.
GSAFE-finetuned
kaggle.com
zip
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AlbertoCosta (2025). GSAFE-finetuned [Dataset]. https://www.kaggle.com/datasets/noobsajbot/gsafe-finetuned
Explore at:
zip(85872842 bytes)Available download formats
Dataset updated
Aug 4, 2025
Authors
AlbertoCosta
Description
These are LoRA fine‑tuned adapter weights for Google Gemma 3n E2B IT, produced for the Google Gemma 3n Impact Challenge. Base model: https://huggingface.co/google/gemma-3n-e2b-it Usage subject to the Gemma Terms of Use: https://ai.google.dev/gemma/terms and the rules listed in the competition page: https://www.kaggle.com/competitions/google-gemma-3n-hackathon/rules
h
KokoroTTS-af-Synthetic-QA
huggingface.co
Updated Jan 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MI Remo (2025). KokoroTTS-af-Synthetic-QA [Dataset]. https://huggingface.co/datasets/rokeya71/KokoroTTS-af-Synthetic-QA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 6, 2025
Authors
MI Remo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Everyday Conversations Fine-Tuning Dataset (LLaMA 3.1 - 2K)

Overview

This repository hosts the Everyday Conversations - LLaMA 3.1 - 2K dataset, a carefully curated fine-tuning dataset designed for conversational AI models. The dataset was created using the Kokoro-82M model, featuring voice samples from the af Voicepack.

Dataset Link

Hugging Face Dataset - Everyday Conversations LLaMA 3.1 - 2K

Features

Voice Model: Kokoro-82M Voicepack: af… See the full description on the dataset page: https://huggingface.co/datasets/rokeya71/KokoroTTS-af-Synthetic-QA.
h
GSCF_finetune
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Supply Chain AI Research at The Ohio State University, GSCF_finetune [Dataset]. https://huggingface.co/datasets/Supply-Chain-AI-Research/GSCF_finetune
Explore at:
Dataset authored and provided by
Supply Chain AI Research at The Ohio State University
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
GSCF Q&A Dataset for Fine-Tuning

Dataset Summary

This dataset contains a collection of question-and-answer pairs specifically designed for fine-tuning Large Language Models (LLMs) on the Global Supply Chain Forum (GSCF) framework, a leading process model developed at The Ohio State University. The data is structured to train a model to act as an expert supply chain consultant. The content covers two main types of interactions:

Definitional Knowledge: Questions that… See the full description on the dataset page: https://huggingface.co/datasets/Supply-Chain-AI-Research/GSCF_finetune.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shawhin Talebi, ai-job-embedding-finetuning [Dataset]. https://huggingface.co/datasets/shawhin/ai-job-embedding-finetuning

ai-job-embedding-finetuning

shawhin/ai-job-embedding-finetuning

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Authors

Shawhin Talebi

Description

Dataset for fine-tuning an embedding model for AI job search. Data sourced from datastax/linkedin_job_listings. Data used to fine-tune shawhin/distilroberta-ai-job-embeddings for AI job search. Links

GitHub Repo Video link Blog link

Clear search

Close search

Google apps

Main menu

ai-job-embedding-finetuning

ai-wit-training-data

Data from: AstroChat

Purpose and Scope

Intended Use

Quickstart

DATASET DESCRIPTION

Access

Structure

Metadata

Generation Method

Step-by-step description

Future work and contributions appreciated

Languages

Size

USAGE AND GUIDELINES

License

Restrictions

Citation

Update Frequency

Have a feedback or spot an error?

Mental Health Conversational AI Training Dataset

Whisper-Fine-Tune-One-Shot-Eval

AI-MATH-LLM-Package

nayjest/Phi-3-mini-4k-instruct

Model Summary

Intended Uses

How to Use

Tokenizer

Chat Format

Whisper-Small-Bengali

Contextual Input SFT Dataset

📦 Instruction-Tuned Dataset with Contextual Inputs (10,000 Examples for SFT)

🧠 What is Supervised Fine-Tuning (SFT)?

📚 About This Dataset

💡 Example Entry

🧪 What Can You Do With This Dataset?

🧑‍🎓 Beginners

🧑‍💻 Practitioners

🧠 Researchers

🎯 Suggested Projects

🛠 Tools That Work Well

🔖 License

🙌 Acknowledgments

Recurv-Clinical-Dataset

hr-policies-qa-dataset

🏢 HR Policies Q&A Synthetic Dataset

🧠 Context & Applications

📊 Dataset Features

📦 This Repo Contains

🧪 ML & Research Use Cases

🔒 Why Syncora.ai Synthetic Data?

💬 Questions or Contributions?

⚠️ Disclaimer

Dataset and Models for Detection of News Agency Releases in Historical...

hub-tldr-dataset-summaries-llama

HVULao_NLP: A Word-Segmented and POS-Tagged Lao Corpus

Training Data For building a chatbot

MELD-emotion-detection-preprocessed

Hi, I’m Seniru Epasinghe 👋

Connect with me:

Multimodal Emotion Recognition Dataset (Processed from MELD)

Overview

Example Entry

Column Description:

Modality Emotion Extraction

Format and Usage

Citation

License & Acknowledgments

toolverifier

GSAFE-finetuned

KokoroTTS-af-Synthetic-QA

GSCF_finetune

ai-job-embedding-finetuning

shawhin/ai-job-embedding-finetuning