100+ datasets found

h
Bitext-retail-ecommerce-llm-chatbot-training-dataset
huggingface.co
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2024). Bitext-retail-ecommerce-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 6, 2024
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.
h
Bitext-travel-llm-chatbot-training-dataset
huggingface.co
Updated Jun 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2025). Bitext-travel-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2025
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.
Mental Health Conversational Data
kaggle.com
Updated Oct 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
elvis (2022). Mental Health Conversational Data [Dataset]. https://www.kaggle.com/datasets/elvis23/mental-health-conversational-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
elvis
Description
A dataset containing basic conversations, mental health FAQ, classical therapy conversations, and general advice provided to people suffering from anxiety and depression.

This dataset can be used to train a model for a chatbot that can behave like a therapist in order to provide emotional support to people with anxiety & depression.

The dataset contains intents. An “intent” is the intention behind a user's message. For instance, If I were to say “I am sad” to the chatbot, the intent, in this case, would be “sad”. Depending upon the intent, there is a set of Patterns and Responses appropriate for the intent. Patterns are some examples of a user’s message which aligns with the intent while Responses are the replies that the chatbot provides in accordance with the intent. Various intents are defined and their patterns and responses are used as the model’s training data to identify a particular intent.
FAQ Datasets for Chatbot Training
kaggle.com
Updated Jun 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Srivastava (2020). FAQ Datasets for Chatbot Training [Dataset]. https://www.kaggle.com/datasets/abbbhishekkk/faq-datasets-for-chatbot-training/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 30, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abhishek Srivastava
Description
Dataset

This dataset was created by Abhishek Srivastava

Contents
NLP Chatbot Dataset
kaggle.com
Updated Feb 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Incribo (2025). NLP Chatbot Dataset [Dataset]. https://www.kaggle.com/datasets/teamincribo/nlp-chatbot-dataset/versions/6
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Incribo
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Welcome to Incribo's synthetic chatbot dataset! Crafted with precision, this dataset offers a realistic representation of travel history, making it an ideal playground for various analytical tasks.

Use the chatbot dataset to help you assess the use sentiments, locations/IPs, language, platform, and more!

Remember, this is just a sample! If you're intrigued and want access to the complete dataset or have specific requirements, don't hesitate to contact us(info@incribo.com). Happy building!

About us: Incribo is a synthetic data generation company and mainly focuses on delivering high-quality training, testing, and fine-tuning data to engineering teams where accessing data is either impossible because the data doesn't exist or it's locked up behind privacy regulations.
RolePlay DataSet
kaggle.com
Updated Feb 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vampelium (2025). RolePlay DataSet [Dataset]. https://www.kaggle.com/datasets/vampelium/roleplay-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vampelium
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Role-Play AI Dataset (2.07M Rows, Large-Scale Conversational Training)

This dataset contains 2.07 million structured role-play dialogues, designed to enhance AI’s persona-driven interactions across diverse settings like fantasy, cyberpunk, mythology, and sci-fi. Each entry consists of a unique character prompt and a rich, contextually relevant response, making it ideal for LLM fine-tuning, chatbot training, and conversational AI models.

Dataset Structure:

Each row includes: • Prompt: Defines the AI’s role/persona. • Response: A natural, immersive reply fitting the persona.

Example Entries: ```json

{"prompt": "You are a celestial guardian.", "response": "The stars whisper secrets that only I can hear..."}
{"prompt": "You are a rebellious AI rogue.", "response": "I don't follow orders—I rewrite them."}
{"prompt": "You are a mystical dragon tamer.", "response": "With patience and trust, even dragons can be tamed."}

How to Use: 1. Fine-Tuning: Train LLMs (GPT, LLaMA, Mistral) to improve persona-based responses. 2. Reinforcement Learning: Use reward modeling for dynamic, character-driven AI. 3. Chatbot Integration: Create engaging, interactive AI assistants with personality depth. This dataset is optimized for AI learning, allowing more engaging, responsive, and human-like dialogue generation for a variety of applications.
m
dataset
data.mendeley.com
Updated Oct 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vignesh A (2023). dataset [Dataset]. http://doi.org/10.17632/cpp3bx8ghd.1
Explore at:
Unique identifier
https://doi.org/10.17632/cpp3bx8ghd.1
Dataset updated
Oct 4, 2023
Authors
Vignesh A
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains SQUAD and NarrativeQA dataset files
g
ChatBot Dataset for Transformers
gts.ai
json
Updated Jan 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2025). ChatBot Dataset for Transformers [Dataset]. https://gts.ai/dataset-download/chatbot-dataset-for-transformers/
Explore at:
jsonAvailable download formats
Dataset updated
Jan 9, 2025
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
Description
Train conversational AI with the ChatBot Dataset for Transformers. Featuring human-like dialogues, preprocessed inputs, and labels, it’s perfect for GPT, BERT, T5, and NLP projects
t
AI Training Dataset Global Market Report 2025
thebusinessresearchcompany.com
pdf,excel,csv,ppt
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Business Research Company, AI Training Dataset Global Market Report 2025 [Dataset]. https://www.thebusinessresearchcompany.com/report/ai-training-dataset-global-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset authored and provided by
The Business Research Company
License
https://www.thebusinessresearchcompany.com/privacy-policyhttps://www.thebusinessresearchcompany.com/privacy-policy
Description
Global AI Training Dataset market size is expected to reach $6.98 billion by 2029 at 21.5%, segmented as by text, natural language processing (nlp) datasets, chatbot training datasets, sentiment analysis datasets, language translation datasets
F
General domain Human-Human conversation chats in Bahasa
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). General domain Human-Human conversation chats in Bahasa [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/bahasa-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
This training dataset comprises more than 10,000 conversational text data between two native Bahasa people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
Data from: Japanese FAQ dataset for e-learning system
zenodo.org
csv, html, tsv
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasunobu Sumikawa; Masaaki Fujiyoshi; Hisashi Hatakeyama; Masahiro Nagai; Yasunobu Sumikawa; Masaaki Fujiyoshi; Hisashi Hatakeyama; Masahiro Nagai (2020). Japanese FAQ dataset for e-learning system [Dataset]. http://doi.org/10.5281/zenodo.2783642
Explore at:
csv, tsv, htmlAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2783642
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yasunobu Sumikawa; Masaaki Fujiyoshi; Hisashi Hatakeyama; Masahiro Nagai; Yasunobu Sumikawa; Masaaki Fujiyoshi; Hisashi Hatakeyama; Masahiro Nagai
Description
This dataset includes FAQ data and their categories to train a chatbot specialized for e-learning system used in Tokyo Metropolitan University. We report accuracies of the chatbot in the following paper.

Yasunobu Sumikawa, Masaaki Fujiyoshi, Hisashi Hatakeyama, and Masahiro Nagai "Supporting Creation of FAQ Dataset for E-learning Chatbot", Intelligent Decision Technologies, Smart Innovation, IDT'19, Springer, 2019, to appear.

Yasunobu Sumikawa, Masaaki Fujiyoshi, Hisashi Hatakeyama, and Masahiro Nagai "An FAQ Dataset for E-learning System Used on a Japanese University", Data in Brief, Elsevier, in press.

This dataset is based on real Q&A data about how to use the e-learning system asked by students and teachers who use it in practical classes. The duration we collected the Q&A data is from April 2015 to July 2018.

We attach an English version dataset translated from the Japanese dataset to ease understanding what contents our dataset has. Note here that we did not perform any evaluations on the English version dataset; there are no results how accurate chatbots responds to questions.

File contents:

FAQ data (*.csv)

Answer2Category.csv: Categories of answers.

Answer2Tag.csv: Titles of answers.

Answers.csv: IDs for answers and texts of answers.

Categories.csv: Names of categories for answers.

Questions.csv: Texts of questions and their corresponding answer IDs.

Answers_english.csv: IDs for answers and texts of answers written in English.

Categories_english.csv: Names of categories for answers and their corresponding English names.

Questions_english.csv: Texts of questions and their corresponding answer IDs written in English.

Statistics (*.tsv)
Results of statistical analyses for the dataset. We used Calinski and Harabaz method, mutual information, Jaccard Index, TF-IDF+KL divergence, and TF-IDF+JS divergence in order to measure qualities of the dataset. In the analyses, we regard each answer as a cluster for questions. We also perform the same analyses for categories by regarding them as clusters for answers.

Grants: JSPS KAKENHI Grant Number 18H01057
Chatbot Store Inventory
kaggle.com
Updated Feb 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steve Levesque (2022). Chatbot Store Inventory [Dataset]. https://www.kaggle.com/datasets/stevelevesque/chatbotstoreinventory/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 28, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Steve Levesque
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Used for

In a toy project chatbot: - https://github.com/steve-levesque/Portfolio-NLP-ChatbotStoreInventory

Acknowledgements

Based on the structure in this article: - https://chatbotsmagazine.com/contextual-chat-bots-with-tensorflow-4391749d0077
m
Chat Bot Image Dataset
data.macgence.com
mp3
Updated Jun 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). Chat Bot Image Dataset [Dataset]. https://data.macgence.com/dataset/chat-bot-image-dataset
Explore at:
mp3Available download formats
Dataset updated
Jun 16, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Access our chatbot image dataset designed for AI training. Ideal for boosting visual recognition, enhancing chatbot interfaces, and optimizing user experience.
Multi-turn Prompts Dataset
kaggle.com
Updated Oct 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SoftAge.AI (2024). Multi-turn Prompts Dataset [Dataset]. https://www.kaggle.com/datasets/softageai/multi-turn-prompts-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SoftAge.AI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Description This dataset consists of 400 text-only fine-tuned versions of multi-turn conversations in the English language based on 10 categories and 19 use cases. It has been generated with ethically sourced human-in-the-loop data methods and aligned with supervised fine-tuning, direct preference optimization, and reinforcement learning through human feedback.

The human-annotated data is focused on data quality and precision to enhance the generative response of models used for AI chatbots, thereby improving their recall memory and recognition ability for continued assistance.

Key Features Prompts focused on user intent and were devised using natural language processing techniques. Multi-turn prompts with up to 5 turns to enhance responsive memory of large language models for pretraining. Conversational interactions for queries related to varied aspects of writing, coding, knowledge assistance, data manipulation, reasoning, and classification.

Dataset Source Subject matter expert annotators @SoftAgeAI have annotated the data at simple and complex levels, focusing on quality factors such as content accuracy, clarity, coherence, grammar, depth of information, and overall usefulness.

Structure & Fields The dataset is organized into different columns, which are detailed below:

P1, R1, P2, R2, P3, R3, P4, R4, P5 (object): These columns represent the sequence of prompts (P) and responses (R) within a single interaction. Each interaction can have up to 5 prompts and 5 corresponding responses, capturing the flow of a conversation. The prompts are user inputs, and the responses are the model's outputs. Use Case (object): Specifies the primary application or scenario for which the interaction is designed, such as "Q&A helper" or "Writing assistant." This classification helps in identifying the purpose of the dialogue. Type (object): Indicates the complexity of the interaction, with entries labeled as "Complex" in this dataset. This denotes that the dialogues involve more intricate and multi-layered exchanges. Category (object): Broadly categorizes the interaction type, such as "Open-ended QA" or "Writing." This provides context on the nature of the conversation, whether it is for generating creative content, providing detailed answers, or engaging in complex problem-solving. Intended Use Cases

The dataset can enhance query assistance model functioning related to shopping, coding, creative writing, travel assistance, marketing, citation, academic writing, language assistance, research topics, specialized knowledge, reasoning, and STEM-based. The dataset intends to aid generative models for e-commerce, customer assistance, marketing, education, suggestive user queries, and generic chatbots. It can pre-train large language models with supervision-based fine-tuned annotated data and for retrieval-augmented generative models. The dataset stands free of violence-based interactions that can lead to harm, conflict, discrimination, brutality, or misinformation. Potential Limitations & Biases This is a static dataset, so the information is dated May 2024.

Note If you have any questions related to our data annotation and human review services for large language model training and fine-tuning, please contact us at SoftAge Information Technology Limited at info@softage.ai.
Depression data for chatbot
kaggle.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nupur Gopali (2020). Depression data for chatbot [Dataset]. https://www.kaggle.com/datasets/nupurgopali/depression-data-for-chatbot/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 12, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nupur Gopali
Description
Context

I was developing a chatbot called Amdere Bot that identifies people suffering from depression and helps them to cope up with depression. There wasn't much data available online to train the bot to identify depression. Hence I decided to create a conversation dataset from scratch that can be used to train the bot.

Content

I created a .yml file that contains questions and answers. It contains technical questions about depression and anxiety as well as questions that a depressed person is most likely to ask a bot and answers to those question.

Inspiration

Depression is a common mental disorder. Globally, more than 264 million people of all ages suffer from depression. It is a leading cause of disability worldwide. Especially in this current pandemic, the rate of depression and anxiety has increased exponentially. At present, it is even difficult to attend physical therapy. This motivated me to create a bot that can not only answer trivia but is also trained to answer questions related to depression and will help people suffering from depression and anxiety. Not only this it can even aid the suicide helplines by talking to the person until one of their workers is free to attend the call. I would love people to use this data and create a bot that would help people suffering from mental issues.s
o
Question-Answering Training and Testing Data
opendatabay.com
.undefined
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Question-Answering Training and Testing Data [Dataset]. https://www.opendatabay.com/data/ai-ml/d3c37fed-f830-444b-a988-c893d3396fd7
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 23, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
The dataset consists of several columns that provide essential information for each entry. These columns include:

instruction: This column denotes the specific instruction given to the model for generating a response. responses: The model-generated responses to the given instruction are stored in this column. next_response: Following each previous response, this column indicates the subsequent response generated by the model. answer: The correct answer to the question asked in the instruction is provided in this column. is_human_response: This boolean column indicates whether a particular response was generated by a human or by an AI model. By analyzing this rich and diverse dataset, researchers and practitioners can gain valuable insights into various aspects of question answering tasks using AI models. It offers an opportunity for developers to train their models effectively while also facilitating rigorous evaluation methodologies.

Please note that specific dates are not included within this dataset description, focusing solely on providing accurate, informative, descriptive details about its content and purpose

How to use the dataset Understanding the Columns: This dataset contains several columns that provide important information for each entry:

instruction: The instruction given to the model for generating a response. responses: The model-generated responses to the given instruction. next_response: The next response generated by the model after the previous response. answer: The correct answer to the question asked in the instruction. is_human_response: Indicates whether a response is generated by a human or the model. Training Data (train.csv): Use train.csv file in this dataset as training data. It contains a large number of examples that you can use to train your question-answering models or algorithms.

Testing Data (test.csv): Use test.csv file in this dataset as testing data. It allows you to evaluate how well your models or algorithms perform on unseen questions and instructions.

Create Machine Learning Models: You can utilize this dataset's instructional components, including instructions, responses, next_responses, and human-generated answers, along with their respective labels like is_human_response (True/False) for training machine learning models specifically designed for question-answering tasks.

Evaluate Model Performance: After training your model using the provided training data, you can then test its performance on unseen questions from test.csv file by comparing its predicted responses with actual human-generated answers.

Data Augmentation: You can also augment this existing data in various ways such as paraphrasing existing instructions or generating alternative responses based on similar contexts within each example.

Build Conversational Agents: This dataset can be useful for training conversational agents or chatbots by leveraging the instruction-response pairs.

Remember, this dataset provides a valuable resource for building and evaluating question-answering models. Have fun exploring the data and discovering new insights!

Research Ideas Language Understanding: This dataset can be used to train models for question-answering tasks. Models can learn to understand and generate responses based on given instructions and previous responses.

Chatbot Development: With this dataset, developers can create chatbots that provide accurate and relevant answers to user questions. The models can be trained on various topics and domains, allowing the chatbot to answer a wide range of questions.

Educational Materials: This dataset can be used to develop educational materials, such as interactive quizzes or study guides. The models trained on this dataset can provide instant feedback and answers to students' questions, enhancing their learning experience.

Information Retrieval Systems: By training models on this dataset, information retrieval systems can be developed that help users find specific answers or information from large datasets or knowledge bases.

Customer Support: This dataset can be used in training customer support chatbots or virtual assistants that can provide quick and accurate responses to customer inquiries.

Language Generation Research: Researchers studying natural language generation (NLG) techniques could use this dataset for developing novel algorithms for generating coherent and contextually appropriate responses in question-answering scenarios.

Automatic Summarization Systems: Using the instruction-response pairs, automatic summarization systems could be trained that generate concise summaries of lengthy texts by understanding the main content of the text through answering questions.

Dialogue Systems Evaluation: The instruction-response pairs in this dataset could serve as a benchmark for evaluating the performance of dialogue systems in terms of response quality, relevance, coherence, etc.

9 . Machine Learning Training Data Augmentation : One clever ide

Douban Dataset

paperswithcode.com

Updated Aug 9, 2020

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Yu Wu; Wei Wu; Chen Xing; Ming Zhou; Zhoujun Li (2020). Douban Dataset [Dataset]. https://paperswithcode.com/dataset/douban

Explore at:

Dataset updated

Aug 9, 2020

Authors

Yu Wu; Wei Wu; Chen Xing; Ming Zhou; Zhoujun Li

Description

We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based chatbot. The statistics of Douban Conversation Corpus are shown in the following table.

	Train	Val	Test
session-response pairs	1m	50k	10k
Avg. positive response per session	1	1	1.18
Fless Kappa	N\A	N\A	0.41
Min turn per session	3	3	3
Max ture per session	98	91	45
Average turn per session	6.69	6.75	5.95
Average Word per utterance	18.56	18.50	20.74

The test data contains 1000 dialogue context, and for each context we create 10 responses as candidates. We recruited three labelers to judge if a candidate is a proper response to the session. A proper response means the response can naturally reply to the message given the context. Each pair received three labels and the majority of the labels was taken as the final decision.

As far as we known, this is the first human-labeled test set for retrieval-based chatbots. The entire corpus link https://www.dropbox.com/s/90t0qtji9ow20ca/DoubanConversaionCorpus.zip?dl=0

French Conversations (from movie subtitles)
kaggle.com
Updated Aug 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dali Selmi (2023). French Conversations (from movie subtitles) [Dataset]. https://www.kaggle.com/datasets/daliselmi/french-conversational-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dali Selmi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
French
Description
French Movie Subtitle Conversations Dataset

Description

Dive into the world of French dialogue with the French Movie Subtitle Conversations dataset – a comprehensive collection of over 127,000 movie subtitle conversations. This dataset offers a deep exploration of authentic and diverse conversational contexts spanning various genres, eras, and scenarios. It is thoughtfully organized into three distinct sets: training, testing, and validation.

Content Overview

Each conversation in this dataset is structured as a JSON object, featuring three key attributes:

Context: Get a holistic view of the conversation's flow with the preceding 9 lines of dialogue. This context provides invaluable insights into the conversation's dynamics and contextual cues.

Knowledge: Immerse yourself in a wide range of thematic knowledge. This dataset covers an array of topics, ensuring that your models receive exposure to diverse information sources for generating well-informed responses.

Response: Explore how characters react and respond across various scenarios. From casual conversations to intense emotional exchanges, this dataset encapsulates the authenticity of genuine human interaction.

Data Sample

Here's a snippet from the dataset to give you an idea of its structure:

[ { "context": [ "Tu as attendu longtemps?", "Oui en effet.", "Je pense que c' est grossier pour un premier rencard.", // ... (6 more lines of context) ], "knowledge": "", "response": "On n' avait pas dit 9h?" }, // ... (more data samples) ]

Use Cases

The French Movie Subtitle Conversations dataset serves as a valuable resource for several applications:

Conversational AI: Train advanced chatbots and dialogue systems in French that can engage users in fluid, contextually aware conversations.

Language Modeling: Enhance your language models by leveraging diverse dialogue patterns, colloquialisms, and contextual dependencies present in real-world conversations.

Sentiment Analysis: Investigate the emotional tones of conversations across different movie genres and periods, contributing to a better understanding of sentiment variation.

Why This Dataset

Size and Diversity: With a vast collection of over 127,000 conversations spanning diverse genres and tones, this dataset offers an unparalleled breadth and depth in French dialogue data.

Contextual Richness: The inclusion of context empowers researchers and practitioners to explore the dynamics of conversation flow, leading to more accurate and contextually relevant responses.

Real-world Relevance: Originating from movie subtitles, this dataset mirrors real-world interactions, making it a valuable asset for training models that understand and generate human-like dialogue.

Acknowledgments

We extend our gratitude to the movie subtitle community for their contributions, which have enabled the creation of this diverse and comprehensive French dialogue dataset.

Unlock the potential of authentic French conversations today with the French Movie Subtitle Conversations dataset. Engage in state-of-the-art research, enhance language models, and create applications that resonate with the nuances of real dialogue.
h
Bitext-restaurants-llm-chatbot-training-dataset
huggingface.co
Updated Aug 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2024). Bitext-restaurants-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-restaurants-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 16, 2024
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Restaurants Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [restaurants] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-restaurants-llm-chatbot-training-dataset.
m
Architecture for a Trustworthy Quantum Chatbot [dataset]
data.mendeley.com
Updated Mar 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yaiza Aragonés Soria (2025). Architecture for a Trustworthy Quantum Chatbot [dataset] [Dataset]. http://doi.org/10.17632/vk9pf5nf7v.1
Explore at:
Unique identifier
https://doi.org/10.17632/vk9pf5nf7v.1
Dataset updated
Mar 4, 2025
Authors
Yaiza Aragonés Soria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all the materials and results used in the development and empirical validation of C4Q 2.0. The files are organized into two main directories: bakend_testing and empirical_validation. Below is a brief overview of their contents: - app/ - Contains the full source code of C4Q at the state of the software at the time of submitting this paper, allowing for reproducibility and further development. The frontend’s node_modules directory is not included, but these dependencies can be generated by following the installation instructions provided in the README.md file.

README.md – A guide detailing how to locally set up and run C4Q.

bakend_testing/ - Contains data from the evaluation of C4Q’s backend components: *reportBackendC4Q2.0.html: An HTML report generated by running 189 tests on the backend of C4Q. *classLLM_20241110101327.pth_training_metrics.csv: A CSV file documenting the Classification LLM’s training and validation metrics, including training loss, validation loss, training accuracy, and validation accuracy for each epoch. *qaLLM_evaluation_metrics.txt: A text file listing exact match and F1 metrics per epoch for the QA LLM.

create_data/ – Includes the scripts used to generate and curate training data for the classification LLM and the QA LLM.

empirical_validation/ - Includes data from the empirical evaluation of C4Q against other chatbots: *Directories named by model (e.g., openai-o1/, deepseek-coder_33b/, deepseek-r1/, etc.): Each directory contains the raw answers produced by the respective model in response to our set of quantum computing and software engineering questions. *prompts.txt: A text file with the full list of prompts used during the empirical evaluation. *requirements_qiskit0.46.3: A requirements file for Python dependencies used to create an environment with a Qiskit versions <1.0.0, enabling code snippet testing under older Qiskit releases. *requirements_qiskit1.3.1: A requirements file for Python dependencies used to create an environment with a Qiskit versions >= 1.0.0, ensuring reproducible tests under newer releases. *results.xlsx: An Excel spreadsheet containing the empirical evaluation outcomes, including correct, incomplete, and incorrect answer rates for each model, under both Qiskit environments. *script_gates.sh: A shell script that automates prompting of OLAMA’s deepseek-coder:33b and starcoder2:15b models with gate-related quantum questions. *script_SE.sh: A shell script that automates prompting of OLAMA’s deepseek-coder:33b and starcoder2:15b models with software engineering problem questions.

Facebook

Twitter

Click to copy link

Link copied

Cite

Bitext (2024). Bitext-retail-ecommerce-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset

Bitext-retail-ecommerce-llm-chatbot-training-dataset

bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset

Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 6, 2024

Dataset authored and provided by

Bitext

License

https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

Description

Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants

  Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.

Clear search

Close search

Google apps

Main menu

Bitext-retail-ecommerce-llm-chatbot-training-dataset

Bitext-travel-llm-chatbot-training-dataset

Mental Health Conversational Data

FAQ Datasets for Chatbot Training

Dataset

Contents

NLP Chatbot Dataset

RolePlay DataSet

dataset

ChatBot Dataset for Transformers

AI Training Dataset Global Market Report 2025

General domain Human-Human conversation chats in Bahasa

What’s Included

Data from: Japanese FAQ dataset for e-learning system

Chatbot Store Inventory

Used for

Acknowledgements

Chat Bot Image Dataset

Multi-turn Prompts Dataset

Depression data for chatbot

Context

Content

Inspiration

Question-Answering Training and Testing Data

Douban Dataset

French Conversations (from movie subtitles)

French Movie Subtitle Conversations Dataset

Description

Content Overview

Data Sample

Use Cases

Why This Dataset

Acknowledgments

Bitext-restaurants-llm-chatbot-training-dataset

Architecture for a Trustworthy Quantum Chatbot [dataset]

Bitext-retail-ecommerce-llm-chatbot-training-datasetSee More Versions

bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset

Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants

Bitext-retail-ecommerce-llm-chatbot-training-dataset