Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
This dataset can be used to train Large Language Models such as GPT, Llama2 and Falcon, both for Fine Tuning and Domain Adaptation.
The dataset has the following specs:
The categories and intents have been selected from Bitext's collection of 20 vertical-specific datasets, covering the intents that are common across all 20 verticals. The verticals are:
For a full list of verticals and its intents see https://www.bitext.com/chatbot-verticals/.
The question/answer pairs have been generated using a hybrid methodology that uses natural texts as source text, NLP technology to extract seeds from these texts, and NLG technology to expand the seed texts. All steps in the process are curated by computational linguists.
The dataset contains an extensive amount of text data across its 'instruction' and 'response' columns. After processing and tokenizing the dataset, we've identified a total of 3.57 million tokens. This rich set of tokens is essential for training advanced LLMs for AI Conversational, AI Generative, and Question and Answering (Q&A) models.
Each entry in the dataset contains the following fields:
The categories and intents covered by the dataset are:
The entities covered by the dataset are:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This JSON file contains a collection of conversational AI intents designed to motivate and interact with users. The intents cover various topics, including greetings, weather inquiries, hobbies, music, movies, farewells, informal and formal questions, math operations and formulas, prime numbers, geometry concepts, math puzzles, and even a Shakespearean poem.
The additional intents related to consolidating people and motivating them have been included to provide users with uplifting and encouraging responses. These intents aim to offer support during challenging times, foster teamwork, and provide words of motivation and inspiration to users seeking guidance and encouragement.
The JSON structure is organized into individual intent objects, each containing a tag to identify the intent, a set of patterns representing user inputs, and corresponding responses provided by the AI model. This dataset can be used to train a conversational AI system to engage in positive interactions with users and offer motivational messages.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The University Chatbot Dataset contains 38 intents covering general university-related inquiries, designed to train, fine-tune, and evaluate conversational AI models in the education sector.
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Retail Banking Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail Banking] sector can be easily achieved using our two-step approach to LLM Fine-Tuning.… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-banking-llm-chatbot-training-dataset.
Facebook
TwitterWe’ve developed another annotated dataset designed specifically for conversational AI and companion AI model training.
What you have here on Kaggle is our free sample - Think Salon Kitty meets AI
The 'Time Waster Identification & Retreat Model Dataset', enables AI handler agents to detect when users are likely to churn—saving valuable tokens and preventing wasted compute cycles in conversational models.
This batch has 167 entries annotated for sentiment, intent, user risk flagging (via behavioural tracking), user Recovery Potential per statement; among others. This dataset is designed to be a niche micro dataset for a specific use case: Time Waster Identification and Retreat.
👉 Buy the updated version: https://lifebricksglobal.gumroad.com/l/Time-WasterDetection-Dataset
This dataset is perfect for:
It is designed for AI researchers and developers building:
Use case:
This batch has 167 entries annotated for sentiment, intent, user risk flagging (via behavioural tracking), user Recovery Potential per statement; among others. This dataset is designed to be a niche micro dataset for a specific use case: Time Waster Identification and Retreat.
👉 Good for teams working on conversational AI, companion AI, fraud detectors and those integrating routing logic for voice/chat agents
👉 Buy the updated version: https://lifebricksglobal.gumroad.com/l/Time-WasterDetection-Dataset
Contact us on LinkedIn: Life Bricks Global.
License:
This dataset is provided under a custom license. By using the dataset, you agree to the following terms:
Usage: You are allowed to use the dataset for non-commercial purposes, including research, development, and machine learning model training.
Modification: You may modify the dataset for your own use.
Redistribution: Redistribution of the dataset in its original or modified form is not allowed without permission.
Attribution: Proper attribution must be given when using or referencing this dataset.
No Warranty: The dataset is provided "as-is" without any warranties, express or implied, regarding its accuracy, completeness, or fitness for a particular purpose.
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Restaurants Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [restaurants] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-restaurants-llm-chatbot-training-dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Chatbot are used by almost every tech based company and become trending these days I decided build chatbot so i find this, to get good hands on experience how to build chatbot this dataset is perfect
Contribute to this dataset and enjoy Kaggling !!!!!!!!!!!!!
Facebook
TwitterThis dataset contains example utterances and their corresponding intents from the Customer Support domain. The data can be used to train intent recognition models Natural Language Understanding (NLU) platforms.
The dataset covers the "Customer Support" domain and includes 27 intents grouped in 11 categories. These intents have been selected from Bitext's collection of 20 domain-specific datasets (banking, retail, utilities...), keeping the intents that are common across domains. See below for a full list of categories and intents.
The dataset contains over 20,000 utterances, with a varying number of utterances per intent. These utterances have been extracted from a larger dataset of 288,000 utterances (approx. 10,000 per intent), including language register variations such as politeness, colloquial, swearing, indirect style... To select the utterances, we use stratified sampling to generate a dataset with a general user language register profile.
The dataset also reflects commonly ocurring linguistic phenomena of real-life chatbots, such as: - spelling mistakes - run-on words - missing punctuation
Each entry in the dataset contains an example utterance from the Customer Support domain, along with its corresponding intent, category and additional linguistic information. Each line contains the following four fields: - flags: the applicable linguistic flags - utterance: an example user utterance - category: the high-level intent category - intent: the intent corresponding to the user utterance
The dataset contains annotations for linguistic phenomena, which can be used to adapt bot training to different user language profiles. These flags are: B - Basic syntactic structure S - Syntactic structure L - Lexical variation (synonyms) M - Morphological variation (plurals, tenses…) I - Interrogative structure C - Complex/Coordinated syntactic structure P - Politeness variation Q - Colloquial variation W - Offensive language E - Expanded abbreviations (I'm -> I am, I'd -> I would…) D - Indirect speech (ask an agent to…) Z - Noise (spelling, punctuation…)
These phenomena make the training dataset more effective and make bots more accurate and robust.
The intent categories covered by the dataset are: ACCOUNT CANCELLATION_FEE CONTACT DELIVERY FEEDBACK INVOICES NEWSLETTER ORDER PAYMENT REFUNDS SHIPPING
The intents covered by the dataset are: cancel_order complaint contact_customer_service contact_human_agent create_account change_order change_shipping_address check_cancellation_fee check_invoices check_payment_methods check_refund_policy delete_account delivery_options delivery_period edit_account get_invoice get_refund newsletter_subscription payment_issue place_order recover_password registration_problems review set_up_shipping_address switch_account track_order track_refund
(c) Bitext Innovations, 2020
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
📌 Overview This dataset is designed to support research in AI-driven language learning, specifically for chatbot-based English tutoring. It includes intent classification for chatbot interactions and grammatical error correction to assist users in improving their English proficiency.
📊 Dataset Structure The dataset consists of 200 rows with the following columns:
Sentence → User queries for intent classification (e.g., "Can you check my grammar?") Intent → Categorized chatbot responses (e.g., Grammar_Check, Vocabulary_Assistance) Incorrect_Sentence → Common grammatical errors in English writing Corrected_Sentence → AI-corrected versions of the incorrect sentences
Facebook
Twitterhttps://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses
A question answer based dataset for training LLMs or Text2Text model as an AI Based Girlfriend, can be used to done other analysis. Taken from multiple sources and combined. Please ensure that if you train a model, it should be a responsible one.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The English General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world English usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level English conversations covering a broad spectrum of everyday topics.
This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native English speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Chats reflect informal, native-level English usage with:
Every chat instance is accompanied by structured metadata, which includes:
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
This ensures a clean, reliable dataset ready for high-performance AI model training.
This dataset is ideal for training and evaluating a wide range of text-based AI systems:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a dataset for conversations in json format. You can use this intents to train your chatbot for different types for conversation. You can also make changes on your own to train your chatbot for new conversations .
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides detailed, synthetic healthcare chatbot conversations with annotated intent labels, message sequencing, and extracted entities. Designed for training and evaluating conversational AI, it supports intent classification, dialogue modeling, and entity recognition in healthcare virtual assistants. The dataset enables robust analysis of user-bot interactions for improved patient engagement and automation.
Facebook
Twitterhttps://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Get a high-quality chatbot dataset for AI/ML models in Hospitality Sector. Ideal for NLP training, improving chatbot responses, and enhancing conversational AI.
Facebook
Twitterhttps://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Get a high-quality chatbot dataset for AI/ML models in BFSI Sector. Train with diverse conversational data for accurate, efficient machine learning applications
Facebook
Twitterhttps://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
High-quality chatbot dataset for AI/ML models in Ecommerce Sector. Train NLP algorithms with diverse conversational data to enhance chatbot accuracy.
Facebook
TwitterThis dataset was created by Abhishek Srivastava
Facebook
TwitterVictorano/Bitext-customer-support-llm-chatbot-training-dataset-4k-seed42 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Vietnamese Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Vietnamese-speaking regions.
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
This dataset reflects the natural flow of Vietnamese healthcare communication and includes:
These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversations range from simple inquiries to complex advisory sessions, including:
Each conversation typically includes these structural components:
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Available in JSON, CSV, and TXT formats, each conversation includes:
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.