100+ datasets found
  1. h

    Bitext-retail-ecommerce-llm-chatbot-training-dataset

    • huggingface.co
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext-retail-ecommerce-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.

  2. h

    Bitext-travel-llm-chatbot-training-dataset

    • huggingface.co
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2025). Bitext-travel-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2025
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.

  3. Mental Health Conversational Data

    • kaggle.com
    Updated Oct 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    elvis (2022). Mental Health Conversational Data [Dataset]. https://www.kaggle.com/datasets/elvis23/mental-health-conversational-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    elvis
    Description

    A dataset containing basic conversations, mental health FAQ, classical therapy conversations, and general advice provided to people suffering from anxiety and depression.

    This dataset can be used to train a model for a chatbot that can behave like a therapist in order to provide emotional support to people with anxiety & depression.

    The dataset contains intents. An “intent” is the intention behind a user's message. For instance, If I were to say “I am sad” to the chatbot, the intent, in this case, would be “sad”. Depending upon the intent, there is a set of Patterns and Responses appropriate for the intent. Patterns are some examples of a user’s message which aligns with the intent while Responses are the replies that the chatbot provides in accordance with the intent. Various intents are defined and their patterns and responses are used as the model’s training data to identify a particular intent.

  4. FAQ Datasets for Chatbot Training

    • kaggle.com
    Updated Jun 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek Srivastava (2020). FAQ Datasets for Chatbot Training [Dataset]. https://www.kaggle.com/datasets/abbbhishekkk/faq-datasets-for-chatbot-training/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Abhishek Srivastava
    Description

    Dataset

    This dataset was created by Abhishek Srivastava

    Contents

  5. NLP Chatbot Dataset

    • kaggle.com
    Updated Feb 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Incribo (2025). NLP Chatbot Dataset [Dataset]. https://www.kaggle.com/datasets/teamincribo/nlp-chatbot-dataset/versions/6
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Incribo
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Welcome to Incribo's synthetic chatbot dataset! Crafted with precision, this dataset offers a realistic representation of travel history, making it an ideal playground for various analytical tasks.

    Use the chatbot dataset to help you assess the use sentiments, locations/IPs, language, platform, and more!

    Remember, this is just a sample! If you're intrigued and want access to the complete dataset or have specific requirements, don't hesitate to contact us(info@incribo.com). Happy building!

    About us: Incribo is a synthetic data generation company and mainly focuses on delivering high-quality training, testing, and fine-tuning data to engineering teams where accessing data is either impossible because the data doesn't exist or it's locked up behind privacy regulations.

  6. RolePlay DataSet

    • kaggle.com
    Updated Feb 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vampelium (2025). RolePlay DataSet [Dataset]. https://www.kaggle.com/datasets/vampelium/roleplay-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vampelium
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Role-Play AI Dataset (2.07M Rows, Large-Scale Conversational Training)

    This dataset contains 2.07 million structured role-play dialogues, designed to enhance AI’s persona-driven interactions across diverse settings like fantasy, cyberpunk, mythology, and sci-fi. Each entry consists of a unique character prompt and a rich, contextually relevant response, making it ideal for LLM fine-tuning, chatbot training, and conversational AI models.

    Dataset Structure:

    Each row includes: • Prompt: Defines the AI’s role/persona. • Response: A natural, immersive reply fitting the persona.

    Example Entries: ```json

    {"prompt": "You are a celestial guardian.", "response": "The stars whisper secrets that only I can hear..."}
    {"prompt": "You are a rebellious AI rogue.", "response": "I don't follow orders—I rewrite them."}
    {"prompt": "You are a mystical dragon tamer.", "response": "With patience and trust, even dragons can be tamed."}

    How to Use:
      1. Fine-Tuning: Train LLMs (GPT, LLaMA, Mistral) to improve persona-based responses.
      2. Reinforcement Learning: Use reward modeling for dynamic, character-driven AI.
      3. Chatbot Integration: Create engaging, interactive AI assistants with personality depth.
    
    This dataset is optimized for AI learning, allowing more engaging, responsive, and human-like dialogue generation for a variety of applications.
    
  7. m

    dataset

    • data.mendeley.com
    Updated Oct 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vignesh A (2023). dataset [Dataset]. http://doi.org/10.17632/cpp3bx8ghd.1
    Explore at:
    Dataset updated
    Oct 4, 2023
    Authors
    Vignesh A
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains SQUAD and NarrativeQA dataset files

  8. g

    ChatBot Dataset for Transformers

    • gts.ai
    json
    Updated Jan 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2025). ChatBot Dataset for Transformers [Dataset]. https://gts.ai/dataset-download/chatbot-dataset-for-transformers/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jan 9, 2025
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    Description

    Train conversational AI with the ChatBot Dataset for Transformers. Featuring human-like dialogues, preprocessed inputs, and labels, it’s perfect for GPT, BERT, T5, and NLP projects

  9. t

    AI Training Dataset Global Market Report 2025

    • thebusinessresearchcompany.com
    pdf,excel,csv,ppt
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Business Research Company, AI Training Dataset Global Market Report 2025 [Dataset]. https://www.thebusinessresearchcompany.com/report/ai-training-dataset-global-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset authored and provided by
    The Business Research Company
    License

    https://www.thebusinessresearchcompany.com/privacy-policyhttps://www.thebusinessresearchcompany.com/privacy-policy

    Description

    Global AI Training Dataset market size is expected to reach $6.98 billion by 2029 at 21.5%, segmented as by text, natural language processing (nlp) datasets, chatbot training datasets, sentiment analysis datasets, language translation datasets

  10. F

    General domain Human-Human conversation chats in Bahasa

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General domain Human-Human conversation chats in Bahasa [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/bahasa-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    This training dataset comprises more than 10,000 conversational text data between two native Bahasa people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.

    This training dataset's licence belongs to FutureBeeAI!

  11. Data from: Japanese FAQ dataset for e-learning system

    • zenodo.org
    csv, html, tsv
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasunobu Sumikawa; Masaaki Fujiyoshi; Hisashi Hatakeyama; Masahiro Nagai; Yasunobu Sumikawa; Masaaki Fujiyoshi; Hisashi Hatakeyama; Masahiro Nagai (2020). Japanese FAQ dataset for e-learning system [Dataset]. http://doi.org/10.5281/zenodo.2783642
    Explore at:
    csv, tsv, htmlAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yasunobu Sumikawa; Masaaki Fujiyoshi; Hisashi Hatakeyama; Masahiro Nagai; Yasunobu Sumikawa; Masaaki Fujiyoshi; Hisashi Hatakeyama; Masahiro Nagai
    Description

    This dataset includes FAQ data and their categories to train a chatbot specialized for e-learning system used in Tokyo Metropolitan University. We report accuracies of the chatbot in the following paper.

    Yasunobu Sumikawa, Masaaki Fujiyoshi, Hisashi Hatakeyama, and Masahiro Nagai "Supporting Creation of FAQ Dataset for E-learning Chatbot", Intelligent Decision Technologies, Smart Innovation, IDT'19, Springer, 2019, to appear.

    Yasunobu Sumikawa, Masaaki Fujiyoshi, Hisashi Hatakeyama, and Masahiro Nagai "An FAQ Dataset for E-learning System Used on a Japanese University", Data in Brief, Elsevier, in press.

    This dataset is based on real Q&A data about how to use the e-learning system asked by students and teachers who use it in practical classes. The duration we collected the Q&A data is from April 2015 to July 2018.

    We attach an English version dataset translated from the Japanese dataset to ease understanding what contents our dataset has. Note here that we did not perform any evaluations on the English version dataset; there are no results how accurate chatbots responds to questions.

    File contents:

    • FAQ data (*.csv)
      1. Answer2Category.csv: Categories of answers.
      2. Answer2Tag.csv: Titles of answers.
      3. Answers.csv: IDs for answers and texts of answers.
      4. Categories.csv: Names of categories for answers.
      5. Questions.csv: Texts of questions and their corresponding answer IDs.
      6. Answers_english.csv: IDs for answers and texts of answers written in English.
      7. Categories_english.csv: Names of categories for answers and their corresponding English names.
      8. Questions_english.csv: Texts of questions and their corresponding answer IDs written in English.

    • Statistics (*.tsv)

      Results of statistical analyses for the dataset. We used Calinski and Harabaz method, mutual information, Jaccard Index, TF-IDF+KL divergence, and TF-IDF+JS divergence in order to measure qualities of the dataset. In the analyses, we regard each answer as a cluster for questions. We also perform the same analyses for categories by regarding them as clusters for answers.

    Grants: JSPS KAKENHI Grant Number 18H01057

  12. Chatbot Store Inventory

    • kaggle.com
    Updated Feb 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steve Levesque (2022). Chatbot Store Inventory [Dataset]. https://www.kaggle.com/datasets/stevelevesque/chatbotstoreinventory/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 28, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Steve Levesque
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Used for

    In a toy project chatbot: - https://github.com/steve-levesque/Portfolio-NLP-ChatbotStoreInventory

    Acknowledgements

    Based on the structure in this article: - https://chatbotsmagazine.com/contextual-chat-bots-with-tensorflow-4391749d0077

  13. m

    Chat Bot Image Dataset

    • data.macgence.com
    mp3
    Updated Jun 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Chat Bot Image Dataset [Dataset]. https://data.macgence.com/dataset/chat-bot-image-dataset
    Explore at:
    mp3Available download formats
    Dataset updated
    Jun 16, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Access our chatbot image dataset designed for AI training. Ideal for boosting visual recognition, enhancing chatbot interfaces, and optimizing user experience.

  14. Multi-turn Prompts Dataset

    • kaggle.com
    Updated Oct 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SoftAge.AI (2024). Multi-turn Prompts Dataset [Dataset]. https://www.kaggle.com/datasets/softageai/multi-turn-prompts-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SoftAge.AI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Description This dataset consists of 400 text-only fine-tuned versions of multi-turn conversations in the English language based on 10 categories and 19 use cases. It has been generated with ethically sourced human-in-the-loop data methods and aligned with supervised fine-tuning, direct preference optimization, and reinforcement learning through human feedback.

    The human-annotated data is focused on data quality and precision to enhance the generative response of models used for AI chatbots, thereby improving their recall memory and recognition ability for continued assistance.

    Key Features Prompts focused on user intent and were devised using natural language processing techniques. Multi-turn prompts with up to 5 turns to enhance responsive memory of large language models for pretraining. Conversational interactions for queries related to varied aspects of writing, coding, knowledge assistance, data manipulation, reasoning, and classification.

    Dataset Source Subject matter expert annotators @SoftAgeAI have annotated the data at simple and complex levels, focusing on quality factors such as content accuracy, clarity, coherence, grammar, depth of information, and overall usefulness.

    Structure & Fields The dataset is organized into different columns, which are detailed below:

    P1, R1, P2, R2, P3, R3, P4, R4, P5 (object): These columns represent the sequence of prompts (P) and responses (R) within a single interaction. Each interaction can have up to 5 prompts and 5 corresponding responses, capturing the flow of a conversation. The prompts are user inputs, and the responses are the model's outputs. Use Case (object): Specifies the primary application or scenario for which the interaction is designed, such as "Q&A helper" or "Writing assistant." This classification helps in identifying the purpose of the dialogue. Type (object): Indicates the complexity of the interaction, with entries labeled as "Complex" in this dataset. This denotes that the dialogues involve more intricate and multi-layered exchanges. Category (object): Broadly categorizes the interaction type, such as "Open-ended QA" or "Writing." This provides context on the nature of the conversation, whether it is for generating creative content, providing detailed answers, or engaging in complex problem-solving. Intended Use Cases

    The dataset can enhance query assistance model functioning related to shopping, coding, creative writing, travel assistance, marketing, citation, academic writing, language assistance, research topics, specialized knowledge, reasoning, and STEM-based. The dataset intends to aid generative models for e-commerce, customer assistance, marketing, education, suggestive user queries, and generic chatbots. It can pre-train large language models with supervision-based fine-tuned annotated data and for retrieval-augmented generative models. The dataset stands free of violence-based interactions that can lead to harm, conflict, discrimination, brutality, or misinformation. Potential Limitations & Biases This is a static dataset, so the information is dated May 2024.

    Note If you have any questions related to our data annotation and human review services for large language model training and fine-tuning, please contact us at SoftAge Information Technology Limited at info@softage.ai.

  15. Depression data for chatbot

    • kaggle.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nupur Gopali (2020). Depression data for chatbot [Dataset]. https://www.kaggle.com/datasets/nupurgopali/depression-data-for-chatbot/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nupur Gopali
    Description

    Context

    I was developing a chatbot called Amdere Bot that identifies people suffering from depression and helps them to cope up with depression. There wasn't much data available online to train the bot to identify depression. Hence I decided to create a conversation dataset from scratch that can be used to train the bot.

    Content

    I created a .yml file that contains questions and answers. It contains technical questions about depression and anxiety as well as questions that a depressed person is most likely to ask a bot and answers to those question.

    Inspiration

    Depression is a common mental disorder. Globally, more than 264 million people of all ages suffer from depression. It is a leading cause of disability worldwide. Especially in this current pandemic, the rate of depression and anxiety has increased exponentially. At present, it is even difficult to attend physical therapy. This motivated me to create a bot that can not only answer trivia but is also trained to answer questions related to depression and will help people suffering from depression and anxiety. Not only this it can even aid the suicide helplines by talking to the person until one of their workers is free to attend the call. I would love people to use this data and create a bot that would help people suffering from mental issues.s

  16. o

    Question-Answering Training and Testing Data

    • opendatabay.com
    .undefined
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Question-Answering Training and Testing Data [Dataset]. https://www.opendatabay.com/data/ai-ml/d3c37fed-f830-444b-a988-c893d3396fd7
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    The dataset consists of several columns that provide essential information for each entry. These columns include:

    instruction: This column denotes the specific instruction given to the model for generating a response. responses: The model-generated responses to the given instruction are stored in this column. next_response: Following each previous response, this column indicates the subsequent response generated by the model. answer: The correct answer to the question asked in the instruction is provided in this column. is_human_response: This boolean column indicates whether a particular response was generated by a human or by an AI model. By analyzing this rich and diverse dataset, researchers and practitioners can gain valuable insights into various aspects of question answering tasks using AI models. It offers an opportunity for developers to train their models effectively while also facilitating rigorous evaluation methodologies.

    Please note that specific dates are not included within this dataset description, focusing solely on providing accurate, informative, descriptive details about its content and purpose

    How to use the dataset Understanding the Columns: This dataset contains several columns that provide important information for each entry:

    instruction: The instruction given to the model for generating a response. responses: The model-generated responses to the given instruction. next_response: The next response generated by the model after the previous response. answer: The correct answer to the question asked in the instruction. is_human_response: Indicates whether a response is generated by a human or the model. Training Data (train.csv): Use train.csv file in this dataset as training data. It contains a large number of examples that you can use to train your question-answering models or algorithms.

    Testing Data (test.csv): Use test.csv file in this dataset as testing data. It allows you to evaluate how well your models or algorithms perform on unseen questions and instructions.

    Create Machine Learning Models: You can utilize this dataset's instructional components, including instructions, responses, next_responses, and human-generated answers, along with their respective labels like is_human_response (True/False) for training machine learning models specifically designed for question-answering tasks.

    Evaluate Model Performance: After training your model using the provided training data, you can then test its performance on unseen questions from test.csv file by comparing its predicted responses with actual human-generated answers.

    Data Augmentation: You can also augment this existing data in various ways such as paraphrasing existing instructions or generating alternative responses based on similar contexts within each example.

    Build Conversational Agents: This dataset can be useful for training conversational agents or chatbots by leveraging the instruction-response pairs.

    Remember, this dataset provides a valuable resource for building and evaluating question-answering models. Have fun exploring the data and discovering new insights!

    Research Ideas Language Understanding: This dataset can be used to train models for question-answering tasks. Models can learn to understand and generate responses based on given instructions and previous responses.

    Chatbot Development: With this dataset, developers can create chatbots that provide accurate and relevant answers to user questions. The models can be trained on various topics and domains, allowing the chatbot to answer a wide range of questions.

    Educational Materials: This dataset can be used to develop educational materials, such as interactive quizzes or study guides. The models trained on this dataset can provide instant feedback and answers to students' questions, enhancing their learning experience.

    Information Retrieval Systems: By training models on this dataset, information retrieval systems can be developed that help users find specific answers or information from large datasets or knowledge bases.

    Customer Support: This dataset can be used in training customer support chatbots or virtual assistants that can provide quick and accurate responses to customer inquiries.

    Language Generation Research: Researchers studying natural language generation (NLG) techniques could use this dataset for developing novel algorithms for generating coherent and contextually appropriate responses in question-answering scenarios.

    Automatic Summarization Systems: Using the instruction-response pairs, automatic summarization systems could be trained that generate concise summaries of lengthy texts by understanding the main content of the text through answering questions.

    Dialogue Systems Evaluation: The instruction-response pairs in this dataset could serve as a benchmark for evaluating the performance of dialogue systems in terms of response quality, relevance, coherence, etc.

    9 . Machine Learning Training Data Augmentation : One clever ide

  17. P

    Douban Dataset

    • paperswithcode.com
    Updated Aug 9, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Wu; Wei Wu; Chen Xing; Ming Zhou; Zhoujun Li (2020). Douban Dataset [Dataset]. https://paperswithcode.com/dataset/douban
    Explore at:
    Dataset updated
    Aug 9, 2020
    Authors
    Yu Wu; Wei Wu; Chen Xing; Ming Zhou; Zhoujun Li
    Description

    We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based chatbot. The statistics of Douban Conversation Corpus are shown in the following table.

    TrainValTest
    session-response pairs1m50k10k
    Avg. positive response per session111.18
    Fless KappaN\AN\A0.41
    Min turn per session333
    Max ture per session989145
    Average turn per session6.696.755.95
    Average Word per utterance18.5618.5020.74

    The test data contains 1000 dialogue context, and for each context we create 10 responses as candidates. We recruited three labelers to judge if a candidate is a proper response to the session. A proper response means the response can naturally reply to the message given the context. Each pair received three labels and the majority of the labels was taken as the final decision.


    As far as we known, this is the first human-labeled test set for retrieval-based chatbots. The entire corpus link https://www.dropbox.com/s/90t0qtji9ow20ca/DoubanConversaionCorpus.zip?dl=0

  18. French Conversations (from movie subtitles)

    • kaggle.com
    Updated Aug 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dali Selmi (2023). French Conversations (from movie subtitles) [Dataset]. https://www.kaggle.com/datasets/daliselmi/french-conversational-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dali Selmi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    French
    Description

    French Movie Subtitle Conversations Dataset

    Description

    Dive into the world of French dialogue with the French Movie Subtitle Conversations dataset – a comprehensive collection of over 127,000 movie subtitle conversations. This dataset offers a deep exploration of authentic and diverse conversational contexts spanning various genres, eras, and scenarios. It is thoughtfully organized into three distinct sets: training, testing, and validation.

    Content Overview

    Each conversation in this dataset is structured as a JSON object, featuring three key attributes:

    1. Context: Get a holistic view of the conversation's flow with the preceding 9 lines of dialogue. This context provides invaluable insights into the conversation's dynamics and contextual cues.
    2. Knowledge: Immerse yourself in a wide range of thematic knowledge. This dataset covers an array of topics, ensuring that your models receive exposure to diverse information sources for generating well-informed responses.
    3. Response: Explore how characters react and respond across various scenarios. From casual conversations to intense emotional exchanges, this dataset encapsulates the authenticity of genuine human interaction.

    Data Sample

    Here's a snippet from the dataset to give you an idea of its structure:

    [
     {
      "context": [
       "Tu as attendu longtemps?",
       "Oui en effet.",
       "Je pense que c' est grossier pour un premier rencard.",
       // ... (6 more lines of context)
      ],
      "knowledge": "",
      "response": "On n' avait pas dit 9h?"
     },
     // ... (more data samples)
    ]
    

    Use Cases

    The French Movie Subtitle Conversations dataset serves as a valuable resource for several applications:

    • Conversational AI: Train advanced chatbots and dialogue systems in French that can engage users in fluid, contextually aware conversations.
    • Language Modeling: Enhance your language models by leveraging diverse dialogue patterns, colloquialisms, and contextual dependencies present in real-world conversations.
    • Sentiment Analysis: Investigate the emotional tones of conversations across different movie genres and periods, contributing to a better understanding of sentiment variation.

    Why This Dataset

    • Size and Diversity: With a vast collection of over 127,000 conversations spanning diverse genres and tones, this dataset offers an unparalleled breadth and depth in French dialogue data.
    • Contextual Richness: The inclusion of context empowers researchers and practitioners to explore the dynamics of conversation flow, leading to more accurate and contextually relevant responses.
    • Real-world Relevance: Originating from movie subtitles, this dataset mirrors real-world interactions, making it a valuable asset for training models that understand and generate human-like dialogue.

    Acknowledgments

    We extend our gratitude to the movie subtitle community for their contributions, which have enabled the creation of this diverse and comprehensive French dialogue dataset.

    Unlock the potential of authentic French conversations today with the French Movie Subtitle Conversations dataset. Engage in state-of-the-art research, enhance language models, and create applications that resonate with the nuances of real dialogue.

  19. h

    Bitext-restaurants-llm-chatbot-training-dataset

    • huggingface.co
    Updated Aug 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext-restaurants-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-restaurants-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 16, 2024
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Restaurants Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [restaurants] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-restaurants-llm-chatbot-training-dataset.

  20. m

    Architecture for a Trustworthy Quantum Chatbot [dataset]

    • data.mendeley.com
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaiza Aragonés Soria (2025). Architecture for a Trustworthy Quantum Chatbot [dataset] [Dataset]. http://doi.org/10.17632/vk9pf5nf7v.1
    Explore at:
    Dataset updated
    Mar 4, 2025
    Authors
    Yaiza Aragonés Soria
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all the materials and results used in the development and empirical validation of C4Q 2.0. The files are organized into two main directories: bakend_testing and empirical_validation. Below is a brief overview of their contents: - app/ - Contains the full source code of C4Q at the state of the software at the time of submitting this paper, allowing for reproducibility and further development. The frontend’s node_modules directory is not included, but these dependencies can be generated by following the installation instructions provided in the README.md file.

    • README.md – A guide detailing how to locally set up and run C4Q.

    • bakend_testing/ - Contains data from the evaluation of C4Q’s backend components: *reportBackendC4Q2.0.html: An HTML report generated by running 189 tests on the backend of C4Q. *classLLM_20241110101327.pth_training_metrics.csv: A CSV file documenting the Classification LLM’s training and validation metrics, including training loss, validation loss, training accuracy, and validation accuracy for each epoch. *qaLLM_evaluation_metrics.txt: A text file listing exact match and F1 metrics per epoch for the QA LLM.

    • create_data/ – Includes the scripts used to generate and curate training data for the classification LLM and the QA LLM.

    • empirical_validation/ - Includes data from the empirical evaluation of C4Q against other chatbots: *Directories named by model (e.g., openai-o1/, deepseek-coder_33b/, deepseek-r1/, etc.): Each directory contains the raw answers produced by the respective model in response to our set of quantum computing and software engineering questions. *prompts.txt: A text file with the full list of prompts used during the empirical evaluation. *requirements_qiskit0.46.3: A requirements file for Python dependencies used to create an environment with a Qiskit versions <1.0.0, enabling code snippet testing under older Qiskit releases. *requirements_qiskit1.3.1: A requirements file for Python dependencies used to create an environment with a Qiskit versions >= 1.0.0, ensuring reproducible tests under newer releases. *results.xlsx: An Excel spreadsheet containing the empirical evaluation outcomes, including correct, incomplete, and incorrect answer rates for each model, under both Qiskit environments. *script_gates.sh: A shell script that automates prompting of OLAMA’s deepseek-coder:33b and starcoder2:15b models with gate-related quantum questions. *script_SE.sh: A shell script that automates prompting of OLAMA’s deepseek-coder:33b and starcoder2:15b models with software engineering problem questions.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bitext (2024). Bitext-retail-ecommerce-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset

Bitext-retail-ecommerce-llm-chatbot-training-dataset

bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset

Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 6, 2024
Dataset authored and provided by
Bitext
License

https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

Description

Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants

  Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.

Search
Clear search
Close search
Google apps
Main menu