https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.
A dataset containing basic conversations, mental health FAQ, classical therapy conversations, and general advice provided to people suffering from anxiety and depression.
This dataset can be used to train a model for a chatbot that can behave like a therapist in order to provide emotional support to people with anxiety & depression.
The dataset contains intents. An “intent” is the intention behind a user's message. For instance, If I were to say “I am sad” to the chatbot, the intent, in this case, would be “sad”. Depending upon the intent, there is a set of Patterns and Responses appropriate for the intent. Patterns are some examples of a user’s message which aligns with the intent while Responses are the replies that the chatbot provides in accordance with the intent. Various intents are defined and their patterns and responses are used as the model’s training data to identify a particular intent.
This dataset was created by Abhishek Srivastava
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Welcome to Incribo's synthetic chatbot dataset! Crafted with precision, this dataset offers a realistic representation of travel history, making it an ideal playground for various analytical tasks.
Use the chatbot dataset to help you assess the use sentiments, locations/IPs, language, platform, and more!
Remember, this is just a sample! If you're intrigued and want access to the complete dataset or have specific requirements, don't hesitate to contact us(info@incribo.com). Happy building!
About us: Incribo is a synthetic data generation company and mainly focuses on delivering high-quality training, testing, and fine-tuning data to engineering teams where accessing data is either impossible because the data doesn't exist or it's locked up behind privacy regulations.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Role-Play AI Dataset (2.07M Rows, Large-Scale Conversational Training)
This dataset contains 2.07 million structured role-play dialogues, designed to enhance AI’s persona-driven interactions across diverse settings like fantasy, cyberpunk, mythology, and sci-fi. Each entry consists of a unique character prompt and a rich, contextually relevant response, making it ideal for LLM fine-tuning, chatbot training, and conversational AI models.
Dataset Structure:
Each row includes: • Prompt: Defines the AI’s role/persona. • Response: A natural, immersive reply fitting the persona.
Example Entries: ```json
{"prompt": "You are a celestial guardian.", "response": "The stars whisper secrets that only I can hear..."}
{"prompt": "You are a rebellious AI rogue.", "response": "I don't follow orders—I rewrite them."}
{"prompt": "You are a mystical dragon tamer.", "response": "With patience and trust, even dragons can be tamed."}
How to Use:
1. Fine-Tuning: Train LLMs (GPT, LLaMA, Mistral) to improve persona-based responses.
2. Reinforcement Learning: Use reward modeling for dynamic, character-driven AI.
3. Chatbot Integration: Create engaging, interactive AI assistants with personality depth.
This dataset is optimized for AI learning, allowing more engaging, responsive, and human-like dialogue generation for a variety of applications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains SQUAD and NarrativeQA dataset files
Train conversational AI with the ChatBot Dataset for Transformers. Featuring human-like dialogues, preprocessed inputs, and labels, it’s perfect for GPT, BERT, T5, and NLP projects
https://www.thebusinessresearchcompany.com/privacy-policyhttps://www.thebusinessresearchcompany.com/privacy-policy
Global AI Training Dataset market size is expected to reach $6.98 billion by 2029 at 21.5%, segmented as by text, natural language processing (nlp) datasets, chatbot training datasets, sentiment analysis datasets, language translation datasets
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This training dataset comprises more than 10,000 conversational text data between two native Bahasa people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
This dataset includes FAQ data and their categories to train a chatbot specialized for e-learning system used in Tokyo Metropolitan University. We report accuracies of the chatbot in the following paper.
Yasunobu Sumikawa, Masaaki Fujiyoshi, Hisashi Hatakeyama, and Masahiro Nagai "Supporting Creation of FAQ Dataset for E-learning Chatbot", Intelligent Decision Technologies, Smart Innovation, IDT'19, Springer, 2019, to appear.
Yasunobu Sumikawa, Masaaki Fujiyoshi, Hisashi Hatakeyama, and Masahiro Nagai "An FAQ Dataset for E-learning System Used on a Japanese University", Data in Brief, Elsevier, in press.
This dataset is based on real Q&A data about how to use the e-learning system asked by students and teachers who use it in practical classes. The duration we collected the Q&A data is from April 2015 to July 2018.
We attach an English version dataset translated from the Japanese dataset to ease understanding what contents our dataset has. Note here that we did not perform any evaluations on the English version dataset; there are no results how accurate chatbots responds to questions.
File contents:
Results of statistical analyses for the dataset. We used Calinski and Harabaz method, mutual information, Jaccard Index, TF-IDF+KL divergence, and TF-IDF+JS divergence in order to measure qualities of the dataset. In the analyses, we regard each answer as a cluster for questions. We also perform the same analyses for categories by regarding them as clusters for answers.
Grants: JSPS KAKENHI Grant Number 18H01057
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In a toy project chatbot: - https://github.com/steve-levesque/Portfolio-NLP-ChatbotStoreInventory
Based on the structure in this article: - https://chatbotsmagazine.com/contextual-chat-bots-with-tensorflow-4391749d0077
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Access our chatbot image dataset designed for AI training. Ideal for boosting visual recognition, enhancing chatbot interfaces, and optimizing user experience.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description This dataset consists of 400 text-only fine-tuned versions of multi-turn conversations in the English language based on 10 categories and 19 use cases. It has been generated with ethically sourced human-in-the-loop data methods and aligned with supervised fine-tuning, direct preference optimization, and reinforcement learning through human feedback.
The human-annotated data is focused on data quality and precision to enhance the generative response of models used for AI chatbots, thereby improving their recall memory and recognition ability for continued assistance.
Key Features Prompts focused on user intent and were devised using natural language processing techniques. Multi-turn prompts with up to 5 turns to enhance responsive memory of large language models for pretraining. Conversational interactions for queries related to varied aspects of writing, coding, knowledge assistance, data manipulation, reasoning, and classification.
Dataset Source Subject matter expert annotators @SoftAgeAI have annotated the data at simple and complex levels, focusing on quality factors such as content accuracy, clarity, coherence, grammar, depth of information, and overall usefulness.
Structure & Fields The dataset is organized into different columns, which are detailed below:
P1, R1, P2, R2, P3, R3, P4, R4, P5 (object): These columns represent the sequence of prompts (P) and responses (R) within a single interaction. Each interaction can have up to 5 prompts and 5 corresponding responses, capturing the flow of a conversation. The prompts are user inputs, and the responses are the model's outputs. Use Case (object): Specifies the primary application or scenario for which the interaction is designed, such as "Q&A helper" or "Writing assistant." This classification helps in identifying the purpose of the dialogue. Type (object): Indicates the complexity of the interaction, with entries labeled as "Complex" in this dataset. This denotes that the dialogues involve more intricate and multi-layered exchanges. Category (object): Broadly categorizes the interaction type, such as "Open-ended QA" or "Writing." This provides context on the nature of the conversation, whether it is for generating creative content, providing detailed answers, or engaging in complex problem-solving. Intended Use Cases
The dataset can enhance query assistance model functioning related to shopping, coding, creative writing, travel assistance, marketing, citation, academic writing, language assistance, research topics, specialized knowledge, reasoning, and STEM-based. The dataset intends to aid generative models for e-commerce, customer assistance, marketing, education, suggestive user queries, and generic chatbots. It can pre-train large language models with supervision-based fine-tuned annotated data and for retrieval-augmented generative models. The dataset stands free of violence-based interactions that can lead to harm, conflict, discrimination, brutality, or misinformation. Potential Limitations & Biases This is a static dataset, so the information is dated May 2024.
Note If you have any questions related to our data annotation and human review services for large language model training and fine-tuning, please contact us at SoftAge Information Technology Limited at info@softage.ai.
I was developing a chatbot called Amdere Bot that identifies people suffering from depression and helps them to cope up with depression. There wasn't much data available online to train the bot to identify depression. Hence I decided to create a conversation dataset from scratch that can be used to train the bot.
I created a .yml file that contains questions and answers. It contains technical questions about depression and anxiety as well as questions that a depressed person is most likely to ask a bot and answers to those question.
Depression is a common mental disorder. Globally, more than 264 million people of all ages suffer from depression. It is a leading cause of disability worldwide. Especially in this current pandemic, the rate of depression and anxiety has increased exponentially. At present, it is even difficult to attend physical therapy. This motivated me to create a bot that can not only answer trivia but is also trained to answer questions related to depression and will help people suffering from depression and anxiety. Not only this it can even aid the suicide helplines by talking to the person until one of their workers is free to attend the call. I would love people to use this data and create a bot that would help people suffering from mental issues.s
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset consists of several columns that provide essential information for each entry. These columns include:
instruction: This column denotes the specific instruction given to the model for generating a response. responses: The model-generated responses to the given instruction are stored in this column. next_response: Following each previous response, this column indicates the subsequent response generated by the model. answer: The correct answer to the question asked in the instruction is provided in this column. is_human_response: This boolean column indicates whether a particular response was generated by a human or by an AI model. By analyzing this rich and diverse dataset, researchers and practitioners can gain valuable insights into various aspects of question answering tasks using AI models. It offers an opportunity for developers to train their models effectively while also facilitating rigorous evaluation methodologies.
Please note that specific dates are not included within this dataset description, focusing solely on providing accurate, informative, descriptive details about its content and purpose
How to use the dataset Understanding the Columns: This dataset contains several columns that provide important information for each entry:
instruction: The instruction given to the model for generating a response. responses: The model-generated responses to the given instruction. next_response: The next response generated by the model after the previous response. answer: The correct answer to the question asked in the instruction. is_human_response: Indicates whether a response is generated by a human or the model. Training Data (train.csv): Use train.csv file in this dataset as training data. It contains a large number of examples that you can use to train your question-answering models or algorithms.
Testing Data (test.csv): Use test.csv file in this dataset as testing data. It allows you to evaluate how well your models or algorithms perform on unseen questions and instructions.
Create Machine Learning Models: You can utilize this dataset's instructional components, including instructions, responses, next_responses, and human-generated answers, along with their respective labels like is_human_response (True/False) for training machine learning models specifically designed for question-answering tasks.
Evaluate Model Performance: After training your model using the provided training data, you can then test its performance on unseen questions from test.csv file by comparing its predicted responses with actual human-generated answers.
Data Augmentation: You can also augment this existing data in various ways such as paraphrasing existing instructions or generating alternative responses based on similar contexts within each example.
Build Conversational Agents: This dataset can be useful for training conversational agents or chatbots by leveraging the instruction-response pairs.
Remember, this dataset provides a valuable resource for building and evaluating question-answering models. Have fun exploring the data and discovering new insights!
Research Ideas Language Understanding: This dataset can be used to train models for question-answering tasks. Models can learn to understand and generate responses based on given instructions and previous responses.
Chatbot Development: With this dataset, developers can create chatbots that provide accurate and relevant answers to user questions. The models can be trained on various topics and domains, allowing the chatbot to answer a wide range of questions.
Educational Materials: This dataset can be used to develop educational materials, such as interactive quizzes or study guides. The models trained on this dataset can provide instant feedback and answers to students' questions, enhancing their learning experience.
Information Retrieval Systems: By training models on this dataset, information retrieval systems can be developed that help users find specific answers or information from large datasets or knowledge bases.
Customer Support: This dataset can be used in training customer support chatbots or virtual assistants that can provide quick and accurate responses to customer inquiries.
Language Generation Research: Researchers studying natural language generation (NLG) techniques could use this dataset for developing novel algorithms for generating coherent and contextually appropriate responses in question-answering scenarios.
Automatic Summarization Systems: Using the instruction-response pairs, automatic summarization systems could be trained that generate concise summaries of lengthy texts by understanding the main content of the text through answering questions.
Dialogue Systems Evaluation: The instruction-response pairs in this dataset could serve as a benchmark for evaluating the performance of dialogue systems in terms of response quality, relevance, coherence, etc.
9 . Machine Learning Training Data Augmentation : One clever ide
We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based chatbot. The statistics of Douban Conversation Corpus are shown in the following table.
Train | Val | Test | |
---|---|---|---|
session-response pairs | 1m | 50k | 10k |
Avg. positive response per session | 1 | 1 | 1.18 |
Fless Kappa | N\A | N\A | 0.41 |
Min turn per session | 3 | 3 | 3 |
Max ture per session | 98 | 91 | 45 |
Average turn per session | 6.69 | 6.75 | 5.95 |
Average Word per utterance | 18.56 | 18.50 | 20.74 |
The test data contains 1000 dialogue context, and for each context we create 10 responses as candidates. We recruited three labelers to judge if a candidate is a proper response to the session. A proper response means the response can naturally reply to the message given the context. Each pair received three labels and the majority of the labels was taken as the final decision.
As far as we known, this is the first human-labeled test set for retrieval-based chatbots. The entire corpus link https://www.dropbox.com/s/90t0qtji9ow20ca/DoubanConversaionCorpus.zip?dl=0
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dive into the world of French dialogue with the French Movie Subtitle Conversations dataset – a comprehensive collection of over 127,000 movie subtitle conversations. This dataset offers a deep exploration of authentic and diverse conversational contexts spanning various genres, eras, and scenarios. It is thoughtfully organized into three distinct sets: training, testing, and validation.
Each conversation in this dataset is structured as a JSON object, featuring three key attributes:
Here's a snippet from the dataset to give you an idea of its structure:
[
{
"context": [
"Tu as attendu longtemps?",
"Oui en effet.",
"Je pense que c' est grossier pour un premier rencard.",
// ... (6 more lines of context)
],
"knowledge": "",
"response": "On n' avait pas dit 9h?"
},
// ... (more data samples)
]
The French Movie Subtitle Conversations dataset serves as a valuable resource for several applications:
We extend our gratitude to the movie subtitle community for their contributions, which have enabled the creation of this diverse and comprehensive French dialogue dataset.
Unlock the potential of authentic French conversations today with the French Movie Subtitle Conversations dataset. Engage in state-of-the-art research, enhance language models, and create applications that resonate with the nuances of real dialogue.
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Restaurants Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [restaurants] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-restaurants-llm-chatbot-training-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all the materials and results used in the development and empirical validation of C4Q 2.0. The files are organized into two main directories: bakend_testing and empirical_validation. Below is a brief overview of their contents: - app/ - Contains the full source code of C4Q at the state of the software at the time of submitting this paper, allowing for reproducibility and further development. The frontend’s node_modules directory is not included, but these dependencies can be generated by following the installation instructions provided in the README.md file.
README.md – A guide detailing how to locally set up and run C4Q.
bakend_testing/ - Contains data from the evaluation of C4Q’s backend components: *reportBackendC4Q2.0.html: An HTML report generated by running 189 tests on the backend of C4Q. *classLLM_20241110101327.pth_training_metrics.csv: A CSV file documenting the Classification LLM’s training and validation metrics, including training loss, validation loss, training accuracy, and validation accuracy for each epoch. *qaLLM_evaluation_metrics.txt: A text file listing exact match and F1 metrics per epoch for the QA LLM.
create_data/ – Includes the scripts used to generate and curate training data for the classification LLM and the QA LLM.
empirical_validation/ - Includes data from the empirical evaluation of C4Q against other chatbots: *Directories named by model (e.g., openai-o1/, deepseek-coder_33b/, deepseek-r1/, etc.): Each directory contains the raw answers produced by the respective model in response to our set of quantum computing and software engineering questions. *prompts.txt: A text file with the full list of prompts used during the empirical evaluation. *requirements_qiskit0.46.3: A requirements file for Python dependencies used to create an environment with a Qiskit versions <1.0.0, enabling code snippet testing under older Qiskit releases. *requirements_qiskit1.3.1: A requirements file for Python dependencies used to create an environment with a Qiskit versions >= 1.0.0, ensuring reproducible tests under newer releases. *results.xlsx: An Excel spreadsheet containing the empirical evaluation outcomes, including correct, incomplete, and incorrect answer rates for each model, under both Qiskit environments. *script_gates.sh: A shell script that automates prompting of OLAMA’s deepseek-coder:33b and starcoder2:15b models with gate-related quantum questions. *script_SE.sh: A shell script that automates prompting of OLAMA’s deepseek-coder:33b and starcoder2:15b models with software engineering problem questions.
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.