https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Chatbot Arena Conversations Dataset
This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp. To ensure the safe release… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations.
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/1-800-SHARED-TASKS/lmsys-chat-1m.
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Customer Support sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains a compilation of carefully-crafted Q&A pairs which are designed to provide AI-based tailored support for mental health. These carefully chosen questions and answers offer an avenue for those looking for help to gain the assistance they need. With these pre-processed conversations, Artificial Intelligence (AI) solutions can be developed and deployed to better understand and respond appropriately to individual needs based on their input. This comprehensive dataset is crafted by experts in the mental health field, providing insightful content that will further research in this growing area. These data points will be invaluable for developing the next generation of personalized AI-based mental health chatbots capable of truly understanding what people need
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains pre-processed Q&A pairs for AI-based tailored support for mental health. As such, it represents an excellent starting point in building a conversational model which can handle conversations about mental health issues. Here are some tips on how to use this dataset to its fullest potential:
Understand your data: Spend time getting to know the text of the conversation between the user and the chatbot and familiarize yourself with what type of questions and answers are included in this specific dataset. This will help you better formulate queries for your own conversational model or develop new ones you can add yourself.
Refine your language processing models: By studying the patterns in syntax, grammar, tone, voice, etc., within this conversational data set you can hone your natural language processing capabilities - such as keyword extractions or entity extraction – prior to implementing them into a larger bot system .
Test assumptions: Have an idea of what you think may work best with a particular audience or context? See if these assumptions pan out by applying different variations of text to this dataset to see if it works before rolling out changes across other channels or programs that utilize AI/chatbot services
Research & Analyze Results : After testing out different scenarios on real-world users by using various forms of q&a within this chatbot pair data set , analyze & record any relevant results pertaining towards understanding user behavior better through further analysis after being exposed to tailored texted conversations about Mental Health topics both passively & actively . The more information you collect here , leads us closer towards creating effective AI powered conversations that bring our desired outcomes from our customer base .
- Developing a chatbot for personalized mental health advice and guidance tailored to individuals' unique needs, experiences, and struggles.
- Creating an AI-driven diagnostic system that can interpret mental health conversations and provide targeted recommendations for interventions or treatments based on clinical expertise.
- Designing an AI-powered recommendation engine to suggest relevant content such as articles, videos, or podcasts based on users’ questions or topics of discussion during their conversation with the chatbot
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------| | text | The text of the conversation between the user and the chatbot. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
AI Medical Chatbot Dataset
This is an experimental Dataset designed to run a Medical Chatbot It contains at least 250k dialogues between a Patient and a Doctor.
Playground ChatBot
ruslanmv/AI-Medical-Chatbot For furter information visit the project here: https://github.com/ruslanmv/ai-medical-chatbot
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains valuable records of conversations between humans and AI-driven chatbots in real-world scenarios. This is a great opportunity to explore the nuances and intricacies of conversations between humans and machines, opening the door to interesting research directions for machine learning, artificial intelligence, natural language processing (NLP), and beyond. With this data, researchers can determine how well machines are able to simulate real conversation behavior such as nonverbal exchanges, intonations, humorous insights or even sarcasm. The data also provides an avenue for comparative studies between human behavior and AI capabilities in carrying out meaningful dialogues with humans. This knowledge base is invaluable for those who aim to create more astounding AI systems that can closely imitate comprehensible speech patterns through their trained technology models
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use this Dataset
This dataset contains conversations between humans and AI-driven chatbots in real-world scenarios. With this dataset, you will be able to use the data to build an AI system that can respond intelligently in natural language conversations. For example, you can build a system with the ability to further engage users by replying with meaningful responses as the conversation progresses.
In order to get started, first familiarize yourself with the columns included in this dataset: 'chat' and 'system'. The column 'chat' contains conversations between humans and chatbot systems while the column 'system' contains responses from AI-driven chatbots.
Once you understand what is included in the data set, it's time for you to start building your AI system! Depending on how complex or advanced your goal is, there are several different approaches that could be used when working with this data set such as supervised learning models like seq2seq network or unsupervised methods like autoencoders etc. To get more detailed information regarding those methods refer to external materials available online.
After having trained your model, now it's time for testing out its performance! Enter some sample text into your model using either a web form or command line interface – then observe how it responds against what’s already stored within training datasets column ‘System’ which indicates expected chatsbot response (see above). You should find that once trained correctly; potential outcomes of such tests explores very closely resembling instances from learning sources (the training dataset) leading evidence of advanced Artificial intelligence applications are possible with sufficient analysis inputs! As always if extra accuracy is needed afterwards tweak any parameters until desired results are achieved - Congratulations!
- AI-driven natural language generation: Using this dataset, developers can train AI systems to automatically generate natural conversations between humans and machines.
- Automatic response selection: The data in the dataset could be used to train AI algorithms which select the most appropriate response in any given conversation.
- Evaluating human-machine interaction: Researchers can use this data to identify areas of improvement in conversational interactions between humans and machines, as well as evaluate various techniques for creating effective dialogue systems
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:--------------------------------------------------------| | chat | Contains dialogues uttered by the human. (String) | | system | Contains responses from the AI-driven chatbot. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
huggingface-projects/bot-fight-data dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Chatbot Instruction Prompts Datasets
Dataset Summary
This dataset has been generated from the following ones:
tatsu-lab/alpaca Dahoas/instruct-human-assistant-prompt allenai/prosocial-dialog
The datasets has been cleaned up of spurious entries and artifacts. It contains ~500k of prompt and expected resposne. This DB is intended to train an instruct-type model
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Update
[01/31/2024] We update the OpenAI Moderation API results for ToxicChat (0124) based on their updated moderation model on on Jan 25, 2024.[01/28/2024] We release an official T5-Large model trained on ToxicChat (toxicchat0124). Go and check it for you baseline comparision![01/19/2024] We have a new version of ToxicChat (toxicchat0124)!
Content
This dataset contains toxicity annotations on 10K user prompts collected from the Vicuna online demo. We utilize a human-AI… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/toxic-chat.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
SPML Chatbot Prompt Injection Dataset
Arxiv Paper Introducing the SPML Chatbot Prompt Injection Dataset: a robust collection of system prompts designed to create realistic chatbot interactions, coupled with a diverse array of annotated user prompts that attempt to carry out prompt injection attacks. While other datasets in this domain have centered on less practical chatbot scenarios or have limited themselves to "jailbreaking" – just one aspect of prompt injection – our dataset… See the full description on the dataset page: https://huggingface.co/datasets/reshabhs/SPML_Chatbot_Prompt_Injection.
bot-remains/student-assistance-chatbot dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
CREDIT: Dataset Card for "heliosbrahma/mental_health_chatbot_dataset"
Dataset Description
Dataset Summary
This dataset contains conversational pair of questions and answers in a single text related to Mental Health. Dataset was curated from popular healthcare blogs like WebMD, Mayo Clinic and HeatlhLine, online FAQs etc. All questions and answers have been anonymized to remove any PII data and pre-processed to remove any unwanted characters.
Languages… See the full description on the dataset page: https://huggingface.co/datasets/ZahrizhalAli/mental_health_conversational_dataset.
Manav5461/Chatbot-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
myLlama03/chatbot-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
tejas1206/medical-chatbot dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
LMSYS Chatbot Arena ELO Scores
This dataset is a datasets-friendly version of Chatbot Arena ELO scores, updated daily from the leaderboard API at https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard. Updated: 20250717
Loading Data
from datasets import load_dataset
dataset = load_dataset("mathewhe/chatbot-arena-elo", split="train")
The main branch of this dataset will always be updated to the latest ELO and leaderboard version. If you need a fixed dataset… See the full description on the dataset page: https://huggingface.co/datasets/mathewhe/chatbot-arena-elo.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
silviaiaia/chatbot dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
DrBenjamin/ai-medical-chatbot dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Events and Ticketing Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [events and ticketing] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-events-ticketing-llm-chatbot-training-dataset.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Chatbot Arena Conversations Dataset
This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp. To ensure the safe release… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations.