Facebook
TwitterAlekseyKorshuk/persona-chat dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for SPC: Synthetic-Persona-Chat Dataset
Abstract from the paper introducing this dataset:
High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user's character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and… See the full description on the dataset page: https://huggingface.co/datasets/google/Synthetic-Persona-Chat.
Facebook
TwitterDataset Description
This persona chat dataset consists of 20,000 conversations. This dataset is crafted to enhance personalized conversational text generation models that consistently reflect a character's persona in the generated response across many conversation turns. Each dialogue in the dataset is structured to reflect a back-and-forth exchange between two personas, offering a window into how individual characteristics, backgrounds, and personal narratives can influence… See the full description on the dataset page: https://huggingface.co/datasets/Cynaptics/persona-chat.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for PersonaChat
Dataset Description
PersonaChat is a multi-turn dialogue dataset introduced by Zhang et al. (2018) for training and evaluating persona-grounded conversational agents. Each conversation is between two crowdworkers, each assigned a randomly selected persona consisting of several simple facts. The dataset aims to assess whether models can maintain consistent character traits throughout a conversation.
Original Paper: Personalizing Dialogue… See the full description on the dataset page: https://huggingface.co/datasets/awsaf49/persona-chat.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present the PERSONA-CHAT dataset, a new dialogue dataset consisting of 162,064 utterances between crowdworkers who were randomly paired and each asked to act the part of a given provided persona (randomly assigned, and created by another set of crowdworkers). The paired workers were asked to chat naturally and to get to know each other during the conversation. This produces interesting and engaging conversations that our agents can try to learn to mimic.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Kawindu Wijewardhane
Released under MIT
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
A chit-chat dataset where paired Turkers are given assigned personas and chat to try to get to know each other.
Chit-chat models are known to have several problems: they lack specificity, do not display a consistent personality and are often not very captivating. In this work we present the task of making chit-chat more engaging by conditioning on profile information. We collect data and train models to (i) condition on their given profile information; and (ii) information about the person they are talking to, resulting in improved dialogues, as measured by next utterance prediction. Since (ii) is initially unknown our model is trained to engage its partner with personal topics, and we show the resulting dialogue can be used to predict profile information about the interlocutors.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Persona-Chat
Dataset Summary
Persona-Chat is a high-quality multi-turn dialogue dataset designed to train conversational AI systems with consistent personality and style. Each participant in the dataset is assigned a persona—a short description or set of traits—which guides their responses throughout the conversation. This dataset enables AI models to learn to maintain coherent personas across dialogue turns and produce responses that reflect consistent characteristics… See the full description on the dataset page: https://huggingface.co/datasets/anezatra/persona-chat.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset of 10,000 dialogues will help researchers of dialogue systems to develop approaches for training chat bots. Prepared in collaboration with MIPT’s Neural Networks and Deep Learning Lab, the dataset contains profiles with a description of each individual's personality and dialogues between the research participants. A chatbot that is trained on the dataset will be able to communicate on behalf of a certain persona and get to know people by chatting with them on general topics.
Facebook
TwitterPMPC (Persona Match on Persona-Chat) is a dataset for Speaker Persona Detection (SPD) which aims to detect speaker personas based on the plain conversational text.
Facebook
TwitterANTEGRAL/korean-persona-chat-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterPERSONA-CHAT 数据集,这是一个新的对话数据集,由随机配对的众包工作人员之间的 162,064 个话语组成 并且每个人都要求扮演给定的角色(随机分配,由另一组众包创建)。配对的工人被要求自然地聊天,并在谈话中相互了解。这会产生有趣且引人入胜的对话,我们的代理可以尝试学习模仿。
Facebook
Twittersuchievement/rp-chat-persona-sharegpt dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterkaitchup/qa-chat-persona-education dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is reformated nazlicanto/persona-based-chat Original dataset Synthetic Persona Chat
Changes
Added system column, which is reformated persona_b, unified into one string, replaced "I", "my"... on "You", "your"... and corrected capital letter usage (now after dot goes capital letter) Added messages column, which is dialogue reformated to be in conversational format + system message Splitted on train and test
More about reformating
You can find all the… See the full description on the dataset page: https://huggingface.co/datasets/Kkordik/persona-based-chat-messages.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.
The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).
To be completed
python
from datasets import load_dataset
dataset = load_dataset("patrickfleith/AstroChat")901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column):
- id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets.
- topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split.
- subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc.
- persona: description of the persona used to simulate a user
- opening_question: the first question asked by the user to start a conversation with the AI-assistant
- messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields:
- role: the role of the speaker, either user or assistant
- content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.
Important See the full list of topics and subtopics covered below.
Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main
We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:
Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
gpt-4-turbo model) to generate the answers to the opening questionsAll instances in the dataset are in english
901 synthetically-generated dialogue
AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International
No restriction. Please provide the correct attribution following the license terms.
Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579
Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)
Use the ...
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Spanish Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Spanish-speaking regions.
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
This dataset reflects the natural flow of Spanish healthcare communication and includes:
These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversations range from simple inquiries to complex advisory sessions, including:
Each conversation typically includes these structural components:
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Available in JSON, CSV, and TXT formats, each conversation includes:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A curated prompt template for AI language models: Create a persona using a template very useful
Facebook
Twitterdattheshshenoy/genz-persona-chat-style dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset is a translation of the Persona Chat dataset into informal Indonesian, reflecting the language commonly used by Indonesian teenagers in instant messaging conversations. It is derived from the repository psetialana/multi_session_chat-informal_indonesian-transformed, which serves as a translated version of gonced8/multi-session_chat. The conversations in the first session of the multi-session chat dataset originate from the Persona Chat dataset.
Facebook
TwitterAlekseyKorshuk/persona-chat dataset hosted on Hugging Face and contributed by the HF Datasets community