A version of the PersonaChat dataset that has been true-cased, and also has been given more normalized punctuation. The original PersonaChat dataset is in all lower case, and has extra space around each clause/sentence separating punctuation mark. This version of the dataset has more of a natural language look, with sentence capitalization, proper noun capitalization, and normalized whitespace. Also, each dialogue turn includes a pool of distractor candidate responses, which can be used by a multiple choice regularization loss during training.
Persona-Chat is sourced from authentic conversations between human annotators who are randomly matched and assigned persona information.
AlekseyKorshuk/persona-chat dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for SPC: Synthetic-Persona-Chat Dataset
Abstract from the paper introducing this dataset:
High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user's character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and… See the full description on the dataset page: https://huggingface.co/datasets/google/Synthetic-Persona-Chat.
The PersonaChat dataset is a large persona-conditioned chit-chat style dialogue dataset.
Dataset Description
This persona chat dataset consists of 20,000 conversations. This dataset is crafted to enhance personalized conversational text generation models that consistently reflect a character's persona in the generated response across many conversation turns. Each dialogue in the dataset is structured to reflect a back-and-forth exchange between two personas, offering a window into how individual characteristics, backgrounds, and personal narratives can influence… See the full description on the dataset page: https://huggingface.co/datasets/Cynaptics/persona-chat.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
A chit-chat dataset where paired Turkers are given assigned personas and chat to try to get to know each other.
Chit-chat models are known to have several problems: they lack specificity, do not display a consistent personality and are often not very captivating. In this work we present the task of making chit-chat more engaging by conditioning on profile information. We collect data and train models to (i) condition on their given profile information; and (ii) information about the person they are talking to, resulting in improved dialogues, as measured by next utterance prediction. Since (ii) is initially unknown our model is trained to engage its partner with personal topics, and we show the resulting dialogue can be used to predict profile information about the interlocutors.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for PersonaChat
Dataset Description
PersonaChat is a multi-turn dialogue dataset introduced by Zhang et al. (2018) for training and evaluating persona-grounded conversational agents. Each conversation is between two crowdworkers, each assigned a randomly selected persona consisting of several simple facts. The dataset aims to assess whether models can maintain consistent character traits throughout a conversation.
Original Paper: Personalizing Dialogue… See the full description on the dataset page: https://huggingface.co/datasets/awsaf49/persona-chat.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present the PERSONA-CHAT dataset, a new dialogue dataset consisting of 162,064 utterances between crowdworkers who were randomly paired and each asked to act the part of a given provided persona (randomly assigned, and created by another set of crowdworkers). The paired workers were asked to chat naturally and to get to know each other during the conversation. This produces interesting and engaging conversations that our agents can try to learn to mimic.
anassaleh218/personachat dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for "personachat_safe"
More Information needed
ANTEGRAL/korean-persona-chat-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
The ConvAI2 NeurIPS competition aimed at finding approaches to creating high-quality dialogue agents capable of meaningful open domain conversation. The ConvAI2 dataset for training models is based on the PERSONA-CHAT dataset. The speaker pairs each have assigned profiles coming from a set of 1155 possible personas (at training time), each consisting of at least 5 profile sentences, setting aside 100 never seen before personas for validation. As the original PERSONA-CHAT test set was released, a new hidden test set consisted of 100 new personas and over 1,015 dialogs was created by crowdsourced workers. To avoid modeling that takes advantage of trivial word overlap, additional rewritten sets of the same train and test personas were crowdsourced, with related sentences that are rephrases, generalizations or specializations, rendering the task much more challenging. For example “I just got my nails done” is revised as “I love to pamper myself on a regular basis” and “I am on a diet now” is revised as “I need to lose weight.” The training, validation and hidden test sets consists of 17,878, 1,000 and 1,015 dialogues, respectively.
PMPC (Persona Match on Persona-Chat) is a dataset for Speaker Persona Detection (SPD) which aims to detect speaker personas based on the plain conversational text.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
使用前置需求
Python 3.x CSV 文件,必須包含以下欄位: user 1 personas user 2 personas Best Generated Conversation
如何使用
準備 CSV 文件請確認 CSV 文件中包含上述三個欄位,並將 CSV 文件命名為 input.csv(或根據實際情況修改腳本中的檔案名稱)。
運行腳本在命令列執行: python extract_conversations.py
執行後會生成一個 output.json 文件,內含轉換後的 JSON 數據。
如何更換角色映射
預設情況下,腳本將對話中:
User 1 的訊息映射為 gpt User 2 的訊息映射為 human
若你需要更換角色,例如將 User 1 映射成 human、User 2 映射成 gpt,請按照以下步驟修改腳本中對應的部分:
找到以下程式碼片段(位於每組對話配對邏輯中):if first[0] == "1" and second[0] == "2":… See the full description on the dataset page: https://huggingface.co/datasets/tw-llama/Synthetic-Persona-Chat-Reversal-Role.
The ConvAI2 dataset, derived from Persona-Chat, contains dialogues between crowdworkers who role-play as assigned personas, enabling the development of conversational agents that can mimic engaging interactions.
This is the open dialogue datasets collected by TextBox, including:
PersonaChat (pc) DailyDialog (dd) DSTC7-AVSD (da) SGD (sgd) Topical-Chat (tc) Wizard of Wikipedia (wow) Movie Dialog (md) Cleaned OpenSubtitles Dialogs (cos) Empathetic Dialogues (ed) Curiosity (curio) CMU Document Grounded Conversations (cmudog) MuTual (mutual) OpenDialKG (odkg) DREAM (dream).
The detail and leaderboard of each dataset can be found in TextBox page.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
채팅-페르소나 쌍 데이터셋
위 데이터는 AI Hub의 한국어 멀티세션 대화 데이터 셋을
한국어 어체 변환 모델 korean-style-converter-6b을 이용해 존댓말에서 반말로 변환 후
Session1-2로 이루어진 데이터셋에서 10328개의 ( 채팅 - 페르소나 ) 쌍을 추출하여 제작하였습니다.
추후, 정제된 버전의 데이터 셋도 공개 예정입니다.
정제된 버전의 데이터셋이 공개되었습니다! NLPBada/korean-persona-chat-dataset-v2
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
以下のデータセットから、dialogue_idとutterances、話者情報(ペルソナ)を抽出し、ロールプレイを想定した形式に変更して作成しました。https://github.com/nu-dialogue/real-persona-chat
文献
@inproceedings{yamashita-etal-2023-realpersonachat, title = "{R}eal{P}ersona{C}hat: A Realistic Persona Chat Corpus with Interlocutors{'} Own Personalities", author = "Yamashita, Sanae and Inoue, Koji and Guo, Ao and Mochizuki, Shota and Kawahara, Tatsuya and Higashinaka, Ryuichiro", booktitle = "Proceedings of… See the full description on the dataset page: https://huggingface.co/datasets/JINIAC/real-persona-chat.
A version of the PersonaChat dataset that has been true-cased, and also has been given more normalized punctuation. The original PersonaChat dataset is in all lower case, and has extra space around each clause/sentence separating punctuation mark. This version of the dataset has more of a natural language look, with sentence capitalization, proper noun capitalization, and normalized whitespace. Also, each dialogue turn includes a pool of distractor candidate responses, which can be used by a multiple choice regularization loss during training.