The ConvAI2 NeurIPS competition aimed at finding approaches to creating high-quality dialogue agents capable of meaningful open domain conversation. The ConvAI2 dataset for training models is based on the PERSONA-CHAT dataset. The speaker pairs each have assigned profiles coming from a set of 1155 possible personas (at training time), each consisting of at least 5 profile sentences, setting aside 100 never seen before personas for validation. As the original PERSONA-CHAT test set was released, a new hidden test set consisted of 100 new personas and over 1,015 dialogs was created by crowdsourced workers.
To avoid modeling that takes advantage of trivial word overlap, additional rewritten sets of the same train and test personas were crowdsourced, with related sentences that are rephrases, generalizations or specializations, rendering the task much more challenging. For example “I just got my nails done” is revised as “I love to pamper myself on a regular basis” and “I am on a diet now” is revised as “I need to lose weight.”
The training, validation and hidden test sets consists of 17,878, 1,000 and 1,015 dialogues, respectively.
BPersona-chat is an evaluation dataset based on the English multiturn chat corpus Persona-chat and the Japanese multiturn chat corpus JPersona-chat.
Each chat was performed between two crowd workers assuming artificial personas. The speakers discuss a given personality trait, including but not limited to self-introduction, hobby, and others. (Notice that they are not translations of each other.)
Chats are translated into Japanese/English by professional translators, a low-quality machine translation model A and a high-quality machine translation model B.
Translations are evaluated by crowdworkers as either good or bad, depending on the correctness and coherence.
Each chat is included in one .xlsx file with the following structure:
person - the speaker on the current utterance, source - the utterance in the source language, translation - the translation in the target language, evaluation: is this a good translation? - the evaluation of the translation's quality, y - the current translation is a correct translation of the source utterance, n - the current translation is an erroneous translation of the source utterance.
PMPC (Persona Match on Persona-Chat) is a dataset for Speaker Persona Detection (SPD) which aims to detect speaker personas based on the plain conversational text.
intelsense/persona-chat-en2bn-azure dataset hosted on Hugging Face and contributed by the HF Datasets community
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore the historical Whois records related to persona.chat (Domain). Get insights into ownership history and changes over time.
intelsense/persona-chat-en2bn dataset hosted on Hugging Face and contributed by the HF Datasets community
PersonalDialog is a large-scale multi-turn dialogue dataset containing various traits from a large number of speakers. The dataset consists of 20.83M sessions and 56.25M utterances from 8.47M speakers. Each utterance is associated with a speaker who is marked with traits like Age, Gender, Location, Interest Tags, etc. Several anonymization schemes are designed to protect the privacy of each speaker.
En enero de 2024, 2.000 millones de usuarios accedían al chat de WhatsApp mensualmente. El uso de la aplicación es particularmente fuerte en mercados en Estados Unidos, aunque cabe destacar que es una de las aplicaciones sociales móviles más populares en todo el mundo. En febrero de 2014, la red social Facebook adquirió la aplicación móvil por 19.000 millones de dólares estadounidenses.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
The ConvAI2 NeurIPS competition aimed at finding approaches to creating high-quality dialogue agents capable of meaningful open domain conversation. The ConvAI2 dataset for training models is based on the PERSONA-CHAT dataset. The speaker pairs each have assigned profiles coming from a set of 1155 possible personas (at training time), each consisting of at least 5 profile sentences, setting aside 100 never seen before personas for validation. As the original PERSONA-CHAT test set was released, a new hidden test set consisted of 100 new personas and over 1,015 dialogs was created by crowdsourced workers.
To avoid modeling that takes advantage of trivial word overlap, additional rewritten sets of the same train and test personas were crowdsourced, with related sentences that are rephrases, generalizations or specializations, rendering the task much more challenging. For example “I just got my nails done” is revised as “I love to pamper myself on a regular basis” and “I am on a diet now” is revised as “I need to lose weight.”
The training, validation and hidden test sets consists of 17,878, 1,000 and 1,015 dialogues, respectively.