DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 dialogues and validation and test sets with 1000 dialogues each. On average there are around 8 speaker turns per dialogue with around 15 tokens per turn.
The Mini Daily Dialog Dataset is a smaller and more processed version of the Daily Dialog dataset defined for NLU tasks. The mini version contains 700 dialog in the train.csv file and 100 dialogs in the test.csv file along with the corresponding dialog acts.
There are 4 dialog acts in the data and encoded as shown below: inform : 1 question : 2 directive : 3 commissive : 4
This dataset could be used for class assignments and mini-project demos.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The explosive growth of dialogue data has aroused significant interest among scholars in abstractive dialogue summarization. In this paper, we propose a novel sequence-to-sequence framework called DS-SS (Dialogue Summarization with Factual-Statement Fusion and Dialogue Segmentation) for summarizing dialogues. The novelty of the DS-SS framework mainly lies in two aspects: 1) Factual statements are extracted from the source dialogue and combined with the source dialogue to perform the further dialogue encoding; and 2) A dialogue segmenter is trained and used to separate a dialogue to be encoded into several topic-coherent segments. Thanks to these two aspects, the proposed framework may better encode dialogues, thereby generating summaries exhibiting higher factual consistency and informativeness. Experimental results on two large-scale datasets SAMSum and DialogSum demonstrate the superiority of our framework over strong baselines, as evidenced by both automatic evaluation metrics and human evaluation.
The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art in tracking the state of spoken dialog systems. State tracking, sometimes called belief tracking, refers to accurately estimating the user's goal as a dialog progresses. Accurate state tracking is desirable because it provides robustness to errors in speech recognition, and helps reduce ambiguity inherent in language within a temporal process like dialog. In these challenges, participants were given labelled corpora of dialogs to develop state tracking algorithms. The trackers were then evaluated on a common set of held-out dialogs, which were released, un-labelled, during a one week period.
The corpus was collected using Amazon Mechanical Turk, and consists of dialogs in two domains: restaurant information, and tourist information. Tourist information subsumes restaurant information, and includes bars, cafés etc. as well as multiple new slots. There were two rounds of evaluation using this data:
DSTC 2 released a large number of training dialogs related to restaurant search. Compared to DSTC (which was in the bus timetables domain), DSTC 2 introduces changing user goals, tracking 'requested slots' as well as the new restaurants domain. Results from DSTC 2 were presented at SIGDIAL 2014. DSTC 3 addressed the problem of adaption to a new domain - tourist information. DSTC 3 releases a small amount of labelled data in the tourist information domain; participants will use this data plus the restaurant data from DSTC 2 for training. Dialogs used for training are fully labelled; user transcriptions, user dialog-act semantics and dialog state are all annotated. (This corpus therefore is also suitable for studies in Spoken Language Understanding.)
The MMD (MultiModal Dialogs) dataset is a dataset for multimodal domain-aware conversations. It consists of over 150K conversation sessions between shoppers and sales agents, annotated by a group of in-house annotators using a semi-automated manually intense iterative process.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
The MedDialog dataset (English) contains conversations (in English) between doctors and patients.It has 0.26 million dialogues. The data is continuously growing and more dialogues will be added. The raw dialogues are from healthcaremagic.com and icliniq.com. All copyrights of the data belong to healthcaremagic.com and icliniq.com.
Multi-domain Customer Service Dialogue Text Data, 90,000 sets in total; spanning multiple domains, including telecommunications, e-commerce, and financial, lifestyle, business, education, healthcare, and entertainment; Each set of data consists of single or multi-turn conversations; this dataset can be used for tasks such as LLM training, chatgpt
This bucket contains the checkpoints used to reproduce the baseline results reported in the DialoGLUE benchmark hosted on EvalAI (https://evalai.cloudcv.org/web/challenges/challenge-page/708/overview). The associated scripts for using the checkpoints are located here: https://github.com/alexa/dialoglue. The associated paper describing the benchmark and checkpoints is here: https://arxiv.org/abs/2009.13570. The provided checkpoints include the CONVBERT model, a BERT-esque model trained on a large open-domain conversational dataset. It also includes the CONVBERT-DG and BERT-DG checkpoints described in the linked paper.
This database is the interactive text corpus of real users on the mobile phone. The database itself has been desensitized to ensure of no private information of the user's (A and B are the codes to replace the sender and receiver, and sensitive information such as cellphone number and user name are replaced with '* * *'). This database can be used for tasks such as natural language understanding.
The Japanese-English business conversation corpus, namely Business Scene Dialogue corpus, was constructed in 3 steps:
selecting business scenes, writing monolingual conversation scenarios according to the selected scenes, and translating the scenarios into the other language.
Half of the monolingual scenarios were written in Japanese and the other half were written in English. The whole construction process was supervised by a person who satisfies the following conditions to guarantee the conversations to be natural:
has the experience of being engaged in language learning programs, especially for business conversations is able to smoothly communicate with others in various business scenes both in Japanese and English has the experience of being involved in business
The BSD corpus is split into balanced training, development and evaluation sets. The documents in these sets are balanced in terms of scenes and original languages. In this repository we publicly share the full development and evaluation sets and a part of the training data set.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
I decided, for a little hobby project I'm working on, that I needed a dialog dataset, which Cornell University kindly provided here. However, as a database programmer, I'm used to working with structured data, not parsing and building lists from text-based files, and I decided that life would be much easier for me if I had this data in an SQL-type database, so I wrote me a little python script to chunk the whole thing into SQLite.
Original Data Set: https://www.kaggle.com/Cornell-University/movie-dialog-corpus
As of this writing, the original dataset was updated 7 months ago.
The data is normalized, with all of the code-breaking artifacts I ran into hand-corrected. If you're familiar with SQL, and have a language/library that supports SQLite, I hope you'll find this fairly easy to work with. All of the data from the original dataset is, I believe, present, though I did remove some redundancies. For example, in the original dataset, movie_lines.tsv lists the character name along with the character id, which is redundant, because the name is listed in the movie_characters.tsv file. While this is a convenience when you have to process the file directly, it can easily be obtained by a JOIN in a structured database. The raw_script_urls are included in the movies table.
Thank you to Cornell University for providing the original Corpus. Photo by Tobias Fischer on Unsplash
Do let me know if you find this useful. I will probably do similar conversions for other datasets as I need them, and would happily upload them if anyone else finds them useful in that form.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the following json files: 1. data.json: the woz dialogue dataset, which contains the conversation users and wizards, as well as a set of coarse labels for each user turn. 2. restaurant_db.json: the Cambridge restaurant database file, containing restaurants in the Cambridge UK area and a set of attributes. 3. attraction_db.json: the Cambridge attraction database file, contining attractions in the Cambridge UK area and a set of attributes. 4. hotel_db.json: the Cambridge hotel database file, containing hotels in the Cambridge UK area and a set of attributes. 5. train_db.json: the Cambridge train (with artificial connections) database file, containing trains in the Cambridge UK area and a set of attributes. 6. hospital_db.json: the Cambridge hospital database file, contatining information about departments. 7. police_db.json: the Cambridge police station information. 8. taxi_db.json: slot-value list for taxi domain. 9. valListFile.json: list of dialogues for validation. 10. testListFile.json: list of dialogues for testing. 11. system_acts.json: system acts annotations 12. ontology.json: Data-based ontology.
Important note: This dataset was previously entitled 'Research data supporting "MultiWOZ 2.1 - Multi-Domain Dialogue State Corrections and State Tracking Baselines"'. The change to the current title of this dataset was made at the request of the authors in July 2019.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This multi-targeted dataset contains several datasets that allow to train goal-oriented dialogue systems for student service domain in Latvian. The dataset contains a manually annotated dataset of domain-specific dialog intents, a manually created and annotated dataset of generalised and formalised dialog scenarios based on corpus evidence, dataset for FAQ module training.
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
The experimental AI Dialog software application is able to build a mental model from a plain English dialogue using speech or textual input. Subsequently, comprehension and logical capability may be tested using plain English queries.
AI Dialog is used to experimentally test the validity and utility of a novel software application design, as a path to Artificial General Intelligence, sometimes referred to as Strong Artificial Intelligence or Conversational Artificial Intelligence.
The theory behind AI Dialog is fully described in the book: Belch, Francis R. (2023) Artificial General Intelligence – A New Approach, available from: Amazon.com (from September 2023).
A short demonstration of the AI Dialog software application is available from YouTube®, and entitled:
Artificial General Intelligence – A New Approach – AI Dialog Demonstration.
There are also two YouTube® lectures each of about 1 hour duration describing the radical new approach to Artificial General Intelligence used to implement the AI Dialog software application. These are:
Artificial General Intelligence – A New Approach – Part I Sentence Semantics.
Artificial General Intelligence – A New Approach – Part II Dialogues and Mental Models.
This is a free download of the executable of the AI Dialog Software Application Version 4.1 Alpha release. This version supersedes Version 3.2 to allow speech as well as textual user input.
The AI Dialog software is protected by international copyright, but is made available to use for non-commercial personal study purposes.
The application will run on Windows 10® PC, Laptop and Tablet systems, and requires about 1 M byte of memory. The download file is zipped and needs to be unzipped. After this, the content of the folder AI Dialog 4.1 Alpha Release is:
• Application Files (Folder) • Documentation (Folder) • NLP2016Autumn (Manifest) • Setup (Application)
In the Documentation folder are two PDF files:
• Copy Of Tuition Lessons (PDF) • Specification (PDF).
The first is a hard copy of the tuition lessons. The second is a specification of a subset of English for use with the AI Dialog system. However, there is no need to consult either of these initially, as AI Dialog incorporates a quick start tuition module.
To install AI Dialog, double click the Setup file. This starts AI Dialog immediately after installation, but places an application icon on the Windows 10® Start list for restarting later. After AI Dialog starts, just follow the speech or pop-up message instructions, which lead to quick start interactive tuition modules, fully describing how to use the application.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for ProsocialDialog Dataset
Dataset Summary
ProsocialDialog is the first large-scale multi-turn English dialogue dataset to teach conversational agents to respond to problematic content following social norms. Covering diverse unethical, problematic, biased, and toxic situations, ProsocialDialog contains responses that encourage prosocial behavior, grounded in commonsense social rules (i.e., rules-of-thumb, RoTs). Created via a human-AI collaborative… See the full description on the dataset page: https://huggingface.co/datasets/allenai/prosocial-dialog.
English(the United States) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering generic domain. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(1,416 Americans), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The DailyDialog dataset is a curated collection of multi-turn dialogues that reflects everyday communication. It covers a variety of topics relevant to daily experiences. This dataset features human-written conversations, ensuring natural and realistic language, which contributes to higher quality data with less noise. Each dialogue involves two or more participants and is provided in a textual format. A key feature of this dataset is the inclusion of corresponding labels for communication intention and emotion attached to each utterance. These labels offer valuable insights into how participants express their intentions and emotional states through speech. The dataset is an invaluable resource for developing robust dialogue systems capable of understanding human interactions on a deeper level, identifying diverse intentions, and recognising various emotional states encountered in daily exchanges.
The dataset is organised into three separate CSV files: validation.csv, train.csv, and test.csv. These files facilitate different stages of model development, including validation, training, and testing. The dataset focuses on multi-turn dialogues. Specific numbers for rows or records are not provided within the available information.
This dataset offers excellent opportunities for various applications: * Natural Language Processing (NLP): Ideal for training NLP models to understand and generate more realistic and human-like dialogues. Communication intention labels help identify the purpose of utterances, while emotion labels add emotional context. * Sentiment Analysis: With the emotion labels, the dataset can be used for sentiment analysis tasks, allowing classification of the overall sentiment of a conversation or individual utterances. This is useful for understanding customer feedback or social media discussions. * Dialogue Generation: One can train dialogue generation models capable of creating engaging conversations on various daily life topics. Communication intention labels can guide the model in generating appropriate responses based on different expressed intents.
The dataset is designed to accurately represent daily life conversations, covering a wide range of everyday topics. It consists of human-written conversations, ensuring natural language use. No specific geographic, time range, or demographic scope beyond "daily life" is detailed.
CCO
Original Data Source: DailyDialog: Multi-Turn Dialog+Intention+Emotion
AbstractIntroduction Dialogs Re-Enacted Across Languages was developed at the University of Texas at El Paso. It contains approximately 17 hours of conversational speech in English and Spanish by 129 unique bilingual speakers, specifically, short fragments extracted from spontaneous conversations and close re-enactments in the other language by the original speakers, for 3816 pairs of matching utterances. Data Data was collected in 2022-2023. Participants were recruited from among students at the University of Texas at El Paso which is located on the US-Mexico border. All participants were bilingual speakers of General American English and of Mexico-Texas Border Spanish. Their self-described dialects for English were El Paso and for Spanish, mostly "El Paso/Juarez." Each speaker pair had a ten minute conversation in one language. From these conversations, various fragments of the conversations were chosen for re-enactment, and the original speakers produced equivalents in the other language. Each re-enactment was vetted for fidelity to the original and naturalness in the target language. After recording, fragments were mapped to the translated re-enactments using ELAN, an annotation tool for audio and video recordings. Metadata about conversations, participants, re-enactments and utterances are included in this release. The audio data is presented as flac compressed, single channel, 16 kHz, 16-bit linear PCM.
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.
This work is accepted by ACL findings 2021. You may find the paper here: https://arxiv.org/pdf/2105.06762.pdf.
If you want to use our dataset, please cite our paper.
Dialogue Data We collect dialogue data for DialogSum from three public dialogue corpora, namely Dailydialog (Li et al., 2017), DREAM (Sun et al., 2019) and MuTual (Cui et al., 2019), as well as an English speaking practice website. These datasets contain face-to-face spoken dialogues that cover a wide range of daily-life topics, including schooling, work, medication, shopping, leisure, travel. Most conversations take place between friends, colleagues, and between service providers and customers.
Compared with previous datasets, dialogues from DialogSum have distinct characteristics: * Under rich real-life scenarios, including more diverse task-oriented scenarios; * Have clear communication patterns and intents, which is valuable to serve as summarization sources; * Have a reasonable length, which comforts the purpose of automatic summarization.
Summaries We ask annotators to summarize each dialogue based on the following criteria: * Convey the most salient information; * Be brief; * Preserve important named entities within the conversation; * Be written from an observer perspective; * Be written in formal language.
Topics In addition to summaries, we also ask annotators to write a short topic for each dialogue, which can be potentially useful for future work, e.g. generating summaries by leveraging topic information.
Dataset Card for Open Assistant Cornell Movies Dialog
Dataset Summary
The dataset was created using Cornell Movies Dialog Corpus which contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts. Dialogs and meta-data from the underlying Corpus were used to design a dataset that can be used to InstructGPT based models to learn movie scripts. Example : User: Assume RICK and ALICE are characters from a fantasy-horror movie, continue… See the full description on the dataset page: https://huggingface.co/datasets/shahules786/OA-cornell-movies-dialog.
DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 dialogues and validation and test sets with 1000 dialogues each. On average there are around 8 speaker turns per dialogue with around 15 tokens per turn.