100+ datasets found

P
DailyDialog Dataset
paperswithcode.com
opendatalab.com
+1more
Updated Feb 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan-ran Li; Hui Su; Xiaoyu Shen; Wenjie Li; Ziqiang Cao; Shuzi Niu (2021). DailyDialog Dataset [Dataset]. https://paperswithcode.com/dataset/dailydialog
Explore at:
Dataset updated
Feb 2, 2021
Authors
Yan-ran Li; Hui Su; Xiaoyu Shen; Wenjie Li; Ziqiang Cao; Shuzi Niu
Description
DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 dialogues and validation and test sets with 1000 dialogues each. On average there are around 8 speaker turns per dialogue with around 15 tokens per turn.
d
Statcan Dialogue Dataset
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu, Xing Han; Reddy, Siva; de Vries, Harm (2023). Statcan Dialogue Dataset [Dataset]. http://doi.org/10.5683/SP3/NR0BMY
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/NR0BMY
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Lu, Xing Han; Reddy, Siva; de Vries, Harm
Description
Welcome to the data repository for requesting access to the Statcan Dialogue Dataset! Before requesting access, you can visit our website or read our EACL 2023 paper Requesting Access In order to use our dataset, you must agree to the terms of use and restrictions before requesting access (see below). We will manually review each request and grant access or reach out to you for further information. To facilitate the process, make sure that: Your Dataverse account is linked to your professional/research website, which we may review to ensure the dataset will be used for the intended purpose Your request is made with an academic (e.g. .edu) or professional email (e.g. @servicenow.com). To do this, your have to set your primary email to your academic/professional email, or create a new Dataverse account. If your academic institution does not end with .edu, or you are part of a professional group that does not have an email address, please contact us (see email in paper). Abstract: We introduce the StatCan Dialogue Dataset consisting of 19,379 conversation turns between agents working at Statistics Canada and online users looking for published data tables. The conversations stem from genuine intents, are held in English or French, and lead to agents retrieving one of over 5000 complex data tables. Based on this dataset, we propose two tasks: (1) automatic retrieval of relevant tables based on a on-going conversation, and (2) automatic generation of appropriate agent responses at each turn. We investigate the difficulty of each task by establishing strong baselines. Our experiments on a temporal data split reveal that all models struggle to generalize to future conversations, as we observe a significant drop in performance across both tasks when we move from the validation to the test set. In addition, we find that response generation models struggle to decide when to return a table. Considering that the tasks pose significant challenges to existing models, we encourage the community to develop models for our task, which can be directly used to help knowledge workers find relevant tables for live chat users.
f
Datasets used in experiments.
plos.figshare.com
zip
Updated Apr 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mingkai Zhang; Dan You; Shouguang Wang (2024). Datasets used in experiments. [Dataset]. http://doi.org/10.1371/journal.pone.0302104.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302104.s001
Dataset updated
Apr 16, 2024
Dataset provided by
PLOS ONE
Authors
Mingkai Zhang; Dan You; Shouguang Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The explosive growth of dialogue data has aroused significant interest among scholars in abstractive dialogue summarization. In this paper, we propose a novel sequence-to-sequence framework called DS-SS (Dialogue Summarization with Factual-Statement Fusion and Dialogue Segmentation) for summarizing dialogues. The novelty of the DS-SS framework mainly lies in two aspects: 1) Factual statements are extracted from the source dialogue and combined with the source dialogue to perform the further dialogue encoding; and 2) A dialogue segmenter is trained and used to separate a dialogue to be encoded into several topic-coherent segments. Thanks to these two aspects, the proposed framework may better encode dialogues, thereby generating summaries exhibiting higher factual consistency and informativeness. Experimental results on two large-scale datasets SAMSum and DialogSum demonstrate the superiority of our framework over strong baselines, as evidenced by both automatic evaluation metrics and human evaluation.
Mini Daily Dialog Act
kaggle.com
Updated Mar 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aseem Srivastava (2022). Mini Daily Dialog Act [Dataset]. https://www.kaggle.com/datasets/as3eem/mini-daily-dialog-act/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 23, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aseem Srivastava
Description
About the Data

The Mini Daily Dialog Dataset is a smaller and more processed version of the Daily Dialog dataset defined for NLU tasks. The mini version contains 700 dialog in the train.csv file and 100 dialogs in the test.csv file along with the corresponding dialog acts.

About Dialog Acts;

There are 4 dialog acts in the data and encoded as shown below: inform : 1 question : 2 directive : 3 commissive : 4

Use of the Dataset

This dataset could be used for class assignments and mini-project demos.
P
Data from: MMD Dataset
paperswithcode.com
Updated May 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amrita Saha; Mitesh Khapra; Karthik Sankaranarayanan (2023). MMD Dataset [Dataset]. https://paperswithcode.com/dataset/mmd
Explore at:
Dataset updated
May 26, 2023
Authors
Amrita Saha; Mitesh Khapra; Karthik Sankaranarayanan
Description
The MMD (MultiModal Dialogs) dataset is a dataset for multimodal domain-aware conversations. It consists of over 150K conversation sessions between shoppers and sales agents, annotated by a group of in-house annotators using a semi-automated manually intense iterative process.
prosocial-dialog
huggingface.co
opendatalab.com
Updated Feb 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2023). prosocial-dialog [Dataset]. https://huggingface.co/datasets/allenai/prosocial-dialog
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 22, 2023
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for ProsocialDialog Dataset

Dataset Summary

ProsocialDialog is the first large-scale multi-turn English dialogue dataset to teach conversational agents to respond to problematic content following social norms. Covering diverse unethical, problematic, biased, and toxic situations, ProsocialDialog contains responses that encourage prosocial behavior, grounded in commonsense social rules (i.e., rules-of-thumb, RoTs). Created via a human-AI collaborative… See the full description on the dataset page: https://huggingface.co/datasets/allenai/prosocial-dialog.
P
Dialogue State Tracking Challenge Dataset
paperswithcode.com
opendatalab.com
Updated Aug 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason Williams; Antoine Raux; Ramach; Deepak ran; Alan Black (2021). Dialogue State Tracking Challenge Dataset [Dataset]. https://paperswithcode.com/dataset/dialogue-state-tracking-challenge
Explore at:
Dataset updated
Aug 30, 2021
Authors
Jason Williams; Antoine Raux; Ramach; Deepak ran; Alan Black
Description
The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art in tracking the state of spoken dialog systems. State tracking, sometimes called belief tracking, refers to accurately estimating the user's goal as a dialog progresses. Accurate state tracking is desirable because it provides robustness to errors in speech recognition, and helps reduce ambiguity inherent in language within a temporal process like dialog. In these challenges, participants were given labelled corpora of dialogs to develop state tracking algorithms. The trackers were then evaluated on a common set of held-out dialogs, which were released, un-labelled, during a one week period.

The corpus was collected using Amazon Mechanical Turk, and consists of dialogs in two domains: restaurant information, and tourist information. Tourist information subsumes restaurant information, and includes bars, cafés etc. as well as multiple new slots. There were two rounds of evaluation using this data:

DSTC 2 released a large number of training dialogs related to restaurant search. Compared to DSTC (which was in the bus timetables domain), DSTC 2 introduces changing user goals, tracking 'requested slots' as well as the new restaurants domain. Results from DSTC 2 were presented at SIGDIAL 2014. DSTC 3 addressed the problem of adaption to a new domain - tourist information. DSTC 3 releases a small amount of labelled data in the tourist information domain; participants will use this data plus the restaurant data from DSTC 2 for training. Dialogs used for training are fully labelled; user transcriptions, user dialog-act semantics and dialog state are all annotated. (This corpus therefore is also suitable for studies in Spoken Language Understanding.)
830,276 groups - Multi-Round Interpersonal Dialogues Text Data
m.nexdata.ai
nexdata.ai
Updated Oct 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 830,276 groups - Multi-Round Interpersonal Dialogues Text Data [Dataset]. https://m.nexdata.ai/datasets/llm/150
Explore at:
Dataset updated
Oct 4, 2023
Dataset authored and provided by
Nexdata
Variables measured
Language, Data size, Applications, Data content, Storage format, Collecting period
Description
This database is the interactive text corpus of real users on the mobile phone. The database itself has been desensitized to ensure of no private information of the user's (A and B are the codes to replace the sender and receiver, and sensitive information such as cellphone number and user name are replaced with '* * *'). This database can be used for tasks such as natural language understanding.
Cornell Movie Dialogs Corpus SQLite
kaggle.com
Updated Feb 13, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lee Richards (2018). Cornell Movie Dialogs Corpus SQLite [Dataset]. https://www.kaggle.com/mrlarichards/cornell-movie-dialogs-corpus-sqlite/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 13, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Lee Richards
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Context

I decided, for a little hobby project I'm working on, that I needed a dialog dataset, which Cornell University kindly provided here. However, as a database programmer, I'm used to working with structured data, not parsing and building lists from text-based files, and I decided that life would be much easier for me if I had this data in an SQL-type database, so I wrote me a little python script to chunk the whole thing into SQLite.

Content

Original Data Set: https://www.kaggle.com/Cornell-University/movie-dialog-corpus

As of this writing, the original dataset was updated 7 months ago.

The data is normalized, with all of the code-breaking artifacts I ran into hand-corrected. If you're familiar with SQL, and have a language/library that supports SQLite, I hope you'll find this fairly easy to work with. All of the data from the original dataset is, I believe, present, though I did remove some redundancies. For example, in the original dataset, movie_lines.tsv lists the character name along with the character id, which is redundant, because the name is listed in the movie_characters.tsv file. While this is a convenience when you have to process the file directly, it can easily be obtained by a JOIN in a structured database. The raw_script_urls are included in the movies table.

Acknowledgements

Thank you to Cornell University for providing the original Corpus. Photo by Tobias Fischer on Unsplash

Inspiration

Do let me know if you find this useful. I will probably do similar conversions for other datasets as I need them, and would happily upload them if anyone else finds them useful in that form.
c
Research data supporting "MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz...
repository.cam.ac.uk
zip
Updated Jul 10, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Budzianowski, Paweł; Mihail, Eric; Rahul, Goel; Shachi, Paul; Sethi, Abhishek; Agarwal, Sanchit; Gao, Shuyag; Hakkani-Tur, Dilek (2019). Research data supporting "MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling" [Dataset]. http://doi.org/10.17863/CAM.41572
Explore at:
zip(13794372 bytes)Available download formats
Unique identifier
https://doi.org/10.17863/CAM.41572
Dataset updated
Jul 10, 2019
Dataset provided by
Apollo
University of Cambridge
Authors
Budzianowski, Paweł; Mihail, Eric; Rahul, Goel; Shachi, Paul; Sethi, Abhishek; Agarwal, Sanchit; Gao, Shuyag; Hakkani-Tur, Dilek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains the following json files: 1. data.json: the woz dialogue dataset, which contains the conversation users and wizards, as well as a set of coarse labels for each user turn. 2. restaurant_db.json: the Cambridge restaurant database file, containing restaurants in the Cambridge UK area and a set of attributes. 3. attraction_db.json: the Cambridge attraction database file, contining attractions in the Cambridge UK area and a set of attributes. 4. hotel_db.json: the Cambridge hotel database file, containing hotels in the Cambridge UK area and a set of attributes. 5. train_db.json: the Cambridge train (with artificial connections) database file, containing trains in the Cambridge UK area and a set of attributes. 6. hospital_db.json: the Cambridge hospital database file, contatining information about departments. 7. police_db.json: the Cambridge police station information. 8. taxi_db.json: slot-value list for taxi domain. 9. valListFile.json: list of dialogues for validation. 10. testListFile.json: list of dialogues for testing. 11. system_acts.json: system acts annotations 12. ontology.json: Data-based ontology.

Important note: This dataset was previously entitled 'Research data supporting "MultiWOZ 2.1 - Multi-Domain Dialogue State Corrections and State Tracking Baselines"'. The change to the current title of this dataset was made at the request of the authors in July 2019.
Data from: DialoGLUE: A Natural Language Understanding Benchmark for...
registry.opendata.aws
Updated Oct 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2020). DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue [Dataset]. https://registry.opendata.aws/dialoglue/
Explore at:
Dataset updated
Oct 11, 2020
Dataset provided by
Amazon.comhttp://amazon.com/
Description
This bucket contains the checkpoints used to reproduce the baseline results reported in the DialoGLUE benchmark hosted on EvalAI (https://evalai.cloudcv.org/web/challenges/challenge-page/708/overview). The associated scripts for using the checkpoints are located here: https://github.com/alexa/dialoglue. The associated paper describing the benchmark and checkpoints is here: https://arxiv.org/abs/2009.13570. The provided checkpoints include the CONVBERT model, a BERT-esque model trained on a large open-domain conversational dataset. It also includes the CONVBERT-DG and BERT-DG checkpoints described in the linked paper.
90,000 sets – Multi-domain Customer Service Dialogue Text Data
m.nexdata.ai
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 90,000 sets – Multi-domain Customer Service Dialogue Text Data [Dataset]. https://m.nexdata.ai/datasets/llm/1396
Explore at:
Dataset updated
Nov 21, 2023
Dataset authored and provided by
Nexdata
Variables measured
Language, Data Size, Data content, Storage Format, Data collection method
Description
Multi-domain Customer Service Dialogue Text Data, 90,000 sets in total; spanning multiple domains, including telecommunications, e-commerce, and financial, lifestyle, business, education, healthcare, and entertainment; Each set of data consists of single or multi-turn conversations; this dataset can be used for tasks such as LLM training, chatgpt
Z
Data from: esCorpiusDialog: A Large-Scale Multilingual Dialogue Dataset in...
data.niaid.nih.gov
Updated Mar 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier, Gutiérrez-Hernando (2025). esCorpiusDialog: A Large-Scale Multilingual Dialogue Dataset in Spanish, Catalan, Basque, and Galician [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_15017668
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
Griol, David
Callejas, Zoraida
Pérez-Fernández, David
Gutiérrez-Fandiño, Asier
Kharitonova, Ksenia
Javier, Gutiérrez-Hernando
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
In response to the growing need for comprehensive conversational datasets, we introduce esCorpiusDialog, a large-scale, diverse dialogue corpus designed to facilitate the development of conversational models in Spanish, Catalan, Basque, and Galician. Addressing the critical shortage of high-quality conversational data for these languages, esCorpiusDialog encompasses an extensive 15.1 GiB corpus containing 26,900,187 dialogues with a total of 116,524,110 conversational turns and 1,382,496,364 tokens. On average, each dialogue includes 4.3 turns, enabling effective modeling of multi-turn interactions.

esCorpiusDialog aggregates data from multiple rich sources: movie subtitles (OpenSubtitles), newsgroups (Usenet), online forums (Mediavida, Reddit), and literature (Project Gutenberg). Specifically, the dataset comprises approximately 26.6 million dialogues in Spanish, 116 thousand in Basque, over 92 thousand in Catalan, and 63 thousand in Galician, making it the most extensive multilingual conversational dataset currently available for these languages.

The dataset has undergone meticulous processing to clearly define conversational turns and accurately segment dialogues, ensuring its suitability for training robust, open-domain conversational systems. With its breadth of topics and varied dialogue styles, esCorpiusDialog represents an invaluable resource for researchers and practitioners aiming to enhance the dialogue capabilities and generalization of fine-tuned large language models (LLMs) across diverse conversational applications.
n
1,136 Hours - English(the United States) Spontaneous Dialogue Smartphone...
nexdata.ai
m.nexdata.ai
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 1,136 Hours - English(the United States) Spontaneous Dialogue Smartphone speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1004
Explore at:
Dataset updated
Nov 8, 2023
Dataset provided by
nexdata technology inc
Nexdata
Authors
Nexdata
Area covered
United States
Variables measured
Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
Description
English(the United States) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering generic domain. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(1,416 Americans), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
A
Data from: Dialogs Re-Enacted Across Languages
abacus.library.ubc.ca
iso, txt
Updated Sep 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abacus Data Network (2024). Dialogs Re-Enacted Across Languages [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml;jsessionid=d64940b1987f7d56125a2d289c5c?persistentId=hdl%3A11272.1%2FAB2%2FXRMWND&version=&q=&fileTypeGroupFacet=&fileAccess=&fileSortField=size
Explore at:
iso(938545152), txt(1308)Available download formats
Dataset updated
Sep 17, 2024
Dataset provided by
Abacus Data Network
Description
AbstractIntroduction Dialogs Re-Enacted Across Languages was developed at the University of Texas at El Paso. It contains approximately 17 hours of conversational speech in English and Spanish by 129 unique bilingual speakers, specifically, short fragments extracted from spontaneous conversations and close re-enactments in the other language by the original speakers, for 3816 pairs of matching utterances. Data Data was collected in 2022-2023. Participants were recruited from among students at the University of Texas at El Paso which is located on the US-Mexico border. All participants were bilingual speakers of General American English and of Mexico-Texas Border Spanish. Their self-described dialects for English were El Paso and for Spanish, mostly "El Paso/Juarez." Each speaker pair had a ten minute conversation in one language. From these conversations, various fragments of the conversations were chosen for re-enactment, and the original speakers produced equivalents in the other language. Each re-enactment was vetted for fidelity to the original and naturalness in the target language. After recording, fragments were mapped to the translated re-enactments using ELAN, an annotation tool for audio and video recordings. Metadata about conversations, participants, re-enactments and utterances are included in this release. The audio data is presented as flac compressed, single channel, 16 kHz, 16-bit linear PCM.
P
DialogSum Dataset
paperswithcode.com
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yulong Chen; Yang Liu; Liang Chen; Yue Zhang (2024). DialogSum Dataset [Dataset]. https://paperswithcode.com/dataset/dialogsum
Explore at:
Dataset updated
Dec 18, 2024
Authors
Yulong Chen; Yang Liu; Liang Chen; Yue Zhang
Description
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.

This work is accepted by ACL findings 2021. You may find the paper here: https://arxiv.org/pdf/2105.06762.pdf.

If you want to use our dataset, please cite our paper.

Dialogue Data We collect dialogue data for DialogSum from three public dialogue corpora, namely Dailydialog (Li et al., 2017), DREAM (Sun et al., 2019) and MuTual (Cui et al., 2019), as well as an English speaking practice website. These datasets contain face-to-face spoken dialogues that cover a wide range of daily-life topics, including schooling, work, medication, shopping, leisure, travel. Most conversations take place between friends, colleagues, and between service providers and customers.

Compared with previous datasets, dialogues from DialogSum have distinct characteristics: * Under rich real-life scenarios, including more diverse task-oriented scenarios; * Have clear communication patterns and intents, which is valuable to serve as summarization sources; * Have a reasonable length, which comforts the purpose of automatic summarization.

Summaries We ask annotators to summarize each dialogue based on the following criteria: * Convey the most salient information; * Be brief; * Preserve important named entities within the conversation; * Be written from an observer perspective; * Be written in formal language.

Topics In addition to summaries, we also ask annotators to write a short topic for each dialogue, which can be potentially useful for future work, e.g. generating summaries by leveraging topic information.
h
meddialog
huggingface.co
paperswithcode.com
+1more
Updated Apr 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). meddialog [Dataset]. https://huggingface.co/datasets/bigbio/meddialog
Explore at:
Dataset updated
Apr 22, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
The MedDialog dataset (English) contains conversations (in English) between doctors and patients.It has 0.26 million dialogues. The data is continuously growing and more dialogues will be added. The raw dialogues are from healthcaremagic.com and icliniq.com. All copyrights of the data belong to healthcaremagic.com and icliniq.com.
m
AI Dialog Software Application
data.mendeley.com
Updated Aug 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francis R Belch (2023). AI Dialog Software Application [Dataset]. http://doi.org/10.17632/zhv2wfnprv.4
Explore at:
Unique identifier
https://doi.org/10.17632/zhv2wfnprv.4
Dataset updated
Aug 31, 2023
Authors
Francis R Belch
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
Description

The experimental AI Dialog software application is able to build a mental model from a plain English dialogue using speech or textual input. Subsequently, comprehension and logical capability may be tested using plain English queries.

AI Dialog is used to experimentally test the validity and utility of a novel software application design, as a path to Artificial General Intelligence, sometimes referred to as Strong Artificial Intelligence or Conversational Artificial Intelligence.

The theory behind AI Dialog is fully described in the book: Belch, Francis R. (2023) Artificial General Intelligence – A New Approach, available from: Amazon.com (from September 2023).

A short demonstration of the AI Dialog software application is available from YouTube®, and entitled:

Artificial General Intelligence – A New Approach – AI Dialog Demonstration.

There are also two YouTube® lectures each of about 1 hour duration describing the radical new approach to Artificial General Intelligence used to implement the AI Dialog software application. These are:

Artificial General Intelligence – A New Approach – Part I Sentence Semantics.

Artificial General Intelligence – A New Approach – Part II Dialogues and Mental Models.

This is a free download of the executable of the AI Dialog Software Application Version 4.1 Alpha release. This version supersedes Version 3.2 to allow speech as well as textual user input.

The AI Dialog software is protected by international copyright, but is made available to use for non-commercial personal study purposes.

The application will run on Windows 10® PC, Laptop and Tablet systems, and requires about 1 M byte of memory. The download file is zipped and needs to be unzipped. After this, the content of the folder AI Dialog 4.1 Alpha Release is:

• Application Files (Folder) • Documentation (Folder) • NLP2016Autumn (Manifest) • Setup (Application)

In the Documentation folder are two PDF files:

• Copy Of Tuition Lessons (PDF) • Specification (PDF).

The first is a hard copy of the tuition lessons. The second is a specification of a subset of English for use with the AI Dialog system. However, there is no need to consult either of these initially, as AI Dialog incorporates a quick start tuition module.

To install AI Dialog, double click the Setup file. This starts AI Dialog immediately after installation, but places an application icon on the Windows 10® Start list for restarting later. After AI Dialog starts, just follow the speech or pop-up message instructions, which lead to quick start interactive tuition modules, fully describing how to use the application.
Friends TV Show Dialog Sequences
kaggle.com
Updated Dec 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Friends TV Show Dialog Sequences [Dataset]. https://www.kaggle.com/datasets/thedevastator/friends-tv-show-dialog-sequences/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 3, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Friends TV Show Dialog Sequences

Friends TV Show Dialog Sequences

By Suriyadeepan R [source]

About this dataset

The dataset sequences.csv provides a comprehensive collection of dialog sequences retrieved from the popular sitcom Friends. This dataset has been curated to offer researchers, data analysts, and machine learning enthusiasts an extensive resource for studying linguistic patterns and analyzing conversational structures in a highly regarded television series.

Each row of the dataset corresponds to a specific sequence of dialogues exchanged between the characters in the Friends TV show. The sequences are arranged consecutively, ensuring continuity within each set of conversations. This dataset captures moments encompassing different scenarios, emotions, and relationships depicted throughout all ten seasons of the series.

By exploring this dataset, individuals can gain insights into various aspects such as character interactions, humor elements, socio-cultural references, sentimental expressions, conflicts resolution approaches utilized by the characters. Additionally, this resource facilitates language modeling tasks and offers opportunities for sentiment analysis or dialogue generation using natural language processing techniques.

The original sources of these dialog transcripts have been meticulously collated to ensure accuracy and fidelity to the original aired episodes. Researchers interested in studying language use across different contexts can utilize this dataset as a valuable tool for training models or devising creative algorithms based on real-life conversations between fictional characters.

Please note that while every effort has been made to ensure consistency in capturing these sequences accurately from diverse scenes across all ten seasons of Friends TV show with high precision; however inadvertent discrepancies may still exist due to variables like dialogue delivery speed or overlapping speech instances

How to use the dataset

Dataset Overview

The dataset consists of a single file named sequences.csv. It contains multiple columns that provide different information about the dialogues in each sequence. The columns available in the dataset are as follows:

Sequence ID: A unique identifier for each dialogue sequence.

Season: The season number in which the dialogue sequence belongs.

Episode: The episode number within the season where the dialogue sequence appears.

Sequence Index: The index of each dialogue within a particular sequence.

Character: The name of the character speaking in a specific line of dialogue.

Dialogue Text: The actual spoken words by a character.

Please note that there are no date-related columns included in this dataset.

Analyzing and Exploring Data

Once you have loaded or imported the sequences.csv file into your preferred data analysis tool, you can begin exploring and analyzing its contents using various techniques:

Descriptive Statistics: You can compute basic descriptive statistics on different columns, such as counting unique values, calculating frequencies, or finding patterns across seasons or episodes.

Character Analysis: By examining data related to characters' names and dialogues, you can analyze their speaking patterns, most frequent speakers, word count distribution per character, etc.

Episode Analysis: You may explore specific episodes by filtering data based on season and episode numbers to examine particular events or recurring themes within them.

.**Dialogue Sentiment Analysis:** Applying sentiment analysis techniques to analyze text content might reveal interesting insights about emotions expressed by different characters during various seasons or episodes.

Ensure to use appropriate data visualization techniques to present your findings, such as bar charts, line plots, word clouds, or heatmaps.

Potential Use Cases

Natural Language Processing (NLP) and Sentiment Analysis: Analyzing the sentiment of characters' dialogues over time or identifying specific emotions expressed during crucial moments in the show.

Character Interaction Analysis: Identifying character pairs who frequently engage in conversations or analyzing how relationships between characters evolve throughout different seasons.

Dialogue Generation Models: Training language

Research Ideas

Sentiment Analysis: The dataset can be used to analyze the sentiment of each dialogue sequence in order to understand the overall mood or tone of specific episodes or characters.

Dialogue Generation: By training a language model...
h
dialogsum
huggingface.co
Updated Jun 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karthick Kaliannan Neelamohan (2022). dialogsum [Dataset]. https://huggingface.co/datasets/knkarthick/dialogsum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2022
Authors
Karthick Kaliannan Neelamohan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for DIALOGSum Corpus

Dataset Description Links

Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick

Dataset Summary

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yan-ran Li; Hui Su; Xiaoyu Shen; Wenjie Li; Ziqiang Cao; Shuzi Niu (2021). DailyDialog Dataset [Dataset]. https://paperswithcode.com/dataset/dailydialog

DailyDialog Dataset

Explore at:

Dataset updated

Feb 2, 2021

Authors

Yan-ran Li; Hui Su; Xiaoyu Shen; Wenjie Li; Ziqiang Cao; Shuzi Niu

Description

DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 dialogues and validation and test sets with 1000 dialogues each. On average there are around 8 speaker turns per dialogue with around 15 tokens per turn.

Clear search

Close search

Google apps

Main menu

DailyDialog Dataset

Statcan Dialogue Dataset

Datasets used in experiments.

Mini Daily Dialog Act

About the Data

About Dialog Acts;

Use of the Dataset

Data from: MMD Dataset

prosocial-dialog

Dialogue State Tracking Challenge Dataset

830,276 groups - Multi-Round Interpersonal Dialogues Text Data

Cornell Movie Dialogs Corpus SQLite

Context

Content

Acknowledgements

Inspiration

Research data supporting "MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz...

Data from: DialoGLUE: A Natural Language Understanding Benchmark for...

90,000 sets – Multi-domain Customer Service Dialogue Text Data

Data from: esCorpiusDialog: A Large-Scale Multilingual Dialogue Dataset in...

1,136 Hours - English(the United States) Spontaneous Dialogue Smartphone...

Data from: Dialogs Re-Enacted Across Languages

DialogSum Dataset

meddialog

AI Dialog Software Application

Friends TV Show Dialog Sequences

Friends TV Show Dialog Sequences

Friends TV Show Dialog Sequences

About this dataset

How to use the dataset

Dataset Overview

Analyzing and Exploring Data

Potential Use Cases

Research Ideas

dialogsum

DailyDialog Dataset