100+ datasets found
  1. P

    DailyDialog Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Feb 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yan-ran Li; Hui Su; Xiaoyu Shen; Wenjie Li; Ziqiang Cao; Shuzi Niu (2021). DailyDialog Dataset [Dataset]. https://paperswithcode.com/dataset/dailydialog
    Explore at:
    Dataset updated
    Feb 2, 2021
    Authors
    Yan-ran Li; Hui Su; Xiaoyu Shen; Wenjie Li; Ziqiang Cao; Shuzi Niu
    Description

    DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 dialogues and validation and test sets with 1000 dialogues each. On average there are around 8 speaker turns per dialogue with around 15 tokens per turn.

  2. Mini Daily Dialog Act

    • kaggle.com
    Updated Mar 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aseem Srivastava (2022). Mini Daily Dialog Act [Dataset]. https://www.kaggle.com/datasets/as3eem/mini-daily-dialog-act/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 23, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aseem Srivastava
    Description

    About the Data

    The Mini Daily Dialog Dataset is a smaller and more processed version of the Daily Dialog dataset defined for NLU tasks. The mini version contains 700 dialog in the train.csv file and 100 dialogs in the test.csv file along with the corresponding dialog acts.

    About Dialog Acts;

    There are 4 dialog acts in the data and encoded as shown below: inform : 1 question : 2 directive : 3 commissive : 4

    Use of the Dataset

    This dataset could be used for class assignments and mini-project demos.

  3. f

    Datasets used in experiments.

    • plos.figshare.com
    zip
    Updated Apr 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mingkai Zhang; Dan You; Shouguang Wang (2024). Datasets used in experiments. [Dataset]. http://doi.org/10.1371/journal.pone.0302104.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 16, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mingkai Zhang; Dan You; Shouguang Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The explosive growth of dialogue data has aroused significant interest among scholars in abstractive dialogue summarization. In this paper, we propose a novel sequence-to-sequence framework called DS-SS (Dialogue Summarization with Factual-Statement Fusion and Dialogue Segmentation) for summarizing dialogues. The novelty of the DS-SS framework mainly lies in two aspects: 1) Factual statements are extracted from the source dialogue and combined with the source dialogue to perform the further dialogue encoding; and 2) A dialogue segmenter is trained and used to separate a dialogue to be encoded into several topic-coherent segments. Thanks to these two aspects, the proposed framework may better encode dialogues, thereby generating summaries exhibiting higher factual consistency and informativeness. Experimental results on two large-scale datasets SAMSum and DialogSum demonstrate the superiority of our framework over strong baselines, as evidenced by both automatic evaluation metrics and human evaluation.

  4. P

    Dialogue State Tracking Challenge Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated May 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Williams; Antoine Raux; Ramach; Deepak ran; Alan Black (2020). Dialogue State Tracking Challenge Dataset [Dataset]. https://paperswithcode.com/dataset/dialogue-state-tracking-challenge
    Explore at:
    Dataset updated
    May 1, 2020
    Authors
    Jason Williams; Antoine Raux; Ramach; Deepak ran; Alan Black
    Description

    The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art in tracking the state of spoken dialog systems. State tracking, sometimes called belief tracking, refers to accurately estimating the user's goal as a dialog progresses. Accurate state tracking is desirable because it provides robustness to errors in speech recognition, and helps reduce ambiguity inherent in language within a temporal process like dialog. In these challenges, participants were given labelled corpora of dialogs to develop state tracking algorithms. The trackers were then evaluated on a common set of held-out dialogs, which were released, un-labelled, during a one week period.

    The corpus was collected using Amazon Mechanical Turk, and consists of dialogs in two domains: restaurant information, and tourist information. Tourist information subsumes restaurant information, and includes bars, cafés etc. as well as multiple new slots. There were two rounds of evaluation using this data:

    DSTC 2 released a large number of training dialogs related to restaurant search. Compared to DSTC (which was in the bus timetables domain), DSTC 2 introduces changing user goals, tracking 'requested slots' as well as the new restaurants domain. Results from DSTC 2 were presented at SIGDIAL 2014. DSTC 3 addressed the problem of adaption to a new domain - tourist information. DSTC 3 releases a small amount of labelled data in the tourist information domain; participants will use this data plus the restaurant data from DSTC 2 for training. Dialogs used for training are fully labelled; user transcriptions, user dialog-act semantics and dialog state are all annotated. (This corpus therefore is also suitable for studies in Spoken Language Understanding.)

  5. P

    Data from: MMD Dataset

    • paperswithcode.com
    Updated May 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amrita Saha; Mitesh Khapra; Karthik Sankaranarayanan (2023). MMD Dataset [Dataset]. https://paperswithcode.com/dataset/mmd
    Explore at:
    Dataset updated
    May 26, 2023
    Authors
    Amrita Saha; Mitesh Khapra; Karthik Sankaranarayanan
    Description

    The MMD (MultiModal Dialogs) dataset is a dataset for multimodal domain-aware conversations. It consists of over 150K conversation sessions between shoppers and sales agents, annotated by a group of in-house annotators using a semi-automated manually intense iterative process.

  6. h

    meddialog

    • huggingface.co
    • paperswithcode.com
    • +1more
    Updated Apr 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2023). meddialog [Dataset]. https://huggingface.co/datasets/bigbio/meddialog
    Explore at:
    Dataset updated
    Apr 22, 2023
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    The MedDialog dataset (English) contains conversations (in English) between doctors and patients.It has 0.26 million dialogues. The data is continuously growing and more dialogues will be added. The raw dialogues are from healthcaremagic.com and icliniq.com. All copyrights of the data belong to healthcaremagic.com and icliniq.com.

  7. 90,000 sets – Multi-domain Customer Service Dialogue Text Data

    • m.nexdata.ai
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 90,000 sets – Multi-domain Customer Service Dialogue Text Data [Dataset]. https://m.nexdata.ai/datasets/llm/1396
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset authored and provided by
    Nexdata
    Variables measured
    Language, Data Size, Data content, Storage Format, Data collection method
    Description

    Multi-domain Customer Service Dialogue Text Data, 90,000 sets in total; spanning multiple domains, including telecommunications, e-commerce, and financial, lifestyle, business, education, healthcare, and entertainment; Each set of data consists of single or multi-turn conversations; this dataset can be used for tasks such as LLM training, chatgpt

  8. Data from: DialoGLUE: A Natural Language Understanding Benchmark for...

    • registry.opendata.aws
    Updated Oct 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2020). DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue [Dataset]. https://registry.opendata.aws/dialoglue/
    Explore at:
    Dataset updated
    Oct 11, 2020
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    This bucket contains the checkpoints used to reproduce the baseline results reported in the DialoGLUE benchmark hosted on EvalAI (https://evalai.cloudcv.org/web/challenges/challenge-page/708/overview). The associated scripts for using the checkpoints are located here: https://github.com/alexa/dialoglue. The associated paper describing the benchmark and checkpoints is here: https://arxiv.org/abs/2009.13570. The provided checkpoints include the CONVBERT model, a BERT-esque model trained on a large open-domain conversational dataset. It also includes the CONVBERT-DG and BERT-DG checkpoints described in the linked paper.

  9. 830,276 groups - Multi-Round Interpersonal Dialogues Text Data

    • m.nexdata.ai
    • nexdata.ai
    Updated Oct 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 830,276 groups - Multi-Round Interpersonal Dialogues Text Data [Dataset]. https://m.nexdata.ai/datasets/llm/150
    Explore at:
    Dataset updated
    Oct 4, 2023
    Dataset authored and provided by
    Nexdata
    Variables measured
    Language, Data size, Applications, Data content, Storage format, Collecting period
    Description

    This database is the interactive text corpus of real users on the mobile phone. The database itself has been desensitized to ensure of no private information of the user's (A and B are the codes to replace the sender and receiver, and sensitive information such as cellphone number and user name are replaced with '* * *'). This database can be used for tasks such as natural language understanding.

  10. P

    Business Scene Dialogue Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jan 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matīss Rikters; Ryokan Ri; Tong Li; Toshiaki Nakazawa (2021). Business Scene Dialogue Dataset [Dataset]. https://paperswithcode.com/dataset/business-scene-dialogue
    Explore at:
    Dataset updated
    Jan 10, 2021
    Authors
    Matīss Rikters; Ryokan Ri; Tong Li; Toshiaki Nakazawa
    Description

    The Japanese-English business conversation corpus, namely Business Scene Dialogue corpus, was constructed in 3 steps:

    selecting business scenes, writing monolingual conversation scenarios according to the selected scenes, and translating the scenarios into the other language.

    Half of the monolingual scenarios were written in Japanese and the other half were written in English. The whole construction process was supervised by a person who satisfies the following conditions to guarantee the conversations to be natural:

    has the experience of being engaged in language learning programs, especially for business conversations is able to smoothly communicate with others in various business scenes both in Japanese and English has the experience of being involved in business

    The BSD corpus is split into balanced training, development and evaluation sets. The documents in these sets are balanced in terms of scenes and original languages. In this repository we publicly share the full development and evaluation sets and a part of the training data set.

  11. Cornell Movie Dialogs Corpus SQLite

    • kaggle.com
    Updated Feb 13, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee Richards (2018). Cornell Movie Dialogs Corpus SQLite [Dataset]. https://www.kaggle.com/mrlarichards/cornell-movie-dialogs-corpus-sqlite/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Lee Richards
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Context

    I decided, for a little hobby project I'm working on, that I needed a dialog dataset, which Cornell University kindly provided here. However, as a database programmer, I'm used to working with structured data, not parsing and building lists from text-based files, and I decided that life would be much easier for me if I had this data in an SQL-type database, so I wrote me a little python script to chunk the whole thing into SQLite.

    Content

    Original Data Set: https://www.kaggle.com/Cornell-University/movie-dialog-corpus

    As of this writing, the original dataset was updated 7 months ago.

    The data is normalized, with all of the code-breaking artifacts I ran into hand-corrected. If you're familiar with SQL, and have a language/library that supports SQLite, I hope you'll find this fairly easy to work with. All of the data from the original dataset is, I believe, present, though I did remove some redundancies. For example, in the original dataset, movie_lines.tsv lists the character name along with the character id, which is redundant, because the name is listed in the movie_characters.tsv file. While this is a convenience when you have to process the file directly, it can easily be obtained by a JOIN in a structured database. The raw_script_urls are included in the movies table.

    Acknowledgements

    Thank you to Cornell University for providing the original Corpus. Photo by Tobias Fischer on Unsplash

    Inspiration

    Do let me know if you find this useful. I will probably do similar conversions for other datasets as I need them, and would happily upload them if anyone else finds them useful in that form.

  12. c

    Research data supporting "MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz...

    • repository.cam.ac.uk
    zip
    Updated Jul 10, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Budzianowski, Paweł; Mihail, Eric; Rahul, Goel; Shachi, Paul; Sethi, Abhishek; Agarwal, Sanchit; Gao, Shuyag; Hakkani-Tur, Dilek (2019). Research data supporting "MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling" [Dataset]. http://doi.org/10.17863/CAM.41572
    Explore at:
    zip(13794372 bytes)Available download formats
    Dataset updated
    Jul 10, 2019
    Dataset provided by
    Apollo
    University of Cambridge
    Authors
    Budzianowski, Paweł; Mihail, Eric; Rahul, Goel; Shachi, Paul; Sethi, Abhishek; Agarwal, Sanchit; Gao, Shuyag; Hakkani-Tur, Dilek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains the following json files: 1. data.json: the woz dialogue dataset, which contains the conversation users and wizards, as well as a set of coarse labels for each user turn. 2. restaurant_db.json: the Cambridge restaurant database file, containing restaurants in the Cambridge UK area and a set of attributes. 3. attraction_db.json: the Cambridge attraction database file, contining attractions in the Cambridge UK area and a set of attributes. 4. hotel_db.json: the Cambridge hotel database file, containing hotels in the Cambridge UK area and a set of attributes. 5. train_db.json: the Cambridge train (with artificial connections) database file, containing trains in the Cambridge UK area and a set of attributes. 6. hospital_db.json: the Cambridge hospital database file, contatining information about departments. 7. police_db.json: the Cambridge police station information. 8. taxi_db.json: slot-value list for taxi domain. 9. valListFile.json: list of dialogues for validation. 10. testListFile.json: list of dialogues for testing. 11. system_acts.json: system acts annotations 12. ontology.json: Data-based ontology.

    Important note: This dataset was previously entitled 'Research data supporting "MultiWOZ 2.1 - Multi-Domain Dialogue State Corrections and State Tracking Baselines"'. The change to the current title of this dataset was made at the request of the authors in July 2019.

  13. c

    Data from: LUIS: data collection for task oriented dialogue system creation

    • repository.clarin.lv
    Updated Jul 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nešpore-Bērzkalne Gunta; Inguna Skadiņa; Normunds Grūzītis; Artūrs Znotiņš; Didzis Goško (2021). LUIS: data collection for task oriented dialogue system creation [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/47
    Explore at:
    Dataset updated
    Jul 6, 2021
    Authors
    Nešpore-Bērzkalne Gunta; Inguna Skadiņa; Normunds Grūzītis; Artūrs Znotiņš; Didzis Goško
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This multi-targeted dataset contains several datasets that allow to train goal-oriented dialogue systems for student service domain in Latvian. The dataset contains a manually annotated dataset of domain-specific dialog intents, a manually created and annotated dataset of generalised and formalised dialog scenarios based on corpus evidence, dataset for FAQ module training.

  14. m

    AI Dialog Software Application

    • data.mendeley.com
    Updated Aug 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francis R Belch (2023). AI Dialog Software Application [Dataset]. http://doi.org/10.17632/zhv2wfnprv.4
    Explore at:
    Dataset updated
    Aug 31, 2023
    Authors
    Francis R Belch
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Description

    The experimental AI Dialog software application is able to build a mental model from a plain English dialogue using speech or textual input. Subsequently, comprehension and logical capability may be tested using plain English queries.

    AI Dialog is used to experimentally test the validity and utility of a novel software application design, as a path to Artificial General Intelligence, sometimes referred to as Strong Artificial Intelligence or Conversational Artificial Intelligence.

    The theory behind AI Dialog is fully described in the book: Belch, Francis R. (2023) Artificial General Intelligence – A New Approach, available from: Amazon.com (from September 2023).

    A short demonstration of the AI Dialog software application is available from YouTube®, and entitled:

    Artificial General Intelligence – A New Approach – AI Dialog Demonstration.

    There are also two YouTube® lectures each of about 1 hour duration describing the radical new approach to Artificial General Intelligence used to implement the AI Dialog software application. These are:

    Artificial General Intelligence – A New Approach – Part I Sentence Semantics.

    Artificial General Intelligence – A New Approach – Part II Dialogues and Mental Models.

    This is a free download of the executable of the AI Dialog Software Application Version 4.1 Alpha release. This version supersedes Version 3.2 to allow speech as well as textual user input.

    The AI Dialog software is protected by international copyright, but is made available to use for non-commercial personal study purposes.

    The application will run on Windows 10® PC, Laptop and Tablet systems, and requires about 1 M byte of memory. The download file is zipped and needs to be unzipped. After this, the content of the folder AI Dialog 4.1 Alpha Release is:

    • Application Files (Folder) • Documentation (Folder) • NLP2016Autumn (Manifest) • Setup (Application)

    In the Documentation folder are two PDF files:

    • Copy Of Tuition Lessons (PDF) • Specification (PDF).

    The first is a hard copy of the tuition lessons. The second is a specification of a subset of English for use with the AI Dialog system. However, there is no need to consult either of these initially, as AI Dialog incorporates a quick start tuition module.

    To install AI Dialog, double click the Setup file. This starts AI Dialog immediately after installation, but places an application icon on the Windows 10® Start list for restarting later. After AI Dialog starts, just follow the speech or pop-up message instructions, which lead to quick start interactive tuition modules, fully describing how to use the application.

  15. prosocial-dialog

    • huggingface.co
    • opendatalab.com
    Updated Feb 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2023). prosocial-dialog [Dataset]. https://huggingface.co/datasets/allenai/prosocial-dialog
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 22, 2023
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for ProsocialDialog Dataset

      Dataset Summary
    

    ProsocialDialog is the first large-scale multi-turn English dialogue dataset to teach conversational agents to respond to problematic content following social norms. Covering diverse unethical, problematic, biased, and toxic situations, ProsocialDialog contains responses that encourage prosocial behavior, grounded in commonsense social rules (i.e., rules-of-thumb, RoTs). Created via a human-AI collaborative… See the full description on the dataset page: https://huggingface.co/datasets/allenai/prosocial-dialog.

  16. n

    1,136 Hours - English(the United States) Spontaneous Dialogue Smartphone...

    • nexdata.ai
    • m.nexdata.ai
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 1,136 Hours - English(the United States) Spontaneous Dialogue Smartphone speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1004
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    nexdata technology inc
    Nexdata
    Authors
    Nexdata
    Area covered
    United States
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
    Description

    English(the United States) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering generic domain. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(1,416 Americans), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  17. o

    Multi-Turn Dialogues with Emotion & Intent Labels

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Multi-Turn Dialogues with Emotion & Intent Labels [Dataset]. https://www.opendatabay.com/data/ai-ml/c2640303-2aa1-4d38-a323-4e674bf07b5b
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    The DailyDialog dataset is a curated collection of multi-turn dialogues that reflects everyday communication. It covers a variety of topics relevant to daily experiences. This dataset features human-written conversations, ensuring natural and realistic language, which contributes to higher quality data with less noise. Each dialogue involves two or more participants and is provided in a textual format. A key feature of this dataset is the inclusion of corresponding labels for communication intention and emotion attached to each utterance. These labels offer valuable insights into how participants express their intentions and emotional states through speech. The dataset is an invaluable resource for developing robust dialogue systems capable of understanding human interactions on a deeper level, identifying diverse intentions, and recognising various emotional states encountered in daily exchanges.

    Columns

    • dialog: This column contains the actual conversation between two or more participants. It is presented in text format.
    • act: The act column provides the communication intention labels for each utterance within the dialogue. These labels categorise the purpose behind a participant's speech, such as asking a question, making a statement, or making a request.
    • emotion: This column holds categorical labels that represent the emotions expressed by each participant during their utterances, including examples like anger, happiness, or sadness.

    Distribution

    The dataset is organised into three separate CSV files: validation.csv, train.csv, and test.csv. These files facilitate different stages of model development, including validation, training, and testing. The dataset focuses on multi-turn dialogues. Specific numbers for rows or records are not provided within the available information.

    Usage

    This dataset offers excellent opportunities for various applications: * Natural Language Processing (NLP): Ideal for training NLP models to understand and generate more realistic and human-like dialogues. Communication intention labels help identify the purpose of utterances, while emotion labels add emotional context. * Sentiment Analysis: With the emotion labels, the dataset can be used for sentiment analysis tasks, allowing classification of the overall sentiment of a conversation or individual utterances. This is useful for understanding customer feedback or social media discussions. * Dialogue Generation: One can train dialogue generation models capable of creating engaging conversations on various daily life topics. Communication intention labels can guide the model in generating appropriate responses based on different expressed intents.

    Coverage

    The dataset is designed to accurately represent daily life conversations, covering a wide range of everyday topics. It consists of human-written conversations, ensuring natural language use. No specific geographic, time range, or demographic scope beyond "daily life" is detailed.

    License

    CCO

    Who Can Use It

    • AI/ML Developers: Especially those working on dialogue systems, conversational AI, and natural language understanding.
    • NLP Researchers: Individuals focused on advancing NLP models for dialogue, intention recognition, and emotion detection.
    • Data Scientists: Those interested in sentiment analysis, language modelling, and human communication patterns.
    • Academics: Researchers and students studying human interaction, linguistics, and machine learning applications in text analysis.

    Dataset Name Suggestions

    • DailyDialog: Intent & Emotion Conversations
    • Multi-Turn Dialogues with Emotion & Intent Labels
    • Everyday Conversation Dataset
    • Human-Written Dialogues: Intent & Sentiment
    • DailyTalk: Annotated Conversations

    Attributes

    Original Data Source: DailyDialog: Multi-Turn Dialog+Intention+Emotion

  18. A

    Data from: Dialogs Re-Enacted Across Languages

    • abacus.library.ubc.ca
    iso, txt
    Updated Sep 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2024). Dialogs Re-Enacted Across Languages [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml;jsessionid=d64940b1987f7d56125a2d289c5c?persistentId=hdl%3A11272.1%2FAB2%2FXRMWND&version=&q=&fileTypeGroupFacet=&fileAccess=&fileSortField=size
    Explore at:
    iso(938545152), txt(1308)Available download formats
    Dataset updated
    Sep 17, 2024
    Dataset provided by
    Abacus Data Network
    Description

    AbstractIntroduction Dialogs Re-Enacted Across Languages was developed at the University of Texas at El Paso. It contains approximately 17 hours of conversational speech in English and Spanish by 129 unique bilingual speakers, specifically, short fragments extracted from spontaneous conversations and close re-enactments in the other language by the original speakers, for 3816 pairs of matching utterances. Data Data was collected in 2022-2023. Participants were recruited from among students at the University of Texas at El Paso which is located on the US-Mexico border. All participants were bilingual speakers of General American English and of Mexico-Texas Border Spanish. Their self-described dialects for English were El Paso and for Spanish, mostly "El Paso/Juarez." Each speaker pair had a ten minute conversation in one language. From these conversations, various fragments of the conversations were chosen for re-enactment, and the original speakers produced equivalents in the other language. Each re-enactment was vetted for fidelity to the original and naturalness in the target language. After recording, fragments were mapped to the translated re-enactments using ELAN, an annotation tool for audio and video recordings. Metadata about conversations, participants, re-enactments and utterances are included in this release. The audio data is presented as flac compressed, single channel, 16 kHz, 16-bit linear PCM.

  19. P

    DialogSum Dataset

    • paperswithcode.com
    Updated Jun 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yulong Chen; Yang Liu; Liang Chen; Yue Zhang (2021). DialogSum Dataset [Dataset]. https://paperswithcode.com/dataset/dialogsum
    Explore at:
    Dataset updated
    Jun 2, 2021
    Authors
    Yulong Chen; Yang Liu; Liang Chen; Yue Zhang
    Description

    DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.

    This work is accepted by ACL findings 2021. You may find the paper here: https://arxiv.org/pdf/2105.06762.pdf.

    If you want to use our dataset, please cite our paper.

    Dialogue Data We collect dialogue data for DialogSum from three public dialogue corpora, namely Dailydialog (Li et al., 2017), DREAM (Sun et al., 2019) and MuTual (Cui et al., 2019), as well as an English speaking practice website. These datasets contain face-to-face spoken dialogues that cover a wide range of daily-life topics, including schooling, work, medication, shopping, leisure, travel. Most conversations take place between friends, colleagues, and between service providers and customers.

    Compared with previous datasets, dialogues from DialogSum have distinct characteristics: * Under rich real-life scenarios, including more diverse task-oriented scenarios; * Have clear communication patterns and intents, which is valuable to serve as summarization sources; * Have a reasonable length, which comforts the purpose of automatic summarization.

    Summaries We ask annotators to summarize each dialogue based on the following criteria: * Convey the most salient information; * Be brief; * Preserve important named entities within the conversation; * Be written from an observer perspective; * Be written in formal language.

    Topics In addition to summaries, we also ask annotators to write a short topic for each dialogue, which can be potentially useful for future work, e.g. generating summaries by leveraging topic information.

  20. h

    OA-cornell-movies-dialog

    • huggingface.co
    Updated Feb 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahul Es (2023). OA-cornell-movies-dialog [Dataset]. https://huggingface.co/datasets/shahules786/OA-cornell-movies-dialog
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2023
    Authors
    Shahul Es
    Description

    Dataset Card for Open Assistant Cornell Movies Dialog

      Dataset Summary
    

    The dataset was created using Cornell Movies Dialog Corpus which contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts. Dialogs and meta-data from the underlying Corpus were used to design a dataset that can be used to InstructGPT based models to learn movie scripts. Example : User: Assume RICK and ALICE are characters from a fantasy-horror movie, continue… See the full description on the dataset page: https://huggingface.co/datasets/shahules786/OA-cornell-movies-dialog.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yan-ran Li; Hui Su; Xiaoyu Shen; Wenjie Li; Ziqiang Cao; Shuzi Niu (2021). DailyDialog Dataset [Dataset]. https://paperswithcode.com/dataset/dailydialog

DailyDialog Dataset

Explore at:
Dataset updated
Feb 2, 2021
Authors
Yan-ran Li; Hui Su; Xiaoyu Shen; Wenjie Li; Ziqiang Cao; Shuzi Niu
Description

DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 dialogues and validation and test sets with 1000 dialogues each. On average there are around 8 speaker turns per dialogue with around 15 tokens per turn.

Search
Clear search
Close search
Google apps
Main menu