100+ datasets found
  1. P

    DailyDialog Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Feb 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yan-ran Li; Hui Su; Xiaoyu Shen; Wenjie Li; Ziqiang Cao; Shuzi Niu (2021). DailyDialog Dataset [Dataset]. https://paperswithcode.com/dataset/dailydialog
    Explore at:
    Dataset updated
    Feb 2, 2021
    Authors
    Yan-ran Li; Hui Su; Xiaoyu Shen; Wenjie Li; Ziqiang Cao; Shuzi Niu
    Description

    DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 dialogues and validation and test sets with 1000 dialogues each. On average there are around 8 speaker turns per dialogue with around 15 tokens per turn.

  2. d

    Statcan Dialogue Dataset

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu, Xing Han; Reddy, Siva; de Vries, Harm (2023). Statcan Dialogue Dataset [Dataset]. http://doi.org/10.5683/SP3/NR0BMY
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Lu, Xing Han; Reddy, Siva; de Vries, Harm
    Description

    Welcome to the data repository for requesting access to the Statcan Dialogue Dataset! Before requesting access, you can visit our website or read our EACL 2023 paper Requesting Access In order to use our dataset, you must agree to the terms of use and restrictions before requesting access (see below). We will manually review each request and grant access or reach out to you for further information. To facilitate the process, make sure that: Your Dataverse account is linked to your professional/research website, which we may review to ensure the dataset will be used for the intended purpose Your request is made with an academic (e.g. .edu) or professional email (e.g. @servicenow.com). To do this, your have to set your primary email to your academic/professional email, or create a new Dataverse account. If your academic institution does not end with .edu, or you are part of a professional group that does not have an email address, please contact us (see email in paper). Abstract: We introduce the StatCan Dialogue Dataset consisting of 19,379 conversation turns between agents working at Statistics Canada and online users looking for published data tables. The conversations stem from genuine intents, are held in English or French, and lead to agents retrieving one of over 5000 complex data tables. Based on this dataset, we propose two tasks: (1) automatic retrieval of relevant tables based on a on-going conversation, and (2) automatic generation of appropriate agent responses at each turn. We investigate the difficulty of each task by establishing strong baselines. Our experiments on a temporal data split reveal that all models struggle to generalize to future conversations, as we observe a significant drop in performance across both tasks when we move from the validation to the test set. In addition, we find that response generation models struggle to decide when to return a table. Considering that the tasks pose significant challenges to existing models, we encourage the community to develop models for our task, which can be directly used to help knowledge workers find relevant tables for live chat users.

  3. f

    Datasets used in experiments.

    • plos.figshare.com
    zip
    Updated Apr 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mingkai Zhang; Dan You; Shouguang Wang (2024). Datasets used in experiments. [Dataset]. http://doi.org/10.1371/journal.pone.0302104.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 16, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mingkai Zhang; Dan You; Shouguang Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The explosive growth of dialogue data has aroused significant interest among scholars in abstractive dialogue summarization. In this paper, we propose a novel sequence-to-sequence framework called DS-SS (Dialogue Summarization with Factual-Statement Fusion and Dialogue Segmentation) for summarizing dialogues. The novelty of the DS-SS framework mainly lies in two aspects: 1) Factual statements are extracted from the source dialogue and combined with the source dialogue to perform the further dialogue encoding; and 2) A dialogue segmenter is trained and used to separate a dialogue to be encoded into several topic-coherent segments. Thanks to these two aspects, the proposed framework may better encode dialogues, thereby generating summaries exhibiting higher factual consistency and informativeness. Experimental results on two large-scale datasets SAMSum and DialogSum demonstrate the superiority of our framework over strong baselines, as evidenced by both automatic evaluation metrics and human evaluation.

  4. Mini Daily Dialog Act

    • kaggle.com
    Updated Mar 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aseem Srivastava (2022). Mini Daily Dialog Act [Dataset]. https://www.kaggle.com/datasets/as3eem/mini-daily-dialog-act/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 23, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aseem Srivastava
    Description

    About the Data

    The Mini Daily Dialog Dataset is a smaller and more processed version of the Daily Dialog dataset defined for NLU tasks. The mini version contains 700 dialog in the train.csv file and 100 dialogs in the test.csv file along with the corresponding dialog acts.

    About Dialog Acts;

    There are 4 dialog acts in the data and encoded as shown below: inform : 1 question : 2 directive : 3 commissive : 4

    Use of the Dataset

    This dataset could be used for class assignments and mini-project demos.

  5. P

    Data from: MMD Dataset

    • paperswithcode.com
    Updated May 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amrita Saha; Mitesh Khapra; Karthik Sankaranarayanan (2023). MMD Dataset [Dataset]. https://paperswithcode.com/dataset/mmd
    Explore at:
    Dataset updated
    May 26, 2023
    Authors
    Amrita Saha; Mitesh Khapra; Karthik Sankaranarayanan
    Description

    The MMD (MultiModal Dialogs) dataset is a dataset for multimodal domain-aware conversations. It consists of over 150K conversation sessions between shoppers and sales agents, annotated by a group of in-house annotators using a semi-automated manually intense iterative process.

  6. prosocial-dialog

    • huggingface.co
    • opendatalab.com
    Updated Feb 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2023). prosocial-dialog [Dataset]. https://huggingface.co/datasets/allenai/prosocial-dialog
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 22, 2023
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for ProsocialDialog Dataset

      Dataset Summary
    

    ProsocialDialog is the first large-scale multi-turn English dialogue dataset to teach conversational agents to respond to problematic content following social norms. Covering diverse unethical, problematic, biased, and toxic situations, ProsocialDialog contains responses that encourage prosocial behavior, grounded in commonsense social rules (i.e., rules-of-thumb, RoTs). Created via a human-AI collaborative… See the full description on the dataset page: https://huggingface.co/datasets/allenai/prosocial-dialog.

  7. P

    Dialogue State Tracking Challenge Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Aug 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Williams; Antoine Raux; Ramach; Deepak ran; Alan Black (2021). Dialogue State Tracking Challenge Dataset [Dataset]. https://paperswithcode.com/dataset/dialogue-state-tracking-challenge
    Explore at:
    Dataset updated
    Aug 30, 2021
    Authors
    Jason Williams; Antoine Raux; Ramach; Deepak ran; Alan Black
    Description

    The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art in tracking the state of spoken dialog systems. State tracking, sometimes called belief tracking, refers to accurately estimating the user's goal as a dialog progresses. Accurate state tracking is desirable because it provides robustness to errors in speech recognition, and helps reduce ambiguity inherent in language within a temporal process like dialog. In these challenges, participants were given labelled corpora of dialogs to develop state tracking algorithms. The trackers were then evaluated on a common set of held-out dialogs, which were released, un-labelled, during a one week period.

    The corpus was collected using Amazon Mechanical Turk, and consists of dialogs in two domains: restaurant information, and tourist information. Tourist information subsumes restaurant information, and includes bars, cafés etc. as well as multiple new slots. There were two rounds of evaluation using this data:

    DSTC 2 released a large number of training dialogs related to restaurant search. Compared to DSTC (which was in the bus timetables domain), DSTC 2 introduces changing user goals, tracking 'requested slots' as well as the new restaurants domain. Results from DSTC 2 were presented at SIGDIAL 2014. DSTC 3 addressed the problem of adaption to a new domain - tourist information. DSTC 3 releases a small amount of labelled data in the tourist information domain; participants will use this data plus the restaurant data from DSTC 2 for training. Dialogs used for training are fully labelled; user transcriptions, user dialog-act semantics and dialog state are all annotated. (This corpus therefore is also suitable for studies in Spoken Language Understanding.)

  8. 830,276 groups - Multi-Round Interpersonal Dialogues Text Data

    • m.nexdata.ai
    • nexdata.ai
    Updated Oct 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 830,276 groups - Multi-Round Interpersonal Dialogues Text Data [Dataset]. https://m.nexdata.ai/datasets/llm/150
    Explore at:
    Dataset updated
    Oct 4, 2023
    Dataset authored and provided by
    Nexdata
    Variables measured
    Language, Data size, Applications, Data content, Storage format, Collecting period
    Description

    This database is the interactive text corpus of real users on the mobile phone. The database itself has been desensitized to ensure of no private information of the user's (A and B are the codes to replace the sender and receiver, and sensitive information such as cellphone number and user name are replaced with '* * *'). This database can be used for tasks such as natural language understanding.

  9. Cornell Movie Dialogs Corpus SQLite

    • kaggle.com
    Updated Feb 13, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee Richards (2018). Cornell Movie Dialogs Corpus SQLite [Dataset]. https://www.kaggle.com/mrlarichards/cornell-movie-dialogs-corpus-sqlite/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Lee Richards
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Context

    I decided, for a little hobby project I'm working on, that I needed a dialog dataset, which Cornell University kindly provided here. However, as a database programmer, I'm used to working with structured data, not parsing and building lists from text-based files, and I decided that life would be much easier for me if I had this data in an SQL-type database, so I wrote me a little python script to chunk the whole thing into SQLite.

    Content

    Original Data Set: https://www.kaggle.com/Cornell-University/movie-dialog-corpus

    As of this writing, the original dataset was updated 7 months ago.

    The data is normalized, with all of the code-breaking artifacts I ran into hand-corrected. If you're familiar with SQL, and have a language/library that supports SQLite, I hope you'll find this fairly easy to work with. All of the data from the original dataset is, I believe, present, though I did remove some redundancies. For example, in the original dataset, movie_lines.tsv lists the character name along with the character id, which is redundant, because the name is listed in the movie_characters.tsv file. While this is a convenience when you have to process the file directly, it can easily be obtained by a JOIN in a structured database. The raw_script_urls are included in the movies table.

    Acknowledgements

    Thank you to Cornell University for providing the original Corpus. Photo by Tobias Fischer on Unsplash

    Inspiration

    Do let me know if you find this useful. I will probably do similar conversions for other datasets as I need them, and would happily upload them if anyone else finds them useful in that form.

  10. c

    Research data supporting "MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz...

    • repository.cam.ac.uk
    zip
    Updated Jul 10, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Budzianowski, Paweł; Mihail, Eric; Rahul, Goel; Shachi, Paul; Sethi, Abhishek; Agarwal, Sanchit; Gao, Shuyag; Hakkani-Tur, Dilek (2019). Research data supporting "MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling" [Dataset]. http://doi.org/10.17863/CAM.41572
    Explore at:
    zip(13794372 bytes)Available download formats
    Dataset updated
    Jul 10, 2019
    Dataset provided by
    Apollo
    University of Cambridge
    Authors
    Budzianowski, Paweł; Mihail, Eric; Rahul, Goel; Shachi, Paul; Sethi, Abhishek; Agarwal, Sanchit; Gao, Shuyag; Hakkani-Tur, Dilek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains the following json files: 1. data.json: the woz dialogue dataset, which contains the conversation users and wizards, as well as a set of coarse labels for each user turn. 2. restaurant_db.json: the Cambridge restaurant database file, containing restaurants in the Cambridge UK area and a set of attributes. 3. attraction_db.json: the Cambridge attraction database file, contining attractions in the Cambridge UK area and a set of attributes. 4. hotel_db.json: the Cambridge hotel database file, containing hotels in the Cambridge UK area and a set of attributes. 5. train_db.json: the Cambridge train (with artificial connections) database file, containing trains in the Cambridge UK area and a set of attributes. 6. hospital_db.json: the Cambridge hospital database file, contatining information about departments. 7. police_db.json: the Cambridge police station information. 8. taxi_db.json: slot-value list for taxi domain. 9. valListFile.json: list of dialogues for validation. 10. testListFile.json: list of dialogues for testing. 11. system_acts.json: system acts annotations 12. ontology.json: Data-based ontology.

    Important note: This dataset was previously entitled 'Research data supporting "MultiWOZ 2.1 - Multi-Domain Dialogue State Corrections and State Tracking Baselines"'. The change to the current title of this dataset was made at the request of the authors in July 2019.

  11. Data from: DialoGLUE: A Natural Language Understanding Benchmark for...

    • registry.opendata.aws
    Updated Oct 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2020). DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue [Dataset]. https://registry.opendata.aws/dialoglue/
    Explore at:
    Dataset updated
    Oct 11, 2020
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    This bucket contains the checkpoints used to reproduce the baseline results reported in the DialoGLUE benchmark hosted on EvalAI (https://evalai.cloudcv.org/web/challenges/challenge-page/708/overview). The associated scripts for using the checkpoints are located here: https://github.com/alexa/dialoglue. The associated paper describing the benchmark and checkpoints is here: https://arxiv.org/abs/2009.13570. The provided checkpoints include the CONVBERT model, a BERT-esque model trained on a large open-domain conversational dataset. It also includes the CONVBERT-DG and BERT-DG checkpoints described in the linked paper.

  12. 90,000 sets – Multi-domain Customer Service Dialogue Text Data

    • m.nexdata.ai
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 90,000 sets – Multi-domain Customer Service Dialogue Text Data [Dataset]. https://m.nexdata.ai/datasets/llm/1396
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset authored and provided by
    Nexdata
    Variables measured
    Language, Data Size, Data content, Storage Format, Data collection method
    Description

    Multi-domain Customer Service Dialogue Text Data, 90,000 sets in total; spanning multiple domains, including telecommunications, e-commerce, and financial, lifestyle, business, education, healthcare, and entertainment; Each set of data consists of single or multi-turn conversations; this dataset can be used for tasks such as LLM training, chatgpt

  13. Z

    Data from: esCorpiusDialog: A Large-Scale Multilingual Dialogue Dataset in...

    • data.niaid.nih.gov
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier, Gutiérrez-Hernando (2025). esCorpiusDialog: A Large-Scale Multilingual Dialogue Dataset in Spanish, Catalan, Basque, and Galician [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_15017668
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    Griol, David
    Callejas, Zoraida
    Pérez-Fernández, David
    Gutiérrez-Fandiño, Asier
    Kharitonova, Ksenia
    Javier, Gutiérrez-Hernando
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    In response to the growing need for comprehensive conversational datasets, we introduce esCorpiusDialog, a large-scale, diverse dialogue corpus designed to facilitate the development of conversational models in Spanish, Catalan, Basque, and Galician. Addressing the critical shortage of high-quality conversational data for these languages, esCorpiusDialog encompasses an extensive 15.1 GiB corpus containing 26,900,187 dialogues with a total of 116,524,110 conversational turns and 1,382,496,364 tokens. On average, each dialogue includes 4.3 turns, enabling effective modeling of multi-turn interactions.

    esCorpiusDialog aggregates data from multiple rich sources: movie subtitles (OpenSubtitles), newsgroups (Usenet), online forums (Mediavida, Reddit), and literature (Project Gutenberg). Specifically, the dataset comprises approximately 26.6 million dialogues in Spanish, 116 thousand in Basque, over 92 thousand in Catalan, and 63 thousand in Galician, making it the most extensive multilingual conversational dataset currently available for these languages.

    The dataset has undergone meticulous processing to clearly define conversational turns and accurately segment dialogues, ensuring its suitability for training robust, open-domain conversational systems. With its breadth of topics and varied dialogue styles, esCorpiusDialog represents an invaluable resource for researchers and practitioners aiming to enhance the dialogue capabilities and generalization of fine-tuned large language models (LLMs) across diverse conversational applications.

  14. n

    1,136 Hours - English(the United States) Spontaneous Dialogue Smartphone...

    • nexdata.ai
    • m.nexdata.ai
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 1,136 Hours - English(the United States) Spontaneous Dialogue Smartphone speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1004
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    nexdata technology inc
    Nexdata
    Authors
    Nexdata
    Area covered
    United States
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
    Description

    English(the United States) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering generic domain. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(1,416 Americans), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  15. A

    Data from: Dialogs Re-Enacted Across Languages

    • abacus.library.ubc.ca
    iso, txt
    Updated Sep 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2024). Dialogs Re-Enacted Across Languages [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml;jsessionid=d64940b1987f7d56125a2d289c5c?persistentId=hdl%3A11272.1%2FAB2%2FXRMWND&version=&q=&fileTypeGroupFacet=&fileAccess=&fileSortField=size
    Explore at:
    iso(938545152), txt(1308)Available download formats
    Dataset updated
    Sep 17, 2024
    Dataset provided by
    Abacus Data Network
    Description

    AbstractIntroduction Dialogs Re-Enacted Across Languages was developed at the University of Texas at El Paso. It contains approximately 17 hours of conversational speech in English and Spanish by 129 unique bilingual speakers, specifically, short fragments extracted from spontaneous conversations and close re-enactments in the other language by the original speakers, for 3816 pairs of matching utterances. Data Data was collected in 2022-2023. Participants were recruited from among students at the University of Texas at El Paso which is located on the US-Mexico border. All participants were bilingual speakers of General American English and of Mexico-Texas Border Spanish. Their self-described dialects for English were El Paso and for Spanish, mostly "El Paso/Juarez." Each speaker pair had a ten minute conversation in one language. From these conversations, various fragments of the conversations were chosen for re-enactment, and the original speakers produced equivalents in the other language. Each re-enactment was vetted for fidelity to the original and naturalness in the target language. After recording, fragments were mapped to the translated re-enactments using ELAN, an annotation tool for audio and video recordings. Metadata about conversations, participants, re-enactments and utterances are included in this release. The audio data is presented as flac compressed, single channel, 16 kHz, 16-bit linear PCM.

  16. P

    DialogSum Dataset

    • paperswithcode.com
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yulong Chen; Yang Liu; Liang Chen; Yue Zhang (2024). DialogSum Dataset [Dataset]. https://paperswithcode.com/dataset/dialogsum
    Explore at:
    Dataset updated
    Dec 18, 2024
    Authors
    Yulong Chen; Yang Liu; Liang Chen; Yue Zhang
    Description

    DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.

    This work is accepted by ACL findings 2021. You may find the paper here: https://arxiv.org/pdf/2105.06762.pdf.

    If you want to use our dataset, please cite our paper.

    Dialogue Data We collect dialogue data for DialogSum from three public dialogue corpora, namely Dailydialog (Li et al., 2017), DREAM (Sun et al., 2019) and MuTual (Cui et al., 2019), as well as an English speaking practice website. These datasets contain face-to-face spoken dialogues that cover a wide range of daily-life topics, including schooling, work, medication, shopping, leisure, travel. Most conversations take place between friends, colleagues, and between service providers and customers.

    Compared with previous datasets, dialogues from DialogSum have distinct characteristics: * Under rich real-life scenarios, including more diverse task-oriented scenarios; * Have clear communication patterns and intents, which is valuable to serve as summarization sources; * Have a reasonable length, which comforts the purpose of automatic summarization.

    Summaries We ask annotators to summarize each dialogue based on the following criteria: * Convey the most salient information; * Be brief; * Preserve important named entities within the conversation; * Be written from an observer perspective; * Be written in formal language.

    Topics In addition to summaries, we also ask annotators to write a short topic for each dialogue, which can be potentially useful for future work, e.g. generating summaries by leveraging topic information.

  17. h

    meddialog

    • huggingface.co
    • paperswithcode.com
    • +1more
    Updated Apr 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2023). meddialog [Dataset]. https://huggingface.co/datasets/bigbio/meddialog
    Explore at:
    Dataset updated
    Apr 22, 2023
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    The MedDialog dataset (English) contains conversations (in English) between doctors and patients.It has 0.26 million dialogues. The data is continuously growing and more dialogues will be added. The raw dialogues are from healthcaremagic.com and icliniq.com. All copyrights of the data belong to healthcaremagic.com and icliniq.com.

  18. m

    AI Dialog Software Application

    • data.mendeley.com
    Updated Aug 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francis R Belch (2023). AI Dialog Software Application [Dataset]. http://doi.org/10.17632/zhv2wfnprv.4
    Explore at:
    Dataset updated
    Aug 31, 2023
    Authors
    Francis R Belch
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Description

    The experimental AI Dialog software application is able to build a mental model from a plain English dialogue using speech or textual input. Subsequently, comprehension and logical capability may be tested using plain English queries.

    AI Dialog is used to experimentally test the validity and utility of a novel software application design, as a path to Artificial General Intelligence, sometimes referred to as Strong Artificial Intelligence or Conversational Artificial Intelligence.

    The theory behind AI Dialog is fully described in the book: Belch, Francis R. (2023) Artificial General Intelligence – A New Approach, available from: Amazon.com (from September 2023).

    A short demonstration of the AI Dialog software application is available from YouTube®, and entitled:

    Artificial General Intelligence – A New Approach – AI Dialog Demonstration.

    There are also two YouTube® lectures each of about 1 hour duration describing the radical new approach to Artificial General Intelligence used to implement the AI Dialog software application. These are:

    Artificial General Intelligence – A New Approach – Part I Sentence Semantics.

    Artificial General Intelligence – A New Approach – Part II Dialogues and Mental Models.

    This is a free download of the executable of the AI Dialog Software Application Version 4.1 Alpha release. This version supersedes Version 3.2 to allow speech as well as textual user input.

    The AI Dialog software is protected by international copyright, but is made available to use for non-commercial personal study purposes.

    The application will run on Windows 10® PC, Laptop and Tablet systems, and requires about 1 M byte of memory. The download file is zipped and needs to be unzipped. After this, the content of the folder AI Dialog 4.1 Alpha Release is:

    • Application Files (Folder) • Documentation (Folder) • NLP2016Autumn (Manifest) • Setup (Application)

    In the Documentation folder are two PDF files:

    • Copy Of Tuition Lessons (PDF) • Specification (PDF).

    The first is a hard copy of the tuition lessons. The second is a specification of a subset of English for use with the AI Dialog system. However, there is no need to consult either of these initially, as AI Dialog incorporates a quick start tuition module.

    To install AI Dialog, double click the Setup file. This starts AI Dialog immediately after installation, but places an application icon on the Windows 10® Start list for restarting later. After AI Dialog starts, just follow the speech or pop-up message instructions, which lead to quick start interactive tuition modules, fully describing how to use the application.

  19. Friends TV Show Dialog Sequences

    • kaggle.com
    Updated Dec 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Friends TV Show Dialog Sequences [Dataset]. https://www.kaggle.com/datasets/thedevastator/friends-tv-show-dialog-sequences/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Friends TV Show Dialog Sequences

    Friends TV Show Dialog Sequences

    By Suriyadeepan R [source]

    About this dataset

    The dataset sequences.csv provides a comprehensive collection of dialog sequences retrieved from the popular sitcom Friends. This dataset has been curated to offer researchers, data analysts, and machine learning enthusiasts an extensive resource for studying linguistic patterns and analyzing conversational structures in a highly regarded television series.

    Each row of the dataset corresponds to a specific sequence of dialogues exchanged between the characters in the Friends TV show. The sequences are arranged consecutively, ensuring continuity within each set of conversations. This dataset captures moments encompassing different scenarios, emotions, and relationships depicted throughout all ten seasons of the series.

    By exploring this dataset, individuals can gain insights into various aspects such as character interactions, humor elements, socio-cultural references, sentimental expressions, conflicts resolution approaches utilized by the characters. Additionally, this resource facilitates language modeling tasks and offers opportunities for sentiment analysis or dialogue generation using natural language processing techniques.

    The original sources of these dialog transcripts have been meticulously collated to ensure accuracy and fidelity to the original aired episodes. Researchers interested in studying language use across different contexts can utilize this dataset as a valuable tool for training models or devising creative algorithms based on real-life conversations between fictional characters.

    Please note that while every effort has been made to ensure consistency in capturing these sequences accurately from diverse scenes across all ten seasons of Friends TV show with high precision; however inadvertent discrepancies may still exist due to variables like dialogue delivery speed or overlapping speech instances

    How to use the dataset

    Dataset Overview

    The dataset consists of a single file named sequences.csv. It contains multiple columns that provide different information about the dialogues in each sequence. The columns available in the dataset are as follows:

    • Sequence ID: A unique identifier for each dialogue sequence.
    • Season: The season number in which the dialogue sequence belongs.
    • Episode: The episode number within the season where the dialogue sequence appears.
    • Sequence Index: The index of each dialogue within a particular sequence.
    • Character: The name of the character speaking in a specific line of dialogue.
    • Dialogue Text: The actual spoken words by a character.

    Please note that there are no date-related columns included in this dataset.

    Analyzing and Exploring Data

    Once you have loaded or imported the sequences.csv file into your preferred data analysis tool, you can begin exploring and analyzing its contents using various techniques:

    • Descriptive Statistics: You can compute basic descriptive statistics on different columns, such as counting unique values, calculating frequencies, or finding patterns across seasons or episodes.
    • Character Analysis: By examining data related to characters' names and dialogues, you can analyze their speaking patterns, most frequent speakers, word count distribution per character, etc.
    • Episode Analysis: You may explore specific episodes by filtering data based on season and episode numbers to examine particular events or recurring themes within them.
    • .**Dialogue Sentiment Analysis:** Applying sentiment analysis techniques to analyze text content might reveal interesting insights about emotions expressed by different characters during various seasons or episodes.

    Ensure to use appropriate data visualization techniques to present your findings, such as bar charts, line plots, word clouds, or heatmaps.

    Potential Use Cases

    • Natural Language Processing (NLP) and Sentiment Analysis: Analyzing the sentiment of characters' dialogues over time or identifying specific emotions expressed during crucial moments in the show.
    • Character Interaction Analysis: Identifying character pairs who frequently engage in conversations or analyzing how relationships between characters evolve throughout different seasons.
    • Dialogue Generation Models: Training language

    Research Ideas

    • Sentiment Analysis: The dataset can be used to analyze the sentiment of each dialogue sequence in order to understand the overall mood or tone of specific episodes or characters.
    • Dialogue Generation: By training a language model...
  20. h

    dialogsum

    • huggingface.co
    Updated Jun 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karthick Kaliannan Neelamohan (2022). dialogsum [Dataset]. https://huggingface.co/datasets/knkarthick/dialogsum
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2022
    Authors
    Karthick Kaliannan Neelamohan
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for DIALOGSum Corpus

      Dataset Description
    
    
    
    
    
      Links
    

    Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick

      Dataset Summary
    

    DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yan-ran Li; Hui Su; Xiaoyu Shen; Wenjie Li; Ziqiang Cao; Shuzi Niu (2021). DailyDialog Dataset [Dataset]. https://paperswithcode.com/dataset/dailydialog

DailyDialog Dataset

Explore at:
Dataset updated
Feb 2, 2021
Authors
Yan-ran Li; Hui Su; Xiaoyu Shen; Wenjie Li; Ziqiang Cao; Shuzi Niu
Description

DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 dialogues and validation and test sets with 1000 dialogues each. On average there are around 8 speaker turns per dialogue with around 15 tokens per turn.

Search
Clear search
Close search
Google apps
Main menu