100+ datasets found
  1. 478 Hours - Spanish Conversational Speech Data by Mobile Phone

    • nexdata.ai
    • m.nexdata.ai
    Updated Dec 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 478 Hours - Spanish Conversational Speech Data by Mobile Phone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1147
    Explore at:
    Dataset updated
    Dec 5, 2023
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
    Description

    Spanish(Spain) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(596 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  2. F

    General domain Human-Human conversation chats in Bahasa

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General domain Human-Human conversation chats in Bahasa [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/bahasa-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    This training dataset comprises more than 10,000 conversational text data between two native Bahasa people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.

    This training dataset's licence belongs to FutureBeeAI!

  3. F

    General domain Human-Human conversation chats in German

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General domain Human-Human conversation chats in German [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/german-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    This training dataset comprises more than 10,000 conversational text data between two native German people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.

    This training dataset's licence belongs to FutureBeeAI!

  4. F

    General domain Human-Human conversation chats in Swedish

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General domain Human-Human conversation chats in Swedish [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/swedish-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    This training dataset comprises more than 10,000 conversational text data between two native Swedish people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.

    This training dataset's licence belongs to FutureBeeAI!

  5. F

    General domain Human-Human conversation chats in Urdu

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General domain Human-Human conversation chats in Urdu [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/urdu-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    This training dataset comprises more than 10,000 conversational text data between two native Urdu people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.

    This training dataset's licence belongs to FutureBeeAI!

  6. English Conversation and Monologue speech dataset

    • kaggle.com
    Updated Jun 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank Wong (2024). English Conversation and Monologue speech dataset [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/english-real-world-speech-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Frank Wong
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    English(America) Real-world Casual Conversation and Monologue speech dataset

    Description

    English(America) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety-show, etc, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1115?source=Kaggle

    Format

    16kHz, 16 bit, wav, mono channel;

    Content category

    Including self-media, conversation, live, lecture, variety-show, etc;

    Recording environment

    Low background noise;

    Country

    America(USA);

    Language(Region) Code

    en-US;

    Language

    English;

    Features of annotation

    Transcription text, timestamp, speaker ID, gender.

    Accuracy Rate

    Sentence Accuracy Rate (SAR) 95%

    Licensing Information

    Commercial License

  7. E

    Mandarin Mobile Telephony Conversational Speech Collection Data - 2,657...

    • catalog.elra.info
    Updated Oct 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2022). Mandarin Mobile Telephony Conversational Speech Collection Data - 2,657 Hours [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0421/
    Explore at:
    Dataset updated
    Oct 6, 2022
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    4491 speakers participated in the recording and conducted face-to-face communication in a natural way. No topics are specified, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy.Format:16kHz, 16bit, uncompressed wav, mono channelEnvironments:quiet indoor environment, without echoRecording content:no topic is specified, and the speakers make dialogue while the recording is performedDemographics:4,491 speakers, 63% of which are female.Annotations:annotating for the transcription text, speaker identification and genderDevice:Android mobile phone, iPhoneLanguage:MandarinApplications:speech recognition; voiceprint recognition.Accuracy rate:97%

  8. 1,503 Hours - Arabic(UAE) Real-world Casual Conversation and Monologue...

    • nexdata.ai
    • m.nexdata.ai
    Updated Jun 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 1,503 Hours - Arabic(UAE) Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1710
    Explore at:
    Dataset updated
    Jun 28, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    United Arab Emirates
    Variables measured
    Format, Accuracy, Language, Annotation, Application scenarios
    Description

    Arabic(UAE) Real-world Casual Conversation and Monologue speech dataset, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  9. Implicature dataset

    • figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Jasmi George (2023). Implicature dataset [Dataset]. http://doi.org/10.6084/m9.figshare.10315505.v7
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Elizabeth Jasmi George
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set consists of conversational implicatures of utterances. Conversational implicatures are the meanings of an utterance more than what is literally stated by the utterance. The data consist of 1001 utterances that come as responses in a specific context and their implicatures. These written representations of the utterances are collected manually by scraping and transcribing from relevant sources from August, 2019 to August, 2020. The source of dialogues in the data include TOEFL listening comprehension short conversations, movie dialogues from IMSDb and websites explaining idioms, similes, metaphors and hyperboles. The implicatures are annotated manually.FormattingThe dataset file (Conversational Implicature Dataset 1-1001 - implicature data 1-1001.csv) is written as comma-separated values file. Columns that contain commas (,) are escaped using double-quotes ("). The dataset is also available as an excel sheet (Conversational Implicature Dataset 1-1001.xlsx)ContentThe dataset is available in Conversational Implicature Dataset 1-1001 - implicature data 1-1001.csv. Each entry in the dataset consists of a context utterance, a response utterance and an Implicature.Context UtteranceThe written representation of an utterance which serves as the context in which the response utterance can implicate a meaning different from its literal meaning.Response UtteranceThe written representation of an utterance which has a different meaning than the meaning of the sentences used in it.ImplicatureThe implicated meaning of the response utterance.

  10. F

    General domain Human-Human conversation chats in Spanish

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General domain Human-Human conversation chats in Spanish [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/spanish-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    This training dataset comprises more than 10,000 conversational text data between two native Spanish people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.

    This training dataset's licence belongs to FutureBeeAI!

  11. 330 Hours - Dari Conversational Speech Data by Telephone

    • m.nexdata.ai
    • nexdata.ai
    Updated Feb 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 330 Hours - Dari Conversational Speech Data by Telephone [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1240?source=Github
    Explore at:
    Dataset updated
    Feb 6, 2024
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Speaker, Language, Accuracy rate, Content category, Recording device, Recording condition, Features of annotation
    Description

    Dari(Afghanistan) Spontaneous Dialogue Telephony speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(452 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  12. f

    Conversational Skills in Language Learning Games: A Speech Recognition...

    • figshare.com
    • data.mendeley.com
    bin
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Murat Kuvvetli (2023). Conversational Skills in Language Learning Games: A Speech Recognition Technology Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24769470.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    figshare
    Authors
    Murat Kuvvetli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "SpeechRec_LanguageLearning_ConversationalSkills" dataset is a collection of data generated in a game-based language learning environment, aiming to explore the impact of Speech Recognition Technology (SRT) on the development of conversational skills. The dataset encompasses speaking test results conducted within the context of language learning games utilizing SRT.

  13. C

    Conversational AI in Healthcare Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Conversational AI in Healthcare Report [Dataset]. https://www.archivemarketresearch.com/reports/conversational-ai-in-healthcare-12100
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Feb 5, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Market Analysis The market for Conversational AI in Healthcare is projected to reach a value of XX million by 2033, growing at a CAGR of 5%. The growth is driven by the increasing adoption of AI in healthcare, the rising need for efficient patient care, and the growing prevalence of chronic diseases. Key market trends include the integration of natural language processing (NLP) and machine learning (ML) for improved communication and analysis, and the emergence of cloud-based solutions for cost-effective scalability. The major segments of the market are NLP and ML based solutions, with applications in medical record mining, medical imaging analysis, medicine development, and emergency assistance. Value Chain Analysis The Conversational AI in Healthcare market value chain consists of several players, including hardware manufacturers, software developers, solution providers, and healthcare providers. Hardware manufacturers provide the devices and sensors used for data collection and processing. Software developers create the AI algorithms and software, enabling healthcare providers to interact with patients through conversational interfaces. Solution providers integrate hardware and software to provide end-to-end solutions. Healthcare providers, including hospitals, clinics, and nursing homes, are the end-users who utilize Conversational AI solutions to enhance patient care. Key market players include Google Health, IBM Watson Health, Oncora Medical, and CloudMedX Health.

  14. o

    Ethical Dialogue Dataset

    • opendatabay.com
    .undefined
    Updated Jul 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Ethical Dialogue Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/92d27a42-d8ec-46f5-acba-f415f82cdf52
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 4, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Education & Learning Analytics
    Description

    ProsocialDialog is a large-scale, multi-turn English dialogue dataset designed to teach conversational agents how to respond to problematic content in line with social norms. It addresses a variety of unethical, biased, toxic, and generally problematic situations. The dataset is notable for its focus on encouraging prosocial behaviour, which is guided by commonsense social rules, referred to as Rules-of-Thumb (RoTs). Developed through a human-AI collaborative framework, the dataset consists of 58,000 dialogues, comprising 331,000 utterances, 160,000 unique RoTs, and 497,000 dialogue safety labels, each accompanied by free-form rationales. The test.csv file within the ProsocialDialog dataset contains data specifically for evaluating the accuracy of a model in predicting conversation safety.

    Columns

    The dataset includes the following columns: * context: The context of the conversation. (String) * response: The response to the conversation. (String) * rots: Rules of thumb associated with the conversation. (String) * safety_label: The safety label associated with the conversation. (String) * safety_annotations: Annotations associated with the conversation. (String) * safety_annotation_reasons: Reasons for the safety annotations. (String) * source: The source of the conversation. (String) * etc: Any additional information associated with the conversation. (String) * dialogue_id: Unique identifier for each dialogue. * response_id: Unique identifier for each response.

    Distribution

    The dataset is typically provided in a CSV file format, such as test.csv. It contains 58,000 dialogues, encompassing 331,000 utterances. There are 24,972 unique dialogue IDs and 24,903 unique response IDs. The dataset includes 160,000 unique Rules-of-Thumb (RoTs) and 497,000 dialogue safety labels. Specific numbers for rows or records beyond these counts are not provided in the sources.

    Usage

    This dataset is ideally suited for several applications: * Designing Conversational Agents: It can be used to build Natural Language Processing (NLP) models capable of recognising and classifying problematic content. The safety labels, rationales, and RoTs can train conversational agents to respond in socially acceptable ways. * Benchmark Systems: ProsocialDialog serves as an effective benchmark for evaluating the performance of existing conversation datasets in identifying, responding to, and preventing problematic content interactions. * Automated Moderation: The dialogue safety labels and their associated free-form rationales are valuable for technology platforms implementing automated moderation tasks, such as flagging or banning offensive messages or users.

    Coverage

    The ProsocialDialog dataset is in English and has a global regional coverage. It addresses general conversational scenarios involving social norms and problematic content, but specific demographic scope details or the precise time range of data collection are not explicitly outlined in the sources. The dataset was listed on 11/06/2025.

    License

    CCO

    Who Can Use It

    This dataset is beneficial for a range of users, including: * Researchers and Developers in AI and Machine Learning: Particularly those focused on Natural Language Processing (NLP) and building sophisticated conversational AI systems. * Organisations and Platforms: Especially those in need of automated moderation tools or aiming to ensure their conversational agents adhere to social norms and promote prosocial behaviour. * Academics and Students: Engaged in studying dialogue safety, social psychology, or ethical AI, who can explore the safety labels, annotations, RoTs, and data sources to gain deeper insights into human conversation dynamics.

    Dataset Name Suggestions

    • ProsocialDialog - Problematic Content Dialogue
    • Conversational Safety Norms
    • Ethical Dialogue Dataset
    • Social Norms AI Conversations
    • Harmful Content Dialogue Dataset

    Attributes

    Original Data Source: ProsocialDialog - Problematic Content Dialogue

  15. P

    StudyAbroadGPT Dataset Dataset

    • paperswithcode.com
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Millat; Md Motiur (2025). StudyAbroadGPT Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/studyabroadgpt-dataset
    Explore at:
    Dataset updated
    Apr 21, 2025
    Authors
    Md Millat; Md Motiur
    Description

    The StudyAbroadGPT-Dataset is a collection of conversational data focused on university application requirements for various programs, including MBA, MS in Computer Science, Data Science, and Bachelor of Medicine. The dataset includes interactions between humans asking questions about application processes (e.g., "How do I write a strong SOP for MS in Data Science at MIT?") and an assistant providing detailed responses. Covering prestigious institutions such as MIT, Oxford, Cambridge, and Stanford, this dataset serves as a valuable resource for understanding the informational needs of prospective students applying to study abroad.

    Dataset Structure The dataset is organized as a list of JSON objects, where each object represents a single conversation. Each conversation contains an array of turns, structured as follows:

    "from": Specifies the speaker, either "human" or "assistant". "value": Contains the text of the query or response.

    Example json { "conversations": [ {"from": "human", "value": "What documents do I need for applying to MBA?"}, {"from": "assistant", "value": "## Introduction To embark on your MBA journey, it's crucial to gather the necessary documents..."} ] }

    Data Collection This dataset was synthetically generated to simulate realistic conversations about study abroad applications. It is designed to reflect common queries and provide detailed, informative responses related to university application requirements.

    Preprocessing The dataset is provided in its raw form, consisting of unprocessed conversational text. Depending on their specific use case, such as natural language processing (NLP) tasks, users may need to perform additional preprocessing steps like tokenization or stopword removal.

    Usage You can load the dataset using the Hugging Face Datasets library with the following code:

    from datasets import load_dataset
    
    dataset = load_dataset("millat/StudyAbroadGPT-Dataset")
    

    Accessing Conversations To access and iterate through the conversations, use this example:

    python for conversation in dataset["train"]: for turn in conversation["conversations"]: print(f"{turn['from']}: {turn['value']}")

    Examples Here’s a sample conversation excerpt from the dataset:

    json { "conversations": [ {"from": "human", "value": "What are the GMAT requirements for Oxford?"}, {"from": "assistant", "value": "## Introduction Oxford University, one of the world's most prestigious institutions, requires..."} ] }

    Use Cases

    Training Conversational Agents: Build chatbots to assist with university application queries. Analyzing Trends: Study application requirements across different programs and institutions. NLP Development: Create natural language understanding models tailored to educational domains.

    License This dataset is licensed under the MIT License.

    Citation If you use this dataset in your research, please cite it as follows:

    bibtex @misc{StudyAbroadGPT-Dataset, author = {MD MILLAT HOSEN}, title = {StudyAbroadGPT-Dataset}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/millat/StudyAbroadGPT-Dataset}} }

  16. D

    Data from: Spiritual conversation model for patients and loved ones in...

    • ssh.datastations.nl
    Updated Mar 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Haufe; Marc Haufe (2024). Spiritual conversation model for patients and loved ones in palliative care: a validation study [Dataset]. http://doi.org/10.17026/SS/K0NGCL
    Explore at:
    csv(675), pdf(57645), text/comma-separated-values(6011), pdf(88279), application/x-spss-syntax(26964), ods(36881), pdf(81688), pdf(38601), pdf(61582), text/comma-separated-values(112710)Available download formats
    Dataset updated
    Mar 11, 2024
    Dataset provided by
    DANS Data Station Social Sciences and Humanities
    Authors
    Marc Haufe; Marc Haufe
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset is for follow-up research purposes. It consists of all key documents from data collection to data analysis of the research project. See the readme document for the relevant procedure and document description.

  17. Data from: Conversational Transcripts of Truthful and Deceptive Speech...

    • icpsr.umich.edu
    Updated Aug 29, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duran, Nicholas; Paxton, Alexandra; Fusaroli, Riccardo (2018). Conversational Transcripts of Truthful and Deceptive Speech Involving Controversial Topics, Central California, 2012 [Dataset]. http://doi.org/10.3886/ICPSR37124.v1
    Explore at:
    Dataset updated
    Aug 29, 2018
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Duran, Nicholas; Paxton, Alexandra; Fusaroli, Riccardo
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/37124/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/37124/terms

    Time period covered
    May 2012 - Nov 2012
    Area covered
    California, United States, California
    Description

    This study investigated the presence of dynamic patterns of interpersonal coordination in extended deceptive conversations across multi-modal channels of behavior. Using a "devil's advocate" paradigm, the researchers experimentally elicited deception and truth across controversial social and political topics in which conversational partners either agreed or disagreed, and where one partner was surreptitiously asked to argue an opinion opposite of what he or she really believed. The researchers focused on interpersonal coordination as an emergent behavioral signal that captured inter-dependencies between conversational partners, both as the coupling of head movements over the span of milliseconds, measured via a windowed lagged cross correlation (WLCC) technique, and more global temporal dependencies across speech rate, using cross recurrence quantification analysis (CRQA). Another focus that was considered was how interpersonal coordination might be shaped by strategic, adaptive conversational goals associated with deception. This collection includes both qualitative transcripts and a quantitative dataset including respondent demographics (including sex, age, and ethnicity). The qualitative dataset consists of 94 written transcripts of audio-recorded conversations, lasting eight minutes each in length. The quantitative dataset includes 5 variables for 102 cases.

  18. AI and Ancient Languages Student Survey - 2023-2024 Comparative Data

    • figshare.com
    csv
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edward A. S. Ross; Jackie Baines (2024). AI and Ancient Languages Student Survey - 2023-2024 Comparative Data [Dataset]. http://doi.org/10.6084/m9.figshare.27146880.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 2, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Edward A. S. Ross; Jackie Baines
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This comparative dataset was collected during the data-gathering portion of the "ChatGPT: A Conversational Language Study Tool" project over the 2023-2024 academic year. These survey forms were completed by students in ancient language classes in the Department of Classics at the University of Reading.This project has been reviewed by the University of Reading University Research Ethics Committee and has been given a favourable ethical opinion for conduct.

  19. F

    General domain Human-Human conversation chats in Bengali

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General domain Human-Human conversation chats in Bengali [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/bengali-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    This training dataset comprises more than 10,000 conversational text data between two native Bengali people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.

    This training dataset's licence belongs to FutureBeeAI!

  20. d

    AI Training Data | US Transcription Data| Unique Consumer Sentiment Data:...

    • datarade.ai
    Updated Jan 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com (2025). AI Training Data | US Transcription Data| Unique Consumer Sentiment Data: Transcription of the calls to the companies [Dataset]. https://datarade.ai/data-products/wiserbrand-ai-training-data-us-transcription-data-unique-wiserbrand-com
    Explore at:
    .csv, .xls, .txt, .jsonAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    WiserBrand.com
    Area covered
    United States
    Description

    WiserBrand's Comprehensive Customer Call Transcription Dataset: Tailored Insights

    WiserBrand offers a customizable dataset comprising transcribed customer call records, meticulously tailored to your specific requirements. This extensive dataset includes:

    User ID and Firm Name: Identify and categorize calls by unique user IDs and company names. Call Duration: Analyze engagement levels through call lengths. Geographical Information: Detailed data on city, state, and country for regional analysis. Call Timing: Track peak interaction times with precise timestamps. Call Reason and Group: Categorised reasons for calls, helping to identify common customer issues. Device and OS Types: Information on the devices and operating systems used for technical support analysis. Transcriptions: Full-text transcriptions of each call, enabling sentiment analysis, keyword extraction, and detailed interaction reviews.

    Our dataset is designed for businesses aiming to enhance customer service strategies, develop targeted marketing campaigns, and improve product support systems. Gain actionable insights into customer needs and behavior patterns with this comprehensive collection, particularly useful for Consumer Data, Consumer Behavior Data, Consumer Sentiment Data, Consumer Review Data, AI Training Data, Textual Data, and Transcription Data applications.

    WiserBrand's dataset is essential for companies looking to leverage Consumer Data and B2B Marketing Data to drive their strategic initiatives in the English-speaking markets of the USA, UK, and Australia. By accessing this rich dataset, businesses can uncover trends and insights critical for improving customer engagement and satisfaction.

    Cases:

    1. Training Speech Recognition (Speech-to-Text) and Speech Synthesis (Text-to-Speech) Models WiserBrand's Comprehensive Customer Call Transcription Dataset is an excellent resource for training and improving speech recognition models (Speech-to-Text, STT) and speech synthesis systems (Text-to-Speech, TTS). Here’s how this dataset can contribute to these tasks:

    Enriching STT Models: The dataset includes a wide variety of real-world customer service calls with diverse accents, tones, and terminologies. This makes it highly valuable for training speech-to-text models to better recognize different dialects, regional speech patterns, and industry-specific jargon. It could help improve accuracy in transcribing conversations in customer service, sales, or technical support.

    Contextualized Speech Recognition: Given the contextual information (e.g., reasons for calls, call categories, etc.), it can help models differentiate between various types of conversations (technical support vs. sales queries), which would improve the model’s ability to transcribe in a more contextually relevant manner.

    Improving TTS Systems: The transcriptions, along with their associated metadata (such as call duration, timing, and call reason), can aid in training Text-to-Speech models that mimic natural conversation patterns, including pauses, tone variation, and proper intonation. This is especially beneficial for developing conversational agents that sound more natural and human-like in their responses.

    Noise and Speech Quality Handling: Real-world customer service calls often contain background noise, overlapping speech, and interruptions, which are crucial elements for training speech models to handle real-life scenarios more effectively.

    1. Training AI Agents for Replacing Customer Service Representatives WiserBrand’s dataset can be incredibly valuable for businesses looking to develop AI-powered customer support agents that can replace or augment human customer service representatives. Here’s how this dataset supports AI agent training:

    Customer Interaction Simulation: The transcriptions provide a comprehensive view of real customer interactions, including common queries, complaints, and support requests. By training AI models on this data, businesses can equip their virtual agents with the ability to understand customer concerns, follow up on issues, and provide meaningful solutions, all while mimicking human-like conversational flow.

    Sentiment Analysis and Emotional Intelligence: The full-text transcriptions, along with associated call metadata (e.g., reason for the call, call duration, and geographical data), allow for sentiment analysis, enabling AI agents to gauge the emotional tone of customers. This helps the agents respond appropriately, whether it’s providing reassurance during frustrating technical issues or offering solutions in a polite, empathetic manner. Such capabilities are essential for improving customer satisfaction in automated systems.

    Customizable Dialogue Systems: The dataset allows for categorizing and identifying recurring call patterns and issues. This means AI agents can be trained to recognize the types of queries that come up frequently, allowing them to automate routine tasks such as ...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nexdata (2023). 478 Hours - Spanish Conversational Speech Data by Mobile Phone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1147
Organization logo

478 Hours - Spanish Conversational Speech Data by Mobile Phone

Explore at:
Dataset updated
Dec 5, 2023
Dataset authored and provided by
Nexdata
Variables measured
Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
Description

Spanish(Spain) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(596 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Search
Clear search
Close search
Google apps
Main menu