100+ datasets found

478 Hours - Spanish Conversational Speech Data by Mobile Phone
nexdata.ai
m.nexdata.ai
Updated Dec 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 478 Hours - Spanish Conversational Speech Data by Mobile Phone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1147
Explore at:
Dataset updated
Dec 5, 2023
Dataset authored and provided by
Nexdata
Variables measured
Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
Description
Spanish(Spain) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(596 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
F
General domain Human-Human conversation chats in Bahasa
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). General domain Human-Human conversation chats in Bahasa [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/bahasa-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
This training dataset comprises more than 10,000 conversational text data between two native Bahasa people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
F
General domain Human-Human conversation chats in German
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). General domain Human-Human conversation chats in German [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/german-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
This training dataset comprises more than 10,000 conversational text data between two native German people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
F
General domain Human-Human conversation chats in Swedish
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). General domain Human-Human conversation chats in Swedish [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/swedish-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
This training dataset comprises more than 10,000 conversational text data between two native Swedish people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
F
General domain Human-Human conversation chats in Urdu
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). General domain Human-Human conversation chats in Urdu [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/urdu-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
This training dataset comprises more than 10,000 conversational text data between two native Urdu people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
English Conversation and Monologue speech dataset
kaggle.com
Updated Jun 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank Wong (2024). English Conversation and Monologue speech dataset [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/english-real-world-speech-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Frank Wong
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
English(America) Real-world Casual Conversation and Monologue speech dataset

Description

English(America) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety-show, etc, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1115?source=Kaggle

Format

16kHz, 16 bit, wav, mono channel;

Content category

Including self-media, conversation, live, lecture, variety-show, etc;

Recording environment

Low background noise;

Country

America(USA);

Language(Region) Code

en-US;

Language

English;

Features of annotation

Transcription text, timestamp, speaker ID, gender.

Accuracy Rate

Sentence Accuracy Rate (SAR) 95%

Licensing Information

Commercial License
E
Mandarin Mobile Telephony Conversational Speech Collection Data - 2,657...
catalog.elra.info
Updated Oct 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2022). Mandarin Mobile Telephony Conversational Speech Collection Data - 2,657 Hours [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0421/
Explore at:
Dataset updated
Oct 6, 2022
Dataset provided by
ELRA (European Language Resources Association)
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
4491 speakers participated in the recording and conducted face-to-face communication in a natural way. No topics are specified, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy.Format：16kHz, 16bit, uncompressed wav, mono channelEnvironments：quiet indoor environment, without echoRecording content：no topic is specified, and the speakers make dialogue while the recording is performedDemographics：4,491 speakers, 63% of which are female.Annotations：annotating for the transcription text, speaker identification and genderDevice：Android mobile phone, iPhoneLanguage：MandarinApplications：speech recognition; voiceprint recognition.Accuracy rate：97%
1,503 Hours - Arabic(UAE) Real-world Casual Conversation and Monologue...
nexdata.ai
m.nexdata.ai
Updated Jun 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). 1,503 Hours - Arabic(UAE) Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1710
Explore at:
Dataset updated
Jun 28, 2025
Dataset authored and provided by
Nexdata
Area covered
United Arab Emirates
Variables measured
Format, Accuracy, Language, Annotation, Application scenarios
Description
Arabic(UAE) Real-world Casual Conversation and Monologue speech dataset, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Implicature dataset
figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Jasmi George (2023). Implicature dataset [Dataset]. http://doi.org/10.6084/m9.figshare.10315505.v7
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.10315505.v7
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Elizabeth Jasmi George
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set consists of conversational implicatures of utterances. Conversational implicatures are the meanings of an utterance more than what is literally stated by the utterance. The data consist of 1001 utterances that come as responses in a specific context and their implicatures. These written representations of the utterances are collected manually by scraping and transcribing from relevant sources from August, 2019 to August, 2020. The source of dialogues in the data include TOEFL listening comprehension short conversations, movie dialogues from IMSDb and websites explaining idioms, similes, metaphors and hyperboles. The implicatures are annotated manually.FormattingThe dataset file (Conversational Implicature Dataset 1-1001 - implicature data 1-1001.csv) is written as comma-separated values file. Columns that contain commas (,) are escaped using double-quotes ("). The dataset is also available as an excel sheet (Conversational Implicature Dataset 1-1001.xlsx)ContentThe dataset is available in Conversational Implicature Dataset 1-1001 - implicature data 1-1001.csv. Each entry in the dataset consists of a context utterance, a response utterance and an Implicature.Context UtteranceThe written representation of an utterance which serves as the context in which the response utterance can implicate a meaning different from its literal meaning.Response UtteranceThe written representation of an utterance which has a different meaning than the meaning of the sentences used in it.ImplicatureThe implicated meaning of the response utterance.
F
General domain Human-Human conversation chats in Spanish
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). General domain Human-Human conversation chats in Spanish [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/spanish-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
This training dataset comprises more than 10,000 conversational text data between two native Spanish people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
330 Hours - Dari Conversational Speech Data by Telephone
m.nexdata.ai
nexdata.ai
Updated Feb 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2024). 330 Hours - Dari Conversational Speech Data by Telephone [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1240?source=Github
Explore at:
Dataset updated
Feb 6, 2024
Dataset authored and provided by
Nexdata
Variables measured
Format, Country, Speaker, Language, Accuracy rate, Content category, Recording device, Recording condition, Features of annotation
Description
Dari(Afghanistan) Spontaneous Dialogue Telephony speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(452 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
f
Conversational Skills in Language Learning Games: A Speech Recognition...
figshare.com
data.mendeley.com
bin
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Murat Kuvvetli (2023). Conversational Skills in Language Learning Games: A Speech Recognition Technology Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24769470.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24769470.v1
Dataset updated
Dec 8, 2023
Dataset provided by
figshare
Authors
Murat Kuvvetli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "SpeechRec_LanguageLearning_ConversationalSkills" dataset is a collection of data generated in a game-based language learning environment, aiming to explore the impact of Speech Recognition Technology (SRT) on the development of conversational skills. The dataset encompasses speaking test results conducted within the context of language learning games utilizing SRT.
C
Conversational AI in Healthcare Report
archivemarketresearch.com
doc, pdf, ppt
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Conversational AI in Healthcare Report [Dataset]. https://www.archivemarketresearch.com/reports/conversational-ai-in-healthcare-12100
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Feb 5, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
Market Analysis The market for Conversational AI in Healthcare is projected to reach a value of XX million by 2033, growing at a CAGR of 5%. The growth is driven by the increasing adoption of AI in healthcare, the rising need for efficient patient care, and the growing prevalence of chronic diseases. Key market trends include the integration of natural language processing (NLP) and machine learning (ML) for improved communication and analysis, and the emergence of cloud-based solutions for cost-effective scalability. The major segments of the market are NLP and ML based solutions, with applications in medical record mining, medical imaging analysis, medicine development, and emergency assistance. Value Chain Analysis The Conversational AI in Healthcare market value chain consists of several players, including hardware manufacturers, software developers, solution providers, and healthcare providers. Hardware manufacturers provide the devices and sensors used for data collection and processing. Software developers create the AI algorithms and software, enabling healthcare providers to interact with patients through conversational interfaces. Solution providers integrate hardware and software to provide end-to-end solutions. Healthcare providers, including hospitals, clinics, and nursing homes, are the end-users who utilize Conversational AI solutions to enhance patient care. Key market players include Google Health, IBM Watson Health, Oncora Medical, and CloudMedX Health.
o
Ethical Dialogue Dataset
opendatabay.com
.undefined
Updated Jul 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Ethical Dialogue Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/92d27a42-d8ec-46f5-acba-f415f82cdf52
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 4, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Education & Learning Analytics
Description
ProsocialDialog is a large-scale, multi-turn English dialogue dataset designed to teach conversational agents how to respond to problematic content in line with social norms. It addresses a variety of unethical, biased, toxic, and generally problematic situations. The dataset is notable for its focus on encouraging prosocial behaviour, which is guided by commonsense social rules, referred to as Rules-of-Thumb (RoTs). Developed through a human-AI collaborative framework, the dataset consists of 58,000 dialogues, comprising 331,000 utterances, 160,000 unique RoTs, and 497,000 dialogue safety labels, each accompanied by free-form rationales. The test.csv file within the ProsocialDialog dataset contains data specifically for evaluating the accuracy of a model in predicting conversation safety.

Columns

The dataset includes the following columns: * context: The context of the conversation. (String) * response: The response to the conversation. (String) * rots: Rules of thumb associated with the conversation. (String) * safety_label: The safety label associated with the conversation. (String) * safety_annotations: Annotations associated with the conversation. (String) * safety_annotation_reasons: Reasons for the safety annotations. (String) * source: The source of the conversation. (String) * etc: Any additional information associated with the conversation. (String) * dialogue_id: Unique identifier for each dialogue. * response_id: Unique identifier for each response.

Distribution

The dataset is typically provided in a CSV file format, such as test.csv. It contains 58,000 dialogues, encompassing 331,000 utterances. There are 24,972 unique dialogue IDs and 24,903 unique response IDs. The dataset includes 160,000 unique Rules-of-Thumb (RoTs) and 497,000 dialogue safety labels. Specific numbers for rows or records beyond these counts are not provided in the sources.

Usage

This dataset is ideally suited for several applications: * Designing Conversational Agents: It can be used to build Natural Language Processing (NLP) models capable of recognising and classifying problematic content. The safety labels, rationales, and RoTs can train conversational agents to respond in socially acceptable ways. * Benchmark Systems: ProsocialDialog serves as an effective benchmark for evaluating the performance of existing conversation datasets in identifying, responding to, and preventing problematic content interactions. * Automated Moderation: The dialogue safety labels and their associated free-form rationales are valuable for technology platforms implementing automated moderation tasks, such as flagging or banning offensive messages or users.

Coverage

The ProsocialDialog dataset is in English and has a global regional coverage. It addresses general conversational scenarios involving social norms and problematic content, but specific demographic scope details or the precise time range of data collection are not explicitly outlined in the sources. The dataset was listed on 11/06/2025.

License

CCO

Who Can Use It

This dataset is beneficial for a range of users, including: * Researchers and Developers in AI and Machine Learning: Particularly those focused on Natural Language Processing (NLP) and building sophisticated conversational AI systems. * Organisations and Platforms: Especially those in need of automated moderation tools or aiming to ensure their conversational agents adhere to social norms and promote prosocial behaviour. * Academics and Students: Engaged in studying dialogue safety, social psychology, or ethical AI, who can explore the safety labels, annotations, RoTs, and data sources to gain deeper insights into human conversation dynamics.

Dataset Name Suggestions

ProsocialDialog - Problematic Content Dialogue

Conversational Safety Norms

Ethical Dialogue Dataset

Social Norms AI Conversations

Harmful Content Dialogue Dataset

Attributes

Original Data Source: ProsocialDialog - Problematic Content Dialogue
P
StudyAbroadGPT Dataset Dataset
paperswithcode.com
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Millat; Md Motiur (2025). StudyAbroadGPT Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/studyabroadgpt-dataset
Explore at:
Dataset updated
Apr 21, 2025
Authors
Md Millat; Md Motiur
Description
The StudyAbroadGPT-Dataset is a collection of conversational data focused on university application requirements for various programs, including MBA, MS in Computer Science, Data Science, and Bachelor of Medicine. The dataset includes interactions between humans asking questions about application processes (e.g., "How do I write a strong SOP for MS in Data Science at MIT?") and an assistant providing detailed responses. Covering prestigious institutions such as MIT, Oxford, Cambridge, and Stanford, this dataset serves as a valuable resource for understanding the informational needs of prospective students applying to study abroad.

Dataset Structure The dataset is organized as a list of JSON objects, where each object represents a single conversation. Each conversation contains an array of turns, structured as follows:

"from": Specifies the speaker, either "human" or "assistant". "value": Contains the text of the query or response.

Example json { "conversations": [ {"from": "human", "value": "What documents do I need for applying to MBA?"}, {"from": "assistant", "value": "## Introduction To embark on your MBA journey, it's crucial to gather the necessary documents..."} ] }

Data Collection This dataset was synthetically generated to simulate realistic conversations about study abroad applications. It is designed to reflect common queries and provide detailed, informative responses related to university application requirements.

Preprocessing The dataset is provided in its raw form, consisting of unprocessed conversational text. Depending on their specific use case, such as natural language processing (NLP) tasks, users may need to perform additional preprocessing steps like tokenization or stopword removal.

Usage You can load the dataset using the Hugging Face Datasets library with the following code:

from datasets import load_dataset dataset = load_dataset("millat/StudyAbroadGPT-Dataset")

Accessing Conversations To access and iterate through the conversations, use this example:

python for conversation in dataset["train"]: for turn in conversation["conversations"]: print(f"{turn['from']}: {turn['value']}")

Examples Here’s a sample conversation excerpt from the dataset:

json { "conversations": [ {"from": "human", "value": "What are the GMAT requirements for Oxford?"}, {"from": "assistant", "value": "## Introduction Oxford University, one of the world's most prestigious institutions, requires..."} ] }

Use Cases

Training Conversational Agents: Build chatbots to assist with university application queries. Analyzing Trends: Study application requirements across different programs and institutions. NLP Development: Create natural language understanding models tailored to educational domains.

License This dataset is licensed under the MIT License.

Citation If you use this dataset in your research, please cite it as follows:

bibtex @misc{StudyAbroadGPT-Dataset, author = {MD MILLAT HOSEN}, title = {StudyAbroadGPT-Dataset}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/millat/StudyAbroadGPT-Dataset}} }
D
Data from: Spiritual conversation model for patients and loved ones in...
ssh.datastations.nl
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Haufe; Marc Haufe (2024). Spiritual conversation model for patients and loved ones in palliative care: a validation study [Dataset]. http://doi.org/10.17026/SS/K0NGCL
Explore at:
csv(675), pdf(57645), text/comma-separated-values(6011), pdf(88279), application/x-spss-syntax(26964), ods(36881), pdf(81688), pdf(38601), pdf(61582), text/comma-separated-values(112710)Available download formats
Unique identifier
https://doi.org/10.17026/SS/K0NGCL
Dataset updated
Mar 11, 2024
Dataset provided by
DANS Data Station Social Sciences and Humanities
Authors
Marc Haufe; Marc Haufe
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is for follow-up research purposes. It consists of all key documents from data collection to data analysis of the research project. See the readme document for the relevant procedure and document description.
Data from: Conversational Transcripts of Truthful and Deceptive Speech...
icpsr.umich.edu
Updated Aug 29, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Duran, Nicholas; Paxton, Alexandra; Fusaroli, Riccardo (2018). Conversational Transcripts of Truthful and Deceptive Speech Involving Controversial Topics, Central California, 2012 [Dataset]. http://doi.org/10.3886/ICPSR37124.v1
Explore at:
Unique identifier
https://doi.org/10.3886/ICPSR37124.v1
Dataset updated
Aug 29, 2018
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
Duran, Nicholas; Paxton, Alexandra; Fusaroli, Riccardo
License
https://www.icpsr.umich.edu/web/ICPSR/studies/37124/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/37124/terms
Time period covered
May 2012 - Nov 2012
Area covered
California, United States, California
Description
This study investigated the presence of dynamic patterns of interpersonal coordination in extended deceptive conversations across multi-modal channels of behavior. Using a "devil's advocate" paradigm, the researchers experimentally elicited deception and truth across controversial social and political topics in which conversational partners either agreed or disagreed, and where one partner was surreptitiously asked to argue an opinion opposite of what he or she really believed. The researchers focused on interpersonal coordination as an emergent behavioral signal that captured inter-dependencies between conversational partners, both as the coupling of head movements over the span of milliseconds, measured via a windowed lagged cross correlation (WLCC) technique, and more global temporal dependencies across speech rate, using cross recurrence quantification analysis (CRQA). Another focus that was considered was how interpersonal coordination might be shaped by strategic, adaptive conversational goals associated with deception. This collection includes both qualitative transcripts and a quantitative dataset including respondent demographics (including sex, age, and ethnicity). The qualitative dataset consists of 94 written transcripts of audio-recorded conversations, lasting eight minutes each in length. The quantitative dataset includes 5 variables for 102 cases.
AI and Ancient Languages Student Survey - 2023-2024 Comparative Data
figshare.com
csv
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Edward A. S. Ross; Jackie Baines (2024). AI and Ancient Languages Student Survey - 2023-2024 Comparative Data [Dataset]. http://doi.org/10.6084/m9.figshare.27146880.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27146880.v1
Dataset updated
Oct 2, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Edward A. S. Ross; Jackie Baines
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This comparative dataset was collected during the data-gathering portion of the "ChatGPT: A Conversational Language Study Tool" project over the 2023-2024 academic year. These survey forms were completed by students in ancient language classes in the Department of Classics at the University of Reading.This project has been reviewed by the University of Reading University Research Ethics Committee and has been given a favourable ethical opinion for conduct.
F
General domain Human-Human conversation chats in Bengali
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). General domain Human-Human conversation chats in Bengali [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/bengali-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
This training dataset comprises more than 10,000 conversational text data between two native Bengali people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
d
AI Training Data | US Transcription Data| Unique Consumer Sentiment Data:...
datarade.ai
Updated Jan 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WiserBrand.com (2025). AI Training Data | US Transcription Data| Unique Consumer Sentiment Data: Transcription of the calls to the companies [Dataset]. https://datarade.ai/data-products/wiserbrand-ai-training-data-us-transcription-data-unique-wiserbrand-com
Explore at:
.csv, .xls, .txt, .jsonAvailable download formats
Dataset updated
Jan 13, 2025
Dataset provided by
WiserBrand.com
Area covered
United States
Description
WiserBrand's Comprehensive Customer Call Transcription Dataset: Tailored Insights

WiserBrand offers a customizable dataset comprising transcribed customer call records, meticulously tailored to your specific requirements. This extensive dataset includes:

User ID and Firm Name: Identify and categorize calls by unique user IDs and company names. Call Duration: Analyze engagement levels through call lengths. Geographical Information: Detailed data on city, state, and country for regional analysis. Call Timing: Track peak interaction times with precise timestamps. Call Reason and Group: Categorised reasons for calls, helping to identify common customer issues. Device and OS Types: Information on the devices and operating systems used for technical support analysis. Transcriptions: Full-text transcriptions of each call, enabling sentiment analysis, keyword extraction, and detailed interaction reviews.

Our dataset is designed for businesses aiming to enhance customer service strategies, develop targeted marketing campaigns, and improve product support systems. Gain actionable insights into customer needs and behavior patterns with this comprehensive collection, particularly useful for Consumer Data, Consumer Behavior Data, Consumer Sentiment Data, Consumer Review Data, AI Training Data, Textual Data, and Transcription Data applications.

WiserBrand's dataset is essential for companies looking to leverage Consumer Data and B2B Marketing Data to drive their strategic initiatives in the English-speaking markets of the USA, UK, and Australia. By accessing this rich dataset, businesses can uncover trends and insights critical for improving customer engagement and satisfaction.

Cases:

Training Speech Recognition (Speech-to-Text) and Speech Synthesis (Text-to-Speech) Models WiserBrand's Comprehensive Customer Call Transcription Dataset is an excellent resource for training and improving speech recognition models (Speech-to-Text, STT) and speech synthesis systems (Text-to-Speech, TTS). Here’s how this dataset can contribute to these tasks:

Enriching STT Models: The dataset includes a wide variety of real-world customer service calls with diverse accents, tones, and terminologies. This makes it highly valuable for training speech-to-text models to better recognize different dialects, regional speech patterns, and industry-specific jargon. It could help improve accuracy in transcribing conversations in customer service, sales, or technical support.

Contextualized Speech Recognition: Given the contextual information (e.g., reasons for calls, call categories, etc.), it can help models differentiate between various types of conversations (technical support vs. sales queries), which would improve the model’s ability to transcribe in a more contextually relevant manner.

Improving TTS Systems: The transcriptions, along with their associated metadata (such as call duration, timing, and call reason), can aid in training Text-to-Speech models that mimic natural conversation patterns, including pauses, tone variation, and proper intonation. This is especially beneficial for developing conversational agents that sound more natural and human-like in their responses.

Noise and Speech Quality Handling: Real-world customer service calls often contain background noise, overlapping speech, and interruptions, which are crucial elements for training speech models to handle real-life scenarios more effectively.

Training AI Agents for Replacing Customer Service Representatives WiserBrand’s dataset can be incredibly valuable for businesses looking to develop AI-powered customer support agents that can replace or augment human customer service representatives. Here’s how this dataset supports AI agent training:

Customer Interaction Simulation: The transcriptions provide a comprehensive view of real customer interactions, including common queries, complaints, and support requests. By training AI models on this data, businesses can equip their virtual agents with the ability to understand customer concerns, follow up on issues, and provide meaningful solutions, all while mimicking human-like conversational flow.

Sentiment Analysis and Emotional Intelligence: The full-text transcriptions, along with associated call metadata (e.g., reason for the call, call duration, and geographical data), allow for sentiment analysis, enabling AI agents to gauge the emotional tone of customers. This helps the agents respond appropriately, whether it’s providing reassurance during frustrating technical issues or offering solutions in a polite, empathetic manner. Such capabilities are essential for improving customer satisfaction in automated systems.

Customizable Dialogue Systems: The dataset allows for categorizing and identifying recurring call patterns and issues. This means AI agents can be trained to recognize the types of queries that come up frequently, allowing them to automate routine tasks such as ...

Facebook

Twitter

Click to copy link

Link copied

Cite

Nexdata (2023). 478 Hours - Spanish Conversational Speech Data by Mobile Phone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1147

478 Hours - Spanish Conversational Speech Data by Mobile Phone

Explore at:

Dataset updated

Dec 5, 2023

Dataset authored and provided by

Nexdata

Variables measured

Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation

Description

Spanish(Spain) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(596 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Clear search

Close search

Google apps

Main menu

478 Hours - Spanish Conversational Speech Data by Mobile Phone

General domain Human-Human conversation chats in Bahasa

What’s Included

General domain Human-Human conversation chats in German

What’s Included

General domain Human-Human conversation chats in Swedish

What’s Included

General domain Human-Human conversation chats in Urdu

What’s Included

English Conversation and Monologue speech dataset

English(America) Real-world Casual Conversation and Monologue speech dataset

Description

Format

Content category

Recording environment

Country

Language(Region) Code

Language

Features of annotation

Accuracy Rate

Licensing Information

Mandarin Mobile Telephony Conversational Speech Collection Data - 2,657...

1,503 Hours - Arabic(UAE) Real-world Casual Conversation and Monologue...

Implicature dataset

General domain Human-Human conversation chats in Spanish

What’s Included

330 Hours - Dari Conversational Speech Data by Telephone

Conversational Skills in Language Learning Games: A Speech Recognition...

Conversational AI in Healthcare Report

Ethical Dialogue Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

StudyAbroadGPT Dataset Dataset

Data from: Spiritual conversation model for patients and loved ones in...

Data from: Conversational Transcripts of Truthful and Deceptive Speech...

AI and Ancient Languages Student Survey - 2023-2024 Comparative Data

General domain Human-Human conversation chats in Bengali

What’s Included

AI Training Data | US Transcription Data| Unique Consumer Sentiment Data:...

478 Hours - Spanish Conversational Speech Data by Mobile PhoneSee More Versions

478 Hours - Spanish Conversational Speech Data by Mobile Phone