https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Mandarin Chinese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Mandarin speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mandarin Chinese communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Mandarin speech models that understand and respond to authentic Chinese accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mandarin Chinese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Mandarin speech and language AI applications:
Languages:Percent Chinese Speakers: Basic demographics by census tracts in King County based on current American Community Survey 5 Year Average (ACS). Included demographics are: total population; foreign born; median household income; English language proficiency; languages spoken; race and ethnicity; sex; and age. Numbers and derived percentages are estimates based on the current year's ACS. GEO_ID_TRT is the key field and may be used to join to other demographic Census data tables.
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
This database contains the recordings of 500 Chinese Mandarin speakers from Northern China (250 males and 250 females), from 18 to 60 years’ old, recorded in quiet studios located in Shenzhen and in Hong Kong Special Administrative Region, People’s Republic of China. Demographics of native speakers from Northern China is as follows:- Beijing: 200 speakers (100 males, 100 females)- North of Beijing: 101 speakers (50 males, 51 females)- Shandong: 149 speakers (75 males, 74 females)- Henan: 50 speakers (25 males, 25 females)Speaker profile includes the following information: unique ID, place of birth, place where speaker lived the longest by the age of 16, and the number of years that the speaker lived there, age, gender, recording place.Recordings were made through microphone headsets (ATM73a / AUDIO TECHNICA) and consist of 172 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM. Recording script consists of :• Phoneme balance statement: 785 sentences• Travel conversation: 1618 sentences• About 200 sentences per speaker including: 134 sentences of travel conversation, 66 sentences of phoneme balance
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
This database contains the recordings of 1000 Chinese Mandarin speakers from Southern China (500 males and 500 females), from 18 to 60 years’ old, recorded in quiet studios located in Shenzhen and in Hong Kong Special Administrative Region, People’s Republic of China. Demographics of native speakers from Southern China is as follows:- Guangdong: 312 speakers (154 males, 158 females)- Fujian: 155 speakers (95 males, 60 females)- Jiangsu: 262 speakers (134 males, 128 females)- Zhejiang: 160 speakers (84 males, 76 females)- Taiwan: 105 speakers (31 males, 74 females)- Other-Southern: 6 speakers (2 males, 4 females)Speaker profile includes the following information: unique ID, place of birth, place where speaker lived the longest by the age of 16, and the number of years that the speaker lived there, age, gender, recording place.Recordings were made through microphone headsets (ATM73a / AUDIO TECHNICA) and consist of 341 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM. Recording script consists of :• Phoneme balance statement: 785 sentences• Travel conversation: 1618 sentences• About 200 sentences per speaker including: 134 sentences of travel conversation, 66 sentences of phoneme balance
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Mandarin Chinese Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for Mandarin -speaking travelers.
Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.
The dataset includes 30 hours of dual-channel audio recordings between native Mandarin Chinese speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.
Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).
These scenarios help models understand and respond to diverse traveler needs in real-time.
Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.
Extensive metadata enriches each call and speaker for better filtering and AI training:
This dataset is ideal for a variety of AI use cases in the travel and tourism space:
Taiwanese Mandarin(China) Scripted Monologue Smartphone speech dataset, collected from monologue based on given prompts, covering economy, entertainment, news, oral language, numbers, alphabet and other domains. Transcribed with text content. Our dataset was collected from extensive and diversify speakers(204 native speakers), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Mandarin Chinese Call Center Speech Dataset for the Healthcare industry is purpose-built to accelerate the development of Mandarin speech recognition, spoken language understanding, and conversational AI systems. With 30 Hours of unscripted, real-world conversations, it delivers the linguistic and contextual depth needed to build high-performance ASR models for medical and wellness-related customer service.
Created by FutureBeeAI, this dataset empowers voice AI teams, NLP researchers, and data scientists to develop domain-specific models for hospitals, clinics, insurance providers, and telemedicine platforms.
The dataset features 30 Hours of dual-channel call center conversations between native Mandarin Chinese speakers. These recordings cover a variety of healthcare support topics, enabling the development of speech technologies that are contextually aware and linguistically rich.
The dataset spans inbound and outbound calls, capturing a broad range of healthcare-specific interactions and sentiment types (positive, neutral, negative).
These real-world interactions help build speech models that understand healthcare domain nuances and user intent.
Every audio file is accompanied by high-quality, manually created transcriptions in JSON format.
Each conversation and speaker includes detailed metadata to support fine-tuned training and analysis.
This dataset can be used across a range of healthcare and voice AI use cases:
Taiwanese Accent Mandarin(China) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Mandarin Chinese Call Center Speech Dataset for the Telecom industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Mandarin-speaking telecom customers. Featuring over 30 hours of real-world, unscripted audio, it delivers authentic customer-agent interactions across key telecom support scenarios to help train robust ASR models.
Curated by FutureBeeAI, this dataset empowers voice AI engineers, telecom automation teams, and NLP researchers to build high-accuracy, production-ready models for telecom-specific use cases.
The dataset contains 30 hours of dual-channel call center recordings between native Mandarin Chinese speakers. Captured in realistic customer support settings, these conversations span a wide range of telecom topics from network complaints to billing issues, offering a strong foundation for training and evaluating telecom voice AI solutions.
This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral ensuring broad scenario coverage for telecom AI development.
This variety helps train telecom-specific models to manage real-world customer interactions and understand context-specific voice patterns.
All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.
These transcriptions are production-ready, allowing for faster development of ASR and conversational AI systems in the Telecom domain.
Rich metadata is available for each participant and conversation:
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Mandarin Chinese Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Mandarin -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.
Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.
The dataset features 30 hours of dual-channel call center recordings between native Mandarin Chinese speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.
This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.
Such domain-rich variety ensures model generalization across common real estate support conversations.
All recordings are accompanied by precise, manually verified transcriptions in JSON format.
These transcriptions streamline ASR and NLP development for Mandarin real estate voice applications.
Detailed metadata accompanies each participant and conversation:
This enables smart filtering, dialect-focused model training, and structured dataset exploration.
This dataset is ideal for voice AI and NLP systems built for the real estate sector:
Mandarin Chinese(China) Noisy Monologue Smartphone speech dataset_ Guiding, collected from monologue based on given prompts, covering generic domain, such as in-car, smart home, voice assistant, recorded in noisy condition. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(205 people), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Recently, considerable attention has been given to the effect of the age of acquisition (AoA) on learning a second language (L2); however, the scarcity of L2 AoA ratings has limited advancements in this field. We presented the ratings of L2 AoA in late, unbalanced Chinese-English bilingual speakers and collected the familiarity of the L2 and the corresponding Chinese translations of English words. In addition, to promote the cross-language comparison and motivate the AoA research on Chinese two-character words, data on AoA, familiarity, and concreteness of the first language (L1) were also collected from Chinese native speakers. We first reported the reliability of each rated variable. Then, we described the validity by the following three steps: the distributions of each rated variable were described, the correlations between these variables were calculated, and regression analyses were run. The results showed that AoA, familiarity, and concreteness were all significant predictors of lexical decision times. The word database can be used by researchers who are interested in AoA, familiarity, and concreteness in both the L1 and L2 of late, unbalanced Chinese-English bilingual speakers. The full database is freely available for research purposes.
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
The audio dataset includes General Conversation, featuring Chinese speakers from China with detailed metadata.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The dataset comprises over 10,000 chat conversations, each focusing on specific Real Estate related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on Real Estate topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Real Estate use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in Chinese Real Estate interactions. This diversity ensures the dataset accurately represents the language used by Chinese speakers in Real Estate contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Chinese Real Estate interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Real Estate customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Chinese Speech Separation Dataset (CSSD) is a dataset of audio recordings of people speaking Mandarin Chinese in a variety of noisy environments. The dataset consists of 10,000 audio recordings, each of which is a mixture of two speakers. The dataset is split into a training set of 8,000 recordings and a test set of 2,000 recordings. The audio recordings are in .wav format and have a sampling rate of 16 kHz. The audio recordings are labeled with the identities of the two speakers in the mixture. The CSSD dataset is a valuable resource for training speech separation models.
We collect utterances from the Chinese Artificial Intelligence Speakers (CAIS), and annotate them with slot tags and intent labels. The training, validation and test sets are split by the distribution of intents, where detailed statistics are provided in the supplementary material. Since the utterances are collected from speaker systems in the real world, intent labels are partial to the PlayMusic option. We adopt the BIOES tagging scheme for slots instead of the BIO2 used in the ATIS, since previous studies have highlighted meaningful improvements with this scheme (Ratinov and Roth, 2009) in the sequence labeling field
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The dataset comprises over 10,000 chat conversations, each focusing on specific BFSI-related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on BFSI topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various BFSI use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in Chinese BFSI interactions. This diversity ensures the dataset accurately represents the language used by Chinese speakers in BFSI contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Chinese BFSI interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of BFSI customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.
VoxCeleb is a very popular datasets for speaker verification.
CnCeleb is similar to VoxCeleb, but mainly contains mandarin speakers.
These datasets are too large to launch quick experiments, therefore, I create a sampled version of them.
This dataset contains 100 speakers from voxceleb and 100 speakers from cnceleb, which is enough to test out new thoughts.
Specifically, the voxceleb part of this dataset is sampled from the dev set of the VoxCeleb2, while the cnceleb is directly sampled from the CnCeleb2.
[1] Chung, J.S., Nagrani, A. and Zisserman, A., 2018. VoxCceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622. [2] Fan, Y., Kang, J.W., Li, L.T., Li, K.C., Chen, H.L., Cheng, S.T., Zhang, P.Y., Zhou, Z.Y., Cai, Y.Q. and Wang, D., 2020, May. Cn-celeb: a challenging chinese speaker recognition dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7604-7608). IEEE.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The dataset comprises over 10,000 chat conversations, each focusing on specific Travel related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on Travel topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Travel use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in Chinese Travel interactions. This diversity ensures the dataset accurately represents the language used by Chinese speakers in Travel contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Chinese Travel interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Travel customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Although Chinese speech affective computing has received increasing attention, existing datasets still have defects such as lack of naturalness, single pronunciation style, and unreliable annotation, which seriously hinder the research in this field. To address these issues, this paper introduces the first Chinese Natural Speech Complex Emotion Dataset (CNSCED) to provide natural data resources for Chinese speech affective computing. CNSCED was collected from publicly broadcasted civil dispute and interview television programs in China, reflecting the authentic emotional characteristics of Chinese people in daily life. The dataset includes 14 hours of speech data from 454 speakers of various ages, totaling 15777 samples. Based on the inherent complexity and ambiguity of natural emotions, this paper proposes an emotion vector annotation method. This method utilizes a vector composed of six meta-emotional dimensions (angry, sad, aroused, happy, surprise, and fear) of different intensities to describe any single or complex emotion. The CNSCED released two subtasks: complex emotion classification and complex emotion intensity regression. In the experimental section, we evaluated the CNSCED dataset using deep neural network models and provided a baseline result. To the best of our knowledge, CNSCED is the first public Chinese natural speech complex emotion dataset, which can be used for scientific research free of charge.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Mandarin Chinese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Mandarin speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mandarin Chinese communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Mandarin speech models that understand and respond to authentic Chinese accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mandarin Chinese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Mandarin speech and language AI applications: