79 datasets found
  1. F

    Mandarin General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mandarin General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-mandarin-china
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Mandarin Chinese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Mandarin speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mandarin Chinese communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Mandarin speech models that understand and respond to authentic Chinese accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mandarin Chinese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Mandarin Chinese speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of China to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Mandarin speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Mandarin Chinese.
    Voice Assistants: Build smart assistants capable of understanding natural Chinese conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  2. a

    Percent Chinese Speakers

    • hub.arcgis.com
    • gis-kingcounty.opendata.arcgis.com
    Updated Aug 10, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    King County (2016). Percent Chinese Speakers [Dataset]. https://hub.arcgis.com/datasets/294c704e32734e8ebcd030e535b841ce
    Explore at:
    Dataset updated
    Aug 10, 2016
    Dataset authored and provided by
    King County
    Area covered
    Description

    Languages:Percent Chinese Speakers: Basic demographics by census tracts in King County based on current American Community Survey 5 Year Average (ACS). Included demographics are: total population; foreign born; median household income; English language proficiency; languages spoken; race and ethnicity; sex; and age. Numbers and derived percentages are estimates based on the current year's ACS. GEO_ID_TRT is the key field and may be used to join to other demographic Census data tables.

  3. 98.9 Hours - Taiwanese Mandarin(China) Scripted Monologue Smartphone speech...

    • m.nexdata.ai
    • nexdata.ai
    Updated Nov 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 98.9 Hours - Taiwanese Mandarin(China) Scripted Monologue Smartphone speech dataset [Dataset]. https://m.nexdata.ai/datasets/speechrecog/63
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Taiwan
    Variables measured
    Format, Country, Speaker, Language, Accuracy rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
    Description

    Taiwanese Mandarin(China) Scripted Monologue Smartphone speech dataset, collected from monologue based on given prompts, covering economy, entertainment, news, oral language, numbers, alphabet and other domains. Transcribed with text content. Our dataset was collected from extensive and diversify speakers(204 native speakers), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  4. E

    Chinese Mandarin (North) database

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated May 31, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2018). Chinese Mandarin (North) database [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0398/
    Explore at:
    Dataset updated
    May 31, 2018
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    This database contains the recordings of 500 Chinese Mandarin speakers from Northern China (250 males and 250 females), from 18 to 60 years’ old, recorded in quiet studios located in Shenzhen and in Hong Kong Special Administrative Region, People’s Republic of China. Demographics of native speakers from Northern China is as follows:- Beijing: 200 speakers (100 males, 100 females)- North of Beijing: 101 speakers (50 males, 51 females)- Shandong: 149 speakers (75 males, 74 females)- Henan: 50 speakers (25 males, 25 females)Speaker profile includes the following information: unique ID, place of birth, place where speaker lived the longest by the age of 16, and the number of years that the speaker lived there, age, gender, recording place.Recordings were made through microphone headsets (ATM73a / AUDIO TECHNICA) and consist of 172 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM. Recording script consists of :• Phoneme balance statement: 785 sentences• Travel conversation: 1618 sentences• About 200 sentences per speaker including: 134 sentences of travel conversation, 66 sentences of phoneme balance

  5. E

    Chinese Mandarin (South) database

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated May 31, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2018). Chinese Mandarin (South) database [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0397/
    Explore at:
    Dataset updated
    May 31, 2018
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    This database contains the recordings of 1000 Chinese Mandarin speakers from Southern China (500 males and 500 females), from 18 to 60 years’ old, recorded in quiet studios located in Shenzhen and in Hong Kong Special Administrative Region, People’s Republic of China. Demographics of native speakers from Southern China is as follows:- Guangdong: 312 speakers (154 males, 158 females)- Fujian: 155 speakers (95 males, 60 females)- Jiangsu: 262 speakers (134 males, 128 females)- Zhejiang: 160 speakers (84 males, 76 females)- Taiwan: 105 speakers (31 males, 74 females)- Other-Southern: 6 speakers (2 males, 4 females)Speaker profile includes the following information: unique ID, place of birth, place where speaker lived the longest by the age of 16, and the number of years that the speaker lived there, age, gender, recording place.Recordings were made through microphone headsets (ATM73a / AUDIO TECHNICA) and consist of 341 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM. Recording script consists of :• Phoneme balance statement: 785 sentences• Travel conversation: 1618 sentences• About 200 sentences per speaker including: 134 sentences of travel conversation, 66 sentences of phoneme balance

  6. F

    Mandarin Call Center Data for Travel AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mandarin Call Center Data for Travel AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-mandarin-china
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Mandarin Chinese Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for Mandarin -speaking travelers.

    Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.

    Speech Data

    The dataset includes 30 hours of dual-channel audio recordings between native Mandarin Chinese speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.

    Participant Diversity:
    Speakers: 60 native Mandarin Chinese contributors from our verified pool.
    Regions: Covering multiple China provinces to capture accent and dialectal variation.
    Participant Profile: Balanced representation of age (18–70) and gender (60% male, 40% female).
    Recording Details:
    Conversation Nature: Naturally flowing, spontaneous customer-agent calls.
    Call Duration: Between 5 and 15 minutes per session.
    Audio Format: Stereo WAV, 16-bit depth, at 8kHz and 16kHz.
    Recording Environment: Captured in controlled, noise-free, echo-free settings.

    Topic Diversity

    Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).

    Inbound Calls:
    Booking Assistance
    Destination Information
    Flight Delays or Cancellations
    Support for Disabled Passengers
    Health and Safety Travel Inquiries
    Lost or Delayed Luggage, and more
    Outbound Calls:
    Promotional Travel Offers
    Customer Feedback Surveys
    Booking Confirmations
    Flight Rescheduling Alerts
    Visa Expiry Notifications, and others

    These scenarios help models understand and respond to diverse traveler needs in real-time.

    Transcription

    Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-Stamped Segments
    Non-speech Markers (e.g., pauses, coughs)
    High transcription accuracy by dual-layered transcription review ensures word error rate under 5%.

    Metadata

    Extensive metadata enriches each call and speaker for better filtering and AI training:

    Participant Metadata: ID, age, gender, region, accent, and dialect.
    Conversation Metadata: Topic, domain, call type, sentiment, and audio specs.

    Usage and Applications

    This dataset is ideal for a variety of AI use cases in the travel and tourism space:

    ASR Systems: Train Mandarin speech-to-text engines for travel platforms.
    <div style="margin-top:10px; margin-bottom: 10px;

  7. h

    SpeechOcean

    • huggingface.co
    Updated Jul 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koel Labs (2025). SpeechOcean [Dataset]. https://huggingface.co/datasets/KoelLabs/SpeechOcean
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Koel Labs
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    speechocean762

    speechocean762 is a speech dataset of native Mandarin speakers (50% adult, 50% children) speaking English. It contains phonemic annotations using the sounds supported by ARPABet. It was developed by Junbo Zhang et al. Read more on their official github, Hugging Face dataset, and paper.

      This Processed Version
    

    We have processed the dataset into an easily consumable Hugging Face dataset using this data processing script. This maps the phoneme annotations… See the full description on the dataset page: https://huggingface.co/datasets/KoelLabs/SpeechOcean.

  8. Z

    Mandarin matrix sentence test recordings: Lombard and plain speech with...

    • data.niaid.nih.gov
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, Fei (2025). Mandarin matrix sentence test recordings: Lombard and plain speech with different speakers [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7063029
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    Hochmuth, Sabine
    Scharf, Maximilian
    Kollmeier, Birger
    Chen, Fei
    Warzybok, Anna
    Hu, Hongmei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was recorded within the Deutsche Forschungsgemeinschaft (DFG) project: Experiments and models of speech recognition across tonal and non-tonal language systems (EMSATON, Projektnummer 415895050).

    The Lombard effect or Lombard reflex is the involuntary tendency of speakers to increase their vocal effort when speaking in loud noise to enhance the audibility of their voice. Up to date, the phenomena of Lombard effects were observed in different languages. The present database aimed at providing recordings for studying the Lombard effect with Mandarin speech.

    Eleven native-Mandarin talkers (6 female and 5 male) were recruited, both Lombard/plain speech were recorded from the same talker in the same day.

    All speakers produced fluent standard Mandarin speech (North China). All listeners were normal-hearing with pure tone thresholds of 20 dB hearing level or better at audiometric octave frequencies between 125 and 8000 Hz. All listeners provided written informed consent, approved by the Ethics Committee of Carl von Ossietzky University of Oldenburg. Listeners received an hourly compensation for their participation.

    The recording sentences were same as the official Mandarin Chinese matrix sentence test (CMNmatrix, Hu et al. 2018).

    One hundred sentences (ten base lists of ten sentences) of the CMNmatrix were recorded from each speaker in both plain and Lombard speaking styles (each base list containing all 50 words). The 100 sentences were divided into 10 blocks of 10 sentences each, and the plain and Lombard blocks were presented in an alternating order. The recording took place in a double-walled, sound-attenuated booth fulfilling ISO 8253-3 (ISO 8253-3, 2012), using a Neumann 184 microphone with a cardioid characteristic (Georg Neumann GmbH, Berlin, Germany) and a Fireface UC soundcard (with a sampling rate of 44100 Hz and resolution of 16 bits). The recording procedure generally followed the procedures of Alghamdi et al. (2018). A Mandarin-native speaker and a phonetician participated in the recording session and listened to the sentences to control the pronunciations, intonation, and speaking rate. During the recording, the speaker was instructed to read the sentence presented on a frontal screen. In case of any mispronunciation or change in the intonation, the speaker was asked via the screen to repeat the sentence again, and on average, each sentence was recorded twice. In Lombard conditions the speaker was regularly asked via a prompt to repeat a sentence, to keep the speaker in the Lombard communication situation. For the plain-speech recording blocks, the speakers were asked to pronounce the sentences with natural intonation and accentuation, and at an intermediate speaking rate, which was facilitated by a progress bar on the screen. Furthermore, the speakers were asked to keep the speaking effort constant and to avoid any exaggerated pronunciations that could lead to unnatural speech cues. For the Lombard speech recording blocks, speakers were instructed to imagine a conversation to another person in a pub-like situation. During the whole recording session, speakers wore headphones (Sennheiser HDA200) that provided the audio signal of the speaker.. In the Lombard condition, the stationary speech-shaped noise ICRA1 (Dreschler et al., 2001) was mixed with the speaker’s audio signal at a level of 80 dB SPL (calibrated with a Brüel & Kjær (B&K) 4153 artificial ear, a B&K 4134 0.5-inch inch microphone, a B&K 2669 preamplifier, and a B&K 2610). Previous studies showed that this level induced a robust Lombard speech without the danger of inducing hearing damage (Alghamdi et al., 2018).

    The sentences were cut from the recording, high-pass filtered (60 Hz cut-off frequency) and set to the average root-mean-square level of the original speech material of the Mandarin Matrix test (Hu et al., 2018). Then the best version of each sentence was chosen by native-Mandarin speakers regarding pronunciation, tempo, and intonation.

    For more detailed information, please contact hongmei.hu@uni-oldenburg, sabine.hochmuth@uni-oldenburg.de.

    Hu H, Xi X, Wong LLN, Hochmuth S, Warzybok A, Kollmeier B (2018) Construction and evaluation of the mandarin chinese matrix (cmnmatrix) sentence test for the assessment of speech recognition in noise. International Journal of Audiology 57:838-850. https://doi.org/10.1080/14992027.2018.1483083

  9. p

    Motion Predicates in Taiwan Mandarin—A Linguistic Dataset

    • purr.purdue.edu
    Updated Jun 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pin-hsi Chen (2022). Motion Predicates in Taiwan Mandarin—A Linguistic Dataset [Dataset]. http://doi.org/10.4231/B2WX-QA39
    Explore at:
    Dataset updated
    Jun 21, 2022
    Dataset provided by
    PURR
    Authors
    Pin-hsi Chen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Taiwan
    Description

    The dataset contains linguistic data elicited from three Taiwan Mandarin speakers with a software that randomly displays motion events. The data are the utterances produced by the speakers when they described the events.

  10. f

    Table_1_A Database of Chinese-English Bilingual Speakers: Ratings of the Age...

    • figshare.com
    • frontiersin.figshare.com
    xlsx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jue Wang; Baoguo Chen (2023). Table_1_A Database of Chinese-English Bilingual Speakers: Ratings of the Age of Acquisition and Familiarity.xlsx [Dataset]. http://doi.org/10.3389/fpsyg.2020.554785.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Frontiers
    Authors
    Jue Wang; Baoguo Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recently, considerable attention has been given to the effect of the age of acquisition (AoA) on learning a second language (L2); however, the scarcity of L2 AoA ratings has limited advancements in this field. We presented the ratings of L2 AoA in late, unbalanced Chinese-English bilingual speakers and collected the familiarity of the L2 and the corresponding Chinese translations of English words. In addition, to promote the cross-language comparison and motivate the AoA research on Chinese two-character words, data on AoA, familiarity, and concreteness of the first language (L1) were also collected from Chinese native speakers. We first reported the reliability of each rated variable. Then, we described the validity by the following three steps: the distributions of each rated variable were described, the correlations between these variables were calculated, and regression analyses were run. The results showed that AoA, familiarity, and concreteness were all significant predictors of lexical decision times. The word database can be used by researchers who are interested in AoA, familiarity, and concreteness in both the L1 and L2 of late, unbalanced Chinese-English bilingual speakers. The full database is freely available for research purposes.

  11. speech_bundle

    • kaggle.com
    Updated Mar 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    _ChrisQ_ (2022). speech_bundle [Dataset]. https://www.kaggle.com/datasets/chrisqiu/speech-bundle/suggestions?status=pending
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 20, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    _ChrisQ_
    Description

    Speech Bundle

    VoxCeleb is a very popular datasets for speaker verification.
    CnCeleb is similar to VoxCeleb, but mainly contains mandarin speakers. These datasets are too large to launch quick experiments, therefore, I create a sampled version of them.
    This dataset contains 100 speakers from voxceleb and 100 speakers from cnceleb, which is enough to test out new thoughts.
    Specifically, the voxceleb part of this dataset is sampled from the dev set of the VoxCeleb2, while the cnceleb is directly sampled from the CnCeleb2.

    References

    [1] Chung, J.S., Nagrani, A. and Zisserman, A., 2018. VoxCceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622. [2] Fan, Y., Kang, J.W., Li, L.T., Li, K.C., Chen, H.L., Cheng, S.T., Zhang, P.Y., Zhou, Z.Y., Cai, Y.Q. and Wang, D., 2020, May. Cn-celeb: a challenging chinese speaker recognition dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7604-7608). IEEE.

  12. F

    Mandarin Call Center Data for Healthcare AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mandarin Call Center Data for Healthcare AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/healthcare-call-center-conversation-mandarin-china
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Mandarin Chinese Call Center Speech Dataset for the Healthcare industry is purpose-built to accelerate the development of Mandarin speech recognition, spoken language understanding, and conversational AI systems. With 30 Hours of unscripted, real-world conversations, it delivers the linguistic and contextual depth needed to build high-performance ASR models for medical and wellness-related customer service.

    Created by FutureBeeAI, this dataset empowers voice AI teams, NLP researchers, and data scientists to develop domain-specific models for hospitals, clinics, insurance providers, and telemedicine platforms.

    Speech Data

    The dataset features 30 Hours of dual-channel call center conversations between native Mandarin Chinese speakers. These recordings cover a variety of healthcare support topics, enabling the development of speech technologies that are contextually aware and linguistically rich.

    Participant Diversity:
    Speakers: 60 verified native Mandarin Chinese speakers from our contributor community.
    Regions: Diverse provinces across China to ensure broad dialectal representation.
    Participant Profile: Age range of 18–70 with a gender mix of 60% male and 40% female.
    RecordingDetails:
    Conversation Nature: Naturally flowing, unscripted conversations.
    Call Duration: Each session ranges between 5 to 15 minutes.
    Audio Format: WAV format, stereo, 16-bit depth at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clear conditions without background noise or echo.

    Topic Diversity

    The dataset spans inbound and outbound calls, capturing a broad range of healthcare-specific interactions and sentiment types (positive, neutral, negative).

    Inbound Calls:
    Appointment Scheduling
    New Patient Registration
    Surgical Consultation
    Dietary Advice and Consultations
    Insurance Coverage Inquiries
    Follow-up Treatment Requests, and more
    OutboundCalls:
    Appointment Reminders
    Preventive Care Campaigns
    Test Results & Lab Reports
    Health Risk Assessment Calls
    Vaccination Updates
    Wellness Subscription Outreach, and more

    These real-world interactions help build speech models that understand healthcare domain nuances and user intent.

    Transcription

    Every audio file is accompanied by high-quality, manually created transcriptions in JSON format.

    Transcription Includes:
    Speaker-identified Dialogues
    Time-coded Segments
    Non-speech Annotations (e.g., silence, cough)
    High transcription accuracy with word error rate is below 5%, backed by dual-layer QA checks.

    Metadata

    Each conversation and speaker includes detailed metadata to support fine-tuned training and analysis.

    Participant Metadata: ID, gender, age, region, accent, and dialect.
    Conversation Metadata: Topic, sentiment, call type, sample rate, and technical specs.

    Usage and Applications

    This dataset can be used across a range of healthcare and voice AI use cases:

  13. a

    CSSD

    • datasets.activeloop.ai
    deeplake
    Updated Mar 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peking University (2022). CSSD [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/cssd-dataset/
    Explore at:
    deeplakeAvailable download formats
    Dataset updated
    Mar 11, 2022
    Dataset authored and provided by
    Peking University
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2016 - Jan 1, 2020
    Area covered
    Dataset funded by
    National Natural Science Foundation of China
    Description

    The Chinese Speech Separation Dataset (CSSD) is a dataset of audio recordings of people speaking Mandarin Chinese in a variety of noisy environments. The dataset consists of 10,000 audio recordings, each of which is a mixture of two speakers. The dataset is split into a training set of 8,000 recordings and a test set of 2,000 recordings. The audio recordings are in .wav format and have a sampling rate of 16 kHz. The audio recordings are labeled with the identities of the two speakers in the mixture. The CSSD dataset is a valuable resource for training speech separation models.

  14. 1250 Hours - Taiwanese Accent Mandarin Spontaneous Speech Data

    • m.nexdata.ai
    Updated Jan 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 1250 Hours - Taiwanese Accent Mandarin Spontaneous Speech Data [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1295?source=Github
    Explore at:
    Dataset updated
    Jan 26, 2024
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Accuracy, Language, Data Content, Features of annotation
    Description

    Taiwanese Accent Mandarin(China) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  15. E

    Taiwan Mandarin Speecon database

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Feb 22, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). Taiwan Mandarin Speecon database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0212/
    Explore at:
    Dataset updated
    Feb 22, 2007
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Taiwan Mandarin Speecon database is divided into 2 sets: 1) The first set comprises the recordings of 550 adult Taiwanese speakers (273 males, 277 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place). 2) The second set comprises the recordings of 50 child Taiwanese speakers (25 boys, 25 girls), recorded over 4 microphone channels in 1 recording environment (children room). This database is partitioned into 56 DVDs (first set) and 3 DVDs (second set).The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:Calibration data: 6 noise recordings The “silence word” recordingFree spontaneous items (adults only):5 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)17 Elicited spontaneous items (adults only):3 dates, 2 times, 3 proper names, 2 city names, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language Read speech:30 phonetically rich sentences uttered by adults and 60 uttered by children5 phonetically rich words (adults only)4 isolated digits1 isolated digit sequence4 connected digit sequences1 telephone number3 natural numbers1 money amount2 time phrases (T1 : analogue, T2 : digital)3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)3 letter sequences1 proper name2 city or street names2 questions2 special keyboard characters 1 Web address1 email address46 core words synonyms208 application specific words and phrases per session (adults)74 toy commands, 14 phone commands and 34 general commands (children)The following age distribution has been obtained: Adults: 246 speakers are between 15 and 30, 235 speakers are between 31 and 45, 63 speakers are between 46 and 60, and 6 speakers are over 60.Children: 21 speakers are between 7 and 10, 29 speakers are between 11 and 14.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

  16. h

    EDEN_ASR_Data

    • huggingface.co
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siyan LI (2025). EDEN_ASR_Data [Dataset]. https://huggingface.co/datasets/sylviali/EDEN_ASR_Data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2025
    Authors
    Siyan LI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    EDEN ASR Dataset

    A subset of this data was used to support the development of empathetic feedback modules in EDEN and its prior work. The dataset contains audio clips of native Mandarin speakers. The speakers conversed with a chatbot hosted on an English practice platform. 3081 audio clips from 613 conversations and 163 users remained after filtering. The filtering process removes audio clips containing only Mandarin, duplicates, and a subset of self-introductions from the users.… See the full description on the dataset page: https://huggingface.co/datasets/sylviali/EDEN_ASR_Data.

  17. E

    GlobalPhone German

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone German [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0198/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The German corpus was produced using the Frankfurter Allgemeine und Sueddeutsche Zeitung newspaper. It contains recordings of 77 speakers (70 males, 7 females) recorded in Karlsruhe, Germany. No age distribution is available.

  18. h

    The interpretation and prediction of event participants in Mandarin...

    • heidata.uni-heidelberg.de
    tsv, txt
    Updated Nov 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Gerwien; Johannes Gerwien (2019). The interpretation and prediction of event participants in Mandarin verb-final active and passive sentences [Dataset] [Dataset]. http://doi.org/10.11588/DATA/L7QPUY
    Explore at:
    tsv(1022521), txt(1056)Available download formats
    Dataset updated
    Nov 8, 2019
    Dataset provided by
    heiDATA
    Authors
    Johannes Gerwien; Johannes Gerwien
    License

    https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/L7QPUYhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/L7QPUY

    Description

    This data set contains eye tracking data collected with an SMI RED 500 eye tracking system. The experimental design, elicitation method, coding, and criteria for excluding/including data are documented in the article: Gerwien, J. (2019) "The interpretation and prediction of event participants in Mandarin active and passive N-N-V sentences". The article's abstract is as follows: The role of the markers bèi and bǎ for thematic role assignment in Chinese NP1-marker-NP2-V sentences was investigated in adult native speakers using the visual world paradigm (eye tracking). While word order is identical, thematic roles are distributed reversely in these structures (patient-bèi-agent, (passive); agent-bǎ-patient, (active)). If Mandarin speakers interpret NP1 as the agent of an event, viewing behavior was expected to differ between conditions for NP1-objects, indicating the revision of initial role assignment in the case of bèi. Given reliability differences between markers for role assignment, differences in anticipatory eye movements to NP2-objects were expected. 16 visual stimuli were combined with 16 sets of sentence pairs; one pair partner featuring a bèi-, the other a bǎ-structure. Growth curve analysis of 28 participants’ eye movements revealed no attention differences for NP1-objects. However, anticipatory eye movements to NP2-objects differed. This suggests that a stable event representation is constructed only after NP1 and the marker have been processed, but before NP2. As a control variable, syntactic/semantic complexity of NP1 was manipulated. The differences obtained indicate that the visual world paradigm is in principle sensitive to detect language-induced processing costs, which was taken to validate the null-finding for NP1. Interestingly, NP1 complexity also modulated predictive processing. Findings are discussed with respect to a differentiation between interpretative and predictive aspects incremental processing.

  19. MSEEG: An EEG-based Dataset for Mandarin Imagined Speech BCI

    • zenodo.org
    bin, txt
    Updated Mar 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). MSEEG: An EEG-based Dataset for Mandarin Imagined Speech BCI [Dataset]. http://doi.org/10.5281/zenodo.10797879
    Explore at:
    bin, txtAvailable download formats
    Dataset updated
    Mar 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study constructed a Mandarin-Speech EEG (MSEEG) dataset including imagined, intended, and spoken speech modalities from 10 native speakers. The stimuli used in the experiment consist of monosyllabic Mandarin words, comprising four categories of vowels and four categories of tones.

  20. MandarinStutteredSpeech

    • huggingface.co
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AImpower.org (2025). MandarinStutteredSpeech [Dataset]. https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech
    Explore at:
    Dataset updated
    Jun 26, 2025
    Dataset provided by
    aimpower GmbH
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Mandarin Stuttered Speech Dataset (StammerTalk)

    The StammerTalk dataset contains 43 hours of spontenous conversations and reading of voice commands by 66 Mandarin Chinese speakers who stutter.

      Data Collection Process
    

    The StammerTalk dataset was created by StammerTalk (口吃说) community (http://stammertalk.net/), in partnership with AImpower.org. Speech data collection was conducted by two StammerTalk volunteers, who also stutter, with participants over videoconferencing… See the full description on the dataset page: https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FutureBee AI (2022). Mandarin General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-mandarin-china

Mandarin General Conversation Speech Dataset for ASR

Mandarin General Conversation Speech Corpus

Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License

https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

Dataset funded by
FutureBeeAI
Description

Introduction

Welcome to the Mandarin Chinese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Mandarin speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mandarin Chinese communication.

Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Mandarin speech models that understand and respond to authentic Chinese accents and dialects.

Speech Data

The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mandarin Chinese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

Participant Diversity:
Speakers: 60 verified native Mandarin Chinese speakers from FutureBeeAI’s contributor community.
Regions: Representing various provinces of China to ensure dialectal diversity and demographic balance.
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
Recording Details:
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
Duration: Each conversation ranges from 15 to 60 minutes.
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity

The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

Sample Topics Include:
Family & Relationships
Food & Recipes
Education & Career
Healthcare Discussions
Social Issues
Technology & Gadgets
Travel & Local Culture
Shopping & Marketplace Experiences, and many more.

Transcription

Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

Transcription Highlights:
Speaker-segmented dialogues
Time-coded utterances
Non-speech elements (pauses, laughter, etc.)
High transcription accuracy, achieved through double QA pass, average WER < 5%

These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

Metadata

The dataset comes with granular metadata for both speakers and recordings:

Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

Usage and Applications

This dataset is a versatile resource for multiple Mandarin speech and language AI applications:

ASR Development: Train accurate speech-to-text systems for Mandarin Chinese.
Voice Assistants: Build smart assistants capable of understanding natural Chinese conversations.
<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

Search
Clear search
Close search
Google apps
Main menu