Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Language diversity and its driving foctors
Comprehensive dataset of 349 Chinese language schools in United States as of June, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
The synthetic data comes from 5 female voice actors in a professional recording studio (background noise <18dB(A)). Each of them makes 2-3 recordings per week as part of a total recording cycle of 1 month, and the recorded content covers Chinese marketing scripts.
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Original source of the data:
Norman, J. (2003): Chinese dialects. Phonology. In: Thurgood, G. & LaPolla, R.: The Sino-Tibetan Languages. Routledge: London and New York. 72-83.
Comprehensive dataset of 13 Chinese language instructors in Germany as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository provides raw data of citation tones of Beijing dialect extracted from recordings and experimental results since 1900.
1900 Azoulay, 1st take.csv
and 1900 Azoulay, 2nd take.csv
.1920 Wang.csv
.1922a Chao.csv
.1930 Shu.csv
.1933 Pai.csv
.1915 Karlgren.csv
(from pp. 253–259).1922c Chao.csv
.1925 Liu.csv
(from Pl. XI).1934 Obata & Tesima.csv
(from speaker Xiangyin Bao 包象寅 in Figures 2a and 2e).1934 Pai.csv
.1998 Lin et al..csv
.2003 Sanders & Shi.csv
(from Speakers 39, 41, 42 & 45).https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
This database contains the recordings of 500 Chinese Mandarin speakers from Northern China (250 males and 250 females), from 18 to 60 years’ old, recorded in quiet studios located in Shenzhen and in Hong Kong Special Administrative Region, People’s Republic of China. Demographics of native speakers from Northern China is as follows:- Beijing: 200 speakers (100 males, 100 females)- North of Beijing: 101 speakers (50 males, 51 females)- Shandong: 149 speakers (75 males, 74 females)- Henan: 50 speakers (25 males, 25 females)Speaker profile includes the following information: unique ID, place of birth, place where speaker lived the longest by the age of 16, and the number of years that the speaker lived there, age, gender, recording place.Recordings were made through microphone headsets (ATM73a / AUDIO TECHNICA) and consist of 172 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM. Recording script consists of :• Phoneme balance statement: 785 sentences• Travel conversation: 1618 sentences• About 200 sentences per speaker including: 134 sentences of travel conversation, 66 sentences of phoneme balance
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cite the source of the dataset as:
Hóu, J. (2004): Xiàndài Hànyǔ fāngyán yīnkù 现代汉语方言音库 [Phonological database of Chinese dialects]. Shànghǎi: Shànghǎi Jiàoyù.
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
This database contains the recordings of 1000 Chinese Mandarin speakers from Southern China (500 males and 500 females), from 18 to 60 years’ old, recorded in quiet studios located in Shenzhen and in Hong Kong Special Administrative Region, People’s Republic of China. Demographics of native speakers from Southern China is as follows:- Guangdong: 312 speakers (154 males, 158 females)- Fujian: 155 speakers (95 males, 60 females)- Jiangsu: 262 speakers (134 males, 128 females)- Zhejiang: 160 speakers (84 males, 76 females)- Taiwan: 105 speakers (31 males, 74 females)- Other-Southern: 6 speakers (2 males, 4 females)Speaker profile includes the following information: unique ID, place of birth, place where speaker lived the longest by the age of 16, and the number of years that the speaker lived there, age, gender, recording place.Recordings were made through microphone headsets (ATM73a / AUDIO TECHNICA) and consist of 341 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM. Recording script consists of :• Phoneme balance statement: 785 sentences• Travel conversation: 1618 sentences• About 200 sentences per speaker including: 134 sentences of travel conversation, 66 sentences of phoneme balance
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Mandarin Chinese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Mandarin speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mandarin Chinese communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Mandarin speech models that understand and respond to authentic Chinese accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mandarin Chinese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Mandarin speech and language AI applications:
Comprehensive dataset of 174 Chinese language schools in Indonesia as of June, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
Comprehensive dataset of 1 Chinese language schools in → Ehime, Japan as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
Comprehensive dataset of 4 Chinese language schools in Colombia as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Chinese-Shanghai corpus was produced using the Peoples Daily newspaper. It contains recordings of 41 speakers (16 males, 25 females) recorded in Shanghai, China. The following age distribution has been obtained: 1 speaker is below 19, 2 speakers are between 20 and 29, 13 speakers are between 30 and 39, 14 speakers are between 40 and 49, and 11 speakers are over 50.
Comprehensive dataset of 1 Chinese language schools in Iwate, Japan as of June, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
AbstractIntroduction CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 27 hours of unscripted telephone conversations between native speakers of the Taiwan dialect of Mandarin Chinese. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND Mandarin Chinese-Taiwan Dialect (LDC96S56). The CALLFRIEND series is a collection of telephone conversations in several languages conducted by LDC in support of language identification technology development. Languages covered in the collection include American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. Data All data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes. The data was recorded as 8kHz u-law SPH encoded stereo files, with one end of the phone call on each channel. In this release, files were converted to WAV format, and information from the original SPH headers is described in the documentation. SPH files are not included in this second edition. The audio files were originally split into train, dev and test folders of 20 recordings each, but they are combined in this release. Completed calls passed through a human auditing process to verify that the target language was spoken by the participants, to check the quality of the recordings, and to record information about dialect, noise and distortion.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This TCST-UT dataset contains 58767 samples, 72.08 hours, and audio files from 147 different speakers. Each sample is a triplet consisting of Tibetan speech, corresponding Tibetan text, and Chinese text. Among them, the Tibetan language speech data comes from the M2ASR Tibetan dialect speech recognition dataset, which is published on the m2sr.cslt.org website. The audio files can be obtained through email requests, so this dataset does not directly provide voice audio files, only the audio paths of the samples contained in the dataset. The audio path of each sample, along with the corresponding Tibetan text and Chinese translated text, is stored in the output. json file, where the audio path refers to the path in the public dataset.The size of output. json is 22MB. The file puts the audio path, Tibetan text, and Chinese translation text of each sample into a dictionary, with the data format being:Abbreviation of Name - Audio Number: {“audio”: Audio file path,“text”: {“Tibetan”: The Tibetan text corresponding to the audio file“Chinese”: The Chinese text corresponding to the audio file}}
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset is divided into two parts: the first represents an evaluation of the data in the Linguistic Atlas of of Chinese Dialects (Cao 2008) for uniqueness and generality for 3 core Xiang varieties (Changsha, Shaoyang, Lianyuan) and 2 non-Xiang varieties (Changde, Chaling). 1 non-core Xiang variety was also included for reference (Hengyang). It is divided into three sections based on the type of linguistic features covered, corresponding to the three volumes of the Atlas: Phonology, Lexicon, and Grammar. Each dataset contains 8-10 columns: (1) Variety (which variety of the six is being considered); (2) Feature (the feature from the Atlas); (3) Value (the value for that location given in the Atlas); (4) Middle Chinese Interpretation (ONLY Phonological Dataset; the value of the feature in the Qieyun, using Baxter's transcription); (5) Dialectal Rendition (the form the feature takes in the published description of the variety); (6) Implied Sound Change (ONLY Phonological Dataset; the assumed sound change from Middle Chinese); (7) Unique; (8) General; (9) Map Number (relevant page and map number in the Linguistic Atlas of Chinese Dialects); (10) Notes.Uniqueness means that a feature does not occur in non-Xiang varieties, while Generality means a feature occurs in all Xiang varieties; this is indicated in the dataset with 'Yes' (positive value), 'No' (negative value), and N/A (irrelevant). Features deemed particularly relevant for subgrouping purposes, i.e. are particularly rare or unique, are indicated in yellow highlight. If a feature in the Atlas does not agree with the published description of a variety, this is indicated under Notes as 'Mismatch', with the relevant mismatched feature indicated in red lettering. In the lexical and grammatical datasets, the 'Dialectal Rendition' column (Column 4) focuses on features which are unique and deemed helpful to subgrouping purposes; that is, rows that are highlighted in yellow. If a row is highlighted but lacks a value for this column, it means a relevant form could not be identified in the published description for that variety. The second dataset ('Innovations') represents a list of 147 linguistic innovations evaluated for 13 Sinitic language varieties (9 Xiang, 4 non-Xiang), with '1' meaning 'possesses innovation' and '0' meaning 'does not possess innovation'. A value of 'NA' means that the value could not be determined for that variety for a lack of data.
AbstractIntroduction Global TIMIT Mandarin Chinese-Guanzhong Dialect was developed by the Linguistic Data Consortium and Xi'an Jiaotong University and consists of approximately five hours of read speech and transcripts in the Guanzhong dialect of Mandarin Chinese as spoken in Shannxi province. The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Specifically, these features included: A large number of fluently-read sentences, containing a representative sample of phonetic, lexical, syntactic, semantic, and pragmatic patterns A relatively large number of speakers Time-aligned lexical and phonetic transcription of all utterances Some sentences read by all speakers, others read by a few speakers, and others read by just one speaker Data Global TIMIT Mandarin Chinese-Guanzhong Dialect consists of 50 speakers reading 120 sentences selected from Chinese Gigaword Fifth Edition (LDC2011T13). Among the 120 sentences, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types. The corpus was recorded at Xi'an Jiaotong University, Xi'an, China. Speakers (25 female, 25 male) were born in Weinan, Shannxi and spoke the Guanzhong dialect. All speech data are presented as 16kHz, 16-bit flac compressed wav files. Each file has accompanying phone, word, and tone segmentation files, as well as Praat TextGrid files.
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Chinese name components, accompanied by accurate pinyin readings, gender codes, and flags denoting whether name is a given name, surname, or both.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Language diversity and its driving foctors