100+ datasets found

Data--Chinese dialects & environmental factors.docx
figshare.com
docx
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruifeng Mo (2024). Data--Chinese dialects & environmental factors.docx [Dataset]. http://doi.org/10.6084/m9.figshare.28052444.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28052444.v1
Dataset updated
Dec 18, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ruifeng Mo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Language diversity and its driving foctors
p
Chinese Language Schools in United States - 349 Verified Listings Database
poidata.io
csv, excel, json
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poidata.io (2025). Chinese Language Schools in United States - 349 Verified Listings Database [Dataset]. https://www.poidata.io/report/chinese-language-school/united-states
Explore at:
json, csv, excelAvailable download formats
Dataset updated
Jun 18, 2025
Dataset provided by
Poidata.io
Area covered
United States
Description
Comprehensive dataset of 349 Chinese language schools in United States as of June, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
i
Chinese Multi-dialect TTS Database
infinityai.ai
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DataOceanAI (2025). Chinese Multi-dialect TTS Database [Dataset]. www.infinityai.ai
Explore at:
Dataset updated
Jun 13, 2025
Dataset provided by
datatoceanai
DataOceanAI
Authors
DataOceanAI
Variables measured
Product name, Recording duration, Recording language, Recording parameters, Recording environment, Annotation Information, Product library number
Description
The synthetic data comes from 5 female voice actors in a professional recording studio (background noise <18dB(A)). Each of them makes 2-3 recordings per week as part of a total recording cycle of 1 month, and the recorded content covers Chinese marketing scripts.
cldf-datasets/normansinitic: Structural data for the paper by Norman (2013)...
zenodo.org
zip
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johann-Mattis List; Johann-Mattis List (2020). cldf-datasets/normansinitic: Structural data for the paper by Norman (2013) on Chinese dialect classification [Dataset]. http://doi.org/10.5281/zenodo.1405148
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1405148
Dataset updated
Apr 15, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johann-Mattis List; Johann-Mattis List
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Description
Original source of the data:

Norman, J. (2003): Chinese dialects. Phonology. In: Thurgood, G. & LaPolla, R.: The Sino-Tibetan Languages. Routledge: London and New York. 72-83.
p
Chinese Language Instructors in Germany - 13 Verified Listings Database
poidata.io
csv, excel, json
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poidata.io (2025). Chinese Language Instructors in Germany - 13 Verified Listings Database [Dataset]. https://www.poidata.io/report/chinese-language-instructor/germany
Explore at:
excel, csv, jsonAvailable download formats
Dataset updated
Jul 3, 2025
Dataset provided by
Poidata.io
Area covered
Germany
Description
Comprehensive dataset of 13 Chinese language instructors in Germany as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
Supplementary data for "Tones of Beijing dialect since 1900 and their...
zenodo.org
zip
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianheng Wang; Tianheng Wang (2025). Supplementary data for "Tones of Beijing dialect since 1900 and their evolution: Evidence from early recordings" [Dataset]. http://doi.org/10.5281/zenodo.15192724
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15192724
Dataset updated
Jun 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tianheng Wang; Tianheng Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Beijing
Description
This repository provides raw data of citation tones of Beijing dialect extracted from recordings and experimental results since 1900.

Data sources

Early recordings

Azoulay, Léon (ed.). 1900. Exposition Universelle de Paris, 1900.

Unpublished recordings on wax cylinders.
The first take of citation tones of Beijing dialect is on item No. 78 (cylinder No. 115), and the second take is on items Nos. 76 & 77 (cylinders Nos. 109 & 110).

Digitized by the Centre de recherche en ethnomusicologie, Laboratoire d’ethnologie et de sociologie comparative. (Nos. 76, 77 & 78)

Extracted data: 1900 Azoulay, 1st take.csv and 1900 Azoulay, 2nd take.csv.

Wang, Pu 王璞. 1920. Zhonghua Guoyin liushengjipian 中華國音留聲機片 [Chinese National Phonetic record]. Shanghai: Zhonghua Book Company.

Six 11.5-inch 80 rpm vertical-cut phonograph records produced by Pathé Records in Paris.
The demonstration of citation tones (Lesson 6) is on Disc 3, Side B, catalog number 34001⁶.

Partially digitized by the National Taiwan University Library. (Disc 3)

Extracted data: 1920 Wang.csv.

Chao, Yuen Ren 趙元任. 1922a. Guoyu liushengji pian 國語留聲機片 [National Language record]. Shanghai: The Commercial Press.

Eight 10-inch 78 rpm phonograph records produced by Columbia Phonograph Company in New York, catalog number W1.
The demonstration of citation tones (Lesson 7) is on Disc 4, Side A, matrix number 93282.

Digitized by the National Taiwan University Library. (Disc 4)

Also digitized by the Chinese University of Hong Kong Library. (Discs 1–8) (not used in this study)

Extracted data: 1922a Chao.csv.

Shu, Chien Chun. 1930. Chinese (Linguaphone Language Courses). London: Linguaphone Institute.

Sixteen 10-inch 78 rpm phonograph records.
The demonstration of citation tones is on Disc 1, Side A, catalog number C.1.E. (titled “Chinese course: Lesson No. 1”), and Disc 16, Side A, catalog number C.S.1.E. (titled “Chinese sounds: No. 1”).

Digitized by Beijing Dianji yu Jingdian Lao Changpian Shuzihua Chuban Xiangmu 北京典籍与经典老唱片数字化出版项目 [Beijing Classics and Old Records Digital Publishing Project]. (Discs 1 & 16)

Also digitized by the Great 78 Project. (Discs 1 & 16) (not used in this study)

Extracted data: 1930 Shu.csv.

Pai, Ti-chou 白滌洲 1933. Biaozhun Guoyin 標準國音 [Standard National Pronunciation]. Shanghai: Zhonghua Book Company.

Four 10-inch 78 rpm phonograph records produced by Great China Record Company in Shanghai.
The demonstration of citation tones (Side 5) is on Disc 3, Side A, catalog number 18-A.

Partially digitized by the National Taiwan University Library. (Disc 3)

Extracted data: 1933 Pai.csv.

Early experiments

Karlgren, Bernhard. 1915. Études sur la Phonologie chinoise (Archives d’Études Orientales 15), vol. 1. Leyde: E. J. Brill. (Brittle Books | Internet Archive)

Extracted data: 1915 Karlgren.csv (from pp. 253–259).

Chao, Yuen Ren 趙元任. 1922c. Zhongguo yanyu zidiao di shiyan yanjiufa 中國言語字調底實驗硏究法 [The methods for investigation of the intonation of Chinese language]. Kexue 科學 7(9). 871–882.

Extracted data: 1922c Chao.csv.

Liu, Fu. 1925. Étude expérimentale sur les tons du Chinois. Paris: Les Belles Lettres. (Gallica)

Extracted data: 1925 Liu.csv (from Pl. XI).

Obata, Jûichi 小幡重一 & Tesima, Takehiko 豊島武彦. 1934. Shina-go no butsuri onseigakuteki kenkyū: Shisei no seishitsu 支那語の物理音聲學的研究:四聲の性質 [Physicophonetic study of Chinese language: Properties of the four tones]. Nippon Sugaku-Buturigakkwaishi 日本数学物理学会誌 8(1). 1–10. (DOI: 10.11429/subutsukaishi1927.8.1)

Extracted data: 1934 Obata & Tesima.csv (from speaker Xiangyin Bao 包象寅 in Figures 2a and 2e).

Pai, Ti-chou 白滌洲. 1934. Beijingyu shengdiao ji bianhua 北京語声調及变化 [Tones and changes of Beijing dialect].

Manuscript.

Partially published in Luo, Changpei 羅常培 & Wang, Jun 王均. 1957. Putong yuyinxue gangyao 普通語音学綱要 [Outline of general phonetics], 125–127. Beijing: Science Press.

Extracted data: 1934 Pai.csv.

Modern data

Lin, Tao 林焘 & Zhou, Yimin 周一民 & Cai, Wenlan 蔡文兰. 1998. Beijinghua yindang 北京话音档 [Phonetic archive of Beijing dialect] (Xiandai Hanyu Fangyan Yinku 现代汉语方言音库 [Phonetic Database of Modern Chinese Dialects] 1). Shanghai: Shanghai Education Publishing House.

Includes sound recordings on the accompanying tape.

I directly cite extracted data (mean and standard deviation) from Shi, Shaowei 石少伟. 2007. Xiandai Hanyu Fangyan Yinku danzidiao shiyan yanjiu 《现代汉语方言音库》单字调实验研究 [Experimental study on monosyllabic tones in Phonetic Database of Modern Chinese Dialects]. Nanjing: Nanjing Normal University. (Master’s thesis.) (DOI: 10.7666/d.y1116983)

Data: 1998 Lin et al..csv.

Sanders, Robert & Shi, Feng 石锋. 2003. Hanyu Yuyin Shujuku 汉语语音数据库 [Chinese Phonetic Database].

Unpublished speech corpus.

Extracted data: 2003 Sanders & Shi.csv (from Speakers 39, 41, 42 & 45).

Non-citation tones recordings

Li, Deyang 李德鍚 & Zhang, Dequan 張德泉. 1909. Dengmi yinyu 燈謎隱語 [Lantern riddles]. Paris: Pathé Records.

11.5-inch 90 rpm center-start vertical-cut phonograph record, catalog number 32573.

Reissued in Shanghai in the 1920s on 11.5-inch 80 rpm outside-start vertical-cut discs, under the same catalog number.

The reissue was remastered in the 2007 documentary Xiangsheng Dashi 相声大师
E
Chinese Mandarin (North) database
catalog.elra.info
live.european-language-grid.eu
Updated May 31, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2018). Chinese Mandarin (North) database [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0398/
Explore at:
Dataset updated
May 31, 2018
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
This database contains the recordings of 500 Chinese Mandarin speakers from Northern China (250 males and 250 females), from 18 to 60 years’ old, recorded in quiet studios located in Shenzhen and in Hong Kong Special Administrative Region, People’s Republic of China. Demographics of native speakers from Northern China is as follows:- Beijing: 200 speakers (100 males, 100 females)- North of Beijing: 101 speakers (50 males, 51 females)- Shandong: 149 speakers (75 males, 74 females)- Henan: 50 speakers (25 males, 25 females)Speaker profile includes the following information: unique ID, place of birth, place where speaker lived the longest by the age of 16, and the number of years that the speaker lived there, age, gender, recording place.Recordings were made through microphone headsets (ATM73a / AUDIO TECHNICA) and consist of 172 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM. Recording script consists of :• Phoneme balance statement: 785 sentences• Travel conversation: 1618 sentences• About 200 sentences per speaker including: 134 sentences of travel conversation, 66 sentences of phoneme balance
Z
CLDF dataset derived from Hóu's "Phonological Database of Chinese Dialects"...
data.niaid.nih.gov
zenodo.org
Updated Aug 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hóu, Jīngyī (2024). CLDF dataset derived from Hóu's "Phonological Database of Chinese Dialects" from 2004 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5126857
Explore at:
Dataset updated
Aug 7, 2024
Dataset authored and provided by
Hóu, Jīngyī
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cite the source of the dataset as:

Hóu, J. (2004): Xiàndài Hànyǔ fāngyán yīnkù 现代汉语方言音库 [Phonological database of Chinese dialects]. Shànghǎi: Shànghǎi Jiàoyù.
E
Chinese Mandarin (South) database
catalog.elra.info
live.european-language-grid.eu
Updated May 31, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2018). Chinese Mandarin (South) database [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0397/
Explore at:
Dataset updated
May 31, 2018
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
This database contains the recordings of 1000 Chinese Mandarin speakers from Southern China (500 males and 500 females), from 18 to 60 years’ old, recorded in quiet studios located in Shenzhen and in Hong Kong Special Administrative Region, People’s Republic of China. Demographics of native speakers from Southern China is as follows:- Guangdong: 312 speakers (154 males, 158 females)- Fujian: 155 speakers (95 males, 60 females)- Jiangsu: 262 speakers (134 males, 128 females)- Zhejiang: 160 speakers (84 males, 76 females)- Taiwan: 105 speakers (31 males, 74 females)- Other-Southern: 6 speakers (2 males, 4 females)Speaker profile includes the following information: unique ID, place of birth, place where speaker lived the longest by the age of 16, and the number of years that the speaker lived there, age, gender, recording place.Recordings were made through microphone headsets (ATM73a / AUDIO TECHNICA) and consist of 341 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM. Recording script consists of :• Phoneme balance statement: 785 sentences• Travel conversation: 1618 sentences• About 200 sentences per speaker including: 134 sentences of travel conversation, 66 sentences of phoneme balance
F
Mandarin General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Mandarin General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-mandarin-china
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Mandarin Chinese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Mandarin speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mandarin Chinese communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Mandarin speech models that understand and respond to authentic Chinese accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mandarin Chinese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Mandarin Chinese speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of China to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Mandarin speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Mandarin Chinese.

•
Voice Assistants: Build smart assistants capable of understanding natural Chinese conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;
p
Chinese Language Schools in Indonesia - 174 Verified Listings Database
poidata.io
csv, excel, json
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poidata.io (2025). Chinese Language Schools in Indonesia - 174 Verified Listings Database [Dataset]. https://www.poidata.io/report/chinese-language-school/indonesia
Explore at:
csv, excel, jsonAvailable download formats
Dataset updated
Jun 26, 2025
Dataset provided by
Poidata.io
Area covered
Indonesia
Description
Comprehensive dataset of 174 Chinese language schools in Indonesia as of June, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
p
Chinese Language Schools in → Ehime, Japan - 1 Verified Listings Database
poidata.io
csv, excel, json
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poidata.io (2025). Chinese Language Schools in → Ehime, Japan - 1 Verified Listings Database [Dataset]. https://www.poidata.io/report/chinese-language-school/japan/ehime
Explore at:
excel, json, csvAvailable download formats
Dataset updated
Jul 5, 2025
Dataset provided by
Poidata.io
Area covered
Ehime, Japan
Description
Comprehensive dataset of 1 Chinese language schools in → Ehime, Japan as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
p
Chinese Language Schools in Colombia - 4 Verified Listings Database
poidata.io
csv, excel, json
Updated Jul 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poidata.io (2025). Chinese Language Schools in Colombia - 4 Verified Listings Database [Dataset]. https://www.poidata.io/report/chinese-language-school/colombia
Explore at:
csv, excel, jsonAvailable download formats
Dataset updated
Jul 13, 2025
Dataset provided by
Poidata.io
Area covered
Colombia
Description
Comprehensive dataset of 4 Chinese language schools in Colombia as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
E
GlobalPhone Chinese-Shanghai
catalogue.elra.info
catalog.elra.info
+1more
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Chinese-Shanghai [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0194/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
Shanghai
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Chinese-Shanghai corpus was produced using the Peoples Daily newspaper. It contains recordings of 41 speakers (16 males, 25 females) recorded in Shanghai, China. The following age distribution has been obtained: 1 speaker is below 19, 2 speakers are between 20 and 29, 13 speakers are between 30 and 39, 14 speakers are between 40 and 49, and 11 speakers are over 50.
p
Chinese Language Schools in Iwate, Japan - 1 Verified Listings Database
poidata.io
csv, excel, json
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poidata.io (2025). Chinese Language Schools in Iwate, Japan - 1 Verified Listings Database [Dataset]. https://www.poidata.io/report/chinese-language-school/japan/iwate
Explore at:
excel, json, csvAvailable download formats
Dataset updated
Jun 26, 2025
Dataset provided by
Poidata.io
Area covered
Iwate, Japan
Description
Comprehensive dataset of 1 Chinese language schools in Iwate, Japan as of June, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
A
CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition
abacus.library.ubc.ca
iso, txt
Updated Mar 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abacus Data Network (2022). CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml;jsessionid=8de5d62f9bfb012e7807cef02109?persistentId=hdl%3A11272.1%2FAB2%2FAT8NRM&version=&q=&fileAccess=&fileTag=%22Documentation%22&fileSortField=&fileSortOrder=
Explore at:
iso(3116947456), txt(1308)Available download formats
Dataset updated
Mar 18, 2022
Dataset provided by
Abacus Data Network
Area covered
Taiwan
Description
AbstractIntroduction CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 27 hours of unscripted telephone conversations between native speakers of the Taiwan dialect of Mandarin Chinese. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND Mandarin Chinese-Taiwan Dialect (LDC96S56). The CALLFRIEND series is a collection of telephone conversations in several languages conducted by LDC in support of language identification technology development. Languages covered in the collection include American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. Data All data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes. The data was recorded as 8kHz u-law SPH encoded stereo files, with one end of the phone call on each channel. In this release, files were converted to WAV format, and information from the original SPH headers is described in the documentation. SPH files are not included in this second edition. The audio files were originally split into train, dev and test folders of 20 recordings each, but they are combined in this release. Completed calls passed through a human auditing process to verify that the target language was spoken by the participants, to check the quality of the recordings, and to record information about dialect, noise and distortion.
S
TCST-UT: Tibetan-Chinese speech translation dataset of Ü-Tsang dialect
scidb.cn
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
li xin; Liu Jialuo; Dorje Peng Mao; Kan Zhuocuo; Qi Xiaoke; Zhao Xiaobing (2025). TCST-UT: Tibetan-Chinese speech translation dataset of Ü-Tsang dialect [Dataset]. http://doi.org/10.57760/sciencedb.18807
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.18807
Dataset updated
Jan 3, 2025
Dataset provided by
Science Data Bank
Authors
li xin; Liu Jialuo; Dorje Peng Mao; Kan Zhuocuo; Qi Xiaoke; Zhao Xiaobing
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This TCST-UT dataset contains 58767 samples, 72.08 hours, and audio files from 147 different speakers. Each sample is a triplet consisting of Tibetan speech, corresponding Tibetan text, and Chinese text. Among them, the Tibetan language speech data comes from the M2ASR Tibetan dialect speech recognition dataset, which is published on the m2sr.cslt.org website. The audio files can be obtained through email requests, so this dataset does not directly provide voice audio files, only the audio paths of the samples contained in the dataset. The audio path of each sample, along with the corresponding Tibetan text and Chinese translated text, is stored in the output. json file, where the audio path refers to the path in the public dataset.The size of output. json is 22MB. The file puts the audio path, Tibetan text, and Chinese translation text of each sample into a dictionary, with the data format being:Abbreviation of Name - Audio Number: {“audio”: Audio file path,“text”: {“Tibetan”: The Tibetan text corresponding to the audio file“Chinese”: The Chinese text corresponding to the audio file}}
h
Dataset for Xiang Subgrouping
datahub.hku.hk
csv
Updated May 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Marcelo Sevilla; John Joseph Perry; Chu-Wen Chen (2025). Dataset for Xiang Subgrouping [Dataset]. http://doi.org/10.25442/hku.28935251.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.25442/hku.28935251.v1
Dataset updated
May 16, 2025
Dataset provided by
HKU Data Repository
Authors
Robert Marcelo Sevilla; John Joseph Perry; Chu-Wen Chen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset is divided into two parts: the first represents an evaluation of the data in the Linguistic Atlas of of Chinese Dialects (Cao 2008) for uniqueness and generality for 3 core Xiang varieties (Changsha, Shaoyang, Lianyuan) and 2 non-Xiang varieties (Changde, Chaling). 1 non-core Xiang variety was also included for reference (Hengyang). It is divided into three sections based on the type of linguistic features covered, corresponding to the three volumes of the Atlas: Phonology, Lexicon, and Grammar. Each dataset contains 8-10 columns: (1) Variety (which variety of the six is being considered); (2) Feature (the feature from the Atlas); (3) Value (the value for that location given in the Atlas); (4) Middle Chinese Interpretation (ONLY Phonological Dataset; the value of the feature in the Qieyun, using Baxter's transcription); (5) Dialectal Rendition (the form the feature takes in the published description of the variety); (6) Implied Sound Change (ONLY Phonological Dataset; the assumed sound change from Middle Chinese); (7) Unique; (8) General; (9) Map Number (relevant page and map number in the Linguistic Atlas of Chinese Dialects); (10) Notes.Uniqueness means that a feature does not occur in non-Xiang varieties, while Generality means a feature occurs in all Xiang varieties; this is indicated in the dataset with 'Yes' (positive value), 'No' (negative value), and N/A (irrelevant). Features deemed particularly relevant for subgrouping purposes, i.e. are particularly rare or unique, are indicated in yellow highlight. If a feature in the Atlas does not agree with the published description of a variety, this is indicated under Notes as 'Mismatch', with the relevant mismatched feature indicated in red lettering. In the lexical and grammatical datasets, the 'Dialectal Rendition' column (Column 4) focuses on features which are unique and deemed helpful to subgrouping purposes; that is, rows that are highlighted in yellow. If a row is highlighted but lacks a value for this column, it means a relevant form could not be identified in the published description for that variety. The second dataset ('Innovations') represents a list of 147 linguistic innovations evaluated for 13 Sinitic language varieties (9 Xiang, 4 non-Xiang), with '1' meaning 'possesses innovation' and '0' meaning 'does not possess innovation'. A value of 'NA' means that the value could not be determined for that variety for a lack of data.
A
Global TIMIT Mandarin Chinese-Guanzhong Dialect
abacus.library.ubc.ca
iso, txt
Updated Mar 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abacus Data Network (2022). Global TIMIT Mandarin Chinese-Guanzhong Dialect [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/MFTAUQ
Explore at:
txt(1308), iso(582139904)Available download formats
Dataset updated
Mar 18, 2022
Dataset provided by
Abacus Data Network
Area covered
Guanzhong
Description
AbstractIntroduction Global TIMIT Mandarin Chinese-Guanzhong Dialect was developed by the Linguistic Data Consortium and Xi'an Jiaotong University and consists of approximately five hours of read speech and transcripts in the Guanzhong dialect of Mandarin Chinese as spoken in Shannxi province. The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Specifically, these features included: A large number of fluently-read sentences, containing a representative sample of phonetic, lexical, syntactic, semantic, and pragmatic patterns A relatively large number of speakers Time-aligned lexical and phonetic transcription of all utterances Some sentences read by all speakers, others read by a few speakers, and others read by just one speaker Data Global TIMIT Mandarin Chinese-Guanzhong Dialect consists of 50 speakers reading 120 sentences selected from Chinese Gigaword Fifth Edition (LDC2011T13). Among the 120 sentences, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types. The corpus was recorded at Xi'an Jiaotong University, Xi'an, China. Speakers (25 female, 25 male) were born in Weinan, Shannxi and spoke the Guanzhong dialect. All speech data are presented as 16kHz, 16-bit flac compressed wav files. Each file has accompanying phone, word, and tone segmentation files, as well as Praat TextGrid files.
E
Database of Chinese Names
live.european-language-grid.eu
catalog.elra.info
txt
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Database of Chinese Names [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/2443
Explore at:
txtAvailable download formats
License
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
China
Description
Chinese name components, accompanied by accurate pinyin readings, gender codes, and flags denoting whether name is a given name, surname, or both.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ruifeng Mo (2024). Data--Chinese dialects & environmental factors.docx [Dataset]. http://doi.org/10.6084/m9.figshare.28052444.v1

Data--Chinese dialects & environmental factors.docx

Explore at:

docxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.28052444.v1

Dataset updated

Dec 18, 2024

Dataset provided by

Figsharehttp://figshare.com/

Authors

Ruifeng Mo

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Language diversity and its driving foctors

Clear search

Close search

Google apps

Main menu

Data--Chinese dialects & environmental factors.docx

Chinese Language Schools in United States - 349 Verified Listings Database

Chinese Multi-dialect TTS Database

cldf-datasets/normansinitic: Structural data for the paper by Norman (2013)...

Chinese Language Instructors in Germany - 13 Verified Listings Database

Supplementary data for "Tones of Beijing dialect since 1900 and their...

Data sources

Early recordings

Early experiments

Modern data

Non-citation tones recordings

Chinese Mandarin (North) database

CLDF dataset derived from Hóu's "Phonological Database of Chinese Dialects"...

Chinese Mandarin (South) database

Mandarin General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Chinese Language Schools in Indonesia - 174 Verified Listings Database

Chinese Language Schools in → Ehime, Japan - 1 Verified Listings Database

Chinese Language Schools in Colombia - 4 Verified Listings Database

GlobalPhone Chinese-Shanghai

Chinese Language Schools in Iwate, Japan - 1 Verified Listings Database

CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition

TCST-UT: Tibetan-Chinese speech translation dataset of Ü-Tsang dialect

Dataset for Xiang Subgrouping

Global TIMIT Mandarin Chinese-Guanzhong Dialect

Database of Chinese Names

Data--Chinese dialects & environmental factors.docx