100+ datasets found
  1. Data--Chinese dialects & environmental factors.docx

    • figshare.com
    docx
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruifeng Mo (2024). Data--Chinese dialects & environmental factors.docx [Dataset]. http://doi.org/10.6084/m9.figshare.28052444.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Dec 18, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ruifeng Mo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Language diversity and its driving foctors

  2. p

    Chinese Language Schools in United States - 349 Verified Listings Database

    • poidata.io
    csv, excel, json
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Poidata.io (2025). Chinese Language Schools in United States - 349 Verified Listings Database [Dataset]. https://www.poidata.io/report/chinese-language-school/united-states
    Explore at:
    json, csv, excelAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    Poidata.io
    Area covered
    United States
    Description

    Comprehensive dataset of 349 Chinese language schools in United States as of June, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.

  3. i

    Chinese Multi-dialect TTS Database

    • infinityai.ai
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataOceanAI (2025). Chinese Multi-dialect TTS Database [Dataset]. www.infinityai.ai
    Explore at:
    Dataset updated
    Jun 13, 2025
    Dataset provided by
    datatoceanai
    DataOceanAI
    Authors
    DataOceanAI
    Variables measured
    Product name, Recording duration, Recording language, Recording parameters, Recording environment, Annotation Information, Product library number
    Description

    The synthetic data comes from 5 female voice actors in a professional recording studio (background noise <18dB(A)). Each of them makes 2-3 recordings per week as part of a total recording cycle of 1 month, and the recorded content covers Chinese marketing scripts.

  4. cldf-datasets/normansinitic: Structural data for the paper by Norman (2013)...

    • zenodo.org
    zip
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johann-Mattis List; Johann-Mattis List (2020). cldf-datasets/normansinitic: Structural data for the paper by Norman (2013) on Chinese dialect classification [Dataset]. http://doi.org/10.5281/zenodo.1405148
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johann-Mattis List; Johann-Mattis List
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    Original source of the data:

    Norman, J. (2003): Chinese dialects. Phonology. In: Thurgood, G. & LaPolla, R.: The Sino-Tibetan Languages. Routledge: London and New York. 72-83.

  5. p

    Chinese Language Instructors in Germany - 13 Verified Listings Database

    • poidata.io
    csv, excel, json
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Poidata.io (2025). Chinese Language Instructors in Germany - 13 Verified Listings Database [Dataset]. https://www.poidata.io/report/chinese-language-instructor/germany
    Explore at:
    excel, csv, jsonAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset provided by
    Poidata.io
    Area covered
    Germany
    Description

    Comprehensive dataset of 13 Chinese language instructors in Germany as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.

  6. Supplementary data for "Tones of Beijing dialect since 1900 and their...

    • zenodo.org
    zip
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianheng Wang; Tianheng Wang (2025). Supplementary data for "Tones of Beijing dialect since 1900 and their evolution: Evidence from early recordings" [Dataset]. http://doi.org/10.5281/zenodo.15192724
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tianheng Wang; Tianheng Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Beijing
    Description

    This repository provides raw data of citation tones of Beijing dialect extracted from recordings and experimental results since 1900.

    Data sources

    Early recordings

    • Azoulay, Léon (ed.). 1900. Exposition Universelle de Paris, 1900.
      • Unpublished recordings on wax cylinders.
        The first take of citation tones of Beijing dialect is on item No. 78 (cylinder No. 115), and the second take is on items Nos. 76 & 77 (cylinders Nos. 109 & 110).
      • Digitized by the Centre de recherche en ethnomusicologie, Laboratoire d’ethnologie et de sociologie comparative. (Nos. 76, 77 & 78)
      • Extracted data: 1900 Azoulay, 1st take.csv and 1900 Azoulay, 2nd take.csv.
    • Wang, Pu 王璞. 1920. Zhonghua Guoyin liushengjipian 中華國音留聲機片 [Chinese National Phonetic record]. Shanghai: Zhonghua Book Company.
      • Six 11.5-inch 80 rpm vertical-cut phonograph records produced by Pathé Records in Paris.
        The demonstration of citation tones (Lesson 6) is on Disc 3, Side B, catalog number 34001⁶.
      • Partially digitized by the National Taiwan University Library. (Disc 3)
      • Extracted data: 1920 Wang.csv.
    • Chao, Yuen Ren 趙元任. 1922a. Guoyu liushengji pian 國語留聲機片 [National Language record]. Shanghai: The Commercial Press.
      • Eight 10-inch 78 rpm phonograph records produced by Columbia Phonograph Company in New York, catalog number W1.
        The demonstration of citation tones (Lesson 7) is on Disc 4, Side A, matrix number 93282.
      • Digitized by the National Taiwan University Library. (Disc 4)
      • Also digitized by the Chinese University of Hong Kong Library. (Discs 1–8) (not used in this study)
      • Extracted data: 1922a Chao.csv.
    • Shu, Chien Chun. 1930. Chinese (Linguaphone Language Courses). London: Linguaphone Institute.
      • Sixteen 10-inch 78 rpm phonograph records.
        The demonstration of citation tones is on Disc 1, Side A, catalog number C.1.E. (titled “Chinese course: Lesson No. 1”), and Disc 16, Side A, catalog number C.S.1.E. (titled “Chinese sounds: No. 1”).
      • Digitized by Beijing Dianji yu Jingdian Lao Changpian Shuzihua Chuban Xiangmu 北京典籍与经典老唱片数字化出版项目 [Beijing Classics and Old Records Digital Publishing Project]. (Discs 1 & 16)
      • Also digitized by the Great 78 Project. (Discs 1 & 16) (not used in this study)
      • Extracted data: 1930 Shu.csv.
    • Pai, Ti-chou 白滌洲 1933. Biaozhun Guoyin 標準國音 [Standard National Pronunciation]. Shanghai: Zhonghua Book Company.
      • Four 10-inch 78 rpm phonograph records produced by Great China Record Company in Shanghai.
        The demonstration of citation tones (Side 5) is on Disc 3, Side A, catalog number 18-A.
      • Partially digitized by the National Taiwan University Library. (Disc 3)
      • Extracted data: 1933 Pai.csv.

    Early experiments

    • Karlgren, Bernhard. 1915. Études sur la Phonologie chinoise (Archives d’Études Orientales 15), vol. 1. Leyde: E. J. Brill. (Brittle Books | Internet Archive)
    • Chao, Yuen Ren 趙元任. 1922c. Zhongguo yanyu zidiao di shiyan yanjiufa 中國言語字調底實驗硏究法 [The methods for investigation of the intonation of Chinese language]. Kexue 科學 7(9). 871–882.
    • Liu, Fu. 1925. Étude expérimentale sur les tons du Chinois. Paris: Les Belles Lettres. (Gallica)
    • Obata, Jûichi 小幡重一 & Tesima, Takehiko 豊島武彦. 1934. Shina-go no butsuri onseigakuteki kenkyū: Shisei no seishitsu 支那語の物理音聲學的研究:四聲の性質 [Physicophonetic study of Chinese language: Properties of the four tones]. Nippon Sugaku-Buturigakkwaishi 日本数学物理学会誌 8(1). 1–10. (DOI: 10.11429/subutsukaishi1927.8.1)
    • Pai, Ti-chou 白滌洲. 1934. Beijingyu shengdiao ji bianhua 北京語声調及变化 [Tones and changes of Beijing dialect].
      • Manuscript.
      • Partially published in Luo, Changpei 羅常培 & Wang, Jun 王均. 1957. Putong yuyinxue gangyao 普通語音学綱要 [Outline of general phonetics], 125–127. Beijing: Science Press.
      • Extracted data: 1934 Pai.csv.

    Modern data

    • Lin, Tao 林焘 & Zhou, Yimin 周一民 & Cai, Wenlan 蔡文兰. 1998. Beijinghua yindang 北京话音档 [Phonetic archive of Beijing dialect] (Xiandai Hanyu Fangyan Yinku 现代汉语方言音库 [Phonetic Database of Modern Chinese Dialects] 1). Shanghai: Shanghai Education Publishing House.
      • Includes sound recordings on the accompanying tape.
      • I directly cite extracted data (mean and standard deviation) from Shi, Shaowei 石少伟. 2007. Xiandai Hanyu Fangyan Yinku danzidiao shiyan yanjiu 《现代汉语方言音库》单字调实验研究 [Experimental study on monosyllabic tones in Phonetic Database of Modern Chinese Dialects]. Nanjing: Nanjing Normal University. (Master’s thesis.) (DOI: 10.7666/d.y1116983)
      • Data: 1998 Lin et al..csv.
    • Sanders, Robert & Shi, Feng 石锋. 2003. Hanyu Yuyin Shujuku 汉语语音数据库 [Chinese Phonetic Database].

    Non-citation tones recordings

    • Li, Deyang 李德鍚 & Zhang, Dequan 張德泉. 1909. Dengmi yinyu 燈謎隱語 [Lantern riddles]. Paris: Pathé Records.
      • 11.5-inch 90 rpm center-start vertical-cut phonograph record, catalog number 32573.
      • Reissued in Shanghai in the 1920s on 11.5-inch 80 rpm outside-start vertical-cut discs, under the same catalog number.
      • The reissue was remastered in the 2007 documentary Xiangsheng Dashi 相声大师

  7. E

    Chinese Mandarin (North) database

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated May 31, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2018). Chinese Mandarin (North) database [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0398/
    Explore at:
    Dataset updated
    May 31, 2018
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    This database contains the recordings of 500 Chinese Mandarin speakers from Northern China (250 males and 250 females), from 18 to 60 years’ old, recorded in quiet studios located in Shenzhen and in Hong Kong Special Administrative Region, People’s Republic of China. Demographics of native speakers from Northern China is as follows:- Beijing: 200 speakers (100 males, 100 females)- North of Beijing: 101 speakers (50 males, 51 females)- Shandong: 149 speakers (75 males, 74 females)- Henan: 50 speakers (25 males, 25 females)Speaker profile includes the following information: unique ID, place of birth, place where speaker lived the longest by the age of 16, and the number of years that the speaker lived there, age, gender, recording place.Recordings were made through microphone headsets (ATM73a / AUDIO TECHNICA) and consist of 172 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM. Recording script consists of :• Phoneme balance statement: 785 sentences• Travel conversation: 1618 sentences• About 200 sentences per speaker including: 134 sentences of travel conversation, 66 sentences of phoneme balance

  8. Z

    CLDF dataset derived from Hóu's "Phonological Database of Chinese Dialects"...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hóu, Jīngyī (2024). CLDF dataset derived from Hóu's "Phonological Database of Chinese Dialects" from 2004 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5126857
    Explore at:
    Dataset updated
    Aug 7, 2024
    Dataset authored and provided by
    Hóu, Jīngyī
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cite the source of the dataset as:

    Hóu, J. (2004): Xiàndài Hànyǔ fāngyán yīnkù 现代汉语方言音库 [Phonological database of Chinese dialects]. Shànghǎi: Shànghǎi Jiàoyù.

  9. E

    Chinese Mandarin (South) database

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated May 31, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2018). Chinese Mandarin (South) database [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0397/
    Explore at:
    Dataset updated
    May 31, 2018
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    This database contains the recordings of 1000 Chinese Mandarin speakers from Southern China (500 males and 500 females), from 18 to 60 years’ old, recorded in quiet studios located in Shenzhen and in Hong Kong Special Administrative Region, People’s Republic of China. Demographics of native speakers from Southern China is as follows:- Guangdong: 312 speakers (154 males, 158 females)- Fujian: 155 speakers (95 males, 60 females)- Jiangsu: 262 speakers (134 males, 128 females)- Zhejiang: 160 speakers (84 males, 76 females)- Taiwan: 105 speakers (31 males, 74 females)- Other-Southern: 6 speakers (2 males, 4 females)Speaker profile includes the following information: unique ID, place of birth, place where speaker lived the longest by the age of 16, and the number of years that the speaker lived there, age, gender, recording place.Recordings were made through microphone headsets (ATM73a / AUDIO TECHNICA) and consist of 341 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM. Recording script consists of :• Phoneme balance statement: 785 sentences• Travel conversation: 1618 sentences• About 200 sentences per speaker including: 134 sentences of travel conversation, 66 sentences of phoneme balance

  10. F

    Mandarin General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mandarin General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-mandarin-china
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Mandarin Chinese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Mandarin speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mandarin Chinese communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Mandarin speech models that understand and respond to authentic Chinese accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mandarin Chinese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Mandarin Chinese speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of China to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Mandarin speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Mandarin Chinese.
    Voice Assistants: Build smart assistants capable of understanding natural Chinese conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  11. p

    Chinese Language Schools in Indonesia - 174 Verified Listings Database

    • poidata.io
    csv, excel, json
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Poidata.io (2025). Chinese Language Schools in Indonesia - 174 Verified Listings Database [Dataset]. https://www.poidata.io/report/chinese-language-school/indonesia
    Explore at:
    csv, excel, jsonAvailable download formats
    Dataset updated
    Jun 26, 2025
    Dataset provided by
    Poidata.io
    Area covered
    Indonesia
    Description

    Comprehensive dataset of 174 Chinese language schools in Indonesia as of June, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.

  12. p

    Chinese Language Schools in → Ehime, Japan - 1 Verified Listings Database

    • poidata.io
    csv, excel, json
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Poidata.io (2025). Chinese Language Schools in → Ehime, Japan - 1 Verified Listings Database [Dataset]. https://www.poidata.io/report/chinese-language-school/japan/ehime
    Explore at:
    excel, json, csvAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    Poidata.io
    Area covered
    Ehime, Japan
    Description

    Comprehensive dataset of 1 Chinese language schools in → Ehime, Japan as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.

  13. p

    Chinese Language Schools in Colombia - 4 Verified Listings Database

    • poidata.io
    csv, excel, json
    Updated Jul 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Poidata.io (2025). Chinese Language Schools in Colombia - 4 Verified Listings Database [Dataset]. https://www.poidata.io/report/chinese-language-school/colombia
    Explore at:
    csv, excel, jsonAvailable download formats
    Dataset updated
    Jul 13, 2025
    Dataset provided by
    Poidata.io
    Area covered
    Colombia
    Description

    Comprehensive dataset of 4 Chinese language schools in Colombia as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.

  14. E

    GlobalPhone Chinese-Shanghai

    • catalogue.elra.info
    • catalog.elra.info
    • +1more
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Chinese-Shanghai [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0194/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    Shanghai
    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Chinese-Shanghai corpus was produced using the Peoples Daily newspaper. It contains recordings of 41 speakers (16 males, 25 females) recorded in Shanghai, China. The following age distribution has been obtained: 1 speaker is below 19, 2 speakers are between 20 and 29, 13 speakers are between 30 and 39, 14 speakers are between 40 and 49, and 11 speakers are over 50.

  15. p

    Chinese Language Schools in Iwate, Japan - 1 Verified Listings Database

    • poidata.io
    csv, excel, json
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Poidata.io (2025). Chinese Language Schools in Iwate, Japan - 1 Verified Listings Database [Dataset]. https://www.poidata.io/report/chinese-language-school/japan/iwate
    Explore at:
    excel, json, csvAvailable download formats
    Dataset updated
    Jun 26, 2025
    Dataset provided by
    Poidata.io
    Area covered
    Iwate, Japan
    Description

    Comprehensive dataset of 1 Chinese language schools in Iwate, Japan as of June, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.

  16. A

    CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition

    • abacus.library.ubc.ca
    iso, txt
    Updated Mar 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2022). CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml;jsessionid=8de5d62f9bfb012e7807cef02109?persistentId=hdl%3A11272.1%2FAB2%2FAT8NRM&version=&q=&fileAccess=&fileTag=%22Documentation%22&fileSortField=&fileSortOrder=
    Explore at:
    iso(3116947456), txt(1308)Available download formats
    Dataset updated
    Mar 18, 2022
    Dataset provided by
    Abacus Data Network
    Area covered
    Taiwan
    Description

    AbstractIntroduction CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition was developed by the Linguistic Data Consortium (LDC) and consists of approximately 27 hours of unscripted telephone conversations between native speakers of the Taiwan dialect of Mandarin Chinese. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND Mandarin Chinese-Taiwan Dialect (LDC96S56). The CALLFRIEND series is a collection of telephone conversations in several languages conducted by LDC in support of language identification technology development. Languages covered in the collection include American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. Data All data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes. The data was recorded as 8kHz u-law SPH encoded stereo files, with one end of the phone call on each channel. In this release, files were converted to WAV format, and information from the original SPH headers is described in the documentation. SPH files are not included in this second edition. The audio files were originally split into train, dev and test folders of 20 recordings each, but they are combined in this release. Completed calls passed through a human auditing process to verify that the target language was spoken by the participants, to check the quality of the recordings, and to record information about dialect, noise and distortion.

  17. S

    TCST-UT: Tibetan-Chinese speech translation dataset of Ü-Tsang dialect

    • scidb.cn
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    li xin; Liu Jialuo; Dorje Peng Mao; Kan Zhuocuo; Qi Xiaoke; Zhao Xiaobing (2025). TCST-UT: Tibetan-Chinese speech translation dataset of Ü-Tsang dialect [Dataset]. http://doi.org/10.57760/sciencedb.18807
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2025
    Dataset provided by
    Science Data Bank
    Authors
    li xin; Liu Jialuo; Dorje Peng Mao; Kan Zhuocuo; Qi Xiaoke; Zhao Xiaobing
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This TCST-UT dataset contains 58767 samples, 72.08 hours, and audio files from 147 different speakers. Each sample is a triplet consisting of Tibetan speech, corresponding Tibetan text, and Chinese text. Among them, the Tibetan language speech data comes from the M2ASR Tibetan dialect speech recognition dataset, which is published on the m2sr.cslt.org website. The audio files can be obtained through email requests, so this dataset does not directly provide voice audio files, only the audio paths of the samples contained in the dataset. The audio path of each sample, along with the corresponding Tibetan text and Chinese translated text, is stored in the output. json file, where the audio path refers to the path in the public dataset.The size of output. json is 22MB. The file puts the audio path, Tibetan text, and Chinese translation text of each sample into a dictionary, with the data format being:Abbreviation of Name - Audio Number: {“audio”: Audio file path,“text”: {“Tibetan”: The Tibetan text corresponding to the audio file“Chinese”: The Chinese text corresponding to the audio file}}

  18. h

    Dataset for Xiang Subgrouping

    • datahub.hku.hk
    csv
    Updated May 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Marcelo Sevilla; John Joseph Perry; Chu-Wen Chen (2025). Dataset for Xiang Subgrouping [Dataset]. http://doi.org/10.25442/hku.28935251.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 16, 2025
    Dataset provided by
    HKU Data Repository
    Authors
    Robert Marcelo Sevilla; John Joseph Perry; Chu-Wen Chen
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset is divided into two parts: the first represents an evaluation of the data in the Linguistic Atlas of of Chinese Dialects (Cao 2008) for uniqueness and generality for 3 core Xiang varieties (Changsha, Shaoyang, Lianyuan) and 2 non-Xiang varieties (Changde, Chaling). 1 non-core Xiang variety was also included for reference (Hengyang). It is divided into three sections based on the type of linguistic features covered, corresponding to the three volumes of the Atlas: Phonology, Lexicon, and Grammar. Each dataset contains 8-10 columns: (1) Variety (which variety of the six is being considered); (2) Feature (the feature from the Atlas); (3) Value (the value for that location given in the Atlas); (4) Middle Chinese Interpretation (ONLY Phonological Dataset; the value of the feature in the Qieyun, using Baxter's transcription); (5) Dialectal Rendition (the form the feature takes in the published description of the variety); (6) Implied Sound Change (ONLY Phonological Dataset; the assumed sound change from Middle Chinese); (7) Unique; (8) General; (9) Map Number (relevant page and map number in the Linguistic Atlas of Chinese Dialects); (10) Notes.Uniqueness means that a feature does not occur in non-Xiang varieties, while Generality means a feature occurs in all Xiang varieties; this is indicated in the dataset with 'Yes' (positive value), 'No' (negative value), and N/A (irrelevant). Features deemed particularly relevant for subgrouping purposes, i.e. are particularly rare or unique, are indicated in yellow highlight. If a feature in the Atlas does not agree with the published description of a variety, this is indicated under Notes as 'Mismatch', with the relevant mismatched feature indicated in red lettering. In the lexical and grammatical datasets, the 'Dialectal Rendition' column (Column 4) focuses on features which are unique and deemed helpful to subgrouping purposes; that is, rows that are highlighted in yellow. If a row is highlighted but lacks a value for this column, it means a relevant form could not be identified in the published description for that variety. The second dataset ('Innovations') represents a list of 147 linguistic innovations evaluated for 13 Sinitic language varieties (9 Xiang, 4 non-Xiang), with '1' meaning 'possesses innovation' and '0' meaning 'does not possess innovation'. A value of 'NA' means that the value could not be determined for that variety for a lack of data.

  19. A

    Global TIMIT Mandarin Chinese-Guanzhong Dialect

    • abacus.library.ubc.ca
    iso, txt
    Updated Mar 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2022). Global TIMIT Mandarin Chinese-Guanzhong Dialect [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/MFTAUQ
    Explore at:
    txt(1308), iso(582139904)Available download formats
    Dataset updated
    Mar 18, 2022
    Dataset provided by
    Abacus Data Network
    Area covered
    Guanzhong
    Description

    AbstractIntroduction Global TIMIT Mandarin Chinese-Guanzhong Dialect was developed by the Linguistic Data Consortium and Xi'an Jiaotong University and consists of approximately five hours of read speech and transcripts in the Guanzhong dialect of Mandarin Chinese as spoken in Shannxi province. The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Specifically, these features included: A large number of fluently-read sentences, containing a representative sample of phonetic, lexical, syntactic, semantic, and pragmatic patterns A relatively large number of speakers Time-aligned lexical and phonetic transcription of all utterances Some sentences read by all speakers, others read by a few speakers, and others read by just one speaker Data Global TIMIT Mandarin Chinese-Guanzhong Dialect consists of 50 speakers reading 120 sentences selected from Chinese Gigaword Fifth Edition (LDC2011T13). Among the 120 sentences, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types. The corpus was recorded at Xi'an Jiaotong University, Xi'an, China. Speakers (25 female, 25 male) were born in Weinan, Shannxi and spoke the Guanzhong dialect. All speech data are presented as 16kHz, 16-bit flac compressed wav files. Each file has accompanying phone, word, and tone segmentation files, as well as Praat TextGrid files.

  20. E

    Database of Chinese Names

    • live.european-language-grid.eu
    • catalog.elra.info
    txt
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Database of Chinese Names [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/2443
    Explore at:
    txtAvailable download formats
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    China
    Description

    Chinese name components, accompanied by accurate pinyin readings, gender codes, and flags denoting whether name is a given name, surname, or both.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ruifeng Mo (2024). Data--Chinese dialects & environmental factors.docx [Dataset]. http://doi.org/10.6084/m9.figshare.28052444.v1
Organization logo

Data--Chinese dialects & environmental factors.docx

Explore at:
docxAvailable download formats
Dataset updated
Dec 18, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ruifeng Mo
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Language diversity and its driving foctors

Search
Clear search
Close search
Google apps
Main menu