In 2022, around 42.03 million people in the United States spoke Spanish at home. In comparison, approximately 974,829 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5
Introduction
ChildMandarin is a comprehensive, open-source Mandarin Chinese speech dataset specifically designed for research on young children aged 3 to 5. This dataset addresses the critical lack of publicly available resources for this age group, enabling advancements in automatic speech recognition (ASR), speaker verification (SV), and other related fields. The dataset is released… See the full description on the dataset page: https://huggingface.co/datasets/BAAI/ChildMandarin.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for cantonese-mandarin-translations
Dataset Summary
This is a machine-translated parallel corpus between Cantonese (a Chinese dialect that is mainly spoken by Guangdong (province of China), Hong Kong, Macau and part of Malaysia) and Chinese (written form, in Simplified Chinese).
Supported Tasks and Leaderboards
N/A
Languages
Cantonese (yue) Simplified Chinese (zh-CN)
Dataset Structure
JSON lines with yue… See the full description on the dataset page: https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations.
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Chinese-Shanghai corpus was produced using the Peoples Daily newspaper. It contains recordings of 41 speakers (16 males, 25 females) recorded in Shanghai, China. The following age distribution has been obtained: 1 speaker is below 19, 2 speakers are between 20 and 29, 13 speakers are between 30 and 39, 14 speakers are between 40 and 49, and 11 speakers are over 50.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for "clue"
Dataset Summary
CLUE, A Chinese Language Understanding Evaluation Benchmark (https://www.cluebenchmarks.com/) is a collection of resources for training, evaluating, and analyzing Chinese language understanding systems.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure
Data Instances
afqmc
Size of downloaded… See the full description on the dataset page: https://huggingface.co/datasets/clue/clue.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for United Nations Parallel Corpus
Dataset Summary
The United Nations Parallel Corpus is the first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. The corpus is freely available for download under a liberal… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/un_pc.
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
This dataset consists of 4.98 hours of transcribed conversational speech in Mandarin Chinese, where 30 conversations are uttered by 32 speakers (16 males and 16 females). The audios are sampled at 16 kHz and quantized at 16 bits.For each conversation, there are two close-talking channels recorded via the microphones, one for each speaker, as well as three far-field channels recorded by iPhone, Androïd Phone, and recorder respectively. This corpus may be obtained as a complete set or by selecting specific channels (two close-talking channels shall be understood as 1 single channel): - MDT Mandarin Chinese Conversational Recognition Corpus - complete set (ELRA-S0409-01)- MDT Mandarin Chinese Conversational Recognition Corpus - 1 channel (ELRA-S0409-02)- MDT Mandarin Chinese Conversational Recognition Corpus - 2 channels (ELRA-S0409-03)- MDT Mandarin Chinese Conversational Recognition Corpus - 3 channels (ELRA-S0409-04)
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Translated Chinese Medical Prompts
This repository contains medical prompts translated originally from Chinese, which can be used as training data for natural language processing (NLP) tasks related to the medical domain in English language. Dataset Description The dataset consists of a collection of medical prompts originally in Chinese, which have been translated into English. These prompts cover various medical topics, including symptoms, diagnoses, treatments, medications… See the full description on the dataset page: https://huggingface.co/datasets/h2oai/h2o-translated-chinese-med-prompts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As reflected in the concept of Global Englishes, English mediates global communication, where English speakers represent not merely those from English-speaking countries like United Kingdom or United States but also global people from a wide range of linguistic backgrounds, who speak the language with diverse accents. Thus, to communicate internationally, cultivating a maximized listening proficiency for and positive attitudes toward global Englishes speakers with diverse accents is ever more important. However, with their preference for American English and its popular culture, it is uncertain whether Swedish youth learners are developing these key linguistic qualities to be prepared for the globalized use of English. To address this, we randomly assigned 160 upper secondary students (mean age = 17.25) into six groups, where each group listened to one of six English speakers. The six speakers first languages were Mandarin, Russian/Ukrainian, Tamil, Lusoga/Luganda, American English, and British English. Through comparing the six student groups, we examined their listener intelligibility (actual understanding), listener comprehensibility (feeling of ease or difficulty), accentedness perception (perceiving an accent as native or foreign), and accentedness acceptance (showing a positive or negative attitude toward an accent) of diverse English accents. The results showed that the intelligibility scores and perception/attitude ratings of participants favored the two speakers with privileged accents–the American and British speakers. However, across all six groups, no correlation was detected between their actual understanding of the speakers and their perception/attitude ratings, which often had a strong correlation with their feelings of ease/difficulty regarding the speakers accents. Taken together, our results suggest that the current English education needs innovation to be more aligned with the national syllabus that promotes a global perspective. That is, students need to be guided to improve their actual understanding and sense of familiarity with Global English speakers besides the native accents that they prefer. Moreover, innovative pedagogical work should be undertaken to change Swedish youths’ perceptions and attitudes and prepare them to become open-minded toward diverse English speakers.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
We recommend you to use the improved version Fineweb-edu-chinese-v2.1 !
Chinese Fineweb Edu Dataset V2 [中文] [English]
[OpenCSG Community] [👾github] [wechat] [Twitter]
📖Technical Report Chinese Fineweb Edu Dataset V2 is a comprehensive upgrade of the original Chinese Fineweb Edu, designed and optimized for natural language processing (NLP) tasks in the education sector. This high-quality Chinese pretraining dataset has undergone significant… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/chinese-fineweb-edu-v2.
Project Web: https://magpie-align.github.io/ Arxiv Technical Report: https://arxiv.org/abs/2406.08464 Codes: https://github.com/magpie-align/magpie
Abstract
Click Here High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent… See the full description on the dataset page: https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
AISHELL-3
Identifier: SLR93 Summary: Mandarin data, provided by Beijing Shell Shell Technology Co., Ltd.
Category: Speech
License: Apache License v.2.0
Downloads (use a mirror closer to you):
data_aishell3.tgz 19G Mirrors:
[US]
[EU]
[CN]
About this resource:AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus
published by Beijing Shell Shell Technology Co.,Ltd. It can be used to train… See the full description on the dataset page: https://huggingface.co/datasets/shenyunhang/AISHELL-3.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The IWSLT 2017 Multilingual Task addresses text translation, including zero-shot translation, with a single MT system across all directions including English, German, Dutch, Italian and Romanian. As unofficial task, conventional bilingual text translation is offered between English and Arabic, French, Japanese, Chinese, German and Korean.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Chinese-Dolly-15k 是繁體中文翻譯的Dolly instruction(Databricks)資料集 原來的資料集'databricks/databricks-dolly-15k'是由數千名Databricks員工根據InstructGPT論文中概述的幾種行為類別生成的遵循指示記錄的開來源資料集。這幾個行為類別包括頭腦風暴、分類、封閉型問答、生成、資訊擷取、開放類型的問答和摘要。 在知識共用署名-相同方式共用3.0(CC BY-SA 3.0)許可下,此資料集可用於任何學術或商業用途。 如果你也在做這些資料集的籌備,歡迎來聯繫我們,避免重複花錢。
Citation
Please cite the repo if you use the data or code in this repo. @misc{alpaca, author = {DavidLanz}, title = {An Instruction-following Chinese Language model, LoRA tuning on… See the full description on the dataset page: https://huggingface.co/datasets/DavidLanz/chinese-dolly-15k.
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.The second TC-STAR evaluation campaign took place in March 2006. Three core technologies were evaluated during the campaign:• Automatic Speech Recognition (ASR),• Spoken Language Translation (SLT),• Text to Speech (TTS).Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.This package includes the material used for the TC-STAR 2006 Spoken Language Translation (SLT) second evaluation campaign for Spanish-to-English translation within the CORTES task. The same packages are available for English (ELRA-E0011), Spanish (ELRA-E0012) and Mandarin Chinese (ELRA-E0013) for ASR, and for SLT in 2 other directions, English-to-Spanish (ELRA-E0014) and Chinese-to-English (ELRA-E0016), as well as for the EPPS task for Spanish-to-English (ELRA-E0015/02).To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.This package was used within the CORTES task and consists of 2 data sets:-Development data set: built upon the ASR development data set, in order to enable end-to-end evaluation. Subsets of 25,000 words were selected from the CORTES verbatim transcriptions and from the CORTES Final Text Edition documents. The source texts were then the translated into English by two independent translation agencies. All source text sets and reference translations were formatted using the same SGML DTD that has been used for the NIST Machine Translation evaluations.-Test data set: as for the development set, the same procedure was followed to produce the test data, i.e.: subsets of 25,000 words were selected from the test data set (CORTES sessions on 24 November 2005) both from the manual transcriptions and from the Final Text Edition documents. The source data were then translated into English by two independent agencies.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Chinese Fineweb Edu Dataset V2.1 [中文] [English]
[OpenCSG Community] [👾github] [wechat] [Twitter]
📖Technical Report The Chinese Fineweb Edu Dataset V2.1 is an enhanced version of the V2 dataset, designed specifically for natural language processing (NLP) tasks in the education sector. This version introduces two new data sources, map-cc and opencsg-cc, and retains data with scores ranging from 2 to 3. The dataset entries are organized into different folders… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
本数据集已停止更新,请移步https://huggingface.co/datasets/fzmnm/TinyStoriesAdv-zh
TinyEncyclopediasChinese
Inspired by the papers (TinyStories)[https://arxiv.org/abs/2305.07759] and (Textbooks Are All You Need)[https://arxiv.org/abs/2306.11644], where a small language model exhibits strong capabilities when trained on high-quality, kid-friendly stories synthesized by AI, I present an AI-generated Encyclopedia suitable for kindergarten and grade school levels. This dataset follows my previous… See the full description on the dataset page: https://huggingface.co/datasets/fzmnm/TinyEncyclopedias-Chinese.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Overview
🌐 Website • 🤗 Hugging Face • ⏬ Data • 📃 Paper
ChineseEcomQA is a scalable question-answering benchmark focused on fundamental e-commerce concepts. Specifically, our benchmark is built on three core characteristics: Focus on Fundamental Concept, E-commerce Generality and E-commerce Expertise. Please visit our website or check our paper for more details.
💫 Instroduction
With the increasing use of Large Language Models (LLMs) in fields such as e-commerce… See the full description on the dataset page: https://huggingface.co/datasets/OpenStellarTeam/Chinese-EcomQA.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
In 2022, around 42.03 million people in the United States spoke Spanish at home. In comparison, approximately 974,829 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.