18 datasets found

Ranking of languages spoken at home in the U.S. 2022
statista.com
Updated Dec 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Ranking of languages spoken at home in the U.S. 2022 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
Explore at:
Dataset updated
Dec 9, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
United States
Description
In 2022, around 42.03 million people in the United States spoke Spanish at home. In comparison, approximately 974,829 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
ChildMandarin
huggingface.co
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Beijing Academy of Artificial Intelligence (2025). ChildMandarin [Dataset]. https://huggingface.co/datasets/BAAI/ChildMandarin
Explore at:
Dataset updated
Mar 18, 2025
Dataset authored and provided by
Beijing Academy of Artificial Intelligence
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5

Introduction

ChildMandarin is a comprehensive, open-source Mandarin Chinese speech dataset specifically designed for research on young children aged 3 to 5. This dataset addresses the critical lack of publicly available resources for this age group, enabling advancements in automatic speech recognition (ASR), speaker verification (SV), and other related fields. The dataset is released… See the full description on the dataset page: https://huggingface.co/datasets/BAAI/ChildMandarin.
h
cantonese-mandarin-translations
huggingface.co
hf-proxy-cf.effarig.site
Updated Dec 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Botisan AI (2021). cantonese-mandarin-translations [Dataset]. https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 14, 2021
Dataset authored and provided by
Botisan AI
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for cantonese-mandarin-translations

Dataset Summary

This is a machine-translated parallel corpus between Cantonese (a Chinese dialect that is mainly spoken by Guangdong (province of China), Hong Kong, Macau and part of Malaysia) and Chinese (written form, in Simplified Chinese).

Supported Tasks and Leaderboards

N/A

Languages

Cantonese (yue) Simplified Chinese (zh-CN)

Dataset Structure

JSON lines with yue… See the full description on the dataset page: https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations.
E
GlobalPhone Chinese-Shanghai
catalog.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Chinese-Shanghai [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0194/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association)
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Area covered
Shanghai
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Chinese-Shanghai corpus was produced using the Peoples Daily newspaper. It contains recordings of 41 speakers (16 males, 25 females) recorded in Shanghai, China. The following age distribution has been obtained: 1 speaker is below 19, 2 speakers are between 20 and 29, 13 speakers are between 30 and 39, 14 speakers are between 40 and 49, and 11 speakers are over 50.
h
clue
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CLUE benchmark (2020). clue [Dataset]. https://huggingface.co/datasets/clue/clue
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
CLUE benchmark
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for "clue"

Dataset Summary

CLUE, A Chinese Language Understanding Evaluation Benchmark (https://www.cluebenchmarks.com/) is a collection of resources for training, evaluating, and analyzing Chinese language understanding systems.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure Data Instances afqmc

Size of downloaded… See the full description on the dataset page: https://huggingface.co/datasets/clue/clue.
h
un_pc
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
un_pc [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/un_pc
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for United Nations Parallel Corpus

Dataset Summary

The United Nations Parallel Corpus is the first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. The corpus is freely available for download under a liberal… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/un_pc.
E
MDT Mandarin Chinese Conversational Recognition Corpus – 2 channels
catalog.elra.info
live.european-language-grid.eu
Updated May 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2020). MDT Mandarin Chinese Conversational Recognition Corpus – 2 channels [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0409_03/
Explore at:
Dataset updated
May 20, 2020
Dataset provided by
ELRA (European Language Resources Association)
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
This dataset consists of 4.98 hours of transcribed conversational speech in Mandarin Chinese, where 30 conversations are uttered by 32 speakers (16 males and 16 females). The audios are sampled at 16 kHz and quantized at 16 bits.For each conversation, there are two close-talking channels recorded via the microphones, one for each speaker, as well as three far-field channels recorded by iPhone, Androïd Phone, and recorder respectively. This corpus may be obtained as a complete set or by selecting specific channels (two close-talking channels shall be understood as 1 single channel): - MDT Mandarin Chinese Conversational Recognition Corpus - complete set (ELRA-S0409-01)- MDT Mandarin Chinese Conversational Recognition Corpus - 1 channel (ELRA-S0409-02)- MDT Mandarin Chinese Conversational Recognition Corpus - 2 channels (ELRA-S0409-03)- MDT Mandarin Chinese Conversational Recognition Corpus - 3 channels (ELRA-S0409-04)
h2o-translated-chinese-med-prompts
huggingface.co
Updated Jul 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H2O.ai (2023). h2o-translated-chinese-med-prompts [Dataset]. https://huggingface.co/datasets/h2oai/h2o-translated-chinese-med-prompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 1, 2023
Dataset provided by
H2O.ai, Inc.
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Translated Chinese Medical Prompts

This repository contains medical prompts translated originally from Chinese, which can be used as training data for natural language processing (NLP) tasks related to the medical domain in English language. Dataset Description The dataset consists of a collection of medical prompts originally in Chinese, which have been translated into English. These prompts cover various medical topics, including symptoms, diagnoses, treatments, medications… See the full description on the dataset page: https://huggingface.co/datasets/h2oai/h2o-translated-chinese-med-prompts.
f
Table2_Swedish Youths as Listeners of Global Englishes Speakers With Diverse...
figshare.com
frontiersin.figshare.com
xlsx
Updated Jun 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hyeseung Jeong; Anna Elgemark; Bosse Thorén (2023). Table2_Swedish Youths as Listeners of Global Englishes Speakers With Diverse Accents: Listener Intelligibility, Listener Comprehensibility, Accentedness Perception, and Accentedness Acceptance.XLSX [Dataset]. http://doi.org/10.3389/feduc.2021.651908.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2021.651908.s003
Dataset updated
Jun 10, 2023
Dataset provided by
Frontiers
Authors
Hyeseung Jeong; Anna Elgemark; Bosse Thorén
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As reflected in the concept of Global Englishes, English mediates global communication, where English speakers represent not merely those from English-speaking countries like United Kingdom or United States but also global people from a wide range of linguistic backgrounds, who speak the language with diverse accents. Thus, to communicate internationally, cultivating a maximized listening proficiency for and positive attitudes toward global Englishes speakers with diverse accents is ever more important. However, with their preference for American English and its popular culture, it is uncertain whether Swedish youth learners are developing these key linguistic qualities to be prepared for the globalized use of English. To address this, we randomly assigned 160 upper secondary students (mean age = 17.25) into six groups, where each group listened to one of six English speakers. The six speakers first languages were Mandarin, Russian/Ukrainian, Tamil, Lusoga/Luganda, American English, and British English. Through comparing the six student groups, we examined their listener intelligibility (actual understanding), listener comprehensibility (feeling of ease or difficulty), accentedness perception (perceiving an accent as native or foreign), and accentedness acceptance (showing a positive or negative attitude toward an accent) of diverse English accents. The results showed that the intelligibility scores and perception/attitude ratings of participants favored the two speakers with privileged accents–the American and British speakers. However, across all six groups, no correlation was detected between their actual understanding of the speakers and their perception/attitude ratings, which often had a strong correlation with their feelings of ease/difficulty regarding the speakers accents. Taken together, our results suggest that the current English education needs innovation to be more aligned with the national syllabus that promotes a global perspective. That is, students need to be guided to improve their actual understanding and sense of familiarity with Global English speakers besides the native accents that they prefer. Moreover, innovative pedagogical work should be undertaken to change Swedish youths’ perceptions and attitudes and prepare them to become open-minded toward diverse English speakers.
h
chinese-fineweb-edu-v2
huggingface.co
Updated Mar 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
opencsg (2025). chinese-fineweb-edu-v2 [Dataset]. https://huggingface.co/datasets/opencsg/chinese-fineweb-edu-v2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2025
Dataset authored and provided by
opencsg
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
We recommend you to use the improved version Fineweb-edu-chinese-v2.1 !

Chinese Fineweb Edu Dataset V2 [中文] [English]

[OpenCSG Community] [👾github] [wechat] [Twitter]

📖Technical Report Chinese Fineweb Edu Dataset V2 is a comprehensive upgrade of the original Chinese Fineweb Edu, designed and optimized for natural language processing (NLP) tasks in the education sector. This high-quality Chinese pretraining dataset has undergone significant… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/chinese-fineweb-edu-v2.
h
Magpie-Qwen2-Pro-200K-Chinese
huggingface.co
Updated Jun 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Magpie Alignment (2024). Magpie-Qwen2-Pro-200K-Chinese [Dataset]. https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2024
Dataset authored and provided by
Magpie Alignment
Description
Project Web: https://magpie-align.github.io/ Arxiv Technical Report: https://arxiv.org/abs/2406.08464 Codes: https://github.com/magpie-align/magpie

Abstract

Click Here High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent… See the full description on the dataset page: https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese.
h
AISHELL-3
huggingface.co
Updated Feb 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
沈云航 Yunhang Shen (2025). AISHELL-3 [Dataset]. https://huggingface.co/datasets/shenyunhang/AISHELL-3
Explore at:
Dataset updated
Feb 20, 2025
Authors
沈云航 Yunhang Shen
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
AISHELL-3

Identifier: SLR93 Summary: Mandarin data, provided by Beijing Shell Shell Technology Co., Ltd.

Category: Speech

License: Apache License v.2.0

Downloads (use a mirror closer to you): data_aishell3.tgz 19G Mirrors: [US]
[EU]
[CN]
About this resource:AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus published by Beijing Shell Shell Technology Co.,Ltd. It can be used to train… See the full description on the dataset page: https://huggingface.co/datasets/shenyunhang/AISHELL-3.
h
iwslt2017
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
International Conference on Spoken Language Translation, iwslt2017 [Dataset]. https://huggingface.co/datasets/IWSLT/iwslt2017
Explore at:
Dataset authored and provided by
International Conference on Spoken Language Translation
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The IWSLT 2017 Multilingual Task addresses text translation, including zero-shot translation, with a single MT system across all directions including English, German, Dutch, Italian and Romanian. As unofficial task, conventional bilingual text translation is offered between English and Arabic, French, Japanese, Chinese, German and Korean.
h
chinese-dolly-15k
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chinese-dolly-15k [Dataset]. https://huggingface.co/datasets/DavidLanz/chinese-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
David Lanz
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Chinese-Dolly-15k 是繁體中文翻譯的Dolly instruction(Databricks)資料集原來的資料集'databricks/databricks-dolly-15k'是由數千名Databricks員工根據InstructGPT論文中概述的幾種行為類別生成的遵循指示記錄的開來源資料集。這幾個行為類別包括頭腦風暴、分類、封閉型問答、生成、資訊擷取、開放類型的問答和摘要。在知識共用署名-相同方式共用3.0（CC BY-SA 3.0）許可下，此資料集可用於任何學術或商業用途。如果你也在做這些資料集的籌備，歡迎來聯繫我們，避免重複花錢。

Citation

Please cite the repo if you use the data or code in this repo. @misc{alpaca, author = {DavidLanz}, title = {An Instruction-following Chinese Language model, LoRA tuning on… See the full description on the dataset page: https://huggingface.co/datasets/DavidLanz/chinese-dolly-15k.
E
TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - CORTES
catalog.elra.info
catalogue.elra.info
+1more
Updated Sep 7, 2007
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - CORTES [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-E0015_01/
Explore at:
Dataset updated
Sep 7, 2007
Dataset provided by
ELRA (European Language Resources Association)
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf
Description
TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.The second TC-STAR evaluation campaign took place in March 2006. Three core technologies were evaluated during the campaign:• Automatic Speech Recognition (ASR),• Spoken Language Translation (SLT),• Text to Speech (TTS).Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.This package includes the material used for the TC-STAR 2006 Spoken Language Translation (SLT) second evaluation campaign for Spanish-to-English translation within the CORTES task. The same packages are available for English (ELRA-E0011), Spanish (ELRA-E0012) and Mandarin Chinese (ELRA-E0013) for ASR, and for SLT in 2 other directions, English-to-Spanish (ELRA-E0014) and Chinese-to-English (ELRA-E0016), as well as for the EPPS task for Spanish-to-English (ELRA-E0015/02).To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.This package was used within the CORTES task and consists of 2 data sets:-Development data set: built upon the ASR development data set, in order to enable end-to-end evaluation. Subsets of 25,000 words were selected from the CORTES verbatim transcriptions and from the CORTES Final Text Edition documents. The source texts were then the translated into English by two independent translation agencies. All source text sets and reference translations were formatted using the same SGML DTD that has been used for the NIST Machine Translation evaluations.-Test data set: as for the development set, the same procedure was followed to produce the test data, i.e.: subsets of 25,000 words were selected from the test data set (CORTES sessions on 24 November 2005) both from the manual transcriptions and from the Final Text Edition documents. The source data were then translated into English by two independent agencies.
h
Fineweb-Edu-Chinese-V2.1
huggingface.co
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
opencsg (2025). Fineweb-Edu-Chinese-V2.1 [Dataset]. https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2025
Dataset authored and provided by
opencsg
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Chinese Fineweb Edu Dataset V2.1 [中文] [English]

[OpenCSG Community] [👾github] [wechat] [Twitter]

📖Technical Report The Chinese Fineweb Edu Dataset V2.1 is an enhanced version of the V2 dataset, designed specifically for natural language processing (NLP) tasks in the education sector. This version introduces two new data sources, map-cc and opencsg-cc, and retains data with scores ranging from 2 to 3. The dataset entries are organized into different folders… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1.
h
TinyEncyclopedias-Chinese
huggingface.co
Updated Jun 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fangzhangmnm (2024). TinyEncyclopedias-Chinese [Dataset]. https://huggingface.co/datasets/fzmnm/TinyEncyclopedias-Chinese
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 14, 2024
Authors
fangzhangmnm
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Area covered
中国
Description
本数据集已停止更新，请移步https://huggingface.co/datasets/fzmnm/TinyStoriesAdv-zh

TinyEncyclopediasChinese

Inspired by the papers (TinyStories)[https://arxiv.org/abs/2305.07759] and (Textbooks Are All You Need)[https://arxiv.org/abs/2306.11644], where a small language model exhibits strong capabilities when trained on high-quality, kid-friendly stories synthesized by AI, I present an AI-generated Encyclopedia suitable for kindergarten and grade school levels. This dataset follows my previous… See the full description on the dataset page: https://huggingface.co/datasets/fzmnm/TinyEncyclopedias-Chinese.
h
Chinese-EcomQA
huggingface.co
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chinese-EcomQA [Dataset]. https://huggingface.co/datasets/OpenStellarTeam/Chinese-EcomQA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 18, 2025
Authors
OpenStellarTeam
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Overview

🌐 Website • 🤗 Hugging Face • ⏬ Data • 📃 Paper

ChineseEcomQA is a scalable question-answering benchmark focused on fundamental e-commerce concepts. Specifically, our benchmark is built on three core characteristics: Focus on Fundamental Concept, E-commerce Generality and E-commerce Expertise. Please visit our website or check our paper for more details.

💫 Instroduction

With the increasing use of Large Language Models (LLMs) in fields such as e-commerce… See the full description on the dataset page: https://huggingface.co/datasets/OpenStellarTeam/Chinese-EcomQA.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2024). Ranking of languages spoken at home in the U.S. 2022 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/

Ranking of languages spoken at home in the U.S. 2022

Explore at:

15 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Dec 9, 2024

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2022

Area covered

United States

Description

In 2022, around 42.03 million people in the United States spoke Spanish at home. In comparison, approximately 974,829 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

Clear search

Close search

Google apps

Main menu

Ranking of languages spoken at home in the U.S. 2022

ChildMandarin

cantonese-mandarin-translations

GlobalPhone Chinese-Shanghai

clue

un_pc

MDT Mandarin Chinese Conversational Recognition Corpus – 2 channels

h2o-translated-chinese-med-prompts

Table2_Swedish Youths as Listeners of Global Englishes Speakers With Diverse...

chinese-fineweb-edu-v2

Magpie-Qwen2-Pro-200K-Chinese

AISHELL-3

iwslt2017

chinese-dolly-15k

TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - CORTES

Fineweb-Edu-Chinese-V2.1

TinyEncyclopedias-Chinese

Chinese-EcomQA

Ranking of languages spoken at home in the U.S. 2022