18 datasets found
  1. Ranking of languages spoken at home in the U.S. 2022

    • statista.com
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Ranking of languages spoken at home in the U.S. 2022 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
    Explore at:
    Dataset updated
    Dec 9, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2022
    Area covered
    United States
    Description

    In 2022, around 42.03 million people in the United States spoke Spanish at home. In comparison, approximately 974,829 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

  2. ChildMandarin

    • huggingface.co
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beijing Academy of Artificial Intelligence (2025). ChildMandarin [Dataset]. https://huggingface.co/datasets/BAAI/ChildMandarin
    Explore at:
    Dataset updated
    Mar 18, 2025
    Dataset authored and provided by
    Beijing Academy of Artificial Intelligence
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5

      Introduction
    

    ChildMandarin is a comprehensive, open-source Mandarin Chinese speech dataset specifically designed for research on young children aged 3 to 5. This dataset addresses the critical lack of publicly available resources for this age group, enabling advancements in automatic speech recognition (ASR), speaker verification (SV), and other related fields. The dataset is released… See the full description on the dataset page: https://huggingface.co/datasets/BAAI/ChildMandarin.

  3. h

    cantonese-mandarin-translations

    • huggingface.co
    • hf-proxy-cf.effarig.site
    Updated Dec 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Botisan AI (2021). cantonese-mandarin-translations [Dataset]. https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 14, 2021
    Dataset authored and provided by
    Botisan AI
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for cantonese-mandarin-translations

      Dataset Summary
    

    This is a machine-translated parallel corpus between Cantonese (a Chinese dialect that is mainly spoken by Guangdong (province of China), Hong Kong, Macau and part of Malaysia) and Chinese (written form, in Simplified Chinese).

      Supported Tasks and Leaderboards
    

    N/A

      Languages
    

    Cantonese (yue) Simplified Chinese (zh-CN)

      Dataset Structure
    

    JSON lines with yue… See the full description on the dataset page: https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations.

  4. E

    GlobalPhone Chinese-Shanghai

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Chinese-Shanghai [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0194/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Area covered
    Shanghai
    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Chinese-Shanghai corpus was produced using the Peoples Daily newspaper. It contains recordings of 41 speakers (16 males, 25 females) recorded in Shanghai, China. The following age distribution has been obtained: 1 speaker is below 19, 2 speakers are between 20 and 29, 13 speakers are between 30 and 39, 14 speakers are between 40 and 49, and 11 speakers are over 50.

  5. h

    clue

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CLUE benchmark (2020). clue [Dataset]. https://huggingface.co/datasets/clue/clue
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    CLUE benchmark
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for "clue"

      Dataset Summary
    

    CLUE, A Chinese Language Understanding Evaluation Benchmark (https://www.cluebenchmarks.com/) is a collection of resources for training, evaluating, and analyzing Chinese language understanding systems.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure
    
    
    
    
    
    
    
      Data Instances
    
    
    
    
    
    
    
      afqmc
    

    Size of downloaded… See the full description on the dataset page: https://huggingface.co/datasets/clue/clue.

  6. h

    un_pc

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    un_pc [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/un_pc
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for United Nations Parallel Corpus

      Dataset Summary
    

    The United Nations Parallel Corpus is the first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. The corpus is freely available for download under a liberal… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/un_pc.

  7. E

    MDT Mandarin Chinese Conversational Recognition Corpus – 2 channels

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated May 20, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2020). MDT Mandarin Chinese Conversational Recognition Corpus – 2 channels [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0409_03/
    Explore at:
    Dataset updated
    May 20, 2020
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    This dataset consists of 4.98 hours of transcribed conversational speech in Mandarin Chinese, where 30 conversations are uttered by 32 speakers (16 males and 16 females). The audios are sampled at 16 kHz and quantized at 16 bits.For each conversation, there are two close-talking channels recorded via the microphones, one for each speaker, as well as three far-field channels recorded by iPhone, Androïd Phone, and recorder respectively. This corpus may be obtained as a complete set or by selecting specific channels (two close-talking channels shall be understood as 1 single channel): - MDT Mandarin Chinese Conversational Recognition Corpus - complete set (ELRA-S0409-01)- MDT Mandarin Chinese Conversational Recognition Corpus - 1 channel (ELRA-S0409-02)- MDT Mandarin Chinese Conversational Recognition Corpus - 2 channels (ELRA-S0409-03)- MDT Mandarin Chinese Conversational Recognition Corpus - 3 channels (ELRA-S0409-04)

  8. h2o-translated-chinese-med-prompts

    • huggingface.co
    Updated Jul 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H2O.ai (2023). h2o-translated-chinese-med-prompts [Dataset]. https://huggingface.co/datasets/h2oai/h2o-translated-chinese-med-prompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2023
    Dataset provided by
    H2O.ai, Inc.
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Translated Chinese Medical Prompts

    This repository contains medical prompts translated originally from Chinese, which can be used as training data for natural language processing (NLP) tasks related to the medical domain in English language. Dataset Description The dataset consists of a collection of medical prompts originally in Chinese, which have been translated into English. These prompts cover various medical topics, including symptoms, diagnoses, treatments, medications… See the full description on the dataset page: https://huggingface.co/datasets/h2oai/h2o-translated-chinese-med-prompts.

  9. f

    Table2_Swedish Youths as Listeners of Global Englishes Speakers With Diverse...

    • figshare.com
    • frontiersin.figshare.com
    xlsx
    Updated Jun 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hyeseung Jeong; Anna Elgemark; Bosse Thorén (2023). Table2_Swedish Youths as Listeners of Global Englishes Speakers With Diverse Accents: Listener Intelligibility, Listener Comprehensibility, Accentedness Perception, and Accentedness Acceptance.XLSX [Dataset]. http://doi.org/10.3389/feduc.2021.651908.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    Frontiers
    Authors
    Hyeseung Jeong; Anna Elgemark; Bosse Thorén
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As reflected in the concept of Global Englishes, English mediates global communication, where English speakers represent not merely those from English-speaking countries like United Kingdom or United States but also global people from a wide range of linguistic backgrounds, who speak the language with diverse accents. Thus, to communicate internationally, cultivating a maximized listening proficiency for and positive attitudes toward global Englishes speakers with diverse accents is ever more important. However, with their preference for American English and its popular culture, it is uncertain whether Swedish youth learners are developing these key linguistic qualities to be prepared for the globalized use of English. To address this, we randomly assigned 160 upper secondary students (mean age = 17.25) into six groups, where each group listened to one of six English speakers. The six speakers first languages were Mandarin, Russian/Ukrainian, Tamil, Lusoga/Luganda, American English, and British English. Through comparing the six student groups, we examined their listener intelligibility (actual understanding), listener comprehensibility (feeling of ease or difficulty), accentedness perception (perceiving an accent as native or foreign), and accentedness acceptance (showing a positive or negative attitude toward an accent) of diverse English accents. The results showed that the intelligibility scores and perception/attitude ratings of participants favored the two speakers with privileged accents–the American and British speakers. However, across all six groups, no correlation was detected between their actual understanding of the speakers and their perception/attitude ratings, which often had a strong correlation with their feelings of ease/difficulty regarding the speakers accents. Taken together, our results suggest that the current English education needs innovation to be more aligned with the national syllabus that promotes a global perspective. That is, students need to be guided to improve their actual understanding and sense of familiarity with Global English speakers besides the native accents that they prefer. Moreover, innovative pedagogical work should be undertaken to change Swedish youths’ perceptions and attitudes and prepare them to become open-minded toward diverse English speakers.

  10. h

    chinese-fineweb-edu-v2

    • huggingface.co
    Updated Mar 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    opencsg (2025). chinese-fineweb-edu-v2 [Dataset]. https://huggingface.co/datasets/opencsg/chinese-fineweb-edu-v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 11, 2025
    Dataset authored and provided by
    opencsg
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    We recommend you to use the improved version Fineweb-edu-chinese-v2.1 !

      Chinese Fineweb Edu Dataset V2     [中文]  [English]
    

    [OpenCSG Community] [👾github] [wechat] [Twitter]

    📖Technical Report Chinese Fineweb Edu Dataset V2 is a comprehensive upgrade of the original Chinese Fineweb Edu, designed and optimized for natural language processing (NLP) tasks in the education sector. This high-quality Chinese pretraining dataset has undergone significant… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/chinese-fineweb-edu-v2.

  11. h

    Magpie-Qwen2-Pro-200K-Chinese

    • huggingface.co
    Updated Jun 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Magpie Alignment (2024). Magpie-Qwen2-Pro-200K-Chinese [Dataset]. https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 27, 2024
    Dataset authored and provided by
    Magpie Alignment
    Description

    Project Web: https://magpie-align.github.io/ Arxiv Technical Report: https://arxiv.org/abs/2406.08464 Codes: https://github.com/magpie-align/magpie

      Abstract
    

    Click Here High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent… See the full description on the dataset page: https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese.

  12. h

    AISHELL-3

    • huggingface.co
    Updated Feb 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    沈云航 Yunhang Shen (2025). AISHELL-3 [Dataset]. https://huggingface.co/datasets/shenyunhang/AISHELL-3
    Explore at:
    Dataset updated
    Feb 20, 2025
    Authors
    沈云航 Yunhang Shen
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    AISHELL-3

    Identifier: SLR93 Summary: Mandarin data, provided by Beijing Shell Shell Technology Co., Ltd.

    Category: Speech

    License: Apache License v.2.0

    Downloads (use a mirror closer to you): data_aishell3.tgz 19G Mirrors: [US]
    [EU]
    [CN]
    About this resource:AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus published by Beijing Shell Shell Technology Co.,Ltd. It can be used to train… See the full description on the dataset page: https://huggingface.co/datasets/shenyunhang/AISHELL-3.

  13. h

    iwslt2017

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    International Conference on Spoken Language Translation, iwslt2017 [Dataset]. https://huggingface.co/datasets/IWSLT/iwslt2017
    Explore at:
    Dataset authored and provided by
    International Conference on Spoken Language Translation
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    The IWSLT 2017 Multilingual Task addresses text translation, including zero-shot translation, with a single MT system across all directions including English, German, Dutch, Italian and Romanian. As unofficial task, conventional bilingual text translation is offered between English and Arabic, French, Japanese, Chinese, German and Korean.

  14. h

    chinese-dolly-15k

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chinese-dolly-15k [Dataset]. https://huggingface.co/datasets/DavidLanz/chinese-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    David Lanz
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Chinese-Dolly-15k 是繁體中文翻譯的Dolly instruction(Databricks)資料集 原來的資料集'databricks/databricks-dolly-15k'是由數千名Databricks員工根據InstructGPT論文中概述的幾種行為類別生成的遵循指示記錄的開來源資料集。這幾個行為類別包括頭腦風暴、分類、封閉型問答、生成、資訊擷取、開放類型的問答和摘要。 在知識共用署名-相同方式共用3.0(CC BY-SA 3.0)許可下,此資料集可用於任何學術或商業用途。 如果你也在做這些資料集的籌備,歡迎來聯繫我們,避免重複花錢。

      Citation
    

    Please cite the repo if you use the data or code in this repo. @misc{alpaca, author = {DavidLanz}, title = {An Instruction-following Chinese Language model, LoRA tuning on… See the full description on the dataset page: https://huggingface.co/datasets/DavidLanz/chinese-dolly-15k.

  15. E

    TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - CORTES

    • catalog.elra.info
    • catalogue.elra.info
    • +1more
    Updated Sep 7, 2007
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - CORTES [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-E0015_01/
    Explore at:
    Dataset updated
    Sep 7, 2007
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf

    Description

    TC-STAR is a European integrated project focusing on Speech-to-Speech Translation (SST). To encourage significant breakthrough in all SST technologies, annual open competitive evaluations are organized. Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-To-Speech (TTS) are evaluated independently and within an end-to-end system.The second TC-STAR evaluation campaign took place in March 2006. Three core technologies were evaluated during the campaign:• Automatic Speech Recognition (ASR),• Spoken Language Translation (SLT),• Text to Speech (TTS).Each evaluation package includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the second evaluation campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.The speech databases made within the TC-STAR project were validated by SPEX, in the Netherlands, to assess their compliance with the TC-STAR format and content specifications.This package includes the material used for the TC-STAR 2006 Spoken Language Translation (SLT) second evaluation campaign for Spanish-to-English translation within the CORTES task. The same packages are available for English (ELRA-E0011), Spanish (ELRA-E0012) and Mandarin Chinese (ELRA-E0013) for ASR, and for SLT in 2 other directions, English-to-Spanish (ELRA-E0014) and Chinese-to-English (ELRA-E0016), as well as for the EPPS task for Spanish-to-English (ELRA-E0015/02).To be able to chain the components, ASR, SLT and TTS evaluation tasks were designed to use common sets of raw data and conditions. Three evaluation tasks, common to ASR, SLT and TTS, were selected: EPPS (European Parliament Plenary Sessions) task, CORTES (Spanish Parliament Sessions) task and VOA (Voice of America) task. The CORTES data were used in addition to the EPPS data to evaluate ASR in Spanish and SLT from Spanish into English.This package was used within the CORTES task and consists of 2 data sets:-Development data set: built upon the ASR development data set, in order to enable end-to-end evaluation. Subsets of 25,000 words were selected from the CORTES verbatim transcriptions and from the CORTES Final Text Edition documents. The source texts were then the translated into English by two independent translation agencies. All source text sets and reference translations were formatted using the same SGML DTD that has been used for the NIST Machine Translation evaluations.-Test data set: as for the development set, the same procedure was followed to produce the test data, i.e.: subsets of 25,000 words were selected from the test data set (CORTES sessions on 24 November 2005) both from the manual transcriptions and from the Final Text Edition documents. The source data were then translated into English by two independent agencies.

  16. h

    Fineweb-Edu-Chinese-V2.1

    • huggingface.co
    Updated Mar 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    opencsg (2025). Fineweb-Edu-Chinese-V2.1 [Dataset]. https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 11, 2025
    Dataset authored and provided by
    opencsg
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Chinese Fineweb Edu Dataset V2.1 [中文] [English]

    [OpenCSG Community] [👾github] [wechat] [Twitter]

    📖Technical Report The Chinese Fineweb Edu Dataset V2.1 is an enhanced version of the V2 dataset, designed specifically for natural language processing (NLP) tasks in the education sector. This version introduces two new data sources, map-cc and opencsg-cc, and retains data with scores ranging from 2 to 3. The dataset entries are organized into different folders… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1.

  17. h

    TinyEncyclopedias-Chinese

    • huggingface.co
    Updated Jun 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fangzhangmnm (2024). TinyEncyclopedias-Chinese [Dataset]. https://huggingface.co/datasets/fzmnm/TinyEncyclopedias-Chinese
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 14, 2024
    Authors
    fangzhangmnm
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Area covered
    中国
    Description

    本数据集已停止更新,请移步https://huggingface.co/datasets/fzmnm/TinyStoriesAdv-zh

      TinyEncyclopediasChinese
    

    Inspired by the papers (TinyStories)[https://arxiv.org/abs/2305.07759] and (Textbooks Are All You Need)[https://arxiv.org/abs/2306.11644], where a small language model exhibits strong capabilities when trained on high-quality, kid-friendly stories synthesized by AI, I present an AI-generated Encyclopedia suitable for kindergarten and grade school levels. This dataset follows my previous… See the full description on the dataset page: https://huggingface.co/datasets/fzmnm/TinyEncyclopedias-Chinese.

  18. h

    Chinese-EcomQA

    • huggingface.co
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chinese-EcomQA [Dataset]. https://huggingface.co/datasets/OpenStellarTeam/Chinese-EcomQA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 18, 2025
    Authors
    OpenStellarTeam
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Overview

    🌐 Website • 🤗 Hugging Face • ⏬ Data • 📃 Paper

    ChineseEcomQA is a scalable question-answering benchmark focused on fundamental e-commerce concepts. Specifically, our benchmark is built on three core characteristics: Focus on Fundamental Concept, E-commerce Generality and E-commerce Expertise. Please visit our website or check our paper for more details.

      💫 Instroduction
    

    With the increasing use of Large Language Models (LLMs) in fields such as e-commerce… See the full description on the dataset page: https://huggingface.co/datasets/OpenStellarTeam/Chinese-EcomQA.

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2024). Ranking of languages spoken at home in the U.S. 2022 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
Organization logo

Ranking of languages spoken at home in the U.S. 2022

Explore at:
15 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Dec 9, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
United States
Description

In 2022, around 42.03 million people in the United States spoke Spanish at home. In comparison, approximately 974,829 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

Search
Clear search
Close search
Google apps
Main menu