15 datasets found
  1. h

    hplt_monolingual_v1_2

    • huggingface.co
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HPLT (2024). hplt_monolingual_v1_2 [Dataset]. https://huggingface.co/datasets/HPLT/hplt_monolingual_v1_2
    Explore at:
    Dataset updated
    Mar 18, 2024
    Dataset authored and provided by
    HPLT
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Data release 1.2 of the monolingual portion of HPLT (December 2023)

    There are 75 languages in this release (22 TB of raw files, 11 TB of deduped files and 8.4 TB of clean files) provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few GB each. The number of shards per language depends on the size of the specific corpus.

  2. h

    HPLT2.0_cleaned

    • huggingface.co
    Updated Mar 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Guana (2025). HPLT2.0_cleaned [Dataset]. https://huggingface.co/datasets/jobs-git/HPLT2.0_cleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 7, 2025
    Authors
    James Guana
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl. For a detailed description of the dataset, please refer to https://hplt-project.org/datasets/v2.0 The Cleaned variant of HPLT Datasets v2.0 This is the cleaned variant of the HPLT Datasets v2.0 converted to the Parquet format semi-automatically when being uploaded here. The original JSONL files… See the full description on the dataset page: https://huggingface.co/datasets/jobs-git/HPLT2.0_cleaned.

  3. h

    hplt-vi

    • huggingface.co
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Symato Team (2024). hplt-vi [Dataset]. https://huggingface.co/datasets/Symato/hplt-vi
    Explore at:
    Dataset updated
    Oct 1, 2024
    Dataset authored and provided by
    Symato Team
    Description

    NOTE: Dữ liệu mới hơn (chưa lọc) đã có tại

    https://github.com/hplt-project/data-analytics-tool/blob/main/reports/mono-2.0/HPLT-v2-vie_Latn.lite.pdf https://hplt-project.org/datasets/v2.0

    Dữ liệu tiếng Việt từ https://hplt-project.org/datasets/v1, loại bỏ những dữ liệu từ Common Crawl (CC) Thống kê theo tên miền SIZE DOCS DOMAIN

    40855.5mb 3586.6k http://dongtrieu.edu.vn 30012.1mb 112.8k http://hamtruyentranh.net… See the full description on the dataset page: https://huggingface.co/datasets/Symato/hplt-vi.

  4. h

    hplt-v1.2_urls

    • huggingface.co
    Updated May 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Hagar (2025). hplt-v1.2_urls [Dataset]. http://doi.org/10.57967/hf/5499
    Explore at:
    Dataset updated
    May 15, 2025
    Authors
    Nick Hagar
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for hplt-v1.2_urls

    This dataset provides the URLs and top-level domains associated with training records in HPLT v1.2. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record identifiers. In doing so, it allows… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/hplt-v1.2_urls.

  5. h

    DocHPLT

    • huggingface.co
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HPLT (2025). DocHPLT [Dataset]. https://huggingface.co/datasets/HPLT/DocHPLT
    Explore at:
    Dataset updated
    Jul 8, 2025
    Dataset authored and provided by
    HPLT
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    At the moment the data has gated access. We plan to make it openly accessible in late September

      DocHPLT: A Massively Multilingual Document-Level Translation Dataset
    

    Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available… See the full description on the dataset page: https://huggingface.co/datasets/HPLT/DocHPLT.

  6. h

    hplt-reduced

    • huggingface.co
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lynn Alis (2025). hplt-reduced [Dataset]. https://huggingface.co/datasets/NiamaLynn/hplt-reduced
    Explore at:
    Dataset updated
    Jul 8, 2025
    Authors
    Lynn Alis
    Description

    NiamaLynn/hplt-reduced dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    HPLT-zh

    • huggingface.co
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziyin Zhang (2025). HPLT-zh [Dataset]. https://huggingface.co/datasets/Geralt-Targaryen/HPLT-zh
    Explore at:
    Dataset updated
    Jun 30, 2025
    Authors
    Ziyin Zhang
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Geralt-Targaryen/HPLT-zh dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    ua-squad

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HPLT, ua-squad [Dataset]. https://huggingface.co/datasets/HPLT/ua-squad
    Explore at:
    Dataset authored and provided by
    HPLT
    Description

    Dataset Card for UAQuAD

    This is a revised version of the Ukrainian SQuAD dataset intended for internal use in the HPLT project. The dataset is constructed as follows:

    Examples with the answer appearing in the passage more than 1 time are discarded to prevent potential generation of the frequent spans. Examples with the answer frequency of more than 1 over the dataset are filtered out to prevent potential span frequency bias in the few-shot regimes. The answer spans are… See the full description on the dataset page: https://huggingface.co/datasets/HPLT/ua-squad.

  9. h

    ua-gec

    • huggingface.co
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HPLT (2025). ua-gec [Dataset]. https://huggingface.co/datasets/HPLT/ua-gec
    Explore at:
    Dataset updated
    Jun 20, 2025
    Dataset authored and provided by
    HPLT
    Description

    Dataset Card for UA-GEC

    This is a revised version of the document-level Ukrainian Grammatical Error Correction (UA-GEC) dataset intended for internal use in the HPLT project. The dataset is constructed as follows:

    The examples are extracted using the library. The source document can appear more than once. Examples where the source is the same as the target are removed. The remaining training and test examples are combined into one split. Note: no clear evaluation script is… See the full description on the dataset page: https://huggingface.co/datasets/HPLT/ua-gec.

  10. h

    HPLT-Dutch-cleaned-v1.2

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bram Vanroy, HPLT-Dutch-cleaned-v1.2 [Dataset]. https://huggingface.co/datasets/BramVanroy/HPLT-Dutch-cleaned-v1.2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Bram Vanroy
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    HPLT Dutch cleaned v1.2

    Data creator: High Performance Language Technologies Data URL: https://hplt-project.org/datasets/v1.2 Technical data description: https://hplt-project.org/HPLT_D2_1_Initial_release_of_monolingual_and_parallel_data_sets-1.pdf

      Fields
    

    id: Document ID document_lang: Document language identified by CLD2 during the WARC extraction process. scores: Language identification scores for each paragraph in the document. langs: Language with highest… See the full description on the dataset page: https://huggingface.co/datasets/BramVanroy/HPLT-Dutch-cleaned-v1.2.

  11. h

    hpltv2-llama33-edu-annotation

    • huggingface.co
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LumiOpen (2024). hpltv2-llama33-edu-annotation [Dataset]. https://huggingface.co/datasets/LumiOpen/hpltv2-llama33-edu-annotation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 6, 2024
    Dataset authored and provided by
    LumiOpen
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    HPLT version 2.0 educational annotations

    This dataset contains annotations derived from HPLT v2 cleaned samples. There are 500,000 annotations for each language if the source contains at least 500,000 samples. We prompt Llama-3.3-70B-Instruct to score web pages based on their educational value following FineWeb-Edu classifier. Note 1: The dataset contains the prompt (using the first 1500 characters of the text sample), the scores, and the full Llama 3 generation. The column "idx"… See the full description on the dataset page: https://huggingface.co/datasets/LumiOpen/hpltv2-llama33-edu-annotation.

  12. h

    macedonian-corpus-cleaned

    • huggingface.co
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LVSTCK (2025). macedonian-corpus-cleaned [Dataset]. https://huggingface.co/datasets/LVSTCK/macedonian-corpus-cleaned
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset authored and provided by
    LVSTCK
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Macedonian Corpus - Cleaned

    raw version here Paper

      🌟 Key Highlights
    

    Size: 35.5 GB, Word Count: 3.31 billion Filtered for irrelevant and low-quality content using C4 and Gopher filtering. Includes text from 10+ sources such as fineweb-2, HPLT-2, Wikipedia, and more.

      📋 Overview
    

    Macedonian is widely recognized as a low-resource language in the field of NLP. Publicly available resources in Macedonian are extremely limited, and as far as we know, no consolidated… See the full description on the dataset page: https://huggingface.co/datasets/LVSTCK/macedonian-corpus-cleaned.

  13. h

    hindi-corpus-v2

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Polyglot, hindi-corpus-v2 [Dataset]. https://huggingface.co/datasets/Polygl0t/hindi-corpus-v2
    Explore at:
    Dataset authored and provided by
    Polyglot
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Hindi Corpus

      Initial Crawl (1.6 TB)
    

    Name Metadata Number of Samples Token Count License

    Wikipedia https://huggingface.co/datasets/wikimedia/wikipedia 163093 70298707 CC-BY-SA-3.0

    HPLT2.0_cleaned https://huggingface.co/datasets/HPLT/HPLT2.0_cleaned 13651945 11133627510 cc0-1.0

    C4 https://huggingface.co/datasets/allenai/c418500000 51485714124 ODC-By

    CC100 https://huggingface.co/datasets/statmt/cc100 103537752 2350198697 Common Crawl terms of use

    Davlan… See the full description on the dataset page: https://huggingface.co/datasets/Polygl0t/hindi-corpus-v2.

  14. h

    fineweb-edu-translated

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki, fineweb-edu-translated [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/fineweb-edu-translated
    Explore at:
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Helsinki-NLP/fineweb-edu-translated

    Automatically translated documents from fineweb-edu. Translations are based on OPUS-MT and HPLT-MT models.

  15. h

    en_be_mt_datasets_evaluation

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    somerandomguyontheweb (2025). en_be_mt_datasets_evaluation [Dataset]. https://huggingface.co/datasets/somerandomguyontheweb/en_be_mt_datasets_evaluation
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    somerandomguyontheweb
    License

    https://choosealicense.com/licenses/pddl/https://choosealicense.com/licenses/pddl/

    Description

    Overview

    This is a small dataset of English-Belarusian sentence pairs sampled from the largest parallel corpora in OPUS (100 random instances from each of the following: NLLB, HPLT, CCMatrix, CCAligned) and manually labeled for correctness by a speaker of Belarusian. The taxonomy of labels follows Kreutzer et al. 2022:

    CC: correct translation, natural sentence CB: correct translation, boilerplate or low quality CS: correct translation, short X: incorrect translation WL: wrong… See the full description on the dataset page: https://huggingface.co/datasets/somerandomguyontheweb/en_be_mt_datasets_evaluation.

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
HPLT (2024). hplt_monolingual_v1_2 [Dataset]. https://huggingface.co/datasets/HPLT/hplt_monolingual_v1_2

hplt_monolingual_v1_2

HPLT/hplt_monolingual_v1_2

HPLT Monolingual Release v1.2

Explore at:
Dataset updated
Mar 18, 2024
Dataset authored and provided by
HPLT
License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

Data release 1.2 of the monolingual portion of HPLT (December 2023)

There are 75 languages in this release (22 TB of raw files, 11 TB of deduped files and 8.4 TB of clean files) provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few GB each. The number of shards per language depends on the size of the specific corpus.

Search
Clear search
Close search
Google apps
Main menu