15 datasets found

h
hplt_monolingual_v1_2
huggingface.co
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HPLT (2024). hplt_monolingual_v1_2 [Dataset]. https://huggingface.co/datasets/HPLT/hplt_monolingual_v1_2
Explore at:
Dataset updated
Mar 18, 2024
Dataset authored and provided by
HPLT
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Data release 1.2 of the monolingual portion of HPLT (December 2023)

There are 75 languages in this release (22 TB of raw files, 11 TB of deduped files and 8.4 TB of clean files) provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few GB each. The number of shards per language depends on the size of the specific corpus.
h
HPLT2.0_cleaned
huggingface.co
Updated Mar 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Guana (2025). HPLT2.0_cleaned [Dataset]. https://huggingface.co/datasets/jobs-git/HPLT2.0_cleaned
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 7, 2025
Authors
James Guana
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl. For a detailed description of the dataset, please refer to https://hplt-project.org/datasets/v2.0 The Cleaned variant of HPLT Datasets v2.0 This is the cleaned variant of the HPLT Datasets v2.0 converted to the Parquet format semi-automatically when being uploaded here. The original JSONL files… See the full description on the dataset page: https://huggingface.co/datasets/jobs-git/HPLT2.0_cleaned.
h
hplt-vi
huggingface.co
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Symato Team (2024). hplt-vi [Dataset]. https://huggingface.co/datasets/Symato/hplt-vi
Explore at:
Dataset updated
Oct 1, 2024
Dataset authored and provided by
Symato Team
Description
NOTE: Dữ liệu mới hơn (chưa lọc) đã có tại

https://github.com/hplt-project/data-analytics-tool/blob/main/reports/mono-2.0/HPLT-v2-vie_Latn.lite.pdf https://hplt-project.org/datasets/v2.0

Dữ liệu tiếng Việt từ https://hplt-project.org/datasets/v1, loại bỏ những dữ liệu từ Common Crawl (CC) Thống kê theo tên miền SIZE DOCS DOMAIN

40855.5mb 3586.6k http://dongtrieu.edu.vn 30012.1mb 112.8k http://hamtruyentranh.net… See the full description on the dataset page: https://huggingface.co/datasets/Symato/hplt-vi.
h
hplt-v1.2_urls
huggingface.co
Updated May 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Hagar (2025). hplt-v1.2_urls [Dataset]. http://doi.org/10.57967/hf/5499
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5499
Dataset updated
May 15, 2025
Authors
Nick Hagar
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for hplt-v1.2_urls

This dataset provides the URLs and top-level domains associated with training records in HPLT v1.2. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

Dataset Details Dataset Description

This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record identifiers. In doing so, it allows… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/hplt-v1.2_urls.
h
DocHPLT
huggingface.co
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HPLT (2025). DocHPLT [Dataset]. https://huggingface.co/datasets/HPLT/DocHPLT
Explore at:
Dataset updated
Jul 8, 2025
Dataset authored and provided by
HPLT
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
At the moment the data has gated access. We plan to make it openly accessible in late September

DocHPLT: A Massively Multilingual Document-Level Translation Dataset

Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available… See the full description on the dataset page: https://huggingface.co/datasets/HPLT/DocHPLT.
h
hplt-reduced
huggingface.co
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lynn Alis (2025). hplt-reduced [Dataset]. https://huggingface.co/datasets/NiamaLynn/hplt-reduced
Explore at:
Dataset updated
Jul 8, 2025
Authors
Lynn Alis
Description
NiamaLynn/hplt-reduced dataset hosted on Hugging Face and contributed by the HF Datasets community
h
HPLT-zh
huggingface.co
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziyin Zhang (2025). HPLT-zh [Dataset]. https://huggingface.co/datasets/Geralt-Targaryen/HPLT-zh
Explore at:
Dataset updated
Jun 30, 2025
Authors
Ziyin Zhang
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Geralt-Targaryen/HPLT-zh dataset hosted on Hugging Face and contributed by the HF Datasets community
h
ua-squad
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HPLT, ua-squad [Dataset]. https://huggingface.co/datasets/HPLT/ua-squad
Explore at:
Dataset authored and provided by
HPLT
Description
Dataset Card for UAQuAD

This is a revised version of the Ukrainian SQuAD dataset intended for internal use in the HPLT project. The dataset is constructed as follows:

Examples with the answer appearing in the passage more than 1 time are discarded to prevent potential generation of the frequent spans. Examples with the answer frequency of more than 1 over the dataset are filtered out to prevent potential span frequency bias in the few-shot regimes. The answer spans are… See the full description on the dataset page: https://huggingface.co/datasets/HPLT/ua-squad.
h
ua-gec
huggingface.co
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HPLT (2025). ua-gec [Dataset]. https://huggingface.co/datasets/HPLT/ua-gec
Explore at:
Dataset updated
Jun 20, 2025
Dataset authored and provided by
HPLT
Description
Dataset Card for UA-GEC

This is a revised version of the document-level Ukrainian Grammatical Error Correction (UA-GEC) dataset intended for internal use in the HPLT project. The dataset is constructed as follows:

The examples are extracted using the library. The source document can appear more than once. Examples where the source is the same as the target are removed. The remaining training and test examples are combined into one split. Note: no clear evaluation script is… See the full description on the dataset page: https://huggingface.co/datasets/HPLT/ua-gec.
h
HPLT-Dutch-cleaned-v1.2
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bram Vanroy, HPLT-Dutch-cleaned-v1.2 [Dataset]. https://huggingface.co/datasets/BramVanroy/HPLT-Dutch-cleaned-v1.2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Bram Vanroy
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
HPLT Dutch cleaned v1.2

Data creator: High Performance Language Technologies Data URL: https://hplt-project.org/datasets/v1.2 Technical data description: https://hplt-project.org/HPLT_D2_1_Initial_release_of_monolingual_and_parallel_data_sets-1.pdf

Fields

id: Document ID document_lang: Document language identified by CLD2 during the WARC extraction process. scores: Language identification scores for each paragraph in the document. langs: Language with highest… See the full description on the dataset page: https://huggingface.co/datasets/BramVanroy/HPLT-Dutch-cleaned-v1.2.
h
hpltv2-llama33-edu-annotation
huggingface.co
Updated Dec 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LumiOpen (2024). hpltv2-llama33-edu-annotation [Dataset]. https://huggingface.co/datasets/LumiOpen/hpltv2-llama33-edu-annotation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 6, 2024
Dataset authored and provided by
LumiOpen
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
HPLT version 2.0 educational annotations

This dataset contains annotations derived from HPLT v2 cleaned samples. There are 500,000 annotations for each language if the source contains at least 500,000 samples. We prompt Llama-3.3-70B-Instruct to score web pages based on their educational value following FineWeb-Edu classifier. Note 1: The dataset contains the prompt (using the first 1500 characters of the text sample), the scores, and the full Llama 3 generation. The column "idx"… See the full description on the dataset page: https://huggingface.co/datasets/LumiOpen/hpltv2-llama33-edu-annotation.
h
macedonian-corpus-cleaned
huggingface.co
Updated Jan 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LVSTCK (2025). macedonian-corpus-cleaned [Dataset]. https://huggingface.co/datasets/LVSTCK/macedonian-corpus-cleaned
Explore at:
Dataset updated
Jan 20, 2025
Dataset authored and provided by
LVSTCK
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Macedonian Corpus - Cleaned

raw version here Paper

🌟 Key Highlights

Size: 35.5 GB, Word Count: 3.31 billion Filtered for irrelevant and low-quality content using C4 and Gopher filtering. Includes text from 10+ sources such as fineweb-2, HPLT-2, Wikipedia, and more.

📋 Overview

Macedonian is widely recognized as a low-resource language in the field of NLP. Publicly available resources in Macedonian are extremely limited, and as far as we know, no consolidated… See the full description on the dataset page: https://huggingface.co/datasets/LVSTCK/macedonian-corpus-cleaned.
h
hindi-corpus-v2
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Polyglot, hindi-corpus-v2 [Dataset]. https://huggingface.co/datasets/Polygl0t/hindi-corpus-v2
Explore at:
Dataset authored and provided by
Polyglot
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Hindi Corpus

Initial Crawl (1.6 TB)

Name Metadata Number of Samples Token Count License

Wikipedia https://huggingface.co/datasets/wikimedia/wikipedia 163093 70298707 CC-BY-SA-3.0

HPLT2.0_cleaned https://huggingface.co/datasets/HPLT/HPLT2.0_cleaned 13651945 11133627510 cc0-1.0

C4 https://huggingface.co/datasets/allenai/c418500000 51485714124 ODC-By

CC100 https://huggingface.co/datasets/statmt/cc100 103537752 2350198697 Common Crawl terms of use

Davlan… See the full description on the dataset page: https://huggingface.co/datasets/Polygl0t/hindi-corpus-v2.
h
fineweb-edu-translated
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technology Research Group at the University of Helsinki, fineweb-edu-translated [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/fineweb-edu-translated
Explore at:
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Helsinki-NLP/fineweb-edu-translated

Automatically translated documents from fineweb-edu. Translations are based on OPUS-MT and HPLT-MT models.
h
en_be_mt_datasets_evaluation
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
somerandomguyontheweb (2025). en_be_mt_datasets_evaluation [Dataset]. https://huggingface.co/datasets/somerandomguyontheweb/en_be_mt_datasets_evaluation
Explore at:
Dataset updated
Jun 1, 2025
Authors
somerandomguyontheweb
License
https://choosealicense.com/licenses/pddl/https://choosealicense.com/licenses/pddl/
Description
Overview

This is a small dataset of English-Belarusian sentence pairs sampled from the largest parallel corpora in OPUS (100 random instances from each of the following: NLLB, HPLT, CCMatrix, CCAligned) and manually labeled for correctness by a speaker of Belarusian. The taxonomy of labels follows Kreutzer et al. 2022:

CC: correct translation, natural sentence CB: correct translation, boilerplate or low quality CS: correct translation, short X: incorrect translation WL: wrong… See the full description on the dataset page: https://huggingface.co/datasets/somerandomguyontheweb/en_be_mt_datasets_evaluation.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

HPLT (2024). hplt_monolingual_v1_2 [Dataset]. https://huggingface.co/datasets/HPLT/hplt_monolingual_v1_2

hplt_monolingual_v1_2

HPLT/hplt_monolingual_v1_2

HPLT Monolingual Release v1.2

Explore at:

Dataset updated

Mar 18, 2024

Dataset authored and provided by

HPLT

License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

Data release 1.2 of the monolingual portion of HPLT (December 2023)

There are 75 languages in this release (22 TB of raw files, 11 TB of deduped files and 8.4 TB of clean files) provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few GB each. The number of shards per language depends on the size of the specific corpus.

Clear search

Close search

Google apps

Main menu

hplt_monolingual_v1_2

HPLT2.0_cleaned

hplt-vi

hplt-v1.2_urls

DocHPLT

hplt-reduced

HPLT-zh

ua-squad

ua-gec

HPLT-Dutch-cleaned-v1.2

hpltv2-llama33-edu-annotation

macedonian-corpus-cleaned

hindi-corpus-v2

fineweb-edu-translated

en_be_mt_datasets_evaluation

hplt_monolingual_v1_2

HPLT/hplt_monolingual_v1_2

HPLT Monolingual Release v1.2