https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Data release 1.2 of the monolingual portion of HPLT (December 2023)
There are 75 languages in this release (22 TB of raw files, 11 TB of deduped files and 8.4 TB of clean files) provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few GB each. The number of shards per language depends on the size of the specific corpus.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl. For a detailed description of the dataset, please refer to https://hplt-project.org/datasets/v2.0 The Cleaned variant of HPLT Datasets v2.0 This is the cleaned variant of the HPLT Datasets v2.0 converted to the Parquet format semi-automatically when being uploaded here. The original JSONL files… See the full description on the dataset page: https://huggingface.co/datasets/jobs-git/HPLT2.0_cleaned.
NOTE: Dữ liệu mới hơn (chưa lọc) đã có tại
https://github.com/hplt-project/data-analytics-tool/blob/main/reports/mono-2.0/HPLT-v2-vie_Latn.lite.pdf https://hplt-project.org/datasets/v2.0
Dữ liệu tiếng Việt từ https://hplt-project.org/datasets/v1, loại bỏ những dữ liệu từ Common Crawl (CC) Thống kê theo tên miền SIZE DOCS DOMAIN
40855.5mb 3586.6k http://dongtrieu.edu.vn 30012.1mb 112.8k http://hamtruyentranh.net… See the full description on the dataset page: https://huggingface.co/datasets/Symato/hplt-vi.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for hplt-v1.2_urls
This dataset provides the URLs and top-level domains associated with training records in HPLT v1.2. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.
Dataset Details
Dataset Description
This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record identifiers. In doing so, it allows… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/hplt-v1.2_urls.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
At the moment the data has gated access. We plan to make it openly accessible in late September
DocHPLT: A Massively Multilingual Document-Level Translation Dataset
Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available… See the full description on the dataset page: https://huggingface.co/datasets/HPLT/DocHPLT.
NiamaLynn/hplt-reduced dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Geralt-Targaryen/HPLT-zh dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for UAQuAD
This is a revised version of the Ukrainian SQuAD dataset intended for internal use in the HPLT project. The dataset is constructed as follows:
Examples with the answer appearing in the passage more than 1 time are discarded to prevent potential generation of the frequent spans. Examples with the answer frequency of more than 1 over the dataset are filtered out to prevent potential span frequency bias in the few-shot regimes. The answer spans are… See the full description on the dataset page: https://huggingface.co/datasets/HPLT/ua-squad.
Dataset Card for UA-GEC
This is a revised version of the document-level Ukrainian Grammatical Error Correction (UA-GEC) dataset intended for internal use in the HPLT project. The dataset is constructed as follows:
The examples are extracted using the library. The source document can appear more than once. Examples where the source is the same as the target are removed. The remaining training and test examples are combined into one split. Note: no clear evaluation script is… See the full description on the dataset page: https://huggingface.co/datasets/HPLT/ua-gec.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
HPLT Dutch cleaned v1.2
Data creator: High Performance Language Technologies Data URL: https://hplt-project.org/datasets/v1.2 Technical data description: https://hplt-project.org/HPLT_D2_1_Initial_release_of_monolingual_and_parallel_data_sets-1.pdf
Fields
id: Document ID document_lang: Document language identified by CLD2 during the WARC extraction process. scores: Language identification scores for each paragraph in the document. langs: Language with highest… See the full description on the dataset page: https://huggingface.co/datasets/BramVanroy/HPLT-Dutch-cleaned-v1.2.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
HPLT version 2.0 educational annotations
This dataset contains annotations derived from HPLT v2 cleaned samples. There are 500,000 annotations for each language if the source contains at least 500,000 samples. We prompt Llama-3.3-70B-Instruct to score web pages based on their educational value following FineWeb-Edu classifier. Note 1: The dataset contains the prompt (using the first 1500 characters of the text sample), the scores, and the full Llama 3 generation. The column "idx"… See the full description on the dataset page: https://huggingface.co/datasets/LumiOpen/hpltv2-llama33-edu-annotation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Macedonian Corpus - Cleaned
raw version here Paper
🌟 Key Highlights
Size: 35.5 GB, Word Count: 3.31 billion Filtered for irrelevant and low-quality content using C4 and Gopher filtering. Includes text from 10+ sources such as fineweb-2, HPLT-2, Wikipedia, and more.
📋 Overview
Macedonian is widely recognized as a low-resource language in the field of NLP. Publicly available resources in Macedonian are extremely limited, and as far as we know, no consolidated… See the full description on the dataset page: https://huggingface.co/datasets/LVSTCK/macedonian-corpus-cleaned.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Hindi Corpus
Initial Crawl (1.6 TB)
Name Metadata Number of Samples Token Count License
Wikipedia https://huggingface.co/datasets/wikimedia/wikipedia 163093 70298707 CC-BY-SA-3.0
HPLT2.0_cleaned https://huggingface.co/datasets/HPLT/HPLT2.0_cleaned 13651945 11133627510 cc0-1.0
C4 https://huggingface.co/datasets/allenai/c418500000 51485714124 ODC-By
CC100 https://huggingface.co/datasets/statmt/cc100 103537752 2350198697 Common Crawl terms of use
Davlan… See the full description on the dataset page: https://huggingface.co/datasets/Polygl0t/hindi-corpus-v2.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Helsinki-NLP/fineweb-edu-translated
Automatically translated documents from fineweb-edu. Translations are based on OPUS-MT and HPLT-MT models.
https://choosealicense.com/licenses/pddl/https://choosealicense.com/licenses/pddl/
Overview
This is a small dataset of English-Belarusian sentence pairs sampled from the largest parallel corpora in OPUS (100 random instances from each of the following: NLLB, HPLT, CCMatrix, CCAligned) and manually labeled for correctness by a speaker of Belarusian. The taxonomy of labels follows Kreutzer et al. 2022:
CC: correct translation, natural sentence CB: correct translation, boilerplate or low quality CS: correct translation, short X: incorrect translation WL: wrong… See the full description on the dataset page: https://huggingface.co/datasets/somerandomguyontheweb/en_be_mt_datasets_evaluation.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Data release 1.2 of the monolingual portion of HPLT (December 2023)
There are 75 languages in this release (22 TB of raw files, 11 TB of deduped files and 8.4 TB of clean files) provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few GB each. The number of shards per language depends on the size of the specific corpus.