4 datasets found

h
common_corpus
huggingface.co
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PleIAs (2024). common_corpus [Dataset]. https://huggingface.co/datasets/PleIAs/common_corpus
Explore at:
Dataset updated
Nov 13, 2024
Dataset authored and provided by
PleIAs
Description
Common Corpus

Full data paper

Common Corpus is the largest open and permissible licensed text dataset, comprising 2 trillion tokens (1,998,647,168,282 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners and contributed in-kind to Current AI initiative. Common Corpus differs from existing open datasets in that it is:… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.
h
Post-OCR-Correction
huggingface.co
opendatalab.com
Updated Apr 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PleIAs (2024). Post-OCR-Correction [Dataset]. https://huggingface.co/datasets/PleIAs/Post-OCR-Correction
Explore at:
Dataset updated
Apr 26, 2024
Dataset authored and provided by
PleIAs
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Post-OCR correction is a large corpus of 1 billion words containing original texts with a varying number of OCR mistakes and an experimental multilingual post-OCR correction output created by Pleias. Generation of Post-OCR correction was performed using HPC resources from GENCI–IDRIS (Grant 2023-AD011014736) on Jean-Zay.

Description

All the texts come from collections integrated into Common Corpus, the largest open corpus for pretraining previously released by Pleias on HuggingFace.… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Post-OCR-Correction.
h
YouTube-Commons
huggingface.co
Updated Apr 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PleIAs (2024). YouTube-Commons [Dataset]. https://huggingface.co/datasets/PleIAs/YouTube-Commons
Explore at:
Dataset updated
Apr 17, 2024
Dataset authored and provided by
PleIAs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
YouTube
Description
📺 YouTube-Commons 📺

YouTube-Commons is a collection of audio transcripts of 2,063,066 videos shared on YouTube under a CC-By license.

Content

The collection comprises 22,709,724 original and automatically translated transcripts from 3,156,703 videos (721,136 individual channels). In total, this represents nearly 45 billion words (44,811,518,375). All the videos where shared on YouTube with a CC-BY license: the dataset provide all the necessary provenance information… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/YouTube-Commons.
h
The-Obsidian
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuchen Xie, The-Obsidian [Dataset]. https://huggingface.co/datasets/yuchenxie/The-Obsidian
Explore at:
Authors
Yuchen Xie
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Pretraining set used to pretrain Arlow. Partially uploaded.

This dataset is a mixture of datasets coming from:

Huggingface FineWeb Huggingface FineWeb 2 PleIAs Common Corpus Open Math Text TinyStories AutoMathText
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

PleIAs (2024). common_corpus [Dataset]. https://huggingface.co/datasets/PleIAs/common_corpus

common_corpus

PleIAs/common_corpus

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Nov 13, 2024

Dataset authored and provided by

PleIAs

Description

Common Corpus

Full data paper

Common Corpus is the largest open and permissible licensed text dataset, comprising 2 trillion tokens (1,998,647,168,282 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners and contributed in-kind to Current AI initiative. Common Corpus differs from existing open datasets in that it is:… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

Clear search

Close search

Google apps

Main menu

common_corpus

Post-OCR-Correction

YouTube-Commons

The-Obsidian

common_corpus

PleIAs/common_corpus