4 datasets found
  1. h

    common_corpus

    • huggingface.co
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). common_corpus [Dataset]. https://huggingface.co/datasets/PleIAs/common_corpus
    Explore at:
    Dataset updated
    Nov 13, 2024
    Dataset authored and provided by
    PleIAs
    Description

    Common Corpus

    Full data paper

    Common Corpus is the largest open and permissible licensed text dataset, comprising 2 trillion tokens (1,998,647,168,282 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners and contributed in-kind to Current AI initiative. Common Corpus differs from existing open datasets in that it is:… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

  2. h

    Post-OCR-Correction

    • huggingface.co
    • opendatalab.com
    Updated Apr 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). Post-OCR-Correction [Dataset]. https://huggingface.co/datasets/PleIAs/Post-OCR-Correction
    Explore at:
    Dataset updated
    Apr 26, 2024
    Dataset authored and provided by
    PleIAs
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Post-OCR correction is a large corpus of 1 billion words containing original texts with a varying number of OCR mistakes and an experimental multilingual post-OCR correction output created by Pleias. Generation of Post-OCR correction was performed using HPC resources from GENCI–IDRIS (Grant 2023-AD011014736) on Jean-Zay.

      Description
    

    All the texts come from collections integrated into Common Corpus, the largest open corpus for pretraining previously released by Pleias on HuggingFace.… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Post-OCR-Correction.

  3. h

    YouTube-Commons

    • huggingface.co
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). YouTube-Commons [Dataset]. https://huggingface.co/datasets/PleIAs/YouTube-Commons
    Explore at:
    Dataset updated
    Apr 17, 2024
    Dataset authored and provided by
    PleIAs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    📺 YouTube-Commons 📺

    YouTube-Commons is a collection of audio transcripts of 2,063,066 videos shared on YouTube under a CC-By license.

      Content
    

    The collection comprises 22,709,724 original and automatically translated transcripts from 3,156,703 videos (721,136 individual channels). In total, this represents nearly 45 billion words (44,811,518,375). All the videos where shared on YouTube with a CC-BY license: the dataset provide all the necessary provenance information… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/YouTube-Commons.

  4. h

    The-Obsidian

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuchen Xie, The-Obsidian [Dataset]. https://huggingface.co/datasets/yuchenxie/The-Obsidian
    Explore at:
    Authors
    Yuchen Xie
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Pretraining set used to pretrain Arlow. Partially uploaded.

      This dataset is a mixture of datasets coming from:
    

    Huggingface FineWeb Huggingface FineWeb 2 PleIAs Common Corpus Open Math Text TinyStories AutoMathText

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
PleIAs (2024). common_corpus [Dataset]. https://huggingface.co/datasets/PleIAs/common_corpus

common_corpus

PleIAs/common_corpus

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Nov 13, 2024
Dataset authored and provided by
PleIAs
Description

Common Corpus

Full data paper

Common Corpus is the largest open and permissible licensed text dataset, comprising 2 trillion tokens (1,998,647,168,282 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners and contributed in-kind to Current AI initiative. Common Corpus differs from existing open datasets in that it is:… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

Search
Clear search
Close search
Google apps
Main menu