5 datasets found
  1. h

    finepdfs

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, finepdfs [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/finepdfs
    Explore at:
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Liberating 3T of the finest tokens from PDFs

      What is this?
    

    As we run out of web pages to process, the natural question has always been: what to do next? Only a few knew about a data source that everyone avoided for ages, due to its incredible extraction cost and complexity: PDFs. 📄 FinePDFs is exactly that. It is the largest publicly available corpus sourced exclusively from PDFs, containing about 3 trillion tokens across 475 million documents in 1733 languages. Compared to HTML… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finepdfs.

  2. h

    finepdfs-1B

    • huggingface.co
    Updated Sep 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asankhaya Sharma (2025). finepdfs-1B [Dataset]. https://huggingface.co/datasets/codelion/finepdfs-1B
    Explore at:
    Dataset updated
    Sep 8, 2025
    Authors
    Asankhaya Sharma
    Description

    codelion/finepdfs-1B dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    finepdfs-100M

    • huggingface.co
    Updated Sep 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asankhaya Sharma (2025). finepdfs-100M [Dataset]. https://huggingface.co/datasets/codelion/finepdfs-100M
    Explore at:
    Dataset updated
    Sep 8, 2025
    Authors
    Asankhaya Sharma
    Description

    codelion/finepdfs-100M dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    finepdfs-10M

    • huggingface.co
    Updated Sep 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asankhaya Sharma (2025). finepdfs-10M [Dataset]. https://huggingface.co/datasets/codelion/finepdfs-10M
    Explore at:
    Dataset updated
    Sep 8, 2025
    Authors
    Asankhaya Sharma
    Description

    codelion/finepdfs-10M dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    finepdfs-traditional-chinese

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jed Cheng, finepdfs-traditional-chinese [Dataset]. https://huggingface.co/datasets/jed351/finepdfs-traditional-chinese
    Explore at:
    Authors
    Jed Cheng
    Description

    I downloaded and filtered the finepdf to extract traditional Chinese content.

  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FineData, finepdfs [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/finepdfs

finepdfs

đź“„ FinePDFs

HuggingFaceFW/finepdfs

Explore at:
Dataset authored and provided by
FineData
License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

Liberating 3T of the finest tokens from PDFs

  What is this?

As we run out of web pages to process, the natural question has always been: what to do next? Only a few knew about a data source that everyone avoided for ages, due to its incredible extraction cost and complexity: PDFs. 📄 FinePDFs is exactly that. It is the largest publicly available corpus sourced exclusively from PDFs, containing about 3 trillion tokens across 475 million documents in 1733 languages. Compared to HTML… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finepdfs.

Search
Clear search
Close search
Google apps
Main menu