5 datasets found

h
finepdfs
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, finepdfs [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/finepdfs
Explore at:
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Liberating 3T of the finest tokens from PDFs

What is this?

As we run out of web pages to process, the natural question has always been: what to do next? Only a few knew about a data source that everyone avoided for ages, due to its incredible extraction cost and complexity: PDFs. 📄 FinePDFs is exactly that. It is the largest publicly available corpus sourced exclusively from PDFs, containing about 3 trillion tokens across 475 million documents in 1733 languages. Compared to HTML… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finepdfs.
h
finepdfs-1B
huggingface.co
Updated Sep 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asankhaya Sharma (2025). finepdfs-1B [Dataset]. https://huggingface.co/datasets/codelion/finepdfs-1B
Explore at:
Dataset updated
Sep 8, 2025
Authors
Asankhaya Sharma
Description
codelion/finepdfs-1B dataset hosted on Hugging Face and contributed by the HF Datasets community
h
finepdfs-100M
huggingface.co
Updated Sep 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asankhaya Sharma (2025). finepdfs-100M [Dataset]. https://huggingface.co/datasets/codelion/finepdfs-100M
Explore at:
Dataset updated
Sep 8, 2025
Authors
Asankhaya Sharma
Description
codelion/finepdfs-100M dataset hosted on Hugging Face and contributed by the HF Datasets community
h
finepdfs-10M
huggingface.co
Updated Sep 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asankhaya Sharma (2025). finepdfs-10M [Dataset]. https://huggingface.co/datasets/codelion/finepdfs-10M
Explore at:
Dataset updated
Sep 8, 2025
Authors
Asankhaya Sharma
Description
codelion/finepdfs-10M dataset hosted on Hugging Face and contributed by the HF Datasets community
h
finepdfs-traditional-chinese
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jed Cheng, finepdfs-traditional-chinese [Dataset]. https://huggingface.co/datasets/jed351/finepdfs-traditional-chinese
Explore at:
Authors
Jed Cheng
Description
I downloaded and filtered the finepdf to extract traditional Chinese content.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

FineData, finepdfs [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/finepdfs

finepdfs

📄 FinePDFs

HuggingFaceFW/finepdfs

Explore at:

Dataset authored and provided by

FineData

License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

Liberating 3T of the finest tokens from PDFs

  What is this?

As we run out of web pages to process, the natural question has always been: what to do next? Only a few knew about a data source that everyone avoided for ages, due to its incredible extraction cost and complexity: PDFs. 📄 FinePDFs is exactly that. It is the largest publicly available corpus sourced exclusively from PDFs, containing about 3 trillion tokens across 475 million documents in 1733 languages. Compared to HTML… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finepdfs.

Clear search

Close search

Google apps

Main menu

finepdfs

finepdfs-1B

finepdfs-100M

finepdfs-10M

finepdfs-traditional-chinese

finepdfs

📄 FinePDFs

HuggingFaceFW/finepdfs