https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Liberating 3T of the finest tokens from PDFs
What is this?
As we run out of web pages to process, the natural question has always been: what to do next? Only a few knew about a data source that everyone avoided for ages, due to its incredible extraction cost and complexity: PDFs. 📄 FinePDFs is exactly that. It is the largest publicly available corpus sourced exclusively from PDFs, containing about 3 trillion tokens across 475 million documents in 1733 languages. Compared to HTML… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finepdfs.
codelion/finepdfs-1B dataset hosted on Hugging Face and contributed by the HF Datasets community
codelion/finepdfs-100M dataset hosted on Hugging Face and contributed by the HF Datasets community
codelion/finepdfs-10M dataset hosted on Hugging Face and contributed by the HF Datasets community
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Liberating 3T of the finest tokens from PDFs
What is this?
As we run out of web pages to process, the natural question has always been: what to do next? Only a few knew about a data source that everyone avoided for ages, due to its incredible extraction cost and complexity: PDFs. 📄 FinePDFs is exactly that. It is the largest publicly available corpus sourced exclusively from PDFs, containing about 3 trillion tokens across 475 million documents in 1733 languages. Compared to HTML… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finepdfs.