2 datasets found
  1. h

    fineweb-edu

    • huggingface.co
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
    Explore at:
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    πŸ“š FineWeb-Edu

    1.3 trillion tokens of the finest educational data the 🌐 web has to offer

    Paper: https://arxiv.org/abs/2406.17557

      What is it?
    

    πŸ“š FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

  2. h

    fineweb-edu-score-2

    • huggingface.co
    Updated Jun 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2024). fineweb-edu-score-2 [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2
    Explore at:
    Dataset updated
    Jun 12, 2024
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    πŸ“š FineWeb-Edu-score-2

    1.3 trillion tokens of the finest educational data the 🌐 web has to offer

      What is it?
    

    πŸ“š FineWeb-Edu dataset consists of 1.3T tokens (FineWeb-Edu) and 5.4T tokens of educational web pages filtered from 🍷 FineWeb dataset. This is the 5.4 trillion version.

      Note: this version uses a lower educational score threshold = 2, which results in more documents, but lower quality compared to the 1.3T version. For more details check the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2.
    
  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497

fineweb-edu

FineWeb-Edu

HuggingFaceFW/fineweb-edu

Explore at:
55 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 3, 2025
Dataset authored and provided by
FineData
License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

πŸ“š FineWeb-Edu

1.3 trillion tokens of the finest educational data the 🌐 web has to offer

Paper: https://arxiv.org/abs/2406.17557

  What is it?

πŸ“š FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

Search
Clear search
Close search
Google apps
Main menu