6 datasets found
  1. h

    pile-arxiv-slm-l1sae420

    • huggingface.co
    Updated Mar 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timaeus (2025). pile-arxiv-slm-l1sae420 [Dataset]. https://huggingface.co/datasets/timaeus/pile-arxiv-slm-l1sae420
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2025
    Dataset authored and provided by
    Timaeus
    Description

    timaeus/pile-arxiv-slm-l1sae420 dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. h

    pile-arxiv-elimination-disjoint-slm-l1sae1568

    • huggingface.co
    Updated Mar 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timaeus (2025). pile-arxiv-elimination-disjoint-slm-l1sae1568 [Dataset]. https://huggingface.co/datasets/timaeus/pile-arxiv-elimination-disjoint-slm-l1sae1568
    Explore at:
    Dataset updated
    Mar 19, 2025
    Dataset authored and provided by
    Timaeus
    Description

    timaeus/pile-arxiv-elimination-disjoint-slm-l1sae1568 dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    scaling_mia_the_pile_00_arxiv

    • huggingface.co
    Updated Feb 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parameter Lab (2025). scaling_mia_the_pile_00_arxiv [Dataset]. https://huggingface.co/datasets/parameterlab/scaling_mia_the_pile_00_arxiv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2025
    Dataset authored and provided by
    Parameter Lab
    Description

    This dataset includes all arxiv documents from the 00.jsonl.zst partition of The Pile. It was created with this script: pile_path = "data/the_pile/train/00.jsonl.zst"

    with zstd.open(pile_path, 'r') as fr: with open("/tmp/arxiv.jsonl", "w") as fw: for i, line in enumerate(tqdm(fr)): doc = json.loads(line) source = doc['meta']['pile_set_name'] if source == "ArXiv": fw.write(json.dumps(doc) + " ")

    The validation and test sets are… See the full description on the dataset page: https://huggingface.co/datasets/parameterlab/scaling_mia_the_pile_00_arxiv.

  4. h

    proof-pile

    • huggingface.co
    Updated Dec 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoskinson Center for Formal Mathematics (2022). proof-pile [Dataset]. https://huggingface.co/datasets/hoskinson-center/proof-pile
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 25, 2022
    Dataset authored and provided by
    Hoskinson Center for Formal Mathematics
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    A dataset of high quality mathematical text.

  5. h

    pile-eval

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wilson Wu, pile-eval [Dataset]. https://huggingface.co/datasets/wiwu2390/pile-eval
    Explore at:
    Authors
    Wilson Wu
    Description

    First 100 rows of each of timaeus/pile-github, timaeus/pile-wikipedia_en, timaeus/pile-arxiv, timaeus/pile-pile-cc in that order.

  6. h

    gptneo-pubmed-abstracts

    • huggingface.co
    Updated Jun 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    rachel lai (2024). gptneo-pubmed-abstracts [Dataset]. https://huggingface.co/datasets/rachel6603/gptneo-pubmed-abstracts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2024
    Authors
    rachel lai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    This is a dataset consisting of 10000 PubMed abstracts from The Pile (arXiv:2101.00027), along with completions (both human, and LLM-generated), in order to be used to calculate Heaps Law, in the manner described in the preliminary paper, Heaps' Law in GPT-Neo Large Language… See the full description on the dataset page: https://huggingface.co/datasets/rachel6603/gptneo-pubmed-abstracts.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Timaeus (2025). pile-arxiv-slm-l1sae420 [Dataset]. https://huggingface.co/datasets/timaeus/pile-arxiv-slm-l1sae420

pile-arxiv-slm-l1sae420

timaeus/pile-arxiv-slm-l1sae420

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 15, 2025
Dataset authored and provided by
Timaeus
Description

timaeus/pile-arxiv-slm-l1sae420 dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu