6 datasets found

h
pile-arxiv-slm-l1sae420
huggingface.co
Updated Mar 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timaeus (2025). pile-arxiv-slm-l1sae420 [Dataset]. https://huggingface.co/datasets/timaeus/pile-arxiv-slm-l1sae420
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 15, 2025
Dataset authored and provided by
Timaeus
Description
timaeus/pile-arxiv-slm-l1sae420 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pile-arxiv-elimination-disjoint-slm-l1sae1568
huggingface.co
Updated Mar 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timaeus (2025). pile-arxiv-elimination-disjoint-slm-l1sae1568 [Dataset]. https://huggingface.co/datasets/timaeus/pile-arxiv-elimination-disjoint-slm-l1sae1568
Explore at:
Dataset updated
Mar 19, 2025
Dataset authored and provided by
Timaeus
Description
timaeus/pile-arxiv-elimination-disjoint-slm-l1sae1568 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
scaling_mia_the_pile_00_arxiv
huggingface.co
Updated Feb 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Parameter Lab (2025). scaling_mia_the_pile_00_arxiv [Dataset]. https://huggingface.co/datasets/parameterlab/scaling_mia_the_pile_00_arxiv
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2025
Dataset authored and provided by
Parameter Lab
Description
This dataset includes all arxiv documents from the 00.jsonl.zst partition of The Pile. It was created with this script: pile_path = "data/the_pile/train/00.jsonl.zst"

with zstd.open(pile_path, 'r') as fr: with open("/tmp/arxiv.jsonl", "w") as fw: for i, line in enumerate(tqdm(fr)): doc = json.loads(line) source = doc['meta']['pile_set_name'] if source == "ArXiv": fw.write(json.dumps(doc) + " ")

The validation and test sets are… See the full description on the dataset page: https://huggingface.co/datasets/parameterlab/scaling_mia_the_pile_00_arxiv.
h
proof-pile
huggingface.co
Updated Dec 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hoskinson Center for Formal Mathematics (2022). proof-pile [Dataset]. https://huggingface.co/datasets/hoskinson-center/proof-pile
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 25, 2022
Dataset authored and provided by
Hoskinson Center for Formal Mathematics
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
A dataset of high quality mathematical text.
h
pile-eval
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wilson Wu, pile-eval [Dataset]. https://huggingface.co/datasets/wiwu2390/pile-eval
Explore at:
Authors
Wilson Wu
Description
First 100 rows of each of timaeus/pile-github, timaeus/pile-wikipedia_en, timaeus/pile-arxiv, timaeus/pile-pile-cc in that order.
h
gptneo-pubmed-abstracts
huggingface.co
Updated Jun 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rachel lai (2024). gptneo-pubmed-abstracts [Dataset]. https://huggingface.co/datasets/rachel6603/gptneo-pubmed-abstracts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 22, 2024
Authors
rachel lai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Dataset Name

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

Dataset Details Dataset Description

This is a dataset consisting of 10000 PubMed abstracts from The Pile (arXiv:2101.00027), along with completions (both human, and LLM-generated), in order to be used to calculate Heaps Law, in the manner described in the preliminary paper, Heaps' Law in GPT-Neo Large Language… See the full description on the dataset page: https://huggingface.co/datasets/rachel6603/gptneo-pubmed-abstracts.
Not seeing a result you expected?
Learn how you can add new datasets to our index.