timaeus/pile-arxiv-slm-l1sae420 dataset hosted on Hugging Face and contributed by the HF Datasets community
timaeus/pile-arxiv-elimination-disjoint-slm-l1sae1568 dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset includes all arxiv documents from the 00.jsonl.zst partition of The Pile. It was created with this script: pile_path = "data/the_pile/train/00.jsonl.zst"
with zstd.open(pile_path, 'r') as fr: with open("/tmp/arxiv.jsonl", "w") as fw: for i, line in enumerate(tqdm(fr)): doc = json.loads(line) source = doc['meta']['pile_set_name'] if source == "ArXiv": fw.write(json.dumps(doc) + " ")
The validation and test sets are… See the full description on the dataset page: https://huggingface.co/datasets/parameterlab/scaling_mia_the_pile_00_arxiv.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A dataset of high quality mathematical text.
First 100 rows of each of timaeus/pile-github, timaeus/pile-wikipedia_en, timaeus/pile-arxiv, timaeus/pile-pile-cc in that order.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Dataset Name
This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Dataset Details
Dataset Description
This is a dataset consisting of 10000 PubMed abstracts from The Pile (arXiv:2101.00027), along with completions (both human, and LLM-generated), in order to be used to calculate Heaps Law, in the manner described in the preliminary paper, Heaps' Law in GPT-Neo Large Language… See the full description on the dataset page: https://huggingface.co/datasets/rachel6603/gptneo-pubmed-abstracts.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
timaeus/pile-arxiv-slm-l1sae420 dataset hosted on Hugging Face and contributed by the HF Datasets community