DISCLAIMER
This represents only a subset of the final dataset (taking into account the new HuggingFace LFS storage limits). The complete dataset will be released following the camera-ready submission of our paper.
The Heap Dataset
We develop The Heap, a new contamination-free multilingual code dataset comprising 57 languages, which facilitates LLM evaluation reproducibility. The reproduction packge can be found here.
Is your code in The Heap?
An opt-out… See the full description on the dataset page: https://huggingface.co/datasets/WizzF/Heap-Forge.
Appendix (For AAAI 2026')
Due to the storage limitations of HuggingFace users, we have uploaded 1 million image-text pairs anonymously for review purposes. The complete dataset has already been uploaded to a cloud storage server and will be fully disclosed if the paper is accepted.
🧠 1 About FaceCaption-15M Construction
⚡ 1.1 Details of Attribute Designs:
To illustrate the data distribution better, we categorized the 40 facial appearance attributes into five… See the full description on the dataset page: https://huggingface.co/datasets/anonymous-user-2025/FaceCaption-15M.
This dataset card contains data from the original Basenji project. The original Basenji dataset has two main limitations:
Format: Data is stored in TensorFlow format, which is not directly compatible with PyTorch workflows Cost: Users need to pay Google Cloud storage fees to download the data
To facilitate PyTorch-based training, we have downloaded and converted the data to H5 format for our research usage (https://huggingface.co/papers/2506.01833). With permission from the original Basenji… See the full description on the dataset page: https://huggingface.co/datasets/yangyz1230/space.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
! A note: To address the size limitations on Hugging Face, only 200 of the 63,792 rows were uploaded. The full dataset is available upon request for interested parties The HF_histone_lysine_N_methyltransferase_2_G9a_inhibitors_IUPAC dataset is a part of the study "Targeting neurodegeneration: three machine learning methods for G9a inhibitors discovery using PubChem and scikit-learn" https://doi.org/10.1007/s10822-025-00642-z This dataset contains 63,792 rows, each representing a unique small… See the full description on the dataset page: https://huggingface.co/datasets/ivanovaml/HF_histone_lysine_N_methyltransferase_2_G9a_inhibitors_IUPAC.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Annotation
We resized the dataset to 1080p for easier uploading. Therefore, the original annotation file might not match the video names. Please refer to this https://github.com/PKU-YuanGroup/Open-Sora-Plan/issues/312#issuecomment-2197312973
Pexels
Pexels consists of multiple folders, but each folder exceeds the size limit for Huggingface uploads. Therefore, we divided each folder into 5 parts. You need to merge the 5 parts of each folder first, and then extract each… See the full description on the dataset page: https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for MMLU
Dataset Summary
Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.
GoEmotions
GoEmotions is a corpus of 58k carefully curated comments extracted from Reddit, with human annotations to 27 emotion categories or Neutral.
Number of examples: 58,009. Number of labels: 27 + Neutral. Maximum sequence length in training and evaluation datasets: 30.
On top of the raw data, we also include a version filtered based on reter-agreement, which contains a train/test/validation split:
Size of training dataset: 43,410. Size of test dataset: 5,427. Size of… See the full description on the dataset page: https://huggingface.co/datasets/mrm8488/goemotions.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
! A note: To address the size limitations on Hugging Face, only 200 of the 59,609 rows were uploaded. The full dataset is available upon request for interested parties The human_dopamine_D1_receptor_antagonists_SMILES_13C_NMR_spectrosopy_with_molecular_features dataset is a part of the study " Comparative analysis of computational approaches for predicting Transthyretin (TTR) transcription activators and human dopamine D1 receptor antagonists" https://doi.org/10.48550/arXiv.2506.01137 The… See the full description on the dataset page: https://huggingface.co/datasets/ivanovaml/human_dopamine_D1_receptor_antagonists_SMILES_13C_NMR_spectrosopy_with_molecular.
https://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/
This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books in English. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
Note that in order to limit the required storage for preparing this dataset, the audio
is stored in the .wav format and is not converted to a float32 array. To convert the audio
file to a float32 array, please make use of the .map()
function as follows:
import soundfile as sf
def map_to_array(batch):
speech_array, _ = sf.read(batch["file"])
batch["speech"] = speech_array
return batch
dataset = dataset.map(map_to_array, remove_columns=["file"])
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Due to storage limits some files had to be split into multiple parts. They can be merged like this: cat file.* > file.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Gigaspeech Part 3
This is Part 3 of 8 of a large-scale speech dataset, split to accommodate HuggingFace's repository size limits.
Multi-Part Dataset
This dataset is split across multiple repositories:
Part 1: shahdsaf/gigaspeech-part-1 Part 2: shahdsaf/gigaspeech-part-2 Part 3 (current): shahdsaf/gigaspeech-part-3 Part 4: shahdsaf/gigaspeech-part-4 Part 5: shahdsaf/gigaspeech-part-5 Part 6: shahdsaf/gigaspeech-part-6 Part 7: shahdsaf/gigaspeech-part-7 Part 8:… See the full description on the dataset page: https://huggingface.co/datasets/shahdsaf/gigaspeech-part-3.
🌀 Navier-Stokes Simulation Dataset (Re=500, T=300)
This dataset contains 300 time steps of a high-resolution 3D Navier-Stokes simulation at Reynolds number 500. The full array was split into three parts to comply with file size limitations on the Hugging Face Hub. Each file is a .npy file in NumPy binary format and contains a contiguous slice along the time dimension.
📁 File Structure
ns_split_3/ ├── ns_part_01.npy # Samples 0–99 ├── ns_part_02.npy # Samples 100–199… See the full description on the dataset page: https://huggingface.co/datasets/LDA1020/ns-dataset.
📃 Paper | 🤗 Hugging Face | ⭐ Github
Dataset Overview
In the table below, we provide a brief summary of the dataset statistics.
Category Size
Total Sample 2019163
Total Image 2019163
Average Answer Length 84
Maximum Answer Length 5851
JSON Overview
Each dictionary in the JSON file contains three keys: 'id', 'image', and 'conversations'. The 'id' is the unique identifier for the current data in the entire dataset. The 'image' stores… See the full description on the dataset page: https://huggingface.co/datasets/Ghaser/Wikipedia-Knowledge-2M.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
CoVoST 2, a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. The dataset is created using Mozilla’s open source Common Voice database of crowdsourced voice recordings.
Note that in order to limit the required storage for preparing this dataset, the audio
is stored in the .mp3 format and is not converted to a float32 array. To convert, the audio
file to a float32 array, please make use of the .map()
function as follows:
import torchaudio
def map_to_array(batch):
speech_array, _ = torchaudio.load(batch["file"])
batch["speech"] = speech_array.numpy()
return batch
dataset = dataset.map(map_to_array, remove_columns=["file"])
LibriSpeech is a corpus of approximately 1000 hours of read English speech with sampling rate of 16 kHz, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.
Note that in order to limit the required storage for preparing this dataset, the audio
is stored in the .flac format and is not converted to a float32 array. To convert, the audio
file to a float32 array, please make use of the .map()
function as follows:
import soundfile as sf
def map_to_array(batch):
speech_array, _ = sf.read(batch["file"])
batch["speech"] = speech_array
return batch
dataset = dataset.map(map_to_array, remove_columns=["file"])
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for GSM8K
Dataset Summary
GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for UltraChat 200k
Dataset Description
This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:
Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The Stack v2
The dataset consists of 4 versions:
bigcode/the-stack-v2: the full "The Stack v2" dataset <-- you are here bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories.bigcode/the-stack-v2-train-smol-ids: based on the… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2.
Dream2Image Dataset - Optimized Version
This is an optimized version of the original dataset opsecsystems/Dream2Image-ZhangTWC129-enriched that has been split into smaller chunks to be compatible with the Hugging Face dataset viewer.
Original Dataset
Repository: opsecsystems/Dream2Image-ZhangTWC129-enriched Issue: Files too large for dataset viewer (>286 MB limit)
Optimized Version
Total examples: 129 Number of chunks: 1 Max chunk size: ~250 MB All chunks… See the full description on the dataset page: https://huggingface.co/datasets/opsecsystems/Dream2Image-ZhangTWC129-enriched-optimized.
DISCLAIMER
This represents only a subset of the final dataset (taking into account the new HuggingFace LFS storage limits). The complete dataset will be released following the camera-ready submission of our paper.
The Heap Dataset
We develop The Heap, a new contamination-free multilingual code dataset comprising 57 languages, which facilitates LLM evaluation reproducibility. The reproduction packge can be found here.
Is your code in The Heap?
An opt-out… See the full description on the dataset page: https://huggingface.co/datasets/WizzF/Heap-Forge.