60 datasets found

h
Heap-Forge
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Forgery Wizzard, Heap-Forge [Dataset]. https://huggingface.co/datasets/WizzF/Heap-Forge
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Forgery Wizzard
Description
DISCLAIMER

This represents only a subset of the final dataset (taking into account the new HuggingFace LFS storage limits). The complete dataset will be released following the camera-ready submission of our paper.

The Heap Dataset

We develop The Heap, a new contamination-free multilingual code dataset comprising 57 languages, which facilitates LLM evaluation reproducibility. The reproduction packge can be found here.

Is your code in The Heap?

An opt-out… See the full description on the dataset page: https://huggingface.co/datasets/WizzF/Heap-Forge.
h
FaceCaption-15M
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Researcher (2025). FaceCaption-15M [Dataset]. https://huggingface.co/datasets/anonymous-user-2025/FaceCaption-15M
Explore at:
Dataset updated
Jun 1, 2025
Authors
Researcher
Description
Appendix (For AAAI 2026')

Due to the storage limitations of HuggingFace users, we have uploaded 1 million image-text pairs anonymously for review purposes. The complete dataset has already been uploaded to a cloud storage server and will be fully disclosed if the paper is accepted.

🧠 1 About FaceCaption-15M Construction ⚡ 1.1 Details of Attribute Designs:

To illustrate the data distribution better, we categorized the 40 facial appearance attributes into five… See the full description on the dataset page: https://huggingface.co/datasets/anonymous-user-2025/FaceCaption-15M.
h
space
huggingface.co
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhao Yang (2025). space [Dataset]. https://huggingface.co/datasets/yangyz1230/space
Explore at:
Dataset updated
Jul 16, 2025
Authors
Zhao Yang
Description
This dataset card contains data from the original Basenji project. The original Basenji dataset has two main limitations:

Format: Data is stored in TensorFlow format, which is not directly compatible with PyTorch workflows Cost: Users need to pay Google Cloud storage fees to download the data

To facilitate PyTorch-based training, we have downloaded and converted the data to H5 format for our research usage (https://huggingface.co/papers/2506.01833). With permission from the original Basenji… See the full description on the dataset page: https://huggingface.co/datasets/yangyz1230/space.
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
h
HF_histone_lysine_N_methyltransferase_2_G9a_inhibitors_IUPAC
huggingface.co
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariya L. Ivanova (2025). HF_histone_lysine_N_methyltransferase_2_G9a_inhibitors_IUPAC [Dataset]. http://doi.org/10.57967/hf/6264
Explore at:
Unique identifier
https://doi.org/10.57967/hf/6264
Dataset updated
Aug 26, 2025
Authors
Mariya L. Ivanova
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
! A note: To address the size limitations on Hugging Face, only 200 of the 63,792 rows were uploaded. The full dataset is available upon request for interested parties The HF_histone_lysine_N_methyltransferase_2_G9a_inhibitors_IUPAC dataset is a part of the study "Targeting neurodegeneration: three machine learning methods for G9a inhibitors discovery using PubChem and scikit-learn" https://doi.org/10.1007/s10822-025-00642-z This dataset contains 63,792 rows, each representing a unique small… See the full description on the dataset page: https://huggingface.co/datasets/ivanovaml/HF_histone_lysine_N_methyltransferase_2_G9a_inhibitors_IUPAC.
h
Open-Sora-Plan-v1.1.0
huggingface.co
Updated Nov 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
linbin (2024). Open-Sora-Plan-v1.1.0 [Dataset]. https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 3, 2024
Authors
linbin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Annotation

We resized the dataset to 1080p for easier uploading. Therefore, the original annotation file might not match the video names. Please refer to this https://github.com/PKU-YuanGroup/Open-Sora-Plan/issues/312#issuecomment-2197312973

Pexels

Pexels consists of multiple folders, but each folder exceeds the size limit for Huggingface uploads. Therefore, we divided each folder into 5 parts. You need to merge the 5 parts of each folder first, and then extract each… See the full description on the dataset page: https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0.
mmlu
huggingface.co
Updated May 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Center for AI Safety (2022). mmlu [Dataset]. https://huggingface.co/datasets/cais/mmlu
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 10, 2023
Dataset authored and provided by
Center for AI Safetyhttps://safe.ai/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for MMLU

Dataset Summary

Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.
h
goemotions
huggingface.co
Updated Aug 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuel Romero (2023). goemotions [Dataset]. https://huggingface.co/datasets/mrm8488/goemotions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Manuel Romero
Description
GoEmotions

GoEmotions is a corpus of 58k carefully curated comments extracted from Reddit, with human annotations to 27 emotion categories or Neutral.

Number of examples: 58,009. Number of labels: 27 + Neutral. Maximum sequence length in training and evaluation datasets: 30.

On top of the raw data, we also include a version filtered based on reter-agreement, which contains a train/test/validation split:

Size of training dataset: 43,410. Size of test dataset: 5,427. Size of… See the full description on the dataset page: https://huggingface.co/datasets/mrm8488/goemotions.
h
human_dopamine_D1_receptor_antagonists_SMILES_13C_NMR_spectrosopy_with_molecular...
huggingface.co
Updated Aug 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariya L. Ivanova (2025). human_dopamine_D1_receptor_antagonists_SMILES_13C_NMR_spectrosopy_with_molecular [Dataset]. http://doi.org/10.57967/hf/6274
Explore at:
Unique identifier
https://doi.org/10.57967/hf/6274
Dataset updated
Aug 27, 2025
Authors
Mariya L. Ivanova
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
! A note: To address the size limitations on Hugging Face, only 200 of the 59,609 rows were uploaded. The full dataset is available upon request for interested parties The human_dopamine_D1_receptor_antagonists_SMILES_13C_NMR_spectrosopy_with_molecular_features dataset is a part of the study " Comparative analysis of computational approaches for predicting Transthyretin (TTR) transcription activators and human dopamine D1 receptor antagonists" https://doi.org/10.48550/arXiv.2506.01137 The… See the full description on the dataset page: https://huggingface.co/datasets/ivanovaml/human_dopamine_D1_receptor_antagonists_SMILES_13C_NMR_spectrosopy_with_molecular.
h
lj_speech
huggingface.co
tensorflow.org
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keith Ito (2024). lj_speech [Dataset]. https://huggingface.co/datasets/keithito/lj_speech
Explore at:
Dataset updated
May 17, 2024
Authors
Keith Ito
License
https://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/
Description
This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books in English. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

Note that in order to limit the required storage for preparing this dataset, the audio is stored in the .wav format and is not converted to a float32 array. To convert the audio file to a float32 array, please make use of the .map() function as follows:

import soundfile as sf def map_to_array(batch): speech_array, _ = sf.read(batch["file"]) batch["speech"] = speech_array return batch dataset = dataset.map(map_to_array, remove_columns=["file"])
h
common_voice_21_0
huggingface.co
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
2Jyq (2025). common_voice_21_0 [Dataset]. https://huggingface.co/datasets/2Jyq/common_voice_21_0
Explore at:
Dataset updated
Jun 15, 2025
Authors
2Jyq
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Due to storage limits some files had to be split into multiple parts. They can be merged like this: cat file.* > file.
h
gigaspeech-part-3
huggingface.co
Updated Jul 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahd Safarani (2025). gigaspeech-part-3 [Dataset]. https://huggingface.co/datasets/shahdsaf/gigaspeech-part-3
Explore at:
Dataset updated
Jul 2, 2025
Authors
Shahd Safarani
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Gigaspeech Part 3

This is Part 3 of 8 of a large-scale speech dataset, split to accommodate HuggingFace's repository size limits.

Multi-Part Dataset

This dataset is split across multiple repositories:

Part 1: shahdsaf/gigaspeech-part-1 Part 2: shahdsaf/gigaspeech-part-2 Part 3 (current): shahdsaf/gigaspeech-part-3 Part 4: shahdsaf/gigaspeech-part-4 Part 5: shahdsaf/gigaspeech-part-5 Part 6: shahdsaf/gigaspeech-part-6 Part 7: shahdsaf/gigaspeech-part-7 Part 8:… See the full description on the dataset page: https://huggingface.co/datasets/shahdsaf/gigaspeech-part-3.
h
ns-dataset
huggingface.co
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shanda Li (2025). ns-dataset [Dataset]. https://huggingface.co/datasets/LDA1020/ns-dataset
Explore at:
Dataset updated
Jul 29, 2025
Authors
Shanda Li
Description
🌀 Navier-Stokes Simulation Dataset (Re=500, T=300)

This dataset contains 300 time steps of a high-resolution 3D Navier-Stokes simulation at Reynolds number 500. The full array was split into three parts to comply with file size limitations on the Hugging Face Hub. Each file is a .npy file in NumPy binary format and contains a contiguous slice along the time dimension.

📁 File Structure

ns_split_3/ ├── ns_part_01.npy # Samples 0–99 ├── ns_part_02.npy # Samples 100–199… See the full description on the dataset page: https://huggingface.co/datasets/LDA1020/ns-dataset.
h
Wikipedia-Knowledge-2M
huggingface.co
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu (2024). Wikipedia-Knowledge-2M [Dataset]. https://huggingface.co/datasets/Ghaser/Wikipedia-Knowledge-2M
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 1, 2024
Authors
Yu
Description
📃 Paper | 🤗 Hugging Face | ⭐ Github

Dataset Overview

In the table below, we provide a brief summary of the dataset statistics.

Category Size

Total Sample 2019163

Total Image 2019163

Average Answer Length 84

Maximum Answer Length 5851

JSON Overview

Each dictionary in the JSON file contains three keys: 'id', 'image', and 'conversations'. The 'id' is the unique identifier for the current data in the entire dataset. The 'image' stores… See the full description on the dataset page: https://huggingface.co/datasets/Ghaser/Wikipedia-Knowledge-2M.
h
covost2
huggingface.co
Updated Jun 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2024). covost2 [Dataset]. https://huggingface.co/datasets/facebook/covost2
Explore at:
Dataset updated
Jun 3, 2024
Dataset authored and provided by
AI at Meta
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
CoVoST 2, a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. The dataset is created using Mozilla’s open source Common Voice database of crowdsourced voice recordings.

Note that in order to limit the required storage for preparing this dataset, the audio is stored in the .mp3 format and is not converted to a float32 array. To convert, the audio file to a float32 array, please make use of the .map() function as follows:

import torchaudio def map_to_array(batch): speech_array, _ = torchaudio.load(batch["file"]) batch["speech"] = speech_array.numpy() return batch dataset = dataset.map(map_to_array, remove_columns=["file"])
h
librispeech_asr_dummy
huggingface.co
Updated Dec 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick von Platen (2022). librispeech_asr_dummy [Dataset]. https://huggingface.co/datasets/patrickvonplaten/librispeech_asr_dummy
Explore at:
Dataset updated
Dec 22, 2022
Authors
Patrick von Platen
Description
LibriSpeech is a corpus of approximately 1000 hours of read English speech with sampling rate of 16 kHz, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.

Note that in order to limit the required storage for preparing this dataset, the audio is stored in the .flac format and is not converted to a float32 array. To convert, the audio file to a float32 array, please make use of the .map() function as follows:

import soundfile as sf def map_to_array(batch): speech_array, _ = sf.read(batch["file"]) batch["speech"] = speech_array return batch dataset = dataset.map(map_to_array, remove_columns=["file"])
gsm8k
huggingface.co
Updated Aug 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 11, 2022
Dataset authored and provided by
OpenAIhttp://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for GSM8K

Dataset Summary

GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.
ultrachat_200k
huggingface.co
opendatalab.com
Updated Oct 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). ultrachat_200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for UltraChat 200k

Dataset Description

This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.
h
the-stack-v2
huggingface.co
Updated Mar 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2024). the-stack-v2 [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-v2
Explore at:
Dataset updated
Mar 1, 2024
Dataset authored and provided by
BigCode
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The Stack v2

The dataset consists of 4 versions:

bigcode/the-stack-v2: the full "The Stack v2" dataset <-- you are here bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories.bigcode/the-stack-v2-train-smol-ids: based on the… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2.
h
Dream2Image-ZhangTWC129-enriched-optimized
huggingface.co
Updated Sep 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Opsec Systems (2025). Dream2Image-ZhangTWC129-enriched-optimized [Dataset]. https://huggingface.co/datasets/opsecsystems/Dream2Image-ZhangTWC129-enriched-optimized
Explore at:
Dataset updated
Sep 6, 2025
Authors
Opsec Systems
Description
Dream2Image Dataset - Optimized Version

This is an optimized version of the original dataset opsecsystems/Dream2Image-ZhangTWC129-enriched that has been split into smaller chunks to be compatible with the Hugging Face dataset viewer.

Original Dataset

Repository: opsecsystems/Dream2Image-ZhangTWC129-enriched Issue: Files too large for dataset viewer (>286 MB limit)

Optimized Version

Total examples: 129 Number of chunks: 1 Max chunk size: ~250 MB All chunks… See the full description on the dataset page: https://huggingface.co/datasets/opsecsystems/Dream2Image-ZhangTWC129-enriched-optimized.

Facebook

Twitter

Click to copy link

Link copied

Cite

Forgery Wizzard, Heap-Forge [Dataset]. https://huggingface.co/datasets/WizzF/Heap-Forge

Heap-Forge

WizzF/Heap-Forge

Explore at:

43 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Authors

Forgery Wizzard

Description

DISCLAIMER

This represents only a subset of the final dataset (taking into account the new HuggingFace LFS storage limits). The complete dataset will be released following the camera-ready submission of our paper.

  The Heap Dataset

We develop The Heap, a new contamination-free multilingual code dataset comprising 57 languages, which facilitates LLM evaluation reproducibility. The reproduction packge can be found here.

  Is your code in The Heap?

An opt-out… See the full description on the dataset page: https://huggingface.co/datasets/WizzF/Heap-Forge.

Clear search

Close search

Google apps

Main menu

Heap-Forge

FaceCaption-15M

space

fineweb

HF_histone_lysine_N_methyltransferase_2_G9a_inhibitors_IUPAC

Open-Sora-Plan-v1.1.0

mmlu

goemotions

human_dopamine_D1_receptor_antagonists_SMILES_13C_NMR_spectrosopy_with_molecular...

lj_speech

common_voice_21_0

gigaspeech-part-3

ns-dataset

Wikipedia-Knowledge-2M

covost2

librispeech_asr_dummy

gsm8k

ultrachat_200k

the-stack-v2

Dream2Image-ZhangTWC129-enriched-optimized

Heap-Forge

WizzF/Heap-Forge