60 datasets found
  1. h

    Heap-Forge

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Forgery Wizzard, Heap-Forge [Dataset]. https://huggingface.co/datasets/WizzF/Heap-Forge
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Forgery Wizzard
    Description

    DISCLAIMER

    This represents only a subset of the final dataset (taking into account the new HuggingFace LFS storage limits). The complete dataset will be released following the camera-ready submission of our paper.

      The Heap Dataset
    

    We develop The Heap, a new contamination-free multilingual code dataset comprising 57 languages, which facilitates LLM evaluation reproducibility. The reproduction packge can be found here.

      Is your code in The Heap?
    

    An opt-out… See the full description on the dataset page: https://huggingface.co/datasets/WizzF/Heap-Forge.

  2. h

    FaceCaption-15M

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Researcher (2025). FaceCaption-15M [Dataset]. https://huggingface.co/datasets/anonymous-user-2025/FaceCaption-15M
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Researcher
    Description

    Appendix (For AAAI 2026')

    Due to the storage limitations of HuggingFace users, we have uploaded 1 million image-text pairs anonymously for review purposes. The complete dataset has already been uploaded to a cloud storage server and will be fully disclosed if the paper is accepted.

      🧠 1 About FaceCaption-15M Construction
    
    
    
    
    
      ⚡ 1.1 Details of Attribute Designs:
    

    To illustrate the data distribution better, we categorized the 40 facial appearance attributes into five… See the full description on the dataset page: https://huggingface.co/datasets/anonymous-user-2025/FaceCaption-15M.

  3. h

    space

    • huggingface.co
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhao Yang (2025). space [Dataset]. https://huggingface.co/datasets/yangyz1230/space
    Explore at:
    Dataset updated
    Jul 16, 2025
    Authors
    Zhao Yang
    Description

    This dataset card contains data from the original Basenji project. The original Basenji dataset has two main limitations:

    Format: Data is stored in TensorFlow format, which is not directly compatible with PyTorch workflows Cost: Users need to pay Google Cloud storage fees to download the data

    To facilitate PyTorch-based training, we have downloaded and converted the data to H5 format for our research usage (https://huggingface.co/papers/2506.01833). With permission from the original Basenji… See the full description on the dataset page: https://huggingface.co/datasets/yangyz1230/space.

  4. h

    fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
    Explore at:
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    🍷 FineWeb

    15 trillion tokens of the finest data the 🌐 web has to offer

      What is it?
    

    The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

  5. h

    HF_histone_lysine_N_methyltransferase_2_G9a_inhibitors_IUPAC

    • huggingface.co
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariya L. Ivanova (2025). HF_histone_lysine_N_methyltransferase_2_G9a_inhibitors_IUPAC [Dataset]. http://doi.org/10.57967/hf/6264
    Explore at:
    Dataset updated
    Aug 26, 2025
    Authors
    Mariya L. Ivanova
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ! A note: To address the size limitations on Hugging Face, only 200 of the 63,792 rows were uploaded. The full dataset is available upon request for interested parties The HF_histone_lysine_N_methyltransferase_2_G9a_inhibitors_IUPAC dataset is a part of the study "Targeting neurodegeneration: three machine learning methods for G9a inhibitors discovery using PubChem and scikit-learn" https://doi.org/10.1007/s10822-025-00642-z This dataset contains 63,792 rows, each representing a unique small… See the full description on the dataset page: https://huggingface.co/datasets/ivanovaml/HF_histone_lysine_N_methyltransferase_2_G9a_inhibitors_IUPAC.

  6. h

    Open-Sora-Plan-v1.1.0

    • huggingface.co
    Updated Nov 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    linbin (2024). Open-Sora-Plan-v1.1.0 [Dataset]. https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 3, 2024
    Authors
    linbin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Annotation

    We resized the dataset to 1080p for easier uploading. Therefore, the original annotation file might not match the video names. Please refer to this https://github.com/PKU-YuanGroup/Open-Sora-Plan/issues/312#issuecomment-2197312973

      Pexels
    

    Pexels consists of multiple folders, but each folder exceeds the size limit for Huggingface uploads. Therefore, we divided each folder into 5 parts. You need to merge the 5 parts of each folder first, and then extract each… See the full description on the dataset page: https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0.

  7. mmlu

    • huggingface.co
    Updated May 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Center for AI Safety (2022). mmlu [Dataset]. https://huggingface.co/datasets/cais/mmlu
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2023
    Dataset authored and provided by
    Center for AI Safetyhttps://safe.ai/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for MMLU

      Dataset Summary
    

    Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.

  8. h

    goemotions

    • huggingface.co
    Updated Aug 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuel Romero (2023). goemotions [Dataset]. https://huggingface.co/datasets/mrm8488/goemotions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 12, 2023
    Authors
    Manuel Romero
    Description

    GoEmotions

    GoEmotions is a corpus of 58k carefully curated comments extracted from Reddit, with human annotations to 27 emotion categories or Neutral.

    Number of examples: 58,009. Number of labels: 27 + Neutral. Maximum sequence length in training and evaluation datasets: 30.

    On top of the raw data, we also include a version filtered based on reter-agreement, which contains a train/test/validation split:

    Size of training dataset: 43,410. Size of test dataset: 5,427. Size of… See the full description on the dataset page: https://huggingface.co/datasets/mrm8488/goemotions.

  9. h

    human_dopamine_D1_receptor_antagonists_SMILES_13C_NMR_spectrosopy_with_molecular...

    • huggingface.co
    Updated Aug 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariya L. Ivanova (2025). human_dopamine_D1_receptor_antagonists_SMILES_13C_NMR_spectrosopy_with_molecular [Dataset]. http://doi.org/10.57967/hf/6274
    Explore at:
    Dataset updated
    Aug 27, 2025
    Authors
    Mariya L. Ivanova
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ! A note: To address the size limitations on Hugging Face, only 200 of the 59,609 rows were uploaded. The full dataset is available upon request for interested parties The human_dopamine_D1_receptor_antagonists_SMILES_13C_NMR_spectrosopy_with_molecular_features dataset is a part of the study " Comparative analysis of computational approaches for predicting Transthyretin (TTR) transcription activators and human dopamine D1 receptor antagonists" https://doi.org/10.48550/arXiv.2506.01137 The… See the full description on the dataset page: https://huggingface.co/datasets/ivanovaml/human_dopamine_D1_receptor_antagonists_SMILES_13C_NMR_spectrosopy_with_molecular.

  10. h

    lj_speech

    • huggingface.co
    • tensorflow.org
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keith Ito (2024). lj_speech [Dataset]. https://huggingface.co/datasets/keithito/lj_speech
    Explore at:
    Dataset updated
    May 17, 2024
    Authors
    Keith Ito
    License

    https://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/

    Description

    This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books in English. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

    Note that in order to limit the required storage for preparing this dataset, the audio is stored in the .wav format and is not converted to a float32 array. To convert the audio file to a float32 array, please make use of the .map() function as follows:

    import soundfile as sf
    
    def map_to_array(batch):
      speech_array, _ = sf.read(batch["file"])
      batch["speech"] = speech_array
      return batch
    
    dataset = dataset.map(map_to_array, remove_columns=["file"])
    
  11. h

    common_voice_21_0

    • huggingface.co
    Updated Jun 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    2Jyq (2025). common_voice_21_0 [Dataset]. https://huggingface.co/datasets/2Jyq/common_voice_21_0
    Explore at:
    Dataset updated
    Jun 15, 2025
    Authors
    2Jyq
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Due to storage limits some files had to be split into multiple parts. They can be merged like this: cat file.* > file.

  12. h

    gigaspeech-part-3

    • huggingface.co
    Updated Jul 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahd Safarani (2025). gigaspeech-part-3 [Dataset]. https://huggingface.co/datasets/shahdsaf/gigaspeech-part-3
    Explore at:
    Dataset updated
    Jul 2, 2025
    Authors
    Shahd Safarani
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Gigaspeech Part 3

    This is Part 3 of 8 of a large-scale speech dataset, split to accommodate HuggingFace's repository size limits.

      Multi-Part Dataset
    

    This dataset is split across multiple repositories:

    Part 1: shahdsaf/gigaspeech-part-1 Part 2: shahdsaf/gigaspeech-part-2 Part 3 (current): shahdsaf/gigaspeech-part-3 Part 4: shahdsaf/gigaspeech-part-4 Part 5: shahdsaf/gigaspeech-part-5 Part 6: shahdsaf/gigaspeech-part-6 Part 7: shahdsaf/gigaspeech-part-7 Part 8:… See the full description on the dataset page: https://huggingface.co/datasets/shahdsaf/gigaspeech-part-3.

  13. h

    ns-dataset

    • huggingface.co
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shanda Li (2025). ns-dataset [Dataset]. https://huggingface.co/datasets/LDA1020/ns-dataset
    Explore at:
    Dataset updated
    Jul 29, 2025
    Authors
    Shanda Li
    Description

    🌀 Navier-Stokes Simulation Dataset (Re=500, T=300)

    This dataset contains 300 time steps of a high-resolution 3D Navier-Stokes simulation at Reynolds number 500. The full array was split into three parts to comply with file size limitations on the Hugging Face Hub. Each file is a .npy file in NumPy binary format and contains a contiguous slice along the time dimension.

      📁 File Structure
    

    ns_split_3/ ├── ns_part_01.npy # Samples 0–99 ├── ns_part_02.npy # Samples 100–199… See the full description on the dataset page: https://huggingface.co/datasets/LDA1020/ns-dataset.

  14. h

    Wikipedia-Knowledge-2M

    • huggingface.co
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu (2024). Wikipedia-Knowledge-2M [Dataset]. https://huggingface.co/datasets/Ghaser/Wikipedia-Knowledge-2M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2024
    Authors
    Yu
    Description

    📃 Paper | 🤗 Hugging Face | ⭐ Github

      Dataset Overview
    

    In the table below, we provide a brief summary of the dataset statistics.

    Category Size

    Total Sample 2019163

    Total Image 2019163

    Average Answer Length 84

    Maximum Answer Length 5851

      JSON Overview
    

    Each dictionary in the JSON file contains three keys: 'id', 'image', and 'conversations'. The 'id' is the unique identifier for the current data in the entire dataset. The 'image' stores… See the full description on the dataset page: https://huggingface.co/datasets/Ghaser/Wikipedia-Knowledge-2M.

  15. h

    covost2

    • huggingface.co
    Updated Jun 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2024). covost2 [Dataset]. https://huggingface.co/datasets/facebook/covost2
    Explore at:
    Dataset updated
    Jun 3, 2024
    Dataset authored and provided by
    AI at Meta
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    CoVoST 2, a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. The dataset is created using Mozilla’s open source Common Voice database of crowdsourced voice recordings.

    Note that in order to limit the required storage for preparing this dataset, the audio is stored in the .mp3 format and is not converted to a float32 array. To convert, the audio file to a float32 array, please make use of the .map() function as follows:

    import torchaudio
    
    def map_to_array(batch):
      speech_array, _ = torchaudio.load(batch["file"])
      batch["speech"] = speech_array.numpy()
      return batch
    
    dataset = dataset.map(map_to_array, remove_columns=["file"])
    
  16. h

    librispeech_asr_dummy

    • huggingface.co
    Updated Dec 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick von Platen (2022). librispeech_asr_dummy [Dataset]. https://huggingface.co/datasets/patrickvonplaten/librispeech_asr_dummy
    Explore at:
    Dataset updated
    Dec 22, 2022
    Authors
    Patrick von Platen
    Description

    LibriSpeech is a corpus of approximately 1000 hours of read English speech with sampling rate of 16 kHz, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.

    Note that in order to limit the required storage for preparing this dataset, the audio is stored in the .flac format and is not converted to a float32 array. To convert, the audio file to a float32 array, please make use of the .map() function as follows:

    import soundfile as sf
    
    def map_to_array(batch):
      speech_array, _ = sf.read(batch["file"])
      batch["speech"] = speech_array
      return batch
    
    dataset = dataset.map(map_to_array, remove_columns=["file"])
    
  17. gsm8k

    • huggingface.co
    Updated Aug 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset authored and provided by
    OpenAIhttp://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for GSM8K

      Dataset Summary
    

    GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

    These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

  18. ultrachat_200k

    • huggingface.co
    • opendatalab.com
    Updated Oct 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). ultrachat_200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for UltraChat 200k

      Dataset Description
    

    This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

    Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.

  19. h

    the-stack-v2

    • huggingface.co
    Updated Mar 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode (2024). the-stack-v2 [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-v2
    Explore at:
    Dataset updated
    Mar 1, 2024
    Dataset authored and provided by
    BigCode
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The Stack v2

    The dataset consists of 4 versions:

    bigcode/the-stack-v2: the full "The Stack v2" dataset <-- you are here bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories.bigcode/the-stack-v2-train-smol-ids: based on the… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2.

  20. h

    Dream2Image-ZhangTWC129-enriched-optimized

    • huggingface.co
    Updated Sep 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Opsec Systems (2025). Dream2Image-ZhangTWC129-enriched-optimized [Dataset]. https://huggingface.co/datasets/opsecsystems/Dream2Image-ZhangTWC129-enriched-optimized
    Explore at:
    Dataset updated
    Sep 6, 2025
    Authors
    Opsec Systems
    Description

    Dream2Image Dataset - Optimized Version

    This is an optimized version of the original dataset opsecsystems/Dream2Image-ZhangTWC129-enriched that has been split into smaller chunks to be compatible with the Hugging Face dataset viewer.

      Original Dataset
    

    Repository: opsecsystems/Dream2Image-ZhangTWC129-enriched Issue: Files too large for dataset viewer (>286 MB limit)

      Optimized Version
    

    Total examples: 129 Number of chunks: 1 Max chunk size: ~250 MB All chunks… See the full description on the dataset page: https://huggingface.co/datasets/opsecsystems/Dream2Image-ZhangTWC129-enriched-optimized.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Forgery Wizzard, Heap-Forge [Dataset]. https://huggingface.co/datasets/WizzF/Heap-Forge

Heap-Forge

WizzF/Heap-Forge

Explore at:
43 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Forgery Wizzard
Description

DISCLAIMER

This represents only a subset of the final dataset (taking into account the new HuggingFace LFS storage limits). The complete dataset will be released following the camera-ready submission of our paper.

  The Heap Dataset

We develop The Heap, a new contamination-free multilingual code dataset comprising 57 languages, which facilitates LLM evaluation reproducibility. The reproduction packge can be found here.

  Is your code in The Heap?

An opt-out… See the full description on the dataset page: https://huggingface.co/datasets/WizzF/Heap-Forge.

Search
Clear search
Close search
Google apps
Main menu