23 datasets found

h
webui-test
huggingface.co
Updated Nov 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Big Lab (2024). webui-test [Dataset]. https://huggingface.co/datasets/biglab/webui-test
Explore at:
Dataset updated
Nov 1, 2024
Dataset authored and provided by
Big Lab
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
This data accompanies the WebUI project (https://dl.acm.org/doi/abs/10.1145/3544548.3581158) For more information, check out the project website: https://uimodeling.github.io/ To download this dataset, you need to install the huggingface-hub package pip install huggingface-hub

Use snapshot_download from huggingface_hub import snapshot_download snapshot_download(repo_id="biglab/webui-test", repo_type="dataset")

IMPORTANT

Before downloading and using, please review the copyright info here:… See the full description on the dataset page: https://huggingface.co/datasets/biglab/webui-test.
h
MaternKernel_compositionality
huggingface.co
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seok Hoan (Kevin) Choi (2025). MaternKernel_compositionality [Dataset]. https://huggingface.co/datasets/shc443/MaternKernel_compositionality
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 5, 2025
Authors
Seok Hoan (Kevin) Choi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
You can load the dataset as follows: from huggingface_hub import snapshot_download

snapshot_download(repo_id="shc443/MaternKernel_compositionality", repo_type="dataset")

For more information regarding data generating process, please refer to our paper or github page
h
fastmap_sfm
huggingface.co
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haochen Wang (2025). fastmap_sfm [Dataset]. https://huggingface.co/datasets/whc/fastmap_sfm
Explore at:
Dataset updated
May 7, 2025
Authors
Haochen Wang
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Fastmap evaluation suite.

You only need the databases to run fastmap. Download the images if you want to produce colored point cloud. Download the subset of data you want to your local directory. huggingface-cli download whc/fastmap_sfm --repo-type dataset --local-dir ./ --include 'databases/tnt_*' 'ground_truths/tnt_*'

or use the python interface from huggingface_hub import hf_hub_download, snapshot_download snapshot_download( repo_id="whc/fastmap_sfm", repo_type='dataset'… See the full description on the dataset page: https://huggingface.co/datasets/whc/fastmap_sfm.
h
SoccerNet-BallActionSpotting-Videos
huggingface.co
Updated Nov 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenSportsLab (2024). SoccerNet-BallActionSpotting-Videos [Dataset]. https://huggingface.co/datasets/OpenSportsLab/SoccerNet-BallActionSpotting-Videos
Explore at:
Dataset updated
Nov 20, 2024
Dataset authored and provided by
OpenSportsLab
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
Download the SoccerNet Ball Action Spotting dataset in the OSL Action Spotting JSON format

from huggingface_hub import snapshot_download snapshot_download(repo_id="OpenSportsLab/SoccerNet-BallActionSpotting-Videos", repo_type="dataset", revision="main", local_dir="SoccerNet-BallActionSpotting-Videos")

Download specific subsets Download 224p/720p versions

from huggingface_hub import snapshot_download

Download the 224p… See the full description on the dataset page: https://huggingface.co/datasets/OpenSportsLab/SoccerNet-BallActionSpotting-Videos.
h
job_embedding
huggingface.co
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pulsifi Pte. Ltd. (2024). job_embedding [Dataset]. https://huggingface.co/datasets/Pulsifi/job_embedding
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2024
Dataset authored and provided by
Pulsifi Pte. Ltd.
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Steps to download the data

Installing necessary library

poetry add huggingface_hub

Using the library to download the data into desired directory

from huggingface_hub import snapshot_download path = "./data" # path you want to store your data snapshot_download("Pulsifi/job_embedding", repo_type="dataset", local_dir=path)
h
MIDI-Images
huggingface.co
Updated Sep 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex (2024). MIDI-Images [Dataset]. https://huggingface.co/datasets/asigalov61/MIDI-Images
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2024
Authors
Alex
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
A dataset of MIDI images designed for use with diffusion models for music generation, music classification, text-to-music and other purposes

🤗 Check out Imagen MIDI Images LIVE demo on Hugging Face Spaces 🤗 Installation

from huggingface_hub import snapshot_download

repo_id = "asigalov61/MIDI-Images" repo_type = 'dataset'

local_dir = "./MIDI-Images"

snapshot_download(repo_id, repo_type=repo_type, local_dir=local_dir)

MIDI Images… See the full description on the dataset page: https://huggingface.co/datasets/asigalov61/MIDI-Images.
h
bird-dataset-vector
huggingface.co
Updated Mar 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HDR Imageomics Institute (2025). bird-dataset-vector [Dataset]. https://huggingface.co/datasets/imageomics/bird-dataset-vector
Explore at:
Dataset updated
Mar 30, 2025
Dataset authored and provided by
HDR Imageomics Institute
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Vector Database

This is a chromadb based vector database containing indexed embeddings from the hugging face dataset: "Somnath01/Birds_Species".

Usage

import chromadb from huggingface_hub import snapshot_download

vector_db_path = snapshot_download( repo_id=vector_dataset.hf_dataset_path, repo_type="dataset" )

client = chromadb.PersistentClient( path=os.path.join(vector_db_path… See the full description on the dataset page: https://huggingface.co/datasets/imageomics/bird-dataset-vector.
h
BLIP3o-Pretrain-JourneyDB
huggingface.co
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BLIP3o (2025). BLIP3o-Pretrain-JourneyDB [Dataset]. https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-JourneyDB
Explore at:
Dataset updated
May 27, 2025
Dataset authored and provided by
BLIP3o
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
BLIP3o Pretrain JourneyDB Dataset

This collection contains 4 million JourneyDB images.

Download

from huggingface_hub import snapshot_download

snapshot_download( repo_id="BLIP3o/BLIP3o-Pretrain-JourneyDB", repo_type="dataset" )

Load Dataset without Extracting

You don’t need to unpack the .tar archives, use WebDataset support in 🤗datasets instead: from datasets import load_dataset import glob

data_files = glob.glob("/your/data/path/*.tar")… See the full description on the dataset page: https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-JourneyDB.
h
SOS-SFC-200K
huggingface.co
Updated May 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weikai Huang (2025). SOS-SFC-200K [Dataset]. https://huggingface.co/datasets/weikaih/SOS-SFC-200K
Explore at:
Dataset updated
May 23, 2025
Authors
Weikai Huang
Description
SOS-SFC-200K Dataset Splits

These are the dataset splits for the paper SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding. The SOS-SFC-200K collection contains 200k samples, each with:

Segmentation masks
Bounding boxes

All annotations are provided in COCO format.

Download & Extraction

Clone or download the entire repository.

from huggingface_hub import snapshot_download

snapshot_download( repo_id="weikaih/SOS-SFC-200K"… See the full description on the dataset page: https://huggingface.co/datasets/weikaih/SOS-SFC-200K.
h
SOS-FC-1M
huggingface.co
Updated May 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weikai Huang (2025). SOS-FC-1M [Dataset]. https://huggingface.co/datasets/weikaih/SOS-FC-1M
Explore at:
Dataset updated
May 23, 2025
Authors
Weikai Huang
Description
SOS-FC-1M Dataset Splits

These are the dataset splits for the paper SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding. The SOS-FC-1M collection contains one million samples, each with:

Segmentation masks
Bounding boxes
Referring expressions

All annotations are provided in COCO format.

Download & Extraction

Clone or download the entire repository.

from huggingface_hub import snapshot_download

snapshot_download(… See the full description on the dataset page: https://huggingface.co/datasets/weikaih/SOS-FC-1M.
h
disease
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aashi tuli, disease [Dataset]. https://huggingface.co/datasets/aashituli/disease
Explore at:
Authors
Aashi tuli
Description
from huggingface_hub import snapshot_download snapshot_download( repo_id="wambugu71/crop_leaf_diseases_vit", local_dir="crop_leaf_model", local_dir_use_symlinks=False )
h
BLIP3o-Pretrain-Long-Caption
huggingface.co
Updated May 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BLIP3o (2025). BLIP3o-Pretrain-Long-Caption [Dataset]. https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption
Explore at:
Dataset updated
May 17, 2025
Dataset authored and provided by
BLIP3o
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
BLIP3o Pretrain Long-Caption Dataset

This collection contains 27 million images, each paired with a long (~120 token) caption generated by Qwen/Qwen2.5-VL-7B-Instruct.

Download

from huggingface_hub import snapshot_download

snapshot_download( repo_id="BLIP3o/BLIP3o-Pretrain-Long-Caption", repo_type="dataset" )

Load Dataset without Extracting

You don’t need to unpack the .tar archives, use WebDataset support in 🤗datasets instead: from datasets import… See the full description on the dataset page: https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption.
h
ACDC
huggingface.co
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ACDC [Dataset]. https://huggingface.co/datasets/mathpluscode/ACDC
Explore at:
Dataset updated
Jun 3, 2025
Authors
Yunguan Fu
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
ACDC Dataset

This is a pre-processed version of the original Automated Cardiac Diagnosis Challenge (ACDC) dataset. Short-axis view images have been resampled to 1mm x 1mm x 10mm. Images have also been center cropped at the left ventricle mask center. The size of each slice is 192 x 192 pixels. Download the dataset using the following command: from huggingface_hub import snapshot_download data_dir = snapshot_download(repo_id="mathpluscode/ACDC", allow_patterns=["*.nii.gz", "*.csv"]… See the full description on the dataset page: https://huggingface.co/datasets/mathpluscode/ACDC.
h
SOS-FC-Object-Segments-10M
huggingface.co
Updated May 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weikai Huang (2025). SOS-FC-Object-Segments-10M [Dataset]. https://huggingface.co/datasets/weikaih/SOS-FC-Object-Segments-10M
Explore at:
Dataset updated
May 23, 2025
Authors
Weikai Huang
Description
SOS-FC-Object-Segments-10M

These are the dataset splits for the paper SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding. This dataset contains over 10M object segments in Frequeny-Category (FC) splits.

Download & Extraction

Clone or download the entire repository.

from huggingface_hub import snapshot_download

snapshot_download( repo_id="weikaih/SOS-FC-Object-Segments-10M", repo_type="dataset"… See the full description on the dataset page: https://huggingface.co/datasets/weikaih/SOS-FC-Object-Segments-10M.
h
Matter-0.1
huggingface.co
Updated Mar 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ram (2024). Matter-0.1 [Dataset]. https://huggingface.co/datasets/0-hero/Matter-0.1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 24, 2024
Authors
Ram
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Matter 0.1

Curated top quality records from 35 other datasets. Extracted from prompt-perfect This is just a consolidation of all the score 5s. Fine-tuning models with various subsets and combinations to create a best performing v1 dataset

~1.4B Tokens, ~2.5M records

Dataset has been deduped, decontaminated with bagel script from Jon Durbin Download using the below command to avoid unecessary files from huggingface_hub import snapshot_download… See the full description on the dataset page: https://huggingface.co/datasets/0-hero/Matter-0.1.
h
TunSwitch
huggingface.co
Updated May 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tunisia.AI (2024). TunSwitch [Dataset]. https://huggingface.co/datasets/tunis-ai/TunSwitch
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 4, 2024
Dataset provided by
Tunisia.AI
Description
Original dataset has been acquired through the following link : https://zenodo.org/records/8370566 The dataset is not cleaned yet and any contributions are welcome 🤗

download instructions

from huggingface_hub import snapshot_download snapshot_download(repo_id="tunis-ai/TunSwitch",repo_type="dataset",local_dir=".")

Information

This repo contains the data used to develop and test the Tunisian Arabic Automatic Speech Recognition model developed in the following paper : A.… See the full description on the dataset page: https://huggingface.co/datasets/tunis-ai/TunSwitch.
h
BLIP3o-60k
huggingface.co
Updated May 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BLIP3o (2025). BLIP3o-60k [Dataset]. https://huggingface.co/datasets/BLIP3o/BLIP3o-60k
Explore at:
Dataset updated
May 13, 2025
Dataset authored and provided by
BLIP3o
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is BLIP3o-60k Text-to-Image instruction tuning dataset distilled from GPT-4o, including the following categories:

JourneyDB Human (including MSCOCO with human caption, human gestures, occupations) Dalle3 Geneval (no overlap with test set) Common objects Simple text

Here we provide the code guidance to download tar file: from huggingface_hub import snapshot_download snapshot_download(repo_id='BLIP3o/BLIP3o-60k', repo_type=‘dataset’)

And you can use huggingface datasets to read the tar… See the full description on the dataset page: https://huggingface.co/datasets/BLIP3o/BLIP3o-60k.
h
CLEVR-BT-DB
huggingface.co
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey Borevsky (2023). CLEVR-BT-DB [Dataset]. https://huggingface.co/datasets/Aborevsky01/CLEVR-BT-DB
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 15, 2023
Authors
Andrey Borevsky
Description
How to install?

!pip install datasets -q from huggingface_hub import snapshot_download import pandas as pd import matplotlib.pyplot as plt

First step: download an entire datatset

snapshot_download(repo_id="Aborevsky01/CLEVR-BT-DB", repo_type="dataset", local_dir='path-to-your-local-dir')

Second step: unarchive the images for VQA

!unzip [path-to-your-local-dir]/[type-of-task]/images.zip

Example of the triplet (image - question -… See the full description on the dataset page: https://huggingface.co/datasets/Aborevsky01/CLEVR-BT-DB.
h
GenRef-CoT
huggingface.co
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diffusion CoT (2025). GenRef-CoT [Dataset]. https://huggingface.co/datasets/diffusion-cot/GenRef-CoT
Explore at:
Dataset updated
Apr 23, 2025
Dataset authored and provided by
Diffusion CoT
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
GenRef-CoT

We provide 227K high-quality CoT reflections which were used to train our Qwen-based reflection generation model in ReflectionFlow [1]. To know the details of the dataset creation pipeline, please refer to Section 3.2 of [1].

Dataset loading

We provide the dataset in the webdataset format for fast dataloading and streaming. We recommend downloading the repository locally for faster I/O: from huggingface_hub import snapshot_download

local_dir =… See the full description on the dataset page: https://huggingface.co/datasets/diffusion-cot/GenRef-CoT.
h
Annotated-MIDI-Dataset
huggingface.co
Updated Mar 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex (2025). Annotated-MIDI-Dataset [Dataset]. https://huggingface.co/datasets/asigalov61/Annotated-MIDI-Dataset
Explore at:
Dataset updated
Mar 27, 2025
Authors
Alex
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Annotated MIDI Dataset

Comprehensive annotated MIDI dataset with original lyrics, lyrics summaries, lyrics sentiments, music descriptions, illustrations, pre-trained MIDI classification model and helper Python code Annotated MIDI Dataset LIVE demos

Music Sentence Transformer Advanced MIDI Classifer Descriptive Music Transformer

Installation

from huggingface_hub import snapshot_download

repo_id = "asigalov61/Annotated-MIDI-Dataset"… See the full description on the dataset page: https://huggingface.co/datasets/asigalov61/Annotated-MIDI-Dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

Big Lab (2024). webui-test [Dataset]. https://huggingface.co/datasets/biglab/webui-test

webui-test

biglab/webui-test

Explore at:

12 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Nov 1, 2024

Dataset authored and provided by

Big Lab

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

This data accompanies the WebUI project (https://dl.acm.org/doi/abs/10.1145/3544548.3581158) For more information, check out the project website: https://uimodeling.github.io/ To download this dataset, you need to install the huggingface-hub package pip install huggingface-hub

Use snapshot_download from huggingface_hub import snapshot_download snapshot_download(repo_id="biglab/webui-test", repo_type="dataset")

IMPORTANT

Before downloading and using, please review the copyright info here:… See the full description on the dataset page: https://huggingface.co/datasets/biglab/webui-test.

Clear search

Close search

Google apps

Main menu

webui-test

MaternKernel_compositionality

fastmap_sfm

SoccerNet-BallActionSpotting-Videos

Download the 224p… See the full description on the dataset page: https://huggingface.co/datasets/OpenSportsLab/SoccerNet-BallActionSpotting-Videos.

job_embedding

MIDI-Images

bird-dataset-vector

BLIP3o-Pretrain-JourneyDB

SOS-SFC-200K

SOS-FC-1M

disease

BLIP3o-Pretrain-Long-Caption

ACDC

SOS-FC-Object-Segments-10M

Matter-0.1

TunSwitch

BLIP3o-60k

CLEVR-BT-DB

First step: download an entire datatset

Second step: unarchive the images for VQA

Example of the triplet (image - question -… See the full description on the dataset page: https://huggingface.co/datasets/Aborevsky01/CLEVR-BT-DB.

GenRef-CoT

Annotated-MIDI-Dataset

webui-testSee More Versions

biglab/webui-test

webui-test