https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
This data accompanies the WebUI project (https://dl.acm.org/doi/abs/10.1145/3544548.3581158) For more information, check out the project website: https://uimodeling.github.io/ To download this dataset, you need to install the huggingface-hub package pip install huggingface-hub
Use snapshot_download from huggingface_hub import snapshot_download snapshot_download(repo_id="biglab/webui-test", repo_type="dataset")
IMPORTANT
Before downloading and using, please review the copyright info here:… See the full description on the dataset page: https://huggingface.co/datasets/biglab/webui-test.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
You can load the dataset as follows: from huggingface_hub import snapshot_download
snapshot_download(repo_id="shc443/MaternKernel_compositionality", repo_type="dataset")
For more information regarding data generating process, please refer to our paper or github page
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Fastmap evaluation suite.
You only need the databases to run fastmap. Download the images if you want to produce colored point cloud. Download the subset of data you want to your local directory. huggingface-cli download whc/fastmap_sfm --repo-type dataset --local-dir ./ --include 'databases/tnt_*' 'ground_truths/tnt_*'
or use the python interface from huggingface_hub import hf_hub_download, snapshot_download snapshot_download( repo_id="whc/fastmap_sfm", repo_type='dataset'… See the full description on the dataset page: https://huggingface.co/datasets/whc/fastmap_sfm.
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Download the SoccerNet Ball Action Spotting dataset in the OSL Action Spotting JSON format
from huggingface_hub import snapshot_download snapshot_download(repo_id="OpenSportsLab/SoccerNet-BallActionSpotting-Videos", repo_type="dataset", revision="main", local_dir="SoccerNet-BallActionSpotting-Videos")
Download specific subsets
Download 224p/720p versions
from huggingface_hub import snapshot_download
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Steps to download the data
Installing necessary library
poetry add huggingface_hub
Using the library to download the data into desired directory
from huggingface_hub import snapshot_download path = "./data" # path you want to store your data snapshot_download("Pulsifi/job_embedding", repo_type="dataset", local_dir=path)
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
A dataset of MIDI images designed for use with diffusion models for music generation, music classification, text-to-music and other purposes
🤗 Check out Imagen MIDI Images LIVE demo on Hugging Face Spaces 🤗
Installation
from huggingface_hub import snapshot_download
repo_id = "asigalov61/MIDI-Images" repo_type = 'dataset'
local_dir = "./MIDI-Images"
snapshot_download(repo_id, repo_type=repo_type, local_dir=local_dir)
MIDI Images… See the full description on the dataset page: https://huggingface.co/datasets/asigalov61/MIDI-Images.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Vector Database
This is a chromadb based vector database containing indexed embeddings from the hugging face dataset: "Somnath01/Birds_Species".
Usage
import chromadb from huggingface_hub import snapshot_download
vector_db_path = snapshot_download( repo_id=vector_dataset.hf_dataset_path, repo_type="dataset" )
client = chromadb.PersistentClient( path=os.path.join(vector_db_path… See the full description on the dataset page: https://huggingface.co/datasets/imageomics/bird-dataset-vector.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
BLIP3o Pretrain JourneyDB Dataset
This collection contains 4 million JourneyDB images.
Download
from huggingface_hub import snapshot_download
snapshot_download( repo_id="BLIP3o/BLIP3o-Pretrain-JourneyDB", repo_type="dataset" )
Load Dataset without Extracting
You don’t need to unpack the .tar archives, use WebDataset support in 🤗datasets instead: from datasets import load_dataset import glob
data_files = glob.glob("/your/data/path/*.tar")… See the full description on the dataset page: https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-JourneyDB.
SOS-SFC-200K Dataset Splits
These are the dataset splits for the paper SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding. The SOS-SFC-200K collection contains 200k samples, each with:
Segmentation masks
Bounding boxes
All annotations are provided in COCO format.
Download & Extraction
Clone or download the entire repository.
from huggingface_hub import snapshot_download
snapshot_download( repo_id="weikaih/SOS-SFC-200K"… See the full description on the dataset page: https://huggingface.co/datasets/weikaih/SOS-SFC-200K.
SOS-FC-1M Dataset Splits
These are the dataset splits for the paper SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding. The SOS-FC-1M collection contains one million samples, each with:
Segmentation masks
Bounding boxes
Referring expressions
All annotations are provided in COCO format.
Download & Extraction
Clone or download the entire repository.
from huggingface_hub import snapshot_download
snapshot_download(… See the full description on the dataset page: https://huggingface.co/datasets/weikaih/SOS-FC-1M.
from huggingface_hub import snapshot_download snapshot_download( repo_id="wambugu71/crop_leaf_diseases_vit", local_dir="crop_leaf_model", local_dir_use_symlinks=False )
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
BLIP3o Pretrain Long-Caption Dataset
This collection contains 27 million images, each paired with a long (~120 token) caption generated by Qwen/Qwen2.5-VL-7B-Instruct.
Download
from huggingface_hub import snapshot_download
snapshot_download( repo_id="BLIP3o/BLIP3o-Pretrain-Long-Caption", repo_type="dataset" )
Load Dataset without Extracting
You don’t need to unpack the .tar archives, use WebDataset support in 🤗datasets instead: from datasets import… See the full description on the dataset page: https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
ACDC Dataset
This is a pre-processed version of the original Automated Cardiac Diagnosis Challenge (ACDC) dataset. Short-axis view images have been resampled to 1mm x 1mm x 10mm. Images have also been center cropped at the left ventricle mask center. The size of each slice is 192 x 192 pixels. Download the dataset using the following command: from huggingface_hub import snapshot_download data_dir = snapshot_download(repo_id="mathpluscode/ACDC", allow_patterns=["*.nii.gz", "*.csv"]… See the full description on the dataset page: https://huggingface.co/datasets/mathpluscode/ACDC.
SOS-FC-Object-Segments-10M
These are the dataset splits for the paper SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding. This dataset contains over 10M object segments in Frequeny-Category (FC) splits.
Download & Extraction
Clone or download the entire repository.
from huggingface_hub import snapshot_download
snapshot_download( repo_id="weikaih/SOS-FC-Object-Segments-10M", repo_type="dataset"… See the full description on the dataset page: https://huggingface.co/datasets/weikaih/SOS-FC-Object-Segments-10M.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Matter 0.1
Curated top quality records from 35 other datasets. Extracted from prompt-perfect This is just a consolidation of all the score 5s. Fine-tuning models with various subsets and combinations to create a best performing v1 dataset
~1.4B Tokens, ~2.5M records
Dataset has been deduped, decontaminated with bagel script from Jon Durbin Download using the below command to avoid unecessary files from huggingface_hub import snapshot_download… See the full description on the dataset page: https://huggingface.co/datasets/0-hero/Matter-0.1.
Original dataset has been acquired through the following link : https://zenodo.org/records/8370566 The dataset is not cleaned yet and any contributions are welcome 🤗
download instructions
from huggingface_hub import snapshot_download snapshot_download(repo_id="tunis-ai/TunSwitch",repo_type="dataset",local_dir=".")
Information
This repo contains the data used to develop and test the Tunisian Arabic Automatic Speech Recognition model developed in the following paper : A.… See the full description on the dataset page: https://huggingface.co/datasets/tunis-ai/TunSwitch.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is BLIP3o-60k Text-to-Image instruction tuning dataset distilled from GPT-4o, including the following categories:
JourneyDB Human (including MSCOCO with human caption, human gestures, occupations) Dalle3 Geneval (no overlap with test set) Common objects Simple text
Here we provide the code guidance to download tar file: from huggingface_hub import snapshot_download snapshot_download(repo_id='BLIP3o/BLIP3o-60k', repo_type=‘dataset’)
And you can use huggingface datasets to read the tar… See the full description on the dataset page: https://huggingface.co/datasets/BLIP3o/BLIP3o-60k.
How to install?
!pip install datasets -q from huggingface_hub import snapshot_download import pandas as pd import matplotlib.pyplot as plt
snapshot_download(repo_id="Aborevsky01/CLEVR-BT-DB", repo_type="dataset", local_dir='path-to-your-local-dir')
!unzip [path-to-your-local-dir]/[type-of-task]/images.zip
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
GenRef-CoT
We provide 227K high-quality CoT reflections which were used to train our Qwen-based reflection generation model in ReflectionFlow [1]. To know the details of the dataset creation pipeline, please refer to Section 3.2 of [1].
Dataset loading
We provide the dataset in the webdataset format for fast dataloading and streaming. We recommend downloading the repository locally for faster I/O: from huggingface_hub import snapshot_download
local_dir =… See the full description on the dataset page: https://huggingface.co/datasets/diffusion-cot/GenRef-CoT.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Annotated MIDI Dataset
Comprehensive annotated MIDI dataset with original lyrics, lyrics summaries, lyrics sentiments, music descriptions, illustrations, pre-trained MIDI classification model and helper Python code
Annotated MIDI Dataset LIVE demos
Music Sentence Transformer Advanced MIDI Classifer Descriptive Music Transformer
Installation
from huggingface_hub import snapshot_download
repo_id = "asigalov61/Annotated-MIDI-Dataset"… See the full description on the dataset page: https://huggingface.co/datasets/asigalov61/Annotated-MIDI-Dataset.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
This data accompanies the WebUI project (https://dl.acm.org/doi/abs/10.1145/3544548.3581158) For more information, check out the project website: https://uimodeling.github.io/ To download this dataset, you need to install the huggingface-hub package pip install huggingface-hub
Use snapshot_download from huggingface_hub import snapshot_download snapshot_download(repo_id="biglab/webui-test", repo_type="dataset")
IMPORTANT
Before downloading and using, please review the copyright info here:… See the full description on the dataset page: https://huggingface.co/datasets/biglab/webui-test.