64 datasets found

h
DL3DV-ALL-2K
huggingface.co
Updated Mar 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DL3DV (2024). DL3DV-ALL-2K [Dataset]. https://huggingface.co/datasets/DL3DV/DL3DV-ALL-2K
Explore at:
Dataset updated
Mar 13, 2024
Dataset authored and provided by
DL3DV
Description
DL3DV-Dataset

This repo has all the 2K frames with camera poses of DL3DV-10K Dataset. We are working hard to review all the dataset to avoid sensitive information. Thank you for your patience.

Download

If you have enough space, you can use git to download a dataset from huggingface. See this link. 480P/960P versions should satisfies most needs. If you do not have enough space, we further provide a download script here to download a subset. The usage: usage:… See the full description on the dataset page: https://huggingface.co/datasets/DL3DV/DL3DV-ALL-2K.
h
fastmap_sfm
huggingface.co
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haochen Wang (2025). fastmap_sfm [Dataset]. https://huggingface.co/datasets/whc/fastmap_sfm
Explore at:
Dataset updated
May 7, 2025
Authors
Haochen Wang
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Fastmap evaluation suite.

You only need the databases to run fastmap. Download the images if you want to produce colored point cloud. Download the subset of data you want to your local directory. huggingface-cli download whc/fastmap_sfm --repo-type dataset --local-dir ./ --include 'databases/tnt_*' 'ground_truths/tnt_*'

or use the python interface from huggingface_hub import hf_hub_download, snapshot_download snapshot_download( repo_id="whc/fastmap_sfm", repo_type='dataset'… See the full description on the dataset page: https://huggingface.co/datasets/whc/fastmap_sfm.
h
github-code
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot, github-code [Dataset]. https://huggingface.co/datasets/codeparrot/github-code
Explore at:
Dataset provided by
Good Engineering, Inc
Authors
CodeParrot
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.
h
LAV-DF
huggingface.co
Updated Jul 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ControlNet (2023). LAV-DF [Dataset]. https://huggingface.co/datasets/ControlNet/LAV-DF
Explore at:
Dataset updated
Jul 11, 2023
Authors
ControlNet
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
Localized Audio Visual DeepFake Dataset (LAV-DF)

This repo is the dataset for the DICTA paper Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization (Best Award), and the journal paper "Glitch in the Matrix!": A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization submitted to CVIU.

LAV-DF Dataset Download

To use this LAV-DF dataset, you should… See the full description on the dataset page: https://huggingface.co/datasets/ControlNet/LAV-DF.
h
cloud-adapter-datasets
huggingface.co
Updated Nov 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
XavierJiezou (2024). cloud-adapter-datasets [Dataset]. https://huggingface.co/datasets/XavierJiezou/cloud-adapter-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2024
Authors
XavierJiezou
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Cloud-Adapter-Datasets

This dataset card aims to describe the datasets used in the Cloud-Adapter, a collection of high-resolution satellite images and semantic segmentation masks for cloud detection and related tasks.

Install

pip install huggingface-hub

Usage

Step 1: Download datasets

huggingface-cli download --repo-type dataset XavierJiezou/cloud-adapter-datasets --local-dir data --include hrc_whu.zip huggingface-cli download --repo-type dataset… See the full description on the dataset page: https://huggingface.co/datasets/XavierJiezou/cloud-adapter-datasets.
h
the-stack-v2
huggingface.co
Updated Mar 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2024). the-stack-v2 [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-v2
Explore at:
Dataset updated
Mar 1, 2024
Dataset authored and provided by
BigCode
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The Stack v2

The dataset consists of 4 versions:

bigcode/the-stack-v2: the full "The Stack v2" dataset <-- you are here bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories.bigcode/the-stack-v2-train-smol-ids: based on the… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2.
h
VLATrainingDataset
huggingface.co
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zhou wei (2025). VLATrainingDataset [Dataset]. https://huggingface.co/datasets/WeiChow/VLATrainingDataset
Explore at:
Dataset updated
May 31, 2025
Authors
zhou wei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Open X-Embodiment Dataset (unofficial)

RLDS dataset for train vla

use this dataset

download the dataset by hf: (

prepare by yourself

The code modified from rlds_dataset_mod We upload the precessed dataset in this repository ❤… See the full description on the dataset page: https://huggingface.co/datasets/WeiChow/VLATrainingDataset.
h
tiny-shakespeare
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trelis, tiny-shakespeare [Dataset]. https://huggingface.co/datasets/Trelis/tiny-shakespeare
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Trelis
Description
Data source

Downloaded via Andrej Karpathy's nanogpt repo from this link

Data Format

The entire dataset is split into train (90%) and test (10%). All rows are at most 1024 tokens, using the Llama 2 tokenizer. All rows are split cleanly so that sentences are whole and unbroken.
h
indonesian-youtube
huggingface.co
Updated Apr 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Malaysia AI (2024). indonesian-youtube [Dataset]. https://huggingface.co/datasets/malaysia-ai/indonesian-youtube
Explore at:
Dataset updated
Apr 18, 2024
Dataset authored and provided by
Malaysia AI
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Area covered
YouTube
Description
Indonesian Youtube

Source code at https://github.com/mesolitica/malaysian-dataset/tree/master/speech/indonesian-youtube

how to download

huggingface-cli download --repo-type dataset
--include '*.z*'
--local-dir './'
malaysia-ai/indonesian-youtube

wget https://www.7-zip.org/a/7z2301-linux-x64.tar.xz tar -xf 7z2301-linux-x64.tar.xz ~/7zz x mp3-16k.zip -y -mmt40

Licensing

All the videos, songs, images, and graphics used in the video belong to their… See the full description on the dataset page: https://huggingface.co/datasets/malaysia-ai/indonesian-youtube.
h
crag-mm-image-search-images
huggingface.co
Updated Apr 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daoyu Wang (2025). crag-mm-image-search-images [Dataset]. https://huggingface.co/datasets/Melmaphother/crag-mm-image-search-images
Explore at:
Dataset updated
Apr 26, 2025
Authors
Daoyu Wang
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Download script to avoid the rate limit:

!/bin/bash

下载命令

COMMAND="huggingface-cli download --repo-type dataset Melmaphother/crag-mm-image-search-images --local-dir crag-mm-image-search-images"

Loop until the command is executed successfully

while true; do echo "Attempting to download/resume: $COMMAND" # Execute download command $COMMAND

EXIT_STATUS=$? if [ $EXIT_STATUS -eq 0 ]; then echo"Download completed successfully." break else echo… See the full description on the dataset page: https://huggingface.co/datasets/Melmaphother/crag-mm-image-search-images.
h
HealthyCT
huggingface.co
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qi Chen (2024). HealthyCT [Dataset]. https://huggingface.co/datasets/qicq1c/HealthyCT
Explore at:
Dataset updated
Mar 28, 2024
Authors
Qi Chen
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Summary

Healthy CT data for abdominal organs (liver, pancreas and kidney) are filtered out from public dataset.

Downloading Instructions 1- Install the Hugging Face library:

pip install -U "huggingface_hub[cli]"

2- Download the dataset:

mkdir HealthyCT cd HealthyCT huggingface-cli download qicq1c/HealthyCT --repo-type dataset --local-dir . --cache-dir ./cache

[Optional] Resume downloading

In case you had a previous interrupted download… See the full description on the dataset page: https://huggingface.co/datasets/qicq1c/HealthyCT.
h
IQA-PyTorch-Datasets
huggingface.co
Updated Feb 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chaofeng Chen (2024). IQA-PyTorch-Datasets [Dataset]. https://huggingface.co/datasets/chaofengc/IQA-PyTorch-Datasets
Explore at:
Dataset updated
Feb 18, 2024
Authors
Chaofeng Chen
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Description

This is the dataset repository used in the pyiqa toolbox. Please refer to Awesome Image Quality Assessment for details of each dataset Example commandline script with huggingface-cli: huggingface-cli download chaofengc/IQA-PyTorch-Datasets live.tgz --local-dir ./datasets --repo-type dataset cd datasets tar -xzvf live.tgz

Disclaimer for This Dataset Collection

This collection of datasets is compiled and maintained for academic, research, and educational… See the full description on the dataset page: https://huggingface.co/datasets/chaofengc/IQA-PyTorch-Datasets.
h
tamil-youtube
huggingface.co
Updated Dec 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Malaysia AI (2024). tamil-youtube [Dataset]. https://huggingface.co/datasets/malaysia-ai/tamil-youtube
Explore at:
Dataset updated
Dec 21, 2024
Dataset authored and provided by
Malaysia AI
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Area covered
YouTube
Description
Tamil Youtube

Selected channels from https://www.youtube.com using 'tamil podcast' keyword. With total 121347 audio files, total 11292.83 hours.

how to download

huggingface-cli download --repo-type dataset
--include '*.z*'
--local-dir './'
malaysia-ai/tamil-youtube

https://gist.githubusercontent.com/huseinzol05/2e26de4f3b29d99e993b349864ab6c10/raw/9b2251f3ff958770215d70c8d82d311f82791b78/unzip.py python3 unzip.py

Licensing

All the videos, songs, images… See the full description on the dataset page: https://huggingface.co/datasets/malaysia-ai/tamil-youtube.
h
OCRMT30K-refine
huggingface.co
Updated Mar 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jingheng Pan (2025). OCRMT30K-refine [Dataset]. https://huggingface.co/datasets/p1k0/OCRMT30K-refine
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 1, 2025
Authors
Jingheng Pan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
下载数据集使用： huggingface-cli download --repo-type dataset --resume-download p1k0/OCRMT30K-refine --local-dir OCRMT30K-refine original_data：原始标注 whole_image_v2.zip: 图片文件
h
sbsfigures_imgs
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kwangrok Ryoo, sbsfigures_imgs [Dataset]. https://huggingface.co/datasets/Ryoo72/sbsfigures_imgs
Explore at:
Authors
Kwangrok Ryoo
Description
This repository is a collection of images from sbsfgures.

How to use ths repo.

Download huggingface-cli download Ryoo72/sbsfigures_imgs --repo-type dataset

Unzip cat partial-imgs* > imgs.tar.gz tar -zxvf imgs.tar.gz

UseUse it with the following datasets.

Ryoo72/sbsfigures_qa Ryoo72/sbsfigures_extract

How did I upload this repo.

Split split -b 20G -d --suffix-length=2 imgs.tar.gz partial-imgs.

Upload from huggingface_hub import HfApi import glob… See the full description on the dataset page: https://huggingface.co/datasets/Ryoo72/sbsfigures_imgs.
h
audiocaps
huggingface.co
opendatalab.com
Updated Jul 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmitry Balobin (2025). audiocaps [Dataset]. https://huggingface.co/datasets/d0rj/audiocaps
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 1, 2025
Authors
Dmitry Balobin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
audiocaps

HuggingFace mirror of official data repo.
h
TSpec-LLM
huggingface.co
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rasoul (2024). TSpec-LLM [Dataset]. https://huggingface.co/datasets/rasoul-nikbakht/TSpec-LLM
Explore at:
Dataset updated
Jun 1, 2024
Authors
Rasoul
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Model Card for the TSpec-LLM Dataset

Demo:

Dataset Description Abstract

This dataset contains processed documentation files from the 3GPP (3rd Generation Partnership Project) standards, converted to markdown and docx formats. It is intended for use in telecommunications research, natural language processing, and machine learning applications, particularly those focusing on telecommunications standards and technologies.

🚀 Dataset Update: Now Up-to-Date… See the full description on the dataset page: https://huggingface.co/datasets/rasoul-nikbakht/TSpec-LLM.
h
world_model_tokenized_data
huggingface.co
Updated Jun 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
1X (2024). world_model_tokenized_data [Dataset]. https://huggingface.co/datasets/1x-technologies/world_model_tokenized_data
Explore at:
Dataset updated
Jun 20, 2024
Dataset authored and provided by
1X
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
World
Description
1X World Model Compression Challenge Dataset

This repository hosts the dataset for the 1X World Model Compression Challenge. huggingface-cli download 1x-technologies/worldmodel --repo-type dataset --local-dir data

Updates Since v1.1

Train/Val v2.0 (~100 hours), replacing v1.1 Test v2.0 dataset for the Compression Challenge Faces blurred for privacy New raw video dataset (CC-BY-NC-SA 4.0) at worldmodel_raw_data Example scripts now split into: cosmos_video_decoder.py —… See the full description on the dataset page: https://huggingface.co/datasets/1x-technologies/world_model_tokenized_data.
h
farfetch_singapore_images
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thanh Hau Nguyen, farfetch_singapore_images [Dataset]. https://huggingface.co/datasets/thanhhau097/farfetch_singapore_images
Explore at:
Authors
Thanh Hau Nguyen
License
https://choosealicense.com/licenses/bsl-1.0/https://choosealicense.com/licenses/bsl-1.0/
Area covered
Singapore
Description
pip install diffusers transformers para-attn numpy pandas hf_transfer

HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download thanhhau097/farfetch_singapore_images --repo-type dataset --local-dir .

cat farfetch_singapore_images.zip.* > farfetch_singapore_images.zip unzip -qq farfetch_singapore_images.zip

unzip -qq farfetch_masks_and_denseposes.zip

rm .zip

pip install sentencepiece HF_TOKEN= python create_farfetch_mask_free_data.py --k 1 --gpu_id 0 --root_folder ./
h
world_model_raw_data
huggingface.co
Updated Nov 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
1X (2024). world_model_raw_data [Dataset]. https://huggingface.co/datasets/1x-technologies/world_model_raw_data
Explore at:
Dataset updated
Nov 6, 2024
Dataset authored and provided by
1X
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
World
Description
Raw Dataset for the 1X World Model Sammpling Challenge. Download with: huggingface-cli download 1x-technologies/worldmodel_raw_data --repo-type dataset --local-dir data

Train/Val v2.0

The training dataset is shareded into 100 independent shards. The definitions are as follows:

video_{shard}.mp4: Raw video with a resolution of 512x512. segment_idx_{shard}.bin - Maps each frame i to its corresponding segment index. You may want to use this to separate non-contiguous frames from… See the full description on the dataset page: https://huggingface.co/datasets/1x-technologies/world_model_raw_data.

Facebook

Twitter

Click to copy link

Link copied

Cite

DL3DV (2024). DL3DV-ALL-2K [Dataset]. https://huggingface.co/datasets/DL3DV/DL3DV-ALL-2K

DL3DV-ALL-2K

Dl3DV-Dataset

DL3DV/DL3DV-ALL-2K

Explore at:

Dataset updated

Mar 13, 2024

Dataset authored and provided by

DL3DV

Description

DL3DV-Dataset

This repo has all the 2K frames with camera poses of DL3DV-10K Dataset. We are working hard to review all the dataset to avoid sensitive information. Thank you for your patience.

  Download

If you have enough space, you can use git to download a dataset from huggingface. See this link. 480P/960P versions should satisfies most needs. If you do not have enough space, we further provide a download script here to download a subset. The usage: usage:… See the full description on the dataset page: https://huggingface.co/datasets/DL3DV/DL3DV-ALL-2K.

Clear search

Close search

Google apps

Main menu

DL3DV-ALL-2K

fastmap_sfm

github-code

LAV-DF

cloud-adapter-datasets

Step 1: Download datasets

the-stack-v2

VLATrainingDataset

tiny-shakespeare

indonesian-youtube

crag-mm-image-search-images

!/bin/bash

下载命令

Loop until the command is executed successfully

HealthyCT

IQA-PyTorch-Datasets

tamil-youtube

OCRMT30K-refine

sbsfigures_imgs

audiocaps

TSpec-LLM

world_model_tokenized_data

farfetch_singapore_images

world_model_raw_data

DL3DV-ALL-2KSee More Versions

Dl3DV-Dataset

DL3DV/DL3DV-ALL-2K

DL3DV-ALL-2K