100+ datasets found

h
laion-occupation
huggingface.co
Updated Apr 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Intelligence & Machine Learning Lab at TU Darmstadt (2023). laion-occupation [Dataset]. https://huggingface.co/datasets/AIML-TUDA/laion-occupation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 3, 2023
Dataset authored and provided by
Artificial Intelligence & Machine Learning Lab at TU Darmstadt
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
LAION Occupation

This dataset is a subset of LAION-2B-en containing 1.8M samples, each assigned to one of 153 occupations. This dataset was curated as part of our investigation into gender-occupation biases in LAION presented in Fair Diffusion. For downloading the images, check out img2dataset.

Data Collection

We identified relevant images in the dataset by computing their CLIP similarity to a textual description of the target occupation. All descriptions were in the… See the full description on the dataset page: https://huggingface.co/datasets/AIML-TUDA/laion-occupation.
P
LAION-5B Dataset
paperswithcode.com
Updated Apr 3, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoph Schuhmann; Romain Beaumont; Richard Vencu; Cade Gordon; Ross Wightman; Mehdi Cherti; Theo Coombes; Aarush Katta; Clayton Mullis; Mitchell Wortsman; Patrick Schramowski; Srivatsa Kundurthy; Katherine Crowson; Ludwig Schmidt; Robert Kaczmarczyk; Jenia Jitsev (2022). LAION-5B Dataset [Dataset]. https://paperswithcode.com/dataset/laion-5b
Explore at:
Dataset updated
Apr 3, 2022
Authors
Christoph Schuhmann; Romain Beaumont; Richard Vencu; Cade Gordon; Ross Wightman; Mehdi Cherti; Theo Coombes; Aarush Katta; Clayton Mullis; Mitchell Wortsman; Patrick Schramowski; Srivatsa Kundurthy; Katherine Crowson; Ludwig Schmidt; Robert Kaczmarczyk; Jenia Jitsev
Description
LAION 5B is a large-scale dataset for research purposes consisting of 5,85B CLIP-filtered image-text pairs. 2,3B contain English language, 2,2B samples from 100+ other languages and 1B samples have texts that do not allow a certain language assignment (e.g. names ). Additionally, we provide several nearest neighbor indices, an improved web interface for exploration & subset creation as well as detection scores for watermark and NSFW.
P
LAION-400M Dataset
paperswithcode.com
Updated Nov 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoph Schuhmann; Richard Vencu; Romain Beaumont; Robert Kaczmarczyk; Clayton Mullis; Aarush Katta; Theo Coombes; Jenia Jitsev; Aran Komatsuzaki (2021). LAION-400M Dataset [Dataset]. https://paperswithcode.com/dataset/laion-400m
Explore at:
Dataset updated
Nov 5, 2021
Authors
Christoph Schuhmann; Richard Vencu; Romain Beaumont; Robert Kaczmarczyk; Clayton Mullis; Aarush Katta; Theo Coombes; Jenia Jitsev; Aran Komatsuzaki
Description
LAION-400M is a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.

⚠️ Disclaimer & Content Warning (from the authors) Our filtering protocol only removed NSFW images detected as illegal, but the dataset still has NSFW content accordingly marked in the metadata. When freely navigating through the dataset, keep in mind that it is a large-scale, non-curated set crawled from the internet for research purposes, such that collected links may lead to discomforting and disturbing content. Therefore, please use the demo links with caution. You can extract a “safe” subset by filtering out samples drawn with NSFW or via stricter CLIP filtering.

There is a certain degree of duplication because we used URL+text as deduplication criteria. The same image with the same caption may sit at different URLs, causing duplicates. The same image with other captions is not, however, considered duplicated.

Using KNN clustering should make it easy to further deduplicate by image content.
a
LAION-400-MILLION OPEN DATASET
academictorrents.com
bittorrent
Updated Sep 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
None (2021). LAION-400-MILLION OPEN DATASET [Dataset]. https://academictorrents.com/details/34b94abbcefef5a240358b9acd7920c8b675aacc
Explore at:
bittorrent(1211103363514)Available download formats
Dataset updated
Sep 14, 2021
Authors
None
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LAION-400M The world’s largest openly available image-text-pair dataset with 400 million samples. # Concept and Content The LAION-400M dataset is completely openly, freely accessible. All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3 The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching. The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021. # Download Information You can find The CLIP image embeddings (NumPy files) The parquet files KNN index of image embeddings # LAION-400M Dataset Statistics The LAION-400M and future even bigger ones are in fact datasets of datasets. For instance, it can be filtered out by image sizes into smaller datasets like th
h
freesound-laion-640k-commercial-16khz-full
huggingface.co
Updated Sep 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Paine (2024). freesound-laion-640k-commercial-16khz-full [Dataset]. https://huggingface.co/datasets/benjamin-paine/freesound-laion-640k-commercial-16khz-full
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2024
Authors
Benjamin Paine
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About this Repository

This repository is the training split of the complete FreeSound LAION 640k dataset, limited only to licenses that permit commercial works, resampled to 16khz using torchaudio.transforms.Resample. This is ideal for use cases where a variety of audio is desired but fidelity and labels are unnecessary, such as background audio for augmenting other datasets.

Dataset Versions

You are looking at the full dataset which contains 403,146 unique sounds… See the full description on the dataset page: https://huggingface.co/datasets/benjamin-paine/freesound-laion-640k-commercial-16khz-full.
t
Laion-5b - Dataset - LDM
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Laion-5b - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-5b
Explore at:
Dataset updated
Dec 2, 2024
Description
A large-scale dataset of text and images for training next-generation language models.
t
Zhangheng Li, Junyuan Hong, Bo Li, Zhangyang Wang (2024). Dataset: LAION-2B...
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Zhangheng Li, Junyuan Hong, Bo Li, Zhangyang Wang (2024). Dataset: LAION-2B dataset. https://doi.org/10.57702/1dclhs95 [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-2b-dataset
Explore at:
Dataset updated
Dec 2, 2024
Description
The LAION-2B dataset used in the pre-training of the diffusion model. The dataset consists of 2.17B images, including public and private domains.
t
LAION-2B
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). LAION-2B [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-2b
Explore at:
Dataset updated
Dec 2, 2024
Description
The dataset used in the paper is LAION-2B, which is a large-scale image-text dataset. The authors fine-tune a pre-trained diffusion model with a subset of LAION-2B with 10k randomly selected samples.
h
freesound-laion-640k
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Paine, freesound-laion-640k [Dataset]. https://huggingface.co/datasets/benjamin-paine/freesound-laion-640k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Benjamin Paine
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
About this Repository

This repository is a re-upload of the FreeSound.org dataset as curated by LAION for the larger LAION-Audio-630k dataset, with the following changes:

Limited columns to only the audio and basic metadata. Incorporated necessary information for licensing and attribution. Removed ambiguously licensed samples, amounting to around 1,000 total samples.

What about download links?

Links were ommitted for the sake of size, as they can be constructed from… See the full description on the dataset page: https://huggingface.co/datasets/benjamin-paine/freesound-laion-640k.
t
Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, Yu Liu (2024). Dataset:...
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, Yu Liu (2024). Dataset: LAION-5B dataset. https://doi.org/10.57702/y2gme96b [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-5b-dataset
Explore at:
Dataset updated
Dec 2, 2024
Description
The dataset used in the paper is not explicitly described, but it is mentioned that the authors used the LAION-5B dataset for training.
P
LAION COCO Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LAION COCO Dataset [Dataset]. https://paperswithcode.com/dataset/laion-coco
Explore at:
Description
LAION-COCO is the world’s largest dataset of 600M generated high-quality captions for publicly available web-images. The images are extracted from the english subset of Laion-5B with an ensemble of BLIP L/14 and 2 CLIP versions (L/14 and RN50x64). This dataset allow models to produce high quality captions for images.
d
LION
catalog.data.gov
data.cityofnewyork.us
+4more
Updated May 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.cityofnewyork.us (2025). LION [Dataset]. https://catalog.data.gov/dataset/lion
Explore at:
Dataset updated
May 31, 2025
Dataset provided by
data.cityofnewyork.us
Description
A single line street base map representing the city's streets and other linear geographic features, along with feature names and address ranges for each addressable street segment. This dataset includes the Nodes file. The Nodes file contains a point feature and unique NodeID for each node that exists in the LION file. The Node_StreetName.txt file lists the street names associated with those nodes. Most nodes, representing intersections, will have at least 2 street names associated in the Node_StreetName.txt file. All previously released versions of this data are available at BYTES of the BIG APPLE - Archive.
t
Stable Diffusion and LAION - Dataset - LDM
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Stable Diffusion and LAION - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/stable-diffusion-and-laion
Explore at:
Dataset updated
Dec 2, 2024
Description
Stable Diffusion and LAION are used as training datasets for the FakeInversion model.
P
LAION-Aesthetics V2 6.5+ Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LAION-Aesthetics V2 6.5+ Dataset [Dataset]. https://paperswithcode.com/dataset/laion-aesthetics-v2-6-5
Explore at:
Description
A subset of the LAION 5B samples with English captions, obtained using LAION-Aesthetics_Predictor V2 625K image-text pairs with predicted aesthetics scores of 6.5 or higher available at https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6.5plus
LAION HR
kaggle.com
Updated Sep 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan Smith (2022). LAION HR [Dataset]. https://www.kaggle.com/datasets/whatevermcsomething/laion-hr/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 17, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nathan Smith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset

This dataset was created by Nathan Smith

Released under Attribution 4.0 International (CC BY 4.0)

Contents
t
Xiang Gao, Zhengbo Xu, Junhan Zhao, Jiaying Liu (2024). Dataset:...
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Xiang Gao, Zhengbo Xu, Junhan Zhao, Jiaying Liu (2024). Dataset: LAION-Aesthetics 6.5+. https://doi.org/10.57702/zvbnqhl9 [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-aesthetics-6-5-
Explore at:
Dataset updated
Dec 2, 2024
Description
LAION-Aesthetics 6.5+ dataset contains 625K image-text pairs.
t
LAION-Aesthetic - Dataset - LDM
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). LAION-Aesthetic - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-aesthetic
Explore at:
Dataset updated
Dec 2, 2024
Description
The dataset used in the paper is LAION-Aesthetic, a large-scale image dataset.
A
LION Differences File
data.amerigeoss.org
catalog.data.gov
+1more
zip
Updated Jul 25, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States[old] (2019). LION Differences File [Dataset]. https://data.amerigeoss.org/dataset/lion-differences-file
Explore at:
zipAvailable download formats
Dataset updated
Jul 25, 2019
Dataset provided by
United States[old]
Description
The LION Differences File (LDF) documents segment and node level changes that have occurred in the LION file between two subsequent releases. This file allows a user who “ties” organizational data to DCP’s Segment ID and/or Node ID to migrate their data appropriately when these changes occur.
laion-5kw-part0
kaggle.com
Updated May 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
卡皮巴拉 (2023). laion-5kw-part0 [Dataset]. https://www.kaggle.com/datasets/jiojioearth/laion-5kw-part0/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 14, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
卡皮巴拉
Description
Dataset

This dataset was created by 卡皮巴拉

Contents
laions_got_talent_enhanced_flash_annotations_and_long_captions
huggingface.co
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LAION eV (2025). laions_got_talent_enhanced_flash_annotations_and_long_captions [Dataset]. https://huggingface.co/datasets/laion/laions_got_talent_enhanced_flash_annotations_and_long_captions
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
LAIONhttps://laion.ai/
Authors
LAION eV
Description
laion/laions_got_talent_enhanced_flash_annotations_and_long_captions dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

Artificial Intelligence & Machine Learning Lab at TU Darmstadt (2023). laion-occupation [Dataset]. https://huggingface.co/datasets/AIML-TUDA/laion-occupation

laion-occupation

LAION Occupation

AIML-TUDA/laion-occupation

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 3, 2023

Dataset authored and provided by

Artificial Intelligence & Machine Learning Lab at TU Darmstadt

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

LAION Occupation

This dataset is a subset of LAION-2B-en containing 1.8M samples, each assigned to one of 153 occupations. This dataset was curated as part of our investigation into gender-occupation biases in LAION presented in Fair Diffusion. For downloading the images, check out img2dataset.

  Data Collection

We identified relevant images in the dataset by computing their CLIP similarity to a textual description of the target occupation. All descriptions were in the… See the full description on the dataset page: https://huggingface.co/datasets/AIML-TUDA/laion-occupation.

Clear search

Close search

Google apps

Main menu

laion-occupation

LAION-5B Dataset

LAION-400M Dataset

LAION-400-MILLION OPEN DATASET

freesound-laion-640k-commercial-16khz-full

Laion-5b - Dataset - LDM

Zhangheng Li, Junyuan Hong, Bo Li, Zhangyang Wang (2024). Dataset: LAION-2B...

LAION-2B

freesound-laion-640k

Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, Yu Liu (2024). Dataset:...

LAION COCO Dataset

LION

Stable Diffusion and LAION - Dataset - LDM

LAION-Aesthetics V2 6.5+ Dataset

LAION HR

Dataset

Contents

Xiang Gao, Zhengbo Xu, Junhan Zhao, Jiaying Liu (2024). Dataset:...

LAION-Aesthetic - Dataset - LDM

LION Differences File

laion-5kw-part0

Dataset

Contents

laions_got_talent_enhanced_flash_annotations_and_long_captions

laion-occupation

LAION Occupation

AIML-TUDA/laion-occupation