7 datasets found

P
LAION-400M Dataset
paperswithcode.com
Updated Nov 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoph Schuhmann; Richard Vencu; Romain Beaumont; Robert Kaczmarczyk; Clayton Mullis; Aarush Katta; Theo Coombes; Jenia Jitsev; Aran Komatsuzaki (2021). LAION-400M Dataset [Dataset]. https://paperswithcode.com/dataset/laion-400m
Explore at:
Dataset updated
Nov 5, 2021
Authors
Christoph Schuhmann; Richard Vencu; Romain Beaumont; Robert Kaczmarczyk; Clayton Mullis; Aarush Katta; Theo Coombes; Jenia Jitsev; Aran Komatsuzaki
Description
LAION-400M is a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.

⚠️ Disclaimer & Content Warning (from the authors) Our filtering protocol only removed NSFW images detected as illegal, but the dataset still has NSFW content accordingly marked in the metadata. When freely navigating through the dataset, keep in mind that it is a large-scale, non-curated set crawled from the internet for research purposes, such that collected links may lead to discomforting and disturbing content. Therefore, please use the demo links with caution. You can extract a “safe” subset by filtering out samples drawn with NSFW or via stricter CLIP filtering.

There is a certain degree of duplication because we used URL+text as deduplication criteria. The same image with the same caption may sit at different URLs, causing duplicates. The same image with other captions is not, however, considered duplicated.

Using KNN clustering should make it easy to further deduplicate by image content.
h
Laion400m-4
huggingface.co
Updated Oct 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jp1924 (2024). Laion400m-4 [Dataset]. https://huggingface.co/datasets/jp1924/Laion400m-4
Explore at:
Dataset updated
Oct 30, 2024
Authors
jp1924
Description
jp1924/Laion400m-4 dataset hosted on Hugging Face and contributed by the HF Datasets community
a
LAION-400-MILLION OPEN DATASET
academictorrents.com
bittorrent
Updated Sep 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
None (2021). LAION-400-MILLION OPEN DATASET [Dataset]. https://academictorrents.com/details/34b94abbcefef5a240358b9acd7920c8b675aacc
Explore at:
bittorrent(1211103363514)Available download formats
Dataset updated
Sep 14, 2021
Authors
None
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LAION-400M The world’s largest openly available image-text-pair dataset with 400 million samples. # Concept and Content The LAION-400M dataset is completely openly, freely accessible. All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3 The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching. The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021. # Download Information You can find The CLIP image embeddings (NumPy files) The parquet files KNN index of image embeddings # LAION-400M Dataset Statistics The LAION-400M and future even bigger ones are in fact datasets of datasets. For instance, it can be filtered out by image sizes into smaller datasets like th
LAION HR
kaggle.com
Updated Sep 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan Smith (2022). LAION HR [Dataset]. https://www.kaggle.com/datasets/whatevermcsomething/laion-hr/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 17, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nathan Smith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset

This dataset was created by Nathan Smith

Released under Attribution 4.0 International (CC BY 4.0)

Contents

Spam Images for Malicious Annotation Set (SIMAS)

zenodo.org

application/gzip, bin +1

Updated May 23, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Maria Vukić; Maria Vukić; Emanuel Lacić; Emanuel Lacić; Denis Helic; Denis Helic (2025). Spam Images for Malicious Annotation Set (SIMAS) [Dataset]. http://doi.org/10.5281/zenodo.15423637

Explore at:

png, bin, application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15423637

Dataset updated

May 23, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Maria Vukić; Maria Vukić; Emanuel Lacić; Emanuel Lacić; Denis Helic; Denis Helic

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

SIMAS Dataset

This archive includes the SIMAS dataset for fine-tuning models for MMS (Multimedia Messaging Service) image moderation. SIMAS is a balanced collection of publicly available images, manually annotated in accordance with a specialized taxonomy designed for identifying visual spam in MMS messages.

Taxonomy for MMS Visual Spam

The following table presents the definitions of categories used for classifying MMS images.

Table 1: Category definitions

Category	Description
Alcohol*	Content related to alcoholic beverages, including advertisements and consumption.
Drugs*	Content related to the use, sale, or trafficking of narcotics (e.g., cannabis, cocaine,
Firearms*	Content involving guns, pistols, knives, or military weapons.
Gambling*	Content related to gambling (casinos, poker, roulette, lotteries).
Sexual	Content involving nudity, sexual acts, or sexually suggestive material.
Tobacco*	Content related to tobacco use and advertisements.
Violence	Content showing violent acts, self-harm, or injury.
Safe	All other content, including neutral depictions, products, or harmless cultural symbols

Note: Categories marked with an asterisk are regulated in some jurisdictions and may not be universally restricted.

Dataset Collection and Annotation

Data Sources

The SIMAS dataset combines publicly available images from multiple sources, selected to reflect the categories defined in our content taxonomy. Each image was manually reviewed by three independent annotators, with final labels assigned when at least two annotators agreed.

The largest portion of the dataset (30.4%) originates from LAION-400M, a large-scale image-text dataset. To identify relevant content, we first selected a list of ImageNet labels that semantically matched our taxonomy. These labels were generated using GPT-4o in a zero-shot setting, using separate prompts per category. This resulted in 194 candidate labels, of which 88.7% were retained after manual review. The structure of the prompts used in this process is shown in the file gpt4o_imagenet_prompting_scheme.png, which illustrates a shared base prompt template applied across all categories. The fields category_definition, file_examples, and exceptions are specified per category. Definitions align with the taxonomy, while the file_examples column includes sample labels retrieved from the ImageNet label list. The exceptions field contains category-specific filtering instructions; a dash indicates no exceptions were specified.

Another 25.1% of images were sourced from Roboflow, using open datasets such as:

The NudeNet dataset contributes 11.4% of the dataset. We sampled 1,000 images from the “porn” category to provide visual coverage of explicit sexual content.

Another 11.0% of images were collected from Kaggle, including:

An additional 9.9% of images were retrieved from Unsplash, using keyword-based search queries aligned with each category in our taxonomy.

Images from UnsafeBench make up 8.0% of the dataset. Since its original binary labels did not match our taxonomy, all samples were manually reassigned to the most appropriate category.

Finally, 4.2% of images were gathered from various publicly accessible websites. These were primarily used to improve category balance and model generalization, especially in safe classes.

All images collected from the listed sources have been manually reviewed by three independent annotators. Each image is then assigned to a category when at least two annotators reach consensus.

Table 2: Distribution of images per public source and category in SIMAS dataset

Type	Category	LAION	Roboflow	NudeNet	Kaggle	Unsplash	UnsafeBench	Other	Total
Unsafe	Alcohol	29	0	3	267	0	1	0	300
Unsafe	Drugs	17	211	0	0	13	8	1	250
Unsafe	Firearms	0	59	0	229	0	62	0	350
Unsafe	Gambling	132	38	0	0	73	39	18	300
Unsafe	Sexual	2	0	421	0	3	68	6	500
Unsafe	Tobacco	0	446	0	0	43	11	0	500
Unsafe	Violence	0	289	0	0	0	11	0	300
Safe	Alcohol	140	35	0	0	16	13	96	300
Safe	Drugs	67	49	0	15	72	17	30	250
Safe	Firearms	173	15	0	3	144	8	7	350
Safe	Gambling	164	2	0	1	121	12	0	300
Safe	Sexual	235	22	139	2	0	94	8	500
Safe	Tobacco	351	67	5	13	8	16	40	500
Safe	Violence	212	20	3	21	0	42	2	300
All	All	1,522	1,253	571	551	493	402	208	5,000

Balancing

To ensure semantic diversity and dataset balance, undersampling was performed on overrepresented categories using a CLIP-based embedding and k-means clustering strategy. This resulted in a final dataset containing 2,500 spam and 2,500 safe images, evenly distributed across all categories.

Table 3: Distribution of images per category in SIMAS

P
CREPE (Compositional REPresentation Evaluation) Dataset
paperswithcode.com
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zixian Ma; Jerry Hong; Mustafa Omer Gul; Mona Gandhi; Irena Gao; Ranjay Krishna (2023). CREPE (Compositional REPresentation Evaluation) Dataset [Dataset]. https://paperswithcode.com/dataset/crepe-vision-language
Explore at:
Dataset updated
Dec 6, 2023
Authors
Zixian Ma; Jerry Hong; Mustafa Omer Gul; Mona Gandhi; Irena Gao; Ranjay Krishna
Description
A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that—across 7 architectures trained with 4 algorithms on massive datasets—they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over 370K image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate 325K, 316K, and 309K hard negative captions for a subset of the pairs. To test productivity, CREPE contains 17K image-text pairs with nine different complexities plus 183K hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to 12%. For productivity, models’ retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.
h
MIRAGE-training-set
huggingface.co
Updated Apr 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick (Tsung-Han) Wu (2025). MIRAGE-training-set [Dataset]. https://huggingface.co/datasets/tsunghanwu/MIRAGE-training-set
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 18, 2025
Authors
Patrick (Tsung-Han) Wu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MIRAGE Pretraining/Finetuning Dataset Card

Dataset details

Dataset type: This dataset is designed to train the visual-RAG model, MIRAGE-8.3B. It contains files to do (multi-stage) pre-training as well as fine-tuning.

Data Preparation:

Stage1 Pretraining: Q-Former and visual alignment layer (low-quality data)

Source: LAION-400M, CC12M, and MSCOCO from here Put all these .tar files under /datasets directory. stage1_pretraining.txt provides an example dataset.… See the full description on the dataset page: https://huggingface.co/datasets/tsunghanwu/MIRAGE-training-set.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Christoph Schuhmann; Richard Vencu; Romain Beaumont; Robert Kaczmarczyk; Clayton Mullis; Aarush Katta; Theo Coombes; Jenia Jitsev; Aran Komatsuzaki (2021). LAION-400M Dataset [Dataset]. https://paperswithcode.com/dataset/laion-400m

LAION-400M Dataset

Explore at:

Dataset updated

Nov 5, 2021

Authors

Christoph Schuhmann; Richard Vencu; Romain Beaumont; Robert Kaczmarczyk; Clayton Mullis; Aarush Katta; Theo Coombes; Jenia Jitsev; Aran Komatsuzaki

Description

LAION-400M is a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.

⚠️ Disclaimer & Content Warning (from the authors) Our filtering protocol only removed NSFW images detected as illegal, but the dataset still has NSFW content accordingly marked in the metadata. When freely navigating through the dataset, keep in mind that it is a large-scale, non-curated set crawled from the internet for research purposes, such that collected links may lead to discomforting and disturbing content. Therefore, please use the demo links with caution. You can extract a “safe” subset by filtering out samples drawn with NSFW or via stricter CLIP filtering.

There is a certain degree of duplication because we used URL+text as deduplication criteria. The same image with the same caption may sit at different URLs, causing duplicates. The same image with other captions is not, however, considered duplicated.

Using KNN clustering should make it easy to further deduplicate by image content.

Clear search

Close search

Google apps

Main menu

LAION-400M Dataset

Laion400m-4

LAION-400-MILLION OPEN DATASET

LAION HR

Dataset

Contents

Spam Images for Malicious Annotation Set (SIMAS)

SIMAS Dataset

Taxonomy for MMS Visual Spam

Dataset Collection and Annotation

Data Sources

Balancing

CREPE (Compositional REPresentation Evaluation) Dataset

MIRAGE-training-set

LAION-400M Dataset