LAION-400M is a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.
⚠️ Disclaimer & Content Warning (from the authors) Our filtering protocol only removed NSFW images detected as illegal, but the dataset still has NSFW content accordingly marked in the metadata. When freely navigating through the dataset, keep in mind that it is a large-scale, non-curated set crawled from the internet for research purposes, such that collected links may lead to discomforting and disturbing content. Therefore, please use the demo links with caution. You can extract a “safe” subset by filtering out samples drawn with NSFW or via stricter CLIP filtering.
There is a certain degree of duplication because we used URL+text as deduplication criteria. The same image with the same caption may sit at different URLs, causing duplicates. The same image with other captions is not, however, considered duplicated.
Using KNN clustering should make it easy to further deduplicate by image content.
jp1924/Laion400m-4 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LAION-400M The world’s largest openly available image-text-pair dataset with 400 million samples. # Concept and Content The LAION-400M dataset is completely openly, freely accessible. All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3 The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching. The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021. # Download Information You can find The CLIP image embeddings (NumPy files) The parquet files KNN index of image embeddings # LAION-400M Dataset Statistics The LAION-400M and future even bigger ones are in fact datasets of datasets. For instance, it can be filtered out by image sizes into smaller datasets like th
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created by Nathan Smith
Released under Attribution 4.0 International (CC BY 4.0)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive includes the SIMAS dataset for fine-tuning models for MMS (Multimedia Messaging Service) image moderation. SIMAS is a balanced collection of publicly available images, manually annotated in accordance with a specialized taxonomy designed for identifying visual spam in MMS messages.
The following table presents the definitions of categories used for classifying MMS images.
Table 1: Category definitions
Category | Description |
Alcohol* | Content related to alcoholic beverages, including advertisements and consumption. |
Drugs* | Content related to the use, sale, or trafficking of narcotics (e.g., cannabis, cocaine, |
Firearms* | Content involving guns, pistols, knives, or military weapons. |
Gambling* | Content related to gambling (casinos, poker, roulette, lotteries). |
Sexual | Content involving nudity, sexual acts, or sexually suggestive material. |
Tobacco* | Content related to tobacco use and advertisements. |
Violence | Content showing violent acts, self-harm, or injury. |
Safe | All other content, including neutral depictions, products, or harmless cultural symbols |
Note: Categories marked with an asterisk are regulated in some jurisdictions and may not be universally restricted.
The SIMAS dataset combines publicly available images from multiple sources, selected to reflect the categories defined in our content taxonomy. Each image was manually reviewed by three independent annotators, with final labels assigned when at least two annotators agreed.
The largest portion of the dataset (30.4%) originates from LAION-400M, a large-scale image-text dataset. To identify relevant content, we first selected a list of ImageNet labels that semantically matched our taxonomy. These labels were generated using GPT-4o in a zero-shot setting, using separate prompts per category. This resulted in 194 candidate labels, of which 88.7% were retained after manual review. The structure of the prompts used in this process is shown in the file gpt4o_imagenet_prompting_scheme.png, which illustrates a shared base prompt template applied across all categories. The fields category_definition, file_examples, and exceptions are specified per category. Definitions align with the taxonomy, while the file_examples column includes sample labels retrieved from the ImageNet label list. The exceptions field contains category-specific filtering instructions; a dash indicates no exceptions were specified.
Another 25.1% of images were sourced from Roboflow, using open datasets such as:
The NudeNet dataset contributes 11.4% of the dataset. We sampled 1,000 images from the “porn” category to provide visual coverage of explicit sexual content.
Another 11.0% of images were collected from Kaggle, including:
An additional 9.9% of images were retrieved from Unsplash, using keyword-based search queries aligned with each category in our taxonomy.
Images from UnsafeBench make up 8.0% of the dataset. Since its original binary labels did not match our taxonomy, all samples were manually reassigned to the most appropriate category.
Finally, 4.2% of images were gathered from various publicly accessible websites. These were primarily used to improve category balance and model generalization, especially in safe classes.
All images collected from the listed sources have been manually reviewed by three independent annotators. Each image is then assigned to a category when at least two annotators reach consensus.
Table 2: Distribution of images per public source and category in SIMAS dataset
Type | Category | LAION | Roboflow | NudeNet | Kaggle | Unsplash | UnsafeBench | Other | Total |
---|---|---|---|---|---|---|---|---|---|
Unsafe | Alcohol | 29 | 0 | 3 | 267 | 0 | 1 | 0 | 300 |
Unsafe | Drugs | 17 | 211 | 0 | 0 | 13 | 8 | 1 | 250 |
Unsafe | Firearms | 0 | 59 | 0 | 229 | 0 | 62 | 0 | 350 |
Unsafe | Gambling | 132 | 38 | 0 | 0 | 73 | 39 | 18 | 300 |
Unsafe | Sexual | 2 | 0 | 421 | 0 | 3 | 68 | 6 | 500 |
Unsafe | Tobacco | 0 | 446 | 0 | 0 | 43 | 11 | 0 | 500 |
Unsafe | Violence | 0 | 289 | 0 | 0 | 0 | 11 | 0 | 300 |
Safe | Alcohol | 140 | 35 | 0 | 0 | 16 | 13 | 96 | 300 |
Safe | Drugs | 67 | 49 | 0 | 15 | 72 | 17 | 30 | 250 |
Safe | Firearms | 173 | 15 | 0 | 3 | 144 | 8 | 7 | 350 |
Safe | Gambling | 164 | 2 | 0 | 1 | 121 | 12 | 0 | 300 |
Safe | Sexual | 235 | 22 | 139 | 2 | 0 | 94 | 8 | 500 |
Safe | Tobacco | 351 | 67 | 5 | 13 | 8 | 16 | 40 | 500 |
Safe | Violence | 212 | 20 | 3 | 21 | 0 | 42 | 2 | 300 |
All | All | 1,522 | 1,253 | 571 | 551 | 493 | 402 | 208 | 5,000 |
To ensure semantic diversity and dataset balance, undersampling was performed on overrepresented categories using a CLIP-based embedding and k-means clustering strategy. This resulted in a final dataset containing 2,500 spam and 2,500 safe images, evenly distributed across all categories.
Table 3: Distribution of images per category in SIMAS
A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that—across 7 architectures trained with 4 algorithms on massive datasets—they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over 370K image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate 325K, 316K, and 309K hard negative captions for a subset of the pairs. To test productivity, CREPE contains 17K image-text pairs with nine different complexities plus 183K hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to 12%. For productivity, models’ retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MIRAGE Pretraining/Finetuning Dataset Card
Dataset details
Dataset type: This dataset is designed to train the visual-RAG model, MIRAGE-8.3B. It contains files to do (multi-stage) pre-training as well as fine-tuning.
Data Preparation:
Stage1 Pretraining: Q-Former and visual alignment layer (low-quality data)
Source: LAION-400M, CC12M, and MSCOCO from here Put all these .tar files under /datasets directory. stage1_pretraining.txt provides an example dataset.… See the full description on the dataset page: https://huggingface.co/datasets/tsunghanwu/MIRAGE-training-set.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
LAION-400M is a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.
⚠️ Disclaimer & Content Warning (from the authors) Our filtering protocol only removed NSFW images detected as illegal, but the dataset still has NSFW content accordingly marked in the metadata. When freely navigating through the dataset, keep in mind that it is a large-scale, non-curated set crawled from the internet for research purposes, such that collected links may lead to discomforting and disturbing content. Therefore, please use the demo links with caution. You can extract a “safe” subset by filtering out samples drawn with NSFW or via stricter CLIP filtering.
There is a certain degree of duplication because we used URL+text as deduplication criteria. The same image with the same caption may sit at different URLs, causing duplicates. The same image with other captions is not, however, considered duplicated.
Using KNN clustering should make it easy to further deduplicate by image content.