The LAION-400M dataset is completely openly, freely accessible.
Check https://laion.ai/laion-400-open-dataset/ for the full description of this dataset.
All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3. The threshold of 0.3 had been determined through human evaluations and seemed to be a good heuristic for estimating semantic image-text-content matching.
The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('laion400m', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LAION-400M The world’s largest openly available image-text-pair dataset with 400 million samples. # Concept and Content The LAION-400M dataset is completely openly, freely accessible. All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3 The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching. The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021. # Download Information You can find The CLIP image embeddings (NumPy files) The parquet files KNN index of image embeddings # LAION-400M Dataset Statistics The LAION-400M and future even bigger ones are in fact datasets of datasets. For instance, it can be filtered out by image sizes into smaller datasets like th
jp1924/Laion400m-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
LAION-Aesthetic is a large-scale dataset for training next generation image-text models.
Text-to-image Latent Diffusion model, CLIP model, Blended Diffusion model, GLIDE model, GLIDE-filtered model
The dataset used for pre-training the MS-CLIP model, which consists of 20 million image-text pairs filtered from Laion-400M.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MIRAGE Pretraining/Finetuning Dataset Card
Dataset details
Dataset type: This dataset is designed to train the visual-RAG model, MIRAGE-8.3B. It contains files to do (multi-stage) pre-training as well as fine-tuning.
Data Preparation:
Stage1 Pretraining: Q-Former and visual alignment layer (low-quality data)
Source: LAION-400M, CC12M, and MSCOCO from here Put all these .tar files under /datasets directory. stage1_pretraining.txt provides an example dataset.
Stage2… See the full description on the dataset page: https://huggingface.co/datasets/tsunghanwu/MIRAGE-training-set.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
The LAION-400M dataset is completely openly, freely accessible.
Check https://laion.ai/laion-400-open-dataset/ for the full description of this dataset.
All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3. The threshold of 0.3 had been determined through human evaluations and seemed to be a good heuristic for estimating semantic image-text-content matching.
The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('laion400m', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.