7 datasets found
  1. T

    laion400m

    • tensorflow.org
    • opendatalab.com
    Updated Sep 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). laion400m [Dataset]. https://www.tensorflow.org/datasets/catalog/laion400m
    Explore at:
    Dataset updated
    Sep 3, 2024
    Description

    The LAION-400M dataset is completely openly, freely accessible.

    Check https://laion.ai/laion-400-open-dataset/ for the full description of this dataset.

    All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3. The threshold of 0.3 had been determined through human evaluations and seemed to be a good heuristic for estimating semantic image-text-content matching.

    The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('laion400m', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  2. a

    LAION-400-MILLION OPEN DATASET

    • academictorrents.com
    bittorrent
    Updated Sep 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    None (2021). LAION-400-MILLION OPEN DATASET [Dataset]. https://academictorrents.com/details/34b94abbcefef5a240358b9acd7920c8b675aacc
    Explore at:
    bittorrent(1211103363514)Available download formats
    Dataset updated
    Sep 14, 2021
    Authors
    None
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LAION-400M The world’s largest openly available image-text-pair dataset with 400 million samples. # Concept and Content The LAION-400M dataset is completely openly, freely accessible. All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3 The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching. The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021. # Download Information You can find The CLIP image embeddings (NumPy files) The parquet files KNN index of image embeddings # LAION-400M Dataset Statistics The LAION-400M and future even bigger ones are in fact datasets of datasets. For instance, it can be filtered out by image sizes into smaller datasets like th

  3. h

    Laion400m-2

    • huggingface.co
    Updated Oct 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jp1924 (2024). Laion400m-2 [Dataset]. https://huggingface.co/datasets/jp1924/Laion400m-2
    Explore at:
    Dataset updated
    Oct 17, 2024
    Authors
    jp1924
    Description

    jp1924/Laion400m-2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. t

    Data from: LAION-400M: Open dataset of clip-filtered 400 million image-text...

    • service.tib.eu
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). LAION-400M: Open dataset of clip-filtered 400 million image-text pairs [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-400m--open-dataset-of-clip-filtered-400-million-image-text-pairs
    Explore at:
    Dataset updated
    Dec 3, 2024
    Description

    LAION-Aesthetic is a large-scale dataset for training next generation image-text models.

  5. t

    Laion-400M - Dataset - LDM

    • service.tib.eu
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Laion-400M - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-400m
    Explore at:
    Dataset updated
    Dec 3, 2024
    Description

    Text-to-image Latent Diffusion model, CLIP model, Blended Diffusion model, GLIDE model, GLIDE-filtered model

  6. t

    Haoxuan You, Luowei Zhou, Bin Xiao, Noel Codella, Yu Cheng, Ruochen Xu,...

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Haoxuan You, Luowei Zhou, Bin Xiao, Noel Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan (2024). Dataset: Laion-20M. https://doi.org/10.57702/070kx7rz [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-20m
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The dataset used for pre-training the MS-CLIP model, which consists of 20 million image-text pairs filtered from Laion-400M.

  7. h

    MIRAGE-training-set

    • huggingface.co
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick (Tsung-Han) Wu (2025). MIRAGE-training-set [Dataset]. https://huggingface.co/datasets/tsunghanwu/MIRAGE-training-set
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 18, 2025
    Authors
    Patrick (Tsung-Han) Wu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MIRAGE Pretraining/Finetuning Dataset Card

      Dataset details
    

    Dataset type: This dataset is designed to train the visual-RAG model, MIRAGE-8.3B. It contains files to do (multi-stage) pre-training as well as fine-tuning.

    Data Preparation:

    Stage1 Pretraining: Q-Former and visual alignment layer (low-quality data)

    Source: LAION-400M, CC12M, and MSCOCO from here Put all these .tar files under /datasets directory. stage1_pretraining.txt provides an example dataset.

    Stage2… See the full description on the dataset page: https://huggingface.co/datasets/tsunghanwu/MIRAGE-training-set.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). laion400m [Dataset]. https://www.tensorflow.org/datasets/catalog/laion400m

laion400m

Explore at:
179 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 3, 2024
Description

The LAION-400M dataset is completely openly, freely accessible.

Check https://laion.ai/laion-400-open-dataset/ for the full description of this dataset.

All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3. The threshold of 0.3 had been determined through human evaluations and seemed to be a good heuristic for estimating semantic image-text-content matching.

The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021.

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('laion400m', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

Search
Clear search
Close search
Google apps
Main menu