100+ datasets found
  1. h

    laion-occupation

    • huggingface.co
    Updated Apr 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Intelligence & Machine Learning Lab at TU Darmstadt (2023). laion-occupation [Dataset]. https://huggingface.co/datasets/AIML-TUDA/laion-occupation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2023
    Dataset authored and provided by
    Artificial Intelligence & Machine Learning Lab at TU Darmstadt
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    LAION Occupation

    This dataset is a subset of LAION-2B-en containing 1.8M samples, each assigned to one of 153 occupations. This dataset was curated as part of our investigation into gender-occupation biases in LAION presented in Fair Diffusion. For downloading the images, check out img2dataset.

      Data Collection
    

    We identified relevant images in the dataset by computing their CLIP similarity to a textual description of the target occupation. All descriptions were in the… See the full description on the dataset page: https://huggingface.co/datasets/AIML-TUDA/laion-occupation.

  2. P

    LAION-5B Dataset

    • paperswithcode.com
    Updated Apr 3, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Schuhmann; Romain Beaumont; Richard Vencu; Cade Gordon; Ross Wightman; Mehdi Cherti; Theo Coombes; Aarush Katta; Clayton Mullis; Mitchell Wortsman; Patrick Schramowski; Srivatsa Kundurthy; Katherine Crowson; Ludwig Schmidt; Robert Kaczmarczyk; Jenia Jitsev (2022). LAION-5B Dataset [Dataset]. https://paperswithcode.com/dataset/laion-5b
    Explore at:
    Dataset updated
    Apr 3, 2022
    Authors
    Christoph Schuhmann; Romain Beaumont; Richard Vencu; Cade Gordon; Ross Wightman; Mehdi Cherti; Theo Coombes; Aarush Katta; Clayton Mullis; Mitchell Wortsman; Patrick Schramowski; Srivatsa Kundurthy; Katherine Crowson; Ludwig Schmidt; Robert Kaczmarczyk; Jenia Jitsev
    Description

    LAION 5B is a large-scale dataset for research purposes consisting of 5,85B CLIP-filtered image-text pairs. 2,3B contain English language, 2,2B samples from 100+ other languages and 1B samples have texts that do not allow a certain language assignment (e.g. names ). Additionally, we provide several nearest neighbor indices, an improved web interface for exploration & subset creation as well as detection scores for watermark and NSFW.

  3. P

    LAION-400M Dataset

    • paperswithcode.com
    Updated Nov 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Schuhmann; Richard Vencu; Romain Beaumont; Robert Kaczmarczyk; Clayton Mullis; Aarush Katta; Theo Coombes; Jenia Jitsev; Aran Komatsuzaki (2021). LAION-400M Dataset [Dataset]. https://paperswithcode.com/dataset/laion-400m
    Explore at:
    Dataset updated
    Nov 5, 2021
    Authors
    Christoph Schuhmann; Richard Vencu; Romain Beaumont; Robert Kaczmarczyk; Clayton Mullis; Aarush Katta; Theo Coombes; Jenia Jitsev; Aran Komatsuzaki
    Description

    LAION-400M is a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.

    ⚠️ Disclaimer & Content Warning (from the authors) Our filtering protocol only removed NSFW images detected as illegal, but the dataset still has NSFW content accordingly marked in the metadata. When freely navigating through the dataset, keep in mind that it is a large-scale, non-curated set crawled from the internet for research purposes, such that collected links may lead to discomforting and disturbing content. Therefore, please use the demo links with caution. You can extract a “safe” subset by filtering out samples drawn with NSFW or via stricter CLIP filtering.

    There is a certain degree of duplication because we used URL+text as deduplication criteria. The same image with the same caption may sit at different URLs, causing duplicates. The same image with other captions is not, however, considered duplicated.

    Using KNN clustering should make it easy to further deduplicate by image content.

  4. a

    LAION-400-MILLION OPEN DATASET

    • academictorrents.com
    bittorrent
    Updated Sep 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    None (2021). LAION-400-MILLION OPEN DATASET [Dataset]. https://academictorrents.com/details/34b94abbcefef5a240358b9acd7920c8b675aacc
    Explore at:
    bittorrent(1211103363514)Available download formats
    Dataset updated
    Sep 14, 2021
    Authors
    None
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LAION-400M The world’s largest openly available image-text-pair dataset with 400 million samples. # Concept and Content The LAION-400M dataset is completely openly, freely accessible. All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3 The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching. The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021. # Download Information You can find The CLIP image embeddings (NumPy files) The parquet files KNN index of image embeddings # LAION-400M Dataset Statistics The LAION-400M and future even bigger ones are in fact datasets of datasets. For instance, it can be filtered out by image sizes into smaller datasets like th

  5. h

    freesound-laion-640k-commercial-16khz-full

    • huggingface.co
    Updated Sep 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Paine (2024). freesound-laion-640k-commercial-16khz-full [Dataset]. https://huggingface.co/datasets/benjamin-paine/freesound-laion-640k-commercial-16khz-full
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2024
    Authors
    Benjamin Paine
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About this Repository

    This repository is the training split of the complete FreeSound LAION 640k dataset, limited only to licenses that permit commercial works, resampled to 16khz using torchaudio.transforms.Resample. This is ideal for use cases where a variety of audio is desired but fidelity and labels are unnecessary, such as background audio for augmenting other datasets.

      Dataset Versions
    

    You are looking at the full dataset which contains 403,146 unique sounds… See the full description on the dataset page: https://huggingface.co/datasets/benjamin-paine/freesound-laion-640k-commercial-16khz-full.

  6. t

    Laion-5b - Dataset - LDM

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Laion-5b - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-5b
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    A large-scale dataset of text and images for training next-generation language models.

  7. t

    Zhangheng Li, Junyuan Hong, Bo Li, Zhangyang Wang (2024). Dataset: LAION-2B...

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Zhangheng Li, Junyuan Hong, Bo Li, Zhangyang Wang (2024). Dataset: LAION-2B dataset. https://doi.org/10.57702/1dclhs95 [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-2b-dataset
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    The LAION-2B dataset used in the pre-training of the diffusion model. The dataset consists of 2.17B images, including public and private domains.

  8. t

    LAION-2B

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). LAION-2B [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-2b
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    The dataset used in the paper is LAION-2B, which is a large-scale image-text dataset. The authors fine-tune a pre-trained diffusion model with a subset of LAION-2B with 10k randomly selected samples.

  9. h

    freesound-laion-640k

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Paine, freesound-laion-640k [Dataset]. https://huggingface.co/datasets/benjamin-paine/freesound-laion-640k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Benjamin Paine
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    About this Repository

    This repository is a re-upload of the FreeSound.org dataset as curated by LAION for the larger LAION-Audio-630k dataset, with the following changes:

    Limited columns to only the audio and basic metadata. Incorporated necessary information for licensing and attribution. Removed ambiguously licensed samples, amounting to around 1,000 total samples.

      What about download links?
    

    Links were ommitted for the sake of size, as they can be constructed from… See the full description on the dataset page: https://huggingface.co/datasets/benjamin-paine/freesound-laion-640k.

  10. t

    Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, Yu Liu (2024). Dataset:...

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, Yu Liu (2024). Dataset: LAION-5B dataset. https://doi.org/10.57702/y2gme96b [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-5b-dataset
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    The dataset used in the paper is not explicitly described, but it is mentioned that the authors used the LAION-5B dataset for training.

  11. P

    LAION COCO Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LAION COCO Dataset [Dataset]. https://paperswithcode.com/dataset/laion-coco
    Explore at:
    Description

    LAION-COCO is the world’s largest dataset of 600M generated high-quality captions for publicly available web-images. The images are extracted from the english subset of Laion-5B with an ensemble of BLIP L/14 and 2 CLIP versions (L/14 and RN50x64). This dataset allow models to produce high quality captions for images.

  12. d

    LION

    • catalog.data.gov
    • data.cityofnewyork.us
    • +4more
    Updated May 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofnewyork.us (2025). LION [Dataset]. https://catalog.data.gov/dataset/lion
    Explore at:
    Dataset updated
    May 31, 2025
    Dataset provided by
    data.cityofnewyork.us
    Description

    A single line street base map representing the city's streets and other linear geographic features, along with feature names and address ranges for each addressable street segment. This dataset includes the Nodes file. The Nodes file contains a point feature and unique NodeID for each node that exists in the LION file. The Node_StreetName.txt file lists the street names associated with those nodes. Most nodes, representing intersections, will have at least 2 street names associated in the Node_StreetName.txt file. All previously released versions of this data are available at BYTES of the BIG APPLE - Archive.

  13. t

    Stable Diffusion and LAION - Dataset - LDM

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Stable Diffusion and LAION - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/stable-diffusion-and-laion
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    Stable Diffusion and LAION are used as training datasets for the FakeInversion model.

  14. P

    LAION-Aesthetics V2 6.5+ Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LAION-Aesthetics V2 6.5+ Dataset [Dataset]. https://paperswithcode.com/dataset/laion-aesthetics-v2-6-5
    Explore at:
    Description

    A subset of the LAION 5B samples with English captions, obtained using LAION-Aesthetics_Predictor V2 625K image-text pairs with predicted aesthetics scores of 6.5 or higher available at https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6.5plus

  15. LAION HR

    • kaggle.com
    Updated Sep 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan Smith (2022). LAION HR [Dataset]. https://www.kaggle.com/datasets/whatevermcsomething/laion-hr/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 17, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nathan Smith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Nathan Smith

    Released under Attribution 4.0 International (CC BY 4.0)

    Contents

  16. t

    Xiang Gao, Zhengbo Xu, Junhan Zhao, Jiaying Liu (2024). Dataset:...

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Xiang Gao, Zhengbo Xu, Junhan Zhao, Jiaying Liu (2024). Dataset: LAION-Aesthetics 6.5+. https://doi.org/10.57702/zvbnqhl9 [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-aesthetics-6-5-
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    LAION-Aesthetics 6.5+ dataset contains 625K image-text pairs.

  17. t

    LAION-Aesthetic - Dataset - LDM

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). LAION-Aesthetic - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/laion-aesthetic
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    The dataset used in the paper is LAION-Aesthetic, a large-scale image dataset.

  18. A

    LION Differences File

    • data.amerigeoss.org
    • catalog.data.gov
    • +1more
    zip
    Updated Jul 25, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States[old] (2019). LION Differences File [Dataset]. https://data.amerigeoss.org/dataset/lion-differences-file
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2019
    Dataset provided by
    United States[old]
    Description

    The LION Differences File (LDF) documents segment and node level changes that have occurred in the LION file between two subsequent releases. This file allows a user who “ties” organizational data to DCP’s Segment ID and/or Node ID to migrate their data appropriately when these changes occur.

  19. laion-5kw-part0

    • kaggle.com
    Updated May 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    卡皮巴拉 (2023). laion-5kw-part0 [Dataset]. https://www.kaggle.com/datasets/jiojioearth/laion-5kw-part0/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 14, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    卡皮巴拉
    Description

    Dataset

    This dataset was created by 卡皮巴拉

    Contents

  20. laions_got_talent_enhanced_flash_annotations_and_long_captions

    • huggingface.co
    Updated Mar 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LAION eV (2025). laions_got_talent_enhanced_flash_annotations_and_long_captions [Dataset]. https://huggingface.co/datasets/laion/laions_got_talent_enhanced_flash_annotations_and_long_captions
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    LAIONhttps://laion.ai/
    Authors
    LAION eV
    Description

    laion/laions_got_talent_enhanced_flash_annotations_and_long_captions dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Artificial Intelligence & Machine Learning Lab at TU Darmstadt (2023). laion-occupation [Dataset]. https://huggingface.co/datasets/AIML-TUDA/laion-occupation

laion-occupation

LAION Occupation

AIML-TUDA/laion-occupation

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 3, 2023
Dataset authored and provided by
Artificial Intelligence & Machine Learning Lab at TU Darmstadt
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

LAION Occupation

This dataset is a subset of LAION-2B-en containing 1.8M samples, each assigned to one of 153 occupations. This dataset was curated as part of our investigation into gender-occupation biases in LAION presented in Fair Diffusion. For downloading the images, check out img2dataset.

  Data Collection

We identified relevant images in the dataset by computing their CLIP similarity to a textual description of the target occupation. All descriptions were in the… See the full description on the dataset page: https://huggingface.co/datasets/AIML-TUDA/laion-occupation.

Search
Clear search
Close search
Google apps
Main menu