100+ datasets found
  1. PASTA Data

    • kaggle.com
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google Research (2024). PASTA Data [Dataset]. https://www.kaggle.com/datasets/googleai/pasta-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 10, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Google Research
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains human rater trajectories used in paper: "Preference Adaptive and Sequential Text-to-Image Generation".

    We use human raters to gather sequential user preferences data for personalized T2I generation. Participants are tasked with interacting with an LMM agent for five turns. Throughout our rater study we use a Gemini 1.5 Flash Model as our base LMM, which acts as an agent. At each turn, the system presents 16 images, arranged in four columns, each representing a different prompt expansion derived from the user's initial prompt and prior interactions. Raters are shown only the generated images, not the prompt expansions themselves.

    At session start, raters are instructed to provide an initial prompt of at most 12 words, encapsulating a specific visual concept. They are encouraged to provide descriptive prompts that avoid generic terms (e.g., "an ancient Egyptian temple with hieroglyphs" 'instead of "a temple"). At each turn, raters then select the column of images preferred most; they are instructed to select a column based on the quality of the best image in that column w.r.t. their original intent. Raters may optionally provide a free-text critique (up to 12 words) to guide subsequent prompt expansions, though most raters did not use this facility.

    See our paper for a comprehensive description of the rater study.

    Citation

    Please cite our paper if you use it in your work.

  2. Predictive Maintenance Dataset

    • kaggle.com
    Updated Nov 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himanshu Agarwal (2022). Predictive Maintenance Dataset [Dataset]. https://www.kaggle.com/datasets/hiimanshuagarwal/predictive-maintenance-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 7, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Himanshu Agarwal
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A company has a fleet of devices transmitting daily sensor readings. They would like to create a predictive maintenance solution to proactively identify when maintenance should be performed. This approach promises cost savings over routine or time based preventive maintenance, because tasks are performed only when warranted.

    The task is to build a predictive model using machine learning to predict the probability of a device failure. When building this model, be sure to minimize false positives and false negatives. The column you are trying to Predict is called failure with binary value 0 for non-failure and 1 for failure.

  3. Fruits-360 dataset

    • kaggle.com
    • data.mendeley.com
    Updated Jun 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mihai Oltean (2025). Fruits-360 dataset [Dataset]. https://www.kaggle.com/datasets/moltean/fruits
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mihai Oltean
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Fruits-360 dataset: A dataset of images containing fruits, vegetables, nuts and seeds

    Version: 2025.06.07.0

    Content

    The following fruits, vegetables and nuts and are included: Apples (different varieties: Crimson Snow, Golden, Golden-Red, Granny Smith, Pink Lady, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red, Lady Finger), Beans, Beetroot Red, Blackberry, Blueberry, Cabbage, Caju seed, Cactus fruit, Cantaloupe (2 varieties), Carambula, Carrot, Cauliflower, Cherimoya, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Chestnut, Clementine, Cocos, Corn (with husk), Cucumber (ripened, regular), Dates, Eggplant, Fig, Ginger Root, Goosberry, Granadilla, Grape (Blue, Pink, White (different varieties)), Grapefruit (Pink, White), Guava, Hazelnut, Huckleberry, Kiwi, Kaki, Kohlrabi, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango (Green, Red), Mangostan, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine (Regular, Flat), Nut (Forest, Pecan), Onion (Red, White), Orange, Papaya, Passion fruit, Peach (different varieties), Pepino, Pear (different varieties, Abate, Forelle, Kaiser, Monster, Red, Stone, Williams), Pepper (Red, Green, Orange, Yellow), Physalis (normal, with Husk), Pineapple (normal, Mini), Pistachio, Pitahaya Red, Plum (different varieties), Pomegranate, Pomelo Sweetie, Potato (Red, Sweet, White), Quince, Rambutan, Raspberry, Redcurrant, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red, Yellow, not ripened, Heart), Walnut, Watermelon, Zucchini (green and dark).

    Branches

    The dataset has 5 major branches:

    -The 100x100 branch, where all images have 100x100 pixels. See _fruits-360_100x100_ folder.

    -The original-size branch, where all images are at their original (captured) size. See _fruits-360_original-size_ folder.

    -The meta branch, which contains additional information about the objects in the Fruits-360 dataset. See _fruits-360_dataset_meta_ folder.

    -The multi branch, which contains images with multiple fruits, vegetables, nuts and seeds. These images are not labeled. See _fruits-360_multi_ folder.

    -The _3_body_problem_ branch where the Training and Test folders contain different (varieties of) the 3 fruits and vegetables (Apples, Cherries and Tomatoes). See _fruits-360_3-body-problem_ folder.

    How to cite

    Mihai Oltean, Fruits-360 dataset, 2017-

    Dataset properties

    For the 100x100 branch

    Total number of images: 138704.

    Training set size: 103993 images.

    Test set size: 34711 images.

    Number of classes: 206 (fruits, vegetables, nuts and seeds).

    Image size: 100x100 pixels.

    For the original-size branch

    Total number of images: 58363.

    Training set size: 29222 images.

    Validation set size: 14614 images

    Test set size: 14527 images.

    Number of classes: 90 (fruits, vegetables, nuts and seeds).

    Image size: various (original, captured, size) pixels.

    For the 3-body-problem branch

    Total number of images: 47033.

    Training set size: 34800 images.

    Test set size: 12233 images.

    Number of classes: 3 (Apples, Cherries, Tomatoes).

    Number of varieties: Apples = 29; Cherries = 12; Tomatoes = 19.

    Image size: 100x100 pixels.

    For the meta branch

    Number of classes: 26 (fruits, vegetables, nuts and seeds).

    For the multi branch

    Number of images: 150.

    Filename format:

    For the 100x100 branch

    image_index_100.jpg (e.g. 31_100.jpg) or

    r_image_index_100.jpg (e.g. r_31_100.jpg) or

    r?_image_index_100.jpg (e.g. r2_31_100.jpg)

    where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels).

    Different varieties of the same fruit (apple, for instance) are stored as belonging to different classes.

    For the original-size branch

    r?_image_index.jpg (e.g. r2_31.jpg)

    where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis.

    The name of the image files in the new version does NOT contain the "_100" suffix anymore. This will help you to make the distinction between the original-size branch and the 100x100 branch.

    For the multi branch

    The file's name is the concatenation of the names of the fruits inside that picture.

    Alternate download

    The Fruits-360 dataset can be downloaded from:

    Kaggle https://www.kaggle.com/moltean/fruits

    GitHub https://github.com/fruits-360

    How fruits were filmed

    Fruits and vegetables were planted in the shaft of a low-speed motor (3 rpm) and a short movie of 20 seconds was recorded.

    A Logitech C920 camera was used for filming the fruits. This is one of the best webcams available.

    Behind the fruits, we placed a white sheet of paper as a background.

    Here i...

  4. Student Performance Data Set

    • kaggle.com
    Updated Mar 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 27, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Data-Science Sean
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

  5. Data from: Spam Email

    • kaggle.com
    Updated Feb 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rhitaza Jana (2022). Spam Email [Dataset]. https://www.kaggle.com/datasets/rhitazajana/spam-email
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rhitaza Jana
    Description

    Dataset

    This dataset was created by Rhitaza Jana

    Contents

  6. 🫀 Heart Disease Dataset

    • kaggle.com
    Updated Apr 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2024). 🫀 Heart Disease Dataset [Dataset]. https://www.kaggle.com/datasets/mexwell/heart-disease-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    mexwell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This heart disease dataset is curated by combining 5 popular heart disease datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:

    • Cleveland
    • Hungarian
    • Switzerland
    • Long Beach VA
    • Statlog (Heart) Data Set.

    This dataset consists of 1190 instances with 11 features. These datasets were collected and combined at one place to help advance research on CAD-related machine learning and data mining algorithms, and hopefully to ultimately advance clinical diagnosis and early treatment.

    Acknowlegement

    Foto von Kenny Eliason auf Unsplash

  7. Iris Species

    • kaggle.com
    zip
    Updated Sep 27, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI Machine Learning (2016). Iris Species [Dataset]. https://www.kaggle.com/datasets/uciml/iris
    Explore at:
    zip(3687 bytes)Available download formats
    Dataset updated
    Sep 27, 2016
    Dataset authored and provided by
    UCI Machine Learning
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

    It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

    The columns in this dataset are:

    • Id
    • SepalLengthCm
    • SepalWidthCm
    • PetalLengthCm
    • PetalWidthCm
    • Species

    Sepal Width vs. Sepal Length

  8. AI Vs Human Text

    • kaggle.com
    Updated Jan 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shayan Gerami (2024). AI Vs Human Text [Dataset]. https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 10, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shayan Gerami
    Description

    Around 500K essays are available in this dataset, both created by AI and written by Human.

    I have gathered the data from multiple sources, added them together and removed the duplicates

  9. May 2015 Reddit Comments

    • kaggle.com
    zip
    Updated Jun 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2019). May 2015 Reddit Comments [Dataset]. https://www.kaggle.com/datasets/kaggle/reddit-comments-may-2015
    Explore at:
    zip(21429083286 bytes)Available download formats
    Dataset updated
    Jun 4, 2019
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api

    Description

    Recently Reddit released an enormous dataset containing all ~1.7 billion of their publicly available comments. The full dataset is an unwieldy 1+ terabyte uncompressed, so we've decided to host a small portion of the comments here for Kagglers to explore. (You don't even need to leave your browser!)

    You can find all the comments from May 2015 on scripts for your natural language processing pleasure. What had redditors laughing, bickering, and NSFW-ing this spring?

    Who knows? Top visualizations may just end up on Reddit.

    Data Description

    The database has one table, May2015, with the following fields:

    • created_utc
    • ups
    • subreddit_id
    • link_id
    • name
    • score_hidden
    • author_flair_css_class
    • author_flair_text
    • subreddit
    • id
    • removal_reason
    • gilded
    • downs
    • archived
    • author
    • score
    • retrieved_on
    • body
    • distinguished
    • edited
    • controversiality
    • parent_id
  10. Iris dataset

    • kaggle.com
    Updated Jul 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himanshu Nakrani (2022). Iris dataset [Dataset]. https://www.kaggle.com/datasets/himanshunakrani/iris-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Himanshu Nakrani
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    It includes three iris species with 50 samples each as well as some properties of each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

    FIle name: iris.csv

  11. NSL-KDD

    • kaggle.com
    Updated Mar 16, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiran (2020). NSL-KDD [Dataset]. https://www.kaggle.com/datasets/kiranmahesh/nslkdd
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 16, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kiran
    Description

    Dataset

    This dataset was created by Kiran

    Contents

  12. LLM: 7 prompt training dataset

    • kaggle.com
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Carl McBride Ellis
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description
    • Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
      File: train_essays_RDizzl3_seven_v2.csv
      Human texts: 14247 LLM texts: 3004

      See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts



    • Version 3: "**The RDizzl3 Seven**"
      File: train_essays_RDizzl3_seven_v1.csv

    • "Car-free cities"

    • "Does the electoral college work?"

    • "Exploring Venus"

    • "The Face on Mars"

    • "Facial action coding system"

    • "A Cowboy Who Rode the Waves"

    • "Driverless cars"

    How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

    • Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

    Namely:

    • "Car-free cities"
    • "Does the electoral college work?"
    • "Exploring Venus"
    • "The Face on Mars"
    • "Facial action coding system"
    • "Seeking multiple opinions"
    • "Phones and driving"

    This dataset is a derivative of the datasets

    as well as the original competition training dataset

    • Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
  13. dictionary

    • kaggle.com
    Updated Dec 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jylee (2018). dictionary [Dataset]. https://www.kaggle.com/datasets/jylee4/dictionary
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 10, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    jylee
    Description

    Dataset

    This dataset was created by jylee

    Contents

  14. Sentinel-1&2 Image Pairs (SAR & Optical)

    • kaggle.com
    Updated Mar 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paritosh Tiwari (2021). Sentinel-1&2 Image Pairs (SAR & Optical) [Dataset]. https://www.kaggle.com/datasets/requiemonk/sentinel12-image-pairs-segregated-by-terrain
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Paritosh Tiwari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset consists of SAR and Optical (RGB) image pairs from Sentinel‑1 and Sentinel‑2 satellites, provided by the Technical University of Munich. Sentinel-1&2 Image Pairs, Michael Schmitt, Technical University of Munich (TUM)

    We searched through images captured during the fall season in the original dataset provided by TUM, and selected images which could belong to each of the four classes: barren land, grassland, agricultural land, and urban areas. Optical images shown in the following sections give an idea of the type of images belonging to each class. We have tried to introduced as much variation as possible when selecting images for a class.

    Data can be used to train a Conditional GAN. Since the images in this dataset are highly complex i.e. they are not regularized and they do not have a neat geometric pattern or orientation, it can also be used to check the robustness of a model, no matter the task.

  15. Medical Text Dataset -Cancer Doc Classification

    • kaggle.com
    Updated Aug 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Falgunipatel19 (2022). Medical Text Dataset -Cancer Doc Classification [Dataset]. https://www.kaggle.com/datasets/falgunipatel19/biomedical-text-publication-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Falgunipatel19
    Description

    For Biomedical text document classification, abstract and full papers(whose length less than or equal to 6 pages) available and used. This dataset focused on long research paper whose page size more than 6 pages. Dataset includes cancer documents to be classified into 3 categories like 'Thyroid_Cancer','Colon_Cancer','Lung_Cancer'. Total publications=7569. it has 3 class labels in dataset. number of samples in each categories: colon cancer=2579, lung cancer=2180, thyroid cancer=2810

  16. Book-Crossing Dataset

    • kaggle.com
    zip
    Updated Sep 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    somnambWl (2019). Book-Crossing Dataset [Dataset]. https://www.kaggle.com/datasets/somnambwl/bookcrossing-dataset
    Explore at:
    zip(17632108 bytes)Available download formats
    Dataset updated
    Sep 7, 2019
    Authors
    somnambWl
    Description

    Book-Crossing dataset mined by Cai-Nicolas Ziegler

    Freely available for research use when acknowledged with the following reference (further details on the dataset are given in this publication):

    • PDF

    • Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear.

    Further information and the original dataset can be found at the original webpage.

    Changes to the dataset:

    • Location removed as it comes in different formats not in default (city, state, country).
    • Transferred from ISO-8859-1 to UTF-8
    • Manually fixed a few rows with incorrect number of columns

    Note:

    • out of 278859 users:
      • only 99053 rated at least 1 book
      • only 43385 rated at least 2 books.
      • only 12306 rated at least 10 books.
    • out of 271379 books:
      • only 270171 are rated at least once.
      • only 124513 have at least 2 ratings.
      • only 17480 have at least 10 ratings.
  17. Kaggle LLMSE Dataset

    • kaggle.com
    Updated Oct 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haoquan Fang (2023). Kaggle LLMSE Dataset [Dataset]. https://www.kaggle.com/datasets/hqfang/kaggle-llmse-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Haoquan Fang
    Description

    deberta-billy is trained locally by @hqfang primarily using @radek1's notebook.

    deberta-lora-lindsey is trained locally by @lindseywei using the LoRA technique.

    deberta-openbook-eric-088 comes from @yuekaixueirc's dataset.

    deberta-openbook-eric-0897 comes from @yuekaixueirc's dataset.

    deberta-openbook-eric-0916 comes from @yuekaixueirc's dataset.

    54k_with_context_v1.csv was created by dropping duplicates @cdeotte's 60k training data all_12_with_context2.csv in this dataset.

    54k.csv was created by dropping the context column from the 54k_with_context_v1.csv.

    val_with_context_v1.csv was created by adding a context column to @itsuki9180's validation dataset.

  18. Fruits-262

    • kaggle.com
    Updated Dec 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mihai Minut (2021). Fruits-262 [Dataset]. https://www.kaggle.com/datasets/aelchimminut/fruits262
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mihai Minut
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Fruits-262 dataset: A dataset containing a vast majority of the popular and known fruits

    Versions and Updates:

    The dataset's final version is the one presented here. No expectations for further development of the dataset at this moment.

    If there seem to be any problems with the dataset, its descriptions, its properties or anything at all please contact me and I will help in what ways I can.

    15/12/2021 - SYNASC2021 Paper

    Based on the Bachelor's Thesis, a paper has been built and published at the SYNASC 2021 conference.

    25/05/2021 - Bachelor's Thesis

    The thesis, that had at its core building this fruit dataset and training CNN models on it, is finished and has been added. Information about the tasks and phases through which the dataset has been can also be found within the thesis.

    25/05/2021 - Resized Dataset and Useful Scripts.

    The resized dataset has been uploaded on 5 different dimensions along with some scripts that help in organizing and altering the dataset. The deployment took around 3-4 hours since some errors kept appearing when uploading. Sorry for any inconvenience.

    Content

    The following fruit types/labels/clades are included: abiu, acai, acerola, ackee, alligator apple, ambarella, apple, apricot, araza, avocado, bael, banana, barbadine, barberry, bayberry, beach plum, bearberry, bell pepper, betel nut, bignay, bilimbi, bitter gourd, black berry, black cherry, black currant, black mullberry, black sapote, blueberry, bolwarra, bottle gourd, brazil nut, bread fruit, buddha s hand, buffaloberry, burdekin plum, burmese grape, caimito, camu camu, canistel, cantaloupe, cape gooseberry, carambola, cardon, cashew, cedar bay cherry, cempedak, ceylon gooseberry, che, chenet, cherimoya, cherry, chico, chokeberry, clementine, cloudberry, cluster fig, cocoa bean, coconut, coffee, common buckthorn, corn kernel, cornelian cherry, crab apple, cranberry, crowberry, cupuacu, custard apple, damson, date, desert fig, desert lime, dewberry, dragonfruit, durian, eggplant, elderberry, elephant apple, emblic, entawak, etrog, feijoa, fibrous satinash, fig, finger lime, galia melon, gandaria, genipap, goji, gooseberry, goumi, grape, grapefruit, greengage, grenadilla, guanabana, guarana, guava, guavaberry, hackberry, hard kiwi, hawthorn, hog plum, honeyberry, honeysuckle, horned melon, illawarra plum, indian almond, indian strawberry, ita palm, jaboticaba, jackfruit, jalapeno, jamaica cherry, jambul, japanese raisin, jasmine, jatoba, jocote, jostaberry, jujube, juniper berry, kaffir lime, kahikatea, kakadu plum, keppel, kiwi, kumquat, kundong, kutjera, lablab, langsat, lapsi, lemon, lemon aspen, leucaena, lillipilli, lime, lingonberry, loganberry, longan, loquat, lucuma, lulo, lychee, mabolo, macadamia, malay apple, mamey apple, mandarine, mango, mangosteen, manila tamarind, marang, mayhaw, maypop, medlar, melinjo, melon pear, midyim, miracle fruit, mock strawberry, monkfruit, monstera deliciosa, morinda, mountain papaya, mountain soursop, mundu, muskmelon, myrtle, nance, nannyberry, naranjilla, native cherry, native gooseberry, nectarine, neem, nungu, nutmeg, oil palm, old world sycomore, olive, orange, oregon grape, otaheite apple, papaya, passion fruit, pawpaw, pea, peanut, pear, pequi, persimmon, pigeon plum, pigface, pili nut, pineapple, pineberry, pitomba, plumcot, podocarpus, pomegranate, pomelo, prikly pear, pulasan, pumpkin, pupunha, purple apple berry, quandong, quince, rambutan, rangpur, raspberry, red mulberry, redcurrant, riberry, ridged gourd, rimu, rose hip, rose myrtle, rose-leaf bramble, saguaro, salak, salal, salmonberry, sandpaper fig, santol, sapodilla, saskatoon, sea buckthorn, sea grape, snowberry, soncoya, strawberry, strawberry guava, sugar apple, surinam cherry, sycamore fig, tamarillo, tangelo, tanjong, taxus baccata, tayberry, texas persimmon, thimbleberry, tomato, toyon, ugli fruit, vanilla, velvet tamarind, watermelon, wax gourd, white aspen, white currant, white mulberry, white sapote, wineberry, wongi, yali pear, yellow plum, yuzu, zigzag vine, zucchini

    Dataset properties

    Total number of images: 225,640.

    Number of classes: 262 fruits.

    Number of images per label: Average: 861, Median: 1007, StDev: 276. (Initial target was 1,000 per label)

    Image Width: Average: 213, Median: 209, StDev: 19.

    Image Height: Average: 262, Median: 255, StDev: 30.

    Missing Images from the initial 1,000 target: Average: 580, Median: 567, StDev: 258.

    Format: a directory name represents a label and in each directory all the image data under the said label (the images are numbered but there might be missing numbers. The "renumber.py" script, if run, will fix the number gap problem).

    Different varieties of the same fruit are generally stored in the same directory (Example: green, yellow and red apple).

    The fruit images present in the dataset can contain the fruit in all the stages o...

  19. Breast Cancer Prediction Dataset

    • kaggle.com
    Updated Sep 26, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Merishna Singh Suwal (2018). Breast Cancer Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/merishnasuwal/breast-cancer-prediction-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 26, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Merishna Singh Suwal
    Description

    Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body.

    This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

  20. ProteinSubcellularLocalization

    • kaggle.com
    Updated Dec 18, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaoxiang Li (2017). ProteinSubcellularLocalization [Dataset]. https://www.kaggle.com/datasets/lzyacht/proteinsubcellularlocalization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 18, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yaoxiang Li
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    The proteomics dataset was summarized by the SWISS-PROT database release 42 (2003–2004) by which obtained extracting all animal, fungal and plant protein sequences.

    Content

    The dataset contains 5959 proteins annotated to one of 11 different subcellular locations which are: chloroplast, cytoplasm, endoplasmic reticulum, extracellular space, Golgi apparatus, lysosomal, mitochondrion, nucleus, peroxisome, plasma membrane and vacuole which represented proteins of plants cell and fungal cell while animal cells shared all localizations with them, but have lysosomes instead of vacuoles. The only variable we intend to consider is protein sequence.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Google Research (2024). PASTA Data [Dataset]. https://www.kaggle.com/datasets/googleai/pasta-data
Organization logo

PASTA Data

Data used in paper "Preference Adaptive and Sequential Text-to-Image Generation"

Explore at:
378 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Google Research
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

This dataset contains human rater trajectories used in paper: "Preference Adaptive and Sequential Text-to-Image Generation".

We use human raters to gather sequential user preferences data for personalized T2I generation. Participants are tasked with interacting with an LMM agent for five turns. Throughout our rater study we use a Gemini 1.5 Flash Model as our base LMM, which acts as an agent. At each turn, the system presents 16 images, arranged in four columns, each representing a different prompt expansion derived from the user's initial prompt and prior interactions. Raters are shown only the generated images, not the prompt expansions themselves.

At session start, raters are instructed to provide an initial prompt of at most 12 words, encapsulating a specific visual concept. They are encouraged to provide descriptive prompts that avoid generic terms (e.g., "an ancient Egyptian temple with hieroglyphs" 'instead of "a temple"). At each turn, raters then select the column of images preferred most; they are instructed to select a column based on the quality of the best image in that column w.r.t. their original intent. Raters may optionally provide a free-text critique (up to 12 words) to guide subsequent prompt expansions, though most raters did not use this facility.

See our paper for a comprehensive description of the rater study.

Citation

Please cite our paper if you use it in your work.

Search
Clear search
Close search
Google apps
Main menu