Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The following fruits, vegetables and nuts and are included: Apples (different varieties: Crimson Snow, Golden, Golden-Red, Granny Smith, Pink Lady, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red, Lady Finger), Beans, Beetroot Red, Blackberry, Blueberry, Cabbage, Caju seed, Cactus fruit, Cantaloupe (2 varieties), Carambula, Carrot, Cauliflower, Cherimoya, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Chestnut, Clementine, Cocos, Corn (with husk), Cucumber (ripened, regular), Dates, Eggplant, Fig, Ginger Root, Goosberry, Granadilla, Grape (Blue, Pink, White (different varieties)), Grapefruit (Pink, White), Guava, Hazelnut, Huckleberry, Kiwi, Kaki, Kohlrabi, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango (Green, Red), Mangostan, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine (Regular, Flat), Nut (Forest, Pecan), Onion (Red, White), Orange, Papaya, Passion fruit, Peach (different varieties), Pepino, Pear (different varieties, Abate, Forelle, Kaiser, Monster, Red, Stone, Williams), Pepper (Red, Green, Orange, Yellow), Physalis (normal, with Husk), Pineapple (normal, Mini), Pistachio, Pitahaya Red, Plum (different varieties), Pomegranate, Pomelo Sweetie, Potato (Red, Sweet, White), Quince, Rambutan, Raspberry, Redcurrant, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red, Yellow, not ripened, Heart), Walnut, Watermelon, Zucchini (green and dark).
The dataset has 5 major branches:
-The 100x100 branch, where all images have 100x100 pixels. See _fruits-360_100x100_ folder.
-The original-size branch, where all images are at their original (captured) size. See _fruits-360_original-size_ folder.
-The meta branch, which contains additional information about the objects in the Fruits-360 dataset. See _fruits-360_dataset_meta_ folder.
-The multi branch, which contains images with multiple fruits, vegetables, nuts and seeds. These images are not labeled. See _fruits-360_multi_ folder.
-The _3_body_problem_ branch where the Training and Test folders contain different (varieties of) the 3 fruits and vegetables (Apples, Cherries and Tomatoes). See _fruits-360_3-body-problem_ folder.
Mihai Oltean, Fruits-360 dataset, 2017-
Total number of images: 138704.
Training set size: 103993 images.
Test set size: 34711 images.
Number of classes: 206 (fruits, vegetables, nuts and seeds).
Image size: 100x100 pixels.
Total number of images: 58363.
Training set size: 29222 images.
Validation set size: 14614 images
Test set size: 14527 images.
Number of classes: 90 (fruits, vegetables, nuts and seeds).
Image size: various (original, captured, size) pixels.
Total number of images: 47033.
Training set size: 34800 images.
Test set size: 12233 images.
Number of classes: 3 (Apples, Cherries, Tomatoes).
Number of varieties: Apples = 29; Cherries = 12; Tomatoes = 19.
Image size: 100x100 pixels.
Number of classes: 26 (fruits, vegetables, nuts and seeds).
Number of images: 150.
image_index_100.jpg (e.g. 31_100.jpg) or
r_image_index_100.jpg (e.g. r_31_100.jpg) or
r?_image_index_100.jpg (e.g. r2_31_100.jpg)
where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels).
Different varieties of the same fruit (apple, for instance) are stored as belonging to different classes.
r?_image_index.jpg (e.g. r2_31.jpg)
where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis.
The name of the image files in the new version does NOT contain the "_100" suffix anymore. This will help you to make the distinction between the original-size branch and the 100x100 branch.
The file's name is the concatenation of the names of the fruits inside that picture.
The Fruits-360 dataset can be downloaded from:
Kaggle https://www.kaggle.com/moltean/fruits
GitHub https://github.com/fruits-360
Fruits and vegetables were planted in the shaft of a low-speed motor (3 rpm) and a short movie of 20 seconds was recorded.
A Logitech C920 camera was used for filming the fruits. This is one of the best webcams available.
Behind the fruits, we placed a white sheet of paper as a background.
Here i...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A company has a fleet of devices transmitting daily sensor readings. They would like to create a predictive maintenance solution to proactively identify when maintenance should be performed. This approach promises cost savings over routine or time based preventive maintenance, because tasks are performed only when warranted.
The task is to build a predictive model using machine learning to predict the probability of a device failure. When building this model, be sure to minimize false positives and false negatives. The column you are trying to Predict is called failure with binary value 0 for non-failure and 1 for failure.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains human rater trajectories used in paper: "Preference Adaptive and Sequential Text-to-Image Generation".
We use human raters to gather sequential user preferences data for personalized T2I generation. Participants are tasked with interacting with an LMM agent for five turns. Throughout our rater study we use a Gemini 1.5 Flash Model as our base LMM, which acts as an agent. At each turn, the system presents 16 images, arranged in four columns, each representing a different prompt expansion derived from the user's initial prompt and prior interactions. Raters are shown only the generated images, not the prompt expansions themselves.
At session start, raters are instructed to provide an initial prompt of at most 12 words, encapsulating a specific visual concept. They are encouraged to provide descriptive prompts that avoid generic terms (e.g., "an ancient Egyptian temple with hieroglyphs" 'instead of "a temple"). At each turn, raters then select the column of images preferred most; they are instructed to select a column based on the quality of the best image in that column w.r.t. their original intent. Raters may optionally provide a free-text critique (up to 12 words) to guide subsequent prompt expansions, though most raters did not use this facility.
See our paper for a comprehensive description of the rater study.
Please cite our paper if you use it in your work.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This heart disease dataset is curated by combining 5 popular heart disease datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:
This dataset consists of 1190 instances with 11 features. These datasets were collected and combined at one place to help advance research on CAD-related machine learning and data mining algorithms, and hopefully to ultimately advance clinical diagnosis and early treatment.
Foto von Kenny Eliason auf Unsplash
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
File: train_essays_RDizzl3_seven_v2.csv
Human texts: 14247
LLM texts: 3004
See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts
Version 3: "**The RDizzl3 Seven**"
File: train_essays_RDizzl3_seven_v1.csv
"Car-free cities
"
"Does the electoral college work?
"
"Exploring Venus
"
"The Face on Mars
"
"Facial action coding system
"
"A Cowboy Who Rode the Waves
"
"Driverless cars
"
How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"
train_essays_7_prompts_v2.csv
) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts. Namely:
Car-free cities
"Does the electoral college work?
"Exploring Venus
"The Face on Mars
"Facial action coding system
"Seeking multiple opinions
"Phones and driving
"This dataset is a derivative of the datasets
as well as the original competition training dataset
This dataset was created by Rhitaza Jana
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Recently Reddit released an enormous dataset containing all ~1.7 billion of their publicly available comments. The full dataset is an unwieldy 1+ terabyte uncompressed, so we've decided to host a small portion of the comments here for Kagglers to explore. (You don't even need to leave your browser!)
You can find all the comments from May 2015 on scripts for your natural language processing pleasure. What had redditors laughing, bickering, and NSFW-ing this spring?
Who knows? Top visualizations may just end up on Reddit.
The database has one table, May2015
, with the following fields:
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.
It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.
The columns in this dataset are:
This dataset was created by jylee
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It includes three iris species with 50 samples each as well as some properties of each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.
FIle name: iris.csv
This dataset was created by surajmishra
deberta-billy
is trained locally by @hqfang primarily using @radek1's notebook.
deberta-lora-lindsey
is trained locally by @lindseywei using the LoRA technique.
deberta-openbook-eric-088
comes from @yuekaixueirc's dataset.
deberta-openbook-eric-0897
comes from @yuekaixueirc's dataset.
deberta-openbook-eric-0916
comes from @yuekaixueirc's dataset.
54k_with_context_v1.csv
was created by dropping duplicates @cdeotte's 60k training data all_12_with_context2.csv
in this dataset.
54k.csv
was created by dropping the context column from the 54k_with_context_v1.csv
.
val_with_context_v1.csv
was created by adding a context column to @itsuki9180's validation dataset.
This dataset was created by Kiran
Around 500K essays are available in this dataset, both created by AI and written by Human.
I have gathered the data from multiple sources, added them together and removed the duplicates
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contains a diverse range of examples, including classification, regression, clustering, and dimensionality reduction problems, with varying levels of complexity and varying numbers of features. Each dataset comes with a detailed description of the problem and the corresponding features, making it easy to understand and work with. Additionally, the dataset provides an opportunity for machine learning enthusiasts to experiment with different SkLearn algorithms and evaluate their performance on different datasets. This dataset is perfect for both beginners and advanced practitioners looking to hone their skills in various machine learning techniques.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions
in Meta Kaggle. The file names match the ids in the KernelVersions
csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads
. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.
Dataset contains more than 28,000 essay written by student and AI generated.
Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay
For Biomedical text document classification, abstract and full papers(whose length less than or equal to 6 pages) available and used. This dataset focused on long research paper whose page size more than 6 pages. Dataset includes cancer documents to be classified into 3 categories like 'Thyroid_Cancer','Colon_Cancer','Lung_Cancer'. Total publications=7569. it has 3 class labels in dataset. number of samples in each categories: colon cancer=2579, lung cancer=2180, thyroid cancer=2810
Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body.
This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The following fruits, vegetables and nuts and are included: Apples (different varieties: Crimson Snow, Golden, Golden-Red, Granny Smith, Pink Lady, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red, Lady Finger), Beans, Beetroot Red, Blackberry, Blueberry, Cabbage, Caju seed, Cactus fruit, Cantaloupe (2 varieties), Carambula, Carrot, Cauliflower, Cherimoya, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Chestnut, Clementine, Cocos, Corn (with husk), Cucumber (ripened, regular), Dates, Eggplant, Fig, Ginger Root, Goosberry, Granadilla, Grape (Blue, Pink, White (different varieties)), Grapefruit (Pink, White), Guava, Hazelnut, Huckleberry, Kiwi, Kaki, Kohlrabi, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango (Green, Red), Mangostan, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine (Regular, Flat), Nut (Forest, Pecan), Onion (Red, White), Orange, Papaya, Passion fruit, Peach (different varieties), Pepino, Pear (different varieties, Abate, Forelle, Kaiser, Monster, Red, Stone, Williams), Pepper (Red, Green, Orange, Yellow), Physalis (normal, with Husk), Pineapple (normal, Mini), Pistachio, Pitahaya Red, Plum (different varieties), Pomegranate, Pomelo Sweetie, Potato (Red, Sweet, White), Quince, Rambutan, Raspberry, Redcurrant, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red, Yellow, not ripened, Heart), Walnut, Watermelon, Zucchini (green and dark).
The dataset has 5 major branches:
-The 100x100 branch, where all images have 100x100 pixels. See _fruits-360_100x100_ folder.
-The original-size branch, where all images are at their original (captured) size. See _fruits-360_original-size_ folder.
-The meta branch, which contains additional information about the objects in the Fruits-360 dataset. See _fruits-360_dataset_meta_ folder.
-The multi branch, which contains images with multiple fruits, vegetables, nuts and seeds. These images are not labeled. See _fruits-360_multi_ folder.
-The _3_body_problem_ branch where the Training and Test folders contain different (varieties of) the 3 fruits and vegetables (Apples, Cherries and Tomatoes). See _fruits-360_3-body-problem_ folder.
Mihai Oltean, Fruits-360 dataset, 2017-
Total number of images: 138704.
Training set size: 103993 images.
Test set size: 34711 images.
Number of classes: 206 (fruits, vegetables, nuts and seeds).
Image size: 100x100 pixels.
Total number of images: 58363.
Training set size: 29222 images.
Validation set size: 14614 images
Test set size: 14527 images.
Number of classes: 90 (fruits, vegetables, nuts and seeds).
Image size: various (original, captured, size) pixels.
Total number of images: 47033.
Training set size: 34800 images.
Test set size: 12233 images.
Number of classes: 3 (Apples, Cherries, Tomatoes).
Number of varieties: Apples = 29; Cherries = 12; Tomatoes = 19.
Image size: 100x100 pixels.
Number of classes: 26 (fruits, vegetables, nuts and seeds).
Number of images: 150.
image_index_100.jpg (e.g. 31_100.jpg) or
r_image_index_100.jpg (e.g. r_31_100.jpg) or
r?_image_index_100.jpg (e.g. r2_31_100.jpg)
where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels).
Different varieties of the same fruit (apple, for instance) are stored as belonging to different classes.
r?_image_index.jpg (e.g. r2_31.jpg)
where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis.
The name of the image files in the new version does NOT contain the "_100" suffix anymore. This will help you to make the distinction between the original-size branch and the 100x100 branch.
The file's name is the concatenation of the names of the fruits inside that picture.
The Fruits-360 dataset can be downloaded from:
Kaggle https://www.kaggle.com/moltean/fruits
GitHub https://github.com/fruits-360
Fruits and vegetables were planted in the shaft of a low-speed motor (3 rpm) and a short movie of 20 seconds was recorded.
A Logitech C920 camera was used for filming the fruits. This is one of the best webcams available.
Behind the fruits, we placed a white sheet of paper as a background.
Here i...