https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The following fruits, vegetables and nuts and are included: Apples (different varieties: Crimson Snow, Golden, Golden-Red, Granny Smith, Pink Lady, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red, Lady Finger), Beans, Beetroot Red, Blackberry, Blueberry, Cabbage, Caju seed, Cactus fruit, Cantaloupe (2 varieties), Carambula, Carrot, Cauliflower, Cherimoya, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Chestnut, Clementine, Cocos, Corn (with husk), Cucumber (ripened, regular), Dates, Eggplant, Fig, Ginger Root, Goosberry, Granadilla, Grape (Blue, Pink, White (different varieties)), Grapefruit (Pink, White), Guava, Hazelnut, Huckleberry, Kiwi, Kaki, Kohlrabi, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango (Green, Red), Mangostan, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine (Regular, Flat), Nut (Forest, Pecan), Onion (Red, White), Orange, Papaya, Passion fruit, Peach (different varieties), Pepino, Pear (different varieties, Abate, Forelle, Kaiser, Monster, Red, Stone, Williams), Pepper (Red, Green, Orange, Yellow), Physalis (normal, with Husk), Pineapple (normal, Mini), Pistachio, Pitahaya Red, Plum (different varieties), Pomegranate, Pomelo Sweetie, Potato (Red, Sweet, White), Quince, Rambutan, Raspberry, Redcurrant, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red, Yellow, not ripened, Heart), Walnut, Watermelon, Zucchini (green and dark).
The dataset has 5 major branches:
-The 100x100 branch, where all images have 100x100 pixels. See _fruits-360_100x100_ folder.
-The original-size branch, where all images are at their original (captured) size. See _fruits-360_original-size_ folder.
-The meta branch, which contains additional information about the objects in the Fruits-360 dataset. See _fruits-360_dataset_meta_ folder.
-The multi branch, which contains images with multiple fruits, vegetables, nuts and seeds. These images are not labeled. See _fruits-360_multi_ folder.
-The _3_body_problem_ branch where the Training and Test folders contain different (varieties of) the 3 fruits and vegetables (Apples, Cherries and Tomatoes). See _fruits-360_3-body-problem_ folder.
Mihai Oltean, Fruits-360 dataset, 2017-
Total number of images: 138704.
Training set size: 103993 images.
Test set size: 34711 images.
Number of classes: 206 (fruits, vegetables, nuts and seeds).
Image size: 100x100 pixels.
Total number of images: 58363.
Training set size: 29222 images.
Validation set size: 14614 images
Test set size: 14527 images.
Number of classes: 90 (fruits, vegetables, nuts and seeds).
Image size: various (original, captured, size) pixels.
Total number of images: 47033.
Training set size: 34800 images.
Test set size: 12233 images.
Number of classes: 3 (Apples, Cherries, Tomatoes).
Number of varieties: Apples = 29; Cherries = 12; Tomatoes = 19.
Image size: 100x100 pixels.
Number of classes: 26 (fruits, vegetables, nuts and seeds).
Number of images: 150.
image_index_100.jpg (e.g. 31_100.jpg) or
r_image_index_100.jpg (e.g. r_31_100.jpg) or
r?_image_index_100.jpg (e.g. r2_31_100.jpg)
where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels).
Different varieties of the same fruit (apple, for instance) are stored as belonging to different classes.
r?_image_index.jpg (e.g. r2_31.jpg)
where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis.
The name of the image files in the new version does NOT contain the "_100" suffix anymore. This will help you to make the distinction between the original-size branch and the 100x100 branch.
The file's name is the concatenation of the names of the fruits inside that picture.
The Fruits-360 dataset can be downloaded from:
Kaggle https://www.kaggle.com/moltean/fruits
GitHub https://github.com/fruits-360
Fruits and vegetables were planted in the shaft of a low-speed motor (3 rpm) and a short movie of 20 seconds was recorded.
A Logitech C920 camera was used for filming the fruits. This is one of the best webcams available.
Behind the fruits, we placed a white sheet of paper as a background.
Here i...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A company has a fleet of devices transmitting daily sensor readings. They would like to create a predictive maintenance solution to proactively identify when maintenance should be performed. This approach promises cost savings over routine or time based preventive maintenance, because tasks are performed only when warranted.
The task is to build a predictive model using machine learning to predict the probability of a device failure. When building this model, be sure to minimize false positives and false negatives. The column you are trying to Predict is called failure with binary value 0 for non-failure and 1 for failure.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains human rater trajectories used in paper: "Preference Adaptive and Sequential Text-to-Image Generation".
We use human raters to gather sequential user preferences data for personalized T2I generation. Participants are tasked with interacting with an LMM agent for five turns. Throughout our rater study we use a Gemini 1.5 Flash Model as our base LMM, which acts as an agent. At each turn, the system presents 16 images, arranged in four columns, each representing a different prompt expansion derived from the user's initial prompt and prior interactions. Raters are shown only the generated images, not the prompt expansions themselves.
At session start, raters are instructed to provide an initial prompt of at most 12 words, encapsulating a specific visual concept. They are encouraged to provide descriptive prompts that avoid generic terms (e.g., "an ancient Egyptian temple with hieroglyphs" 'instead of "a temple"). At each turn, raters then select the column of images preferred most; they are instructed to select a column based on the quality of the best image in that column w.r.t. their original intent. Raters may optionally provide a free-text critique (up to 12 words) to guide subsequent prompt expansions, though most raters did not use this facility.
See our paper for a comprehensive description of the rater study.
Please cite our paper if you use it in your work.
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
File: train_essays_RDizzl3_seven_v2.csv
Human texts: 14247
LLM texts: 3004
See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts
Version 3: "**The RDizzl3 Seven**"
File: train_essays_RDizzl3_seven_v1.csv
"Car-free cities
"
"Does the electoral college work?
"
"Exploring Venus
"
"The Face on Mars
"
"Facial action coding system
"
"A Cowboy Who Rode the Waves
"
"Driverless cars
"
How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"
train_essays_7_prompts_v2.csv
) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts. Namely:
Car-free cities
"Does the electoral college work?
"Exploring Venus
"The Face on Mars
"Facial action coding system
"Seeking multiple opinions
"Phones and driving
"This dataset is a derivative of the datasets
as well as the original competition training dataset
Around 500K essays are available in this dataset, both created by AI and written by Human.
I have gathered the data from multiple sources, added them together and removed the duplicates
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
The dataset has been introduced by the below-mentioned researches: E. C. P. Neto, S. Dadkhah, R. Ferreira, A. Zohourian, R. Lu, A. A. Ghorbani. "CICIoT2023: A real-time dataset and benchmark for large-scale attacks in IoT environment," Sensor (2023) – (submitted to Journal of Sensors). The present data contains different kinds of IoT intrusions. The categories of the IoT intrusions enlisted in the data are as follows: DDoS Brute Force Spoofing DoS Recon Web-based Mirai
There are several subcategories are present in the data for each kind of intrusion types in the IoT. The dataset contains 1191264 instances of network for intrusions and 47 features of each of the intrusions. The dataset can be used to prepare the predictive model through which different kind of intrusive attacks can be detected. The data is also suitable for designing the IDS system.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This heart disease dataset is curated by combining 5 popular heart disease datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:
This dataset consists of 1190 instances with 11 features. These datasets were collected and combined at one place to help advance research on CAD-related machine learning and data mining algorithms, and hopefully to ultimately advance clinical diagnosis and early treatment.
Foto von Kenny Eliason auf Unsplash
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
About
This dataset provides insights into user behavior and online advertising, specifically focusing on predicting whether a user will click on an online advertisement. It contains user demographic information, browsing habits, and details related to the display of the advertisement. This dataset is ideal for building binary classification models to predict user interactions with online ads.
Features
Goal
The objective of this dataset is to predict whether a user will click on an online ad based on their demographics, browsing behavior, the context of the ad's display, and the time of day. You will need to clean the data, understand it and then apply machine learning models to predict and evaluate data. It is a really challenging request for this kind of data. This data can be used to improve ad targeting strategies, optimize ad placement, and better understand user interaction with online advertisements.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.
It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.
The columns in this dataset are:
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
About 10000 rewritten texts using Gemma 7b-it, the original texts from column "Support" in file train.csv from dataset SciQ (Scientific Question Answering)
if you find it useful, upvote it
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Here is the dataset for classifying the different classes of traffic signs. There are around 58 classes and each class has around 120 images. the labels.csv file has the respective description of the traffic sign class. You can change the assignment of these classIDs with descriptions. We can use the basic CNN model to get decent val accuracy. We have around 2000 files for testing.
You can view the notebook named official in the code section to train and test basic cnn model.
Please upvote the notebook and dataset if you like this.
Based on information released from White House with detailed information about the trade between US and the rest of countries. You will find the relevant information for each country, including Exports, Imports and Deficit (or surplus).
Version 2 includes population (if data is available). Figures gathered from https://datahub.io/core/population
Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body.
This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It includes three iris species with 50 samples each as well as some properties of each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.
FIle name: iris.csv
This dataset was created by surajmishra
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This is a dataset that requires a lot of preprocessing with amazing EDA insights for a company. A dataset consisting of sales and profit data sorted by market segment and country/region.
Tips for pre-processing: 1. Check for column names and find error there itself!! 2. Remove '$' sign and '-' from all columns where they are present 3. Change datatype from objects to int after the above two. 4. Challenge: Try removing " , " (comma) from all numerical numbers. 5. Try plotting sales and profit with respect to timeline
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. We are turning some of the data over to you so you can form your own view.
Even more than with other data sets that Kaggle has featured, there’s a huge amount of data cleaning and preparation that goes into putting together a long-time study of climate trends. Early data was collected by technicians using mercury thermometers, where any variation in the visit time impacted measurements. In the 1940s, the construction of airports caused many weather stations to be moved. In the 1980s, there was a move to electronic thermometers that are said to have a cooling bias.
Given this complexity, there are a range of organizations that collate climate trends data. The three most cited land and ocean temperature data sets are NOAA’s MLOST, NASA’s GISTEMP and the UK’s HadCrut.
We have repackaged the data from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. They also use methods that allow weather observations from shorter time series to be included, meaning fewer observations need to be thrown away.
In this dataset, we have include several files:
Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv):
Other files include:
The raw data comes from the Berkeley Earth data page.
For Biomedical text document classification, abstract and full papers(whose length less than or equal to 6 pages) available and used. This dataset focused on long research paper whose page size more than 6 pages. Dataset includes cancer documents to be classified into 3 categories like 'Thyroid_Cancer','Colon_Cancer','Lung_Cancer'. Total publications=7569. it has 3 class labels in dataset. number of samples in each categories: colon cancer=2579, lung cancer=2180, thyroid cancer=2810
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.
Dataset contains more than 28,000 essay written by student and AI generated.
Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).