Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.
Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]
Source: Kaggle
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Facebook
TwitterThis dataset was created by Sukhmani Bedi
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset can be used to train an Open Book model for Kaggle's LLM Science Exam competition. This dataset was generated by searching and concatenating all publicly shared datasets on Sept 1 2023.
The context column was generated using Mgoksu's notebook here with NUM_TITLES=5 and NUM_SENTENCES=20
The source column indicates where the dataset originated. Below are the sources:
source = 1 & 2 * Radek's 6.5k dataset. Discussion here annd here, dataset here.
source = 3 & 4 * Radek's 15k + 5.9k. Discussion here and here, dataset here
source = 5 & 6 * Radek's 6k + 6k. Discussion here and here, dataset here
source = 7 * Leonid's 1k. Discussion here, dataset here
source = 8 * Gigkpeaeums 3k. Discussion here, dataset here
source = 9 * Anil 3.4k. Discussion here, dataset here
source = 10, 11, 12 * Mgoksu 13k. Discussion here, dataset here
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
(WELFake) is a dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, authors merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.
Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).
There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.
Published in: IEEE Transactions on Computational Social Systems: pp. 1-13 (doi: 10.1109/TCSS.2021.3068519).
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
About this file The Kaggle Global Superstore dataset is a comprehensive dataset containing information about sales and orders in a global superstore. It is a valuable resource for data analysis and visualization tasks. This dataset has been processed and transformed from its original format (txt) to CSV using the R programming language. The original dataset is available here, and the transformed CSV file used in this analysis can be found here.
Here is a description of the columns in the dataset:
category: The category of products sold in the superstore.
city: The city where the order was placed.
country: The country in which the superstore is located.
customer_id: A unique identifier for each customer.
customer_name: The name of the customer who placed the order.
discount: The discount applied to the order.
market: The market or region where the superstore operates.
ji_lu_shu: An unknown or unspecified column.
order_date: The date when the order was placed.
order_id: A unique identifier for each order.
order_priority: The priority level of the order.
product_id: A unique identifier for each product.
product_name: The name of the product.
profit: The profit generated from the order.
quantity: The quantity of products ordered.
region: The region where the order was placed.
row_id: A unique identifier for each row in the dataset.
sales: The total sales amount for the order.
segment: The customer segment (e.g., consumer, corporate, or home office).
ship_date: The date when the order was shipped.
ship_mode: The shipping mode used for the order.
shipping_cost: The cost of shipping for the order.
state: The state or region within the country.
sub_category: The sub-category of products within the main category.
year: The year in which the order was placed.
market2: Another column related to market information.
weeknum: The week number when the order was placed.
This dataset can be used for various data analysis tasks, including understanding sales patterns, customer behavior, and profitability in the context of a global superstore.
Facebook
TwitterThis dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com.
Fields include
Rank - Ranking of overall sales
Name - The games name
Platform - Platform of the games release (i.e. PC,PS4, etc.)
Year - Year of the game's release
Genre - Genre of the game
Publisher - Publisher of the game
NA_Sales - Sales in North America (in millions)
EU_Sales - Sales in Europe (in millions)
JP_Sales - Sales in Japan (in millions)
Other_Sales - Sales in the rest of the world (in millions)
Global_Sales - Total worldwide sales.
The script to scrape the data is available at https://github.com/GregorUT/vgchartzScrape. It is based on BeautifulSoup using Python. There are 16,598 records. 2 records were dropped due to incomplete information.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
About Dataset
This dataset provides information about various medical conditions such as Cancer, Pneumonia, and Diabetic based on demographic, lifestyle, and health-related features. It contains randomly generated user data, including multiple missing values, making it suitable for handling imbalanced classification tasks and missing data problems.
Features
Goal
The objective of this dataset is to predict the medical condition (Cancer, Pneumonia, Diabetic) of a user based on their demographic, lifestyle, and health-related features. This dataset can be used to explore strategies for dealing with imbalanced classes and missing data in healthcare applications.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
We utilized a dataset of Machine Design materials, which includes information on their mechanical properties. The dataset was obtained from the Autodesk Material Library and comprises 15 columns, also referred to as features/attributes. This dataset is a real-world dataset, and it does not contain any random values. However, due to missing values, we only utilized seven of these columns for our ML model. You can access the related GitHub Repository here: https://github.com/purushottamnawale/material-selection-using-machine-learning
To develop a ML model, we employed several Python libraries, including NumPy, pandas, scikit-learn, and graphviz, in addition to other technologies such as Weka, MS Excel, VS Code, Kaggle, Jupyter Notebook, and GitHub. We employed Weka software to swiftly visualize the data and comprehend the relationships between the features, without requiring any programming expertise.
My Problem statement is Material Selection for EV Chassis. So, if you have any specific ideas, be sure to implement them and add the codes on Kaggle.
A Detailed Research Paper is available on https://iopscience.iop.org/article/10.1088/1742-6596/2601/1/012014
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The official Meta-Kaggle dataset contains the Users.csv file which contains Username, DisplayName, RegisterDate, and PerformanceTier fields but doesn't contain location data of the Kaggle Users. This dataset augments that data with additional country and region information.
I haven't included the username and displayname values on purpose, just the userid to be joined back to the Meta-Kaggle official Users.csv file.
It is possible that some users haven't inputted their details when the scraper went through their accounts and thus have missing data. Another possibility is that users may have updated their info after the scraper went through their accounts, thus resulting in inconsistencies.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.
The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.
This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.
The following is the Google Colab link to the project, done on Jupyter Notebook -
https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN
The following is the GitHub Repository of the project -
https://github.com/daerkns/social-media-and-mental-health
Libraries used for the Project -
Pandas
Numpy
Matplotlib
Seaborn
Sci-kit Learn
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was not finished, had some incorrect content, just for test. This dataset has SQL injection, XSS and some bash script mixed in string, but not finish.
Facebook
TwitterThis dataset was created by Sanket Rai
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was created by jruvika
Released under Database: Open Database, Contents: © Original Authors
Facebook
TwitterThis dataset was created by Dhia Znaidi
Facebook
TwitterTypically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".
"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."
Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.
Image from stocksnap.io.
Analyses for this dataset could include time series, clustering, classification and more.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is about list of dataset provided by UCI ML , If you are a learner and want some data on the basis of year ,categories, profession or some other criteria you search it from here.
There are 8 rows in the dataset in which all details are given. --link --Data-Name --data type --default task --attribute-type --instances --attributes --year
Some missing values are present there also,
You can analyse the as per your requirement
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Intelligent Manufacturing Dataset for Predictive Optimization is a dataset designed for research in smart manufacturing, AI-driven process optimization, and predictive maintenance. It simulates real-time sensor data from industrial machines, incorporating 6G network slicing for enhanced communication and resource allocation.
Key Features: ✔ Industrial IoT Sensor Data – Temperature, vibration, power consumption, etc. ✔ 6G Network Performance Metrics – Latency, packet loss, and communication efficiency. ✔ Production Efficiency Indicators – Defect rate, predictive maintenance score, error rate. ✔ Target Column (Efficiency_Status) – Classifies manufacturing efficiency as High, Medium, or Low based on performance metrics.
Applications: 🔹 AI-based predictive maintenance 🔹 Resource allocation optimization in 6G-enabled smart factories 🔹 Real-time anomaly detection in industrial production 🔹 Deep learning model training for intelligent manufacturing systems
This dataset serves as a benchmark for AI and deep learning applications in Industry 4.0 and 6G network-integrated manufacturing systems.
Facebook
TwitterThis dataset is a copy of the images in the dataset at the link: Alzheimer's Dataset (4 class of Images).
The original dataset contained MRI images of 32 horizontal slices of the brain divided into 4 classes: - Mild Dementia - Moderate Dementia - Non Dementia - Very Mild Dementia
For each classes there were a different number of subjects: - 28 subjects for the Mild Dementia Class - 2 subjects for the Moderate Dementia Class - 100 subjects for the Non Dementia Class - 70 subjects for the Very Mild Dementia Class
The problem of the original dataset was that the train and the test sets contained different slices of the brain because the images of the dataset were ordered by the position of the slice and the train/test set division was performed by putting the first percentage of images in the train set and the last ones in the test set.
In this dataset the original train and test set have been united and the images have been divided between train, test and validation set randomly.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This supply chain analysis provides a comprehensive view of the company's order and distribution processes, allowing for in-depth analysis and optimization of various aspects of the supply chain, from procurement and inventory management to sales and customer satisfaction. It empowers the company to make data-driven decisions to improve efficiency, reduce costs, and enhance customer experiences. The provided supply chain analysis dataset contains various columns that capture important information related to the company's order and distribution processes:
• OrderNumber • Sales Channel • WarehouseCode • ProcuredDate • CurrencyCode • OrderDate • ShipDate • DeliveryDate • SalesTeamID • CustomerID • StoreID • ProductID • Order Quantity • Discount Applied • Unit Cost • Unit Price
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.
Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]
Source: Kaggle