43 datasets found

MoA Feature Importance with Rapids
kaggle.com
zip
Updated Nov 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Loulou (2020). MoA Feature Importance with Rapids [Dataset]. https://www.kaggle.com/louise2001/moa-feat-importance-rapids
Explore at:
zip(1630369 bytes)Available download formats
Dataset updated
Nov 11, 2020
Authors
Loulou
Description
Dataset

This dataset was created by Loulou

Contents
Zieni dataset for Phishing detection
kaggle.com
data.mendeley.com
zip
Updated Sep 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rasha Zieni (2024). Zieni dataset for Phishing detection [Dataset]. https://www.kaggle.com/datasets/rashazieni/zieni-dataset/code
Explore at:
zip(129009 bytes)Available download formats
Dataset updated
Sep 3, 2024
Authors
Rasha Zieni
Description
This dataset was used for training machine learning models to detect phishing attacks and for studying the explainability of these models. It was published in 2024. The dataset refers to phishing and legitimate websites. Phishing samples have been collected from two sources, namely, PhishTank and Tranco, whereas legitimate samples were collected from Alexa. The dataset is balanced and contains 5,000 phishing and 5,000 legitimate samples, each described by 74 features extracted from the entire URL as well as from the Fully Qualified Domain Name, pathname, filename, and parameters. Of these features, 70 are numerical and four binary. The target variable is also binary.

Reference

Calzarossa, M., Giudici, P., Zieni, R.: Explainable machine learning for phishing feature detection. Quality and Reliability Engineering International 40, 362–373 (2024).

Cite this dataset

Zieni, Rasha (2024), “Zieni dataset for Phishing detection ”, Mendeley Data, V1, doi: 10.17632/8mcz8jsgnb.1
UCI Heart Disease - Explainable AI Project Assets
kaggle.com
zip
Updated Nov 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ariyan_Pro (2025). UCI Heart Disease - Explainable AI Project Assets [Dataset]. https://www.kaggle.com/datasets/ariyannadeem/uci-heart-disease-explainable-ai-project-assets
Explore at:
zip(1051043 bytes)Available download formats
Dataset updated
Nov 18, 2025
Authors
Ariyan_Pro
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Medical-Grade Explainable AI Project Assets

This dataset contains comprehensive assets for a production-ready Explainable AI (XAI) heart disease prediction system achieving 94.1% accuracy with full model transparency.

📊 CONTEXT: Healthcare AI faces a critical "black box" problem where models make predictions without explanations. This project demonstrates how to build trustworthy medical AI using SHAP and LIME for real-time explainability.

🎯 PROJECT GOAL: Create a clinically deployable AI system that not only predicts heart disease with high accuracy but also provides interpretable explanations for each prediction, enabling doctor-AI collaboration.

🚀 KEY FEATURES: - 94.1% prediction accuracy (XGBoost + Optuna) - Real-time SHAP & LIME explanations - FastAPI backend with medical validation - Gradio clinical dashboard - Full MLOps pipeline (MLflow tracking) - 4-Layer enterprise architecture

📁 ASSETS INCLUDED: - heart_clean.csv - Clinical dataset ready for analysis - SHAP summary plots for global explainability - Performance metrics and visualizations - Architecture diagrams - Model evaluation results

🔗 COMPANION RESOURCES: - Live Demo: https://huggingface.co/spaces/Ariyan-Pro/HeartDisease-Predictor - Notebook: https://www.kaggle.com/code/ariyannadeem/heart-disease-prediction-with-explainable-ai - Source Code: https://github.com/Ariyan-Pro/ExplainableAI-HeartDisease

Perfect for learning medical AI implementation, explainable AI techniques, and production deployment.
Financial Transactions Dataset for Fraud Detection
kaggle.com
zip
Updated May 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aryan Kumar (2025). Financial Transactions Dataset for Fraud Detection [Dataset]. https://www.kaggle.com/datasets/aryan208/financial-transactions-dataset-for-fraud-detection
Explore at:
zip(290256858 bytes)Available download formats
Dataset updated
May 2, 2025
Authors
Aryan Kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains 5 million synthetically generated financial transactions designed to simulate real-world behavior for fraud detection research and machine learning applications. Each transaction record includes fields such as:

Transaction Details: ID, timestamp, sender/receiver accounts, amount, type (deposit, transfer, etc.)

Behavioral Features: time since last transaction, spending deviation score, velocity score, geo-anomaly score

Metadata: location, device used, payment channel, IP address, device hash

Fraud Indicators: binary fraud label (is_fraud) and type of fraud (e.g., money laundering, account takeover)

The dataset follows realistic fraud patterns and behavioral anomalies, making it suitable for:

Binary and multiclass classification models

Fraud detection systems

Time-series anomaly detection

Feature engineering and model explainability
Student Performance Factors Dataset
kaggle.com
zip
Updated Oct 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mosap Abdel-Ghany (2025). Student Performance Factors Dataset [Dataset]. https://www.kaggle.com/datasets/mosapabdelghany/student-performance-factors-dataset
Explore at:
zip(96178 bytes)Available download formats
Dataset updated
Oct 16, 2025
Authors
Mosap Abdel-Ghany
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description

This dataset contains data on 6,607 students and the factors influencing their academic performance. It’s designed to help researchers, educators, and data scientists analyze how habits, environment, and background affect exam scores.

You can use this dataset for:

Predictive modeling of student success

Feature importance and correlation studies

Machine learning projects on education analytics

Educational policy or intervention analysis

The dataset includes demographic, behavioral, and academic variables such as study hours, attendance, parental involvement, and more.

Target Variable: Exam_Score

Example research ideas - What is the most influential factor affecting student performance? - Can machine learning accurately predict academic success? - How do socioeconomic and behavioral factors interact in education?

Data source Synthetic data generated for research and educational purposes.
Credit Risk Benchmark Dataset
kaggle.com
zip
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adil Shamim (2025). Credit Risk Benchmark Dataset [Dataset]. https://www.kaggle.com/datasets/adilshamim8/credit-risk-benchmark-dataset
Explore at:
zip(316073 bytes)Available download formats
Dataset updated
Apr 8, 2025
Authors
Adil Shamim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview:
This dataset has been designed as a benchmark for AutoML and predictive modeling in the financial domain. It focuses on assessing credit risk by predicting whether a borrower will experience serious delinquency within two years. The data comprises a mix of financial metrics and personal attributes, which allow users to build and evaluate models for credit risk scoring.

Dataset Characteristics:

Total Features: 10 predictors and 1 target variable.

Data Types: All predictors are numerical (real numbers) while the target variable is binary ({0, 1}).

Task: Binary classification focused on credit risk prediction.

Column Descriptions:
Below is a list of the available columns along with their abbreviated names for ease-of-use:

rev_util: Ratio of revolving credit utilization (balance/credit limit)

age: Age of the borrower

late_30_59: Number of times 30-59 days past due (worse than current)

debt_ratio: Debt to income (or assets) ratio

monthly_inc: Monthly income of the borrower

open_credit: Number of open credit lines and loans

late_90: Number of times 90 days or more late on a payment

real_estate: Number of real estate loans or credit lines

late_60_89: Number of times 60-89 days past due (worse than current)

dependents: Number of dependents

dlq_2yrs: Target variable indicating if a serious delinquency occurred in the next 2 years (0 = No, 1 = Yes)

Use Cases and Applications:
- Risk Management: Build and validate credit scoring models to forecast borrower default risks. - AutoML Benchmarking: Evaluate and compare the performance of various AutoML frameworks on a structured, financial dataset. - Academic Research: Explore trends and relationships in credit behavior, along with the predictive power of financial indicators. - Model Interpretability: Given the regulated nature of financial models, this dataset provides an excellent context for testing feature importance and creating explainable AI solutions.

Additional Information:
- Preprocessing & Feature Engineering: Users are encouraged to perform exploratory data analysis, handle potential missing values or outliers, and experiment with scaling techniques and feature transformations.
- Regulatory Considerations: Since credit scoring models often require transparency, it’s important to incorporate techniques that ensure model interpretability.
- Benchmarking: Ideal for comparing traditional modeling techniques (like logistic regression) with modern approaches (such as gradient boosting and neural networks).

This dataset is now available on Kaggle for anyone looking to experiment with or benchmark predictive models for credit risk analysis. Whether you're a data scientist, researcher, or financial analyst, the dataset provides a straightforward yet robust framework for exploring credit-related behavior and risk factors.
Spam Detection Dataset
kaggle.com
zip
Updated Apr 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AJ (2025). Spam Detection Dataset [Dataset]. https://www.kaggle.com/datasets/smayanj/spam-detection-dataset
Explore at:
zip(234723 bytes)Available download formats
Dataset updated
Apr 12, 2025
Authors
AJ
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is a synthetic dataset for training and testing spam detection models. It contains 20,000 email samples, and each sample is described by five features and one label.

Features:

num_links

Type: Integer

Meaning: Number of links present in the email body

Generated using a Poisson distribution with an average (λ) of 1.5

Assumption: More links often mean higher chances of spam

num_words

Type: Integer

Meaning: Total number of words in the email

Randomly picked between 20 and 200

Assumption: Short or overly long emails might look suspicious, but this is more of a neutral feature

has_offer

Type: Binary (0 or 1)

Meaning: Whether the email contains the word “offer”

Simulated using a binomial distribution (30% chance of being 1)

Assumption: Marketing language like “offer” is common in spam

sender_score

Type: Float between 0 and 1

Meaning: A simulated reputation score of the email sender

Normally distributed around 0.7, clipped to stay between 0 and 1

Assumption: A low sender score means the sender is less trustworthy (and more likely to send spam)

all_caps

Type: Binary (0 or 1)

Meaning: Whether the subject line is written in ALL CAPS

Simulated with a 10% chance of being 1

Assumption: All-caps subject lines are usually attention-grabbing and common in spam

Target:

is_spam

Type: Binary (0 or 1)

Meaning: Whether the email is spam

Generated using a rule-based formula:

Spam probability increases if:

Links > 2

It contains an “offer”

Sender score < 0.4

Subject is in all caps

These factors are combined with different weights

A little noise is added using Gaussian randomness to simulate real-world uncertainty

Emails are labeled as spam if the final probability crosses 0.5

Why this dataset is useful:

You can try binary classification algorithms like Logistic Regression, Decision Trees, Random Forests, or Neural Networks.

It's great for feature importance analysis—you can check which features most affect spam prediction.

You can test model robustness using noisy, rule-based labels.

Good for building and evaluating explainable AI models since the rules are known.

Lifestyle and Health Risk Prediction

kaggle.com

zip

Updated Oct 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Arif Miah (2025). Lifestyle and Health Risk Prediction [Dataset]. https://www.kaggle.com/datasets/miadul/lifestyle-and-health-risk-prediction

Explore at:

zip(61139 bytes)Available download formats

Dataset updated

Oct 19, 2025

Authors

Arif Miah

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

📘 Description:

This synthetic health dataset simulates real-world lifestyle and wellness data for individuals. It is designed to help data scientists, machine learning engineers, and students build and test health risk prediction models safely — without using sensitive medical data.

The dataset includes features such as age, weight, height, exercise habits, sleep hours, sugar intake, smoking, alcohol consumption, marital status, and profession, along with a synthetic health_risk label generated using a heuristic rule-based algorithm that mimics realistic risk behavior patterns.

🧾 Columns Description:

Column Name	Description	Type	Example
`age`	Age of the person (years)	Numeric	35
`weight`	Body weight in kilograms	Numeric	70
`height`	Height in centimeters	Numeric	172
`exercise`	Exercise frequency level	Categorical (`none`, `low`, `medium`, `high`)	`medium`
`sleep`	Average hours of sleep per night	Numeric	7
`sugar_intake`	Level of sugar consumption	Categorical (`low`, `medium`, `high`)	`high`
`smoking`	Smoking habit	Categorical (`yes`, `no`)	`no`
`alcohol`	Alcohol consumption habit	Categorical (`yes`, `no`)	`yes`
`married`	Marital status	Categorical (`yes`, `no`)	`yes`
`profession`	Type of work or profession	Categorical (`office_worker`, `teacher`, `doctor`, `engineer`, etc.)	`teacher`
`bmi`	Body Mass Index calculated as weight / (height²)	Numeric	24.5
`health_risk`	Target label showing overall health risk	Categorical (`low`, `high`)	`high`

🧩 Use Cases:

Health Risk Prediction: Train classification models (Logistic Regression, RandomForest, XGBoost, CatBoost) to predict health risk (low / high).
Feature Importance Analysis: Identify which lifestyle factors most influence health risk.
Data Preprocessing & EDA Practice: Use this dataset for data cleaning, encoding, and visualization practice.
Model Explainability Projects: Use SHAP or LIME to explain how different lifestyle habits affect predictions.
Streamlit or Flask Web App Development: Build a real-time web app that predicts health risk from user input.

💡 Case Study Example:

Imagine you are a data scientist building a Health Risk Prediction App for a wellness startup. You want to analyze how exercise, sleep, and sugar intake affect overall health risk. This dataset helps you simulate those relationships without handling sensitive medical data.

You could:

Perform EDA to find correlations between age, BMI, and health risk.
Train a model using Random Forest to predict health_risk.
Deploy a Streamlit app where users can input their lifestyle information and get a risk score instantly.

⚙️ Technical Information:

Rows: 5,000 (adjustable, you can create more)
Columns: 12
Target variable: health_risk
Data type: Mixed (Numeric + Categorical)
Source: Fully synthetic, generated using Python (NumPy, Faker)

📈 License:

CC0: Public Domain You are free to use this dataset for research, learning, or commercial projects.

🌍 Author:

Created by Arif Miah Machine Learning Engineer | Kaggle Expert | Data Scientist 📧 arifmiahcse@gmail.com

Fraud Detection Transactions Dataset

kaggle.com

zip

Updated Feb 21, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Samay Ashar (2025). Fraud Detection Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/samayashar/fraud-detection-transactions-dataset

Explore at:

zip(2104444 bytes)Available download formats

Dataset updated

Feb 21, 2025

Authors

Samay Ashar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is designed to help data scientists and machine learning enthusiasts develop robust fraud detection models. It contains realistic synthetic transaction data, including user information, transaction types, risk scores, and more, making it ideal for binary classification tasks with models like XGBoost and LightGBM.

📌 Key Features

21 features capturing various aspects of a financial transaction
Realistic structure with numerical, categorical, and temporal data
Binary fraud labels (0 = Not Fraud, 1 = Fraud)
Designed for high accuracy with XGBoost and other ML models
Useful for anomaly detection, risk analysis, and security research

📌 Columns in the Dataset

Column Name	Description
Transaction_ID	Unique identifier for each transaction
User_ID	Unique identifier for the user
Transaction_Amount	Amount of money involved in the transaction
Transaction_Type	Type of transaction (`Online`, `In-Store`, `ATM`, etc.)
Timestamp	Date and time of the transaction
Account_Balance	User's current account balance before the transaction
Device_Type	Type of device used (`Mobile`, `Desktop`, etc.)
Location	Geographical location of the transaction
Merchant_Category	Type of merchant (`Retail`, `Food`, `Travel`, etc.)
IP_Address_Flag	Whether the IP address was flagged as suspicious (`0` or `1`)
Previous_Fraudulent_Activity	Number of past fraudulent activities by the user
Daily_Transaction_Count	Number of transactions made by the user that day
Avg_Transaction_Amount_7d	User's average transaction amount in the past 7 days
Failed_Transaction_Count_7d	Count of failed transactions in the past 7 days
Card_Type	Type of payment card used (`Credit`, `Debit`, `Prepaid`, etc.)
Card_Age	Age of the card in months
Transaction_Distance	Distance between the user's usual location and transaction location
Authentication_Method	How the user authenticated (`PIN`, `Biometric`, etc.)
Risk_Score	Fraud risk score computed for the transaction
Is_Weekend	Whether the transaction occurred on a weekend (`0` or `1`)
Fraud_Label	Target variable (`0 = Not Fraud`, `1 = Fraud`)

📌 Potential Use Cases

Fraud detection model training
Anomaly detection in financial transactions
Risk scoring systems for banks and fintech companies
Feature engineering and model explainability research

Predictive Maintenance Dataset (AI4I 2020)
kaggle.com
zip
Updated Nov 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephan Matzka (2022). Predictive Maintenance Dataset (AI4I 2020) [Dataset]. https://www.kaggle.com/datasets/stephanmatzka/predictive-maintenance-dataset-ai4i-2020/data
Explore at:
zip(138762 bytes)Available download formats
Dataset updated
Nov 6, 2022
Authors
Stephan Matzka
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Please note that this is the original dataset with additional information and proper attribution. There is at least one other version of this dataset on Kaggle that was uploaded without permission. Please be fair and attribute the original author. This synthetic dataset is modeled after an existing milling machine and consists of 10 000 data points from a stored as rows with 14 features in columns

UID: unique identifier ranging from 1 to 10000

product ID: consisting of a letter L, M, or H for low (50% of all products), medium (30%) and high (20%) as product quality variants and a variant-specific serial number

type: just the product type L, M or H from column 2

air temperature [K]: generated using a random walk process later normalized to a standard deviation of 2 K around 300 K

process temperature [K]: generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 K.

rotational speed [rpm]: calculated from a power of 2860 W, overlaid with a normally distributed noise

torque [Nm]: torque values are normally distributed around 40 Nm with a SD = 10 Nm and no negative values.

tool wear [min]: The quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process.

a 'machine failure' label that indicates, whether the machine has failed in this particular datapoint for any of the following failure modes are true.

The machine failure consists of five independent failure modes 10. tool wear failure (TWF): the tool will be replaced of fail at a randomly selected tool wear time between 200 - 240 mins (120 times in our dataset). At this point in time, the tool is replaced 69 times, and fails 51 times (randomly assigned). 11. heat dissipation failure (HDF): heat dissipation causes a process failure, if the difference between air- and process temperature is below 8.6 K and the tools rotational speed is below 1380 rpm. This is the case for 115 data points. 12. power failure (PWF): the product of torque and rotational speed (in rad/s) equals the power required for the process. If this power is below 3500 W or above 9000 W, the process fails, which is the case 95 times in our dataset. 13. overstrain failure (OSF): if the product of tool wear and torque exceeds 11,000 minNm for the L product variant (12,000 M, 13,000 H), the process fails due to overstrain. This is true for 98 datapoints. 14. random failures (RNF): each process has a chance of 0,1 % to fail regardless of its process parameters. This is the case for only 5 datapoints, less than could be expected for 10,000 datapoints in our dataset. If at least one of the above failure modes is true, the process fails and the 'machine failure' label is set to 1. It is therefore not transparent to the machine learning method, which of the failure modes has caused the process to fail.

This dataset is part of the following publication, please cite when using this dataset: S. Matzka, "Explainable Artificial Intelligence for Predictive Maintenance Applications," 2020 Third International Conference on Artificial Intelligence for Industries (AI4I), 2020, pp. 69-74, doi: 10.1109/AI4I49448.2020.00023.

The image of the milling process is the work of Daniel Smyth @ Pexels: https://www.pexels.com/de-de/foto/industrie-herstellung-maschine-werkzeug-10406128/
UCI ML Parkinsons dataset
kaggle.com
zip
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elnaz Alikarami (2025). UCI ML Parkinsons dataset [Dataset]. https://www.kaggle.com/datasets/elnazalikarami/uci-ml-parkinsons-dataset
Explore at:
zip(316796 bytes)Available download formats
Dataset updated
Jul 8, 2025
Authors
Elnaz Alikarami
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Oxford Parkinson's Disease Detection Dataset UCI Machine Learning Repository

dataset's original link : https://archive.ics.uci.edu/dataset/174/parkinsons

Dataset Characteristics Multivariate

Subject Area Health and Medicine

Associated Tasks Classification

Feature Type Real

Instances

197

Features

22

Dataset Information Additional Information

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).

Further details are contained in the following reference -- if you use this dataset, please cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).

Has Missing Values?

No

Iris_Data

kaggle.com

Updated May 1, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Aniket Gaikwad (2025). Iris_Data [Dataset]. http://doi.org/10.34740/kaggle/dsv/11634170

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/11634170

Dataset updated

May 1, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Aniket Gaikwad

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

📄 Dataset Description: Iris Flower Dataset

The Iris dataset is one of the most famous and widely used datasets in the field of machine learning and pattern recognition. It was first introduced by the British biologist and statistician Ronald A. Fisher in 1936.

🌸 Dataset Overview

The dataset consists of 150 samples of iris flowers from three different species: - Setosa - Versicolor - Virginica

Each sample contains four features (all numeric), which are: 1. Sepal Length (cm) 2. Sepal Width (cm) 3. Petal Length (cm) 4. Petal Width (cm)

The target variable is the species of the flower.

📊 Dataset Characteristics

Feature	Type	Description
Sepal Length	Float	Length of the sepal in cm
Sepal Width	Float	Width of the sepal in cm
Petal Length	Float	Length of the petal in cm
Petal Width	Float	Width of the petal in cm
Species	String	Category: Setosa, Versicolor, Virginica

🔍 Applications

This dataset is commonly used for: - Supervised learning (classification) - Data visualization and EDA - Algorithm comparison (e.g., Logistic Regression, SVM, KNN) - Dimensionality reduction (e.g., PCA)

✅ Why This Dataset?

Small and easy to understand
Contains both numeric features and categorical labels
Useful for demonstrating classification algorithms and metrics

YouTube Likes Prediction AV HackLive
kaggle.com
zip
Updated Oct 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vishal Gupta (2020). YouTube Likes Prediction AV HackLive [Dataset]. https://www.kaggle.com/datasets/jinxzed/youtube-likes-prediction-av-hacklive/discussion
Explore at:
zip(21795242 bytes)Available download formats
Dataset updated
Oct 2, 2020
Authors
Vishal Gupta
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Area covered
YouTube
Description
Context

As YouTube becomes one of the most popular video-sharing platforms, YouTuber is developed as a new type of career in recent decades. YouTubers earn money through advertising revenue from YouTube videos, sponsorships from companies, merchandise sales, and donations from their fans. In order to maintain a stable income, the popularity of videos become the top priority for YouTubers. Meanwhile, some of our friends are YouTubers or channel owners in other video-sharing platforms. This raises our interest in predicting the performance of the video. If creators can have a preliminary prediction and understanding on their videos’ performance, they may adjust their video to gain the most attention from the public.

You have been provided details on videos along with some features as well. Can you accurately predict the number of likes for each video using the set of input variables?

Content

Train Set

video_id -> Identifier for each video

title -> Name of the Video on Youtube

channel_title -> Name of the Channel on Youtube

category_id -> Category of the Video (anonymous)

publish_date -> The date video was published

tags -> Different tags for the video

views -> Number of views received by the Video

dislikes -> Number of dislikes on the Video

comment_count -> Number on comments on the Video

description -> Textual description of the Video

country_code -> Country from which the Video was published

likes -> Number of Likes on the video

Acknowledgements

Thank You Analytics Vidhya for providing this dataset.
CIFAKE: Real and AI-Generated Synthetic Images
kaggle.com
Updated Mar 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan J. Bird (2023). CIFAKE: Real and AI-Generated Synthetic Images [Dataset]. https://www.kaggle.com/datasets/birdy654/cifake-real-and-ai-generated-synthetic-images
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jordan J. Bird
Description
CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness.

CIFAKE is a dataset that contains 60,000 synthetically-generated images and 60,000 real images (collected from CIFAR-10). Can computer vision techniques be used to detect when an image is real or has been generated by AI?

Further information on this dataset can be found here: Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.

Dataset details

The dataset contains two classes - REAL and FAKE.

For REAL, we collected the images from Krizhevsky & Hinton's CIFAR-10 dataset

For the FAKE images, we generated the equivalent of CIFAR-10 with Stable Diffusion version 1.4

There are 100,000 images for training (50k per class) and 20,000 for testing (10k per class)

Papers with Code

The dataset and all studies using it are linked using Papers with Code https://paperswithcode.com/dataset/cifake-real-and-ai-generated-synthetic-images

References

If you use this dataset, you must cite the following sources

Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.

Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.

Real images are from Krizhevsky & Hinton (2009), fake images are from Bird & Lotfi (2024). The Bird & Lotfi study is available here.

Notes

The updates to the dataset on the 28th of March 2023 did not change anything; the file formats ".jpeg" were renamed ".jpg" and the root folder was uploaded to meet Kaggle's usability requirements.

License

This dataset is published under the same MIT license as CIFAR-10:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Phishing website Detector
kaggle.com
zip
Updated Feb 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eswar Chand (2020). Phishing website Detector [Dataset]. https://www.kaggle.com/eswarchandt/phishing-website-detector
Explore at:
zip(201800 bytes)Available download formats
Dataset updated
Feb 28, 2020
Authors
Eswar Chand
Description
Description

The data set is provided both in text file and csv file which provides the following resources that can be used as inputs for model building :

A collection of website URLs for 11000+ websites. Each sample has 30 website parameters and a class label identifying it as a phishing website or not (1 or -1).

The code template containing these code blocks: a. Import modules (Part 1) b. Load data function + input/output field descriptions

The data set also serves as an input for project scoping and tries to specify the functional and non-functional requirements for it.

Background of Problem Statement :

You are expected to write the code for a binary classification model (phishing website or not) using Python Scikit-Learn that trains on the data and calculates the accuracy score on the test data. You have to use one or more of the classification algorithms to train a model on the phishing website data set.

Dataset Description:

The dataset for a “.txt” file is with no headers and has only the column values.

The actual column-wise header is described above and, if needed, you can add the header manually if you are using '.txt' file.If you are using '.csv' file then the column names were added and given.

The header list (column names) is as follows : [ 'UsingIP', 'LongURL', 'ShortURL', 'Symbol@', 'Redirecting//', 'PrefixSuffix-', 'SubDomains', 'HTTPS', 'DomainRegLen', 'Favicon', 'NonStdPort', 'HTTPSDomainURL', 'RequestURL', 'AnchorURL', 'LinksInScriptTags', 'ServerFormHandler', 'InfoEmail', 'AbnormalURL', 'WebsiteForwarding', 'StatusBarCust', 'DisableRightClick', 'UsingPopupWindow', 'IframeRedirection', 'AgeofDomain', 'DNSRecording', 'WebsiteTraffic', 'PageRank', 'GoogleIndex', 'LinksPointingToPage', 'StatsReport', 'class' ] ### Brief Description of the features in data set ● UsingIP (categorical - signed numeric) : { -1,1 } ● LongURL (categorical - signed numeric) : { 1,0,-1 } ● ShortURL (categorical - signed numeric) : { 1,-1 } ● Symbol@ (categorical - signed numeric) : { 1,-1 } ● Redirecting// (categorical - signed numeric) : { -1,1 } ● PrefixSuffix- (categorical - signed numeric) : { -1,1 } ● SubDomains (categorical - signed numeric) : { -1,0,1 } ● HTTPS (categorical - signed numeric) : { -1,1,0 } ● DomainRegLen (categorical - signed numeric) : { -1,1 } ● Favicon (categorical - signed numeric) : { 1,-1 } ● NonStdPort (categorical - signed numeric) : { 1,-1 } ● HTTPSDomainURL (categorical - signed numeric) : { -1,1 } ● RequestURL (categorical - signed numeric) : { 1,-1 } ● AnchorURL (categorical - signed numeric) :
Insurance Premium Data
kaggle.com
zip
Updated Apr 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prachi Gopalani (2021). Insurance Premium Data [Dataset]. https://www.kaggle.com/datasets/prachi13/insurance13m-persistency/discussion?sort=undefined
Explore at:
zip(8583271 bytes)Available download formats
Dataset updated
Apr 22, 2021
Authors
Prachi Gopalani
Description
1. Problem Description:

Prepare a Machine Learning Model to predict the Persistency 13M Payment Behaviour at the New Business stage. ## ## 2. Objective:

Using Machine Learning techniques, provide scores for each policy at the New Business stage the likelihood to pay the 13M premium.

Identify the segments where maximum non payers are captured ## 3. Dataset:

“Training” & “Test” Dataset with the raw input attributes and the 13M actual paid/not paid flag.

“Out of Time” Datasets would be provided with just the raw input attributes. ## 4. Expected Steps:

Conduct appropriate Data Treatments for e.g. Missing Value Imputation, Outlier treatment etc.

Conduct required Feature Engineering for e.g. Binning, Ratio, Interaction, Polynomial etc.

Use any machine learning algorithm or combination of machine learning algorithms you deem fit.

Prepare your model on the Train Data and you can evaluate the generalization capability of your model by using K-Fold Cross Validation, Leave One Out

Cross Validation or any other validation technique that you see appropriate.

Score the Test and Out of Time Data and share it back to us along with the scored Train Data for evaluation. Also share all the Model Codes and Documentation.
E-Commerce Fraud Detection Dataset
kaggle.com
zip
Updated Nov 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UmutUygurr (2025). E-Commerce Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/e-commerce-fraud-detection-dataset
Explore at:
zip(6248478 bytes)Available download formats
Dataset updated
Nov 3, 2025
Authors
UmutUygurr
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🌍 Storyline: The Digital Bazaar

In 2024, e-commerce platforms across Istanbul, Berlin, New York, London, and Paris began noticing strange transaction bursts. Some cards tested with $1 purchases at midnight. Others shipped “gaming accessories” 5,000 km away. Promo codes were being reused from freshly created accounts.

To investigate these global patterns safely, this synthetic dataset recreates realistic fraud behavior across countries, channels, and user profiles — allowing anyone to build, test, and compare fraud-detection models without exposing any real user data.

💡 What makes it special

🧍‍♀️ 6 000 unique users performing ≈300 000 transactions

💳 Multiple transactions per user (40–60) → enables behavioral analysis

🧩 Strong feature correlations — not random noise

🌐 Cross-country dynamics (country, bin_country)

💸 Natural imbalance (~2 % fraud) just like real financial systems

🕓 Time realism — night-time fraud spikes, daily rhythms

🧠 Feature explainability — easy to visualize, model, and interpret
Best scored model
kaggle.com
zip
Updated Jan 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abid Ali Awan (2021). Best scored model [Dataset]. https://www.kaggle.com/kingabzpro/best-scored-model
Explore at:
zip(2581323393 bytes)Available download formats
Dataset updated
Jan 23, 2021
Authors
Abid Ali Awan
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Context

Hurricanes can cause upwards of 1,000 deaths and $50 billion in damages in a single event, and have been responsible for well over 160,000 deaths globally in recent history. During a tropical cyclone, humanitarian response efforts hinge on accurate risk approximation models that depend on wind speed measurements at different points in time throughout a storm’s life cycle.

For several decades, forecasters have relied on visual pattern recognition of complex cloud features in visible and infrared imagery. While the longevity of this technique indicates the strong relationship between spatial patterns and cyclone intensity, visual inspection is manual, subjective, and often leads to inconsistent estimates between even well-trained analysts.
Consumer Defensive Stock Predictions
kaggle.com
zip
Updated Nov 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alden Lin (2025). Consumer Defensive Stock Predictions [Dataset]. https://www.kaggle.com/datasets/aldenlin/consumer-defensive-stock-predictions
Explore at:
zip(17983221 bytes)Available download formats
Dataset updated
Nov 23, 2025
Authors
Alden Lin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About the Dataset

Why This Dataset Matters

This dataset is created for Kaggle users who want to explore, experiment, and innovate in the field of stock prediction and financial machine learning. It bridges real-world financial data with practical modeling challenges, enabling you to build, test, and showcase predictive models that simulate professional-grade investment analysis.

Whether you're a beginner exploring quantitative finance or an experienced data scientist refining predictive strategies, this dataset offers a rich playground for uncovering insights and improving model performance.

What Makes This Dataset Unique

✅ Combines fundamentals + analyst sentiment + historical price data

✅ Designed specifically for prediction tasks, not just visualization

✅ Feature-ready structure to shorten your data preprocessing time

✅ Ideal for classification problems (e.g. predicting 20% gain within 6 months)

✅ Suitable for both EDA and end-to-end ML pipelines

This dataset allows contributors to focus on what matters most: building impactful models and sharing innovative approaches with the Kaggle community.

Data Sources

Data is aggregated from widely-used financial intelligence platforms:

FinancialModelingPrep (FMP) Company profiles, financial fundamentals, financial ratios, S&P 500 performance indicators, and analyst ratings.

Alpha Vantage Daily historical OHLC and adjusted close stock price data.

These sources ensure high relevance, broad coverage, and strong analytical value for market-based modeling.

Dataset Structure

The dataset is organized for intuitive exploration and modeling, with each record structured by stock ticker and time period. Feature categories include:

Company metadata: sector, industry, market capitalization

Fundamental indicators: valuation ratios, profitability, revenue growth

Analyst sentiment: ratings and consensus measures

Price behavior: OHLC and adjusted close data

Engineered predictors: derived metrics to improve model accuracy

Target variable: stock performance outcome (e.g. achieving a defined % gain within a future horizon)

This design supports: ✔ Binary classification ✔ Regression modeling ✔ Time-series experimentation

Ideal For Kaggle Projects Like

📈 "Can fundamentals predict market winners?"

🤖 Stock prediction ML competitions

🧮 Feature importance & model explainability studies

📊 Financial dashboard prototypes

🧠 Algorithm comparison challenges

Data Processing Pipeline

To ensure usability and consistency, the data underwent:

Cleaning and removal of anomalous values

Standardization of formats and units

Multi-source alignment and validation

Feature transformation for predictive readiness

This ensures a smooth experience whether you're performing quick EDA or building production-ready models.

Community Invitation

You are encouraged to:

🚀 Build and share predictive notebooks

💬 Discuss modeling strategies

🔍 Explore novel feature engineering ideas

⭐ Fork, upvote, and contribute improvements

If you find this dataset valuable, feel free to follow for future releases and updates as this project evolves with enhanced features, expanded stock coverage, and refined modeling strategies.

Disclaimer

This dataset is intended for research and educational purposes only and does not constitute financial or investment advice. Market conditions and external factors may significantly influence real-world outcomes.

✨ This dataset is part of an ongoing effort to build a transparent, reusable, and community-driven resource for advancing financial machine learning on Kaggle.
Adverse Drug Effects (ADE) Detection
kaggle.com
zip
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sai Kiran Udayana (2025). Adverse Drug Effects (ADE) Detection [Dataset]. https://www.kaggle.com/datasets/saikiranudayana/adverse-drug-effects-ade-detection
Explore at:
zip(774335826 bytes)Available download formats
Dataset updated
Oct 8, 2025
Authors
Sai Kiran Udayana
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
💉 COVID-19 Vaccine Adverse Events (2020-2025): VAERS Real-World Surveillance Data This dataset offers a critical, large-scale look into the real-world safety surveillance of COVID-19 vaccines, sourced from the Vaccine Adverse Event Reporting System (VAERS). Maintained by the CDC and FDA, this collection spans the unprecedented period of mass vaccination from 2020 through 2025, providing an invaluable resource for pharmacovigilance, public health research, and regulatory decision-making.

Key Features & Challenge The dataset is a rich blend of structured and unstructured information detailing reported Adverse Drug Events (ADEs), which range from mild local reactions to severe, life-threatening complications.

Structured Data: Includes standardized symptom codes, offering a direct, quantitative view of reported reactions.

Free-Text Notes: Contains verbose, real-world symptom descriptions provided by reporters. This text is a "treasure trove" of granular context, including details on duration, intensity, and location of symptoms.

The Challenge: The structured entries are limited in scope. The free-text notes, while rich, are inherently noisy and lack standardized metadata such as clinical severity scores or age-specific pattern normalization.

Value to Data Scientists This dataset presents a significant Natural Language Processing (NLP) and Machine Learning (ML) challenge:

Extracting Context: Develop models to effectively extract critical clinical context (e.g., "headache lasting three days, severe") from the raw, non-standardized free-text notes.

Standardizing Severity: Create predictive models to assign standardized severity and age-specific risk patterns to ADEs.

Informed Decision Making: The ultimate goal is to generate actionable, timely insights for regulators, healthcare providers, and pharmaceutical companies, improving both vaccine safety monitoring and public trust.

Dive into this dataset to apply your skills in advanced data cleaning, feature engineering, and state-of-the-art NLP to solve a crucial, high-impact public health challenge.

Facebook

Twitter

Click to copy link

Link copied

Cite

Loulou (2020). MoA Feature Importance with Rapids [Dataset]. https://www.kaggle.com/louise2001/moa-feat-importance-rapids

MoA Feature Importance with Rapids

Analyzing Features importance on the Mechanisms of Action competition

Explore at:

zip(1630369 bytes)Available download formats

Dataset updated

Nov 11, 2020

Authors

Loulou

Description

Dataset

This dataset was created by Loulou

Clear search

Close search

Google apps

Main menu

MoA Feature Importance with Rapids

Dataset

Contents

Zieni dataset for Phishing detection

Reference

Cite this dataset

UCI Heart Disease - Explainable AI Project Assets

Financial Transactions Dataset for Fraud Detection

Student Performance Factors Dataset

Description

Credit Risk Benchmark Dataset

Spam Detection Dataset

Features:

Target:

Why this dataset is useful:

Lifestyle and Health Risk Prediction

📘 Description:

🧾 Columns Description:

🧩 Use Cases:

💡 Case Study Example:

⚙️ Technical Information:

📈 License:

🌍 Author:

Fraud Detection Transactions Dataset

Description

📌 Key Features

📌 Columns in the Dataset

📌 Potential Use Cases

Predictive Maintenance Dataset (AI4I 2020)

UCI ML Parkinsons dataset

Instances

Features

Iris_Data

YouTube Likes Prediction AV HackLive

Context

Content

Acknowledgements

CIFAKE: Real and AI-Generated Synthetic Images

CIFAKE: Real and AI-Generated Synthetic Images

Dataset details

Papers with Code

References

Notes

License

Phishing website Detector

Description

Background of Problem Statement :

Dataset Description:

Insurance Premium Data

1. Problem Description:

E-Commerce Fraud Detection Dataset

Best scored model

Context

Consumer Defensive Stock Predictions

About the Dataset

Why This Dataset Matters

What Makes This Dataset Unique

Data Sources

Dataset Structure

Ideal For Kaggle Projects Like

Data Processing Pipeline

Community Invitation

Disclaimer

Adverse Drug Effects (ADE) Detection

MoA Feature Importance with Rapids

Analyzing Features importance on the Mechanisms of Action competition

Dataset

Contents