43 datasets found
  1. MoA Feature Importance with Rapids

    • kaggle.com
    zip
    Updated Nov 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loulou (2020). MoA Feature Importance with Rapids [Dataset]. https://www.kaggle.com/louise2001/moa-feat-importance-rapids
    Explore at:
    zip(1630369 bytes)Available download formats
    Dataset updated
    Nov 11, 2020
    Authors
    Loulou
    Description

    Dataset

    This dataset was created by Loulou

    Contents

  2. Zieni dataset for Phishing detection

    • kaggle.com
    • data.mendeley.com
    zip
    Updated Sep 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rasha Zieni (2024). Zieni dataset for Phishing detection [Dataset]. https://www.kaggle.com/datasets/rashazieni/zieni-dataset/code
    Explore at:
    zip(129009 bytes)Available download formats
    Dataset updated
    Sep 3, 2024
    Authors
    Rasha Zieni
    Description

    This dataset was used for training machine learning models to detect phishing attacks and for studying the explainability of these models. It was published in 2024. The dataset refers to phishing and legitimate websites. Phishing samples have been collected from two sources, namely, PhishTank and Tranco, whereas legitimate samples were collected from Alexa. The dataset is balanced and contains 5,000 phishing and 5,000 legitimate samples, each described by 74 features extracted from the entire URL as well as from the Fully Qualified Domain Name, pathname, filename, and parameters. Of these features, 70 are numerical and four binary. The target variable is also binary.

    Reference

    Calzarossa, M., Giudici, P., Zieni, R.: Explainable machine learning for phishing feature detection. Quality and Reliability Engineering International 40, 362–373 (2024).

    Cite this dataset

    Zieni, Rasha (2024), “Zieni dataset for Phishing detection ”, Mendeley Data, V1, doi: 10.17632/8mcz8jsgnb.1

  3. UCI Heart Disease - Explainable AI Project Assets

    • kaggle.com
    zip
    Updated Nov 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ariyan_Pro (2025). UCI Heart Disease - Explainable AI Project Assets [Dataset]. https://www.kaggle.com/datasets/ariyannadeem/uci-heart-disease-explainable-ai-project-assets
    Explore at:
    zip(1051043 bytes)Available download formats
    Dataset updated
    Nov 18, 2025
    Authors
    Ariyan_Pro
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Medical-Grade Explainable AI Project Assets

    This dataset contains comprehensive assets for a production-ready Explainable AI (XAI) heart disease prediction system achieving 94.1% accuracy with full model transparency.

    📊 CONTEXT: Healthcare AI faces a critical "black box" problem where models make predictions without explanations. This project demonstrates how to build trustworthy medical AI using SHAP and LIME for real-time explainability.

    🎯 PROJECT GOAL: Create a clinically deployable AI system that not only predicts heart disease with high accuracy but also provides interpretable explanations for each prediction, enabling doctor-AI collaboration.

    🚀 KEY FEATURES: - 94.1% prediction accuracy (XGBoost + Optuna) - Real-time SHAP & LIME explanations - FastAPI backend with medical validation - Gradio clinical dashboard - Full MLOps pipeline (MLflow tracking) - 4-Layer enterprise architecture

    📁 ASSETS INCLUDED: - heart_clean.csv - Clinical dataset ready for analysis - SHAP summary plots for global explainability - Performance metrics and visualizations - Architecture diagrams - Model evaluation results

    🔗 COMPANION RESOURCES: - Live Demo: https://huggingface.co/spaces/Ariyan-Pro/HeartDisease-Predictor - Notebook: https://www.kaggle.com/code/ariyannadeem/heart-disease-prediction-with-explainable-ai - Source Code: https://github.com/Ariyan-Pro/ExplainableAI-HeartDisease

    Perfect for learning medical AI implementation, explainable AI techniques, and production deployment.

  4. Financial Transactions Dataset for Fraud Detection

    • kaggle.com
    zip
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aryan Kumar (2025). Financial Transactions Dataset for Fraud Detection [Dataset]. https://www.kaggle.com/datasets/aryan208/financial-transactions-dataset-for-fraud-detection
    Explore at:
    zip(290256858 bytes)Available download formats
    Dataset updated
    May 2, 2025
    Authors
    Aryan Kumar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains 5 million synthetically generated financial transactions designed to simulate real-world behavior for fraud detection research and machine learning applications. Each transaction record includes fields such as:

    Transaction Details: ID, timestamp, sender/receiver accounts, amount, type (deposit, transfer, etc.)

    Behavioral Features: time since last transaction, spending deviation score, velocity score, geo-anomaly score

    Metadata: location, device used, payment channel, IP address, device hash

    Fraud Indicators: binary fraud label (is_fraud) and type of fraud (e.g., money laundering, account takeover)

    The dataset follows realistic fraud patterns and behavioral anomalies, making it suitable for:

    Binary and multiclass classification models

    Fraud detection systems

    Time-series anomaly detection

    Feature engineering and model explainability

  5. Student Performance Factors Dataset

    • kaggle.com
    zip
    Updated Oct 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mosap Abdel-Ghany (2025). Student Performance Factors Dataset [Dataset]. https://www.kaggle.com/datasets/mosapabdelghany/student-performance-factors-dataset
    Explore at:
    zip(96178 bytes)Available download formats
    Dataset updated
    Oct 16, 2025
    Authors
    Mosap Abdel-Ghany
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    This dataset contains data on 6,607 students and the factors influencing their academic performance. It’s designed to help researchers, educators, and data scientists analyze how habits, environment, and background affect exam scores.

    You can use this dataset for:

    • Predictive modeling of student success
    • Feature importance and correlation studies
    • Machine learning projects on education analytics
    • Educational policy or intervention analysis

    The dataset includes demographic, behavioral, and academic variables such as study hours, attendance, parental involvement, and more.

    Target Variable: Exam_Score

    Example research ideas - What is the most influential factor affecting student performance? - Can machine learning accurately predict academic success? - How do socioeconomic and behavioral factors interact in education?

    Data source Synthetic data generated for research and educational purposes.

  6. Credit Risk Benchmark Dataset

    • kaggle.com
    zip
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adil Shamim (2025). Credit Risk Benchmark Dataset [Dataset]. https://www.kaggle.com/datasets/adilshamim8/credit-risk-benchmark-dataset
    Explore at:
    zip(316073 bytes)Available download formats
    Dataset updated
    Apr 8, 2025
    Authors
    Adil Shamim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview:
    This dataset has been designed as a benchmark for AutoML and predictive modeling in the financial domain. It focuses on assessing credit risk by predicting whether a borrower will experience serious delinquency within two years. The data comprises a mix of financial metrics and personal attributes, which allow users to build and evaluate models for credit risk scoring.

    Dataset Characteristics:

    • Total Features: 10 predictors and 1 target variable.
    • Data Types: All predictors are numerical (real numbers) while the target variable is binary ({0, 1}).
    • Task: Binary classification focused on credit risk prediction.

    Column Descriptions:
    Below is a list of the available columns along with their abbreviated names for ease-of-use:

    • rev_util: Ratio of revolving credit utilization (balance/credit limit)
    • age: Age of the borrower
    • late_30_59: Number of times 30-59 days past due (worse than current)
    • debt_ratio: Debt to income (or assets) ratio
    • monthly_inc: Monthly income of the borrower
    • open_credit: Number of open credit lines and loans
    • late_90: Number of times 90 days or more late on a payment
    • real_estate: Number of real estate loans or credit lines
    • late_60_89: Number of times 60-89 days past due (worse than current)
    • dependents: Number of dependents
    • dlq_2yrs: Target variable indicating if a serious delinquency occurred in the next 2 years (0 = No, 1 = Yes)

    Use Cases and Applications:
    - Risk Management: Build and validate credit scoring models to forecast borrower default risks. - AutoML Benchmarking: Evaluate and compare the performance of various AutoML frameworks on a structured, financial dataset. - Academic Research: Explore trends and relationships in credit behavior, along with the predictive power of financial indicators. - Model Interpretability: Given the regulated nature of financial models, this dataset provides an excellent context for testing feature importance and creating explainable AI solutions.

    Additional Information:
    - Preprocessing & Feature Engineering: Users are encouraged to perform exploratory data analysis, handle potential missing values or outliers, and experiment with scaling techniques and feature transformations.
    - Regulatory Considerations: Since credit scoring models often require transparency, it’s important to incorporate techniques that ensure model interpretability.
    - Benchmarking: Ideal for comparing traditional modeling techniques (like logistic regression) with modern approaches (such as gradient boosting and neural networks).

    This dataset is now available on Kaggle for anyone looking to experiment with or benchmark predictive models for credit risk analysis. Whether you're a data scientist, researcher, or financial analyst, the dataset provides a straightforward yet robust framework for exploring credit-related behavior and risk factors.

  7. Spam Detection Dataset

    • kaggle.com
    zip
    Updated Apr 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AJ (2025). Spam Detection Dataset [Dataset]. https://www.kaggle.com/datasets/smayanj/spam-detection-dataset
    Explore at:
    zip(234723 bytes)Available download formats
    Dataset updated
    Apr 12, 2025
    Authors
    AJ
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is a synthetic dataset for training and testing spam detection models. It contains 20,000 email samples, and each sample is described by five features and one label.

    Features:

    1. num_links

      • Type: Integer
      • Meaning: Number of links present in the email body
      • Generated using a Poisson distribution with an average (λ) of 1.5
      • Assumption: More links often mean higher chances of spam
    2. num_words

      • Type: Integer
      • Meaning: Total number of words in the email
      • Randomly picked between 20 and 200
      • Assumption: Short or overly long emails might look suspicious, but this is more of a neutral feature
    3. has_offer

      • Type: Binary (0 or 1)
      • Meaning: Whether the email contains the word “offer”
      • Simulated using a binomial distribution (30% chance of being 1)
      • Assumption: Marketing language like “offer” is common in spam
    4. sender_score

      • Type: Float between 0 and 1
      • Meaning: A simulated reputation score of the email sender
      • Normally distributed around 0.7, clipped to stay between 0 and 1
      • Assumption: A low sender score means the sender is less trustworthy (and more likely to send spam)
    5. all_caps

      • Type: Binary (0 or 1)
      • Meaning: Whether the subject line is written in ALL CAPS
      • Simulated with a 10% chance of being 1
      • Assumption: All-caps subject lines are usually attention-grabbing and common in spam

    Target:

    1. is_spam
      • Type: Binary (0 or 1)
      • Meaning: Whether the email is spam
      • Generated using a rule-based formula:
        • Spam probability increases if:
        • Links > 2
        • It contains an “offer”
        • Sender score < 0.4
        • Subject is in all caps
        • These factors are combined with different weights
        • A little noise is added using Gaussian randomness to simulate real-world uncertainty
        • Emails are labeled as spam if the final probability crosses 0.5

    Why this dataset is useful:

    • You can try binary classification algorithms like Logistic Regression, Decision Trees, Random Forests, or Neural Networks.
    • It's great for feature importance analysis—you can check which features most affect spam prediction.
    • You can test model robustness using noisy, rule-based labels.
    • Good for building and evaluating explainable AI models since the rules are known.
  8. Lifestyle and Health Risk Prediction

    • kaggle.com
    zip
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). Lifestyle and Health Risk Prediction [Dataset]. https://www.kaggle.com/datasets/miadul/lifestyle-and-health-risk-prediction
    Explore at:
    zip(61139 bytes)Available download formats
    Dataset updated
    Oct 19, 2025
    Authors
    Arif Miah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📘 Description:

    This synthetic health dataset simulates real-world lifestyle and wellness data for individuals. It is designed to help data scientists, machine learning engineers, and students build and test health risk prediction models safely — without using sensitive medical data.

    The dataset includes features such as age, weight, height, exercise habits, sleep hours, sugar intake, smoking, alcohol consumption, marital status, and profession, along with a synthetic health_risk label generated using a heuristic rule-based algorithm that mimics realistic risk behavior patterns.

    🧾 Columns Description:

    Column NameDescriptionTypeExample
    ageAge of the person (years)Numeric35
    weightBody weight in kilogramsNumeric70
    heightHeight in centimetersNumeric172
    exerciseExercise frequency levelCategorical (none, low, medium, high)medium
    sleepAverage hours of sleep per nightNumeric7
    sugar_intakeLevel of sugar consumptionCategorical (low, medium, high)high
    smokingSmoking habitCategorical (yes, no)no
    alcoholAlcohol consumption habitCategorical (yes, no)yes
    marriedMarital statusCategorical (yes, no)yes
    professionType of work or professionCategorical (office_worker, teacher, doctor, engineer, etc.)teacher
    bmiBody Mass Index calculated as weight / (height²)Numeric24.5
    health_riskTarget label showing overall health riskCategorical (low, high)high

    🧩 Use Cases:

    1. Health Risk Prediction: Train classification models (Logistic Regression, RandomForest, XGBoost, CatBoost) to predict health risk (low / high).

    2. Feature Importance Analysis: Identify which lifestyle factors most influence health risk.

    3. Data Preprocessing & EDA Practice: Use this dataset for data cleaning, encoding, and visualization practice.

    4. Model Explainability Projects: Use SHAP or LIME to explain how different lifestyle habits affect predictions.

    5. Streamlit or Flask Web App Development: Build a real-time web app that predicts health risk from user input.

    💡 Case Study Example:

    Imagine you are a data scientist building a Health Risk Prediction App for a wellness startup. You want to analyze how exercise, sleep, and sugar intake affect overall health risk. This dataset helps you simulate those relationships without handling sensitive medical data.

    You could:

    • Perform EDA to find correlations between age, BMI, and health risk.
    • Train a model using Random Forest to predict health_risk.
    • Deploy a Streamlit app where users can input their lifestyle information and get a risk score instantly.

    ⚙️ Technical Information:

    • Rows: 5,000 (adjustable, you can create more)
    • Columns: 12
    • Target variable: health_risk
    • Data type: Mixed (Numeric + Categorical)
    • Source: Fully synthetic, generated using Python (NumPy, Faker)

    📈 License:

    CC0: Public Domain You are free to use this dataset for research, learning, or commercial projects.

    🌍 Author:

    Created by Arif Miah Machine Learning Engineer | Kaggle Expert | Data Scientist 📧 arifmiahcse@gmail.com

  9. Fraud Detection Transactions Dataset

    • kaggle.com
    zip
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samay Ashar (2025). Fraud Detection Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/samayashar/fraud-detection-transactions-dataset
    Explore at:
    zip(2104444 bytes)Available download formats
    Dataset updated
    Feb 21, 2025
    Authors
    Samay Ashar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    This dataset is designed to help data scientists and machine learning enthusiasts develop robust fraud detection models. It contains realistic synthetic transaction data, including user information, transaction types, risk scores, and more, making it ideal for binary classification tasks with models like XGBoost and LightGBM.

    📌 Key Features

    1. 21 features capturing various aspects of a financial transaction
    2. Realistic structure with numerical, categorical, and temporal data
    3. Binary fraud labels (0 = Not Fraud, 1 = Fraud)
    4. Designed for high accuracy with XGBoost and other ML models
    5. Useful for anomaly detection, risk analysis, and security research

    📌 Columns in the Dataset

    Column NameDescription
    Transaction_IDUnique identifier for each transaction
    User_IDUnique identifier for the user
    Transaction_AmountAmount of money involved in the transaction
    Transaction_TypeType of transaction (Online, In-Store, ATM, etc.)
    TimestampDate and time of the transaction
    Account_BalanceUser's current account balance before the transaction
    Device_TypeType of device used (Mobile, Desktop, etc.)
    LocationGeographical location of the transaction
    Merchant_CategoryType of merchant (Retail, Food, Travel, etc.)
    IP_Address_FlagWhether the IP address was flagged as suspicious (0 or 1)
    Previous_Fraudulent_ActivityNumber of past fraudulent activities by the user
    Daily_Transaction_CountNumber of transactions made by the user that day
    Avg_Transaction_Amount_7dUser's average transaction amount in the past 7 days
    Failed_Transaction_Count_7dCount of failed transactions in the past 7 days
    Card_TypeType of payment card used (Credit, Debit, Prepaid, etc.)
    Card_AgeAge of the card in months
    Transaction_DistanceDistance between the user's usual location and transaction location
    Authentication_MethodHow the user authenticated (PIN, Biometric, etc.)
    Risk_ScoreFraud risk score computed for the transaction
    Is_WeekendWhether the transaction occurred on a weekend (0 or 1)
    Fraud_LabelTarget variable (0 = Not Fraud, 1 = Fraud)

    📌 Potential Use Cases

    1. Fraud detection model training
    2. Anomaly detection in financial transactions
    3. Risk scoring systems for banks and fintech companies
    4. Feature engineering and model explainability research
  10. Predictive Maintenance Dataset (AI4I 2020)

    • kaggle.com
    zip
    Updated Nov 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephan Matzka (2022). Predictive Maintenance Dataset (AI4I 2020) [Dataset]. https://www.kaggle.com/datasets/stephanmatzka/predictive-maintenance-dataset-ai4i-2020/data
    Explore at:
    zip(138762 bytes)Available download formats
    Dataset updated
    Nov 6, 2022
    Authors
    Stephan Matzka
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Please note that this is the original dataset with additional information and proper attribution. There is at least one other version of this dataset on Kaggle that was uploaded without permission. Please be fair and attribute the original author. This synthetic dataset is modeled after an existing milling machine and consists of 10 000 data points from a stored as rows with 14 features in columns

    1. UID: unique identifier ranging from 1 to 10000
    2. product ID: consisting of a letter L, M, or H for low (50% of all products), medium (30%) and high (20%) as product quality variants and a variant-specific serial number
    3. type: just the product type L, M or H from column 2
    4. air temperature [K]: generated using a random walk process later normalized to a standard deviation of 2 K around 300 K
    5. process temperature [K]: generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 K.
    6. rotational speed [rpm]: calculated from a power of 2860 W, overlaid with a normally distributed noise
    7. torque [Nm]: torque values are normally distributed around 40 Nm with a SD = 10 Nm and no negative values.
    8. tool wear [min]: The quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process.
    9. a 'machine failure' label that indicates, whether the machine has failed in this particular datapoint for any of the following failure modes are true.

    The machine failure consists of five independent failure modes 10. tool wear failure (TWF): the tool will be replaced of fail at a randomly selected tool wear time between 200 - 240 mins (120 times in our dataset). At this point in time, the tool is replaced 69 times, and fails 51 times (randomly assigned). 11. heat dissipation failure (HDF): heat dissipation causes a process failure, if the difference between air- and process temperature is below 8.6 K and the tools rotational speed is below 1380 rpm. This is the case for 115 data points. 12. power failure (PWF): the product of torque and rotational speed (in rad/s) equals the power required for the process. If this power is below 3500 W or above 9000 W, the process fails, which is the case 95 times in our dataset. 13. overstrain failure (OSF): if the product of tool wear and torque exceeds 11,000 minNm for the L product variant (12,000 M, 13,000 H), the process fails due to overstrain. This is true for 98 datapoints. 14. random failures (RNF): each process has a chance of 0,1 % to fail regardless of its process parameters. This is the case for only 5 datapoints, less than could be expected for 10,000 datapoints in our dataset. If at least one of the above failure modes is true, the process fails and the 'machine failure' label is set to 1. It is therefore not transparent to the machine learning method, which of the failure modes has caused the process to fail.

    This dataset is part of the following publication, please cite when using this dataset: S. Matzka, "Explainable Artificial Intelligence for Predictive Maintenance Applications," 2020 Third International Conference on Artificial Intelligence for Industries (AI4I), 2020, pp. 69-74, doi: 10.1109/AI4I49448.2020.00023.

    The image of the milling process is the work of Daniel Smyth @ Pexels: https://www.pexels.com/de-de/foto/industrie-herstellung-maschine-werkzeug-10406128/

  11. UCI ML Parkinsons dataset

    • kaggle.com
    zip
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elnaz Alikarami (2025). UCI ML Parkinsons dataset [Dataset]. https://www.kaggle.com/datasets/elnazalikarami/uci-ml-parkinsons-dataset
    Explore at:
    zip(316796 bytes)Available download formats
    Dataset updated
    Jul 8, 2025
    Authors
    Elnaz Alikarami
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Oxford Parkinson's Disease Detection Dataset UCI Machine Learning Repository

    dataset's original link : https://archive.ics.uci.edu/dataset/174/parkinsons

    Dataset Characteristics Multivariate

    Subject Area Health and Medicine

    Associated Tasks Classification

    Feature Type Real

    Instances

    197

    Features

    22

    Dataset Information Additional Information

    This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

    The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).

    Further details are contained in the following reference -- if you use this dataset, please cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).

    Has Missing Values?

    No

  12. Iris_Data

    • kaggle.com
    Updated May 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aniket Gaikwad (2025). Iris_Data [Dataset]. http://doi.org/10.34740/kaggle/dsv/11634170
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 1, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aniket Gaikwad
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    📄 Dataset Description: Iris Flower Dataset

    The Iris dataset is one of the most famous and widely used datasets in the field of machine learning and pattern recognition. It was first introduced by the British biologist and statistician Ronald A. Fisher in 1936.

    🌸 Dataset Overview

    The dataset consists of 150 samples of iris flowers from three different species: - Setosa - Versicolor - Virginica

    Each sample contains four features (all numeric), which are: 1. Sepal Length (cm) 2. Sepal Width (cm) 3. Petal Length (cm) 4. Petal Width (cm)

    The target variable is the species of the flower.

    📊 Dataset Characteristics

    FeatureTypeDescription
    Sepal LengthFloatLength of the sepal in cm
    Sepal WidthFloatWidth of the sepal in cm
    Petal LengthFloatLength of the petal in cm
    Petal WidthFloatWidth of the petal in cm
    SpeciesStringCategory: Setosa, Versicolor, Virginica

    🔍 Applications

    This dataset is commonly used for: - Supervised learning (classification) - Data visualization and EDA - Algorithm comparison (e.g., Logistic Regression, SVM, KNN) - Dimensionality reduction (e.g., PCA)

    ✅ Why This Dataset?

    • Small and easy to understand
    • Contains both numeric features and categorical labels
    • Useful for demonstrating classification algorithms and metrics
  13. YouTube Likes Prediction AV HackLive

    • kaggle.com
    zip
    Updated Oct 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vishal Gupta (2020). YouTube Likes Prediction AV HackLive [Dataset]. https://www.kaggle.com/datasets/jinxzed/youtube-likes-prediction-av-hacklive/discussion
    Explore at:
    zip(21795242 bytes)Available download formats
    Dataset updated
    Oct 2, 2020
    Authors
    Vishal Gupta
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Area covered
    YouTube
    Description

    Context

    As YouTube becomes one of the most popular video-sharing platforms, YouTuber is developed as a new type of career in recent decades. YouTubers earn money through advertising revenue from YouTube videos, sponsorships from companies, merchandise sales, and donations from their fans. In order to maintain a stable income, the popularity of videos become the top priority for YouTubers. Meanwhile, some of our friends are YouTubers or channel owners in other video-sharing platforms. This raises our interest in predicting the performance of the video. If creators can have a preliminary prediction and understanding on their videos’ performance, they may adjust their video to gain the most attention from the public.

    You have been provided details on videos along with some features as well. Can you accurately predict the number of likes for each video using the set of input variables?

    Content

    Train Set

    video_id -> Identifier for each video

    title -> Name of the Video on Youtube

    channel_title -> Name of the Channel on Youtube

    category_id -> Category of the Video (anonymous)

    publish_date -> The date video was published

    tags -> Different tags for the video

    views -> Number of views received by the Video

    dislikes -> Number of dislikes on the Video

    comment_count -> Number on comments on the Video

    description -> Textual description of the Video

    country_code -> Country from which the Video was published

    likes -> Number of Likes on the video

    Acknowledgements

    Thank You Analytics Vidhya for providing this dataset.

  14. CIFAKE: Real and AI-Generated Synthetic Images

    • kaggle.com
    Updated Mar 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan J. Bird (2023). CIFAKE: Real and AI-Generated Synthetic Images [Dataset]. https://www.kaggle.com/datasets/birdy654/cifake-real-and-ai-generated-synthetic-images
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jordan J. Bird
    Description

    CIFAKE: Real and AI-Generated Synthetic Images

    The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness.

    CIFAKE is a dataset that contains 60,000 synthetically-generated images and 60,000 real images (collected from CIFAR-10). Can computer vision techniques be used to detect when an image is real or has been generated by AI?

    Further information on this dataset can be found here: Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.

    Dataset details

    The dataset contains two classes - REAL and FAKE.

    For REAL, we collected the images from Krizhevsky & Hinton's CIFAR-10 dataset

    For the FAKE images, we generated the equivalent of CIFAR-10 with Stable Diffusion version 1.4

    There are 100,000 images for training (50k per class) and 20,000 for testing (10k per class)

    Papers with Code

    The dataset and all studies using it are linked using Papers with Code https://paperswithcode.com/dataset/cifake-real-and-ai-generated-synthetic-images

    References

    If you use this dataset, you must cite the following sources

    Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.

    Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.

    Real images are from Krizhevsky & Hinton (2009), fake images are from Bird & Lotfi (2024). The Bird & Lotfi study is available here.

    Notes

    The updates to the dataset on the 28th of March 2023 did not change anything; the file formats ".jpeg" were renamed ".jpg" and the root folder was uploaded to meet Kaggle's usability requirements.

    License

    This dataset is published under the same MIT license as CIFAR-10:

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

  15. Phishing website Detector

    • kaggle.com
    zip
    Updated Feb 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eswar Chand (2020). Phishing website Detector [Dataset]. https://www.kaggle.com/eswarchandt/phishing-website-detector
    Explore at:
    zip(201800 bytes)Available download formats
    Dataset updated
    Feb 28, 2020
    Authors
    Eswar Chand
    Description

    Description

    The data set is provided both in text file and csv file which provides the following resources that can be used as inputs for model building :

    1. A collection of website URLs for 11000+ websites. Each sample has 30 website parameters and a class label identifying it as a phishing website or not (1 or -1).

    2. The code template containing these code blocks: a. Import modules (Part 1) b. Load data function + input/output field descriptions

    The data set also serves as an input for project scoping and tries to specify the functional and non-functional requirements for it.

    Background of Problem Statement :

    You are expected to write the code for a binary classification model (phishing website or not) using Python Scikit-Learn that trains on the data and calculates the accuracy score on the test data. You have to use one or more of the classification algorithms to train a model on the phishing website data set.

    Dataset Description:

    1. The dataset for a “.txt” file is with no headers and has only the column values.
    2. The actual column-wise header is described above and, if needed, you can add the header manually if you are using '.txt' file.If you are using '.csv' file then the column names were added and given.
    3. The header list (column names) is as follows : [ 'UsingIP', 'LongURL', 'ShortURL', 'Symbol@', 'Redirecting//', 'PrefixSuffix-', 'SubDomains', 'HTTPS', 'DomainRegLen', 'Favicon', 'NonStdPort', 'HTTPSDomainURL', 'RequestURL', 'AnchorURL', 'LinksInScriptTags', 'ServerFormHandler', 'InfoEmail', 'AbnormalURL', 'WebsiteForwarding', 'StatusBarCust', 'DisableRightClick', 'UsingPopupWindow', 'IframeRedirection', 'AgeofDomain', 'DNSRecording', 'WebsiteTraffic', 'PageRank', 'GoogleIndex', 'LinksPointingToPage', 'StatsReport', 'class' ] ### Brief Description of the features in data set ● UsingIP (categorical - signed numeric) : { -1,1 } ● LongURL (categorical - signed numeric) : { 1,0,-1 } ● ShortURL (categorical - signed numeric) : { 1,-1 } ● Symbol@ (categorical - signed numeric) : { 1,-1 } ● Redirecting// (categorical - signed numeric) : { -1,1 } ● PrefixSuffix- (categorical - signed numeric) : { -1,1 } ● SubDomains (categorical - signed numeric) : { -1,0,1 } ● HTTPS (categorical - signed numeric) : { -1,1,0 } ● DomainRegLen (categorical - signed numeric) : { -1,1 } ● Favicon (categorical - signed numeric) : { 1,-1 } ● NonStdPort (categorical - signed numeric) : { 1,-1 } ● HTTPSDomainURL (categorical - signed numeric) : { -1,1 } ● RequestURL (categorical - signed numeric) : { 1,-1 } ● AnchorURL (categorical - signed numeric) :
  16. Insurance Premium Data

    • kaggle.com
    zip
    Updated Apr 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prachi Gopalani (2021). Insurance Premium Data [Dataset]. https://www.kaggle.com/datasets/prachi13/insurance13m-persistency/discussion?sort=undefined
    Explore at:
    zip(8583271 bytes)Available download formats
    Dataset updated
    Apr 22, 2021
    Authors
    Prachi Gopalani
    Description

    1. Problem Description:

    • Prepare a Machine Learning Model to predict the Persistency 13M Payment Behaviour at the New Business stage. ## ## 2. Objective:
    • Using Machine Learning techniques, provide scores for each policy at the New Business stage the likelihood to pay the 13M premium.
    • Identify the segments where maximum non payers are captured ## 3. Dataset:
    • “Training” & “Test” Dataset with the raw input attributes and the 13M actual paid/not paid flag.
    • “Out of Time” Datasets would be provided with just the raw input attributes. ## 4. Expected Steps:
      1. Conduct appropriate Data Treatments for e.g. Missing Value Imputation, Outlier treatment etc.
      2. Conduct required Feature Engineering for e.g. Binning, Ratio, Interaction, Polynomial etc.
      3. Use any machine learning algorithm or combination of machine learning algorithms you deem fit.
      4. Prepare your model on the Train Data and you can evaluate the generalization capability of your model by using K-Fold Cross Validation, Leave One Out
      5. Cross Validation or any other validation technique that you see appropriate.
      6. Score the Test and Out of Time Data and share it back to us along with the scored Train Data for evaluation. Also share all the Model Codes and Documentation.
  17. E-Commerce Fraud Detection Dataset

    • kaggle.com
    zip
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UmutUygurr (2025). E-Commerce Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/e-commerce-fraud-detection-dataset
    Explore at:
    zip(6248478 bytes)Available download formats
    Dataset updated
    Nov 3, 2025
    Authors
    UmutUygurr
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🌍 Storyline: The Digital Bazaar

    In 2024, e-commerce platforms across Istanbul, Berlin, New York, London, and Paris began noticing strange transaction bursts. Some cards tested with $1 purchases at midnight. Others shipped “gaming accessories” 5,000 km away. Promo codes were being reused from freshly created accounts.

    To investigate these global patterns safely, this synthetic dataset recreates realistic fraud behavior across countries, channels, and user profiles — allowing anyone to build, test, and compare fraud-detection models without exposing any real user data.

    💡 What makes it special

    🧍‍♀️ 6 000 unique users performing ≈300 000 transactions

    💳 Multiple transactions per user (40–60) → enables behavioral analysis

    🧩 Strong feature correlations — not random noise

    🌐 Cross-country dynamics (country, bin_country)

    💸 Natural imbalance (~2 % fraud) just like real financial systems

    🕓 Time realism — night-time fraud spikes, daily rhythms

    🧠 Feature explainability — easy to visualize, model, and interpret

  18. Best scored model

    • kaggle.com
    zip
    Updated Jan 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abid Ali Awan (2021). Best scored model [Dataset]. https://www.kaggle.com/kingabzpro/best-scored-model
    Explore at:
    zip(2581323393 bytes)Available download formats
    Dataset updated
    Jan 23, 2021
    Authors
    Abid Ali Awan
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Context

    Hurricanes can cause upwards of 1,000 deaths and $50 billion in damages in a single event, and have been responsible for well over 160,000 deaths globally in recent history. During a tropical cyclone, humanitarian response efforts hinge on accurate risk approximation models that depend on wind speed measurements at different points in time throughout a storm’s life cycle.

    For several decades, forecasters have relied on visual pattern recognition of complex cloud features in visible and infrared imagery. While the longevity of this technique indicates the strong relationship between spatial patterns and cyclone intensity, visual inspection is manual, subjective, and often leads to inconsistent estimates between even well-trained analysts.

  19. Consumer Defensive Stock Predictions

    • kaggle.com
    zip
    Updated Nov 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alden Lin (2025). Consumer Defensive Stock Predictions [Dataset]. https://www.kaggle.com/datasets/aldenlin/consumer-defensive-stock-predictions
    Explore at:
    zip(17983221 bytes)Available download formats
    Dataset updated
    Nov 23, 2025
    Authors
    Alden Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About the Dataset

    Why This Dataset Matters

    This dataset is created for Kaggle users who want to explore, experiment, and innovate in the field of stock prediction and financial machine learning. It bridges real-world financial data with practical modeling challenges, enabling you to build, test, and showcase predictive models that simulate professional-grade investment analysis.

    Whether you're a beginner exploring quantitative finance or an experienced data scientist refining predictive strategies, this dataset offers a rich playground for uncovering insights and improving model performance.

    What Makes This Dataset Unique

    • ✅ Combines fundamentals + analyst sentiment + historical price data
    • ✅ Designed specifically for prediction tasks, not just visualization
    • ✅ Feature-ready structure to shorten your data preprocessing time
    • ✅ Ideal for classification problems (e.g. predicting 20% gain within 6 months)
    • ✅ Suitable for both EDA and end-to-end ML pipelines

    This dataset allows contributors to focus on what matters most: building impactful models and sharing innovative approaches with the Kaggle community.

    Data Sources

    Data is aggregated from widely-used financial intelligence platforms:

    • FinancialModelingPrep (FMP) Company profiles, financial fundamentals, financial ratios, S&P 500 performance indicators, and analyst ratings.

    • Alpha Vantage Daily historical OHLC and adjusted close stock price data.

    These sources ensure high relevance, broad coverage, and strong analytical value for market-based modeling.

    Dataset Structure

    The dataset is organized for intuitive exploration and modeling, with each record structured by stock ticker and time period. Feature categories include:

    • Company metadata: sector, industry, market capitalization
    • Fundamental indicators: valuation ratios, profitability, revenue growth
    • Analyst sentiment: ratings and consensus measures
    • Price behavior: OHLC and adjusted close data
    • Engineered predictors: derived metrics to improve model accuracy
    • Target variable: stock performance outcome (e.g. achieving a defined % gain within a future horizon)

    This design supports: ✔ Binary classification ✔ Regression modeling ✔ Time-series experimentation

    Ideal For Kaggle Projects Like

    • 📈 "Can fundamentals predict market winners?"
    • 🤖 Stock prediction ML competitions
    • 🧮 Feature importance & model explainability studies
    • 📊 Financial dashboard prototypes
    • 🧠 Algorithm comparison challenges

    Data Processing Pipeline

    To ensure usability and consistency, the data underwent:

    • Cleaning and removal of anomalous values
    • Standardization of formats and units
    • Multi-source alignment and validation
    • Feature transformation for predictive readiness

    This ensures a smooth experience whether you're performing quick EDA or building production-ready models.

    Community Invitation

    You are encouraged to:

    • 🚀 Build and share predictive notebooks
    • 💬 Discuss modeling strategies
    • 🔍 Explore novel feature engineering ideas
    • ⭐ Fork, upvote, and contribute improvements

    If you find this dataset valuable, feel free to follow for future releases and updates as this project evolves with enhanced features, expanded stock coverage, and refined modeling strategies.

    Disclaimer

    This dataset is intended for research and educational purposes only and does not constitute financial or investment advice. Market conditions and external factors may significantly influence real-world outcomes.

    ✨ This dataset is part of an ongoing effort to build a transparent, reusable, and community-driven resource for advancing financial machine learning on Kaggle.

  20. Adverse Drug Effects (ADE) Detection

    • kaggle.com
    zip
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sai Kiran Udayana (2025). Adverse Drug Effects (ADE) Detection [Dataset]. https://www.kaggle.com/datasets/saikiranudayana/adverse-drug-effects-ade-detection
    Explore at:
    zip(774335826 bytes)Available download formats
    Dataset updated
    Oct 8, 2025
    Authors
    Sai Kiran Udayana
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    💉 COVID-19 Vaccine Adverse Events (2020-2025): VAERS Real-World Surveillance Data This dataset offers a critical, large-scale look into the real-world safety surveillance of COVID-19 vaccines, sourced from the Vaccine Adverse Event Reporting System (VAERS). Maintained by the CDC and FDA, this collection spans the unprecedented period of mass vaccination from 2020 through 2025, providing an invaluable resource for pharmacovigilance, public health research, and regulatory decision-making.

    Key Features & Challenge The dataset is a rich blend of structured and unstructured information detailing reported Adverse Drug Events (ADEs), which range from mild local reactions to severe, life-threatening complications.

    Structured Data: Includes standardized symptom codes, offering a direct, quantitative view of reported reactions.

    Free-Text Notes: Contains verbose, real-world symptom descriptions provided by reporters. This text is a "treasure trove" of granular context, including details on duration, intensity, and location of symptoms.

    The Challenge: The structured entries are limited in scope. The free-text notes, while rich, are inherently noisy and lack standardized metadata such as clinical severity scores or age-specific pattern normalization.

    Value to Data Scientists This dataset presents a significant Natural Language Processing (NLP) and Machine Learning (ML) challenge:

    Extracting Context: Develop models to effectively extract critical clinical context (e.g., "headache lasting three days, severe") from the raw, non-standardized free-text notes.

    Standardizing Severity: Create predictive models to assign standardized severity and age-specific risk patterns to ADEs.

    Informed Decision Making: The ultimate goal is to generate actionable, timely insights for regulators, healthcare providers, and pharmaceutical companies, improving both vaccine safety monitoring and public trust.

    Dive into this dataset to apply your skills in advanced data cleaning, feature engineering, and state-of-the-art NLP to solve a crucial, high-impact public health challenge.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Loulou (2020). MoA Feature Importance with Rapids [Dataset]. https://www.kaggle.com/louise2001/moa-feat-importance-rapids
Organization logo

MoA Feature Importance with Rapids

Analyzing Features importance on the Mechanisms of Action competition

Explore at:
zip(1630369 bytes)Available download formats
Dataset updated
Nov 11, 2020
Authors
Loulou
Description

Dataset

This dataset was created by Loulou

Contents

Search
Clear search
Close search
Google apps
Main menu