20 datasets found
  1. Breast Cancer Exploratory Data Analysis EDA

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). Breast Cancer Exploratory Data Analysis EDA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/breast-cancer-exploratory-data-analysis-eda
    Explore at:
    zip(7609364 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains clinical and diagnostic features related to Breast Cancer, designed for comprehensive Exploratory Data Analysis (EDA) and subsequent predictive modeling.

    It is derived from digitized images of Fine Needle Aspirates (FNA) of breast masses.

    The dataset features quantitative measurements, typically calculated from the characteristics of cell nuclei, including: - Radius - Texture - Perimeter - Area - Smoothness - Compactness - Concavity - Concave Points - Symmetry - Fractal Dimension

    These features are provided as mean, standard error, and "worst" (largest) values.

    The primary goal of this resource is to support the validation of EDA techniques necessary for clinical data science: - Data quality assessment (missing values, inconsistencies). - Feature assessment (distributions, correlations). - Visualization for diagnostic modeling.

    The primary target variable is the binary classification of the tissue sample: Malignant vs. Benign.

  2. Stock Price EDA(Time Series Analysis)

    • kaggle.com
    zip
    Updated May 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RITIK MAHESHWARI (2021). Stock Price EDA(Time Series Analysis) [Dataset]. https://www.kaggle.com/ritikmaheshwari/stock-price-edatime-series-analysis
    Explore at:
    zip(11875814 bytes)Available download formats
    Dataset updated
    May 4, 2021
    Authors
    RITIK MAHESHWARI
    Description

    Context

    There's a story behind every dataset and here's your opportunity to share yours.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

    Analyze closing price of all the stocks. Analyze the total volume of stocks being traded each day. Analyze daily price change in stock. Analyze monthly mean of close feature. Analyze whether Stock prices of these tech companies are correlated or not. Analyze Daily return of each stock and how they are co-related. Value at risk analysis for Tech Companies.

  3. BI intro to data cleaning eda and machine learning

    • kaggle.com
    zip
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Walekhwa Tambiti Leo Philip (2025). BI intro to data cleaning eda and machine learning [Dataset]. https://www.kaggle.com/datasets/walekhwatlphilip/intro-to-data-cleaning-eda-and-machine-learning/suggestions
    Explore at:
    zip(9961 bytes)Available download formats
    Dataset updated
    Nov 17, 2025
    Authors
    Walekhwa Tambiti Leo Philip
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Real-World Data Science Challenge

    Business Intelligence Program Strategy — Student Success Optimization

    Hosted by: Walsoft Computer Institute 📁 Download dataset 👤 Kaggle profile

    Background

    Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.

    As part of an internal review, the leadership team has hired you — a Data Science Consultant — to analyze this dataset and provide clear, evidence-based recommendations on how to improve:

    • Admissions decision-making
    • Academic support strategies
    • Overall program impact and ROI

    Your Mission

    Answer this central question:

    “Using the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”

    Key Strategic Areas

    You are required to analyze and provide actionable insights for the following three areas:

    1. Admissions Optimization

    Should entry exams remain the primary admissions filter?

    Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.

    ✅ Deliverables:

    • Feature importance ranking for predicting Python and DB scores
    • Admission policy recommendation (e.g., retain exams, add screening tools, adjust thresholds)
    • Business rationale and risk analysis

    2. Curriculum Support Strategy

    Are there at-risk student groups who need extra support?

    Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.

    ✅ Deliverables:

    • At-risk segment identification
    • Support program design (e.g., prep course, mentoring)
    • Expected outcomes, costs, and KPIs

    3. Resource Allocation & Program ROI

    How can we allocate resources for maximum student success?

    Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.

    ✅ Deliverables:

    • Performance drivers
    • Student segmentation
    • Resource allocation plan and ROI projection

    🛠️ Dataset Overview

    ColumnDescription
    fNAME, lNAMEStudent first and last name
    AgeStudent age (21–71 years)
    genderGender (standardized as "Male"/"Female")
    countryStudent’s country of origin
    residenceStudent housing/residence type
    entryEXAMEntry test score (28–98)
    prevEducationPrior education (High School, Diploma, etc.)
    studyHOURSTotal study hours logged
    PythonFinal Python exam score
    DBFinal Database exam score

    📊 Dataset

    You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day — from inconsistent formatting to missing values.

    Raw Dataset (Recommended for Full Project)

    Download: bi.csv

    This dataset includes common data quality challenges:

    • Country name inconsistencies
      e.g. Norge → Norway, RSA → South Africa, UK → United Kingdom

    • Residence type variations
      e.g. BI-Residence, BIResidence, BI_Residence → unify to BI Residence

    • Education level typos and casing issues
      e.g. Barrrchelors → Bachelor, DIPLOMA, DiplomaaaDiploma

    • Gender value noise
      e.g. M, F, female → standardize to Male / Female

    • Missing scores in Python subject
      Fill NaN values using column mean or suitable imputation strategy

    Participants using this dataset are expected to apply data cleaning techniques such as: - String standardization - Null value imputation - Type correction (e.g., scores as float) - Validation and visual verification

    Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.

    Cleaned Dataset (Optional Shortcut)

    Download: cleaned_bi.csv

    This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...

  4. SAP Historical Stock Prices Dataset

    • kaggle.com
    Updated Jun 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umair Zia (2024). SAP Historical Stock Prices Dataset [Dataset]. https://www.kaggle.com/datasets/stealthtechnologies/sap-historical-stock-prices-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 17, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Umair Zia
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ABOUT SAP:

    Headquartered in Walldorf, Germany, SAP is the market leader in enterprise application software. Founded in 1972, SAP (which stands for "Systems, Applications, and Products in Data Processing") has a rich history of innovation and growth as a true industry leader.

    _

    ABOUT DATASET:

    This dataset contains detailed historical stock price data for SAP, covering the period from 09/22/1995 to 06/14/2024. The data is collected from Yahoo Finance and includes daily records of the stock's opening price, highest price, lowest price, closing price, and trading volume. Each entry in the dataset represents a single trading day, providing a comprehensive view of the stock's price movements and market activity.

    PURPOSE OF DATASET

    The purpose of this dataset is to provide analysts, traders, and researchers with accurate and granular historical stock price data for SAP. This data can be used for various applications, including:

    • Technical Analysis: Identify trends and patterns in the stock's price movements. Calculate technical indicators such as moving averages, RSI, and Bollinger Bands.

    • Market Sentiment Analysis: Analyze how the stock's price responds to market events and news. Compare the opening and closing prices to understand daily sentiment.

    • Algorithmic Trading: Develop and test trading algorithms based on historical price and volume data. Use past price movements to simulate trading strategies.

    • Predictive Modeling: Build models to forecast future prices and trading volumes. Use historical data to identify potential price movements and market trends.

    • Educational Purposes: Serve as a teaching tool for financial education. Help students and researchers understand the dynamics of stock price changes and market behavior.

    _

    this dataset offers a solid foundation for a wide range of financial analyses and trading applications.

  5. Walmart Stocks Data 2025

    • kaggle.com
    zip
    Updated Feb 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehar Shan Ali (2025). Walmart Stocks Data 2025 [Dataset]. https://www.kaggle.com/meharshanali/walmart-stocks-data-2025
    Explore at:
    zip(467062 bytes)Available download formats
    Dataset updated
    Feb 23, 2025
    Authors
    Mehar Shan Ali
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📊 Walmart Stock Price Dataset & Exploratory Data Analysis (EDA)

    🏢 About Walmart

    Walmart Inc. is a multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores. It is one of the world's largest companies by revenue and a key player in the retail sector. Walmart's stock is actively traded on major stock exchanges, making it an interesting subject for financial analysis.

    📌 Dataset Overview

    This dataset contains historical stock price data for Walmart, sourced directly from Yahoo Finance using the yfinance Python API. The data covers daily stock prices and includes multiple key financial indicators.

    📊 Features Included in the Dataset

    • Date 📅 – The trading day recorded.
    • Open Price 🟢 – Price at market open.
    • High Price 🔼 – Highest price of the day.
    • Low Price 🔽 – Lowest price of the day.
    • Close Price 🔴 – Price at market close.
    • Adjusted Close Price 📉 – Closing price adjusted for splits & dividends.
    • Trading Volume 📈 – Total shares traded.
    • Dividends 💰 – Cash payments to shareholders.
    • Stock Splits 🔄 – Records stock split events.

    🔍 Exploratory Data Analysis (EDA) Steps

    This notebook performs an extensive EDA to uncover insights into Walmart's stock price trends, volatility, and overall behavior in the stock market. The following analysis steps are included:

    1️⃣ Data Preprocessing & Cleaning

    • Load data using Pandas
    • Handle missing values (if any)
    • Check data types and format them properly
    • Convert date column into a datetime format

    2️⃣ Descriptive Statistics & Summary

    • Calculate key statistical measures like mean, median, standard deviation, and interquartile range (IQR)
    • Identify stock price trends over time
    • Check data distribution and skewness

    3️⃣ Data Visualizations

    • 📉 Line Plot – Analyze trends in closing prices over time.
    • 📦 Box Plot – Detect potential outliers in stock prices.
    • 📊 Histogram – Understand the distribution of closing prices.
    • 📈 Moving Averages – Use short-term and long-term moving averages to observe stock trends.
    • 🔥 Correlation Heatmap – Find relationships between stock market indicators.

    4️⃣ Time Series Analysis

    • Identify trends and seasonality in the stock price data.
    • Calculate daily, weekly, and monthly returns.
    • Use rolling windows to analyze moving averages and volatility.

    5️⃣ Insights & Conclusions

    • How volatile is Walmart’s stock over the given period?
    • Does the stock exhibit strong uptrends or downtrends?
    • Are there any strong correlations between features?
    • What insights can be drawn for investors and traders?

    🚀 Use Cases & Applications

    This dataset and analysis can be useful for: - 📡 Stock Market Analysis – Evaluating Walmart’s stock price trends and volatility. - 🏦 Investment Research – Assisting traders and investors in making informed decisions. - 🎓 Educational Purposes – Teaching data science and financial analysis using real-world stock data. - 📊 Algorithmic Trading – Developing trading strategies based on historical stock price trends.

    📥 Download the dataset and explore Walmart’s stock performance today! 🚀

  6. Real Madrid UEFA Champions League Perform Analysis

    • kaggle.com
    zip
    Updated Aug 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joaco Romero Flores (2023). Real Madrid UEFA Champions League Perform Analysis [Dataset]. https://www.kaggle.com/datasets/joaquinaromerof/real-madrid-analysis
    Explore at:
    zip(32668239 bytes)Available download formats
    Dataset updated
    Aug 26, 2023
    Authors
    Joaco Romero Flores
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    Introduction

    In the high-stakes world of professional football, public opinion often forms around emotions, loyalties, and subjective interpretations. The project at hand aims to transcend these biases by delving into a robust, data-driven analysis of Real Madrid's performance in the UEFA Champions League over the past decade.

    Through a blend of traditional statistical methods, machine learning models, game theory, psychology, philosophy, and even military strategies, this investigation presents a multifaceted view of what contributes to a football team's success and how performance can be objectively evaluated.

    Exploratory Data Analysis (EDA)

    The EDA consists of two layers:

    1. Statistical Analysis:

    • Set-Up Process: Loading libraries, data frames, determining position relevancy, and calculating average minutes played.
    • Kurtosis: Understanding data variance and its internal behavior.
    • Feature Engineering: Preprocessing with standard scaler for later ML applications.
    • Sample Statistics, Distribution, and Standard Errors: Essential for inference.
    • Central Limit Theorem: A focus for understanding by experienced data scientists.
    • A/B Testing & ANOVA: Used for null hypothesis testing.

    2. Machine Learning Models:

    • Ordinary Least Square: To estimate the unknown parameters.
    • Linear Regression Models with Sci-Kit Learn: Predicting the dependent variable.
    • XGBoost & Cross-Validation: A powerful algorithm for making predictions.
    • Conformal Prediction: To create valid prediction regions.
    • Radar Maps: For visualizing player performance during their match campaigns.

    Objectives

    The goal of this analysis is multifaceted: 1. Unveil Hidden Statistics: To reveal the underlying patterns often overlooked in casual discussions. 2. Demonstrate the Impact of Probability: How it shapes matches and seasons. 3. Explore Interdisciplinary Influences: Including Game Theory, Strategy, Cooperation, Psychology, Physiology, Military Training, Luck, Economics, Philosophy, and even Freudian Analysis. 4. Challenge Subjective Bias: By presenting a well-rounded, evidence-based view of football performance.

    Conclusion

    This project stands as a testament to the profound complexity of football performance and the nuanced insights that can be derived through rigorous scientific analysis. Whether a data scientist recruiter, football fanatic, or curious mind, the findings herein offer a unique perspective that bridges the gap between passion and empiricism.

  7. India's Fast Delivery Agents Reviews and Ratings

    • kaggle.com
    zip
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanak Baghel (2025). India's Fast Delivery Agents Reviews and Ratings [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/indias-fast-delivery-agents-reviews-and-ratings
    Explore at:
    zip(176771 bytes)Available download formats
    Dataset updated
    May 5, 2025
    Authors
    Kanak Baghel
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Area covered
    India
    Description

    1.1 Industry Landscape of Fast Delivery Services in India

    India’s fast delivery ecosystem is characterized by intense competition among multiple players offering expedited grocery and food delivery services with promised delivery windows as low as 10 to 30 minutes. Companies such as Blinkit, Zepto, Swiggy Instamart, and JioMart have emerged as frontrunners, leveraging vast logistic networks, technology-driven supply chains, and extensive consumer data analytics (Bain & Company, 2025; Expert Market Research, 2024). The sector’s growth trajectory is robust, with the online food delivery market alone valued at USD 48.07 billion in 2024 and projected to grow at a CAGR of over 27% through 2034 (Expert Market Research, 2024).

    1.2 Importance of Customer Ratings and Reviews

    Customer reviews and ratings provide granular feedback on delivery agents’ punctuality, professionalism, order accuracy, and communication. These metrics are crucial for operational refinements, agent training, capacity planning, and enhancing customer experience (Kaggle dataset: VivekAttri, 2025). Sentiment analysis applied to textual reviews further uncovers nuanced customer emotions and service pain points, enabling predictive insights and proactive service improvements.

    1.3 Dataset Overview

    The focal dataset includes structured customer reviews and numerical ratings collected for fast delivery agents across India’s leading quick-commerce platforms. Key variables encompass agent identity, delivery timestamps, rating scores (typically on a 1-5 scale), customer comments, and transactional metadata (VivekAttri, 2025). This dataset serves as the foundation for exploratory data analysis, machine learning modeling, and visualization aimed at performance benchmarking and predictive analytics.

    2. Data Handling and Preprocessing Methodologies

    2.1 Data Acquisition and Integration

    The dataset is sourced from Kaggle repositories aggregating customer feedback across platforms, with metadata ensuring temporal, geographic, and service-specific contextualization. Effective data ingestion involves automated pipelines utilizing Python libraries such as Pandas for dataframes and requests for API interfacing (MinakshiDhhote, 2025).

    2.2 Data Cleaning and Normalization

    Critical preprocessing steps include:

    • Removal of Redundant and Irrelevant Columns: Columns unrelated to delivery agent performance (e.g., user identifiers when anonymized) are discarded to streamline analysis.

    • Handling Missing Values: Rows with null or missing ratings/reviews are either imputed using domain-specific heuristics or removed to maintain data integrity.

    • Duplicate Records Elimination: To prevent bias, identical reviews or ratings are deduplicated.

    • Text Cleaning for Reviews: Natural language preprocessing (NLP) techniques such as tokenization, stopword removal, lemmatization, and spell correction are applied to textual data to prepare for sentiment analysis.

    • Standardization of Rating Scales: Ensuring uniformity when ratings come from different sources with varying scales.

    2.3 Feature Engineering

    Derived features enhance modeling capabilities:

    • Sentiment Scores: Using models like VADER or BERT-based classifiers to convert textual reviews into quantifiable sentiment metrics.

    • Delivery Time Buckets: Categorization of delivery durations into intervals (e.g., under 15 minutes, 15-30 minutes) to analyze performance impact.

    • Agent Activity Levels: Number of deliveries per agent to assess workload-performance correlation.

    • Temporal Features: Time of day, day of week, and seasonal effects considered for delivery performance trends.

    3. Exploratory Data Analysis (EDA) and Visualization

    3.1 Rating Distribution and Statistical Summary

    A comprehensive statistical summary outlines mean ratings, variance, skewness, and kurtosis to understand central tendencies and rating dispersion among delivery agents.

    Table 1: Rating Summary Statistics for Delivery Agents (2025 Dataset Sample)

    || Metric | Value | |----------------------|----------------| | Mean Rating | 3.8 ± 0.15 | | Median Rating | 4.0 | | Standard Deviation | 0.75 | | Skewness | -0.45 | | Kurtosis | 2.1 | | Number of Ratings | 250,000+ | | | | --- | --- | | | |

    Data validated with 95% confidence interval from Kaggle 2025 dataset (VivekAttri, 2025).

    3.2 Geographical and Platform-Based Ratings Comparison

    Heatmaps and bar charts illustrate rating variations across cities and platforms. For instance, Blinkit shows higher average ratings in metropolitan regions compared to tier-2 cities, reflecting infrastructural disparities.

    3.3 Service Attributes and Rating Correlations

    Scatter plots and corr...

  8. Tradyflow - Options Trading!

    • kaggle.com
    zip
    Updated Jun 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Anas (2022). Tradyflow - Options Trading! [Dataset]. https://www.kaggle.com/datasets/muhammadanas0716/tradyflow-options-trading/data
    Explore at:
    zip(208306 bytes)Available download formats
    Dataset updated
    Jun 24, 2022
    Authors
    Muhammad Anas
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    This dataset is obtained from tradytics.com on 21st June 2022. By the time you get to see the code, this dataset will no longer be available on the website. You can only access it on my Github.

    What is the Dataset?

    This dataset is the flow of the stock market on the date 17th June 2022. It contains many tickers. An excellent dataset to practice time series and test your data science skills.

    What Do the rows Stand For?

    Time - Time when this ticker was caught in the flow.

    Sym- The ticker symbol, e.g AAPL, TSLA, SPY.

    C/P - Call or Put trade?

    Exp- The expiration of the contract.

    Str - The strike price.

    Spot - The stock price at the moment when the flow was reported.

    Bidask - The bidask of the contract

    Orders - The total order of the contract.

    Volume - The number of shares traded at the moment when this contract was caught.

    Premiums - The total money spent on this contract.

    Open Interest - The total number of opened contracts at the moment when this contract was caught.

    Diff % - The % difference between Spot and Strike price.

    ITM - If the contract was a win or loss. 0 is LOSS, 1 is WIN

    ***NoT FINANCIAL ADVICE**

    This is an amazing dataset for beginners or those coders who are refreshing their data science skills. No harm if professionals use this either. It's fantastical. You do so much with, honestly so much! Maybe even a stock market bot, thou this is done at your risk. Enjoy and share your code!!!!

  9. House Price Regression Dataset

    • kaggle.com
    zip
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prokshitha Polemoni (2024). House Price Regression Dataset [Dataset]. https://www.kaggle.com/datasets/prokshitha/home-value-insights
    Explore at:
    zip(27045 bytes)Available download formats
    Dataset updated
    Sep 6, 2024
    Authors
    Prokshitha Polemoni
    Description

    Home Value Insights: A Beginner's Regression Dataset

    This dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.

    Features:

    1. Square_Footage: The size of the house in square feet. Larger homes typically have higher prices.
    2. Num_Bedrooms: The number of bedrooms in the house. More bedrooms generally increase the value of a home.
    3. Num_Bathrooms: The number of bathrooms in the house. Houses with more bathrooms are typically priced higher.
    4. Year_Built: The year the house was built. Older houses may be priced lower due to wear and tear.
    5. Lot_Size: The size of the lot the house is built on, measured in acres. Larger lots tend to add value to a property.
    6. Garage_Size: The number of cars that can fit in the garage. Houses with larger garages are usually more expensive.
    7. Neighborhood_Quality: A rating of the neighborhood’s quality on a scale of 1-10, where 10 indicates a high-quality neighborhood. Better neighborhoods usually command higher prices.
    8. House_Price (Target Variable): The price of the house, which is the dependent variable you aim to predict.

    Potential Uses:

    1. Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.

    2. Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.

    3. Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.

    4. Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.

    Versatility:

    • The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.

    • It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.

    • This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.

  10. Stock Market: Historical Data of Top 10 Companies

    • kaggle.com
    zip
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khushi Pitroda (2023). Stock Market: Historical Data of Top 10 Companies [Dataset]. https://www.kaggle.com/datasets/khushipitroda/stock-market-historical-data-of-top-10-companies
    Explore at:
    zip(486977 bytes)Available download formats
    Dataset updated
    Jul 18, 2023
    Authors
    Khushi Pitroda
    Description

    The dataset contains a total of 25,161 rows, each row representing the stock market data for a specific company on a given date. The information collected through web scraping from www.nasdaq.com includes the stock prices and trading volumes for the companies listed, such as Apple, Starbucks, Microsoft, Cisco Systems, Qualcomm, Meta, Amazon.com, Tesla, Advanced Micro Devices, and Netflix.

    Data Analysis Tasks:

    1) Exploratory Data Analysis (EDA): Analyze the distribution of stock prices and volumes for each company over time. Visualize trends, seasonality, and patterns in the stock market data using line charts, bar plots, and heatmaps.

    2)Correlation Analysis: Investigate the correlations between the closing prices of different companies to identify potential relationships. Calculate correlation coefficients and visualize correlation matrices.

    3)Top Performers Identification: Identify the top-performing companies based on their stock price growth and trading volumes over a specific time period.

    4)Market Sentiment Analysis: Perform sentiment analysis using Natural Language Processing (NLP) techniques on news headlines related to each company. Determine whether positive or negative news impacts the stock prices and volumes.

    5)Volatility Analysis: Calculate the volatility of each company's stock prices using metrics like Standard Deviation or Bollinger Bands. Analyze how volatile stocks are in comparison to others.

    Machine Learning Tasks:

    1)Stock Price Prediction: Use time-series forecasting models like ARIMA, SARIMA, or Prophet to predict future stock prices for a particular company. Evaluate the models' performance using metrics like Mean Squared Error (MSE) or Root Mean Squared Error (RMSE).

    2)Classification of Stock Movements: Create a binary classification model to predict whether a stock will rise or fall on the next trading day. Utilize features like historical price changes, volumes, and technical indicators for the predictions. Implement classifiers such as Logistic Regression, Random Forest, or Support Vector Machines (SVM).

    3)Clustering Analysis: Cluster companies based on their historical stock performance using unsupervised learning algorithms like K-means clustering. Explore if companies with similar stock price patterns belong to specific industry sectors.

    4)Anomaly Detection: Detect anomalies in stock prices or trading volumes that deviate significantly from the historical trends. Use techniques like Isolation Forest or One-Class SVM for anomaly detection.

    5)Reinforcement Learning for Portfolio Optimization: Formulate the stock market data as a reinforcement learning problem to optimize a portfolio's performance. Apply algorithms like Q-Learning or Deep Q-Networks (DQN) to learn the optimal trading strategy.

    The dataset provided on Kaggle, titled "Stock Market Stars: Historical Data of Top 10 Companies," is intended for learning purposes only. The data has been gathered from public sources, specifically from web scraping www.nasdaq.com, and is presented in good faith to facilitate educational and research endeavors related to stock market analysis and data science.

    It is essential to acknowledge that while we have taken reasonable measures to ensure the accuracy and reliability of the data, we do not guarantee its completeness or correctness. The information provided in this dataset may contain errors, inaccuracies, or omissions. Users are advised to use this dataset at their own risk and are responsible for verifying the data's integrity for their specific applications.

    This dataset is not intended for any commercial or legal use, and any reliance on the data for financial or investment decisions is not recommended. We disclaim any responsibility or liability for any damages, losses, or consequences arising from the use of this dataset.

    By accessing and utilizing this dataset on Kaggle, you agree to abide by these terms and conditions and understand that it is solely intended for educational and research purposes.

    Please note that the dataset's contents, including the stock market data and company names, are subject to copyright and other proprietary rights of the respective sources. Users are advised to adhere to all applicable laws and regulations related to data usage, intellectual property, and any other relevant legal obligations.

    In summary, this dataset is provided "as is" for learning purposes, without any warranties or guarantees, and users should exercise due diligence and judgment when using the data for any purpose.

  11. Premier League Statistics from 2015 to 2023

    • kaggle.com
    zip
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghaith Mechi (2024). Premier League Statistics from 2015 to 2023 [Dataset]. https://www.kaggle.com/datasets/ghaithmechi/premier-league-statistics-from-2015-to-2023
    Explore at:
    zip(7326 bytes)Available download formats
    Dataset updated
    Jan 16, 2024
    Authors
    Ghaith Mechi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    I've created this dataset using the project:Premier League Statistics Scraping. It contains the statistics of the Premier League matches from 2015 to 2023. you can use the data to do some EDA or to predict the winner for this year. So, enjoy with the data !

     Here are some additional details about the features( columns):
    1. members: the number of players.
    2. foreign_players: the number of foreign players in the team.
    3. mean_age: the mean age of all players.
    4. salaries: monthly salary charge.
    5. spending: transfer expenditure.
    6. MOY: Average players rating.
    7. rank: the rank of the team in the season.
    8. points: points gained in the season.
    9. BP: Goals.
    10. BC: goals against.
    11. DIF=BP-BC.
    12.Gain: the number of winnes.
    13. Null: number of draws.
    14. defeat: Number of losses.
    

    for further information, visit:foot

  12. Subreddit Interactions for 25,000 Users

    • kaggle.com
    zip
    Updated Feb 19, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    colemaclean (2017). Subreddit Interactions for 25,000 Users [Dataset]. https://www.kaggle.com/colemaclean/subreddit-interactions
    Explore at:
    zip(82083361 bytes)Available download formats
    Dataset updated
    Feb 19, 2017
    Authors
    colemaclean
    Description

    Context

    The dataset is a csv file compiled using a python scrapper developed using Reddit's PRAW API. The raw data is a list of 3-tuples of [username,subreddit,utc timestamp]. Each row represents a single comment made by the user, representing about 5 days worth of Reddit data. Note that the actual comment text is not included, only the user, subreddit and comment timestamp of the users comment. The goal of the dataset is to provide a lens in discovering user patterns from reddit meta-data alone. The original use case was to compile a dataset suitable for training a neural network in developing a subreddit recommender system. That final system can be found here

    A very unpolished EDA for the dataset can be found here. Note the published dataset is only half of the one used in the EDA and recommender system, to meet kaggle's 500MB size limitation.

    Content

    user - The username of the person submitting the comment
    subreddit - The title of the subreddit the user made the comment in
    utc_stamp - the utc timestamp of when the user made the comment

    Acknowledgements

    The dataset was compiled as part of a school project. The final project report, with my collaborators, can be found here

    Inspiration

    We were able to build a pretty cool subreddit recommender with the dataset. A blog post for it can be found here, and the stand alone jupyter notebook for it here. Our final model is very undertuned, so there's definitely improvements to be made there, but I think there are many other cool data projects and visualizations that could be built from this dataset. One example would be to analyze the spread of users through the Reddit ecosystem, whether the average user clusters in close communities, or traverses wide and far to different corners. If you do end up building something on this, please share! And have fun!

    Released under Reddit's API licence

  13. Employee Performance & Salary (Synthetic Dataset)

    • kaggle.com
    zip
    Updated Oct 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mamun Hasan (2025). Employee Performance & Salary (Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/mamunhasan2cs/employee-performance-and-salary-synthetic-dataset
    Explore at:
    zip(13002 bytes)Available download formats
    Dataset updated
    Oct 10, 2025
    Authors
    Mamun Hasan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🧑‍💼 Employee Performance and Salary Dataset

    This synthetic dataset simulates employee information in a medium-sized organization, designed specifically for data preprocessing and exploratory data analysis (EDA) tasks in Data Mining and Machine Learning labs.

    It includes over 1,000 employee records with realistic variations in age, gender, department, experience, performance score, and salary — along with missing values, duplicates, and outliers to mimic real-world data quality issues.

    📊 Columns Description

    Column NameDescription
    Employee_IDUnique employee identifier (E0001, E0002, …)
    AgeEmployee age (22–60 years)
    GenderGender of the employee (Male/Female)
    DepartmentDepartment where the employee works (HR, Finance, IT, Marketing, Sales, Operations)
    Experience_YearsTotal years of work experience (contains missing values)
    Performance_ScoreEmployee performance score (0–100, contains missing values)
    SalaryAnnual salary in USD (contains outliers)

    🧠 Example Lab Tasks

    • Identify and impute missing values using mean or median.
    • Detect and remove duplicate employee records.
    • Detect outliers in Salary using IQR or Z-score.
    • Normalize Salary and Performance_Score using Min-Max scaling.
    • Encode categorical columns (Gender, Department) for model training.
    • Ideal for Regression

    🎯 Possible Regression Targets (Dependent Variables)

    Salary → Predict salary based on experience, performance, department, and age. Performance_Score → Predict employee performance based on age, experience, and department.

    🧩 Example Regression Problem

    Predict the employee's salary based on their experience, performance score, and department.

    🧠 Sample Features:

    X = ['Age', 'Experience_Years', 'Performance_Score', 'Department', 'Gender'] y = ['Salary']

    You can apply:

    • Linear Regression
    • Ridge/Lasso Regression
    • Random Forest Regressor
    • XGBoost Regressor
    • SVR (Support Vector Regression)
    • and evaluate with metrics like:

    R², MAE, MSE, RMSE, and residual plots.

  14. Saudi Arabia Events & Crowding Impact Dataset

    • kaggle.com
    zip
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Samy (2025). Saudi Arabia Events & Crowding Impact Dataset [Dataset]. https://www.kaggle.com/datasets/mohamedsamy16/saudi-arabia-events-and-crowding-impact-dataset
    Explore at:
    zip(22590 bytes)Available download formats
    Dataset updated
    Feb 12, 2025
    Authors
    Mohamed Samy
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Saudi Arabia
    Description

    📦 Saudi Arabia Events & Crowding Impact Dataset

    Unlock insights into crowding, sales trends, and delivery optimization using public events, weather, and paydays.

    📝 Dataset Overview

    This dataset captures public events, holidays, weather conditions, and financial factors that influence crowding, consumer behavior, and online deliveries across Saudi Arabia.

    Key Highlights:
    ✅ Covers multiple Saudi cities with rich event data.
    ✅ Includes weather conditions affecting business & logistics.
    ✅ Tracks paydays & school schedules for demand forecasting.
    ✅ Ideal for crowding prediction, sales analysis, and delivery optimization.

    📊 Data Description

    Each row represents a daily snapshot of city conditions with the following variables:

    📆 Date & Calendar Information

    • DateG – Gregorian date (YYYY-MM-DD).
    • DateH – Hijri date.
    • Day – Day of the week (Sunday, Monday, etc.).

    🎉 Public Holidays & Events

    • Holiday Name – Name of the holiday (if applicable).
    • Type of Public Holiday – National, Religious, or School-related holidays.
    • Event – Major events (e.g., festivals, matches, etc.).
    • Match – Includes Premier League & KSA League games.

    🌦 Weather Conditions

    • Cloudy, Fog, Rain, Widespread Dust, Blowing Dust, etc.
    • Useful for studying weather impact on mobility & sales.

    🏙 Crowding & City Impact

    • City – Name of the city.
    • Effect on City – Expected impact (e.g., increased traffic, reduced mobility).

    💰 Economic & Financial Impact

    • Pay Day – Indicates whether it was a salary payout day.
    • days till next payday – How many days until the next salary payout.
    • days after payday – How many days after the last payday.

    🎓 Education & School Impact

    • days after school – Number of days since school ended.
    • days before school – Number of days until school resumes.

    🚀 Potential Use Cases

    This dataset can be leveraged for:

    📌 Crowding Prediction – Identify peak congestion periods based on holidays, weather, and events.
    📌 Sales & Demand Forecasting – Analyze payday effects on consumer spending & delivery volumes.
    📌 Delivery Optimization – Find the best times for online deliveries to avoid congestion.
    📌 Weather Impact Analysis – Study how dust storms & rain affect mobility & e-commerce.
    📌 Event-driven Business Planning – Plan logistics around national events & sports matches.

    📈 Exploratory Data Analysis (EDA)

    🔍 Ideas for Data Exploration

    • Visualize order volume trends across paydays, school terms, & holidays.
    • Analyze correlations between weather conditions & delivery delays.
    • Find seasonal trends in crowding & online shopping behavior.

    🔥 Example Analysis in Python

    import pandas as pd
    import matplotlib.pyplot as plt
    
    # Load the dataset
    df = pd.read_csv("saudi_events.csv")
    
    # Convert date column to datetime format
    df['DateG'] = pd.to_datetime(df['DateG'])
    
    # Plot orders over time
    plt.figure(figsize=(10,5))
    df.groupby('DateG')['days after payday'].mean().plot()
    plt.title("Effect of Payday on Consumer Activity")
    plt.xlabel("Date")
    plt.ylabel("Days After Payday")
    plt.show()
    

    📌 Getting Started

    How to Use the Dataset:

    1️⃣ Download the dataset and load it into Python or R.
    2️⃣ Perform EDA to uncover insights into crowding & spending patterns.
    3️⃣ Use classification models to predict crowding based on weather, holidays & city impact.
    4️⃣ Apply time-series forecasting for sales & delivery demand projections.

    🏆 Why This Dataset is Valuable

    📊 Multidimensional Insights – Combines weather, paydays, and events for a complete picture of crowding & sales trends.
    📌 Business & Logistics Applications – Helps companies plan deliveries, optimize marketing, and predict demand.
    Unique & Rich Data – A rare dataset covering Saudi Arabia's socio-economic events & crowd impact.

    📜 License & Acknowledgments

    • 📖 License: CC BY 4.0 – Free to use with attribution.

    Conclusion

    This dataset is a powerful tool for online delivery companies, businesses, and city planners looking to optimize operations. By analyzing external factors like holidays, paydays, weather, and events, we can predict crowding, improve delivery timing, and forecast sales trends.

    🚀 We welcome feedback and contributions! If you find this dataset useful, please ⭐ it on Kaggle and share your insights!

  15. Student Academic Performance (Synthetic Dataset)

    • kaggle.com
    zip
    Updated Oct 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mamun Hasan (2025). Student Academic Performance (Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/mamunhasan2cs/student-academic-performance-synthetic-dataset
    Explore at:
    zip(9287 bytes)Available download formats
    Dataset updated
    Oct 10, 2025
    Authors
    Mamun Hasan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.

    The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:

    • Handling missing values
    • Removing duplicates
    • Detecting and treating outliers
    • Data normalization and transformation
    • Encoding categorical variables
    • Exploratory data analysis (EDA)
    • Regression Analysis

    📊 Columns Description

    Column NameDescription
    Student_IDUnique identifier for each student (e.g., S0001, S0002, …)
    AgeAge of the student (between 18 and 25 years)
    GenderGender of the student (Male/Female)
    Study_HoursAverage number of study hours per day (contains missing values and outliers)
    Attendance(%)Percentage of class attendance (contains missing values)
    Test_ScoreFinal exam score (0–100 scale)
    GradeLetter grade derived from test scores (F, C, B, A, A+)

    🧠 Example Lab Tasks Using This Dataset:

    • Identify and impute missing values using mean/median.
    • Detect and remove duplicate records.
    • Use IQR or Z-score methods to handle outliers.
    • Normalize Study_Hours and Test_Score using Min-Max scaling.
    • Encode categorical variables (Gender, Grade) for model input.
    • Prepare a clean dataset ready for classification/regression analysis.
    • Can be used for Limited Regression

    🎯 Possible Regression Targets

    Test_Score → Predict test score based on study hours, attendance, age, and gender.

    🧩 Example Regression Problem

    Predict the student’s test score using their study hours, attendance percentage, and age.

    🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']

    You can use:

    • Linear Regression (for simplicity)
    • Polynomial Regression (to explore nonlinear patterns)
    • Decision Tree Regressor or Random Forest Regressor

    And analyze feature influence using correlation or SHAP/LIME explainability.

  16. GTA San Andreas Vehicle Stats - Full Handling Data

    • kaggle.com
    zip
    Updated Jul 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcel Biezunski (2025). GTA San Andreas Vehicle Stats - Full Handling Data [Dataset]. https://www.kaggle.com/datasets/marcelbiezunski/gta-san-andreas-vehicle-stats-full-handling-data/discussion
    Explore at:
    zip(7300 bytes)Available download formats
    Dataset updated
    Jul 14, 2025
    Authors
    Marcel Biezunski
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Don't forget to upvote if you enjoy my work :)

    Hi! I'm a big fan of the GTA series and recently got back into playing GTA San Andreas - a game I still love after all these years. I thought it would be fun to analyze the internal car data from the game files like a data scientist would.

    This dataset contains detailed handling and performance statistics of 162 cars from the legendary game Grand Theft Auto: San Andreas. The data originates from the game's internal configuration files and provides a technical breakdown of each vehicle’s physical and mechanical attributes.

    Each row represents one vehicle with columns including:

    • identifier - Internal vehicle name used by the game (e.g., "INFERNUS")
    • mass_(kg) - Vehicle mass in kilograms
    • turn_mass_(kg) - "Turning mass" or rotational inertia; higher values mean more resistance to turning
    • drag_multiplier - Air resistance coefficient; higher = stronger aerodynamic drag = lower top speed
    • center_of_mass_x/y/z - Coordinates of the center of mass in the game’s physics system (X = lateral, Y = longitudinal, Z = vertical)
    • center_of_mass_%submerged - Submersion threshold; defines how deep the vehicle must be submerged before the game considers it underwater. This affects how the vehicle behaves in water and when it becomes uncontrollable or begins to sink
    • traction_multiplier - Grip multiplier; higher = better traction
    • traction_loss - How easily the vehicle loses grip (i.e., slides); lower = more stable on the road
    • traction_bias(%) - Distribution of traction between front and rear axles (e.g., 50% = equal balance)
    • #_of_gears - Number of gears in the transmission
    • max_velocity(km/h) - Maximum top speed in kilometers per hour
    • acceleration(ms^2) - Acceleration in meters per second squared (real-world physics metric)
    • interia - Vehicle inertia (resistance to changes in speed/direction)
    • drive_type - Drivetrain type: "Front" = front-wheel drive, "Rear" = rear-wheel drive, "4" = 4WD
    • engine_type - Petrol, Diesel or Electric
    • brakes_deceleration(ms^2) - Braking deceleration in m/s²
    • brakes_bias(%) - Brake force distribution between front and rear
    • abs - Anti-lock braking system: 1 = enabled, 0 = disabled
    • steering_lock_(°) - Maximum steering angle in degrees
    • suspension_force_level - Stiffness of the suspension springs
    • suspension_damping_level - Damping level of suspension oscillations
    • suspension_high_speed_com_damping - Additional damping for high-speed situations
    • suspension_lines_upper/lower_limit - Upper and lower suspension travel limits
    • suspension_lines_bias_between_front_and_rear - Bias of suspension travel between front and rear
    • suspension_anti-dive_multiplier - How much the vehicle resists forward dipping under braking
    • seat_offset_distance - Horizontal offset of the driver seat relative to the vehicle’s center
    • collision_damage_multiplier - Multiplier for collision damage; higher = more damage on impact
    • monetary_value_($) - In-game dollar value of the vehicle
    • model_flags, handling_flags - Binary flags that define special behavior (e.g., lowrider, can slide, off-road tuned)
    • lights_front, lights_rear, lights_anim_group - Types of front/rear lights and animation group (e.g., police flashing)

    With 35 unique attributes, this dataset is ideal for: 📈 Exploratory Data Analysis (EDA), 📊 Data Visualization, 🤖 Machine Learning, 🔧 Physics or game logic analysis, 🎮 Reverse engineering game mechanics, 🧪 Feature importance / ranking of in-game vehicle performance.

    I’ve also included a Jupyter Notebook with EDA to showcase some interesting insights from this data. You're welcome to fork, explore, or build your own models on top of it!

  17. EdgeFogFinDSet

    • kaggle.com
    zip
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Nuraddeen Ado (2025). EdgeFogFinDSet [Dataset]. https://www.kaggle.com/datasets/muhammadnuraddeenado/edgefogfindset
    Explore at:
    zip(5136133 bytes)Available download formats
    Dataset updated
    May 27, 2025
    Authors
    Muhammad Nuraddeen Ado
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Edge-Fog Financial Transactions Dataset Overview

    Synthesized Dataset Overview This dataset is a synthetic financial transactions dataset tailored for financial crimes detection in edge and fog computing environments. It simulates transactional activities that could be monitored across decentralized computing layers (e.g., ATMs, mobile apps, IoT financial devices) to train AI/ML models for detecting financial fraud.

    Feature Description TransID Unique transaction identifier AcctNo Synthetic account number AcctName Account holder name TransAmount Transaction amount (some negative values may simulate refunds/fraud) TrnsMode Mode of transaction (ATM, POS, USSDC, etc.) TrnsType Transaction type (Transfer, Withdrawal, Deposit) TrnsDate Date of transaction TrnNature Role in transaction (Source or Destination) MACAdres Device MAC address simulating IoT or edge device IPAdres IP address simulating the device/network location Protocol Communication protocol used (HTTP, UDP, ICMP, etc.) Length Size of the data packet (used to simulate network activity)

    Preliminary Exploratory Data Analysis (EDA) 1. Missing Values: No missing values — dataset is fully populated. 2. Categorical Variables: - TrnsMode, TrnsType, Protocol, and TrnNature show realistic variation: - 5 transaction modes: ATM, Bank, POS, USSDC, Mobile - 3 transaction types: Transfer, Withdrawal, Deposit - 5 network protocols: HTTP, UDP, ICMP, etc. 3. Numerical Distributions: - TransAmount: - Mean: ~$3,452, Std: ~$9,585 - Min: -$908, Max: ~$49,996 - Highly skewed — indicates presence of outliers or potential fraud. - Length (network packet size): - Min: 60, Max: 1500 — within expected network transmission range. 4. Date Field (TrnsDate): - Covers multiple years (2023–2025). - Suitable for time-series modeling. Suitability for Machine Learning The dataset is well-suited for: - Supervised Learning: If labels (fraud/non-fraud) are introduced or derived. - Unsupervised Learning: Anomaly detection using clustering or density-based methods. - Ensemble Methods: Feature variety and volume allow robust ensemble modeling. - Deep Learning: Rich and diverse feature space suitable for sequential or graph models in fraud detection.

  18. Phone Price Predict 2020-2024

    • kaggle.com
    zip
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jerowai (2024). Phone Price Predict 2020-2024 [Dataset]. https://www.kaggle.com/datasets/jerowai/phone-price-predict-2020-2024
    Explore at:
    zip(1002 bytes)Available download formats
    Dataset updated
    Dec 10, 2024
    Authors
    Jerowai
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Overview This dataset provides a curated, example-based snapshot of selected Samsung smartphones released (or expected to be released) between 2020 and 2024. It includes various technical specifications such as camera details, processor type, RAM, internal storage, display size, GPU, battery capacity, operating system, and pricing. Note that these values are illustrative and may not reflect actual market data.

    What’s Inside?

    Phone Name & Release Year: Quickly reference the time frame and model. Camera Specs: Understand the rear camera configurations (e.g., “108+10+10+12 MP”) and compare imaging capabilities across models. Processor & GPU: Gain insights into the performance capabilities by checking the processor and graphics chip. Memory & Storage: Review RAM and internal storage options (e.g., “8 GB RAM” and “128 GB Internal Storage”). Display & Battery: Compare screen sizes (from 6.1 to over 7 inches) and battery capacities (e.g., 5000 mAh) to gauge device longevity and usability. Operating System: Note the Android version at release. Price (USD): Examine relative pricing trends over the years. How to Use This Dataset

    Exploratory Data Analysis (EDA): Use Python libraries like Pandas and Matplotlib to explore pricing trends over time, changes in camera configurations, or the evolution of battery capacities.

    Example: df.groupby('Release Year')['Price (USD)'].mean().plot(kind='bar') can show how average prices have fluctuated year to year. Feature Comparison & Filtering: Easily filter models based on specs. For instance, query phones with at least 8 GB RAM and a 5000 mAh battery to identify devices suitable for power users.

    Example: df[(df['RAM (GB)'] >= 8) & (df['Battery Capacity (mAh)'] >= 5000)] Machine Learning & Predictive Analysis: Although this dataset is example-based and not suitable for precise forecasting, you could still practice predictive modeling. For example, create a simple regression model to predict price based on features like RAM and display size.

    Example: Train a regression model (e.g., LinearRegression in scikit-learn) to see if increasing RAM or battery capacity correlates with higher prices. Comparing Release Trends: Investigate how flagship and mid-range specifications have evolved. See if there’s a noticeable shift towards larger displays, bigger batteries, or higher camera megapixels over the years.

    Recommended Tools & Libraries

    Python & Pandas: For data cleaning, manipulation, and initial analysis. Matplotlib & Seaborn: For creating visualizations to understand trends and distributions. scikit-learn: For modeling and basic predictive tasks, if you choose to use these example values as a training ground. Jupyter Notebooks or Kaggle Kernels: For interactive analysis and iterative exploration. Disclaimer This dataset is a synthetic, illustrative example and may not match real-world specifications, prices, or release timelines. It’s intended for learning, experimentation, and demonstration of various data analysis and machine learning techniques rather than as a factual source.

  19. IMDB top 250 French movies

    • kaggle.com
    zip
    Updated Aug 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khushi Pitroda (2023). IMDB top 250 French movies [Dataset]. https://www.kaggle.com/datasets/khushipitroda/imdb-top-250-french-movies/code
    Explore at:
    zip(36031 bytes)Available download formats
    Dataset updated
    Aug 3, 2023
    Authors
    Khushi Pitroda
    Area covered
    French
    Description

    Important Note: The "Top 250 French Movies" dataset comprises information on the highest-rated French movies according to user ratings on various platforms. This dataset contains 250 unique French movies that have garnered critical acclaim and popularity among viewers. Each movie is associated with essential details, including its rank, title, release year, duration, genre, IMDb rating, image source link, and a brief description.

    This dataset is intended for learning, research, and analysis purposes. The movie ratings and details provided in the dataset are based on publicly available information at the time of scraping. As IMDb ratings and movie information may change over time, it is essential to verify and update the data for the latest information.

    By using this dataset, you acknowledge that the accuracy and completeness of the information cannot be guaranteed, and you assume responsibility for any analysis or decision-making based on the data. Additionally, please adhere to IMDb's terms of use and copyright policies when using the data for any public dissemination or commercial purposes.

    Data Analysis Tasks:

    1.Exploratory Data Analysis (EDA): Explore the distribution of movies by genres, release years, and IMDb ratings. Visualize the top-rated French movies and their IMDb ratings using bar charts or histograms.

    2.Year-wise Trends: Observe trends in French movie production over the years using line charts or area plots. Analyze if there's any correlation between release year and IMDb ratings.

    3.Word Cloud Analysis: Create word clouds from movie descriptions to visualize the most common words and themes among the top-rated French movies. This can provide insights into popular topics and genres.

    4.Network Analysis: Build a network graph connecting French movies that share common actors or directors. Analyze the interconnectedness of movies based on their production teams.

    Machine Learning Tasks:

    1.Movie Recommendation System: Implement a content-based recommendation system that suggests French movies based on similarities in genre, release year, and IMDb ratings. Use techniques like cosine similarity or Jaccard similarity to measure movie similarities.

    2.Movie Genre Classification: Build a multi-class classification model to predict the genre of an French movie based on its description. Utilize Natural Language Processing (NLP) techniques like text preprocessing, TF-IDF, or word embeddings. Use classifiers like Logistic Regression, Naive Bayes, or Support Vector Machines.

    3.Movie Sentiment Analysis: Perform sentiment analysis on movie descriptions to determine the overall sentiment (positive, negative, neutral) of each movie. Use sentiment lexicons or pre-trained sentiment analysis models.

    4.Movie Rating Prediction: Develop a regression model to predict the IMDb rating of an French movie based on features like genre, release year, and description sentiment. Employ regression algorithms like Linear Regression, Decision Trees, or Random Forests.

    5.Movie Clustering: Apply unsupervised clustering algorithms to group French movies with similar attributes. Use features like genre, IMDb rating, and release year to identify movie clusters. Experiment with algorithms like K-means clustering or hierarchical clustering.

    Important Note: Ensure that the data is appropriately preprocessed and encoded for machine learning tasks. Handle any missing values, perform feature engineering, and split the dataset into training and testing sets. Evaluate the performance of each machine learning model using appropriate metrics such as accuracy, precision, recall, or Mean Squared Error (MSE) depending on the task.

    It is crucial to remember that the performance of machine learning models may vary based on the dataset's size and quality. Interpret the results carefully and consider using cross-validation techniques to assess model generalization.

    Lastly, please adhere to IMDb's terms of use and any applicable data usage policies while conducting data analysis and implementing machine learning models with this dataset.

  20. Pavement Dataset

    • kaggle.com
    zip
    Updated May 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gifrey Sulay (2025). Pavement Dataset [Dataset]. https://www.kaggle.com/datasets/gifreysulay/pavement-dataset/discussion?sort=undefined
    Explore at:
    zip(20890601 bytes)Available download formats
    Dataset updated
    May 24, 2025
    Authors
    Gifrey Sulay
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    🏗️ Pavement Condition Monitoring and Maintenance Prediction

    📘 Scenario

    You are a data analyst for a city engineering office tasked with identifying which road segments require urgent maintenance. The office has collected inspection data on various roads, including surface conditions, traffic volume, and environmental factors.

    Your goal is to analyze this data and build a binary classification model to predict whether a given road segment needs maintenance, based on pavement and environmental indicators.

    🔍 Target Variable: Needs_Maintenance

    This binary label indicates whether the road segment requires immediate maintenance, defined by the following rule:

    • Needs_Maintenance = 1
    • Needs_Maintenance = 0 otherwise

    🎯 Learning Objectives

    • Perform exploratory data analysis (EDA) on civil engineering infrastructure data
    • Engineer features relevant to road quality and maintenance
    • Build and evaluate a binary classification model using Python
    • Interpret model results to support maintenance prioritization decisions

    📊 Dataset Features

    Column NameDescription
    Segment IDUnique identifier for the road segment
    PCIPavement Condition Index (0 = worst, 100 = best)
    Road TypeType of road (Primary, Secondary, Barangay)
    AADTAverage Annual Daily Traffic
    Asphalt TypeAsphalt mix classification (e.g. Dense, Open-graded, SMA)
    Last MaintenanceYear of the last major maintenance
    Average RainfallAverage annual rainfall in the area (mm)
    RuttingDepth of rutting (mm)
    IRIInternational Roughness Index (m/km)
    Needs MaintenanceTarget label: 1 if urgent maintenance is needed, 0 otherwise

    🎓 Final Exam Task (For Students)

    Using this 1 050 000-row dataset, perform at least five (5) distinct observations. An observation may combine one or more of the following:

    • Plots using Matplotlib or Seaborn
    • Tables or summary statistics using Pandas
    • Numerical calculations using NumPy
    • Grouped analyses, cross-tabulations, or pivot tables

    You may consult official documentation online (e.g., pandas.pydata.org, matplotlib.org, seaborn.pydata.org, numpy.org), but NO AI-assisted tools or generative models are permitted—even such tools for code snippets or data exploration.

    What counts as an “Observation”

    1. Distribution Insight

      • E.g. plot the distribution of IRI and comment on its skewness.
    2. Correlation or Relationship

      • E.g. scatterplot of Rutting vs. Average Rainfall, plus calculation of Pearson or Spearman correlation.
    3. Group Comparison

      • E.g. pivot table of mean AADT by Road Type and a bar chart.
    4. Derived Feature Analysis

      • E.g. create decay = Rutting / Last Maintenance, then describe its summary statistics and plot.
    5. Conditional Probability or Rate

      • E.g. compute the proportion of Needs Maintenance = 1 within each Road Type count and visualize as a line plot.

    You must deliver:

    • A Jupyter Notebook containing at least five well-labeled observations, each with a title, code cell(s), output (plot/table), and a short interpretation (2–4 sentences).
    • No AI tools: all code must be handwritten or copied from official docs/examples; do not use ChatGPT, Copilot, or similar.
    • Set your random seeds where appropriate to ensure reproducibility.
  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dr. Nagendra (2025). Breast Cancer Exploratory Data Analysis EDA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/breast-cancer-exploratory-data-analysis-eda
Organization logo

Breast Cancer Exploratory Data Analysis EDA

Subtitle: Clinical & Diagnostic features for Breast Cancer EDA and classificatio

Explore at:
zip(7609364 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset contains clinical and diagnostic features related to Breast Cancer, designed for comprehensive Exploratory Data Analysis (EDA) and subsequent predictive modeling.

It is derived from digitized images of Fine Needle Aspirates (FNA) of breast masses.

The dataset features quantitative measurements, typically calculated from the characteristics of cell nuclei, including: - Radius - Texture - Perimeter - Area - Smoothness - Compactness - Concavity - Concave Points - Symmetry - Fractal Dimension

These features are provided as mean, standard error, and "worst" (largest) values.

The primary goal of this resource is to support the validation of EDA techniques necessary for clinical data science: - Data quality assessment (missing values, inconsistencies). - Feature assessment (distributions, correlations). - Visualization for diagnostic modeling.

The primary target variable is the binary classification of the tissue sample: Malignant vs. Benign.

Search
Clear search
Close search
Google apps
Main menu