20 datasets found

Breast Cancer Exploratory Data Analysis EDA
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Breast Cancer Exploratory Data Analysis EDA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/breast-cancer-exploratory-data-analysis-eda
Explore at:
zip(7609364 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains clinical and diagnostic features related to Breast Cancer, designed for comprehensive Exploratory Data Analysis (EDA) and subsequent predictive modeling.

It is derived from digitized images of Fine Needle Aspirates (FNA) of breast masses.

The dataset features quantitative measurements, typically calculated from the characteristics of cell nuclei, including: - Radius - Texture - Perimeter - Area - Smoothness - Compactness - Concavity - Concave Points - Symmetry - Fractal Dimension

These features are provided as mean, standard error, and "worst" (largest) values.

The primary goal of this resource is to support the validation of EDA techniques necessary for clinical data science: - Data quality assessment (missing values, inconsistencies). - Feature assessment (distributions, correlations). - Visualization for diagnostic modeling.

The primary target variable is the binary classification of the tissue sample: Malignant vs. Benign.
Stock Price EDA(Time Series Analysis)
kaggle.com
zip
Updated May 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RITIK MAHESHWARI (2021). Stock Price EDA(Time Series Analysis) [Dataset]. https://www.kaggle.com/ritikmaheshwari/stock-price-edatime-series-analysis
Explore at:
zip(11875814 bytes)Available download formats
Dataset updated
May 4, 2021
Authors
RITIK MAHESHWARI
Description
Context

There's a story behind every dataset and here's your opportunity to share yours.

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?

Analyze closing price of all the stocks. Analyze the total volume of stocks being traded each day. Analyze daily price change in stock. Analyze monthly mean of close feature. Analyze whether Stock prices of these tech companies are correlated or not. Analyze Daily return of each stock and how they are co-related. Value at risk analysis for Tech Companies.

BI intro to data cleaning eda and machine learning

kaggle.com

zip

Updated Nov 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Walekhwa Tambiti Leo Philip (2025). BI intro to data cleaning eda and machine learning [Dataset]. https://www.kaggle.com/datasets/walekhwatlphilip/intro-to-data-cleaning-eda-and-machine-learning/suggestions

Explore at:

zip(9961 bytes)Available download formats

Dataset updated

Nov 17, 2025

Authors

Walekhwa Tambiti Leo Philip

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Real-World Data Science Challenge

Business Intelligence Program Strategy — Student Success Optimization

Hosted by: Walsoft Computer Institute 📁 Download dataset 👤 Kaggle profile

Background

Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.

As part of an internal review, the leadership team has hired you — a Data Science Consultant — to analyze this dataset and provide clear, evidence-based recommendations on how to improve:

Admissions decision-making
Academic support strategies
Overall program impact and ROI

Your Mission

Answer this central question:

“Using the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”

Key Strategic Areas

You are required to analyze and provide actionable insights for the following three areas:

1. Admissions Optimization

Should entry exams remain the primary admissions filter?

Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.

✅ Deliverables:

Feature importance ranking for predicting Python and DB scores
Admission policy recommendation (e.g., retain exams, add screening tools, adjust thresholds)
Business rationale and risk analysis

2. Curriculum Support Strategy

Are there at-risk student groups who need extra support?

Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.

✅ Deliverables:

At-risk segment identification
Support program design (e.g., prep course, mentoring)
Expected outcomes, costs, and KPIs

3. Resource Allocation & Program ROI

How can we allocate resources for maximum student success?

Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.

✅ Deliverables:

Performance drivers
Student segmentation
Resource allocation plan and ROI projection

🛠️ Dataset Overview

Column	Description
`fNAME`, `lNAME`	Student first and last name
`Age`	Student age (21–71 years)
`gender`	Gender (standardized as "Male"/"Female")
`country`	Student’s country of origin
`residence`	Student housing/residence type
`entryEXAM`	Entry test score (28–98)
`prevEducation`	Prior education (High School, Diploma, etc.)
`studyHOURS`	Total study hours logged
`Python`	Final Python exam score
`DB`	Final Database exam score

📊 Dataset

You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day — from inconsistent formatting to missing values.

Raw Dataset (Recommended for Full Project)

Download: bi.csv

This dataset includes common data quality challenges:

Country name inconsistencies
e.g. Norge → Norway, RSA → South Africa, UK → United Kingdom
Residence type variations
e.g. BI-Residence, BIResidence, BI_Residence → unify to BI Residence
Education level typos and casing issues
e.g. Barrrchelors → Bachelor, DIPLOMA, Diplomaaa → Diploma
Gender value noise
e.g. M, F, female → standardize to Male / Female
Missing scores in Python subject
Fill NaN values using column mean or suitable imputation strategy

Participants using this dataset are expected to apply data cleaning techniques such as: - String standardization - Null value imputation - Type correction (e.g., scores as float) - Validation and visual verification

✅ Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.

Cleaned Dataset (Optional Shortcut)

Download: cleaned_bi.csv

This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...

SAP Historical Stock Prices Dataset
kaggle.com
Updated Jun 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umair Zia (2024). SAP Historical Stock Prices Dataset [Dataset]. https://www.kaggle.com/datasets/stealthtechnologies/sap-historical-stock-prices-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 17, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Umair Zia
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
ABOUT SAP:

Headquartered in Walldorf, Germany, SAP is the market leader in enterprise application software. Founded in 1972, SAP (which stands for "Systems, Applications, and Products in Data Processing") has a rich history of innovation and growth as a true industry leader.

_

ABOUT DATASET:

This dataset contains detailed historical stock price data for SAP, covering the period from 09/22/1995 to 06/14/2024. The data is collected from Yahoo Finance and includes daily records of the stock's opening price, highest price, lowest price, closing price, and trading volume. Each entry in the dataset represents a single trading day, providing a comprehensive view of the stock's price movements and market activity.

PURPOSE OF DATASET

The purpose of this dataset is to provide analysts, traders, and researchers with accurate and granular historical stock price data for SAP. This data can be used for various applications, including:

Technical Analysis: Identify trends and patterns in the stock's price movements. Calculate technical indicators such as moving averages, RSI, and Bollinger Bands.

Market Sentiment Analysis: Analyze how the stock's price responds to market events and news. Compare the opening and closing prices to understand daily sentiment.

Algorithmic Trading: Develop and test trading algorithms based on historical price and volume data. Use past price movements to simulate trading strategies.

Predictive Modeling: Build models to forecast future prices and trading volumes. Use historical data to identify potential price movements and market trends.

Educational Purposes: Serve as a teaching tool for financial education. Help students and researchers understand the dynamics of stock price changes and market behavior.

_

this dataset offers a solid foundation for a wide range of financial analyses and trading applications.
Walmart Stocks Data 2025
kaggle.com
zip
Updated Feb 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehar Shan Ali (2025). Walmart Stocks Data 2025 [Dataset]. https://www.kaggle.com/meharshanali/walmart-stocks-data-2025
Explore at:
zip(467062 bytes)Available download formats
Dataset updated
Feb 23, 2025
Authors
Mehar Shan Ali
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
📊 Walmart Stock Price Dataset & Exploratory Data Analysis (EDA)

🏢 About Walmart

Walmart Inc. is a multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores. It is one of the world's largest companies by revenue and a key player in the retail sector. Walmart's stock is actively traded on major stock exchanges, making it an interesting subject for financial analysis.

📌 Dataset Overview

This dataset contains historical stock price data for Walmart, sourced directly from Yahoo Finance using the yfinance Python API. The data covers daily stock prices and includes multiple key financial indicators.

📊 Features Included in the Dataset

Date 📅 – The trading day recorded.

Open Price 🟢 – Price at market open.

High Price 🔼 – Highest price of the day.

Low Price 🔽 – Lowest price of the day.

Close Price 🔴 – Price at market close.

Adjusted Close Price 📉 – Closing price adjusted for splits & dividends.

Trading Volume 📈 – Total shares traded.

Dividends 💰 – Cash payments to shareholders.

Stock Splits 🔄 – Records stock split events.

🔍 Exploratory Data Analysis (EDA) Steps

This notebook performs an extensive EDA to uncover insights into Walmart's stock price trends, volatility, and overall behavior in the stock market. The following analysis steps are included:

1️⃣ Data Preprocessing & Cleaning

Load data using Pandas

Handle missing values (if any)

Check data types and format them properly

Convert date column into a datetime format

2️⃣ Descriptive Statistics & Summary

Calculate key statistical measures like mean, median, standard deviation, and interquartile range (IQR)

Identify stock price trends over time

Check data distribution and skewness

3️⃣ Data Visualizations

📉 Line Plot – Analyze trends in closing prices over time.

📦 Box Plot – Detect potential outliers in stock prices.

📊 Histogram – Understand the distribution of closing prices.

📈 Moving Averages – Use short-term and long-term moving averages to observe stock trends.

🔥 Correlation Heatmap – Find relationships between stock market indicators.

4️⃣ Time Series Analysis

Identify trends and seasonality in the stock price data.

Calculate daily, weekly, and monthly returns.

Use rolling windows to analyze moving averages and volatility.

5️⃣ Insights & Conclusions

How volatile is Walmart’s stock over the given period?

Does the stock exhibit strong uptrends or downtrends?

Are there any strong correlations between features?

What insights can be drawn for investors and traders?

🚀 Use Cases & Applications

This dataset and analysis can be useful for: - 📡 Stock Market Analysis – Evaluating Walmart’s stock price trends and volatility. - 🏦 Investment Research – Assisting traders and investors in making informed decisions. - 🎓 Educational Purposes – Teaching data science and financial analysis using real-world stock data. - 📊 Algorithmic Trading – Developing trading strategies based on historical stock price trends.

📥 Download the dataset and explore Walmart’s stock performance today! 🚀
Real Madrid UEFA Champions League Perform Analysis
kaggle.com
zip
Updated Aug 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joaco Romero Flores (2023). Real Madrid UEFA Champions League Perform Analysis [Dataset]. https://www.kaggle.com/datasets/joaquinaromerof/real-madrid-analysis
Explore at:
zip(32668239 bytes)Available download formats
Dataset updated
Aug 26, 2023
Authors
Joaco Romero Flores
License
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Description
Introduction

In the high-stakes world of professional football, public opinion often forms around emotions, loyalties, and subjective interpretations. The project at hand aims to transcend these biases by delving into a robust, data-driven analysis of Real Madrid's performance in the UEFA Champions League over the past decade.

Through a blend of traditional statistical methods, machine learning models, game theory, psychology, philosophy, and even military strategies, this investigation presents a multifaceted view of what contributes to a football team's success and how performance can be objectively evaluated.

Exploratory Data Analysis (EDA)

The EDA consists of two layers:

1. Statistical Analysis:

Set-Up Process: Loading libraries, data frames, determining position relevancy, and calculating average minutes played.

Kurtosis: Understanding data variance and its internal behavior.

Feature Engineering: Preprocessing with standard scaler for later ML applications.

Sample Statistics, Distribution, and Standard Errors: Essential for inference.

Central Limit Theorem: A focus for understanding by experienced data scientists.

A/B Testing & ANOVA: Used for null hypothesis testing.

2. Machine Learning Models:

Ordinary Least Square: To estimate the unknown parameters.

Linear Regression Models with Sci-Kit Learn: Predicting the dependent variable.

XGBoost & Cross-Validation: A powerful algorithm for making predictions.

Conformal Prediction: To create valid prediction regions.

Radar Maps: For visualizing player performance during their match campaigns.

Objectives

The goal of this analysis is multifaceted: 1. Unveil Hidden Statistics: To reveal the underlying patterns often overlooked in casual discussions. 2. Demonstrate the Impact of Probability: How it shapes matches and seasons. 3. Explore Interdisciplinary Influences: Including Game Theory, Strategy, Cooperation, Psychology, Physiology, Military Training, Luck, Economics, Philosophy, and even Freudian Analysis. 4. Challenge Subjective Bias: By presenting a well-rounded, evidence-based view of football performance.

Conclusion

This project stands as a testament to the profound complexity of football performance and the nuanced insights that can be derived through rigorous scientific analysis. Whether a data scientist recruiter, football fanatic, or curious mind, the findings herein offer a unique perspective that bridges the gap between passion and empiricism.
India's Fast Delivery Agents Reviews and Ratings
kaggle.com
zip
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanak Baghel (2025). India's Fast Delivery Agents Reviews and Ratings [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/indias-fast-delivery-agents-reviews-and-ratings
Explore at:
zip(176771 bytes)Available download formats
Dataset updated
May 5, 2025
Authors
Kanak Baghel
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Area covered
India
Description
1.1 Industry Landscape of Fast Delivery Services in India

India’s fast delivery ecosystem is characterized by intense competition among multiple players offering expedited grocery and food delivery services with promised delivery windows as low as 10 to 30 minutes. Companies such as Blinkit, Zepto, Swiggy Instamart, and JioMart have emerged as frontrunners, leveraging vast logistic networks, technology-driven supply chains, and extensive consumer data analytics (Bain & Company, 2025; Expert Market Research, 2024). The sector’s growth trajectory is robust, with the online food delivery market alone valued at USD 48.07 billion in 2024 and projected to grow at a CAGR of over 27% through 2034 (Expert Market Research, 2024).

1.2 Importance of Customer Ratings and Reviews

Customer reviews and ratings provide granular feedback on delivery agents’ punctuality, professionalism, order accuracy, and communication. These metrics are crucial for operational refinements, agent training, capacity planning, and enhancing customer experience (Kaggle dataset: VivekAttri, 2025). Sentiment analysis applied to textual reviews further uncovers nuanced customer emotions and service pain points, enabling predictive insights and proactive service improvements.

1.3 Dataset Overview

The focal dataset includes structured customer reviews and numerical ratings collected for fast delivery agents across India’s leading quick-commerce platforms. Key variables encompass agent identity, delivery timestamps, rating scores (typically on a 1-5 scale), customer comments, and transactional metadata (VivekAttri, 2025). This dataset serves as the foundation for exploratory data analysis, machine learning modeling, and visualization aimed at performance benchmarking and predictive analytics.

2. Data Handling and Preprocessing Methodologies

2.1 Data Acquisition and Integration

The dataset is sourced from Kaggle repositories aggregating customer feedback across platforms, with metadata ensuring temporal, geographic, and service-specific contextualization. Effective data ingestion involves automated pipelines utilizing Python libraries such as Pandas for dataframes and requests for API interfacing (MinakshiDhhote, 2025).

2.2 Data Cleaning and Normalization

Critical preprocessing steps include:

Removal of Redundant and Irrelevant Columns: Columns unrelated to delivery agent performance (e.g., user identifiers when anonymized) are discarded to streamline analysis.

Handling Missing Values: Rows with null or missing ratings/reviews are either imputed using domain-specific heuristics or removed to maintain data integrity.

Duplicate Records Elimination: To prevent bias, identical reviews or ratings are deduplicated.

Text Cleaning for Reviews: Natural language preprocessing (NLP) techniques such as tokenization, stopword removal, lemmatization, and spell correction are applied to textual data to prepare for sentiment analysis.

Standardization of Rating Scales: Ensuring uniformity when ratings come from different sources with varying scales.

2.3 Feature Engineering

Derived features enhance modeling capabilities:

Sentiment Scores: Using models like VADER or BERT-based classifiers to convert textual reviews into quantifiable sentiment metrics.

Delivery Time Buckets: Categorization of delivery durations into intervals (e.g., under 15 minutes, 15-30 minutes) to analyze performance impact.

Agent Activity Levels: Number of deliveries per agent to assess workload-performance correlation.

Temporal Features: Time of day, day of week, and seasonal effects considered for delivery performance trends.

3. Exploratory Data Analysis (EDA) and Visualization

3.1 Rating Distribution and Statistical Summary

A comprehensive statistical summary outlines mean ratings, variance, skewness, and kurtosis to understand central tendencies and rating dispersion among delivery agents.

Table 1: Rating Summary Statistics for Delivery Agents (2025 Dataset Sample)

|| Metric | Value | |----------------------|----------------| | Mean Rating | 3.8 ± 0.15 | | Median Rating | 4.0 | | Standard Deviation | 0.75 | | Skewness | -0.45 | | Kurtosis | 2.1 | | Number of Ratings | 250,000+ | | | | --- | --- | | | |

Data validated with 95% confidence interval from Kaggle 2025 dataset (VivekAttri, 2025).

3.2 Geographical and Platform-Based Ratings Comparison

Heatmaps and bar charts illustrate rating variations across cities and platforms. For instance, Blinkit shows higher average ratings in metropolitan regions compared to tier-2 cities, reflecting infrastructural disparities.

3.3 Service Attributes and Rating Correlations

Scatter plots and corr...
Tradyflow - Options Trading!
kaggle.com
zip
Updated Jun 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Anas (2022). Tradyflow - Options Trading! [Dataset]. https://www.kaggle.com/datasets/muhammadanas0716/tradyflow-options-trading/data
Explore at:
zip(208306 bytes)Available download formats
Dataset updated
Jun 24, 2022
Authors
Muhammad Anas
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
This dataset is obtained from tradytics.com on 21st June 2022. By the time you get to see the code, this dataset will no longer be available on the website. You can only access it on my Github.

What is the Dataset?

This dataset is the flow of the stock market on the date 17th June 2022. It contains many tickers. An excellent dataset to practice time series and test your data science skills.

What Do the rows Stand For?

Time - Time when this ticker was caught in the flow.

Sym- The ticker symbol, e.g AAPL, TSLA, SPY.

C/P - Call or Put trade?

Exp- The expiration of the contract.

Str - The strike price.

Spot - The stock price at the moment when the flow was reported.

Bidask - The bidask of the contract

Orders - The total order of the contract.

Volume - The number of shares traded at the moment when this contract was caught.

Premiums - The total money spent on this contract.

Open Interest - The total number of opened contracts at the moment when this contract was caught.

Diff % - The % difference between Spot and Strike price.

ITM - If the contract was a win or loss. 0 is LOSS, 1 is WIN

***NoT FINANCIAL ADVICE**

This is an amazing dataset for beginners or those coders who are refreshing their data science skills. No harm if professionals use this either. It's fantastical. You do so much with, honestly so much! Maybe even a stock market bot, thou this is done at your risk. Enjoy and share your code!!!!
House Price Regression Dataset
kaggle.com
zip
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prokshitha Polemoni (2024). House Price Regression Dataset [Dataset]. https://www.kaggle.com/datasets/prokshitha/home-value-insights
Explore at:
zip(27045 bytes)Available download formats
Dataset updated
Sep 6, 2024
Authors
Prokshitha Polemoni
Description
Home Value Insights: A Beginner's Regression Dataset

This dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.

Features:

Square_Footage: The size of the house in square feet. Larger homes typically have higher prices.

Num_Bedrooms: The number of bedrooms in the house. More bedrooms generally increase the value of a home.

Num_Bathrooms: The number of bathrooms in the house. Houses with more bathrooms are typically priced higher.

Year_Built: The year the house was built. Older houses may be priced lower due to wear and tear.

Lot_Size: The size of the lot the house is built on, measured in acres. Larger lots tend to add value to a property.

Garage_Size: The number of cars that can fit in the garage. Houses with larger garages are usually more expensive.

Neighborhood_Quality: A rating of the neighborhood’s quality on a scale of 1-10, where 10 indicates a high-quality neighborhood. Better neighborhoods usually command higher prices.

House_Price (Target Variable): The price of the house, which is the dependent variable you aim to predict.

Potential Uses:

Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.

Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.

Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.

Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.

Versatility:

The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.

It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.

This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.
Stock Market: Historical Data of Top 10 Companies
kaggle.com
zip
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khushi Pitroda (2023). Stock Market: Historical Data of Top 10 Companies [Dataset]. https://www.kaggle.com/datasets/khushipitroda/stock-market-historical-data-of-top-10-companies
Explore at:
zip(486977 bytes)Available download formats
Dataset updated
Jul 18, 2023
Authors
Khushi Pitroda
Description
The dataset contains a total of 25,161 rows, each row representing the stock market data for a specific company on a given date. The information collected through web scraping from www.nasdaq.com includes the stock prices and trading volumes for the companies listed, such as Apple, Starbucks, Microsoft, Cisco Systems, Qualcomm, Meta, Amazon.com, Tesla, Advanced Micro Devices, and Netflix.

Data Analysis Tasks:

1) Exploratory Data Analysis (EDA): Analyze the distribution of stock prices and volumes for each company over time. Visualize trends, seasonality, and patterns in the stock market data using line charts, bar plots, and heatmaps.

2)Correlation Analysis: Investigate the correlations between the closing prices of different companies to identify potential relationships. Calculate correlation coefficients and visualize correlation matrices.

3)Top Performers Identification: Identify the top-performing companies based on their stock price growth and trading volumes over a specific time period.

4)Market Sentiment Analysis: Perform sentiment analysis using Natural Language Processing (NLP) techniques on news headlines related to each company. Determine whether positive or negative news impacts the stock prices and volumes.

5)Volatility Analysis: Calculate the volatility of each company's stock prices using metrics like Standard Deviation or Bollinger Bands. Analyze how volatile stocks are in comparison to others.

Machine Learning Tasks:

1)Stock Price Prediction: Use time-series forecasting models like ARIMA, SARIMA, or Prophet to predict future stock prices for a particular company. Evaluate the models' performance using metrics like Mean Squared Error (MSE) or Root Mean Squared Error (RMSE).

2)Classification of Stock Movements: Create a binary classification model to predict whether a stock will rise or fall on the next trading day. Utilize features like historical price changes, volumes, and technical indicators for the predictions. Implement classifiers such as Logistic Regression, Random Forest, or Support Vector Machines (SVM).

3)Clustering Analysis: Cluster companies based on their historical stock performance using unsupervised learning algorithms like K-means clustering. Explore if companies with similar stock price patterns belong to specific industry sectors.

4)Anomaly Detection: Detect anomalies in stock prices or trading volumes that deviate significantly from the historical trends. Use techniques like Isolation Forest or One-Class SVM for anomaly detection.

5)Reinforcement Learning for Portfolio Optimization: Formulate the stock market data as a reinforcement learning problem to optimize a portfolio's performance. Apply algorithms like Q-Learning or Deep Q-Networks (DQN) to learn the optimal trading strategy.

The dataset provided on Kaggle, titled "Stock Market Stars: Historical Data of Top 10 Companies," is intended for learning purposes only. The data has been gathered from public sources, specifically from web scraping www.nasdaq.com, and is presented in good faith to facilitate educational and research endeavors related to stock market analysis and data science.

It is essential to acknowledge that while we have taken reasonable measures to ensure the accuracy and reliability of the data, we do not guarantee its completeness or correctness. The information provided in this dataset may contain errors, inaccuracies, or omissions. Users are advised to use this dataset at their own risk and are responsible for verifying the data's integrity for their specific applications.

This dataset is not intended for any commercial or legal use, and any reliance on the data for financial or investment decisions is not recommended. We disclaim any responsibility or liability for any damages, losses, or consequences arising from the use of this dataset.

By accessing and utilizing this dataset on Kaggle, you agree to abide by these terms and conditions and understand that it is solely intended for educational and research purposes.

Please note that the dataset's contents, including the stock market data and company names, are subject to copyright and other proprietary rights of the respective sources. Users are advised to adhere to all applicable laws and regulations related to data usage, intellectual property, and any other relevant legal obligations.

In summary, this dataset is provided "as is" for learning purposes, without any warranties or guarantees, and users should exercise due diligence and judgment when using the data for any purpose.
Premier League Statistics from 2015 to 2023
kaggle.com
zip
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghaith Mechi (2024). Premier League Statistics from 2015 to 2023 [Dataset]. https://www.kaggle.com/datasets/ghaithmechi/premier-league-statistics-from-2015-to-2023
Explore at:
zip(7326 bytes)Available download formats
Dataset updated
Jan 16, 2024
Authors
Ghaith Mechi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
I've created this dataset using the project:Premier League Statistics Scraping. It contains the statistics of the Premier League matches from 2015 to 2023. you can use the data to do some EDA or to predict the winner for this year. So, enjoy with the data !

Here are some additional details about the features( columns): 1. members: the number of players. 2. foreign_players: the number of foreign players in the team. 3. mean_age: the mean age of all players. 4. salaries: monthly salary charge. 5. spending: transfer expenditure. 6. MOY: Average players rating. 7. rank: the rank of the team in the season. 8. points: points gained in the season. 9. BP: Goals. 10. BC: goals against. 11. DIF=BP-BC. 12.Gain: the number of winnes. 13. Null: number of draws. 14. defeat: Number of losses.

for further information, visit:foot
Subreddit Interactions for 25,000 Users
kaggle.com
zip
Updated Feb 19, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
colemaclean (2017). Subreddit Interactions for 25,000 Users [Dataset]. https://www.kaggle.com/colemaclean/subreddit-interactions
Explore at:
zip(82083361 bytes)Available download formats
Dataset updated
Feb 19, 2017
Authors
colemaclean
Description
Context

The dataset is a csv file compiled using a python scrapper developed using Reddit's PRAW API. The raw data is a list of 3-tuples of [username,subreddit,utc timestamp]. Each row represents a single comment made by the user, representing about 5 days worth of Reddit data. Note that the actual comment text is not included, only the user, subreddit and comment timestamp of the users comment. The goal of the dataset is to provide a lens in discovering user patterns from reddit meta-data alone. The original use case was to compile a dataset suitable for training a neural network in developing a subreddit recommender system. That final system can be found here

A very unpolished EDA for the dataset can be found here. Note the published dataset is only half of the one used in the EDA and recommender system, to meet kaggle's 500MB size limitation.

Content

user - The username of the person submitting the comment
subreddit - The title of the subreddit the user made the comment in
utc_stamp - the utc timestamp of when the user made the comment

Acknowledgements

The dataset was compiled as part of a school project. The final project report, with my collaborators, can be found here

Inspiration

We were able to build a pretty cool subreddit recommender with the dataset. A blog post for it can be found here, and the stand alone jupyter notebook for it here. Our final model is very undertuned, so there's definitely improvements to be made there, but I think there are many other cool data projects and visualizations that could be built from this dataset. One example would be to analyze the spread of users through the Reddit ecosystem, whether the average user clusters in close communities, or traverses wide and far to different corners. If you do end up building something on this, please share! And have fun!

Released under Reddit's API licence

Employee Performance & Salary (Synthetic Dataset)

kaggle.com

zip

Updated Oct 10, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mamun Hasan (2025). Employee Performance & Salary (Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/mamunhasan2cs/employee-performance-and-salary-synthetic-dataset

Explore at:

zip(13002 bytes)Available download formats

Dataset updated

Oct 10, 2025

Authors

Mamun Hasan

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

🧑‍💼 Employee Performance and Salary Dataset

This synthetic dataset simulates employee information in a medium-sized organization, designed specifically for data preprocessing and exploratory data analysis (EDA) tasks in Data Mining and Machine Learning labs.

It includes over 1,000 employee records with realistic variations in age, gender, department, experience, performance score, and salary — along with missing values, duplicates, and outliers to mimic real-world data quality issues.

📊 Columns Description

Column Name	Description
Employee_ID	Unique employee identifier (E0001, E0002, …)
Age	Employee age (22–60 years)
Gender	Gender of the employee (Male/Female)
Department	Department where the employee works (HR, Finance, IT, Marketing, Sales, Operations)
Experience_Years	Total years of work experience (contains missing values)
Performance_Score	Employee performance score (0–100, contains missing values)
Salary	Annual salary in USD (contains outliers)

🧠 Example Lab Tasks

Identify and impute missing values using mean or median.
Detect and remove duplicate employee records.
Detect outliers in Salary using IQR or Z-score.
Normalize Salary and Performance_Score using Min-Max scaling.
Encode categorical columns (Gender, Department) for model training.
Ideal for Regression

🎯 Possible Regression Targets (Dependent Variables)

Salary → Predict salary based on experience, performance, department, and age. Performance_Score → Predict employee performance based on age, experience, and department.

🧩 Example Regression Problem

Predict the employee's salary based on their experience, performance score, and department.

🧠 Sample Features:

X = ['Age', 'Experience_Years', 'Performance_Score', 'Department', 'Gender'] y = ['Salary']

You can apply:

Linear Regression
Ridge/Lasso Regression
Random Forest Regressor
XGBoost Regressor
SVR (Support Vector Regression)
and evaluate with metrics like:

R², MAE, MSE, RMSE, and residual plots.

Saudi Arabia Events & Crowding Impact Dataset
kaggle.com
zip
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Samy (2025). Saudi Arabia Events & Crowding Impact Dataset [Dataset]. https://www.kaggle.com/datasets/mohamedsamy16/saudi-arabia-events-and-crowding-impact-dataset
Explore at:
zip(22590 bytes)Available download formats
Dataset updated
Feb 12, 2025
Authors
Mohamed Samy
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Saudi Arabia
Description
📦 Saudi Arabia Events & Crowding Impact Dataset

Unlock insights into crowding, sales trends, and delivery optimization using public events, weather, and paydays.

📝 Dataset Overview

This dataset captures public events, holidays, weather conditions, and financial factors that influence crowding, consumer behavior, and online deliveries across Saudi Arabia.

Key Highlights:
✅ Covers multiple Saudi cities with rich event data.
✅ Includes weather conditions affecting business & logistics.
✅ Tracks paydays & school schedules for demand forecasting.
✅ Ideal for crowding prediction, sales analysis, and delivery optimization.

📊 Data Description

Each row represents a daily snapshot of city conditions with the following variables:

📆 Date & Calendar Information

DateG – Gregorian date (YYYY-MM-DD).

DateH – Hijri date.

Day – Day of the week (Sunday, Monday, etc.).

🎉 Public Holidays & Events

Holiday Name – Name of the holiday (if applicable).

Type of Public Holiday – National, Religious, or School-related holidays.

Event – Major events (e.g., festivals, matches, etc.).

Match – Includes Premier League & KSA League games.

🌦 Weather Conditions

Cloudy, Fog, Rain, Widespread Dust, Blowing Dust, etc.

Useful for studying weather impact on mobility & sales.

🏙 Crowding & City Impact

City – Name of the city.

Effect on City – Expected impact (e.g., increased traffic, reduced mobility).

💰 Economic & Financial Impact

Pay Day – Indicates whether it was a salary payout day.

days till next payday – How many days until the next salary payout.

days after payday – How many days after the last payday.

🎓 Education & School Impact

days after school – Number of days since school ended.

days before school – Number of days until school resumes.

🚀 Potential Use Cases

This dataset can be leveraged for:

📌 Crowding Prediction – Identify peak congestion periods based on holidays, weather, and events.
📌 Sales & Demand Forecasting – Analyze payday effects on consumer spending & delivery volumes.
📌 Delivery Optimization – Find the best times for online deliveries to avoid congestion.
📌 Weather Impact Analysis – Study how dust storms & rain affect mobility & e-commerce.
📌 Event-driven Business Planning – Plan logistics around national events & sports matches.

📈 Exploratory Data Analysis (EDA)

🔍 Ideas for Data Exploration

Visualize order volume trends across paydays, school terms, & holidays.

Analyze correlations between weather conditions & delivery delays.

Find seasonal trends in crowding & online shopping behavior.

🔥 Example Analysis in Python

import pandas as pd import matplotlib.pyplot as plt # Load the dataset df = pd.read_csv("saudi_events.csv") # Convert date column to datetime format df['DateG'] = pd.to_datetime(df['DateG']) # Plot orders over time plt.figure(figsize=(10,5)) df.groupby('DateG')['days after payday'].mean().plot() plt.title("Effect of Payday on Consumer Activity") plt.xlabel("Date") plt.ylabel("Days After Payday") plt.show()

📌 Getting Started

How to Use the Dataset:

1️⃣ Download the dataset and load it into Python or R.
2️⃣ Perform EDA to uncover insights into crowding & spending patterns.
3️⃣ Use classification models to predict crowding based on weather, holidays & city impact.
4️⃣ Apply time-series forecasting for sales & delivery demand projections.

🏆 Why This Dataset is Valuable

📊 Multidimensional Insights – Combines weather, paydays, and events for a complete picture of crowding & sales trends.
📌 Business & Logistics Applications – Helps companies plan deliveries, optimize marketing, and predict demand.
⚡ Unique & Rich Data – A rare dataset covering Saudi Arabia's socio-economic events & crowd impact.

📜 License & Acknowledgments

📖 License: CC BY 4.0 – Free to use with attribution.

Conclusion

This dataset is a powerful tool for online delivery companies, businesses, and city planners looking to optimize operations. By analyzing external factors like holidays, paydays, weather, and events, we can predict crowding, improve delivery timing, and forecast sales trends.

🚀 We welcome feedback and contributions! If you find this dataset useful, please ⭐ it on Kaggle and share your insights!

Student Academic Performance (Synthetic Dataset)

kaggle.com

zip

Updated Oct 10, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mamun Hasan (2025). Student Academic Performance (Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/mamunhasan2cs/student-academic-performance-synthetic-dataset

Explore at:

zip(9287 bytes)Available download formats

Dataset updated

Oct 10, 2025

Authors

Mamun Hasan

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.

The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:

Handling missing values
Removing duplicates
Detecting and treating outliers
Data normalization and transformation
Encoding categorical variables
Exploratory data analysis (EDA)
Regression Analysis

📊 Columns Description

Column Name	Description
Student_ID	Unique identifier for each student (e.g., S0001, S0002, …)
Age	Age of the student (between 18 and 25 years)
Gender	Gender of the student (Male/Female)
Study_Hours	Average number of study hours per day (contains missing values and outliers)
Attendance(%)	Percentage of class attendance (contains missing values)
Test_Score	Final exam score (0–100 scale)
Grade	Letter grade derived from test scores (`F`, `C`, `B`, `A`, `A+`)

🧠 Example Lab Tasks Using This Dataset:

Identify and impute missing values using mean/median.
Detect and remove duplicate records.
Use IQR or Z-score methods to handle outliers.
Normalize Study_Hours and Test_Score using Min-Max scaling.
Encode categorical variables (Gender, Grade) for model input.
Prepare a clean dataset ready for classification/regression analysis.
Can be used for Limited Regression

🎯 Possible Regression Targets

Test_Score → Predict test score based on study hours, attendance, age, and gender.

🧩 Example Regression Problem

Predict the student’s test score using their study hours, attendance percentage, and age.

🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']

You can use:

Linear Regression (for simplicity)
Polynomial Regression (to explore nonlinear patterns)
Decision Tree Regressor or Random Forest Regressor

And analyze feature influence using correlation or SHAP/LIME explainability.

GTA San Andreas Vehicle Stats - Full Handling Data
kaggle.com
zip
Updated Jul 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcel Biezunski (2025). GTA San Andreas Vehicle Stats - Full Handling Data [Dataset]. https://www.kaggle.com/datasets/marcelbiezunski/gta-san-andreas-vehicle-stats-full-handling-data/discussion
Explore at:
zip(7300 bytes)Available download formats
Dataset updated
Jul 14, 2025
Authors
Marcel Biezunski
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Don't forget to upvote if you enjoy my work :)

Hi! I'm a big fan of the GTA series and recently got back into playing GTA San Andreas - a game I still love after all these years. I thought it would be fun to analyze the internal car data from the game files like a data scientist would.

This dataset contains detailed handling and performance statistics of 162 cars from the legendary game Grand Theft Auto: San Andreas. The data originates from the game's internal configuration files and provides a technical breakdown of each vehicle’s physical and mechanical attributes.

Each row represents one vehicle with columns including:

identifier - Internal vehicle name used by the game (e.g., "INFERNUS")

mass_(kg) - Vehicle mass in kilograms

turn_mass_(kg) - "Turning mass" or rotational inertia; higher values mean more resistance to turning

drag_multiplier - Air resistance coefficient; higher = stronger aerodynamic drag = lower top speed

center_of_mass_x/y/z - Coordinates of the center of mass in the game’s physics system (X = lateral, Y = longitudinal, Z = vertical)

center_of_mass_%submerged - Submersion threshold; defines how deep the vehicle must be submerged before the game considers it underwater. This affects how the vehicle behaves in water and when it becomes uncontrollable or begins to sink

traction_multiplier - Grip multiplier; higher = better traction

traction_loss - How easily the vehicle loses grip (i.e., slides); lower = more stable on the road

traction_bias(%) - Distribution of traction between front and rear axles (e.g., 50% = equal balance)

#_of_gears - Number of gears in the transmission

max_velocity(km/h) - Maximum top speed in kilometers per hour

acceleration(ms^2) - Acceleration in meters per second squared (real-world physics metric)

interia - Vehicle inertia (resistance to changes in speed/direction)

drive_type - Drivetrain type: "Front" = front-wheel drive, "Rear" = rear-wheel drive, "4" = 4WD

engine_type - Petrol, Diesel or Electric

brakes_deceleration(ms^2) - Braking deceleration in m/s²

brakes_bias(%) - Brake force distribution between front and rear

abs - Anti-lock braking system: 1 = enabled, 0 = disabled

steering_lock_(°) - Maximum steering angle in degrees

suspension_force_level - Stiffness of the suspension springs

suspension_damping_level - Damping level of suspension oscillations

suspension_high_speed_com_damping - Additional damping for high-speed situations

suspension_lines_upper/lower_limit - Upper and lower suspension travel limits

suspension_lines_bias_between_front_and_rear - Bias of suspension travel between front and rear

suspension_anti-dive_multiplier - How much the vehicle resists forward dipping under braking

seat_offset_distance - Horizontal offset of the driver seat relative to the vehicle’s center

collision_damage_multiplier - Multiplier for collision damage; higher = more damage on impact

monetary_value_($) - In-game dollar value of the vehicle

model_flags, handling_flags - Binary flags that define special behavior (e.g., lowrider, can slide, off-road tuned)

lights_front, lights_rear, lights_anim_group - Types of front/rear lights and animation group (e.g., police flashing)

With 35 unique attributes, this dataset is ideal for: 📈 Exploratory Data Analysis (EDA), 📊 Data Visualization, 🤖 Machine Learning, 🔧 Physics or game logic analysis, 🎮 Reverse engineering game mechanics, 🧪 Feature importance / ranking of in-game vehicle performance.

I’ve also included a Jupyter Notebook with EDA to showcase some interesting insights from this data. You're welcome to fork, explore, or build your own models on top of it!
EdgeFogFinDSet
kaggle.com
zip
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Nuraddeen Ado (2025). EdgeFogFinDSet [Dataset]. https://www.kaggle.com/datasets/muhammadnuraddeenado/edgefogfindset
Explore at:
zip(5136133 bytes)Available download formats
Dataset updated
May 27, 2025
Authors
Muhammad Nuraddeen Ado
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Edge-Fog Financial Transactions Dataset Overview

Synthesized Dataset Overview This dataset is a synthetic financial transactions dataset tailored for financial crimes detection in edge and fog computing environments. It simulates transactional activities that could be monitored across decentralized computing layers (e.g., ATMs, mobile apps, IoT financial devices) to train AI/ML models for detecting financial fraud.

Feature Description TransID Unique transaction identifier AcctNo Synthetic account number AcctName Account holder name TransAmount Transaction amount (some negative values may simulate refunds/fraud) TrnsMode Mode of transaction (ATM, POS, USSDC, etc.) TrnsType Transaction type (Transfer, Withdrawal, Deposit) TrnsDate Date of transaction TrnNature Role in transaction (Source or Destination) MACAdres Device MAC address simulating IoT or edge device IPAdres IP address simulating the device/network location Protocol Communication protocol used (HTTP, UDP, ICMP, etc.) Length Size of the data packet (used to simulate network activity)

Preliminary Exploratory Data Analysis (EDA) 1. Missing Values: No missing values — dataset is fully populated. 2. Categorical Variables: - TrnsMode, TrnsType, Protocol, and TrnNature show realistic variation: - 5 transaction modes: ATM, Bank, POS, USSDC, Mobile - 3 transaction types: Transfer, Withdrawal, Deposit - 5 network protocols: HTTP, UDP, ICMP, etc. 3. Numerical Distributions: - TransAmount: - Mean: ~$3,452, Std: ~$9,585 - Min: -$908, Max: ~$49,996 - Highly skewed — indicates presence of outliers or potential fraud. - Length (network packet size): - Min: 60, Max: 1500 — within expected network transmission range. 4. Date Field (TrnsDate): - Covers multiple years (2023–2025). - Suitable for time-series modeling. Suitability for Machine Learning The dataset is well-suited for: - Supervised Learning: If labels (fraud/non-fraud) are introduced or derived. - Unsupervised Learning: Anomaly detection using clustering or density-based methods. - Ensemble Methods: Feature variety and volume allow robust ensemble modeling. - Deep Learning: Rich and diverse feature space suitable for sequential or graph models in fraud detection.
Phone Price Predict 2020-2024
kaggle.com
zip
Updated Dec 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jerowai (2024). Phone Price Predict 2020-2024 [Dataset]. https://www.kaggle.com/datasets/jerowai/phone-price-predict-2020-2024
Explore at:
zip(1002 bytes)Available download formats
Dataset updated
Dec 10, 2024
Authors
Jerowai
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Overview This dataset provides a curated, example-based snapshot of selected Samsung smartphones released (or expected to be released) between 2020 and 2024. It includes various technical specifications such as camera details, processor type, RAM, internal storage, display size, GPU, battery capacity, operating system, and pricing. Note that these values are illustrative and may not reflect actual market data.

What’s Inside?

Phone Name & Release Year: Quickly reference the time frame and model. Camera Specs: Understand the rear camera configurations (e.g., “108+10+10+12 MP”) and compare imaging capabilities across models. Processor & GPU: Gain insights into the performance capabilities by checking the processor and graphics chip. Memory & Storage: Review RAM and internal storage options (e.g., “8 GB RAM” and “128 GB Internal Storage”). Display & Battery: Compare screen sizes (from 6.1 to over 7 inches) and battery capacities (e.g., 5000 mAh) to gauge device longevity and usability. Operating System: Note the Android version at release. Price (USD): Examine relative pricing trends over the years. How to Use This Dataset

Exploratory Data Analysis (EDA): Use Python libraries like Pandas and Matplotlib to explore pricing trends over time, changes in camera configurations, or the evolution of battery capacities.

Example: df.groupby('Release Year')['Price (USD)'].mean().plot(kind='bar') can show how average prices have fluctuated year to year. Feature Comparison & Filtering: Easily filter models based on specs. For instance, query phones with at least 8 GB RAM and a 5000 mAh battery to identify devices suitable for power users.

Example: df[(df['RAM (GB)'] >= 8) & (df['Battery Capacity (mAh)'] >= 5000)] Machine Learning & Predictive Analysis: Although this dataset is example-based and not suitable for precise forecasting, you could still practice predictive modeling. For example, create a simple regression model to predict price based on features like RAM and display size.

Example: Train a regression model (e.g., LinearRegression in scikit-learn) to see if increasing RAM or battery capacity correlates with higher prices. Comparing Release Trends: Investigate how flagship and mid-range specifications have evolved. See if there’s a noticeable shift towards larger displays, bigger batteries, or higher camera megapixels over the years.

Recommended Tools & Libraries

Python & Pandas: For data cleaning, manipulation, and initial analysis. Matplotlib & Seaborn: For creating visualizations to understand trends and distributions. scikit-learn: For modeling and basic predictive tasks, if you choose to use these example values as a training ground. Jupyter Notebooks or Kaggle Kernels: For interactive analysis and iterative exploration. Disclaimer This dataset is a synthetic, illustrative example and may not match real-world specifications, prices, or release timelines. It’s intended for learning, experimentation, and demonstration of various data analysis and machine learning techniques rather than as a factual source.
IMDB top 250 French movies
kaggle.com
zip
Updated Aug 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khushi Pitroda (2023). IMDB top 250 French movies [Dataset]. https://www.kaggle.com/datasets/khushipitroda/imdb-top-250-french-movies/code
Explore at:
zip(36031 bytes)Available download formats
Dataset updated
Aug 3, 2023
Authors
Khushi Pitroda
Area covered
French
Description
Important Note: The "Top 250 French Movies" dataset comprises information on the highest-rated French movies according to user ratings on various platforms. This dataset contains 250 unique French movies that have garnered critical acclaim and popularity among viewers. Each movie is associated with essential details, including its rank, title, release year, duration, genre, IMDb rating, image source link, and a brief description.

This dataset is intended for learning, research, and analysis purposes. The movie ratings and details provided in the dataset are based on publicly available information at the time of scraping. As IMDb ratings and movie information may change over time, it is essential to verify and update the data for the latest information.

By using this dataset, you acknowledge that the accuracy and completeness of the information cannot be guaranteed, and you assume responsibility for any analysis or decision-making based on the data. Additionally, please adhere to IMDb's terms of use and copyright policies when using the data for any public dissemination or commercial purposes.

Data Analysis Tasks:

1.Exploratory Data Analysis (EDA): Explore the distribution of movies by genres, release years, and IMDb ratings. Visualize the top-rated French movies and their IMDb ratings using bar charts or histograms.

2.Year-wise Trends: Observe trends in French movie production over the years using line charts or area plots. Analyze if there's any correlation between release year and IMDb ratings.

3.Word Cloud Analysis: Create word clouds from movie descriptions to visualize the most common words and themes among the top-rated French movies. This can provide insights into popular topics and genres.

4.Network Analysis: Build a network graph connecting French movies that share common actors or directors. Analyze the interconnectedness of movies based on their production teams.

Machine Learning Tasks:

1.Movie Recommendation System: Implement a content-based recommendation system that suggests French movies based on similarities in genre, release year, and IMDb ratings. Use techniques like cosine similarity or Jaccard similarity to measure movie similarities.

2.Movie Genre Classification: Build a multi-class classification model to predict the genre of an French movie based on its description. Utilize Natural Language Processing (NLP) techniques like text preprocessing, TF-IDF, or word embeddings. Use classifiers like Logistic Regression, Naive Bayes, or Support Vector Machines.

3.Movie Sentiment Analysis: Perform sentiment analysis on movie descriptions to determine the overall sentiment (positive, negative, neutral) of each movie. Use sentiment lexicons or pre-trained sentiment analysis models.

4.Movie Rating Prediction: Develop a regression model to predict the IMDb rating of an French movie based on features like genre, release year, and description sentiment. Employ regression algorithms like Linear Regression, Decision Trees, or Random Forests.

5.Movie Clustering: Apply unsupervised clustering algorithms to group French movies with similar attributes. Use features like genre, IMDb rating, and release year to identify movie clusters. Experiment with algorithms like K-means clustering or hierarchical clustering.

Important Note: Ensure that the data is appropriately preprocessed and encoded for machine learning tasks. Handle any missing values, perform feature engineering, and split the dataset into training and testing sets. Evaluate the performance of each machine learning model using appropriate metrics such as accuracy, precision, recall, or Mean Squared Error (MSE) depending on the task.

It is crucial to remember that the performance of machine learning models may vary based on the dataset's size and quality. Interpret the results carefully and consider using cross-validation techniques to assess model generalization.

Lastly, please adhere to IMDb's terms of use and any applicable data usage policies while conducting data analysis and implementing machine learning models with this dataset.

Pavement Dataset

kaggle.com

zip

Updated May 24, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Gifrey Sulay (2025). Pavement Dataset [Dataset]. https://www.kaggle.com/datasets/gifreysulay/pavement-dataset/discussion?sort=undefined

Explore at:

zip(20890601 bytes)Available download formats

Dataset updated

May 24, 2025

Authors

Gifrey Sulay

License

https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

Description

🏗️ Pavement Condition Monitoring and Maintenance Prediction

📘 Scenario

You are a data analyst for a city engineering office tasked with identifying which road segments require urgent maintenance. The office has collected inspection data on various roads, including surface conditions, traffic volume, and environmental factors.

Your goal is to analyze this data and build a binary classification model to predict whether a given road segment needs maintenance, based on pavement and environmental indicators.

🔍 Target Variable: `Needs_Maintenance`

This binary label indicates whether the road segment requires immediate maintenance, defined by the following rule:

Needs_Maintenance = 1
Needs_Maintenance = 0 otherwise

🎯 Learning Objectives

Perform exploratory data analysis (EDA) on civil engineering infrastructure data
Engineer features relevant to road quality and maintenance
Build and evaluate a binary classification model using Python
Interpret model results to support maintenance prioritization decisions

📊 Dataset Features

Column Name	Description
Segment ID	Unique identifier for the road segment
PCI	Pavement Condition Index (0 = worst, 100 = best)
Road Type	Type of road (Primary, Secondary, Barangay)
AADT	Average Annual Daily Traffic
Asphalt Type	Asphalt mix classification (e.g. Dense, Open-graded, SMA)
Last Maintenance	Year of the last major maintenance
Average Rainfall	Average annual rainfall in the area (mm)
Rutting	Depth of rutting (mm)
IRI	International Roughness Index (m/km)
Needs Maintenance	Target label: 1 if urgent maintenance is needed, 0 otherwise

🎓 Final Exam Task (For Students)

Using this 1 050 000-row dataset, perform at least five (5) distinct observations. An observation may combine one or more of the following:

Plots using Matplotlib or Seaborn
Tables or summary statistics using Pandas
Numerical calculations using NumPy
Grouped analyses, cross-tabulations, or pivot tables

You may consult official documentation online (e.g., pandas.pydata.org, matplotlib.org, seaborn.pydata.org, numpy.org), but NO AI-assisted tools or generative models are permitted—even such tools for code snippets or data exploration.

What counts as an “Observation”

Distribution Insight
- E.g. plot the distribution of IRI and comment on its skewness.
Correlation or Relationship
- E.g. scatterplot of Rutting vs. Average Rainfall, plus calculation of Pearson or Spearman correlation.
Group Comparison
- E.g. pivot table of mean AADT by Road Type and a bar chart.
Derived Feature Analysis
- E.g. create decay = Rutting / Last Maintenance, then describe its summary statistics and plot.
Conditional Probability or Rate
- E.g. compute the proportion of Needs Maintenance = 1 within each Road Type count and visualize as a line plot.

You must deliver:

A Jupyter Notebook containing at least five well-labeled observations, each with a title, code cell(s), output (plot/table), and a short interpretation (2–4 sentences).
No AI tools: all code must be handwritten or copied from official docs/examples; do not use ChatGPT, Copilot, or similar.
Set your random seeds where appropriate to ensure reproducibility.

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dr. Nagendra (2025). Breast Cancer Exploratory Data Analysis EDA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/breast-cancer-exploratory-data-analysis-eda

Breast Cancer Exploratory Data Analysis EDA

Subtitle: Clinical & Diagnostic features for Breast Cancer EDA and classificatio

Explore at:

zip(7609364 bytes)Available download formats

Dataset updated

Nov 29, 2025

Authors

Dr. Nagendra

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset contains clinical and diagnostic features related to Breast Cancer, designed for comprehensive Exploratory Data Analysis (EDA) and subsequent predictive modeling.

It is derived from digitized images of Fine Needle Aspirates (FNA) of breast masses.

The dataset features quantitative measurements, typically calculated from the characteristics of cell nuclei, including: - Radius - Texture - Perimeter - Area - Smoothness - Compactness - Concavity - Concave Points - Symmetry - Fractal Dimension

These features are provided as mean, standard error, and "worst" (largest) values.

The primary goal of this resource is to support the validation of EDA techniques necessary for clinical data science: - Data quality assessment (missing values, inconsistencies). - Feature assessment (distributions, correlations). - Visualization for diagnostic modeling.

The primary target variable is the binary classification of the tissue sample: Malignant vs. Benign.

Clear search

Close search

Google apps

Main menu

Breast Cancer Exploratory Data Analysis EDA

Stock Price EDA(Time Series Analysis)

Context

Content

Acknowledgements

Inspiration

BI intro to data cleaning eda and machine learning

Real-World Data Science Challenge

Business Intelligence Program Strategy — Student Success Optimization

Background

Your Mission

Key Strategic Areas

1. Admissions Optimization

2. Curriculum Support Strategy

3. Resource Allocation & Program ROI

🛠️ Dataset Overview

📊 Dataset

Raw Dataset (Recommended for Full Project)

Cleaned Dataset (Optional Shortcut)

SAP Historical Stock Prices Dataset

ABOUT SAP:

ABOUT DATASET:

PURPOSE OF DATASET

this dataset offers a solid foundation for a wide range of financial analyses and trading applications.

Walmart Stocks Data 2025

📊 Walmart Stock Price Dataset & Exploratory Data Analysis (EDA)

🏢 About Walmart

📌 Dataset Overview

📊 Features Included in the Dataset

🔍 Exploratory Data Analysis (EDA) Steps

1️⃣ Data Preprocessing & Cleaning

2️⃣ Descriptive Statistics & Summary

3️⃣ Data Visualizations

4️⃣ Time Series Analysis

5️⃣ Insights & Conclusions

🚀 Use Cases & Applications

Real Madrid UEFA Champions League Perform Analysis

Introduction

Exploratory Data Analysis (EDA)

1. Statistical Analysis:

2. Machine Learning Models:

Objectives

Conclusion

India's Fast Delivery Agents Reviews and Ratings

1.2 Importance of Customer Ratings and Reviews

1.3 Dataset Overview

2. Data Handling and Preprocessing Methodologies

2.1 Data Acquisition and Integration

2.2 Data Cleaning and Normalization

2.3 Feature Engineering

3. Exploratory Data Analysis (EDA) and Visualization

3.1 Rating Distribution and Statistical Summary

3.2 Geographical and Platform-Based Ratings Comparison

3.3 Service Attributes and Rating Correlations

Tradyflow - Options Trading!

What is the Dataset?

What Do the rows Stand For?

House Price Regression Dataset

Home Value Insights: A Beginner's Regression Dataset

Features:

Potential Uses:

Versatility:

Stock Market: Historical Data of Top 10 Companies

Premier League Statistics from 2015 to 2023

Subreddit Interactions for 25,000 Users

Context

Content

Acknowledgements

Inspiration

Employee Performance & Salary (Synthetic Dataset)

📊 Columns Description

🧠 Example Lab Tasks

🎯 Possible Regression Targets (Dependent Variables)

🧩 Example Regression Problem

🧠 Sample Features:

Saudi Arabia Events & Crowding Impact Dataset

📦 Saudi Arabia Events & Crowding Impact Dataset

📝 Dataset Overview

📊 Data Description

📆 Date & Calendar Information

🔍 Target Variable: `Needs_Maintenance`