Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains clinical and diagnostic features related to Breast Cancer, designed for comprehensive Exploratory Data Analysis (EDA) and subsequent predictive modeling.
It is derived from digitized images of Fine Needle Aspirates (FNA) of breast masses.
The dataset features quantitative measurements, typically calculated from the characteristics of cell nuclei, including: - Radius - Texture - Perimeter - Area - Smoothness - Compactness - Concavity - Concave Points - Symmetry - Fractal Dimension
These features are provided as mean, standard error, and "worst" (largest) values.
The primary goal of this resource is to support the validation of EDA techniques necessary for clinical data science: - Data quality assessment (missing values, inconsistencies). - Feature assessment (distributions, correlations). - Visualization for diagnostic modeling.
The primary target variable is the binary classification of the tissue sample: Malignant vs. Benign.
Facebook
TwitterThere's a story behind every dataset and here's your opportunity to share yours.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Analyze closing price of all the stocks. Analyze the total volume of stocks being traded each day. Analyze daily price change in stock. Analyze monthly mean of close feature. Analyze whether Stock prices of these tech companies are correlated or not. Analyze Daily return of each stock and how they are co-related. Value at risk analysis for Tech Companies.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Hosted by: Walsoft Computer Institute 📁 Download dataset 👤 Kaggle profile
Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.
As part of an internal review, the leadership team has hired you — a Data Science Consultant — to analyze this dataset and provide clear, evidence-based recommendations on how to improve:
Answer this central question:
“Using the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”
You are required to analyze and provide actionable insights for the following three areas:
Should entry exams remain the primary admissions filter?
Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.
✅ Deliverables:
Are there at-risk student groups who need extra support?
Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.
✅ Deliverables:
How can we allocate resources for maximum student success?
Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.
✅ Deliverables:
| Column | Description |
|---|---|
fNAME, lNAME | Student first and last name |
Age | Student age (21–71 years) |
gender | Gender (standardized as "Male"/"Female") |
country | Student’s country of origin |
residence | Student housing/residence type |
entryEXAM | Entry test score (28–98) |
prevEducation | Prior education (High School, Diploma, etc.) |
studyHOURS | Total study hours logged |
Python | Final Python exam score |
DB | Final Database exam score |
You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day — from inconsistent formatting to missing values.
Download: bi.csv
This dataset includes common data quality challenges:
Country name inconsistencies
e.g. Norge → Norway, RSA → South Africa, UK → United Kingdom
Residence type variations
e.g. BI-Residence, BIResidence, BI_Residence → unify to BI Residence
Education level typos and casing issues
e.g. Barrrchelors → Bachelor, DIPLOMA, Diplomaaa → Diploma
Gender value noise
e.g. M, F, female → standardize to Male / Female
Missing scores in Python subject
Fill NaN values using column mean or suitable imputation strategy
Participants using this dataset are expected to apply data cleaning techniques such as:
- String standardization
- Null value imputation
- Type correction (e.g., scores as float)
- Validation and visual verification
✅ Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.
Download: cleaned_bi.csv
This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Headquartered in Walldorf, Germany, SAP is the market leader in enterprise application software. Founded in 1972, SAP (which stands for "Systems, Applications, and Products in Data Processing") has a rich history of innovation and growth as a true industry leader.
_
This dataset contains detailed historical stock price data for SAP, covering the period from 09/22/1995 to 06/14/2024. The data is collected from Yahoo Finance and includes daily records of the stock's opening price, highest price, lowest price, closing price, and trading volume. Each entry in the dataset represents a single trading day, providing a comprehensive view of the stock's price movements and market activity.
The purpose of this dataset is to provide analysts, traders, and researchers with accurate and granular historical stock price data for SAP. This data can be used for various applications, including:
Technical Analysis: Identify trends and patterns in the stock's price movements. Calculate technical indicators such as moving averages, RSI, and Bollinger Bands.
Market Sentiment Analysis: Analyze how the stock's price responds to market events and news. Compare the opening and closing prices to understand daily sentiment.
Algorithmic Trading: Develop and test trading algorithms based on historical price and volume data. Use past price movements to simulate trading strategies.
Predictive Modeling: Build models to forecast future prices and trading volumes. Use historical data to identify potential price movements and market trends.
Educational Purposes: Serve as a teaching tool for financial education. Help students and researchers understand the dynamics of stock price changes and market behavior.
_
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Walmart Inc. is a multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores. It is one of the world's largest companies by revenue and a key player in the retail sector. Walmart's stock is actively traded on major stock exchanges, making it an interesting subject for financial analysis.
This dataset contains historical stock price data for Walmart, sourced directly from Yahoo Finance using the yfinance Python API. The data covers daily stock prices and includes multiple key financial indicators.
This notebook performs an extensive EDA to uncover insights into Walmart's stock price trends, volatility, and overall behavior in the stock market. The following analysis steps are included:
This dataset and analysis can be useful for: - 📡 Stock Market Analysis – Evaluating Walmart’s stock price trends and volatility. - 🏦 Investment Research – Assisting traders and investors in making informed decisions. - 🎓 Educational Purposes – Teaching data science and financial analysis using real-world stock data. - 📊 Algorithmic Trading – Developing trading strategies based on historical stock price trends.
📥 Download the dataset and explore Walmart’s stock performance today! 🚀
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
In the high-stakes world of professional football, public opinion often forms around emotions, loyalties, and subjective interpretations. The project at hand aims to transcend these biases by delving into a robust, data-driven analysis of Real Madrid's performance in the UEFA Champions League over the past decade.
Through a blend of traditional statistical methods, machine learning models, game theory, psychology, philosophy, and even military strategies, this investigation presents a multifaceted view of what contributes to a football team's success and how performance can be objectively evaluated.
The EDA consists of two layers:
The goal of this analysis is multifaceted: 1. Unveil Hidden Statistics: To reveal the underlying patterns often overlooked in casual discussions. 2. Demonstrate the Impact of Probability: How it shapes matches and seasons. 3. Explore Interdisciplinary Influences: Including Game Theory, Strategy, Cooperation, Psychology, Physiology, Military Training, Luck, Economics, Philosophy, and even Freudian Analysis. 4. Challenge Subjective Bias: By presenting a well-rounded, evidence-based view of football performance.
This project stands as a testament to the profound complexity of football performance and the nuanced insights that can be derived through rigorous scientific analysis. Whether a data scientist recruiter, football fanatic, or curious mind, the findings herein offer a unique perspective that bridges the gap between passion and empiricism.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
1.1 Industry Landscape of Fast Delivery Services in India
India’s fast delivery ecosystem is characterized by intense competition among multiple players offering expedited grocery and food delivery services with promised delivery windows as low as 10 to 30 minutes. Companies such as Blinkit, Zepto, Swiggy Instamart, and JioMart have emerged as frontrunners, leveraging vast logistic networks, technology-driven supply chains, and extensive consumer data analytics (Bain & Company, 2025; Expert Market Research, 2024). The sector’s growth trajectory is robust, with the online food delivery market alone valued at USD 48.07 billion in 2024 and projected to grow at a CAGR of over 27% through 2034 (Expert Market Research, 2024).
Customer reviews and ratings provide granular feedback on delivery agents’ punctuality, professionalism, order accuracy, and communication. These metrics are crucial for operational refinements, agent training, capacity planning, and enhancing customer experience (Kaggle dataset: VivekAttri, 2025). Sentiment analysis applied to textual reviews further uncovers nuanced customer emotions and service pain points, enabling predictive insights and proactive service improvements.
The focal dataset includes structured customer reviews and numerical ratings collected for fast delivery agents across India’s leading quick-commerce platforms. Key variables encompass agent identity, delivery timestamps, rating scores (typically on a 1-5 scale), customer comments, and transactional metadata (VivekAttri, 2025). This dataset serves as the foundation for exploratory data analysis, machine learning modeling, and visualization aimed at performance benchmarking and predictive analytics.
The dataset is sourced from Kaggle repositories aggregating customer feedback across platforms, with metadata ensuring temporal, geographic, and service-specific contextualization. Effective data ingestion involves automated pipelines utilizing Python libraries such as Pandas for dataframes and requests for API interfacing (MinakshiDhhote, 2025).
Critical preprocessing steps include:
Removal of Redundant and Irrelevant Columns: Columns unrelated to delivery agent performance (e.g., user identifiers when anonymized) are discarded to streamline analysis.
Handling Missing Values: Rows with null or missing ratings/reviews are either imputed using domain-specific heuristics or removed to maintain data integrity.
Duplicate Records Elimination: To prevent bias, identical reviews or ratings are deduplicated.
Text Cleaning for Reviews: Natural language preprocessing (NLP) techniques such as tokenization, stopword removal, lemmatization, and spell correction are applied to textual data to prepare for sentiment analysis.
Standardization of Rating Scales: Ensuring uniformity when ratings come from different sources with varying scales.
Derived features enhance modeling capabilities:
Sentiment Scores: Using models like VADER or BERT-based classifiers to convert textual reviews into quantifiable sentiment metrics.
Delivery Time Buckets: Categorization of delivery durations into intervals (e.g., under 15 minutes, 15-30 minutes) to analyze performance impact.
Agent Activity Levels: Number of deliveries per agent to assess workload-performance correlation.
Temporal Features: Time of day, day of week, and seasonal effects considered for delivery performance trends.
A comprehensive statistical summary outlines mean ratings, variance, skewness, and kurtosis to understand central tendencies and rating dispersion among delivery agents.
Table 1: Rating Summary Statistics for Delivery Agents (2025 Dataset Sample)
|| Metric | Value | |----------------------|----------------| | Mean Rating | 3.8 ± 0.15 | | Median Rating | 4.0 | | Standard Deviation | 0.75 | | Skewness | -0.45 | | Kurtosis | 2.1 | | Number of Ratings | 250,000+ | | | | --- | --- | | | |
Data validated with 95% confidence interval from Kaggle 2025 dataset (VivekAttri, 2025).
Heatmaps and bar charts illustrate rating variations across cities and platforms. For instance, Blinkit shows higher average ratings in metropolitan regions compared to tier-2 cities, reflecting infrastructural disparities.
Scatter plots and corr...
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset is obtained from tradytics.com on 21st June 2022. By the time you get to see the code, this dataset will no longer be available on the website. You can only access it on my Github.
This dataset is the flow of the stock market on the date 17th June 2022. It contains many tickers. An excellent dataset to practice time series and test your data science skills.
Time - Time when this ticker was caught in the flow.
Sym- The ticker symbol, e.g AAPL, TSLA, SPY.
C/P - Call or Put trade?
Exp- The expiration of the contract.
Str - The strike price.
Spot - The stock price at the moment when the flow was reported.
Bidask - The bidask of the contract
Orders - The total order of the contract.
Volume - The number of shares traded at the moment when this contract was caught.
Premiums - The total money spent on this contract.
Open Interest - The total number of opened contracts at the moment when this contract was caught.
Diff % - The % difference between Spot and Strike price.
ITM - If the contract was a win or loss. 0 is LOSS, 1 is WIN
***NoT FINANCIAL ADVICE**
This is an amazing dataset for beginners or those coders who are refreshing their data science skills. No harm if professionals use this either. It's fantastical. You do so much with, honestly so much! Maybe even a stock market bot, thou this is done at your risk. Enjoy and share your code!!!!
Facebook
TwitterThis dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.
Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.
Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.
Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.
Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.
The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.
It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.
This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.
Facebook
TwitterThe dataset contains a total of 25,161 rows, each row representing the stock market data for a specific company on a given date. The information collected through web scraping from www.nasdaq.com includes the stock prices and trading volumes for the companies listed, such as Apple, Starbucks, Microsoft, Cisco Systems, Qualcomm, Meta, Amazon.com, Tesla, Advanced Micro Devices, and Netflix.
Data Analysis Tasks:
1) Exploratory Data Analysis (EDA): Analyze the distribution of stock prices and volumes for each company over time. Visualize trends, seasonality, and patterns in the stock market data using line charts, bar plots, and heatmaps.
2)Correlation Analysis: Investigate the correlations between the closing prices of different companies to identify potential relationships. Calculate correlation coefficients and visualize correlation matrices.
3)Top Performers Identification: Identify the top-performing companies based on their stock price growth and trading volumes over a specific time period.
4)Market Sentiment Analysis: Perform sentiment analysis using Natural Language Processing (NLP) techniques on news headlines related to each company. Determine whether positive or negative news impacts the stock prices and volumes.
5)Volatility Analysis: Calculate the volatility of each company's stock prices using metrics like Standard Deviation or Bollinger Bands. Analyze how volatile stocks are in comparison to others.
Machine Learning Tasks:
1)Stock Price Prediction: Use time-series forecasting models like ARIMA, SARIMA, or Prophet to predict future stock prices for a particular company. Evaluate the models' performance using metrics like Mean Squared Error (MSE) or Root Mean Squared Error (RMSE).
2)Classification of Stock Movements: Create a binary classification model to predict whether a stock will rise or fall on the next trading day. Utilize features like historical price changes, volumes, and technical indicators for the predictions. Implement classifiers such as Logistic Regression, Random Forest, or Support Vector Machines (SVM).
3)Clustering Analysis: Cluster companies based on their historical stock performance using unsupervised learning algorithms like K-means clustering. Explore if companies with similar stock price patterns belong to specific industry sectors.
4)Anomaly Detection: Detect anomalies in stock prices or trading volumes that deviate significantly from the historical trends. Use techniques like Isolation Forest or One-Class SVM for anomaly detection.
5)Reinforcement Learning for Portfolio Optimization: Formulate the stock market data as a reinforcement learning problem to optimize a portfolio's performance. Apply algorithms like Q-Learning or Deep Q-Networks (DQN) to learn the optimal trading strategy.
The dataset provided on Kaggle, titled "Stock Market Stars: Historical Data of Top 10 Companies," is intended for learning purposes only. The data has been gathered from public sources, specifically from web scraping www.nasdaq.com, and is presented in good faith to facilitate educational and research endeavors related to stock market analysis and data science.
It is essential to acknowledge that while we have taken reasonable measures to ensure the accuracy and reliability of the data, we do not guarantee its completeness or correctness. The information provided in this dataset may contain errors, inaccuracies, or omissions. Users are advised to use this dataset at their own risk and are responsible for verifying the data's integrity for their specific applications.
This dataset is not intended for any commercial or legal use, and any reliance on the data for financial or investment decisions is not recommended. We disclaim any responsibility or liability for any damages, losses, or consequences arising from the use of this dataset.
By accessing and utilizing this dataset on Kaggle, you agree to abide by these terms and conditions and understand that it is solely intended for educational and research purposes.
Please note that the dataset's contents, including the stock market data and company names, are subject to copyright and other proprietary rights of the respective sources. Users are advised to adhere to all applicable laws and regulations related to data usage, intellectual property, and any other relevant legal obligations.
In summary, this dataset is provided "as is" for learning purposes, without any warranties or guarantees, and users should exercise due diligence and judgment when using the data for any purpose.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
I've created this dataset using the project:Premier League Statistics Scraping. It contains the statistics of the Premier League matches from 2015 to 2023. you can use the data to do some EDA or to predict the winner for this year. So, enjoy with the data !
Here are some additional details about the features( columns):
1. members: the number of players.
2. foreign_players: the number of foreign players in the team.
3. mean_age: the mean age of all players.
4. salaries: monthly salary charge.
5. spending: transfer expenditure.
6. MOY: Average players rating.
7. rank: the rank of the team in the season.
8. points: points gained in the season.
9. BP: Goals.
10. BC: goals against.
11. DIF=BP-BC.
12.Gain: the number of winnes.
13. Null: number of draws.
14. defeat: Number of losses.
for further information, visit:foot
Facebook
TwitterThe dataset is a csv file compiled using a python scrapper developed using Reddit's PRAW API. The raw data is a list of 3-tuples of [username,subreddit,utc timestamp]. Each row represents a single comment made by the user, representing about 5 days worth of Reddit data. Note that the actual comment text is not included, only the user, subreddit and comment timestamp of the users comment. The goal of the dataset is to provide a lens in discovering user patterns from reddit meta-data alone. The original use case was to compile a dataset suitable for training a neural network in developing a subreddit recommender system. That final system can be found here
A very unpolished EDA for the dataset can be found here. Note the published dataset is only half of the one used in the EDA and recommender system, to meet kaggle's 500MB size limitation.
user - The username of the person submitting the comment
subreddit - The title of the subreddit the user made the comment in
utc_stamp - the utc timestamp of when the user made the comment
The dataset was compiled as part of a school project. The final project report, with my collaborators, can be found here
We were able to build a pretty cool subreddit recommender with the dataset. A blog post for it can be found here, and the stand alone jupyter notebook for it here. Our final model is very undertuned, so there's definitely improvements to be made there, but I think there are many other cool data projects and visualizations that could be built from this dataset. One example would be to analyze the spread of users through the Reddit ecosystem, whether the average user clusters in close communities, or traverses wide and far to different corners. If you do end up building something on this, please share! And have fun!
Released under Reddit's API licence
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
🧑💼 Employee Performance and Salary Dataset
This synthetic dataset simulates employee information in a medium-sized organization, designed specifically for data preprocessing and exploratory data analysis (EDA) tasks in Data Mining and Machine Learning labs.
It includes over 1,000 employee records with realistic variations in age, gender, department, experience, performance score, and salary — along with missing values, duplicates, and outliers to mimic real-world data quality issues.
| Column Name | Description |
|---|---|
| Employee_ID | Unique employee identifier (E0001, E0002, …) |
| Age | Employee age (22–60 years) |
| Gender | Gender of the employee (Male/Female) |
| Department | Department where the employee works (HR, Finance, IT, Marketing, Sales, Operations) |
| Experience_Years | Total years of work experience (contains missing values) |
| Performance_Score | Employee performance score (0–100, contains missing values) |
| Salary | Annual salary in USD (contains outliers) |
Salary → Predict salary based on experience, performance, department, and age. Performance_Score → Predict employee performance based on age, experience, and department.
Predict the employee's salary based on their experience, performance score, and department.
X = ['Age', 'Experience_Years', 'Performance_Score', 'Department', 'Gender'] y = ['Salary']
You can apply:
R², MAE, MSE, RMSE, and residual plots.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Unlock insights into crowding, sales trends, and delivery optimization using public events, weather, and paydays.
This dataset captures public events, holidays, weather conditions, and financial factors that influence crowding, consumer behavior, and online deliveries across Saudi Arabia.
Key Highlights:
✅ Covers multiple Saudi cities with rich event data.
✅ Includes weather conditions affecting business & logistics.
✅ Tracks paydays & school schedules for demand forecasting.
✅ Ideal for crowding prediction, sales analysis, and delivery optimization.
Each row represents a daily snapshot of city conditions with the following variables:
DateG – Gregorian date (YYYY-MM-DD). DateH – Hijri date. Day – Day of the week (Sunday, Monday, etc.). Holiday Name – Name of the holiday (if applicable). Type of Public Holiday – National, Religious, or School-related holidays. Event – Major events (e.g., festivals, matches, etc.). Match – Includes Premier League & KSA League games. Cloudy, Fog, Rain, Widespread Dust, Blowing Dust, etc. City – Name of the city. Effect on City – Expected impact (e.g., increased traffic, reduced mobility). Pay Day – Indicates whether it was a salary payout day. days till next payday – How many days until the next salary payout. days after payday – How many days after the last payday. days after school – Number of days since school ended. days before school – Number of days until school resumes. This dataset can be leveraged for:
📌 Crowding Prediction – Identify peak congestion periods based on holidays, weather, and events.
📌 Sales & Demand Forecasting – Analyze payday effects on consumer spending & delivery volumes.
📌 Delivery Optimization – Find the best times for online deliveries to avoid congestion.
📌 Weather Impact Analysis – Study how dust storms & rain affect mobility & e-commerce.
📌 Event-driven Business Planning – Plan logistics around national events & sports matches.
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv("saudi_events.csv")
# Convert date column to datetime format
df['DateG'] = pd.to_datetime(df['DateG'])
# Plot orders over time
plt.figure(figsize=(10,5))
df.groupby('DateG')['days after payday'].mean().plot()
plt.title("Effect of Payday on Consumer Activity")
plt.xlabel("Date")
plt.ylabel("Days After Payday")
plt.show()
1️⃣ Download the dataset and load it into Python or R.
2️⃣ Perform EDA to uncover insights into crowding & spending patterns.
3️⃣ Use classification models to predict crowding based on weather, holidays & city impact.
4️⃣ Apply time-series forecasting for sales & delivery demand projections.
📊 Multidimensional Insights – Combines weather, paydays, and events for a complete picture of crowding & sales trends.
📌 Business & Logistics Applications – Helps companies plan deliveries, optimize marketing, and predict demand.
⚡ Unique & Rich Data – A rare dataset covering Saudi Arabia's socio-economic events & crowd impact.
This dataset is a powerful tool for online delivery companies, businesses, and city planners looking to optimize operations. By analyzing external factors like holidays, paydays, weather, and events, we can predict crowding, improve delivery timing, and forecast sales trends.
🚀 We welcome feedback and contributions! If you find this dataset useful, please ⭐ it on Kaggle and share your insights!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.
The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:
| Column Name | Description |
|---|---|
| Student_ID | Unique identifier for each student (e.g., S0001, S0002, …) |
| Age | Age of the student (between 18 and 25 years) |
| Gender | Gender of the student (Male/Female) |
| Study_Hours | Average number of study hours per day (contains missing values and outliers) |
| Attendance(%) | Percentage of class attendance (contains missing values) |
| Test_Score | Final exam score (0–100 scale) |
| Grade | Letter grade derived from test scores (F, C, B, A, A+) |
Test_Score → Predict test score based on study hours, attendance, age, and gender.
Predict the student’s test score using their study hours, attendance percentage, and age.
🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']
You can use:
And analyze feature influence using correlation or SHAP/LIME explainability.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Don't forget to upvote if you enjoy my work :)
Hi! I'm a big fan of the GTA series and recently got back into playing GTA San Andreas - a game I still love after all these years. I thought it would be fun to analyze the internal car data from the game files like a data scientist would.
This dataset contains detailed handling and performance statistics of 162 cars from the legendary game Grand Theft Auto: San Andreas. The data originates from the game's internal configuration files and provides a technical breakdown of each vehicle’s physical and mechanical attributes.
Each row represents one vehicle with columns including:
With 35 unique attributes, this dataset is ideal for: 📈 Exploratory Data Analysis (EDA), 📊 Data Visualization, 🤖 Machine Learning, 🔧 Physics or game logic analysis, 🎮 Reverse engineering game mechanics, 🧪 Feature importance / ranking of in-game vehicle performance.
I’ve also included a Jupyter Notebook with EDA to showcase some interesting insights from this data. You're welcome to fork, explore, or build your own models on top of it!
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Edge-Fog Financial Transactions Dataset Overview
Synthesized Dataset Overview This dataset is a synthetic financial transactions dataset tailored for financial crimes detection in edge and fog computing environments. It simulates transactional activities that could be monitored across decentralized computing layers (e.g., ATMs, mobile apps, IoT financial devices) to train AI/ML models for detecting financial fraud.
Feature Description TransID Unique transaction identifier AcctNo Synthetic account number AcctName Account holder name TransAmount Transaction amount (some negative values may simulate refunds/fraud) TrnsMode Mode of transaction (ATM, POS, USSDC, etc.) TrnsType Transaction type (Transfer, Withdrawal, Deposit) TrnsDate Date of transaction TrnNature Role in transaction (Source or Destination) MACAdres Device MAC address simulating IoT or edge device IPAdres IP address simulating the device/network location Protocol Communication protocol used (HTTP, UDP, ICMP, etc.) Length Size of the data packet (used to simulate network activity)
Preliminary Exploratory Data Analysis (EDA) 1. Missing Values: No missing values — dataset is fully populated. 2. Categorical Variables: - TrnsMode, TrnsType, Protocol, and TrnNature show realistic variation: - 5 transaction modes: ATM, Bank, POS, USSDC, Mobile - 3 transaction types: Transfer, Withdrawal, Deposit - 5 network protocols: HTTP, UDP, ICMP, etc. 3. Numerical Distributions: - TransAmount: - Mean: ~$3,452, Std: ~$9,585 - Min: -$908, Max: ~$49,996 - Highly skewed — indicates presence of outliers or potential fraud. - Length (network packet size): - Min: 60, Max: 1500 — within expected network transmission range. 4. Date Field (TrnsDate): - Covers multiple years (2023–2025). - Suitable for time-series modeling. Suitability for Machine Learning The dataset is well-suited for: - Supervised Learning: If labels (fraud/non-fraud) are introduced or derived. - Unsupervised Learning: Anomaly detection using clustering or density-based methods. - Ensemble Methods: Feature variety and volume allow robust ensemble modeling. - Deep Learning: Rich and diverse feature space suitable for sequential or graph models in fraud detection.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Overview This dataset provides a curated, example-based snapshot of selected Samsung smartphones released (or expected to be released) between 2020 and 2024. It includes various technical specifications such as camera details, processor type, RAM, internal storage, display size, GPU, battery capacity, operating system, and pricing. Note that these values are illustrative and may not reflect actual market data.
What’s Inside?
Phone Name & Release Year: Quickly reference the time frame and model. Camera Specs: Understand the rear camera configurations (e.g., “108+10+10+12 MP”) and compare imaging capabilities across models. Processor & GPU: Gain insights into the performance capabilities by checking the processor and graphics chip. Memory & Storage: Review RAM and internal storage options (e.g., “8 GB RAM” and “128 GB Internal Storage”). Display & Battery: Compare screen sizes (from 6.1 to over 7 inches) and battery capacities (e.g., 5000 mAh) to gauge device longevity and usability. Operating System: Note the Android version at release. Price (USD): Examine relative pricing trends over the years. How to Use This Dataset
Exploratory Data Analysis (EDA): Use Python libraries like Pandas and Matplotlib to explore pricing trends over time, changes in camera configurations, or the evolution of battery capacities.
Example: df.groupby('Release Year')['Price (USD)'].mean().plot(kind='bar') can show how average prices have fluctuated year to year. Feature Comparison & Filtering: Easily filter models based on specs. For instance, query phones with at least 8 GB RAM and a 5000 mAh battery to identify devices suitable for power users.
Example: df[(df['RAM (GB)'] >= 8) & (df['Battery Capacity (mAh)'] >= 5000)] Machine Learning & Predictive Analysis: Although this dataset is example-based and not suitable for precise forecasting, you could still practice predictive modeling. For example, create a simple regression model to predict price based on features like RAM and display size.
Example: Train a regression model (e.g., LinearRegression in scikit-learn) to see if increasing RAM or battery capacity correlates with higher prices. Comparing Release Trends: Investigate how flagship and mid-range specifications have evolved. See if there’s a noticeable shift towards larger displays, bigger batteries, or higher camera megapixels over the years.
Recommended Tools & Libraries
Python & Pandas: For data cleaning, manipulation, and initial analysis. Matplotlib & Seaborn: For creating visualizations to understand trends and distributions. scikit-learn: For modeling and basic predictive tasks, if you choose to use these example values as a training ground. Jupyter Notebooks or Kaggle Kernels: For interactive analysis and iterative exploration. Disclaimer This dataset is a synthetic, illustrative example and may not match real-world specifications, prices, or release timelines. It’s intended for learning, experimentation, and demonstration of various data analysis and machine learning techniques rather than as a factual source.
Facebook
TwitterImportant Note: The "Top 250 French Movies" dataset comprises information on the highest-rated French movies according to user ratings on various platforms. This dataset contains 250 unique French movies that have garnered critical acclaim and popularity among viewers. Each movie is associated with essential details, including its rank, title, release year, duration, genre, IMDb rating, image source link, and a brief description.
This dataset is intended for learning, research, and analysis purposes. The movie ratings and details provided in the dataset are based on publicly available information at the time of scraping. As IMDb ratings and movie information may change over time, it is essential to verify and update the data for the latest information.
By using this dataset, you acknowledge that the accuracy and completeness of the information cannot be guaranteed, and you assume responsibility for any analysis or decision-making based on the data. Additionally, please adhere to IMDb's terms of use and copyright policies when using the data for any public dissemination or commercial purposes.
Data Analysis Tasks:
1.Exploratory Data Analysis (EDA): Explore the distribution of movies by genres, release years, and IMDb ratings. Visualize the top-rated French movies and their IMDb ratings using bar charts or histograms.
2.Year-wise Trends: Observe trends in French movie production over the years using line charts or area plots. Analyze if there's any correlation between release year and IMDb ratings.
3.Word Cloud Analysis: Create word clouds from movie descriptions to visualize the most common words and themes among the top-rated French movies. This can provide insights into popular topics and genres.
4.Network Analysis: Build a network graph connecting French movies that share common actors or directors. Analyze the interconnectedness of movies based on their production teams.
Machine Learning Tasks:
1.Movie Recommendation System: Implement a content-based recommendation system that suggests French movies based on similarities in genre, release year, and IMDb ratings. Use techniques like cosine similarity or Jaccard similarity to measure movie similarities.
2.Movie Genre Classification: Build a multi-class classification model to predict the genre of an French movie based on its description. Utilize Natural Language Processing (NLP) techniques like text preprocessing, TF-IDF, or word embeddings. Use classifiers like Logistic Regression, Naive Bayes, or Support Vector Machines.
3.Movie Sentiment Analysis: Perform sentiment analysis on movie descriptions to determine the overall sentiment (positive, negative, neutral) of each movie. Use sentiment lexicons or pre-trained sentiment analysis models.
4.Movie Rating Prediction: Develop a regression model to predict the IMDb rating of an French movie based on features like genre, release year, and description sentiment. Employ regression algorithms like Linear Regression, Decision Trees, or Random Forests.
5.Movie Clustering: Apply unsupervised clustering algorithms to group French movies with similar attributes. Use features like genre, IMDb rating, and release year to identify movie clusters. Experiment with algorithms like K-means clustering or hierarchical clustering.
Important Note: Ensure that the data is appropriately preprocessed and encoded for machine learning tasks. Handle any missing values, perform feature engineering, and split the dataset into training and testing sets. Evaluate the performance of each machine learning model using appropriate metrics such as accuracy, precision, recall, or Mean Squared Error (MSE) depending on the task.
It is crucial to remember that the performance of machine learning models may vary based on the dataset's size and quality. Interpret the results carefully and consider using cross-validation techniques to assess model generalization.
Lastly, please adhere to IMDb's terms of use and any applicable data usage policies while conducting data analysis and implementing machine learning models with this dataset.
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
You are a data analyst for a city engineering office tasked with identifying which road segments require urgent maintenance. The office has collected inspection data on various roads, including surface conditions, traffic volume, and environmental factors.
Your goal is to analyze this data and build a binary classification model to predict whether a given road segment needs maintenance, based on pavement and environmental indicators.
Needs_MaintenanceThis binary label indicates whether the road segment requires immediate maintenance, defined by the following rule:
Needs_Maintenance = 1 Needs_Maintenance = 0 otherwise| Column Name | Description |
|---|---|
| Segment ID | Unique identifier for the road segment |
| PCI | Pavement Condition Index (0 = worst, 100 = best) |
| Road Type | Type of road (Primary, Secondary, Barangay) |
| AADT | Average Annual Daily Traffic |
| Asphalt Type | Asphalt mix classification (e.g. Dense, Open-graded, SMA) |
| Last Maintenance | Year of the last major maintenance |
| Average Rainfall | Average annual rainfall in the area (mm) |
| Rutting | Depth of rutting (mm) |
| IRI | International Roughness Index (m/km) |
| Needs Maintenance | Target label: 1 if urgent maintenance is needed, 0 otherwise |
Using this 1 050 000-row dataset, perform at least five (5) distinct observations. An observation may combine one or more of the following:
You may consult official documentation online (e.g., pandas.pydata.org, matplotlib.org, seaborn.pydata.org, numpy.org), but NO AI-assisted tools or generative models are permitted—even such tools for code snippets or data exploration.
Distribution Insight
IRI and comment on its skewness. Correlation or Relationship
Rutting vs. Average Rainfall, plus calculation of Pearson or Spearman correlation.Group Comparison
AADT by Road Type and a bar chart.Derived Feature Analysis
decay = Rutting / Last Maintenance, then describe its summary statistics and plot.Conditional Probability or Rate
Needs Maintenance = 1 within each Road Type count and visualize as a line plot.You must deliver:
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains clinical and diagnostic features related to Breast Cancer, designed for comprehensive Exploratory Data Analysis (EDA) and subsequent predictive modeling.
It is derived from digitized images of Fine Needle Aspirates (FNA) of breast masses.
The dataset features quantitative measurements, typically calculated from the characteristics of cell nuclei, including: - Radius - Texture - Perimeter - Area - Smoothness - Compactness - Concavity - Concave Points - Symmetry - Fractal Dimension
These features are provided as mean, standard error, and "worst" (largest) values.
The primary goal of this resource is to support the validation of EDA techniques necessary for clinical data science: - Data quality assessment (missing values, inconsistencies). - Feature assessment (distributions, correlations). - Visualization for diagnostic modeling.
The primary target variable is the binary classification of the tissue sample: Malignant vs. Benign.