Facebook
TwitterThis dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.
About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.
Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.
This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.
This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Anubhav Kumar Gupta
Released under Apache 2.0
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Hussein Al Chami
Released under MIT
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Rashmi Tiwari
Released under CC0: Public Domain
Facebook
TwitterThis dataset was created by Sohail K. Nikouzad
Facebook
TwitterThis dataset was created by Deepak Kalvankar
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains data more than 9.5k car sales in Ukraine.Most of then are used car so it open the possibility to analyze featurs related to car operation. This is subset of all car data in Ukraine. Using this we will analyze the various parameter of used car sales in Ukraine.
1.1 Introduction: This Exploratory Data Analysis is to practice python skills till now on a structured dataset including loading, inspecting,wrangling,Exploring and drawing conclusions from the data.The notebook has the obeservations with each step in order to explain throughtly how to approach the dataset. Based on the obseravation some quetions also are answered in the notebook for the reference though not all them are explored in the analysis.
1.2 Data Source and Dataset: a. How was it collected?
Name: Car Sales Sponsering Organization: Dont Know! Year :2019 Description: This is case study of more than 9.5k car sales in Ukraine. b. it is sample? If yes ,What is properly sampled?
Yes .It is sample .We dont have official information about the data collection method, but its appears not to be random sample, so we can assume that it is not representative.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a cleaned version of a Netflix movies dataset prepared for exploratory data analysis (EDA). Missing values have been handled, invalid rows removed, and numerical + categorical columns cleaned for analysis using Python and Pandas.
Facebook
TwitterThis is a transactions data from an Electronics store chain in the US. The data contains 12 CSV files for each month of 2019.
The naming convention is as follows: Sales_[MONTH_NAME]_2019
Each file contains anywhere from around 9000 to 26000 rows and 6 columns. The columns are as follows:
Order ID, Product, Quantity Ordered, Price Each, Order Date, Purchase Address
There are around 186851 data points combining all the 12-month files. There may be null values in some rows.
Facebook
TwitterThis dataset was created by Bhagya sree
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Hosted by: Walsoft Computer Institute 📁 Download dataset 👤 Kaggle profile
Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.
As part of an internal review, the leadership team has hired you — a Data Science Consultant — to analyze this dataset and provide clear, evidence-based recommendations on how to improve:
Answer this central question:
“Using the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”
You are required to analyze and provide actionable insights for the following three areas:
Should entry exams remain the primary admissions filter?
Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.
✅ Deliverables:
Are there at-risk student groups who need extra support?
Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.
✅ Deliverables:
How can we allocate resources for maximum student success?
Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.
✅ Deliverables:
| Column | Description |
|---|---|
fNAME, lNAME | Student first and last name |
Age | Student age (21–71 years) |
gender | Gender (standardized as "Male"/"Female") |
country | Student’s country of origin |
residence | Student housing/residence type |
entryEXAM | Entry test score (28–98) |
prevEducation | Prior education (High School, Diploma, etc.) |
studyHOURS | Total study hours logged |
Python | Final Python exam score |
DB | Final Database exam score |
You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day — from inconsistent formatting to missing values.
Download: bi.csv
This dataset includes common data quality challenges:
Country name inconsistencies
e.g. Norge → Norway, RSA → South Africa, UK → United Kingdom
Residence type variations
e.g. BI-Residence, BIResidence, BI_Residence → unify to BI Residence
Education level typos and casing issues
e.g. Barrrchelors → Bachelor, DIPLOMA, Diplomaaa → Diploma
Gender value noise
e.g. M, F, female → standardize to Male / Female
Missing scores in Python subject
Fill NaN values using column mean or suitable imputation strategy
Participants using this dataset are expected to apply data cleaning techniques such as:
- String standardization
- Null value imputation
- Type correction (e.g., scores as float)
- Validation and visual verification
✅ Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.
Download: cleaned_bi.csv
This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Nikhil raman K
Released under CC0: Public Domain
Facebook
TwitterThis dataset was created by Shivam Chaurasia
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by BharathiD8
Released under Apache 2.0
Facebook
TwitterTailor made data to apply the machine learning models on the dataset. Where the newcomers can easily perform their EDA.
The data consists of all the features of the four wheelers available in the market in 1985. We need to predict the **price of the car ** using Linear Regression or PCA or SVM-R etc.,
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Demographic Analysis of Shopping Behavior: Insights and Recommendations
Dataset Information: The Shopping Mall Customer Segmentation Dataset comprises 15,079 unique entries, featuring Customer ID, age, gender, annual income, and spending score. This dataset assists in understanding customer behavior for strategic marketing planning.
Cleaned Data Details: Data cleaned and standardized, 15,079 unique entries with attributes including - Customer ID, age, gender, annual income, and spending score. Can be used by marketing analysts to produce a better strategy for mall specific marketing.
Challenges Faced: 1. Data Cleaning: Overcoming inconsistencies and missing values required meticulous attention. 2. Statistical Analysis: Interpreting demographic data accurately demanded collaborative effort. 3. Visualization: Crafting informative visuals to convey insights effectively posed design challenges.
Research Topics: 1. Consumer Behavior Analysis: Exploring psychological factors driving purchasing decisions. 2. Market Segmentation Strategies: Investigating effective targeting based on demographic characteristics.
Suggestions for Project Expansion: 1. Incorporate External Data: Integrate social media analytics or geographic data to enrich customer insights. 2. Advanced Analytics Techniques: Explore advanced statistical methods and machine learning algorithms for deeper analysis. 3. Real-Time Monitoring: Develop tools for agile decision-making through continuous customer behavior tracking. This summary outlines the demographic analysis of shopping behavior, highlighting key insights, dataset characteristics, team contributions, challenges, research topics, and suggestions for project expansion. Leveraging these insights can enhance marketing strategies and drive business growth in the retail sector.
References OpenAI. (2022). ChatGPT [Computer software]. Retrieved from https://openai.com/chatgpt. Mustafa, Z. (2022). Shopping Mall Customer Segmentation Data [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/zubairmustafa/shopping-mall-customer-segmentation-data Donkeys. (n.d.). Kaggle Python API [Jupyter Notebook]. Kaggle. Retrieved from https://www.kaggle.com/code/donkeys/kaggle-python-api/notebook Pandas-Datareader. (n.d.). Retrieved from https://pypi.org/project/pandas-datareader/
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This synthetic dataset is designed specifically for practicing data visualization and exploratory data analysis (EDA) using popular Python libraries like Seaborn, Matplotlib, and Pandas.
Unlike most public datasets, this one includes a diverse mix of column types:
📅 Date columns (for time series and trend plots) 🔢 Numerical columns (for histograms, boxplots, scatter plots) 🏷️ Categorical columns (for bar charts, group analysis)
Whether you are a beginner learning how to visualize data or an intermediate user testing new charting techniques, this dataset offers a versatile playground.
Feel free to:
Create EDA notebooks Practice plotting techniques Experiment with filtering, grouping, and aggregations 🛠️ No missing values, no data cleaning needed — just download and start exploring!
Hope you find this helpful. Looking forward to hearing from you all.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Exploratory Data Analysis (EDA) of ZOMATO BANGALORE DATASET using Python and its libraries (Pandas , Matplotlib and Seaborn ). Analyzed restaurant distribution ,top cuisines ,rating distribution, cost for two and other interesting insights.
Included files: - NOTEBOOK : : ZOMATO_EDA.ipynb -IMAGES : : Visualizations of key insights - requirement.txt : : Python dependencies
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Capstone project for Google Advanced Data Analytics. Dataset to build predictive models to provide insights to the HR department, of a large consulting firm. The HR department wants to improve employee satisfaction at the company. Data cleaning, EDA, visualization, and modeling was completed in Python.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Radhika Padsala
Released under MIT
Facebook
TwitterThis dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.
About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.
Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.
This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.
This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning