5 datasets found

Well Cleaned datasets (train & test) using Excel
kaggle.com
zip
Updated Feb 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MICADEE (2020). Well Cleaned datasets (train & test) using Excel [Dataset]. https://www.kaggle.com/adegladius/well-cleaned-datasets-train-test-using-excel
Explore at:
zip(11762 bytes)Available download formats
Dataset updated
Feb 16, 2020
Authors
MICADEE
Description
Context

THE USE OF MICROSOFT EXCEL IN TITANIC MACHINE LEARNING ON KAGGLE

Content

This is a Titinic dataset. It was being cleaned with the help of Microsoft Excel. I actually used several Excel functions e.g IF, COUNTIFS etc to apply One-hot encoding to all the categorical features like Cabin, Embarked, Sex. Also, all the full names of Titanic passengers were also categorized into "Mr","Mrs","Master" and "Miss" respectively by using the function " remove duplicate" in Excel and this helps to extract only the Titles from all the passengers full names e.g Mr, Mrs,Miss & Master. The feature "Family" was encoded to "IsAlone" using Excel function "IF" , where "0" represents IsAlone (i.e only the passenger came alone without any siblings, spouse or relative) while "1" represents IsNotAlone(i.e the passenger possibly came with at least one person e.g spouse, family member or relative). Also, the feature "Ticket" that's of different different types or varieties was being encoded using One-Hot Encoder with the help of the same Excel functions: IF, COUNTIFS to categorize "Ticket" into different types of Tickets taken by Titanic Passengers. The missing values under the feature "Age" was being replaced using "Mode" the age of the passenger with highest frequency or highest number of occurrence. Amazingly, at the end of this, the train score is 84.9% (when using XGBClassifier) while Test prediction score is 84.3%. (Difference = 84.9%-84.3% = 0.6).
Titanic Machine Learning Leaderboard Score: 0.8181. Among Top 4% on the LB being my first attempt Machine Learning Submission on Kaggle.com. Note: Though I was able to attain top 1% currently on Kaggle with LB Score: 0.8889 with rigorous research on different approaches in Machine Learning.

Acknowledgements

First of all, with this of my little achievement on Kaggle Machine Learning Competition just bearly Five(5) months that I joined and started applying my acquired ML knowledge. I acknowledge those behind this amazing platform called Kaggle.com and definitely I really appreciate those who had taken their time out of no time to teach online on how to clean a dataset using several Microsoft Excel functions. I learnt a lot from these online videos. Thus, we can imagine the combination of Excel and Python codes.

Inspiration

Though the use Excel functions to clean up dataset marveled me a lot. And also to see how powerful Microsoft Excel could be. But nevertheless, I will love to see if there's a new or different approach to this, as regards encoding dataset features and also fixing or replacing missing values in a dataset.
m
An Extensive Dataset for the Heart Disease Classification System
data.mendeley.com
Updated Feb 15, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sozan S. Maghdid (2022). An Extensive Dataset for the Heart Disease Classification System [Dataset]. http://doi.org/10.17632/65gxgy2nmg.1
Explore at:
Unique identifier
https://doi.org/10.17632/65gxgy2nmg.1
Dataset updated
Feb 15, 2022
Authors
Sozan S. Maghdid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Finding a good data source is the first step toward creating a database. Cardiovascular illnesses (CVDs) are the major cause of death worldwide. CVDs include coronary heart disease, cerebrovascular disease, rheumatic heart disease, and other heart and blood vessel problems. According to the World Health Organization, 17.9 million people die each year. Heart attacks and strokes account for more than four out of every five CVD deaths, with one-third of these deaths occurring before the age of 70 A comprehensive database for factors that contribute to a heart attack has been constructed , The main purpose here is to collect characteristics of Heart Attack or factors that contribute to it. As a result, a form is created to accomplish this. Microsoft Excel was used to create this form. Figure 1 depicts the form which It has nine fields, where eight fields for input fields and one field for output field. Age, gender, heart rate, systolic BP, diastolic BP, blood sugar, CK-MB, and Test-Troponin are representing the input fields, while the output field pertains to the presence of heart attack, which is divided into two categories (negative and positive).negative refers to the absence of a heart attack, while positive refers to the presence of a heart attack.Table 1 show the detailed information and max and min of values attributes for 1319 cases in the whole database.To confirm the validity of this data, we looked at the patient files in the hospital archive and compared them with the data stored in the laboratories system. On the other hand, we interviewed the patients and specialized doctors. Table 2 is a sample for 1320 cases, which shows 44 cases and the factors that lead to a heart attack in the whole database,After collecting this data, we checked the data if it has null values (invalid values) or if there was an error during data collection. The value is null if it is unknown. Null values necessitate special treatment. This value is used to indicate that the target isn’t a valid data element. When trying to retrieve data that isn't present, you can come across the keyword null in Processing. If you try to do arithmetic operations on a numeric column with one or more null values, the outcome will be null. An example of a null values processing is shown in Figure 2.The data used in this investigation were scaled between 0 and 1 to guarantee that all inputs and outputs received equal attention and to eliminate their dimensionality. Prior to the use of AI models, data normalization has two major advantages. The first is to avoid overshadowing qualities in smaller numeric ranges by employing attributes in larger numeric ranges. The second goal is to avoid any numerical problems throughout the process.After completion of the normalization process, we split the data set into two parts - training and test sets. In the test, we have utilized1060 for train 259 for testing Using the input and output variables, modeling was implemented.
Job Interview Assignments test
kaggle.com
zip
Updated Apr 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Anand (2023). Job Interview Assignments test [Dataset]. https://www.kaggle.com/datasets/yekahaaagayeham/job-interview-assignments-test
Explore at:
zip(37601318 bytes)Available download formats
Dataset updated
Apr 18, 2023
Authors
Aman Anand
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Task 1

Business roles at AgroStar require a baseline of analytical skills, and it is also critical that we are able to explain complex concepts in a simple way to a variety of audiences. This test is structured so that someone with the baseline skills needed to succeed in the role should be able to complete this in under 4 hours without assistance.

Use the data in the included sheet to address the following scenario...

Since its inception, AgroStar has been leveraging an assisted marketplace model. Given that the market potential is huge and that the target customer appreciates a physical store nearby, we have taken a call to explore the offline retail model to drive growth. The primary objective is to get a larger wallet share for AgroStar among existing customers.

Assume you are back in time, in August 2018 and you have been asked to determine the location (taluka) of the first AgroStar offline retail store. 1. What are the key factors you would use to determine the location? Why? 2. What taluka (across three states) would you look open in? Why?

Guidelines:

-- (1) Please mention any assumptions you have made and the underlying thought process -- (2) Please treat the assignment as standalone (it should be self-explanatory to someone who reads it), but we will have a follow-up discussion with you in which we will walk through your approach to this assignment. -- (3) Mention any data that may be missing that would make this study more meaningful -- (4) Kindly conduct your analysis within the spreadsheet, we would like to see the working sheet. If you face any issues due to the file size, kindly download this file and share an excel sheet with us -- (5) If you would like to append a word document/presentation to summarize, please go ahead. -- (6) In case you use any external data source/article, kindly share the source.

Task 4 Cohort

The file CDNOW_master.txt contains the entire purchase history up to the end of June 1998 of the cohort of 23,570 individuals who made their first-ever purchase at CDNOW in the first quarter of 1997. This CDNOW dataset was first used by Fader and Hardie (2001).

Each record in this file, 69,659 in total, comprises four fields: the customer's ID, the date of the transaction, the number of CDs purchased, and the dollar value of the transaction.

CustID = CDNOW_master(:,1); % customer id Date = CDNOW_master(:,2); % transaction date Quant = CDNOW_master(:,3); % number of CDs purchased Spend = CDNOW_master(:,4); % dollar value (excl. S&H)

See "Notes on the CDNOW Master Data Set" (http://brucehardie.com/notes/026/) for details of how the 1/10th systematic sample (http://brucehardie.com/datasets/CDNOW_sample.zip) used in many papers was created.

Reference:

Fader, Peter S. and Bruce G.,S. Hardie, (2001), "Forecasting Repeat Sales at CDNOW: A Case Study," Interfaces, 31 (May-June), Part 2 of 2, S94-S107.

Task 6 Zupee.csv

I have merged all three datasets into one file and also did some feature engineering.
Available Data: You will be given anonymized user gameplay data in the form of 3 csv files. Fields in the data are as described below: Gameplay_Data.csv contains the following fields: * Uid: Alphanumeric unique Id assigned to user * Eventtime: DateTime on which user played the tournament * Entry_Fee: Entry Fee of tournament * Win_Loss: ‘W’ if the user won that particular tournament, ‘L’ otherwise * Winnings: How much money the user won in the tournament (0 for ‘L’) * Tournament_Type: Type of tournament user played (A / B / C / D) * Num_Players: Number of players that played in this tournament

Wallet_Balance.csv contains following fields: * Uid: Alphanumeric unique Id assigned to user * Timestamp: DateTime at which user’s wallet balance is given * Wallet_Balance: User’s wallet balance at given time stamp

Demographic.csv contains following fields: * Uid: Alphanumeric unique Id assigned to user * Installed_At: Timestamp at which user installed the app * Connection_Type: User’s internet connection type (Ex: Cellular / Dial Up) * Cpu_Type: Cpu type of device that the user is playing with * Network_Type: Network type in encoded form * Device_Manufacturer: Ex: Realme * ISP: Internet Service Provider. Ex: Airtel * Country * Country_Subdivision * City * Postal_Code * Language: Language that user has selected for gameplay * Device_Name * Device_Type

Build a basic recommendation system which is able to rank/recommend relevant tournaments and entry prices to the user. The main objectives are: 1. A user should not have to scroll too much before selecting a tournament of their preference 2. We would like the user to play as high an entry fee tournament as possible
Superstore Sales Analysis
kaggle.com
zip
Updated Oct 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Reda Elblgihy (2023). Superstore Sales Analysis [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/superstore-sales-analysis/versions/1
Explore at:
zip(3009057 bytes)Available download formats
Dataset updated
Oct 21, 2023
Authors
Ali Reda Elblgihy
Description
Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:

1- Data Import and Transformation:

Gather and import relevant sales data from various sources into Excel.

Utilize Power Query to clean, transform, and structure the data for analysis.

Merge and link different data sheets to create a cohesive dataset, ensuring that all data fields are connected logically.

2- Data Quality Assessment:

Perform data quality checks to identify and address issues like missing values, duplicates, outliers, and data inconsistencies.

Standardize data formats and ensure that all data is in a consistent, usable state.

3- Calculating COGS:

Determine the Cost of Goods Sold (COGS) for each product sold by considering factors like purchase price, shipping costs, and any additional expenses.

Apply appropriate formulas and calculations to determine COGS accurately.

4- Discount Analysis:

Analyze the discount values offered on products to understand their impact on sales and profitability.

Calculate the average discount percentage, identify trends, and visualize the data using charts or graphs.

5- Sales Metrics:

Calculate and analyze various sales metrics, such as total revenue, profit margins, and sales growth.

Utilize Excel functions to compute these metrics and create visuals for better insights.

6- Visualization:

Create visualizations, such as charts, graphs, and pivot tables, to present the data in an understandable and actionable format.

Visual representations can help identify trends, outliers, and patterns in the data.

7- Report Generation:

Compile the findings and insights into a well-structured report or dashboard, making it easy for stakeholders to understand and make informed decisions.

Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
zomato order data
kaggle.com
Updated Jul 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NayakGanesh007 (2025). zomato order data [Dataset]. https://www.kaggle.com/datasets/nayakganesh007/zomato-order-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 14, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
NayakGanesh007
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Zomato Food Orders – Data Analysis Project 📌 Description: This dataset contains food order data from Zomato, one of India’s leading food delivery platforms. It includes information on customer orders, order status, restaurants, delivery times, and more. The goal of this project is to explore and analyze key insights around customer behavior, delivery patterns, restaurant performance, and order trends.

🔍 Project Objectives: 📊 Perform Exploratory Data Analysis (EDA)

📦 Analyze most frequently ordered cuisines and items

⏱️ Understand average delivery times and delays

🧾 Identify top restaurants and order volumes

📈 Uncover order trends by time (hour/day/week)

💬 Visualize data using Matplotlib & Seaborn

🧹 Clean and preprocess data (missing values, outliers, etc.)

📁 Dataset Features (Example Columns): Column Name Description Order ID - Unique ID for each order Customer ID - Unique customer identifier Restaurant - Name of the restaurant Cuisine - Type of cuisine ordered Order Time - Timestamp when the order was placed Delivery Time - Timestamp when the order was delivered Order Status - Status of the order (Delivered, Cancelled) Payment Method - Mode of payment (Cash, Card, UPI, etc.) Order Amount - Total price of the order

🛠 Tools & Libraries Used: Python

Pandas, NumPy for data manipulation

Matplotlib, Seaborn for visualization

Excel (for raw dataset preview and checks)

✅ Outcomes: Customer ordering trends by cuisine and location

Time-of-day and day-of-week analysis for peak delivery times

Delivery efficiency evaluation

Business recommendations for improving customer experience
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

MICADEE (2020). Well Cleaned datasets (train & test) using Excel [Dataset]. https://www.kaggle.com/adegladius/well-cleaned-datasets-train-test-using-excel

Well Cleaned datasets (train & test) using Excel

Cleaned Titanic datasets (train & test) using Excel

Explore at:

zip(11762 bytes)Available download formats

Dataset updated

Feb 16, 2020

Authors

MICADEE

Description

Context

THE USE OF MICROSOFT EXCEL IN TITANIC MACHINE LEARNING ON KAGGLE

Content

This is a Titinic dataset. It was being cleaned with the help of Microsoft Excel. I actually used several Excel functions e.g IF, COUNTIFS etc to apply One-hot encoding to all the categorical features like Cabin, Embarked, Sex. Also, all the full names of Titanic passengers were also categorized into "Mr","Mrs","Master" and "Miss" respectively by using the function " remove duplicate" in Excel and this helps to extract only the Titles from all the passengers full names e.g Mr, Mrs,Miss & Master. The feature "Family" was encoded to "IsAlone" using Excel function "IF" , where "0" represents IsAlone (i.e only the passenger came alone without any siblings, spouse or relative) while "1" represents IsNotAlone(i.e the passenger possibly came with at least one person e.g spouse, family member or relative). Also, the feature "Ticket" that's of different different types or varieties was being encoded using One-Hot Encoder with the help of the same Excel functions: IF, COUNTIFS to categorize "Ticket" into different types of Tickets taken by Titanic Passengers. The missing values under the feature "Age" was being replaced using "Mode" the age of the passenger with highest frequency or highest number of occurrence. Amazingly, at the end of this, the train score is 84.9% (when using XGBClassifier) while Test prediction score is 84.3%. (Difference = 84.9%-84.3% = 0.6).
Titanic Machine Learning Leaderboard Score: 0.8181. Among Top 4% on the LB being my first attempt Machine Learning Submission on Kaggle.com. Note: Though I was able to attain top 1% currently on Kaggle with LB Score: 0.8889 with rigorous research on different approaches in Machine Learning.

Acknowledgements

First of all, with this of my little achievement on Kaggle Machine Learning Competition just bearly Five(5) months that I joined and started applying my acquired ML knowledge. I acknowledge those behind this amazing platform called Kaggle.com and definitely I really appreciate those who had taken their time out of no time to teach online on how to clean a dataset using several Microsoft Excel functions. I learnt a lot from these online videos. Thus, we can imagine the combination of Excel and Python codes.

Inspiration

Though the use Excel functions to clean up dataset marveled me a lot. And also to see how powerful Microsoft Excel could be. But nevertheless, I will love to see if there's a new or different approach to this, as regards encoding dataset features and also fixing or replacing missing values in a dataset.

Clear search

Close search

Google apps

Main menu

Well Cleaned datasets (train & test) using Excel

Context

Content

Acknowledgements

Inspiration

An Extensive Dataset for the Heart Disease Classification System

Job Interview Assignments test

Task 1

Guidelines:

Task 4 Cohort

Task 6 Zupee.csv

Superstore Sales Analysis

zomato order data

Well Cleaned datasets (train & test) using Excel

Cleaned Titanic datasets (train & test) using Excel

Context

Content

Acknowledgements

Inspiration