Facebook
TwitterTHE USE OF MICROSOFT EXCEL IN TITANIC MACHINE LEARNING ON KAGGLE
This is a Titinic dataset. It was being cleaned with the help of Microsoft Excel. I actually used several Excel functions e.g IF, COUNTIFS etc to apply One-hot encoding to all the categorical features like Cabin, Embarked, Sex. Also, all the full names of Titanic passengers were also categorized into "Mr","Mrs","Master" and "Miss" respectively by using the function " remove duplicate" in Excel and this helps to extract only the Titles from all the passengers full names e.g Mr, Mrs,Miss & Master. The feature "Family" was encoded to "IsAlone" using Excel function "IF" , where "0" represents IsAlone (i.e only the passenger came alone without any siblings, spouse or relative) while "1" represents IsNotAlone(i.e the passenger possibly came with at least one person e.g spouse, family member or relative). Also, the feature "Ticket" that's of different different types or varieties was being encoded using One-Hot Encoder with the help of the same Excel functions: IF, COUNTIFS to categorize "Ticket" into different types of Tickets taken by Titanic Passengers. The missing values under the feature "Age" was being replaced using "Mode" the age of the passenger with highest frequency or highest number of occurrence. Amazingly, at the end of this, the train score is 84.9% (when using XGBClassifier) while Test prediction score is 84.3%. (Difference = 84.9%-84.3% = 0.6).
Titanic Machine Learning Leaderboard Score: 0.8181. Among Top 4% on the LB being my first attempt Machine Learning Submission on Kaggle.com.
Note: Though I was able to attain top 1% currently on Kaggle with LB Score: 0.8889 with rigorous research on different approaches in Machine Learning.
First of all, with this of my little achievement on Kaggle Machine Learning Competition just bearly Five(5) months that I joined and started applying my acquired ML knowledge. I acknowledge those behind this amazing platform called Kaggle.com and definitely I really appreciate those who had taken their time out of no time to teach online on how to clean a dataset using several Microsoft Excel functions. I learnt a lot from these online videos. Thus, we can imagine the combination of Excel and Python codes.
Though the use Excel functions to clean up dataset marveled me a lot. And also to see how powerful Microsoft Excel could be. But nevertheless, I will love to see if there's a new or different approach to this, as regards encoding dataset features and also fixing or replacing missing values in a dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Finding a good data source is the first step toward creating a database. Cardiovascular illnesses (CVDs) are the major cause of death worldwide. CVDs include coronary heart disease, cerebrovascular disease, rheumatic heart disease, and other heart and blood vessel problems. According to the World Health Organization, 17.9 million people die each year. Heart attacks and strokes account for more than four out of every five CVD deaths, with one-third of these deaths occurring before the age of 70 A comprehensive database for factors that contribute to a heart attack has been constructed , The main purpose here is to collect characteristics of Heart Attack or factors that contribute to it. As a result, a form is created to accomplish this. Microsoft Excel was used to create this form. Figure 1 depicts the form which It has nine fields, where eight fields for input fields and one field for output field. Age, gender, heart rate, systolic BP, diastolic BP, blood sugar, CK-MB, and Test-Troponin are representing the input fields, while the output field pertains to the presence of heart attack, which is divided into two categories (negative and positive).negative refers to the absence of a heart attack, while positive refers to the presence of a heart attack.Table 1 show the detailed information and max and min of values attributes for 1319 cases in the whole database.To confirm the validity of this data, we looked at the patient files in the hospital archive and compared them with the data stored in the laboratories system. On the other hand, we interviewed the patients and specialized doctors. Table 2 is a sample for 1320 cases, which shows 44 cases and the factors that lead to a heart attack in the whole database,After collecting this data, we checked the data if it has null values (invalid values) or if there was an error during data collection. The value is null if it is unknown. Null values necessitate special treatment. This value is used to indicate that the target isn’t a valid data element. When trying to retrieve data that isn't present, you can come across the keyword null in Processing. If you try to do arithmetic operations on a numeric column with one or more null values, the outcome will be null. An example of a null values processing is shown in Figure 2.The data used in this investigation were scaled between 0 and 1 to guarantee that all inputs and outputs received equal attention and to eliminate their dimensionality. Prior to the use of AI models, data normalization has two major advantages. The first is to avoid overshadowing qualities in smaller numeric ranges by employing attributes in larger numeric ranges. The second goal is to avoid any numerical problems throughout the process.After completion of the normalization process, we split the data set into two parts - training and test sets. In the test, we have utilized1060 for train 259 for testing Using the input and output variables, modeling was implemented.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Business roles at AgroStar require a baseline of analytical skills, and it is also critical that we are able to explain complex concepts in a simple way to a variety of audiences. This test is structured so that someone with the baseline skills needed to succeed in the role should be able to complete this in under 4 hours without assistance.
Use the data in the included sheet to address the following scenario...
Since its inception, AgroStar has been leveraging an assisted marketplace model. Given that the market potential is huge and that the target customer appreciates a physical store nearby, we have taken a call to explore the offline retail model to drive growth. The primary objective is to get a larger wallet share for AgroStar among existing customers.
Assume you are back in time, in August 2018 and you have been asked to determine the location (taluka) of the first AgroStar offline retail store. 1. What are the key factors you would use to determine the location? Why? 2. What taluka (across three states) would you look open in? Why?
-- (1) Please mention any assumptions you have made and the underlying thought process
-- (2) Please treat the assignment as standalone (it should be self-explanatory to someone who reads it), but we will have a follow-up discussion with you in which we will walk through your approach to this assignment.
-- (3) Mention any data that may be missing that would make this study more meaningful
-- (4) Kindly conduct your analysis within the spreadsheet, we would like to see the working sheet. If you face any issues due to the file size, kindly download this file and share an excel sheet with us
-- (5) If you would like to append a word document/presentation to summarize, please go ahead.
-- (6) In case you use any external data source/article, kindly share the source.
The file CDNOW_master.txt contains the entire purchase history up to the end of June 1998 of the cohort of 23,570 individuals who made their first-ever purchase at CDNOW in the first quarter of 1997. This CDNOW dataset was first used by Fader and Hardie (2001).
Each record in this file, 69,659 in total, comprises four fields: the customer's ID, the date of the transaction, the number of CDs purchased, and the dollar value of the transaction.
CustID = CDNOW_master(:,1); % customer id Date = CDNOW_master(:,2); % transaction date Quant = CDNOW_master(:,3); % number of CDs purchased Spend = CDNOW_master(:,4); % dollar value (excl. S&H)
See "Notes on the CDNOW Master Data Set" (http://brucehardie.com/notes/026/) for details of how the 1/10th systematic sample (http://brucehardie.com/datasets/CDNOW_sample.zip) used in many papers was created.
Reference:
Fader, Peter S. and Bruce G.,S. Hardie, (2001), "Forecasting Repeat Sales at CDNOW: A Case Study," Interfaces, 31 (May-June), Part 2 of 2, S94-S107.
I have merged all three datasets into one file and also did some feature engineering.
Available Data: You will be given anonymized user gameplay data in the form of 3 csv files.
Fields in the data are as described below:
Gameplay_Data.csv contains the following fields:
* Uid: Alphanumeric unique Id assigned to user
* Eventtime: DateTime on which user played the tournament
* Entry_Fee: Entry Fee of tournament
* Win_Loss: ‘W’ if the user won that particular tournament, ‘L’ otherwise
* Winnings: How much money the user won in the tournament (0 for ‘L’)
* Tournament_Type: Type of tournament user played (A / B / C / D)
* Num_Players: Number of players that played in this tournament
Wallet_Balance.csv contains following fields: * Uid: Alphanumeric unique Id assigned to user * Timestamp: DateTime at which user’s wallet balance is given * Wallet_Balance: User’s wallet balance at given time stamp
Demographic.csv contains following fields: * Uid: Alphanumeric unique Id assigned to user * Installed_At: Timestamp at which user installed the app * Connection_Type: User’s internet connection type (Ex: Cellular / Dial Up) * Cpu_Type: Cpu type of device that the user is playing with * Network_Type: Network type in encoded form * Device_Manufacturer: Ex: Realme * ISP: Internet Service Provider. Ex: Airtel * Country * Country_Subdivision * City * Postal_Code * Language: Language that user has selected for gameplay * Device_Name * Device_Type
Build a basic recommendation system which is able to rank/recommend relevant tournaments and entry prices to the user. The main objectives are: 1. A user should not have to scroll too much before selecting a tournament of their preference 2. We would like the user to play as high an entry fee tournament as possible
Facebook
TwitterAnalyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:
1- Data Import and Transformation:
2- Data Quality Assessment:
3- Calculating COGS:
4- Discount Analysis:
5- Sales Metrics:
6- Visualization:
7- Report Generation:
Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Zomato Food Orders – Data Analysis Project 📌 Description: This dataset contains food order data from Zomato, one of India’s leading food delivery platforms. It includes information on customer orders, order status, restaurants, delivery times, and more. The goal of this project is to explore and analyze key insights around customer behavior, delivery patterns, restaurant performance, and order trends.
🔍 Project Objectives: 📊 Perform Exploratory Data Analysis (EDA)
📦 Analyze most frequently ordered cuisines and items
⏱️ Understand average delivery times and delays
🧾 Identify top restaurants and order volumes
📈 Uncover order trends by time (hour/day/week)
💬 Visualize data using Matplotlib & Seaborn
🧹 Clean and preprocess data (missing values, outliers, etc.)
📁 Dataset Features (Example Columns): Column Name Description Order ID - Unique ID for each order Customer ID - Unique customer identifier Restaurant - Name of the restaurant Cuisine - Type of cuisine ordered Order Time - Timestamp when the order was placed Delivery Time - Timestamp when the order was delivered Order Status - Status of the order (Delivered, Cancelled) Payment Method - Mode of payment (Cash, Card, UPI, etc.) Order Amount - Total price of the order
🛠 Tools & Libraries Used: Python
Pandas, NumPy for data manipulation
Matplotlib, Seaborn for visualization
Excel (for raw dataset preview and checks)
✅ Outcomes: Customer ordering trends by cuisine and location
Time-of-day and day-of-week analysis for peak delivery times
Delivery efficiency evaluation
Business recommendations for improving customer experience
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterTHE USE OF MICROSOFT EXCEL IN TITANIC MACHINE LEARNING ON KAGGLE
This is a Titinic dataset. It was being cleaned with the help of Microsoft Excel. I actually used several Excel functions e.g IF, COUNTIFS etc to apply One-hot encoding to all the categorical features like Cabin, Embarked, Sex. Also, all the full names of Titanic passengers were also categorized into "Mr","Mrs","Master" and "Miss" respectively by using the function " remove duplicate" in Excel and this helps to extract only the Titles from all the passengers full names e.g Mr, Mrs,Miss & Master. The feature "Family" was encoded to "IsAlone" using Excel function "IF" , where "0" represents IsAlone (i.e only the passenger came alone without any siblings, spouse or relative) while "1" represents IsNotAlone(i.e the passenger possibly came with at least one person e.g spouse, family member or relative). Also, the feature "Ticket" that's of different different types or varieties was being encoded using One-Hot Encoder with the help of the same Excel functions: IF, COUNTIFS to categorize "Ticket" into different types of Tickets taken by Titanic Passengers. The missing values under the feature "Age" was being replaced using "Mode" the age of the passenger with highest frequency or highest number of occurrence. Amazingly, at the end of this, the train score is 84.9% (when using XGBClassifier) while Test prediction score is 84.3%. (Difference = 84.9%-84.3% = 0.6).
Titanic Machine Learning Leaderboard Score: 0.8181. Among Top 4% on the LB being my first attempt Machine Learning Submission on Kaggle.com.
Note: Though I was able to attain top 1% currently on Kaggle with LB Score: 0.8889 with rigorous research on different approaches in Machine Learning.
First of all, with this of my little achievement on Kaggle Machine Learning Competition just bearly Five(5) months that I joined and started applying my acquired ML knowledge. I acknowledge those behind this amazing platform called Kaggle.com and definitely I really appreciate those who had taken their time out of no time to teach online on how to clean a dataset using several Microsoft Excel functions. I learnt a lot from these online videos. Thus, we can imagine the combination of Excel and Python codes.
Though the use Excel functions to clean up dataset marveled me a lot. And also to see how powerful Microsoft Excel could be. But nevertheless, I will love to see if there's a new or different approach to this, as regards encoding dataset features and also fixing or replacing missing values in a dataset.