83 datasets found

B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
Netflix Movies and TV Shows Dataset Cleaned(excel)
kaggle.com
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav Tawri (2025). Netflix Movies and TV Shows Dataset Cleaned(excel) [Dataset]. https://www.kaggle.com/datasets/gauravtawri/netflix-movies-and-tv-shows-dataset-cleanedexcel
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gaurav Tawri
Description
This dataset is a cleaned and preprocessed version of the original Netflix Movies and TV Shows dataset available on Kaggle. All cleaning was done using Microsoft Excel — no programming involved.

🎯 What’s Included: - Cleaned Excel file (standardized columns, proper date format, removed duplicates/missing values) - A separate "formulas_used.txt" file listing all Excel formulas used during cleaning (e.g., TRIM, CLEAN, DATE, SUBSTITUTE, TEXTJOIN, etc.) - Columns like 'date_added' have been properly formatted into DMY structure - Multi-valued columns like 'listed_in' are split for better analysis - Null values replaced with “Unknown” for clarity - Duration field broken into numeric + unit components

🔍 Dataset Purpose: Ideal for beginners and analysts who want to: - Practice data cleaning in Excel - Explore Netflix content trends - Analyze content by type, country, genre, or date added

📁 Original Dataset Credit: The base version was originally published by Shivam Bansal on Kaggle: https://www.kaggle.com/shivamb/netflix-shows

📌 Bonus: You can find a step-by-step cleaning guide and the same dataset on GitHub as well — along with screenshots and formulas documentation.
q
Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio
qubeshub.org
Updated Jul 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shelly Gaynor (2020). Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio [Dataset]. http://doi.org/10.25334/DRGD-F069
Explore at:
Unique identifier
https://doi.org/10.25334/DRGD-F069
Dataset updated
Jul 16, 2020
Dataset provided by
QUBES
Authors
Shelly Gaynor
Description
Access and clean an open source herbarium dataset using Excel or RStudio.
d
Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop
search.dataone.org
borealisdata.ca
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Costanzo, Lucia; Jadon, Vivek (2024). Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop [Dataset]. http://doi.org/10.5683/SP3/FF6AI9
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/FF6AI9
Dataset updated
Jul 31, 2024
Dataset provided by
Borealis
Authors
Costanzo, Lucia; Jadon, Vivek
Description
Ahoy, data enthusiasts! Join us for a hands-on workshop where you will hoist your sails and navigate through the Statistics Canada website, uncovering hidden treasures in the form of data tables. With the wind at your back, you’ll master the art of downloading these invaluable Stats Can datasets while braving the occasional squall of data cleaning challenges using Excel with your trusty captains Vivek and Lucia at the helm.
Cleaned-Data Pakistan's Largest Ecommerce Dataset
kaggle.com
Updated Mar 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
umaraziz97 (2023). Cleaned-Data Pakistan's Largest Ecommerce Dataset [Dataset]. https://www.kaggle.com/datasets/umaraziz97/cleaned-data-pakistans-largest-ecommerce-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 25, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
umaraziz97
Area covered
Pakistan
Description
Pakistan’s largest ecommerce data – Power BI Report

Dataset Link: pakistan’s_largest_ecommerce_dataset Cleaned Data: Cleaned_Pakistan’s_largest_ecommerce_dataset

Raw Data:

Rows: 584525 **Columns: **21

Process:

All the raw data transformed and saved in new Excel file Working – Pakistan Largest Ecommerce Dataset

Processed Data:

Rows: 582250 Columns: 22 Visualization: Here is the link of Visualization report link: Pakistan-s-largest-ecommerce-data-Power-BI-Data-Visualization-Report

Conclusion:

In categories Mobiles & Tables make more money by selling highest no of products and also providing highest amount of discount on products. On the other side Men’s Fashion Category has sell second highest no of products but it can’t generate money with that ratio, may be the prices of individual products is a good reason behind that. And in orders details we experience Mobiles & Tablets have highest no of canceled orders but completed orders are almost same as Men’s Fashion. We have mostly completed orders but have huge no of canceled orders. In payment methods cod has most no of completed order and mostly canceled orders have payment method Easyaxis.
Excel-project: Glassdoor Data Cleaning
kaggle.com
zip
Updated Sep 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luis Lira (2023). Excel-project: Glassdoor Data Cleaning [Dataset]. https://www.kaggle.com/datasets/luisliraportfolio/excel-project-clean-dataset/discussion
Explore at:
zip(12085049 bytes)Available download formats
Dataset updated
Sep 26, 2023
Authors
Luis Lira
Description
Dataset

This dataset was created by Luis Lira

Contents
v
Global import data of Clean Excel
volza.com
csv
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Volza FZ LLC (2025). Global import data of Clean Excel [Dataset]. https://www.volza.com/imports-united-states/united-states-import-data-of-clean+excel
Explore at:
csvAvailable download formats
Dataset updated
Nov 21, 2025
Dataset authored and provided by
Volza FZ LLC
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Count of importers, Sum of import value, 2014-01-01/2021-09-30, Count of import shipments
Description
27 Global import shipment records of Clean Excel with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
i
Household Income and Expenditure 2010 - Tuvalu
catalog.ihsn.org
Updated Mar 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Central Statistics Division (2019). Household Income and Expenditure 2010 - Tuvalu [Dataset]. http://catalog.ihsn.org/catalog/3203
Explore at:
Dataset updated
Mar 29, 2019
Dataset authored and provided by
Central Statistics Division
Time period covered
2010
Area covered
Tuvalu
Description
Abstract

The main objectives of the survey were: - To obtain weights for the revision of the Consumer Price Index (CPI) for Funafuti; - To provide information on the nature and distribution of household income, expenditure and food consumption patterns; - To provide data on the household sector's contribution to the National Accounts - To provide information on economic activity of men and women to study gender issues - To undertake some poverty analysis

Geographic coverage

National, including Funafuti and Outer islands

Analysis unit

Household

individual

Universe

All the private household are included in the sampling frame. In each household selected, the current resident are surveyed, and people who are usual resident but are currently away (work, health, holydays reasons, or border student for example. If the household had been residing in Tuvalu for less than one year: - but intend to reside more than 12 months => The household is included - do not intend to reside more than 12 months => out of scope

Kind of data

Sample survey data [ssd]

Sampling procedure

It was decided that 33% (one third) sample was sufficient to achieve suitable levels of accuracy for key estimates in the survey. So the sample selection was spread proportionally across all the island except Niulakita as it was considered too small. For selection purposes, each island was treated as a separate stratum and independent samples were selected from each. The strategy used was to list each dwelling on the island by their geographical position and run a systematic skip through the list to achieve the 33% sample. This approach assured that the sample would be spread out across each island as much as possible and thus more representative.

For details please refer to Table 1.1 of the Report.

Sampling deviation

Only the island of Niulakita was not included in the sampling frame, considered too small.

Mode of data collection

Face-to-face [f2f]

Research instrument

There were three main survey forms used to collect data for the survey. Each question are writen in English and translated in Tuvaluan on the same version of the questionnaire. The questionnaires were designed based on the 2004 survey questionnaire.

HOUSEHOLD FORM - composition of the household and demographic profile of each members - dwelling information - dwelling expenditure - transport expenditure - education expenditure - health expenditure - land and property expenditure - household furnishing - home appliances - cultural and social payments - holydays/travel costs - Loans and saving - clothing - other major expenditure items

INDIVIDUAL FORM - health and education - labor force (individu aged 15 and above) - employment activity and income (individu aged 15 and above): wages and salaries, working own business, agriculture and livestock, fishing, income from handicraft, income from gambling, small scale activies, jobs in the last 12 months, other income, childreen income, tobacco and alcohol use, other activities, and seafarer

DIARY (one diary per week, on a 2 weeks period, 2 diaries per household were required) - All kind of expenses - Home production - food and drink (eaten by the household, given away, sold) - Goods taken from own business (consumed, given away) - Monetary gift (given away, received, winning from gambling) - Non monetary gift (given away, received, winning from gambling)

Questionnaire Design Flaws Questionnaire design flaws address any problems with the way questions were worded which will result in an incorrect answer provided by the respondent. Despite every effort to minimize this problem during the design of the respective survey questionnaires and the diaries, problems were still identified during the analysis of the data. Some examples are provided below:

Gifts, Remittances & Donations Collecting information on the following: - the receipt and provision of gifts - the receipt and provision of remittances - the provision of donations to the church, other communities and family occasions is a very difficult task in a HIES. The extent of these activities in Tuvalu is very high, so every effort should be made to address these activities as best as possible. A key problem lies in identifying the best form (questionnaire or diary) for covering such activities. A general rule of thumb for a HIES is that if the activity occurs on a regular basis, and involves the exchange of small monetary amounts or in-kind gifts, the diary is more appropriate. On the other hand, if the activity is less infrequent, and involves larger sums of money, the questionnaire with a recall approach is preferred. It is not always easy to distinguish between the two for the different activities, and as such, both the diary and questionnaire were used to collect this information. Unfortunately it probably wasn?t made clear enough as to what types of transactions were being collected from the different sources, and as such some transactions might have been missed, and others counted twice. The effects of these problems are hopefully minimal overall.

Defining Remittances Because people have different interpretations of what constitutes remittances, the questionnaire needs to be very clear as to how this concept is defined in the survey. Unfortunately this wasn?t explained clearly enough so it was difficult to distinguish between a remittance, which should be of a more regular nature, and a one-off monetary gift which was transferred between two households.

Business Expenses Still Recorded The aim of the survey is to measure "household" expenditure, and as such, any expenditure made by a household for an item or service which was primarily used for a business activity should be excluded. It was not always clear in the questionnaire that this was the case, and as such some business expenses were included. Efforts were made during data cleaning to remove any such business expenses which would impact significantly on survey results.

Purchased goods given away as a gift When a household makes a gift donation of an item it has purchased, this is recorded in section 5 of the diary. Unfortunately it was difficult to know how to treat these items as it was not clear as to whether this item had been recorded already in section 1 of the diary which covers purchases. The decision was made to exclude all information of gifts given which were considered to be purchases, as these items were assumed to have already been recorded already in section 1. Ideally these items should be treated as a purchased gift given away, which in turn is not household consumption expenditure, but this was not possible.

Some key items missed in the Questionnaire Although not a big issue, some key expenditure items were omitted from the questionnaire when it would have been best to collect them via this schedule. A key example being electric fans which many households in Tuvalu own.

Cleaning operations

Consistency of the data: - each questionnaire was checked by the supervisor during and after the collection - before data entry, all the questionnaire were coded - the CSPRo data entry system included inconsistency checks which allow the NSO staff to point some errors and to correct them with imputation estimation from their own knowledge (no time for double entry), 4 data entry operators. - after data entry, outliers were identified in order to check their consistency.

All data entry, including editing, edit checks and queries, was done using CSPro (Census Survey Processing System) with additional data editing and cleaning taking place in Excel.

The staff from the CSD was responsible for undertaking the coding and data entry, with assistance from an additional four temporary staff to help produce results in a more timely manner.

Although enumeration didn't get completed until mid June, the coding and data entry commenced as soon as forms where available from Funafuti, which was towards the end of March. The coding and data entry was then completed around the middle of July.

A visit from an SPC consultant then took place to undertake initial cleaning of the data, primarily addressing missing data items and missing schedules. Once the initial data cleaning was undertaken in CSPro, data was transferred to Excel where it was closely scrutinized to check that all responses were sensible. In the cases where unusual values were identified, original forms were consulted for these households and modifications made to the data if required.

Despite the best efforts being made to clean the data file in preparation for the analysis, no doubt errors will still exist in the data, due to its size and complexity. Having said this, they are not expected to have significant impacts on the survey results.

Under-Reporting and Incorrect Reporting as a result of Poor Field Work Procedures The most crucial stage of any survey activity, whether it be a population census or a survey such as a HIES is the fieldwork. It is crucial for intense checking to take place in the field before survey forms are returned to the office for data processing. Unfortunately, it became evident during the cleaning of the data that fieldwork wasn?t checked as thoroughly as required, and as such some unexpected values appeared in the questionnaires, as well as unusual results appearing in the diaries. Efforts were made to indentify the main issues which would have the greatest impact on final results, and this information was modified using local knowledge, to a more reasonable answer, when required.

Data Entry Errors Data entry errors are always expected, but can be kept to a minimum with
Video Game Sales Dataset (Excel Dashboard Project)
kaggle.com
Updated Oct 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adewale Lateef W (2025). Video Game Sales Dataset (Excel Dashboard Project) [Dataset]. https://www.kaggle.com/datasets/adewalelateefw/video-game-sales-dataset-excel-dashboard-project
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Adewale Lateef W
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains video game sales data prepared for an Excel data analysis and dashboard project.

It includes detailed information on:

Game titles

Platforms

Genres

Publishers

Regional and global sales

The dataset was cleaned, structured, and analyzed in Microsoft Excel to explore patterns in the global video game market. It can be used to:

Practice data cleaning and pivot tables

Build interactive dashboards

Perform sales comparisons across regions and genres

Develop business insights from entertainment data

🧩 File Information

Format: .xlsx (Excel Workbook)

Columns: Name, Platform, Year, Genre, Publisher, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales

💡 Use Cases

Excel dashboard and chart creation

Data visualization and storytelling

Business and market analysis practice

Portfolio or learning projects

👤 Prepared by

Adewale Lateef W — for data analysis and Excel dashboard learning purposes.
v
Global export data of Clean Excel
volza.com
csv
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Volza FZ LLC (2025). Global export data of Clean Excel [Dataset]. https://www.volza.com/exports-india/india-export-data-of-clean+excel
Explore at:
csvAvailable download formats
Dataset updated
Nov 14, 2025
Dataset authored and provided by
Volza FZ LLC
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Count of exporters, Sum of export value, 2014-01-01/2021-09-30, Count of export shipments
Description
123 Global export shipment records of Clean Excel with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
v
Global import data of Clean,excel
volza.com
csv
Updated Nov 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Volza FZ LLC (2025). Global import data of Clean,excel [Dataset]. https://www.volza.com/imports-india/india-import-data-of-clean-excel-from-italy
Explore at:
csvAvailable download formats
Dataset updated
Nov 14, 2025
Dataset authored and provided by
Volza FZ LLC
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Count of importers, Sum of import value, 2014-01-01/2021-09-30, Count of import shipments
Description
955 Global import shipment records of Clean,excel with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.

Netflix Data: Cleaning, Analysis and Visualization

kaggle.com

zip

Updated Aug 26, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization

Explore at:

zip(276607 bytes)Available download formats

Dataset updated

Aug 26, 2022

Authors

Abdulrasaq Ariyo

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

Data Cleaning

We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

--View dataset

SELECT * 
FROM netflix;

--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                  
SELECT show_id, COUNT(*)                                                                                      
FROM netflix 
GROUP BY show_id                                                                                              
ORDER BY show_id DESC;

--No duplicates

--Check null values across columns

SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
    COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
    COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
    COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
    COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
    COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
    COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
    COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
    COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
    COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
    COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
    COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;

We can see that there are NULLS. 
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3

The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

-- Below, we find out if some directors are likely to work with particular cast

WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
FROM netflix
)

SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

With this, we can now populate NULL rows in directors 
using their record with movie_cast

UPDATE netflix 
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;

--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET director = 'Not Given'
WHERE director IS NULL;

--When I was doing this, I found a less complex and faster way to populate a column which I will use next

Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

--Populate the country using the director column

SELECT COALESCE(nt.country,nt2.country) 
FROM netflix AS nt
JOIN netflix AS nt2 
ON nt.director = nt2.director 
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
AND netflix.country IS NULL;


--To confirm if there are still directors linked to country that refuse to update

SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;

--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET country = 'Not Given'
WHERE country IS NULL;

The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

--Show date_added nulls

SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;

--DELETE nulls

DELETE F...

Well Cleaned datasets (train & test) using Excel
kaggle.com
zip
Updated Feb 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MICADEE (2020). Well Cleaned datasets (train & test) using Excel [Dataset]. https://www.kaggle.com/adegladius/well-cleaned-datasets-train-test-using-excel
Explore at:
zip(11762 bytes)Available download formats
Dataset updated
Feb 16, 2020
Authors
MICADEE
Description
Context

THE USE OF MICROSOFT EXCEL IN TITANIC MACHINE LEARNING ON KAGGLE

Content

This is a Titinic dataset. It was being cleaned with the help of Microsoft Excel. I actually used several Excel functions e.g IF, COUNTIFS etc to apply One-hot encoding to all the categorical features like Cabin, Embarked, Sex. Also, all the full names of Titanic passengers were also categorized into "Mr","Mrs","Master" and "Miss" respectively by using the function " remove duplicate" in Excel and this helps to extract only the Titles from all the passengers full names e.g Mr, Mrs,Miss & Master. The feature "Family" was encoded to "IsAlone" using Excel function "IF" , where "0" represents IsAlone (i.e only the passenger came alone without any siblings, spouse or relative) while "1" represents IsNotAlone(i.e the passenger possibly came with at least one person e.g spouse, family member or relative). Also, the feature "Ticket" that's of different different types or varieties was being encoded using One-Hot Encoder with the help of the same Excel functions: IF, COUNTIFS to categorize "Ticket" into different types of Tickets taken by Titanic Passengers. The missing values under the feature "Age" was being replaced using "Mode" the age of the passenger with highest frequency or highest number of occurrence. Amazingly, at the end of this, the train score is 84.9% (when using XGBClassifier) while Test prediction score is 84.3%. (Difference = 84.9%-84.3% = 0.6).
Titanic Machine Learning Leaderboard Score: 0.8181. Among Top 4% on the LB being my first attempt Machine Learning Submission on Kaggle.com. Note: Though I was able to attain top 1% currently on Kaggle with LB Score: 0.8889 with rigorous research on different approaches in Machine Learning.

Acknowledgements

First of all, with this of my little achievement on Kaggle Machine Learning Competition just bearly Five(5) months that I joined and started applying my acquired ML knowledge. I acknowledge those behind this amazing platform called Kaggle.com and definitely I really appreciate those who had taken their time out of no time to teach online on how to clean a dataset using several Microsoft Excel functions. I learnt a lot from these online videos. Thus, we can imagine the combination of Excel and Python codes.

Inspiration

Though the use Excel functions to clean up dataset marveled me a lot. And also to see how powerful Microsoft Excel could be. But nevertheless, I will love to see if there's a new or different approach to this, as regards encoding dataset features and also fixing or replacing missing values in a dataset.
v
Global export data of Clean,excel
volza.com
csv
Updated Nov 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Volza FZ LLC (2025). Global export data of Clean,excel [Dataset]. https://www.volza.com/exports-india/india-export-data-of-clean-excel-to-saudi-arabia
Explore at:
csvAvailable download formats
Dataset updated
Nov 14, 2025
Dataset authored and provided by
Volza FZ LLC
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Count of exporters, Sum of export value, 2014-01-01/2021-09-30, Count of export shipments
Description
116 Global export shipment records of Clean,excel with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
n
Data from: Designing data science workshops for data-intensive environmental...
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Dec 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7wm37pvp7
Dataset updated
Dec 8, 2020
Dataset provided by
Montana State University
California State Polytechnic University
Authors
Allison Theobold; Stacey Hancock; Sara Mannheimer
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.

Methods Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw. The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey. The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively. The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean. The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file.
u
Data from: Survey data from the Australian Marine Debris Initiative
research.usc.edu.au
researchdata.edu.au
csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heidi Tait; Jodi Jones; Caitlin Smith; Kathy Townsend, Survey data from the Australian Marine Debris Initiative [Dataset]. https://research.usc.edu.au/esploro/outputs/dataset/Survey-data-from-the-Australian-Marine/991016398702621
Explore at:
csv(7054018 bytes)Available download formats
Dataset provided by
University of the Sunshine Coast
Authors
Heidi Tait; Jodi Jones; Caitlin Smith; Kathy Townsend
Time period covered
2024
Description
Survey data from the Australian Marine Debris Initiative and the result of spatial analysis from multiple creative commons datasets. Data consists of: • Spatial Data Queensland Coastline – Event summaries within an Excel data table and shapefile • All years • Number of Items removed, Weight volunteers, Volume, Distance, Latitude and Longitude. • Contributing organisation files table/ sites • Environmental, physical and biological variables associated with the closest catchment to each debris survey. TBF has made all reasonable efforts to ensure that the information in the Custom Dataset is accurate. TBF will not be held responsible: • for the way these data are used by the Entity for their Reports; • for any errors that may be contained in the Custom Dataset; or • any direct or indirect damage the use of the Custom Dataset may cause. Data collected by TBF comes from citizen science initiatives and is taken at face value from contributors with each entry being vetted and periodic checks being made to maintain the integrity of the overall dataset. Some clean-up data has been extrapolated by data collectors. Some weight and distance details have not been provided by contributors. The data was collected by various organisations and individuals in clean-up events at their chosen locations where man-made items greater than 5mm were removed from the beach, and sorted, counted and recorded on data sheets, using CyberTracker software devices or the AMDI mobile application. Items were identified according to the method laid out in the TBF Marine Debris Identification Manual in which items are grouped according to their material categories (the manual is available on the TBF website). The length of beach cleaned is at the discretion of the clean-up group and the total weight of items removed is either weighed with handheld scales or estimated.
Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping
figshare.com
Updated Jan 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maryam Binti Haji Abdul Halim (2025). Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping [Dataset]. http://doi.org/10.6084/m9.figshare.28147451.v1
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.28147451.v1
Dataset updated
Jan 6, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Maryam Binti Haji Abdul Halim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project focuses on data mapping, integration, and analysis to support the development and enhancement of six UNCDF operational applications: OrgTraveler, Comms Central, Internal Support Hub, Partnership 360, SmartHR, and TimeTrack. These apps streamline workflows for travel claims, internal support, partnership management, and time tracking within UNCDF.Key Features and Tools:Data Mapping for Salesforce CRM Migration: Structured and mapped data flows to ensure compatibility and seamless migration to Salesforce CRM.Python for Data Cleaning and Transformation: Utilized pandas, numpy, and APIs to clean, preprocess, and transform raw datasets into standardized formats.Power BI Dashboards: Designed interactive dashboards to visualize workflows and monitor performance metrics for decision-making.Collaboration Across Platforms: Integrated Google Collab for code collaboration and Microsoft Excel for data validation and analysis.

Retail Store Sales: Dirty for Data Cleaning

kaggle.com

zip

Updated Jan 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning

Explore at:

zip(226740 bytes)Available download formats

Dataset updated

Jan 18, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Retail Store Sales Dataset

Overview

The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

File Information

File Name: retail_store_sales.csv
Number of Rows: 12,575
Number of Columns: 11

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Customer ID`	A unique identifier for each customer. 25 unique customers.	`CUST_01`
`Category`	The category of the purchased item.	`Food`, `Furniture`
`Item`	The name of the purchased item. May contain missing values or `None`.	`Item_1_FOOD`, `None`
`Price Per Unit`	The static price of a single unit of the item. May contain missing or `None` values.	`4.00`, `None`
`Quantity`	The quantity of the item purchased. May contain missing or `None` values.	`1`, `None`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `None`
`Payment Method`	The method of payment used. May contain missing or invalid values.	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Online`
`Transaction Date`	The date of the transaction. Always present and valid.	`2023-01-15`
`Discount Applied`	Indicates if a discount was applied to the transaction. May contain missing values.	`True`, `False`, `None`

Categories and Items

The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

Electric Household Essentials

Item Code	Item Name	Price
Item_1_EHE	Blender	5.0
Item_2_EHE	Microwave	6.5
Item_3_EHE	Toaster	8.0
Item_4_EHE	Vacuum Cleaner	9.5
Item_5_EHE	Air Purifier	11.0
Item_6_EHE	Electric Kettle	12.5
Item_7_EHE	Rice Cooker	14.0
Item_8_EHE	Iron	15.5
Item_9_EHE	Ceiling Fan	17.0
Item_10_EHE	Table Fan	18.5
Item_11_EHE	Hair Dryer	20.0
Item_12_EHE	Heater	21.5
Item_13_EHE	Humidifier	23.0
Item_14_EHE	Dehumidifier	24.5
Item_15_EHE	Coffee Maker	26.0
Item_16_EHE	Portable AC	27.5
Item_17_EHE	Electric Stove	29.0
Item_18_EHE	Pressure Cooker	30.5
Item_19_EHE	Induction Cooktop	32.0
Item_20_EHE	Water Dispenser	33.5
Item_21_EHE	Hand Blender	35.0
Item_22_EHE	Mixer Grinder	36.5
Item_23_EHE	Sandwich Maker	38.0
Item_24_EHE	Air Fryer	39.5
Item_25_EHE	Juicer	41.0

Furniture

Item Code	Item Name	Price
Item_1_FUR	Office Chair	5.0
Item_2_FUR	Sofa	6.5
Item_3_FUR	Coffee Table	8.0
Item_4_FUR	Dining Table	9.5
Item_5_FUR	Bookshelf	11.0
Item_6_FUR	Bed F...

m
Dataset Supporting: Dynamic Modeling of Poultry Litter Composting in High...
data.mendeley.com
Updated Nov 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alvaro Patiño-Forero (2024). Dataset Supporting: Dynamic Modeling of Poultry Litter Composting in High Mountain Climates using System Identification Techniques [Dataset]. http://doi.org/10.17632/dgxxj2pk8s.2
Explore at:
Unique identifier
https://doi.org/10.17632/dgxxj2pk8s.2
Dataset updated
Nov 22, 2024
Authors
Alvaro Patiño-Forero
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General Description: This dataset contains detailed measurements collected during two controlled experiments designed to study the dynamics of the composting process using the forced aeration technique. The dataset is divided into two main parts:

Experiment 1: Includes parameters such as temperatures (hot air and compost pile), relative humidity, and air and heat inputs. Experiment 2: Complements the first experiment with oxygen levels in addition to the previously mentioned variables. Both datasets are organized in a chronological format, with records that allow the analysis of trends and correlations among the studied variables.

Purpose: The primary objective of this dataset is to facilitate the study of composting dynamics in high mountain environments using the forced aeration technique. It can be used for:

Bioprocess modeling. Studies on energy optimization in biological and chemical processes. Research in environmental biology, process engineering, and clean technologies. Dataset Features: Total Size: Experiment 1: 4302 records and 8 variables. Experiment 2: 3076 records and 9 variables. Temporal Coverage: Records are organized by hour and minute over several days of experimentation. Key Variables: Hour and minute of the record. Heater and compost pile temperatures. Relative humidity. Air and heat inputs. Oxygen levels (in Experiment 2). Days elapsed since the start of the experiment. Available Formats: The dataset is available in Excel format (.xlsx), with each experiment documented on separate sheets.

Access and Use: Restrictions: Commercial use of the dataset requires prior authorization. Potential Applications: This resource is valuable for researchers in fields such as:

Environmental engineering and bioprocesses. Design and optimization of thermal and environmental control systems.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:

167 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.5683/SP3/ZCN177

Dataset updated

Jul 13, 2023

Dataset provided by

Borealis

Authors

Rong Luo

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Clear search

Close search

Google apps

Main menu

Data Cleaning Sample

Netflix Movies and TV Shows Dataset Cleaned(excel)

Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio

Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop

Cleaned-Data Pakistan's Largest Ecommerce Dataset

Pakistan’s largest ecommerce data – Power BI Report

Raw Data:

Process:

Processed Data:

Conclusion:

Excel-project: Glassdoor Data Cleaning

Dataset

Contents

Global import data of Clean Excel

Household Income and Expenditure 2010 - Tuvalu

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Sampling deviation

Mode of data collection

Research instrument

Cleaning operations

Video Game Sales Dataset (Excel Dashboard Project)

Global export data of Clean Excel

Cleaned NHANES 1988-2018

Global import data of Clean,excel

Netflix Data: Cleaning, Analysis and Visualization

Data Cleaning

Well Cleaned datasets (train & test) using Excel

Context

Content

Acknowledgements

Inspiration

Global export data of Clean,excel

Data from: Designing data science workshops for data-intensive environmental...

Data from: Survey data from the Australian Marine Debris Initiative

Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping

Retail Store Sales: Dirty for Data Cleaning

Dirty Retail Store Sales Dataset

Overview

File Information

Columns Description

Categories and Items

Electric Household Essentials

Furniture

Dataset Supporting: Dynamic Modeling of Poultry Litter Composting in High...

Data Cleaning Sample