Facebook
TwitterThis dataset is a cleaned and preprocessed version of the original Netflix Movies and TV Shows dataset available on Kaggle. All cleaning was done using Microsoft Excel — no programming involved.
🎯 What’s Included: - Cleaned Excel file (standardized columns, proper date format, removed duplicates/missing values) - A separate "formulas_used.txt" file listing all Excel formulas used during cleaning (e.g., TRIM, CLEAN, DATE, SUBSTITUTE, TEXTJOIN, etc.) - Columns like 'date_added' have been properly formatted into DMY structure - Multi-valued columns like 'listed_in' are split for better analysis - Null values replaced with “Unknown” for clarity - Duration field broken into numeric + unit components
🔍 Dataset Purpose: Ideal for beginners and analysts who want to: - Practice data cleaning in Excel - Explore Netflix content trends - Analyze content by type, country, genre, or date added
📁 Original Dataset Credit: The base version was originally published by Shivam Bansal on Kaggle: https://www.kaggle.com/shivamb/netflix-shows
📌 Bonus: You can find a step-by-step cleaning guide and the same dataset on GitHub as well — along with screenshots and formulas documentation.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .
We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments
--View dataset
SELECT *
FROM netflix;
--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
SELECT show_id, COUNT(*)
FROM netflix
GROUP BY show_id
ORDER BY show_id DESC;
--No duplicates
--Check null values across columns
SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;
We can see that there are NULLS.
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3
The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column
-- Below, we find out if some directors are likely to work with particular cast
WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast
FROM netflix
)
SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;
With this, we can now populate NULL rows in directors
using their record with movie_cast
UPDATE netflix
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;
--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET director = 'Not Given'
WHERE director IS NULL;
--When I was doing this, I found a less complex and faster way to populate a column which I will use next
Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column
--Populate the country using the director column
SELECT COALESCE(nt.country,nt2.country)
FROM netflix AS nt
JOIN netflix AS nt2
ON nt.director = nt2.director
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id
AND netflix.country IS NULL;
--To confirm if there are still directors linked to country that refuse to update
SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET country = 'Not Given'
WHERE country IS NULL;
The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization
--Show date_added nulls
SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;
--DELETE nulls
DELETE F...
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.
retail_store_sales.csv| Column Name | Description | Example Values |
|---|---|---|
Transaction ID | A unique identifier for each transaction. Always present and unique. | TXN_1234567 |
Customer ID | A unique identifier for each customer. 25 unique customers. | CUST_01 |
Category | The category of the purchased item. | Food, Furniture |
Item | The name of the purchased item. May contain missing values or None. | Item_1_FOOD, None |
Price Per Unit | The static price of a single unit of the item. May contain missing or None values. | 4.00, None |
Quantity | The quantity of the item purchased. May contain missing or None values. | 1, None |
Total Spent | The total amount spent on the transaction. Calculated as Quantity * Price Per Unit. | 8.00, None |
Payment Method | The method of payment used. May contain missing or invalid values. | Cash, Credit Card |
Location | The location where the transaction occurred. May contain missing or invalid values. | In-store, Online |
Transaction Date | The date of the transaction. Always present and valid. | 2023-01-15 |
Discount Applied | Indicates if a discount was applied to the transaction. May contain missing values. | True, False, None |
The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:
| Item Code | Item Name | Price |
|---|---|---|
| Item_1_EHE | Blender | 5.0 |
| Item_2_EHE | Microwave | 6.5 |
| Item_3_EHE | Toaster | 8.0 |
| Item_4_EHE | Vacuum Cleaner | 9.5 |
| Item_5_EHE | Air Purifier | 11.0 |
| Item_6_EHE | Electric Kettle | 12.5 |
| Item_7_EHE | Rice Cooker | 14.0 |
| Item_8_EHE | Iron | 15.5 |
| Item_9_EHE | Ceiling Fan | 17.0 |
| Item_10_EHE | Table Fan | 18.5 |
| Item_11_EHE | Hair Dryer | 20.0 |
| Item_12_EHE | Heater | 21.5 |
| Item_13_EHE | Humidifier | 23.0 |
| Item_14_EHE | Dehumidifier | 24.5 |
| Item_15_EHE | Coffee Maker | 26.0 |
| Item_16_EHE | Portable AC | 27.5 |
| Item_17_EHE | Electric Stove | 29.0 |
| Item_18_EHE | Pressure Cooker | 30.5 |
| Item_19_EHE | Induction Cooktop | 32.0 |
| Item_20_EHE | Water Dispenser | 33.5 |
| Item_21_EHE | Hand Blender | 35.0 |
| Item_22_EHE | Mixer Grinder | 36.5 |
| Item_23_EHE | Sandwich Maker | 38.0 |
| Item_24_EHE | Air Fryer | 39.5 |
| Item_25_EHE | Juicer | 41.0 |
| Item Code | Item Name | Price |
|---|---|---|
| Item_1_FUR | Office Chair | 5.0 |
| Item_2_FUR | Sofa | 6.5 |
| Item_3_FUR | Coffee Table | 8.0 |
| Item_4_FUR | Dining Table | 9.5 |
| Item_5_FUR | Bookshelf | 11.0 |
| Item_6_FUR | Bed F... |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PECD Hydro modelling
This repository contains a more user-friendly version of the Hydro modelling data released by ENTSO-E with their latest Seasonal Outlook.
The original URLs:
The original ENTSO-E hydropower dataset integrates the PECD (Pan-European Climate Database) released for the MAF 2019
As I did for the wind & solar data, the datasets released in this repository are only a more user- and machine-readable version of the original Excel files. As avid user of ENTSO-E data, with this repository I want to share my data wrangling efforts to make this dataset more accessible.
Data description
The zipped file contains 86 Excel files, two different files for each ENTSO-E zone.
In this repository you can find 6 CSV files:
PECD-hydro-capacities.csv: installed capacitiesPECD-hydro-weekly-inflows.csv: weekly inflows for reservoir and open-loop pumpingPECD-hydro-daily-ror-generation.csv: daily run-of-river generationPECD-hydro-weekly-reservoir-min-max-generation.csv: minimum and maximum weekly reservoir generationPECD-hydro-weekly-reservoir-min-max-levels.csv: weekly minimum and maximum reservoir levelsCapacities
The file PECD-hydro-capacities.csv contains: run of river capacity (MW) and storage capacity (GWh), reservoir plants capacity (MW) and storage capacity (GWh), closed-loop pumping/turbining (MW) and storage capacity and open-loop pumping/turbining (MW) and storage capacity. The data is extracted from the Excel files with the name starting with PEMM from the following sections:
Run-of-River and pondage, rows from 5 to 7, columns from 2 to 5Reservoir, rows from 5 to 7, columns from 1 to 3Pump storage - Open Loop, rows from 5 to 7, columns from 1 to 3Pump storage - Closed Loop, rows from 5 to 7, columns from 1 to 3Inflows
The file PECD-hydro-weekly-inflows.csv contains the weekly inflow (GWh) for the climatic years 1982-2017 for reservoir plants and open-loop pumping. The data is extracted from the Excel files with the name starting with PEMM from the following sections:
Reservoir, rows from 13 to 66, columns from 16 to 51Pump storage - Open Loop, rows from 13 to 66, columns from 16 to 51Daily run-of-river
The file PECD-hydro-daily-ror-generation.csv contains the daily run-of-river generation (GWh). The data is extracted from the Excel files with the name starting with PEMM from the following sections:
Run-of-River and pondage, rows from 13 to 378, columns from 15 to 51Miminum and maximum reservoir generation
The file PECD-hydro-weekly-reservoir-min-max-generation.csv contains the minimum and maximum generation (MW, weekly) for reservoir-based plants for the climatic years 1982-2017. The data is extracted from the Excel files with the name starting with PEMM from the following sections:
Reservoir, rows from 13 to 66, columns from 196 to 231Reservoir, rows from 13 to 66, columns from 232 to 267Minimum/Maximum reservoir levels
The file PECD-hydro-weekly-reservoir-min-max-levels.csv contains the minimum/maximum reservoir levels at beginning of each week (scaled coefficient from 0 to 1). The data is extracted from the Excel files with the name starting with PEMM from the following sections:
Reservoir, rows from 14 to 66, column 12Reservoir, rows from 14 to 66, column 13CHANGELOG
[2020/07/17] Added maximum generation for the reservoir
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Original dataset The original year-2019 dataset was downloaded from the World Bank Databank by the following approach on July 23, 2022.
Database: "World Development Indicators" Country: 266 (all available) Series: "CO2 emissions (kt)", "GDP (current US$)", "GNI, Atlas method (current US$)", and "Population, total" Time: 1960, 1970, 1980, 1990, 2000, 2010, 2017, 2018, 2019, 2020, 2021 Layout: Custom -> Time: Column, Country: Row, Series: Column Download options: Excel
Preprocessing
With libreoffice,
remove non-country entries (lines after Zimbabwe), shorten column names for easy processing: Country Name -> Country, Country Code -> Code, "XXXX ... GNI ..." -> GNI_1990, etc (notice '_', not '-', for R), remove unnesssary rows after line Zimbabwe.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An Excel spreadsheet listing the information recorded on each of 18,686 costume designs can be viewed, downloaded, and explored. All the usual Excel sorting possibilities are available, and in addition a useful filter has been installed. For example, to find the number of designs that are Frieze Type #1, go to the top of the frieze type 2 column (column AS), click on the drop-down arrow and unselect every option box except True (i.e. True should be turned on, all other choices turned off). Then in the lower left corner, one reads “1111 of 18686 records found”.
Much more sophisticated exploration can be carried out by downloading the rich and flexible Access Database. The terms used for this database were described in detail in three sections of Deep Blue paper associated with this project. The database can be downloaded and explored.
HOW TO USE THE ACCESS DATABASE 1. Click on the Create Cohort and View Math Trait Data button, and select your cohort by clicking on the features of interest (for example: Apron and Blouse).
Note: Depending on how you exited on your previous visit to the database, there may be items to clear up before creating the cohorts.
a) (Usually unnecessary) Click on the small box near the top left corner to allow connection to Access.
b) (Usually unnecessary) If an undesired window blocks part of the screen, click near the top of this window to minimize it.
c) Make certain under Further Filtering that all four Exclude boxes are checked to get rid of stripes and circles, and circular buttons, and the D1 that is trivially associated with shoes.
Click on Filter Records to Form the Cohort button. Note the # of designs, # of pieces, and # of costumes beside Recalculate.
Click on Calculate Average Math Trait Frequency of Cohort button, and select the symmetry types of interest (for example: D1 and D2) .
To view the Stage 1 table, click on Create Stage 1 table. To edit and print this table, click on Create Excel (after table has been created). The same process works for Stages 2, 3.and 4 tables.
To view the matrix listing the math category impact numbers, move over to a button on the right side and click on View Matrix of Math Category Impact Numbers. To edit and print this matrix, click on Create Excel, use the Excel table as usual.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The present dataset was developed using the MODIS Normalized Difference Snow Index with a spatial resolution of 500 m as input for the SNOWMAP algorithm to detect lake ice from daily clear-sky observations. Furthermore, for cloud-cover conditions, lake ice was identified based on the spatial and temporal continuity of lake-ice data. On this basis, the daily lake-ice monitoring data of 2612 lakes of the Tibetan Plateau from 2002 to 2018 were calculated and classified. Moreover, a time-series analysis of lake ice coverage, which included lakes with surface area greater than 1 km2, was carried out to provide a clear list of lakes for which lake ice phenology can be estimated. The data set contains 5834 raster files, one vector file and 2612 Excel files (including 1134 time series with and without classification statistics). The raster file is named daily lake ice extent. The vector file contains such information as the number, name, location, surface area and classification number of the processed lake. The names of the excel files correspond to lake numbers. Each excel file contains four columns with the daily lake ice coverage information of its corresponding lake from July 2002 to June 2018. The attributes of each column are, successively, date, lake water coverage, lake ice coverage and cloud coverage. Users can first use the vector file to determine the number, location and classification number of a given lake, and then obtain the corresponding daily lake ice coverage data for a given year from the Excel file to use for the monitoring of lake-ice freeze-thaw and research on climate change.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
A journal paper published in Energy Strategy Reviews details the method to create the data.
https://www.sciencedirect.com/science/article/pii/S2211467X21001280
2023-10-10: Version 8.0.5 has additional columns added, one for day of the year, and one for the half hour period of the year (17520 in a standard year and 17568 in a leap year). A new interconnector (https://www.viking-link.com/) has posted a value since 2023-07-12 but the values have all been a zero value so far (until 2023-09-30).
2023-03-15: Version 8.0.1 is a major rewrite with column names that now include the units and the data type. Also, pumped storage has charging values included from 2012, i.e., the negative values when pumped storage is being charged, as well as the positive values when it was discharging (which were available previously). The raw version of the data (rather than cleaned) has been dropped for the time being.
2023-01-06: Version 7.0.0 was created. Now includes data for the Eleclink interconnector from Great Britain to France through the Channel Tunnel (https://www.eleclink.co.uk/index.php). This supersedes previous versions - as the Eleclink data is now included for historical data (including in the ESPENI total).
2021-09-09: Version 6.0.0 was created. Now includes data for the North Sea Link (NSL) interconnector from Great Britain to Norway (https://www.northsealink.com). The previous version (5.0.4) should not be used - as there was an error with interconnector data having a static value over the summer 2021.
2021-05-05: Version 5.0.0 was created. Datetimes now in ISO 8601 format (with capital letter 'T' between the date and time) rather than previously with a space (to RFC 3339 format) and with an offset to identify both UTC and localtime. MW values now all saved as integers rather than floats. Elexon data as always from www.elexonportal.co.uk/fuelhh, National Grid data from https://data.nationalgrideso.com/demand/historic-demand-data Raw data now added again for comparison of pre and post cleaning - to allow for training of additional cleaning methods. If using Microsoft Excel, the T between the date and time can be removed using the =SUBSTITUTE() command - and substitute "T" for a space " "
_
2021-03-02: Version 4.0.0 was created. Due to a new interconnecter (IFA2 - https://en.wikipedia.org/wiki/IFA-2) being commissioned in Q1 2021, there is an additional column with data from National Grid - this is called 'POWER_NGEM_IFA2_FLOW_MW' in the espeni dataset. In addition, National Grid has dropped the column name 'FRENCH_FLOW' that used to provide the value for the column 'POWER_NGEM_FRENCH_FLOW_MW' in previous espeni versions. However, this has been changed to 'IFA_FLOW' in National Grid's original data, which is now called 'POWER_NGEM_IFA_FLOW_MW' in the espeni dataset. Lastly, the IO14 columns have all been dropped by National Grid - and potentially unlikely to appear again in future.
2020-12-02: Version 3.0.0 was created. There was a problem with earlier versions local time format - where the +01:00 value was not carried through into the data properly. Now addressed - therefore - local time now has the format e.g. 2020-03-31 20:00:00+01:00 when in British Summer Time.
2020-10-03: Version 2.0.0 was created as it looks like National Grid has had a significant change to the methodology underpinning the embedded wind calculations. The wind profile seems similar to previous values, but with an increasing value in comparison to the value published in earlier the greater the embedded value is. The 'new' values are from https://data.nationalgrideso.com/demand/daily-demand-update from 2013.
Previously: raw and cleaned datasets for Great Britain's publicly available electrical data from Elexon (www.elexonportal.co.uk) and National Grid (https://demandforecast.nationalgrid.com/efs_demand_forecast/faces/DataExplorer). Updated versions with more recent data will be uploaded with a differing version number and doi
All data is released in accordance with Elexon's disclaimer and reservation of rights.
https://www.elexon.co.uk/using-this-website/disclaimer-and-reservation-of-rights/
This disclaimer is also felt to cover the data from National Grid, and the parsed data from the Energy Informatics Group at the University of Birmingham.
Facebook
Twitter
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description of data preparation performed on data from 2001 to 2007 (end).
Cleaning Data In the original form of Sabana data (both daily and hourly data), instrument frequently recorded minimum value of TIRRa and Total PFD as negative values and maximum value of RH as over 100%. Unquestionably, these are unrealistic values. Thus, they were replaced by 0 (zero) for TIRRa and Total PFD minimum values and 100% for RH maximum values.
Defected Data There were noticeable defect of Total PFD values in 2003 and 2006 (both daily and hourly data). Specifically, in 2003, defected Total PFD values were from January 1st (Day # = 1) through September 3rd (Day # = 247) and, in 2006, they were from March 24th (Day # = 83) through October 31st (Day # = 304). Therefore, four year (2001, 2002, 2004, and 2005) monthly averages were calculated and multiplier was developed based on the ratio of [four year average] / [2003 (or 2006) defected data]. Detail calculation of this can be seen in the Modification file (MS Excel file). Accordingly, columns denoted as “Modified Total PFD” are results of this modification. However, note that red and black colors within the column indicate modified and non-modified (original) values, respectively.
Missing Data There were large numbers of data missing in both daily and hourly dataset which are outlined below. Additionally, there were couples of significantly noticeable defected values in some columns which were omitted from the dataset. Thus, missing and omitted data were left as blank (no values).
Grizelle González - Project Leader, Research Unit
USDA FS - International Institute of Tropical Forestry
voice: 787-764-7800
ggonzalez@fs.fed.us
Facebook
Twitterhttp://dcat-ap.de/def/licenses/other-closedhttp://dcat-ap.de/def/licenses/other-closed
The data set contains the results of the mayor’s election on 25 May 2014 and the mayor’s key election on 15 June 2014 of the City of Düsseldorf.
The local elections took place on 25 May 2014. Because no clear majority was reached, there was a runoff election of the mayor on 15 June 2014.
An authority may set up different territorial levels to present the election results, from the lowest level (voting districts) to constituencies and districts to the level of the city or municipality, district and constituency. However, not all levels are necessary for each type of election. For each of the territorial levels that an authority has set up, there is a file containing the overview of those areas with fast messages already received.
Further data sets contain information on the division of electoral areas for local elections and the division of voting districts.
Information on terms in the field of ‘Elections’ can be found in the Election ABC of the interactive learning platform for election workers of the City of Düsseldorf.
The files are encoded in UTF-8. By default, Excel does not display the umlauts in the files correctly. You can avoid this as follows:
Excel 2003 Select from the menu ‘Data’ -> ‘Import external data’ from the menu item ‘Import data’. The ‘Select data source’ dialog opens. Select the file you want to open and press the ‘Open’ button. Then place the file origin to '65001 Unicode: (UTF-8)' fixed and continue with the ‘Next’ button. In the next dialog, set the separator to ‘Semicolon’ instead of ‘Tabstopp’ and continue with the ‘Next’ button again. They then select the ‘Text’ option as the data format of the columns and exit the wizard with the ‘Finish’ button. Use the ‘OK’ button to finish the procedure and the data is displayed UTF-8 encoded in Microsoft Excel.
Excel 2010 From the tab ‘Data’ in the section ‘Retrieve external data’, select the option ‘From text’. The dialog ‘Import text file’ opens. Select the file you want to open and press the ‘Open’ button. Then place the file origin to '65001 Unicode: (UTF-8)' fixed and continue with the ‘Next’ button. In the next dialog, set the separator to ‘Semicolon’ instead of ‘Tabstopp’ and continue with the ‘Next’ button again. They then select the ‘Text’ option as the data format of the columns and exit the wizard with the ‘Finish’ button. Use the ‘OK’ button to finish the procedure and the data is displayed UTF-8 encoded in Microsoft Excel.
The files contain the following column information:
Number: Constituency number Name: Name of the constituency MaxQuick Messages: maximum number of quick messages AnzQuick Messages: Number of fast messages already recorded Eligible voters: Number of eligible voters Filed under: Number of ballot papers submitted Turnout: Voter turnouts at the respective view levels valid Voting List: Number of valid ballot papers valid: Number of valid votes cast invalid Voting List: Number of invalid ballot papers invalid: Number of invalid votes cast In addition, the following fields are available for each party (example of one party called ‘A Party’):
A Party: Number of total votes of the party A-Party_Proz: Percentage of total votes of the party from the total result
Facebook
TwitterThis notebook serves to showcase my problem solving ability, knowledge of the data analysis process, proficiency with Excel and its various tools and functions, as well as my strategic mindset and statistical prowess. This project consist of an auditing prompt provided by Hive Data, a raw Excel data set, a cleaned and audited version of the raw Excel data set, and my description of my thought process and knowledge used during completion of the project. The prompt can be found below:
The raw data that accompanies the prompt can be found below:
Hive Annotation Job Results - Raw Data
^ These are the tools I was given to complete my task. The rest of the work is entirely my own.
To summarize broadly, my task was to audit the dataset and summarize my process and results. Specifically, I was to create a method for identifying which "jobs" - explained in the prompt above - needed to be rerun based on a set of "background facts," or criteria. The description of my extensive thought process and results can be found below in the Content section.
Brendan Kelley April 23, 2021
Hive Data Audit Prompt Results
This paper explains the auditing process of the “Hive Annotation Job Results” data. It includes the preparation, analysis, visualization, and summary of the data. It is accompanied by the results of the audit in the excel file “Hive Annotation Job Results – Audited”.
Observation
The “Hive Annotation Job Results” data comes in the form of a single excel sheet. It contains 7 columns and 5,001 rows, including column headers. The data includes “file”, “object id”, and the pseudonym for five questions that each client was instructed to answer about their respective table: “tabular”, “semantic”, “definition list”, “header row”, and “header column”. The “file” column includes non-unique (that is, there are multiple instances of the same value in the column) numbers separated by a dash. The “object id” column includes non-unique numbers ranging from 5 to 487539. The columns containing the answers to the five questions include Boolean values - TRUE or FALSE – which depend upon the yes/no worker judgement.
Use of the COUNTIF() function reveals that there are no values other than TRUE or FALSE in any of the five question columns. The VLOOKUP() function reveals that the data does not include any missing values in any of the cells.
Assumptions
Based on the clean state of the data and the guidelines of the Hive Data Audit Prompt, the assumption is that duplicate values in the “file” column are acceptable and should not be removed. Similarly, duplicated values in the “object id” column are acceptable and should not be removed. The data is therefore clean and is ready for analysis/auditing.
Preparation
The purpose of the audit is to analyze the accuracy of the yes/no worker judgement of each question according to the guidelines of the background facts. The background facts are as follows:
• A table that is a definition list should automatically be tabular and also semantic • Semantic tables should automatically be tabular • If a table is NOT tabular, then it is definitely not semantic nor a definition list • A tabular table that has a header row OR header column should definitely be semantic
These background facts serve as instructions for how the answers to the five questions should interact with one another. These facts can be re-written to establish criteria for each question:
For tabular column: - If the table is a definition list, it is also tabular - If the table is semantic, it is also tabular
For semantic column: - If the table is a definition list, it is also semantic - If the table is not tabular, it is not semantic - If the table is tabular and has either a header row or a header column...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The raw data file is available online for public access (https://data.ontario.ca/dataset/lake-simcoe-monitoring). Download the 1980-2019 csv files and open up the file named "Simcoe_Zooplankton&Bythotrephes.csv". Copy and paste the zooplankton sheet into a new excel file called "Simcoe_Zooplankton.csv". The column ZDATE in the excel file needs to be switched from GENERAL to SHORT DATE so that the dates in the ZDATE column read "YYYY/MM/DD". Save as .csv in appropriate R folder. The data file "simcoe_manual_subset_weeks_5" is the raw data that has been subset for the main analysis of the article using the .R file "Simcoe MS - 5 Station Subset Data". The .csv file produced from this must then be manually edited to remove data points that do not have 5 stations per sampling period as well as by combining data points that should fall into a single week. The "simcoe_manual_subset_weeks_5.csv" is then used for the calculation of variability, stabilization, asynchrony, and Shannon Diversity for each year in the .R file "Simcoe MS - 5 Station Calculations". The final .R file "Simcoe MS - 5 Station Analysis contains the final statistical analyses as well as code to reproduce the original figures. Data and code for main and supplementary analyses are also available on GitHub (https://github.com/reillyoc/ZPseasonalPEs).
Facebook
TwitterOpen Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
This dataset contains information on Government of Canada tender information published according to the Financial Administration Act. It includes data for all Schedule I, Schedule II and Schedule III departments, agencies, Crown corporations, and other entities (unless specifically exempt) who must comply with the Government of Canada trade agreement obligations. CanadaBuys is the authoritative source of this information. Visit the How procurement works page on the CanadaBuys website to learn more. All data files in this collection share a common column structure, and the procurement category field (labelled as “procurementCategory-categorieApprovisionnement”) can be used to filter by the following four major categories of tenders: Tenders for construction, which will have a value of “CNST” Tenders for goods, which will have a value of “GD” Tenders for services, which will have a value of “SRV” Tenders for services related to goods, which will have a value of “SRVTGD” A tender may be associated with one or more of the above procurement categories. Note: Some records contain long tender description values that may cause issues when viewed in certain spreadsheet programs, such as Microsoft Excel. When the information doesn’t fit within the cell’s character limit, the program will insert extra rows that don’t conform to the expected column formatting. (Though, all other records will still be displayed properly, in their own rows.) To quickly remove the “spill-over data” caused by this display error in Excel, select the publication date field (labelled as “publicationDate-datePublication”), then click the Filter button on the Data menu ribbon. You can then use the filter pull-down list to remove any blank or non-date values from this field, which will hide the rows that only contain “spill-over” description information. The following list describes the resources associated with this CanadaBuys tender notices dataset. Additional information on Government of Canada tenders can also be found on the Tender notices tab of the CanadaBuys tender opportunities page. NOTE: While the CanadaBuys online portal includes tender opportunities from across multiple levels of government, the data files in this related dataset only include notices from federal government organizations. (1) CanadaBuys data dictionary: This XML file offers descriptions of each data field in the tender notices files linked below, as well as other procurement-related datasets CanadaBuys produces. Use this as a guide for understanding the data elements in these files. This dictionary is updated as needed to reflect changes to the data elements. (2) New tender notices: This file contains up to date information on all new tender notices that are published to CanadaBuys throughout a given day. The file is updated every two hours, from 6:15 am until 10:15 pm (UTC-0500) to include new tenders as they are published. All tenders in this file will have a publication date matching the current day (displayed in the field labelled “publicationDate-datePublication”), or the day prior for systems that feed into this file on a nightly basis. (3) Open tender notices: This file contains up to date information on all tender notices that are open for bidding on CanadaBuys, including any amendments made to these tender notices during their lifecycles. The file is refreshed each morning, between 7:00 am and 8:30 am (UTC-0500) to include newly published open tenders. All tenders in this file will have a status of open (displayed in the field labelled “tenderStatus-tenderStatut-eng”). (4) All CanadaBuys tender notices, 2022-08-08 onwards: This file contains up to date information on all tender notices published through CanadaBuys. This includes any tender notices that were open for bids on or after August 8, 2022, when CanadaBuys launched as the system of record for all Tender Notices for the Government of Canada. This file includes any amendments made to these tender notices during their lifecycles. It is refreshed each morning, between 7:00 am and 8:30 am (UTC-0500) to include any updates or amendments, as needed. Tender notices in this file can have any publication date on or after August 8, 2022 (displayed in the field labelled “publicationDate-datePublication”), and can have a status of open, cancelled or expired (displayed in the field labelled “tenderStatus-tenderStatut-eng”). (5) Legacy tender notices, 2009 to 2022-08 (prior to CanadaBuys): This file contains details of the tender notices that were launched prior to the implementation of CanadaBuys, which became the system of record for all tender notices for the Government of Canada on August 8, 2022. This datafile is refreshed monthly. The over 70,000 tenders in this file have publication dates from August 5, 2022 and before (displayed in the field labelled “publicationDate-datePublication”) and have a status of cancelled or expired (displayed in the field labelled “tenderStatus-tenderStatut-eng”). Note: Procurement data was structured differently in the legacy applications previously used to administer Government of Canada tender notices. Efforts have been made to manipulate these historical records into the structure used by the CanadaBuys data files, to make them easier to analyse and compare with new records. This process is not perfect since simple one-to-one mappings can’t be made in many cases. You can access these historical records in their original format as part of the archived copy of the original tender notices dataset. You can also refer to the supporting documentation for understanding the new CanadaBuys tender and award notices datasets. (6) Tender notices, YYYY-YYYY: These files contain information on all tender notices published in the specified fiscal year that are no longer open to bidding. The current fiscal year's file is refreshed each morning, between 7:00 am and 8:30 am (UTC-0500) to include any updates or amendments, as needed. The files associated with past fiscal years are refreshed monthly. Tender notices in these files can have any publication date between April 1 of a given year and March 31 of the subsequent year (displayed in the field labelled “publicationDate-datePublication”) and can have a status of cancelled or expired (displayed in the field labelled “tenderStatus-tenderStatut-eng”). New records are added to these files once related tenders reach their close date, or are cancelled. Note: New tender notice data files will be added on April 1 for each fiscal year.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context
In small and medium sized firms that aim to do CRM, employees sometimes use Excel to track Customer Feedback. Excel is widely used due to its popularity and clean interface. However, Excel is not similar to other advanced CRM software and websites such as Slack, HubSpot, Salesforce, or Zoho. In cases where an organization aims to collect lower level feedback that can the be uploaded to a larger CRM software, Excel is a good choice. I did some research on how to make it easier for a CRM officer, salesperson, or company data managers to automate client feedback tracking using Excel's VBA functionality and VLOOKUP.
Content
This dataset has one file- CRM Finance Loan Tracking Excel File.xlsm which has columns related to customers of a medium-sized financial institution such as Client, Bank Branch Name, Phone Number, Client Account No., Loan Account No., Product, Loan Amount, Disbursed Date, Maturity, Repaid, Debt Owing, Current Note, 1st Latest Note, 2nd Latest Note, 3rd Latest Note, 4th Latest Note, and 5th Latest Note.
How to Use the Excel File
First, enable macros in the Excel file. Then, you can proceed as follows: On the first sheet called CLIENT LOANS, try typing in column M (Current Note) for any client. The VBA code will automatically update the 1st to 5th Latest Notes in columns N to R. You can look the note logs in the second sheet called LogSheet. The third sheet called CountSpecific shows the count of specific notes for each client.
Note that you can tweak the functionality of these XLSM files to suit your needs, by removing some unneeded columns and adding new ones. Just remember to modify the VBA code accordingly. .
Acknowledgements
This dataset is a compilation of random client names obtained from https://1000randomnames.com/. Other columns also contain random facts of the clients. For illustrative purposes, I typed the notes for the first five clients.
Inspiration
Can we have a simple excel file that helps in tracks client feedback? Can we use Excel formulas to track recurring customer complaints? Can we make it easier to see previous client feedback?
Use Cases - Portfolio management - Sales pipeline management - Client feedback tracking - Student progress tracking - Organizational records tracking - Budget management
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Google Ads Sales Dataset for Data Analytics Campaigns (Raw & Uncleaned) 📝 Dataset Overview This dataset contains raw, uncleaned advertising data from a simulated Google Ads campaign promoting data analytics courses and services. It closely mimics what real digital marketers and analysts would encounter when working with exported campaign data — including typos, formatting issues, missing values, and inconsistencies.
It is ideal for practicing:
Data cleaning
Exploratory Data Analysis (EDA)
Marketing analytics
Campaign performance insights
Dashboard creation using tools like Excel, Python, or Power BI
📁 Columns in the Dataset Column Name ----- -Description Ad_ID --------Unique ID of the ad campaign Campaign_Name ------Name of the campaign (with typos and variations) Clicks --Number of clicks received Impressions --Number of ad impressions Cost --Total cost of the ad (in ₹ or $ format with missing values) Leads ---Number of leads generated Conversions ----Number of actual conversions (signups, sales, etc.) Conversion Rate ---Calculated conversion rate (Conversions ÷ Clicks) Sale_Amount ---Revenue generated from the conversions Ad_Date------ Date of the ad activity (in inconsistent formats like YYYY/MM/DD, DD-MM-YY) Location ------------City where the ad was served (includes spelling/case variations) Device------------ Device type (Mobile, Desktop, Tablet with mixed casing) Keyword ----------Keyword that triggered the ad (with typos)
⚠️ Data Quality Issues (Intentional) This dataset was intentionally left raw and uncleaned to reflect real-world messiness, such as:
Inconsistent date formats
Spelling errors (e.g., "analitics", "anaytics")
Duplicate rows
Mixed units and symbols in cost/revenue columns
Missing values
Irregular casing in categorical fields (e.g., "mobile", "Mobile", "MOBILE")
🎯 Use Cases Data cleaning exercises in Python (Pandas), R, Excel
Data preprocessing for machine learning
Campaign performance analysis
Conversion optimization tracking
Building dashboards in Power BI, Tableau, or Looker
💡 Sample Analysis Ideas Track campaign cost vs. return (ROI)
Analyze click-through rates (CTR) by device or location
Clean and standardize campaign names and keywords
Investigate keyword performance vs. conversions
🔖 Tags Digital Marketing · Google Ads · Marketing Analytics · Data Cleaning · Pandas Practice · Business Analytics · CRM Data
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
PROJECT OBJECTIVE
We are a part of XYZ Co Pvt Ltd company who is in the business of organizing the sports events at international level. Countries nominate sportsmen from different departments and our team has been given the responsibility to systematize the membership roster and generate different reports as per business requirements.
Questions (KPIs)
TASK 1: STANDARDIZING THE DATASET
TASK 2: DATA FORMATING
TASK 3: SUMMARIZE DATA - PIVOT TABLE (Use SPORTSMEN worksheet after attempting TASK 1) • Create a PIVOT table in the worksheet ANALYSIS, starting at cell B3,with the following details:
TASK 4: SUMMARIZE DATA - EXCEL FUNCTIONS (Use SPORTSMEN worksheet after attempting TASK 1)
• Create a SUMMARY table in the worksheet ANALYSIS,starting at cell G4, with the following details:
TASK 5: GENERATE REPORT - PIVOT TABLE (Use SPORTSMEN worksheet after attempting TASK 1)
• Create a PIVOT table report in the worksheet REPORT, starting at cell A3, with the following information:
Process
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
🛒 E-Commerce Data Analysis (Excel & Python Project) 📖 Overview
This project analyzes 10,000+ e-commerce sales records using Excel and Python (Pandas) to uncover valuable business insights. It covers essential data analysis techniques such as cleaning, aggregation, and visualization — perfect for beginners and data analyst learners.
🎯 Objectives
Understand customer purchasing trends
Identify top-selling products
Analyze monthly sales and revenue performance
Calculate business KPIs such as Total Revenue, Total Orders, and Average Order Value (AOV)
🧩 Dataset Information
File: ecommerce_simple_10k.csv Total Rows: 10,000 Columns:
Column Name Description order_id Unique order identifier product Product name quantity Number of items ordered price Price of a single item order_date Date of order placement city City where the order was placed 🧹 Data Cleaning (Python)
Key cleaning steps:
Removed currency symbols (₹) and commas from price and total_sales
Converted order_date into proper datetime format
Created new column month from order_date
Handled missing or incorrect data entries
Facebook
TwitterThe primary business task is to analyze how casual riders and annual members use Cyclistic's bike-share services differently. The insights gained from this analysis will help the marketing team develop strategies aimed at converting casual riders into annual members. This analysis needs to be supported by data and visualizations to convince the Cyclistic executive team.
Casual Riders vs. Annual Members: The core focus of the case study is on the behavioral differences between casual riders and annual members. Cyclistic Historical Trip Data: The data being used is Cyclistic's bike-share trip data, which includes variables like trip duration, start and end stations, user type (casual or member), and bike IDs. Goal: The goal is to design a marketing strategy that targets casual riders and converts them into annual members, as annual members are more profitable for the company.
Lily Moreno: Director of marketing, responsible for Cyclistic’s marketing strategy. Cyclistic Marketing Analytics Team: The team analyzing and reporting on the data. Cyclistic Executive Team: The decision-makers who need to be convinced by the analysis to approve the proposed marketing strategy.
For Q2 in Raw there is incorrect column names
- 01 - Rental Details Rental ID: identifier for each bike rental.
- 01 - Rental Details Local Start Time: The local date and time when the rental started, recorded in MM/DD/YYYY HH:MM format.
- 01 - Rental Details Local End Time: The local date and time when the rental ended, recorded in MM/DD/YYYY HH:MM format.
- 01 - Rental Details Bike ID: identifier for the bike used during the rental.
- 01 - Rental Details Duration In Seconds Uncapped: The total duration of the rental in seconds, including trips that exceed standard time limits (uncapped).
- 03 - Rental Start Station ID: identifier for the station where the rental began.
- 03 - Rental Start Station Name: The name of the station where the rental began.
- 02 - Rental End Station ID: identifier for the station where the rental ended.
- 02 - Rental End Station Name: The name of the station where the rental ended.
- User Type: Specifies whether the user is a "Subscriber" (member) or a "Customer" rider (casual).
- Member Gender: The gender of the member (if available).
- 05 - Member Details Member Birthyear: The birth year of the member (if available).
ride_length using ride_length = D2 - C2 to reflect the trip’s duration.day_of_week column using the formula =TEXT(C2,"dddd") to extract the weekday from the start time.gender and birthyear columns due to excessive missing values.MM/DD/YYYY HH:MM and ensured uniform number formatting for trip IDs.member_casual column to ensure correct identification of casual riders and members.UNION ALL query.
Facebook
TwitterContext
After reaching historic lows during the pandemic, energy consumption increased in the aftermath of deconfinement. This trend was mostly due to economic factors; as restrictions were either reduced or removed, several countries saw a rise in both consumption and general business activity. With the rapid normalization of daily life, many supply chains came increasingly under strain. Several months later, the Russo-Ukrainian War placed further stress on global logistics networks. Energy prices soared, and inflation became a major issue in nations around the world. In an attempt to curb the consequences of this trend, several governments decided to adopt a series of energy-saving measures. France was no exception. In 2022, the French government launched its own Energy Saving Plan (Plan de sobriété énergétique). With measures aimed at households, businesses and the public sector, authorities are now hoping to cut 10% of national energy consumption by 2024 (2019 being the reference year).
Project objective
To reach these energy-saving goals, it is crucial to understand which trends affect French consumption over time. As such, we will be analyzing national gas and electricity use over a ten-year period (2011-2021). Hopefully, this will allow us to identify the main sources of energy consumption in France.
About the dataset
The project dataset was imported from the French government’s Open Data website. Showing the evolution of national electricity and gas consumption over a ten-year period (2011-2021), it was created and collected by Agence ORE, an association of national gas and electricity distribution network operators. The dataset operates under an open license, and includes variables such as operator, year, energy type, consumption category code, consumer category, consumer sector console, consumer sector, company business identification (NAF code), energy consumed, energy delivery point (pdl), and consumption regions. Observations were found in almost 30000 rows.
The dataset was imported and stored on my computer. However, copies of both the raw and clean files can be found in this post.
Our dataset provides extensive information. Nevertheless, we are aware of two potential limitations:
While such information is missing, our project should not face any major obstacles. Given the long-term nature of our data, national trends should be detected even without 2022 energy consumption. In addition, gas and electricity are two of France’s major energy sources and can thus provide many of the expected insights.
Processing
Since the dataset was relatively small (under 30000 rows), I processed the data using Microsoft Excel. First, I created two folders called “Raw Data” and “Working Sheet” (the latter being for the clean data). Afterwards, I eliminated the following unnecessary columns:
Once the useful columns remained, I translated their names from French to English. Thus:
With this done, I proceeded to remove any potential duplicates from the data using the “remove duplicates” option in Excel’s “Data” section (about 200 rows were removed). Following this, I proceeded to both spell-check and translate data values by using the “Find and replace” option in Excel. As such, the following changes were made:
I then proceeded to eliminate rows with empty and 0 values. Once this was completed, I was left with over 15000 rows of data.
To get a better sense of energy consumption on different scales, I also converted MWh to KWh and TWh in separate columns: “Energy consumption (KWh)” and “(Energy consumption (TWh)”. In the end however, I preferred MWh as a metric since it was simpler to analyze.
All values were rounded to the nearest whole number.
Analysis
Once my data was clean, I used Power BI to create a dashboard (all of my files are available in this post).
At first sight, it would seem that French gas and electricity use gre...
Facebook
TwitterThis dataset is a cleaned and preprocessed version of the original Netflix Movies and TV Shows dataset available on Kaggle. All cleaning was done using Microsoft Excel — no programming involved.
🎯 What’s Included: - Cleaned Excel file (standardized columns, proper date format, removed duplicates/missing values) - A separate "formulas_used.txt" file listing all Excel formulas used during cleaning (e.g., TRIM, CLEAN, DATE, SUBSTITUTE, TEXTJOIN, etc.) - Columns like 'date_added' have been properly formatted into DMY structure - Multi-valued columns like 'listed_in' are split for better analysis - Null values replaced with “Unknown” for clarity - Duration field broken into numeric + unit components
🔍 Dataset Purpose: Ideal for beginners and analysts who want to: - Practice data cleaning in Excel - Explore Netflix content trends - Analyze content by type, country, genre, or date added
📁 Original Dataset Credit: The base version was originally published by Shivam Bansal on Kaggle: https://www.kaggle.com/shivamb/netflix-shows
📌 Bonus: You can find a step-by-step cleaning guide and the same dataset on GitHub as well — along with screenshots and formulas documentation.