31 datasets found

Netflix Movies and TV Shows Dataset Cleaned(excel)
kaggle.com
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav Tawri (2025). Netflix Movies and TV Shows Dataset Cleaned(excel) [Dataset]. https://www.kaggle.com/datasets/gauravtawri/netflix-movies-and-tv-shows-dataset-cleanedexcel
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gaurav Tawri
Description
This dataset is a cleaned and preprocessed version of the original Netflix Movies and TV Shows dataset available on Kaggle. All cleaning was done using Microsoft Excel — no programming involved.

🎯 What’s Included: - Cleaned Excel file (standardized columns, proper date format, removed duplicates/missing values) - A separate "formulas_used.txt" file listing all Excel formulas used during cleaning (e.g., TRIM, CLEAN, DATE, SUBSTITUTE, TEXTJOIN, etc.) - Columns like 'date_added' have been properly formatted into DMY structure - Multi-valued columns like 'listed_in' are split for better analysis - Null values replaced with “Unknown” for clarity - Duration field broken into numeric + unit components

🔍 Dataset Purpose: Ideal for beginners and analysts who want to: - Practice data cleaning in Excel - Explore Netflix content trends - Analyze content by type, country, genre, or date added

📁 Original Dataset Credit: The base version was originally published by Shivam Bansal on Kaggle: https://www.kaggle.com/shivamb/netflix-shows

📌 Bonus: You can find a step-by-step cleaning guide and the same dataset on GitHub as well — along with screenshots and formulas documentation.

Netflix Data: Cleaning, Analysis and Visualization

kaggle.com

zip

Updated Aug 26, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization

Explore at:

zip(276607 bytes)Available download formats

Dataset updated

Aug 26, 2022

Authors

Abdulrasaq Ariyo

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

Data Cleaning

We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

--View dataset

SELECT * 
FROM netflix;

--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                  
SELECT show_id, COUNT(*)                                                                                      
FROM netflix 
GROUP BY show_id                                                                                              
ORDER BY show_id DESC;

--No duplicates

--Check null values across columns

SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
    COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
    COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
    COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
    COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
    COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
    COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
    COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
    COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
    COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
    COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
    COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;

We can see that there are NULLS. 
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3

The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

-- Below, we find out if some directors are likely to work with particular cast

WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
FROM netflix
)

SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

With this, we can now populate NULL rows in directors 
using their record with movie_cast

UPDATE netflix 
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;

--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET director = 'Not Given'
WHERE director IS NULL;

--When I was doing this, I found a less complex and faster way to populate a column which I will use next

Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

--Populate the country using the director column

SELECT COALESCE(nt.country,nt2.country) 
FROM netflix AS nt
JOIN netflix AS nt2 
ON nt.director = nt2.director 
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
AND netflix.country IS NULL;


--To confirm if there are still directors linked to country that refuse to update

SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;

--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET country = 'Not Given'
WHERE country IS NULL;

The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

--Show date_added nulls

SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;

--DELETE nulls

DELETE F...

Retail Store Sales: Dirty for Data Cleaning

kaggle.com

zip

Updated Jan 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning

Explore at:

zip(226740 bytes)Available download formats

Dataset updated

Jan 18, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Retail Store Sales Dataset

Overview

The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

File Information

File Name: retail_store_sales.csv
Number of Rows: 12,575
Number of Columns: 11

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Customer ID`	A unique identifier for each customer. 25 unique customers.	`CUST_01`
`Category`	The category of the purchased item.	`Food`, `Furniture`
`Item`	The name of the purchased item. May contain missing values or `None`.	`Item_1_FOOD`, `None`
`Price Per Unit`	The static price of a single unit of the item. May contain missing or `None` values.	`4.00`, `None`
`Quantity`	The quantity of the item purchased. May contain missing or `None` values.	`1`, `None`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `None`
`Payment Method`	The method of payment used. May contain missing or invalid values.	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Online`
`Transaction Date`	The date of the transaction. Always present and valid.	`2023-01-15`
`Discount Applied`	Indicates if a discount was applied to the transaction. May contain missing values.	`True`, `False`, `None`

Categories and Items

The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

Electric Household Essentials

Item Code	Item Name	Price
Item_1_EHE	Blender	5.0
Item_2_EHE	Microwave	6.5
Item_3_EHE	Toaster	8.0
Item_4_EHE	Vacuum Cleaner	9.5
Item_5_EHE	Air Purifier	11.0
Item_6_EHE	Electric Kettle	12.5
Item_7_EHE	Rice Cooker	14.0
Item_8_EHE	Iron	15.5
Item_9_EHE	Ceiling Fan	17.0
Item_10_EHE	Table Fan	18.5
Item_11_EHE	Hair Dryer	20.0
Item_12_EHE	Heater	21.5
Item_13_EHE	Humidifier	23.0
Item_14_EHE	Dehumidifier	24.5
Item_15_EHE	Coffee Maker	26.0
Item_16_EHE	Portable AC	27.5
Item_17_EHE	Electric Stove	29.0
Item_18_EHE	Pressure Cooker	30.5
Item_19_EHE	Induction Cooktop	32.0
Item_20_EHE	Water Dispenser	33.5
Item_21_EHE	Hand Blender	35.0
Item_22_EHE	Mixer Grinder	36.5
Item_23_EHE	Sandwich Maker	38.0
Item_24_EHE	Air Fryer	39.5
Item_25_EHE	Juicer	41.0

Furniture

Item Code	Item Name	Price
Item_1_FUR	Office Chair	5.0
Item_2_FUR	Sofa	6.5
Item_3_FUR	Coffee Table	8.0
Item_4_FUR	Dining Table	9.5
Item_5_FUR	Bookshelf	11.0
Item_6_FUR	Bed F...

ENTSO-E Hydropower modelling data (PECD) in CSV format
zenodo.org
csv
Updated Aug 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matteo De Felice; Matteo De Felice (2020). ENTSO-E Hydropower modelling data (PECD) in CSV format [Dataset]. http://doi.org/10.5281/zenodo.3949757
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3949757
Dataset updated
Aug 14, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matteo De Felice; Matteo De Felice
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PECD Hydro modelling

This repository contains a more user-friendly version of the Hydro modelling data released by ENTSO-E with their latest Seasonal Outlook.

The original URLs:

The zipped file: https://eepublicdownloads.blob.core.windows.net/public-cdn-container/clean-documents/sdc-documents/seasonal/SOR2020/data/Hydro.zip

The documentation file (v 1.0): https://eepublicdownloads.blob.core.windows.net/public-cdn-container/clean-documents/sdc-documents/MAF/2019/Hydropower_Modelling_New_database_and_methodology.pdf

The original ENTSO-E hydropower dataset integrates the PECD (Pan-European Climate Database) released for the MAF 2019

As I did for the wind & solar data, the datasets released in this repository are only a more user- and machine-readable version of the original Excel files. As avid user of ENTSO-E data, with this repository I want to share my data wrangling efforts to make this dataset more accessible.

Data description

The zipped file contains 86 Excel files, two different files for each ENTSO-E zone.

In this repository you can find 6 CSV files:

PECD-hydro-capacities.csv: installed capacities

PECD-hydro-weekly-inflows.csv: weekly inflows for reservoir and open-loop pumping

PECD-hydro-daily-ror-generation.csv: daily run-of-river generation

PECD-hydro-weekly-reservoir-min-max-generation.csv: minimum and maximum weekly reservoir generation

PECD-hydro-weekly-reservoir-min-max-levels.csv: weekly minimum and maximum reservoir levels

Capacities

The file PECD-hydro-capacities.csv contains: run of river capacity (MW) and storage capacity (GWh), reservoir plants capacity (MW) and storage capacity (GWh), closed-loop pumping/turbining (MW) and storage capacity and open-loop pumping/turbining (MW) and storage capacity. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Run-of-River and pondage, rows from 5 to 7, columns from 2 to 5

sheet Reservoir, rows from 5 to 7, columns from 1 to 3

sheet Pump storage - Open Loop, rows from 5 to 7, columns from 1 to 3

sheet Pump storage - Closed Loop, rows from 5 to 7, columns from 1 to 3

Inflows

The file PECD-hydro-weekly-inflows.csv contains the weekly inflow (GWh) for the climatic years 1982-2017 for reservoir plants and open-loop pumping. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 13 to 66, columns from 16 to 51

sheet Pump storage - Open Loop, rows from 13 to 66, columns from 16 to 51

Daily run-of-river

The file PECD-hydro-daily-ror-generation.csv contains the daily run-of-river generation (GWh). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Run-of-River and pondage, rows from 13 to 378, columns from 15 to 51

Miminum and maximum reservoir generation

The file PECD-hydro-weekly-reservoir-min-max-generation.csv contains the minimum and maximum generation (MW, weekly) for reservoir-based plants for the climatic years 1982-2017. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 13 to 66, columns from 196 to 231

sheet Reservoir, rows from 13 to 66, columns from 232 to 267

Minimum/Maximum reservoir levels

The file PECD-hydro-weekly-reservoir-min-max-levels.csv contains the minimum/maximum reservoir levels at beginning of each week (scaled coefficient from 0 to 1). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 14 to 66, column 12

sheet Reservoir, rows from 14 to 66, column 13

CHANGELOG

[2020/07/17] Added maximum generation for the reservoir
Population and GDP/GNI/CO2 emissions (2019, raw data)
figshare.com
txt
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liang Zhao (2023). Population and GDP/GNI/CO2 emissions (2019, raw data) [Dataset]. http://doi.org/10.6084/m9.figshare.22085060.v6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22085060.v6
Dataset updated
Feb 23, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Liang Zhao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Original dataset The original year-2019 dataset was downloaded from the World Bank Databank by the following approach on July 23, 2022.

Database: "World Development Indicators" Country: 266 (all available) Series: "CO2 emissions (kt)", "GDP (current US$)", "GNI, Atlas method (current US$)", and "Population, total" Time: 1960, 1970, 1980, 1990, 2000, 2010, 2017, 2018, 2019, 2020, 2021 Layout: Custom -> Time: Column, Country: Row, Series: Column Download options: Excel

Preprocessing

With libreoffice,

remove non-country entries (lines after Zimbabwe), shorten column names for easy processing: Country Name -> Country, Country Code -> Code, "XXXX ... GNI ..." -> GNI_1990, etc (notice '_', not '-', for R), remove unnesssary rows after line Zimbabwe.
u
European Folk Costumes Excel Spreadsheet and Access Database
deepblue.lib.umich.edu
Updated Mar 9, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James, David A. (2017). European Folk Costumes Excel Spreadsheet and Access Database [Dataset]. http://doi.org/10.7302/Z2HD7SKC
Explore at:
Unique identifier
https://doi.org/10.7302/Z2HD7SKC
Dataset updated
Mar 9, 2017
Dataset provided by
Deep Blue Data
Authors
James, David A.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
1997
Description
An Excel spreadsheet listing the information recorded on each of 18,686 costume designs can be viewed, downloaded, and explored. All the usual Excel sorting possibilities are available, and in addition a useful filter has been installed. For example, to find the number of designs that are Frieze Type #1, go to the top of the frieze type 2 column (column AS), click on the drop-down arrow and unselect every option box except True (i.e. True should be turned on, all other choices turned off). Then in the lower left corner, one reads “1111 of 18686 records found”.

Much more sophisticated exploration can be carried out by downloading the rich and flexible Access Database. The terms used for this database were described in detail in three sections of Deep Blue paper associated with this project. The database can be downloaded and explored.

HOW TO USE THE ACCESS DATABASE 1. Click on the Create Cohort and View Math Trait Data button, and select your cohort by clicking on the features of interest (for example: Apron and Blouse).

Note: Depending on how you exited on your previous visit to the database, there may be items to clear up before creating the cohorts.
a) (Usually unnecessary) Click on the small box near the top left corner to allow connection to Access. b) (Usually unnecessary) If an undesired window blocks part of the screen, click near the top of this window to minimize it. c) Make certain under Further Filtering that all four Exclude boxes are checked to get rid of stripes and circles, and circular buttons, and the D1 that is trivially associated with shoes.

Click on Filter Records to Form the Cohort button. Note the # of designs, # of pieces, and # of costumes beside Recalculate.

Click on Calculate Average Math Trait Frequency of Cohort button, and select the symmetry types of interest (for example: D1 and D2) .

To view the Stage 1 table, click on Create Stage 1 table. To edit and print this table, click on Create Excel (after table has been created). The same process works for Stages 2, 3.and 4 tables.

To view the matrix listing the math category impact numbers, move over to a button on the right side and click on View Matrix of Math Category Impact Numbers. To edit and print this matrix, click on Create Excel, use the Excel table as usual.
4
MODIS-based Daily Lake Ice Extent and Coverage Dataset for Tibetan Plateau...
data.4tu.nl
zip
Updated Mar 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Y. (Yubao) Qiu; Pengfei Xie; M. (Matti) Leppäranta; X. (Xingxing) Wang; Juha Lemmetyinen; H. (Hui) Lin; L. (Lijuan) Shi (2019). MODIS-based Daily Lake Ice Extent and Coverage Dataset for Tibetan Plateau [version 1] [Dataset]. http://doi.org/10.4121/uuid:fdfd8c76-6b7c-4bbf-aec8-98ab199d9093
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:fdfd8c76-6b7c-4bbf-aec8-98ab199d9093
Dataset updated
Mar 12, 2019
Dataset provided by
4TU.Centre for Research Data
Authors
Y. (Yubao) Qiu; Pengfei Xie; M. (Matti) Leppäranta; X. (Xingxing) Wang; Juha Lemmetyinen; H. (Hui) Lin; L. (Lijuan) Shi
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Jul 2002 - Jun 2018
Area covered
Tibetan Plateau
Description
The present dataset was developed using the MODIS Normalized Difference Snow Index with a spatial resolution of 500 m as input for the SNOWMAP algorithm to detect lake ice from daily clear-sky observations. Furthermore, for cloud-cover conditions, lake ice was identified based on the spatial and temporal continuity of lake-ice data. On this basis, the daily lake-ice monitoring data of 2612 lakes of the Tibetan Plateau from 2002 to 2018 were calculated and classified. Moreover, a time-series analysis of lake ice coverage, which included lakes with surface area greater than 1 km2, was carried out to provide a clear list of lakes for which lake ice phenology can be estimated. The data set contains 5834 raster files, one vector file and 2612 Excel files (including 1134 time series with and without classification statistics). The raster file is named daily lake ice extent. The vector file contains such information as the number, name, location, surface area and classification number of the processed lake. The names of the excel files correspond to lake numbers. Each excel file contains four columns with the daily lake ice coverage information of its corresponding lake from July 2002 to June 2018. The attributes of each column are, successively, date, lake water coverage, lake ice coverage and cloud coverage. Users can first use the vector file to determine the number, location and classification number of a given lake, and then obtain the corresponding daily lake ice coverage data for a given year from the Excel file to use for the monitoring of lake-ice freeze-thaw and research on climate change.
Electrical half hourly raw and cleaned datasets for Great Britain from...
zenodo.org
csv
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grant Wilson; Grant Wilson (2025). Electrical half hourly raw and cleaned datasets for Great Britain from 2008-11-05 [Dataset]. http://doi.org/10.5281/zenodo.16328483
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16328483
Dataset updated
Jul 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Grant Wilson; Grant Wilson
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Area covered
United Kingdom
Description
A journal paper published in Energy Strategy Reviews details the method to create the data.

https://www.sciencedirect.com/science/article/pii/S2211467X21001280

2023-10-10: Version 8.0.5 has additional columns added, one for day of the year, and one for the half hour period of the year (17520 in a standard year and 17568 in a leap year). A new interconnector (https://www.viking-link.com/) has posted a value since 2023-07-12 but the values have all been a zero value so far (until 2023-09-30).

2023-03-15: Version 8.0.1 is a major rewrite with column names that now include the units and the data type. Also, pumped storage has charging values included from 2012, i.e., the negative values when pumped storage is being charged, as well as the positive values when it was discharging (which were available previously). The raw version of the data (rather than cleaned) has been dropped for the time being.

2023-01-06: Version 7.0.0 was created. Now includes data for the Eleclink interconnector from Great Britain to France through the Channel Tunnel (https://www.eleclink.co.uk/index.php). This supersedes previous versions - as the Eleclink data is now included for historical data (including in the ESPENI total).

2021-09-09: Version 6.0.0 was created. Now includes data for the North Sea Link (NSL) interconnector from Great Britain to Norway (https://www.northsealink.com). The previous version (5.0.4) should not be used - as there was an error with interconnector data having a static value over the summer 2021.

2021-05-05: Version 5.0.0 was created. Datetimes now in ISO 8601 format (with capital letter 'T' between the date and time) rather than previously with a space (to RFC 3339 format) and with an offset to identify both UTC and localtime. MW values now all saved as integers rather than floats. Elexon data as always from www.elexonportal.co.uk/fuelhh, National Grid data from https://data.nationalgrideso.com/demand/historic-demand-data Raw data now added again for comparison of pre and post cleaning - to allow for training of additional cleaning methods. If using Microsoft Excel, the T between the date and time can be removed using the =SUBSTITUTE() command - and substitute "T" for a space " "

_

2021-03-02: Version 4.0.0 was created. Due to a new interconnecter (IFA2 - https://en.wikipedia.org/wiki/IFA-2) being commissioned in Q1 2021, there is an additional column with data from National Grid - this is called 'POWER_NGEM_IFA2_FLOW_MW' in the espeni dataset. In addition, National Grid has dropped the column name 'FRENCH_FLOW' that used to provide the value for the column 'POWER_NGEM_FRENCH_FLOW_MW' in previous espeni versions. However, this has been changed to 'IFA_FLOW' in National Grid's original data, which is now called 'POWER_NGEM_IFA_FLOW_MW' in the espeni dataset. Lastly, the IO14 columns have all been dropped by National Grid - and potentially unlikely to appear again in future.

2020-12-02: Version 3.0.0 was created. There was a problem with earlier versions local time format - where the +01:00 value was not carried through into the data properly. Now addressed - therefore - local time now has the format e.g. 2020-03-31 20:00:00+01:00 when in British Summer Time.

2020-10-03: Version 2.0.0 was created as it looks like National Grid has had a significant change to the methodology underpinning the embedded wind calculations. The wind profile seems similar to previous values, but with an increasing value in comparison to the value published in earlier the greater the embedded value is. The 'new' values are from https://data.nationalgrideso.com/demand/daily-demand-update from 2013.

Previously: raw and cleaned datasets for Great Britain's publicly available electrical data from Elexon (www.elexonportal.co.uk) and National Grid (https://demandforecast.nationalgrid.com/efs_demand_forecast/faces/DataExplorer). Updated versions with more recent data will be uploaded with a differing version number and doi

All data is released in accordance with Elexon's disclaimer and reservation of rights.

https://www.elexon.co.uk/using-this-website/disclaimer-and-reservation-of-rights/

This disclaimer is also felt to cover the data from National Grid, and the parsed data from the Energy Informatics Group at the University of Birmingham.
g
IP Australia - [Superseded] Intellectual Property Government Open Data 2019...
gimi9.com
Updated Jul 20, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). IP Australia - [Superseded] Intellectual Property Government Open Data 2019 | gimi9.com [Dataset]. https://gimi9.com/dataset/au_intellectual-property-government-open-data-2019
Explore at:
Dataset updated
Jul 20, 2018
Area covered
Australia
Description
What is IPGOD? The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD. # How do I use IPGOD? IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar. # IP Data Platform IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform # References The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset. * Patents * Trade Marks * Designs * Plant Breeder’s Rights # Updates ### Tables and columns Due to the changes in our systems, some tables have been affected. * We have added IPGOD 225 and IPGOD 325 to the dataset! * The IPGOD 206 table is not available this year. * Many tables have been re-built, and as a result may have different columns or different possible values. Please check the data dictionary for each table before use. ### Data quality improvements Data quality has been improved across all tables. * Null values are simply empty rather than '31/12/9999'. * All date columns are now in ISO format 'yyyy-mm-dd'. * All indicator columns have been converted to Boolean data type (True/False) rather than Yes/No, Y/N, or 1/0. * All tables are encoded in UTF-8. * All tables use the backslash \ as the escape character. * The applicant name cleaning and matching algorithms have been updated. We believe that this year's method improves the accuracy of the matches. Please note that the "ipa_id" generated in IPGOD 2019 will not match with those in previous releases of IPGOD.
H
LCZO -- Meteorology -- Daily -- Sabana Field Station -- (2001-2010)
beta.hydroshare.org
hydroshare.org
+1more
zip
Updated Jun 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grizelle González; IITF (2020). LCZO -- Meteorology -- Daily -- Sabana Field Station -- (2001-2010) [Dataset]. https://beta.hydroshare.org/resource/ecb79f4688674bf1bf954722da992fc7/
Explore at:
zip(517.3 KB)Available download formats
Dataset updated
Jun 18, 2020
Dataset provided by
HydroShare
Authors
Grizelle González; IITF
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2001 - Aug 31, 2010
Area covered

Description
Description of data preparation performed on data from 2001 to 2007 (end).

Cleaning Data In the original form of Sabana data (both daily and hourly data), instrument frequently recorded minimum value of TIRRa and Total PFD as negative values and maximum value of RH as over 100%. Unquestionably, these are unrealistic values. Thus, they were replaced by 0 (zero) for TIRRa and Total PFD minimum values and 100% for RH maximum values.

Defected Data There were noticeable defect of Total PFD values in 2003 and 2006 (both daily and hourly data). Specifically, in 2003, defected Total PFD values were from January 1st (Day # = 1) through September 3rd (Day # = 247) and, in 2006, they were from March 24th (Day # = 83) through October 31st (Day # = 304). Therefore, four year (2001, 2002, 2004, and 2005) monthly averages were calculated and multiplier was developed based on the ratio of [four year average] / [2003 (or 2006) defected data]. Detail calculation of this can be seen in the Modification file (MS Excel file). Accordingly, columns denoted as “Modified Total PFD” are results of this modification. However, note that red and black colors within the column indicate modified and non-modified (original) values, respectively.

Missing Data There were large numbers of data missing in both daily and hourly dataset which are outlined below. Additionally, there were couples of significantly noticeable defected values in some columns which were omitted from the dataset. Thus, missing and omitted data were left as blank (no values).

Grizelle González - Project Leader, Research Unit

USDA FS - International Institute of Tropical Forestry

voice: 787-764-7800

ggonzalez@fs.fed.us
e
Mayor Election 2014 Düsseldorf
data.europa.eu
csv, json
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Düsseldorf (2025). Mayor Election 2014 Düsseldorf [Dataset]. https://data.europa.eu/data/datasets/851e793b-50ac-4e57-91fc-a10418b8bb56?locale=en
Explore at:
json, csv(33995), csv(272), csv(928), csv(5542), csv(510), csv(51497), csv(3583), csv(1575)Available download formats
Dataset updated
May 30, 2025
Dataset authored and provided by
Düsseldorf
License
http://dcat-ap.de/def/licenses/other-closedhttp://dcat-ap.de/def/licenses/other-closed
Area covered
Düsseldorf
Description
The data set contains the results of the mayor’s election on 25 May 2014 and the mayor’s key election on 15 June 2014 of the City of Düsseldorf.

The local elections took place on 25 May 2014. Because no clear majority was reached, there was a runoff election of the mayor on 15 June 2014.

An authority may set up different territorial levels to present the election results, from the lowest level (voting districts) to constituencies and districts to the level of the city or municipality, district and constituency. However, not all levels are necessary for each type of election. For each of the territorial levels that an authority has set up, there is a file containing the overview of those areas with fast messages already received.

Further data sets contain information on the division of electoral areas for local elections and the division of voting districts.

Information on terms in the field of ‘Elections’ can be found in the Election ABC of the interactive learning platform for election workers of the City of Düsseldorf.

The files are encoded in UTF-8. By default, Excel does not display the umlauts in the files correctly. You can avoid this as follows:

Excel 2003 Select from the menu ‘Data’ -> ‘Import external data’ from the menu item ‘Import data’. The ‘Select data source’ dialog opens. Select the file you want to open and press the ‘Open’ button. Then place the file origin to '65001 Unicode: (UTF-8)' fixed and continue with the ‘Next’ button. In the next dialog, set the separator to ‘Semicolon’ instead of ‘Tabstopp’ and continue with the ‘Next’ button again. They then select the ‘Text’ option as the data format of the columns and exit the wizard with the ‘Finish’ button. Use the ‘OK’ button to finish the procedure and the data is displayed UTF-8 encoded in Microsoft Excel.

Excel 2010 From the tab ‘Data’ in the section ‘Retrieve external data’, select the option ‘From text’. The dialog ‘Import text file’ opens. Select the file you want to open and press the ‘Open’ button. Then place the file origin to '65001 Unicode: (UTF-8)' fixed and continue with the ‘Next’ button. In the next dialog, set the separator to ‘Semicolon’ instead of ‘Tabstopp’ and continue with the ‘Next’ button again. They then select the ‘Text’ option as the data format of the columns and exit the wizard with the ‘Finish’ button. Use the ‘OK’ button to finish the procedure and the data is displayed UTF-8 encoded in Microsoft Excel.

The files contain the following column information:

Number: Constituency number Name: Name of the constituency MaxQuick Messages: maximum number of quick messages AnzQuick Messages: Number of fast messages already recorded Eligible voters: Number of eligible voters Filed under: Number of ballot papers submitted Turnout: Voter turnouts at the respective view levels valid Voting List: Number of valid ballot papers valid: Number of valid votes cast invalid Voting List: Number of invalid ballot papers invalid: Number of invalid votes cast In addition, the following fields are available for each party (example of one party called ‘A Party’):

A Party: Number of total votes of the party A-Party_Proz: Percentage of total votes of the party from the total result
Hive Annotation Job Results - Cleaned and Audited
kaggle.com
zip
Updated Apr 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brendan Kelley (2021). Hive Annotation Job Results - Cleaned and Audited [Dataset]. https://www.kaggle.com/brendankelley/hive-annotation-job-results-cleaned-and-audited
Explore at:
zip(471571 bytes)Available download formats
Dataset updated
Apr 28, 2021
Authors
Brendan Kelley
Description
Context

This notebook serves to showcase my problem solving ability, knowledge of the data analysis process, proficiency with Excel and its various tools and functions, as well as my strategic mindset and statistical prowess. This project consist of an auditing prompt provided by Hive Data, a raw Excel data set, a cleaned and audited version of the raw Excel data set, and my description of my thought process and knowledge used during completion of the project. The prompt can be found below:

Hive Data Audit Prompt

The raw data that accompanies the prompt can be found below:

Hive Annotation Job Results - Raw Data

^ These are the tools I was given to complete my task. The rest of the work is entirely my own.

To summarize broadly, my task was to audit the dataset and summarize my process and results. Specifically, I was to create a method for identifying which "jobs" - explained in the prompt above - needed to be rerun based on a set of "background facts," or criteria. The description of my extensive thought process and results can be found below in the Content section.

Content

Brendan Kelley April 23, 2021

Hive Data Audit Prompt Results

This paper explains the auditing process of the “Hive Annotation Job Results” data. It includes the preparation, analysis, visualization, and summary of the data. It is accompanied by the results of the audit in the excel file “Hive Annotation Job Results – Audited”.

Observation

The “Hive Annotation Job Results” data comes in the form of a single excel sheet. It contains 7 columns and 5,001 rows, including column headers. The data includes “file”, “object id”, and the pseudonym for five questions that each client was instructed to answer about their respective table: “tabular”, “semantic”, “definition list”, “header row”, and “header column”. The “file” column includes non-unique (that is, there are multiple instances of the same value in the column) numbers separated by a dash. The “object id” column includes non-unique numbers ranging from 5 to 487539. The columns containing the answers to the five questions include Boolean values - TRUE or FALSE – which depend upon the yes/no worker judgement.

Use of the COUNTIF() function reveals that there are no values other than TRUE or FALSE in any of the five question columns. The VLOOKUP() function reveals that the data does not include any missing values in any of the cells.

Assumptions

Based on the clean state of the data and the guidelines of the Hive Data Audit Prompt, the assumption is that duplicate values in the “file” column are acceptable and should not be removed. Similarly, duplicated values in the “object id” column are acceptable and should not be removed. The data is therefore clean and is ready for analysis/auditing.

Preparation

The purpose of the audit is to analyze the accuracy of the yes/no worker judgement of each question according to the guidelines of the background facts. The background facts are as follows:

• A table that is a definition list should automatically be tabular and also semantic • Semantic tables should automatically be tabular • If a table is NOT tabular, then it is definitely not semantic nor a definition list • A tabular table that has a header row OR header column should definitely be semantic

These background facts serve as instructions for how the answers to the five questions should interact with one another. These facts can be re-written to establish criteria for each question:

For tabular column: - If the table is a definition list, it is also tabular - If the table is semantic, it is also tabular

For semantic column: - If the table is a definition list, it is also semantic - If the table is not tabular, it is not semantic - If the table is tabular and has either a header row or a header column...
Z
Data from: Species Portfolio Effects Dominate Seasonal Zooplankton...
data.niaid.nih.gov
Updated Mar 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
O'Connor, Reilly (2022). Species Portfolio Effects Dominate Seasonal Zooplankton Stabilization Within a Large Temperate Lake [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_6345004
Explore at:
Dataset updated
Mar 16, 2022
Dataset provided by
University of Guelph
Authors
O'Connor, Reilly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The raw data file is available online for public access (https://data.ontario.ca/dataset/lake-simcoe-monitoring). Download the 1980-2019 csv files and open up the file named "Simcoe_Zooplankton&Bythotrephes.csv". Copy and paste the zooplankton sheet into a new excel file called "Simcoe_Zooplankton.csv". The column ZDATE in the excel file needs to be switched from GENERAL to SHORT DATE so that the dates in the ZDATE column read "YYYY/MM/DD". Save as .csv in appropriate R folder. The data file "simcoe_manual_subset_weeks_5" is the raw data that has been subset for the main analysis of the article using the .R file "Simcoe MS - 5 Station Subset Data". The .csv file produced from this must then be manually edited to remove data points that do not have 5 stations per sampling period as well as by combining data points that should fall into a single week. The "simcoe_manual_subset_weeks_5.csv" is then used for the calculation of variability, stabilization, asynchrony, and Shannon Diversity for each year in the .R file "Simcoe MS - 5 Station Calculations". The final .R file "Simcoe MS - 5 Station Analysis contains the final statistical analyses as well as code to reproduce the original figures. Data and code for main and supplementary analyses are also available on GitHub (https://github.com/reillyoc/ZPseasonalPEs).
u
CanadaBuys tender notices - Catalogue - Canadian Urban Data Catalogue (CUDC)...
data.urbandatacentre.ca
Updated Oct 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). CanadaBuys tender notices - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-6abd20d4-7a1c-4b38-baa2-9525d0bb2fd2
Explore at:
Dataset updated
Oct 19, 2025
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Area covered
Canada
Description
This dataset contains information on Government of Canada tender information published according to the Financial Administration Act. It includes data for all Schedule I, Schedule II and Schedule III departments, agencies, Crown corporations, and other entities (unless specifically exempt) who must comply with the Government of Canada trade agreement obligations. CanadaBuys is the authoritative source of this information. Visit the How procurement works page on the CanadaBuys website to learn more. All data files in this collection share a common column structure, and the procurement category field (labelled as “procurementCategory-categorieApprovisionnement”) can be used to filter by the following four major categories of tenders: Tenders for construction, which will have a value of “CNST” Tenders for goods, which will have a value of “GD” Tenders for services, which will have a value of “SRV” Tenders for services related to goods, which will have a value of “SRVTGD” A tender may be associated with one or more of the above procurement categories. Note: Some records contain long tender description values that may cause issues when viewed in certain spreadsheet programs, such as Microsoft Excel. When the information doesn’t fit within the cell’s character limit, the program will insert extra rows that don’t conform to the expected column formatting. (Though, all other records will still be displayed properly, in their own rows.) To quickly remove the “spill-over data” caused by this display error in Excel, select the publication date field (labelled as “publicationDate-datePublication”), then click the Filter button on the Data menu ribbon. You can then use the filter pull-down list to remove any blank or non-date values from this field, which will hide the rows that only contain “spill-over” description information. The following list describes the resources associated with this CanadaBuys tender notices dataset. Additional information on Government of Canada tenders can also be found on the Tender notices tab of the CanadaBuys tender opportunities page. NOTE: While the CanadaBuys online portal includes tender opportunities from across multiple levels of government, the data files in this related dataset only include notices from federal government organizations. (1) CanadaBuys data dictionary: This XML file offers descriptions of each data field in the tender notices files linked below, as well as other procurement-related datasets CanadaBuys produces. Use this as a guide for understanding the data elements in these files. This dictionary is updated as needed to reflect changes to the data elements. (2) New tender notices: This file contains up to date information on all new tender notices that are published to CanadaBuys throughout a given day. The file is updated every two hours, from 6:15 am until 10:15 pm (UTC-0500) to include new tenders as they are published. All tenders in this file will have a publication date matching the current day (displayed in the field labelled “publicationDate-datePublication”), or the day prior for systems that feed into this file on a nightly basis. (3) Open tender notices: This file contains up to date information on all tender notices that are open for bidding on CanadaBuys, including any amendments made to these tender notices during their lifecycles. The file is refreshed each morning, between 7:00 am and 8:30 am (UTC-0500) to include newly published open tenders. All tenders in this file will have a status of open (displayed in the field labelled “tenderStatus-tenderStatut-eng”). (4) All CanadaBuys tender notices, 2022-08-08 onwards: This file contains up to date information on all tender notices published through CanadaBuys. This includes any tender notices that were open for bids on or after August 8, 2022, when CanadaBuys launched as the system of record for all Tender Notices for the Government of Canada. This file includes any amendments made to these tender notices during their lifecycles. It is refreshed each morning, between 7:00 am and 8:30 am (UTC-0500) to include any updates or amendments, as needed. Tender notices in this file can have any publication date on or after August 8, 2022 (displayed in the field labelled “publicationDate-datePublication”), and can have a status of open, cancelled or expired (displayed in the field labelled “tenderStatus-tenderStatut-eng”). (5) Legacy tender notices, 2009 to 2022-08 (prior to CanadaBuys): This file contains details of the tender notices that were launched prior to the implementation of CanadaBuys, which became the system of record for all tender notices for the Government of Canada on August 8, 2022. This datafile is refreshed monthly. The over 70,000 tenders in this file have publication dates from August 5, 2022 and before (displayed in the field labelled “publicationDate-datePublication”) and have a status of cancelled or expired (displayed in the field labelled “tenderStatus-tenderStatut-eng”). Note: Procurement data was structured differently in the legacy applications previously used to administer Government of Canada tender notices. Efforts have been made to manipulate these historical records into the structure used by the CanadaBuys data files, to make them easier to analyse and compare with new records. This process is not perfect since simple one-to-one mappings can’t be made in many cases. You can access these historical records in their original format as part of the archived copy of the original tender notices dataset. You can also refer to the supporting documentation for understanding the new CanadaBuys tender and award notices datasets. (6) Tender notices, YYYY-YYYY: These files contain information on all tender notices published in the specified fiscal year that are no longer open to bidding. The current fiscal year's file is refreshed each morning, between 7:00 am and 8:30 am (UTC-0500) to include any updates or amendments, as needed. The files associated with past fiscal years are refreshed monthly. Tender notices in these files can have any publication date between April 1 of a given year and March 31 of the subsequent year (displayed in the field labelled “publicationDate-datePublication”) and can have a status of cancelled or expired (displayed in the field labelled “tenderStatus-tenderStatut-eng”). New records are added to these files once related tenders reach their close date, or are cancelled. Note: New tender notice data files will be added on April 1 for each fiscal year.
CRM Finance Loan Tracking
kaggle.com
zip
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dixoncode (2025). CRM Finance Loan Tracking [Dataset]. https://www.kaggle.com/datadplyr/crm-finance-loan-tracking-excel-file
Explore at:
zip(327889 bytes)Available download formats
Dataset updated
Mar 17, 2025
Authors
dixoncode
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

In small and medium sized firms that aim to do CRM, employees sometimes use Excel to track Customer Feedback. Excel is widely used due to its popularity and clean interface. However, Excel is not similar to other advanced CRM software and websites such as Slack, HubSpot, Salesforce, or Zoho. In cases where an organization aims to collect lower level feedback that can the be uploaded to a larger CRM software, Excel is a good choice. I did some research on how to make it easier for a CRM officer, salesperson, or company data managers to automate client feedback tracking using Excel's VBA functionality and VLOOKUP.

Content

This dataset has one file- CRM Finance Loan Tracking Excel File.xlsm which has columns related to customers of a medium-sized financial institution such as Client, Bank Branch Name, Phone Number, Client Account No., Loan Account No., Product, Loan Amount, Disbursed Date, Maturity, Repaid, Debt Owing, Current Note, 1st Latest Note, 2nd Latest Note, 3rd Latest Note, 4th Latest Note, and 5th Latest Note.

How to Use the Excel File

First, enable macros in the Excel file. Then, you can proceed as follows: On the first sheet called CLIENT LOANS, try typing in column M (Current Note) for any client. The VBA code will automatically update the 1st to 5th Latest Notes in columns N to R. You can look the note logs in the second sheet called LogSheet. The third sheet called CountSpecific shows the count of specific notes for each client.

Note that you can tweak the functionality of these XLSM files to suit your needs, by removing some unneeded columns and adding new ones. Just remember to modify the VBA code accordingly. .

Acknowledgements

This dataset is a compilation of random client names obtained from https://1000randomnames.com/. Other columns also contain random facts of the clients. For illustrative purposes, I typed the notes for the first five clients.

Inspiration

Can we have a simple excel file that helps in tracks client feedback? Can we use Excel formulas to track recurring customer complaints? Can we make it easier to see previous client feedback?

Use Cases - Portfolio management - Sales pipeline management - Client feedback tracking - Student progress tracking - Organizational records tracking - Budget management
Google Ads sales dataset
kaggle.com
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NayakGanesh007 (2025). Google Ads sales dataset [Dataset]. https://www.kaggle.com/datasets/nayakganesh007/google-ads-sales-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
NayakGanesh007
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Google Ads Sales Dataset for Data Analytics Campaigns (Raw & Uncleaned) 📝 Dataset Overview This dataset contains raw, uncleaned advertising data from a simulated Google Ads campaign promoting data analytics courses and services. It closely mimics what real digital marketers and analysts would encounter when working with exported campaign data — including typos, formatting issues, missing values, and inconsistencies.

It is ideal for practicing:

Data cleaning

Exploratory Data Analysis (EDA)

Marketing analytics

Campaign performance insights

Dashboard creation using tools like Excel, Python, or Power BI

📁 Columns in the Dataset Column Name ----- -Description Ad_ID --------Unique ID of the ad campaign Campaign_Name ------Name of the campaign (with typos and variations) Clicks --Number of clicks received Impressions --Number of ad impressions Cost --Total cost of the ad (in ₹ or $ format with missing values) Leads ---Number of leads generated Conversions ----Number of actual conversions (signups, sales, etc.) Conversion Rate ---Calculated conversion rate (Conversions ÷ Clicks) Sale_Amount ---Revenue generated from the conversions Ad_Date------ Date of the ad activity (in inconsistent formats like YYYY/MM/DD, DD-MM-YY) Location ------------City where the ad was served (includes spelling/case variations) Device------------ Device type (Mobile, Desktop, Tablet with mixed casing) Keyword ----------Keyword that triggered the ad (with typos)

⚠️ Data Quality Issues (Intentional) This dataset was intentionally left raw and uncleaned to reflect real-world messiness, such as:

Inconsistent date formats

Spelling errors (e.g., "analitics", "anaytics")

Duplicate rows

Mixed units and symbols in cost/revenue columns

Missing values

Irregular casing in categorical fields (e.g., "mobile", "Mobile", "MOBILE")

🎯 Use Cases Data cleaning exercises in Python (Pandas), R, Excel

Data preprocessing for machine learning

Campaign performance analysis

Conversion optimization tracking

Building dashboards in Power BI, Tableau, or Looker

💡 Sample Analysis Ideas Track campaign cost vs. return (ROI)

Analyze click-through rates (CTR) by device or location

Clean and standardize campaign names and keywords

Investigate keyword performance vs. conversions

🔖 Tags Digital Marketing · Google Ads · Marketing Analytics · Data Cleaning · Pandas Practice · Business Analytics · CRM Data
SPORTS_DATA_ANALYSIS_ON_EXCEL
kaggle.com
zip
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nil kamal Saha (2024). SPORTS_DATA_ANALYSIS_ON_EXCEL [Dataset]. https://www.kaggle.com/datasets/nilkamalsaha/sports-data-analysis-on-excel
Explore at:
zip(1203633 bytes)Available download formats
Dataset updated
Dec 12, 2024
Authors
Nil kamal Saha
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
PROJECT OBJECTIVE

We are a part of XYZ Co Pvt Ltd company who is in the business of organizing the sports events at international level. Countries nominate sportsmen from different departments and our team has been given the responsibility to systematize the membership roster and generate different reports as per business requirements.

Questions (KPIs)

TASK 1: STANDARDIZING THE DATASET

Populate the FULLNAME consisting of the following fields ONLY, in the prescribed format: PREFIX FIRSTNAME LASTNAME.{Note: All UPPERCASE)

Get the COUNTRY NAME to which these sportsmen belong to. Make use of LOCATION sheet to get the required data

Populate the LANGUAGE_!poken by the sportsmen. Make use of LOCTION sheet to get the required data

Generate the EMAIL ADDRESS for those members, who speak English, in the prescribed format :lastname.firstnamel@xyz .org {Note: All lowercase) and for all other members, format should be lastname.firstname@xyz.com (Note: All lowercase)

Populate the SPORT LOCATION of the sport played by each player. Make use of SPORT sheet to get the required data

TASK 2: DATA FORMATING

Display MEMBER IDas always 3 digit number {Note: 001,002 ...,D2D,..etc)

Format the BIRTHDATE as dd mmm'yyyy (Prescribed format example: 09 May' 1986)

Display the units for the WEIGHT column (Prescribed format example: 80 kg)

Format the SALARY to show the data In thousands. If SALARY is less than 100,000 then display data with 2 decimal places else display data with one decimal place. In both cases units should be thousands (k) e.g. 87670 -> 87.67 k and 12 250 -> 123.2 k

TASK 3: SUMMARIZE DATA - PIVOT TABLE (Use SPORTSMEN worksheet after attempting TASK 1) • Create a PIVOT table in the worksheet ANALYSIS, starting at cell B3,with the following details:

In COLUMNS; Group : GENDER.

In ROWS; Group : COUNTRY (Note: use COUNTRY NAMES).

In VALUES; calculate the count of candidates from each COUNTRY and GENDER type, Remove GRAND TOTALs.

TASK 4: SUMMARIZE DATA - EXCEL FUNCTIONS (Use SPORTSMEN worksheet after attempting TASK 1)

• Create a SUMMARY table in the worksheet ANALYSIS,starting at cell G4, with the following details:

Starting from range RANGE H4; get the distinct GENDER. Use remove duplicates option and transpose the data.

Starting from range RANGE GS; get the distinct COUNTRY (Note: use COUNTRY NAMES).

In the cross table,get the count of candidates from each COUNTRY and GENDER type.

TASK 5: GENERATE REPORT - PIVOT TABLE (Use SPORTSMEN worksheet after attempting TASK 1)

• Create a PIVOT table report in the worksheet REPORT, starting at cell A3, with the following information:

Change the report layout to TABULAR form.

Remove expand and collapse buttons.

Remove GRAND TOTALs.

Allow user to filter the data by SPORT LOCATION.

Process

Verify data for any missing values and anomalies, and sort out the same.

Made sure data is consistent and clean with respect to data type, data format and values used.

Created pivot tables according to the questions asked.
ECOMMERCE-DATA-ANALYSING
kaggle.com
zip
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harjot Singh (2025). ECOMMERCE-DATA-ANALYSING [Dataset]. https://www.kaggle.com/datasets/harjotsingh13/ecommerce-data-analysing
Explore at:
zip(337900 bytes)Available download formats
Dataset updated
Nov 12, 2025
Authors
Harjot Singh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🛒 E-Commerce Data Analysis (Excel & Python Project) 📖 Overview

This project analyzes 10,000+ e-commerce sales records using Excel and Python (Pandas) to uncover valuable business insights. It covers essential data analysis techniques such as cleaning, aggregation, and visualization — perfect for beginners and data analyst learners.

🎯 Objectives

Understand customer purchasing trends

Identify top-selling products

Analyze monthly sales and revenue performance

Calculate business KPIs such as Total Revenue, Total Orders, and Average Order Value (AOV)

🧩 Dataset Information

File: ecommerce_simple_10k.csv Total Rows: 10,000 Columns:

Column Name Description order_id Unique order identifier product Product name quantity Number of items ordered price Price of a single item order_date Date of order placement city City where the order was placed 🧹 Data Cleaning (Python)

Key cleaning steps:

Removed currency symbols (₹) and commas from price and total_sales

Converted order_date into proper datetime format

Created new column month from order_date

Handled missing or incorrect data entries
Divvy Bike Share Analysis
kaggle.com
zip
Updated Sep 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabe Puente (2024). Divvy Bike Share Analysis [Dataset]. https://www.kaggle.com/gabepuente/divvy-bike-share-analysis
Explore at:
zip(533248770 bytes)Available download formats
Dataset updated
Sep 21, 2024
Authors
Gabe Puente
Description
Business Task:

The primary business task is to analyze how casual riders and annual members use Cyclistic's bike-share services differently. The insights gained from this analysis will help the marketing team develop strategies aimed at converting casual riders into annual members. This analysis needs to be supported by data and visualizations to convince the Cyclistic executive team.

Key Identifiers for the Case Study:

Casual Riders vs. Annual Members: The core focus of the case study is on the behavioral differences between casual riders and annual members. Cyclistic Historical Trip Data: The data being used is Cyclistic's bike-share trip data, which includes variables like trip duration, start and end stations, user type (casual or member), and bike IDs. Goal: The goal is to design a marketing strategy that targets casual riders and converts them into annual members, as annual members are more profitable for the company.

Key Stakeholders:

Lily Moreno: Director of marketing, responsible for Cyclistic’s marketing strategy. Cyclistic Marketing Analytics Team: The team analyzing and reporting on the data. Cyclistic Executive Team: The decision-makers who need to be convinced by the analysis to approve the proposed marketing strategy.

File Description

Column Descriptions:

trip_id: identifier for each bike trip.

start_time: The start date and time of the trip.

end_time: The end date and time of the trip.

bikeid: identifier for the bike used.

tripduration: Duration of the trip in numerical.

from_station_id: ID of the station where the trip started.

from_station_name: Name of the station where the trip started.

to_station_id: ID of the station where the trip ended.

to_station_name: Name of the station where the trip ended.

usertype: Rider type, either 'Member' or 'Casual'.

gender: Rider’s gender.

birthyear: Rider’s birth year.

For Q2 in Raw there is incorrect column names - 01 - Rental Details Rental ID: identifier for each bike rental. - 01 - Rental Details Local Start Time: The local date and time when the rental started, recorded in MM/DD/YYYY HH:MM format. - 01 - Rental Details Local End Time: The local date and time when the rental ended, recorded in MM/DD/YYYY HH:MM format. - 01 - Rental Details Bike ID: identifier for the bike used during the rental. - 01 - Rental Details Duration In Seconds Uncapped: The total duration of the rental in seconds, including trips that exceed standard time limits (uncapped). - 03 - Rental Start Station ID: identifier for the station where the rental began. - 03 - Rental Start Station Name: The name of the station where the rental began. - 02 - Rental End Station ID: identifier for the station where the rental ended. - 02 - Rental End Station Name: The name of the station where the rental ended. - User Type: Specifies whether the user is a "Subscriber" (member) or a "Customer" rider (casual). - Member Gender: The gender of the member (if available). - 05 - Member Details Member Birthyear: The birth year of the member (if available).

Steps Taken:

Excel Cleaning Steps

Combined Data: Combined the 2019 Q1-Q4 data into one workbook for a unified dataset.

Calculated Ride Length: Replaced trip duration with a new calculated column ride_length using ride_length = D2 - C2 to reflect the trip’s duration.

Created Day of Week Column: Added a day_of_week column using the formula =TEXT(C2,"dddd") to extract the weekday from the start time.

Removed Outliers: Removed trips longer than 24 hours to eliminate outliers.

Removed Columns: Dropped gender and birthyear columns due to excessive missing values.

Formatting: Standardized date and time formats to MM/DD/YYYY HH:MM and ensured uniform number formatting for trip IDs.

Saved Workbook: Saved the cleaned dataset for further analysis.

SQL Data Preparation Steps

Data Upload: Uploaded each quarter’s data to SQL and stored them as separate tables (Q1, Q2, Q3, Q4).

Row Count Check: Verified total rows to ensure data integrity using SQL queries.

Distinct Rider Types: Checked for distinct values in the member_casual column to ensure correct identification of casual riders and members.

Calculated Trip Durations: Used SQL to find the maximum, minimum, and average trip durations for deeper insights.

Data Union: Combined data from all four quarters into a unified table using a UNION ALL query.

Grouped Analysis: Performed grouping and aggregations by rider type, time of day, day of the week, and stations to understand usage patterns.

Calculated Seasonal and Daily Trends: Used SQL to analyze rides by time of day, day of the week, and by month to detect seasonality and daily variations.

*...
French gas and electricity consumption (2011-2021)
kaggle.com
zip
Updated Feb 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mario Fernandez (2023). French gas and electricity consumption (2011-2021) [Dataset]. https://www.kaggle.com/datasets/mariofdz/french-gas-and-electricity-consumption-2011-2021/discussion
Explore at:
zip(3273067 bytes)Available download formats
Dataset updated
Feb 25, 2023
Authors
Mario Fernandez
Area covered
French
Description
Context

After reaching historic lows during the pandemic, energy consumption increased in the aftermath of deconfinement. This trend was mostly due to economic factors; as restrictions were either reduced or removed, several countries saw a rise in both consumption and general business activity. With the rapid normalization of daily life, many supply chains came increasingly under strain. Several months later, the Russo-Ukrainian War placed further stress on global logistics networks. Energy prices soared, and inflation became a major issue in nations around the world. In an attempt to curb the consequences of this trend, several governments decided to adopt a series of energy-saving measures. France was no exception. In 2022, the French government launched its own Energy Saving Plan (Plan de sobriété énergétique). With measures aimed at households, businesses and the public sector, authorities are now hoping to cut 10% of national energy consumption by 2024 (2019 being the reference year).

Project objective

To reach these energy-saving goals, it is crucial to understand which trends affect French consumption over time. As such, we will be analyzing national gas and electricity use over a ten-year period (2011-2021). Hopefully, this will allow us to identify the main sources of energy consumption in France.

About the dataset

The project dataset was imported from the French government’s Open Data website. Showing the evolution of national electricity and gas consumption over a ten-year period (2011-2021), it was created and collected by Agence ORE, an association of national gas and electricity distribution network operators. The dataset operates under an open license, and includes variables such as operator, year, energy type, consumption category code, consumer category, consumer sector console, consumer sector, company business identification (NAF code), energy consumed, energy delivery point (pdl), and consumption regions. Observations were found in almost 30000 rows.

The dataset was imported and stored on my computer. However, copies of both the raw and clean files can be found in this post.

Our dataset provides extensive information. Nevertheless, we are aware of two potential limitations:

The data doesn’t concern all energy commodities (only gas and electricity)

The data doesn’t cover 2022 (year where the Russo-Ukrainian War began)

While such information is missing, our project should not face any major obstacles. Given the long-term nature of our data, national trends should be detected even without 2022 energy consumption. In addition, gas and electricity are two of France’s major energy sources and can thus provide many of the expected insights.

Processing

Since the dataset was relatively small (under 30000 rows), I processed the data using Microsoft Excel. First, I created two folders called “Raw Data” and “Working Sheet” (the latter being for the clean data). Afterwards, I eliminated the following unnecessary columns:

Code_categorie_consommation

Code_grand_secteur

Code_naf, operateur

Libelle_secteur_naf2

Pdl

Code_region

Indqual

Nombre_mailles_secretisees

Libellé_categorie_consommation

Once the useful columns remained, I translated their names from French to English. Thus:

“Annee” became “Year”

“Filière” became “Energy type”

“Libelle_grand_secteur” became “Sector”

“Conso” became “Energy consumption (MWh)”

“Libelle_region” became “Region”

With this done, I proceeded to remove any potential duplicates from the data using the “remove duplicates” option in Excel’s “Data” section (about 200 rows were removed). Following this, I proceeded to both spell-check and translate data values by using the “Find and replace” option in Excel. As such, the following changes were made:

“Energy” column:

“ElectricitÃ©” became “Electricity”

“Gaz” became “Gas”

“Sector” column:

“RÃ©sidentiel” became “Households”

“Industrie” became “Industry”

“Tertiaire” became “Services”

“Region” column: corrected typos and other spelling errors for French regional names.

I then proceeded to eliminate rows with empty and 0 values. Once this was completed, I was left with over 15000 rows of data.

To get a better sense of energy consumption on different scales, I also converted MWh to KWh and TWh in separate columns: “Energy consumption (KWh)” and “(Energy consumption (TWh)”. In the end however, I preferred MWh as a metric since it was simpler to analyze.

All values were rounded to the nearest whole number.

Analysis

Once my data was clean, I used Power BI to create a dashboard (all of my files are available in this post).

At first sight, it would seem that French gas and electricity use gre...

Facebook

Twitter

Click to copy link

Link copied

Cite

Gaurav Tawri (2025). Netflix Movies and TV Shows Dataset Cleaned(excel) [Dataset]. https://www.kaggle.com/datasets/gauravtawri/netflix-movies-and-tv-shows-dataset-cleanedexcel

Netflix Movies and TV Shows Dataset Cleaned(excel)

Cleaned Netflix dataset with detailed formulas and step-by-step documentation

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 8, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Gaurav Tawri

Description

This dataset is a cleaned and preprocessed version of the original Netflix Movies and TV Shows dataset available on Kaggle. All cleaning was done using Microsoft Excel — no programming involved.

🎯 What’s Included: - Cleaned Excel file (standardized columns, proper date format, removed duplicates/missing values) - A separate "formulas_used.txt" file listing all Excel formulas used during cleaning (e.g., TRIM, CLEAN, DATE, SUBSTITUTE, TEXTJOIN, etc.) - Columns like 'date_added' have been properly formatted into DMY structure - Multi-valued columns like 'listed_in' are split for better analysis - Null values replaced with “Unknown” for clarity - Duration field broken into numeric + unit components

🔍 Dataset Purpose: Ideal for beginners and analysts who want to: - Practice data cleaning in Excel - Explore Netflix content trends - Analyze content by type, country, genre, or date added

📁 Original Dataset Credit: The base version was originally published by Shivam Bansal on Kaggle: https://www.kaggle.com/shivamb/netflix-shows

📌 Bonus: You can find a step-by-step cleaning guide and the same dataset on GitHub as well — along with screenshots and formulas documentation.

Clear search

Close search

Google apps

Main menu

Netflix Movies and TV Shows Dataset Cleaned(excel)

Netflix Data: Cleaning, Analysis and Visualization

Data Cleaning

Retail Store Sales: Dirty for Data Cleaning

Dirty Retail Store Sales Dataset

Overview

File Information

Columns Description

Categories and Items

Electric Household Essentials

Furniture

ENTSO-E Hydropower modelling data (PECD) in CSV format

Population and GDP/GNI/CO2 emissions (2019, raw data)

European Folk Costumes Excel Spreadsheet and Access Database

MODIS-based Daily Lake Ice Extent and Coverage Dataset for Tibetan Plateau...

Electrical half hourly raw and cleaned datasets for Great Britain from...

IP Australia - [Superseded] Intellectual Property Government Open Data 2019...

LCZO -- Meteorology -- Daily -- Sabana Field Station -- (2001-2010)

Mayor Election 2014 Düsseldorf

Hive Annotation Job Results - Cleaned and Audited

Context

Content

Data from: Species Portfolio Effects Dominate Seasonal Zooplankton...

CanadaBuys tender notices - Catalogue - Canadian Urban Data Catalogue (CUDC)...

CRM Finance Loan Tracking

Google Ads sales dataset

SPORTS_DATA_ANALYSIS_ON_EXCEL

ECOMMERCE-DATA-ANALYSING

Divvy Bike Share Analysis

Business Task:

Key Identifiers for the Case Study:

Key Stakeholders:

File Description

Column Descriptions:

Steps Taken:

Excel Cleaning Steps

SQL Data Preparation Steps

*...

French gas and electricity consumption (2011-2021)

Netflix Movies and TV Shows Dataset Cleaned(excel)

Cleaned Netflix dataset with detailed formulas and step-by-step documentation