31 datasets found
  1. Netflix Movies and TV Shows Dataset Cleaned(excel)

    • kaggle.com
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Tawri (2025). Netflix Movies and TV Shows Dataset Cleaned(excel) [Dataset]. https://www.kaggle.com/datasets/gauravtawri/netflix-movies-and-tv-shows-dataset-cleanedexcel
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaurav Tawri
    Description

    This dataset is a cleaned and preprocessed version of the original Netflix Movies and TV Shows dataset available on Kaggle. All cleaning was done using Microsoft Excel — no programming involved.

    🎯 What’s Included: - Cleaned Excel file (standardized columns, proper date format, removed duplicates/missing values) - A separate "formulas_used.txt" file listing all Excel formulas used during cleaning (e.g., TRIM, CLEAN, DATE, SUBSTITUTE, TEXTJOIN, etc.) - Columns like 'date_added' have been properly formatted into DMY structure - Multi-valued columns like 'listed_in' are split for better analysis - Null values replaced with “Unknown” for clarity - Duration field broken into numeric + unit components

    🔍 Dataset Purpose: Ideal for beginners and analysts who want to: - Practice data cleaning in Excel - Explore Netflix content trends - Analyze content by type, country, genre, or date added

    📁 Original Dataset Credit: The base version was originally published by Shivam Bansal on Kaggle: https://www.kaggle.com/shivamb/netflix-shows

    📌 Bonus: You can find a step-by-step cleaning guide and the same dataset on GitHub as well — along with screenshots and formulas documentation.

  2. Netflix Data: Cleaning, Analysis and Visualization

    • kaggle.com
    zip
    Updated Aug 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization
    Explore at:
    zip(276607 bytes)Available download formats
    Dataset updated
    Aug 26, 2022
    Authors
    Abdulrasaq Ariyo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

    Data Cleaning

    We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

    --View dataset
    
    SELECT * 
    FROM netflix;
    
    
    --The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                      
    SELECT show_id, COUNT(*)                                                                                      
    FROM netflix 
    GROUP BY show_id                                                                                              
    ORDER BY show_id DESC;
    
    --No duplicates
    
    --Check null values across columns
    
    SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
        COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
        COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
        COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
        COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
        COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
        COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
        COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
        COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
        COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
        COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
        COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
    FROM netflix;
    
    We can see that there are NULLS. 
    director_nulls = 2634
    movie_cast_nulls = 825
    country_nulls = 831
    date_added_nulls = 10
    rating_nulls = 4
    duration_nulls = 3 
    

    The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

    -- Below, we find out if some directors are likely to work with particular cast
    
    WITH cte AS
    (
    SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
    FROM netflix
    )
    
    SELECT director_cast, COUNT(*) AS count
    FROM cte
    GROUP BY director_cast
    HAVING COUNT(*) > 1
    ORDER BY COUNT(*) DESC;
    
    With this, we can now populate NULL rows in directors 
    using their record with movie_cast 
    
    UPDATE netflix 
    SET director = 'Alastair Fothergill'
    WHERE movie_cast = 'David Attenborough'
    AND director IS NULL ;
    
    --Repeat this step to populate the rest of the director nulls
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET director = 'Not Given'
    WHERE director IS NULL;
    
    --When I was doing this, I found a less complex and faster way to populate a column which I will use next
    

    Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

    --Populate the country using the director column
    
    SELECT COALESCE(nt.country,nt2.country) 
    FROM netflix AS nt
    JOIN netflix AS nt2 
    ON nt.director = nt2.director 
    AND nt.show_id <> nt2.show_id
    WHERE nt.country IS NULL;
    UPDATE netflix
    SET country = nt2.country
    FROM netflix AS nt2
    WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
    AND netflix.country IS NULL;
    
    
    --To confirm if there are still directors linked to country that refuse to update
    
    SELECT director, country, date_added
    FROM netflix
    WHERE country IS NULL;
    
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET country = 'Not Given'
    WHERE country IS NULL;
    

    The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

    --Show date_added nulls
    
    SELECT show_id, date_added
    FROM netflix_clean
    WHERE date_added IS NULL;
    
    --DELETE nulls
    
    DELETE F...
    
  3. Retail Store Sales: Dirty for Data Cleaning

    • kaggle.com
    zip
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning
    Explore at:
    zip(226740 bytes)Available download formats
    Dataset updated
    Jan 18, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Retail Store Sales Dataset

    Overview

    The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

    File Information

    • File Name: retail_store_sales.csv
    • Number of Rows: 12,575
    • Number of Columns: 11

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    Customer IDA unique identifier for each customer. 25 unique customers.CUST_01
    CategoryThe category of the purchased item.Food, Furniture
    ItemThe name of the purchased item. May contain missing values or None.Item_1_FOOD, None
    Price Per UnitThe static price of a single unit of the item. May contain missing or None values.4.00, None
    QuantityThe quantity of the item purchased. May contain missing or None values.1, None
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, None
    Payment MethodThe method of payment used. May contain missing or invalid values.Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Online
    Transaction DateThe date of the transaction. Always present and valid.2023-01-15
    Discount AppliedIndicates if a discount was applied to the transaction. May contain missing values.True, False, None

    Categories and Items

    The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

    Electric Household Essentials

    Item CodeItem NamePrice
    Item_1_EHEBlender5.0
    Item_2_EHEMicrowave6.5
    Item_3_EHEToaster8.0
    Item_4_EHEVacuum Cleaner9.5
    Item_5_EHEAir Purifier11.0
    Item_6_EHEElectric Kettle12.5
    Item_7_EHERice Cooker14.0
    Item_8_EHEIron15.5
    Item_9_EHECeiling Fan17.0
    Item_10_EHETable Fan18.5
    Item_11_EHEHair Dryer20.0
    Item_12_EHEHeater21.5
    Item_13_EHEHumidifier23.0
    Item_14_EHEDehumidifier24.5
    Item_15_EHECoffee Maker26.0
    Item_16_EHEPortable AC27.5
    Item_17_EHEElectric Stove29.0
    Item_18_EHEPressure Cooker30.5
    Item_19_EHEInduction Cooktop32.0
    Item_20_EHEWater Dispenser33.5
    Item_21_EHEHand Blender35.0
    Item_22_EHEMixer Grinder36.5
    Item_23_EHESandwich Maker38.0
    Item_24_EHEAir Fryer39.5
    Item_25_EHEJuicer41.0

    Furniture

    Item CodeItem NamePrice
    Item_1_FUROffice Chair5.0
    Item_2_FURSofa6.5
    Item_3_FURCoffee Table8.0
    Item_4_FURDining Table9.5
    Item_5_FURBookshelf11.0
    Item_6_FURBed F...
  4. ENTSO-E Hydropower modelling data (PECD) in CSV format

    • zenodo.org
    csv
    Updated Aug 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo De Felice; Matteo De Felice (2020). ENTSO-E Hydropower modelling data (PECD) in CSV format [Dataset]. http://doi.org/10.5281/zenodo.3949757
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 14, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matteo De Felice; Matteo De Felice
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PECD Hydro modelling

    This repository contains a more user-friendly version of the Hydro modelling data released by ENTSO-E with their latest Seasonal Outlook.

    The original URLs:

    The original ENTSO-E hydropower dataset integrates the PECD (Pan-European Climate Database) released for the MAF 2019

    As I did for the wind & solar data, the datasets released in this repository are only a more user- and machine-readable version of the original Excel files. As avid user of ENTSO-E data, with this repository I want to share my data wrangling efforts to make this dataset more accessible.

    Data description

    The zipped file contains 86 Excel files, two different files for each ENTSO-E zone.

    In this repository you can find 6 CSV files:

    • PECD-hydro-capacities.csv: installed capacities
    • PECD-hydro-weekly-inflows.csv: weekly inflows for reservoir and open-loop pumping
    • PECD-hydro-daily-ror-generation.csv: daily run-of-river generation
    • PECD-hydro-weekly-reservoir-min-max-generation.csv: minimum and maximum weekly reservoir generation
    • PECD-hydro-weekly-reservoir-min-max-levels.csv: weekly minimum and maximum reservoir levels

    Capacities

    The file PECD-hydro-capacities.csv contains: run of river capacity (MW) and storage capacity (GWh), reservoir plants capacity (MW) and storage capacity (GWh), closed-loop pumping/turbining (MW) and storage capacity and open-loop pumping/turbining (MW) and storage capacity. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Run-of-River and pondage, rows from 5 to 7, columns from 2 to 5
    • sheet Reservoir, rows from 5 to 7, columns from 1 to 3
    • sheet Pump storage - Open Loop, rows from 5 to 7, columns from 1 to 3
    • sheet Pump storage - Closed Loop, rows from 5 to 7, columns from 1 to 3

    Inflows

    The file PECD-hydro-weekly-inflows.csv contains the weekly inflow (GWh) for the climatic years 1982-2017 for reservoir plants and open-loop pumping. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 13 to 66, columns from 16 to 51
    • sheet Pump storage - Open Loop, rows from 13 to 66, columns from 16 to 51

    Daily run-of-river

    The file PECD-hydro-daily-ror-generation.csv contains the daily run-of-river generation (GWh). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Run-of-River and pondage, rows from 13 to 378, columns from 15 to 51

    Miminum and maximum reservoir generation

    The file PECD-hydro-weekly-reservoir-min-max-generation.csv contains the minimum and maximum generation (MW, weekly) for reservoir-based plants for the climatic years 1982-2017. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 13 to 66, columns from 196 to 231
    • sheet Reservoir, rows from 13 to 66, columns from 232 to 267

    Minimum/Maximum reservoir levels

    The file PECD-hydro-weekly-reservoir-min-max-levels.csv contains the minimum/maximum reservoir levels at beginning of each week (scaled coefficient from 0 to 1). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 14 to 66, column 12
    • sheet Reservoir, rows from 14 to 66, column 13

    CHANGELOG

    [2020/07/17] Added maximum generation for the reservoir

  5. Population and GDP/GNI/CO2 emissions (2019, raw data)

    • figshare.com
    txt
    Updated Feb 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liang Zhao (2023). Population and GDP/GNI/CO2 emissions (2019, raw data) [Dataset]. http://doi.org/10.6084/m9.figshare.22085060.v6
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 23, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Liang Zhao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Original dataset The original year-2019 dataset was downloaded from the World Bank Databank by the following approach on July 23, 2022.

    Database: "World Development Indicators" Country: 266 (all available) Series: "CO2 emissions (kt)", "GDP (current US$)", "GNI, Atlas method (current US$)", and "Population, total" Time: 1960, 1970, 1980, 1990, 2000, 2010, 2017, 2018, 2019, 2020, 2021 Layout: Custom -> Time: Column, Country: Row, Series: Column Download options: Excel

    Preprocessing

    With libreoffice,

    remove non-country entries (lines after Zimbabwe), shorten column names for easy processing: Country Name -> Country, Country Code -> Code, "XXXX ... GNI ..." -> GNI_1990, etc (notice '_', not '-', for R), remove unnesssary rows after line Zimbabwe.

  6. u

    European Folk Costumes Excel Spreadsheet and Access Database

    • deepblue.lib.umich.edu
    Updated Mar 9, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James, David A. (2017). European Folk Costumes Excel Spreadsheet and Access Database [Dataset]. http://doi.org/10.7302/Z2HD7SKC
    Explore at:
    Dataset updated
    Mar 9, 2017
    Dataset provided by
    Deep Blue Data
    Authors
    James, David A.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1997
    Description

    An Excel spreadsheet listing the information recorded on each of 18,686 costume designs can be viewed, downloaded, and explored. All the usual Excel sorting possibilities are available, and in addition a useful filter has been installed. For example, to find the number of designs that are Frieze Type #1, go to the top of the frieze type 2 column (column AS), click on the drop-down arrow and unselect every option box except True (i.e. True should be turned on, all other choices turned off). Then in the lower left corner, one reads “1111 of 18686 records found”.

    Much more sophisticated exploration can be carried out by downloading the rich and flexible Access Database. The terms used for this database were described in detail in three sections of Deep Blue paper associated with this project. The database can be downloaded and explored.

    HOW TO USE THE ACCESS DATABASE 1. Click on the Create Cohort and View Math Trait Data button, and select your cohort by clicking on the features of interest (for example: Apron and Blouse).

    Note: Depending on how you exited on your previous visit to the database, there may be items to clear up before creating the cohorts.
    a) (Usually unnecessary) Click on the small box near the top left corner to allow connection to Access. b) (Usually unnecessary) If an undesired window blocks part of the screen, click near the top of this window to minimize it. c) Make certain under Further Filtering that all four Exclude boxes are checked to get rid of stripes and circles, and circular buttons, and the D1 that is trivially associated with shoes.

    1. Click on Filter Records to Form the Cohort button. Note the # of designs, # of pieces, and # of costumes beside Recalculate.

    2. Click on Calculate Average Math Trait Frequency of Cohort button, and select the symmetry types of interest (for example: D1 and D2) .

    3. To view the Stage 1 table, click on Create Stage 1 table. To edit and print this table, click on Create Excel (after table has been created). The same process works for Stages 2, 3.and 4 tables.

    4. To view the matrix listing the math category impact numbers, move over to a button on the right side and click on View Matrix of Math Category Impact Numbers. To edit and print this matrix, click on Create Excel, use the Excel table as usual.

  7. 4

    MODIS-based Daily Lake Ice Extent and Coverage Dataset for Tibetan Plateau...

    • data.4tu.nl
    zip
    Updated Mar 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Y. (Yubao) Qiu; Pengfei Xie; M. (Matti) Leppäranta; X. (Xingxing) Wang; Juha Lemmetyinen; H. (Hui) Lin; L. (Lijuan) Shi (2019). MODIS-based Daily Lake Ice Extent and Coverage Dataset for Tibetan Plateau [version 1] [Dataset]. http://doi.org/10.4121/uuid:fdfd8c76-6b7c-4bbf-aec8-98ab199d9093
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 12, 2019
    Dataset provided by
    4TU.Centre for Research Data
    Authors
    Y. (Yubao) Qiu; Pengfei Xie; M. (Matti) Leppäranta; X. (Xingxing) Wang; Juha Lemmetyinen; H. (Hui) Lin; L. (Lijuan) Shi
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jul 2002 - Jun 2018
    Area covered
    Tibetan Plateau
    Description

    The present dataset was developed using the MODIS Normalized Difference Snow Index with a spatial resolution of 500 m as input for the SNOWMAP algorithm to detect lake ice from daily clear-sky observations. Furthermore, for cloud-cover conditions, lake ice was identified based on the spatial and temporal continuity of lake-ice data. On this basis, the daily lake-ice monitoring data of 2612 lakes of the Tibetan Plateau from 2002 to 2018 were calculated and classified. Moreover, a time-series analysis of lake ice coverage, which included lakes with surface area greater than 1 km2, was carried out to provide a clear list of lakes for which lake ice phenology can be estimated. The data set contains 5834 raster files, one vector file and 2612 Excel files (including 1134 time series with and without classification statistics). The raster file is named daily lake ice extent. The vector file contains such information as the number, name, location, surface area and classification number of the processed lake. The names of the excel files correspond to lake numbers. Each excel file contains four columns with the daily lake ice coverage information of its corresponding lake from July 2002 to June 2018. The attributes of each column are, successively, date, lake water coverage, lake ice coverage and cloud coverage. Users can first use the vector file to determine the number, location and classification number of a given lake, and then obtain the corresponding daily lake ice coverage data for a given year from the Excel file to use for the monitoring of lake-ice freeze-thaw and research on climate change.

  8. Electrical half hourly raw and cleaned datasets for Great Britain from...

    • zenodo.org
    csv
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grant Wilson; Grant Wilson (2025). Electrical half hourly raw and cleaned datasets for Great Britain from 2008-11-05 [Dataset]. http://doi.org/10.5281/zenodo.16328483
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Grant Wilson; Grant Wilson
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Area covered
    United Kingdom
    Description

    A journal paper published in Energy Strategy Reviews details the method to create the data.

    https://www.sciencedirect.com/science/article/pii/S2211467X21001280

    2023-10-10: Version 8.0.5 has additional columns added, one for day of the year, and one for the half hour period of the year (17520 in a standard year and 17568 in a leap year). A new interconnector (https://www.viking-link.com/) has posted a value since 2023-07-12 but the values have all been a zero value so far (until 2023-09-30).

    2023-03-15: Version 8.0.1 is a major rewrite with column names that now include the units and the data type. Also, pumped storage has charging values included from 2012, i.e., the negative values when pumped storage is being charged, as well as the positive values when it was discharging (which were available previously). The raw version of the data (rather than cleaned) has been dropped for the time being.

    2023-01-06: Version 7.0.0 was created. Now includes data for the Eleclink interconnector from Great Britain to France through the Channel Tunnel (https://www.eleclink.co.uk/index.php). This supersedes previous versions - as the Eleclink data is now included for historical data (including in the ESPENI total).

    2021-09-09: Version 6.0.0 was created. Now includes data for the North Sea Link (NSL) interconnector from Great Britain to Norway (https://www.northsealink.com). The previous version (5.0.4) should not be used - as there was an error with interconnector data having a static value over the summer 2021.

    2021-05-05: Version 5.0.0 was created. Datetimes now in ISO 8601 format (with capital letter 'T' between the date and time) rather than previously with a space (to RFC 3339 format) and with an offset to identify both UTC and localtime. MW values now all saved as integers rather than floats. Elexon data as always from www.elexonportal.co.uk/fuelhh, National Grid data from https://data.nationalgrideso.com/demand/historic-demand-data Raw data now added again for comparison of pre and post cleaning - to allow for training of additional cleaning methods. If using Microsoft Excel, the T between the date and time can be removed using the =SUBSTITUTE() command - and substitute "T" for a space " "

    _

    2021-03-02: Version 4.0.0 was created. Due to a new interconnecter (IFA2 - https://en.wikipedia.org/wiki/IFA-2) being commissioned in Q1 2021, there is an additional column with data from National Grid - this is called 'POWER_NGEM_IFA2_FLOW_MW' in the espeni dataset. In addition, National Grid has dropped the column name 'FRENCH_FLOW' that used to provide the value for the column 'POWER_NGEM_FRENCH_FLOW_MW' in previous espeni versions. However, this has been changed to 'IFA_FLOW' in National Grid's original data, which is now called 'POWER_NGEM_IFA_FLOW_MW' in the espeni dataset. Lastly, the IO14 columns have all been dropped by National Grid - and potentially unlikely to appear again in future.

    2020-12-02: Version 3.0.0 was created. There was a problem with earlier versions local time format - where the +01:00 value was not carried through into the data properly. Now addressed - therefore - local time now has the format e.g. 2020-03-31 20:00:00+01:00 when in British Summer Time.

    2020-10-03: Version 2.0.0 was created as it looks like National Grid has had a significant change to the methodology underpinning the embedded wind calculations. The wind profile seems similar to previous values, but with an increasing value in comparison to the value published in earlier the greater the embedded value is. The 'new' values are from https://data.nationalgrideso.com/demand/daily-demand-update from 2013.

    Previously: raw and cleaned datasets for Great Britain's publicly available electrical data from Elexon (www.elexonportal.co.uk) and National Grid (https://demandforecast.nationalgrid.com/efs_demand_forecast/faces/DataExplorer). Updated versions with more recent data will be uploaded with a differing version number and doi

    All data is released in accordance with Elexon's disclaimer and reservation of rights.

    https://www.elexon.co.uk/using-this-website/disclaimer-and-reservation-of-rights/

    This disclaimer is also felt to cover the data from National Grid, and the parsed data from the Energy Informatics Group at the University of Birmingham.

  9. g

    IP Australia - [Superseded] Intellectual Property Government Open Data 2019...

    • gimi9.com
    Updated Jul 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). IP Australia - [Superseded] Intellectual Property Government Open Data 2019 | gimi9.com [Dataset]. https://gimi9.com/dataset/au_intellectual-property-government-open-data-2019
    Explore at:
    Dataset updated
    Jul 20, 2018
    Area covered
    Australia
    Description

    What is IPGOD? The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD. # How do I use IPGOD? IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar. # IP Data Platform IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform # References The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset. * Patents * Trade Marks * Designs * Plant Breeder’s Rights # Updates ### Tables and columns Due to the changes in our systems, some tables have been affected. * We have added IPGOD 225 and IPGOD 325 to the dataset! * The IPGOD 206 table is not available this year. * Many tables have been re-built, and as a result may have different columns or different possible values. Please check the data dictionary for each table before use. ### Data quality improvements Data quality has been improved across all tables. * Null values are simply empty rather than '31/12/9999'. * All date columns are now in ISO format 'yyyy-mm-dd'. * All indicator columns have been converted to Boolean data type (True/False) rather than Yes/No, Y/N, or 1/0. * All tables are encoded in UTF-8. * All tables use the backslash \ as the escape character. * The applicant name cleaning and matching algorithms have been updated. We believe that this year's method improves the accuracy of the matches. Please note that the "ipa_id" generated in IPGOD 2019 will not match with those in previous releases of IPGOD.

  10. H

    LCZO -- Meteorology -- Daily -- Sabana Field Station -- (2001-2010)

    • beta.hydroshare.org
    • hydroshare.org
    • +1more
    zip
    Updated Jun 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grizelle González; IITF (2020). LCZO -- Meteorology -- Daily -- Sabana Field Station -- (2001-2010) [Dataset]. https://beta.hydroshare.org/resource/ecb79f4688674bf1bf954722da992fc7/
    Explore at:
    zip(517.3 KB)Available download formats
    Dataset updated
    Jun 18, 2020
    Dataset provided by
    HydroShare
    Authors
    Grizelle González; IITF
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2001 - Aug 31, 2010
    Area covered
    Description

    Description of data preparation performed on data from 2001 to 2007 (end).

    Cleaning Data In the original form of Sabana data (both daily and hourly data), instrument frequently recorded minimum value of TIRRa and Total PFD as negative values and maximum value of RH as over 100%. Unquestionably, these are unrealistic values. Thus, they were replaced by 0 (zero) for TIRRa and Total PFD minimum values and 100% for RH maximum values.

    Defected Data There were noticeable defect of Total PFD values in 2003 and 2006 (both daily and hourly data). Specifically, in 2003, defected Total PFD values were from January 1st (Day # = 1) through September 3rd (Day # = 247) and, in 2006, they were from March 24th (Day # = 83) through October 31st (Day # = 304). Therefore, four year (2001, 2002, 2004, and 2005) monthly averages were calculated and multiplier was developed based on the ratio of [four year average] / [2003 (or 2006) defected data]. Detail calculation of this can be seen in the Modification file (MS Excel file). Accordingly, columns denoted as “Modified Total PFD” are results of this modification. However, note that red and black colors within the column indicate modified and non-modified (original) values, respectively.

    Missing Data There were large numbers of data missing in both daily and hourly dataset which are outlined below. Additionally, there were couples of significantly noticeable defected values in some columns which were omitted from the dataset. Thus, missing and omitted data were left as blank (no values).

    Grizelle González - Project Leader, Research Unit

    USDA FS - International Institute of Tropical Forestry

    voice: 787-764-7800

    ggonzalez@fs.fed.us

  11. e

    Mayor Election 2014 Düsseldorf

    • data.europa.eu
    csv, json
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Düsseldorf (2025). Mayor Election 2014 Düsseldorf [Dataset]. https://data.europa.eu/data/datasets/851e793b-50ac-4e57-91fc-a10418b8bb56?locale=en
    Explore at:
    json, csv(33995), csv(272), csv(928), csv(5542), csv(510), csv(51497), csv(3583), csv(1575)Available download formats
    Dataset updated
    May 30, 2025
    Dataset authored and provided by
    Düsseldorf
    License

    http://dcat-ap.de/def/licenses/other-closedhttp://dcat-ap.de/def/licenses/other-closed

    Area covered
    Düsseldorf
    Description

    The data set contains the results of the mayor’s election on 25 May 2014 and the mayor’s key election on 15 June 2014 of the City of Düsseldorf.

    The local elections took place on 25 May 2014. Because no clear majority was reached, there was a runoff election of the mayor on 15 June 2014.

    An authority may set up different territorial levels to present the election results, from the lowest level (voting districts) to constituencies and districts to the level of the city or municipality, district and constituency. However, not all levels are necessary for each type of election. For each of the territorial levels that an authority has set up, there is a file containing the overview of those areas with fast messages already received.

    Further data sets contain information on the division of electoral areas for local elections and the division of voting districts.

    Information on terms in the field of ‘Elections’ can be found in the Election ABC of the interactive learning platform for election workers of the City of Düsseldorf.

    The files are encoded in UTF-8. By default, Excel does not display the umlauts in the files correctly. You can avoid this as follows:

    Excel 2003 Select from the menu ‘Data’ -> ‘Import external data’ from the menu item ‘Import data’. The ‘Select data source’ dialog opens. Select the file you want to open and press the ‘Open’ button. Then place the file origin to '65001 Unicode: (UTF-8)' fixed and continue with the ‘Next’ button. In the next dialog, set the separator to ‘Semicolon’ instead of ‘Tabstopp’ and continue with the ‘Next’ button again. They then select the ‘Text’ option as the data format of the columns and exit the wizard with the ‘Finish’ button. Use the ‘OK’ button to finish the procedure and the data is displayed UTF-8 encoded in Microsoft Excel.

    Excel 2010 From the tab ‘Data’ in the section ‘Retrieve external data’, select the option ‘From text’. The dialog ‘Import text file’ opens. Select the file you want to open and press the ‘Open’ button. Then place the file origin to '65001 Unicode: (UTF-8)' fixed and continue with the ‘Next’ button. In the next dialog, set the separator to ‘Semicolon’ instead of ‘Tabstopp’ and continue with the ‘Next’ button again. They then select the ‘Text’ option as the data format of the columns and exit the wizard with the ‘Finish’ button. Use the ‘OK’ button to finish the procedure and the data is displayed UTF-8 encoded in Microsoft Excel.

    The files contain the following column information:

    Number: Constituency number Name: Name of the constituency MaxQuick Messages: maximum number of quick messages AnzQuick Messages: Number of fast messages already recorded Eligible voters: Number of eligible voters Filed under: Number of ballot papers submitted Turnout: Voter turnouts at the respective view levels valid Voting List: Number of valid ballot papers valid: Number of valid votes cast invalid Voting List: Number of invalid ballot papers invalid: Number of invalid votes cast In addition, the following fields are available for each party (example of one party called ‘A Party’):

    A Party: Number of total votes of the party A-Party_Proz: Percentage of total votes of the party from the total result

  12. Hive Annotation Job Results - Cleaned and Audited

    • kaggle.com
    zip
    Updated Apr 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brendan Kelley (2021). Hive Annotation Job Results - Cleaned and Audited [Dataset]. https://www.kaggle.com/brendankelley/hive-annotation-job-results-cleaned-and-audited
    Explore at:
    zip(471571 bytes)Available download formats
    Dataset updated
    Apr 28, 2021
    Authors
    Brendan Kelley
    Description

    Context

    This notebook serves to showcase my problem solving ability, knowledge of the data analysis process, proficiency with Excel and its various tools and functions, as well as my strategic mindset and statistical prowess. This project consist of an auditing prompt provided by Hive Data, a raw Excel data set, a cleaned and audited version of the raw Excel data set, and my description of my thought process and knowledge used during completion of the project. The prompt can be found below:

    Hive Data Audit Prompt

    The raw data that accompanies the prompt can be found below:

    Hive Annotation Job Results - Raw Data

    ^ These are the tools I was given to complete my task. The rest of the work is entirely my own.

    To summarize broadly, my task was to audit the dataset and summarize my process and results. Specifically, I was to create a method for identifying which "jobs" - explained in the prompt above - needed to be rerun based on a set of "background facts," or criteria. The description of my extensive thought process and results can be found below in the Content section.

    Content

    Brendan Kelley April 23, 2021

    Hive Data Audit Prompt Results

    This paper explains the auditing process of the “Hive Annotation Job Results” data. It includes the preparation, analysis, visualization, and summary of the data. It is accompanied by the results of the audit in the excel file “Hive Annotation Job Results – Audited”.

    Observation

    The “Hive Annotation Job Results” data comes in the form of a single excel sheet. It contains 7 columns and 5,001 rows, including column headers. The data includes “file”, “object id”, and the pseudonym for five questions that each client was instructed to answer about their respective table: “tabular”, “semantic”, “definition list”, “header row”, and “header column”. The “file” column includes non-unique (that is, there are multiple instances of the same value in the column) numbers separated by a dash. The “object id” column includes non-unique numbers ranging from 5 to 487539. The columns containing the answers to the five questions include Boolean values - TRUE or FALSE – which depend upon the yes/no worker judgement.

    Use of the COUNTIF() function reveals that there are no values other than TRUE or FALSE in any of the five question columns. The VLOOKUP() function reveals that the data does not include any missing values in any of the cells.

    Assumptions

    Based on the clean state of the data and the guidelines of the Hive Data Audit Prompt, the assumption is that duplicate values in the “file” column are acceptable and should not be removed. Similarly, duplicated values in the “object id” column are acceptable and should not be removed. The data is therefore clean and is ready for analysis/auditing.

    Preparation

    The purpose of the audit is to analyze the accuracy of the yes/no worker judgement of each question according to the guidelines of the background facts. The background facts are as follows:

    • A table that is a definition list should automatically be tabular and also semantic • Semantic tables should automatically be tabular • If a table is NOT tabular, then it is definitely not semantic nor a definition list • A tabular table that has a header row OR header column should definitely be semantic

    These background facts serve as instructions for how the answers to the five questions should interact with one another. These facts can be re-written to establish criteria for each question:

    For tabular column: - If the table is a definition list, it is also tabular - If the table is semantic, it is also tabular

    For semantic column: - If the table is a definition list, it is also semantic - If the table is not tabular, it is not semantic - If the table is tabular and has either a header row or a header column...

  13. Z

    Data from: Species Portfolio Effects Dominate Seasonal Zooplankton...

    • data.niaid.nih.gov
    Updated Mar 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    O'Connor, Reilly (2022). Species Portfolio Effects Dominate Seasonal Zooplankton Stabilization Within a Large Temperate Lake [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_6345004
    Explore at:
    Dataset updated
    Mar 16, 2022
    Dataset provided by
    University of Guelph
    Authors
    O'Connor, Reilly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The raw data file is available online for public access (https://data.ontario.ca/dataset/lake-simcoe-monitoring). Download the 1980-2019 csv files and open up the file named "Simcoe_Zooplankton&Bythotrephes.csv". Copy and paste the zooplankton sheet into a new excel file called "Simcoe_Zooplankton.csv". The column ZDATE in the excel file needs to be switched from GENERAL to SHORT DATE so that the dates in the ZDATE column read "YYYY/MM/DD". Save as .csv in appropriate R folder. The data file "simcoe_manual_subset_weeks_5" is the raw data that has been subset for the main analysis of the article using the .R file "Simcoe MS - 5 Station Subset Data". The .csv file produced from this must then be manually edited to remove data points that do not have 5 stations per sampling period as well as by combining data points that should fall into a single week. The "simcoe_manual_subset_weeks_5.csv" is then used for the calculation of variability, stabilization, asynchrony, and Shannon Diversity for each year in the .R file "Simcoe MS - 5 Station Calculations". The final .R file "Simcoe MS - 5 Station Analysis contains the final statistical analyses as well as code to reproduce the original figures. Data and code for main and supplementary analyses are also available on GitHub (https://github.com/reillyoc/ZPseasonalPEs).

  14. u

    CanadaBuys tender notices - Catalogue - Canadian Urban Data Catalogue (CUDC)...

    • data.urbandatacentre.ca
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). CanadaBuys tender notices - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-6abd20d4-7a1c-4b38-baa2-9525d0bb2fd2
    Explore at:
    Dataset updated
    Oct 19, 2025
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Area covered
    Canada
    Description

    This dataset contains information on Government of Canada tender information published according to the Financial Administration Act. It includes data for all Schedule I, Schedule II and Schedule III departments, agencies, Crown corporations, and other entities (unless specifically exempt) who must comply with the Government of Canada trade agreement obligations. CanadaBuys is the authoritative source of this information. Visit the How procurement works page on the CanadaBuys website to learn more. All data files in this collection share a common column structure, and the procurement category field (labelled as “procurementCategory-categorieApprovisionnement”) can be used to filter by the following four major categories of tenders: Tenders for construction, which will have a value of “CNST” Tenders for goods, which will have a value of “GD” Tenders for services, which will have a value of “SRV” Tenders for services related to goods, which will have a value of “SRVTGD” A tender may be associated with one or more of the above procurement categories. Note: Some records contain long tender description values that may cause issues when viewed in certain spreadsheet programs, such as Microsoft Excel. When the information doesn’t fit within the cell’s character limit, the program will insert extra rows that don’t conform to the expected column formatting. (Though, all other records will still be displayed properly, in their own rows.) To quickly remove the “spill-over data” caused by this display error in Excel, select the publication date field (labelled as “publicationDate-datePublication”), then click the Filter button on the Data menu ribbon. You can then use the filter pull-down list to remove any blank or non-date values from this field, which will hide the rows that only contain “spill-over” description information. The following list describes the resources associated with this CanadaBuys tender notices dataset. Additional information on Government of Canada tenders can also be found on the Tender notices tab of the CanadaBuys tender opportunities page. NOTE: While the CanadaBuys online portal includes tender opportunities from across multiple levels of government, the data files in this related dataset only include notices from federal government organizations. (1) CanadaBuys data dictionary: This XML file offers descriptions of each data field in the tender notices files linked below, as well as other procurement-related datasets CanadaBuys produces. Use this as a guide for understanding the data elements in these files. This dictionary is updated as needed to reflect changes to the data elements. (2) New tender notices: This file contains up to date information on all new tender notices that are published to CanadaBuys throughout a given day. The file is updated every two hours, from 6:15 am until 10:15 pm (UTC-0500) to include new tenders as they are published. All tenders in this file will have a publication date matching the current day (displayed in the field labelled “publicationDate-datePublication”), or the day prior for systems that feed into this file on a nightly basis. (3) Open tender notices: This file contains up to date information on all tender notices that are open for bidding on CanadaBuys, including any amendments made to these tender notices during their lifecycles. The file is refreshed each morning, between 7:00 am and 8:30 am (UTC-0500) to include newly published open tenders. All tenders in this file will have a status of open (displayed in the field labelled “tenderStatus-tenderStatut-eng”). (4) All CanadaBuys tender notices, 2022-08-08 onwards: This file contains up to date information on all tender notices published through CanadaBuys. This includes any tender notices that were open for bids on or after August 8, 2022, when CanadaBuys launched as the system of record for all Tender Notices for the Government of Canada. This file includes any amendments made to these tender notices during their lifecycles. It is refreshed each morning, between 7:00 am and 8:30 am (UTC-0500) to include any updates or amendments, as needed. Tender notices in this file can have any publication date on or after August 8, 2022 (displayed in the field labelled “publicationDate-datePublication”), and can have a status of open, cancelled or expired (displayed in the field labelled “tenderStatus-tenderStatut-eng”). (5) Legacy tender notices, 2009 to 2022-08 (prior to CanadaBuys): This file contains details of the tender notices that were launched prior to the implementation of CanadaBuys, which became the system of record for all tender notices for the Government of Canada on August 8, 2022. This datafile is refreshed monthly. The over 70,000 tenders in this file have publication dates from August 5, 2022 and before (displayed in the field labelled “publicationDate-datePublication”) and have a status of cancelled or expired (displayed in the field labelled “tenderStatus-tenderStatut-eng”). Note: Procurement data was structured differently in the legacy applications previously used to administer Government of Canada tender notices. Efforts have been made to manipulate these historical records into the structure used by the CanadaBuys data files, to make them easier to analyse and compare with new records. This process is not perfect since simple one-to-one mappings can’t be made in many cases. You can access these historical records in their original format as part of the archived copy of the original tender notices dataset. You can also refer to the supporting documentation for understanding the new CanadaBuys tender and award notices datasets. (6) Tender notices, YYYY-YYYY: These files contain information on all tender notices published in the specified fiscal year that are no longer open to bidding. The current fiscal year's file is refreshed each morning, between 7:00 am and 8:30 am (UTC-0500) to include any updates or amendments, as needed. The files associated with past fiscal years are refreshed monthly. Tender notices in these files can have any publication date between April 1 of a given year and March 31 of the subsequent year (displayed in the field labelled “publicationDate-datePublication”) and can have a status of cancelled or expired (displayed in the field labelled “tenderStatus-tenderStatut-eng”). New records are added to these files once related tenders reach their close date, or are cancelled. Note: New tender notice data files will be added on April 1 for each fiscal year.

  15. CRM Finance Loan Tracking

    • kaggle.com
    zip
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dixoncode (2025). CRM Finance Loan Tracking [Dataset]. https://www.kaggle.com/datadplyr/crm-finance-loan-tracking-excel-file
    Explore at:
    zip(327889 bytes)Available download formats
    Dataset updated
    Mar 17, 2025
    Authors
    dixoncode
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    In small and medium sized firms that aim to do CRM, employees sometimes use Excel to track Customer Feedback. Excel is widely used due to its popularity and clean interface. However, Excel is not similar to other advanced CRM software and websites such as Slack, HubSpot, Salesforce, or Zoho. In cases where an organization aims to collect lower level feedback that can the be uploaded to a larger CRM software, Excel is a good choice. I did some research on how to make it easier for a CRM officer, salesperson, or company data managers to automate client feedback tracking using Excel's VBA functionality and VLOOKUP.

    Content

    This dataset has one file- CRM Finance Loan Tracking Excel File.xlsm which has columns related to customers of a medium-sized financial institution such as Client, Bank Branch Name, Phone Number, Client Account No., Loan Account No., Product, Loan Amount, Disbursed Date, Maturity, Repaid, Debt Owing, Current Note, 1st Latest Note, 2nd Latest Note, 3rd Latest Note, 4th Latest Note, and 5th Latest Note.

    How to Use the Excel File

    First, enable macros in the Excel file. Then, you can proceed as follows: On the first sheet called CLIENT LOANS, try typing in column M (Current Note) for any client. The VBA code will automatically update the 1st to 5th Latest Notes in columns N to R. You can look the note logs in the second sheet called LogSheet. The third sheet called CountSpecific shows the count of specific notes for each client.

    Note that you can tweak the functionality of these XLSM files to suit your needs, by removing some unneeded columns and adding new ones. Just remember to modify the VBA code accordingly. .

    Acknowledgements

    This dataset is a compilation of random client names obtained from https://1000randomnames.com/. Other columns also contain random facts of the clients. For illustrative purposes, I typed the notes for the first five clients.

    Inspiration

    Can we have a simple excel file that helps in tracks client feedback? Can we use Excel formulas to track recurring customer complaints? Can we make it easier to see previous client feedback?

    Use Cases - Portfolio management - Sales pipeline management - Client feedback tracking - Student progress tracking - Organizational records tracking - Budget management

  16. Google Ads sales dataset

    • kaggle.com
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NayakGanesh007 (2025). Google Ads sales dataset [Dataset]. https://www.kaggle.com/datasets/nayakganesh007/google-ads-sales-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    NayakGanesh007
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Google Ads Sales Dataset for Data Analytics Campaigns (Raw & Uncleaned) 📝 Dataset Overview This dataset contains raw, uncleaned advertising data from a simulated Google Ads campaign promoting data analytics courses and services. It closely mimics what real digital marketers and analysts would encounter when working with exported campaign data — including typos, formatting issues, missing values, and inconsistencies.

    It is ideal for practicing:

    Data cleaning

    Exploratory Data Analysis (EDA)

    Marketing analytics

    Campaign performance insights

    Dashboard creation using tools like Excel, Python, or Power BI

    📁 Columns in the Dataset Column Name ----- -Description Ad_ID --------Unique ID of the ad campaign Campaign_Name ------Name of the campaign (with typos and variations) Clicks --Number of clicks received Impressions --Number of ad impressions Cost --Total cost of the ad (in ₹ or $ format with missing values) Leads ---Number of leads generated Conversions ----Number of actual conversions (signups, sales, etc.) Conversion Rate ---Calculated conversion rate (Conversions ÷ Clicks) Sale_Amount ---Revenue generated from the conversions Ad_Date------ Date of the ad activity (in inconsistent formats like YYYY/MM/DD, DD-MM-YY) Location ------------City where the ad was served (includes spelling/case variations) Device------------ Device type (Mobile, Desktop, Tablet with mixed casing) Keyword ----------Keyword that triggered the ad (with typos)

    ⚠️ Data Quality Issues (Intentional) This dataset was intentionally left raw and uncleaned to reflect real-world messiness, such as:

    Inconsistent date formats

    Spelling errors (e.g., "analitics", "anaytics")

    Duplicate rows

    Mixed units and symbols in cost/revenue columns

    Missing values

    Irregular casing in categorical fields (e.g., "mobile", "Mobile", "MOBILE")

    🎯 Use Cases Data cleaning exercises in Python (Pandas), R, Excel

    Data preprocessing for machine learning

    Campaign performance analysis

    Conversion optimization tracking

    Building dashboards in Power BI, Tableau, or Looker

    💡 Sample Analysis Ideas Track campaign cost vs. return (ROI)

    Analyze click-through rates (CTR) by device or location

    Clean and standardize campaign names and keywords

    Investigate keyword performance vs. conversions

    🔖 Tags Digital Marketing · Google Ads · Marketing Analytics · Data Cleaning · Pandas Practice · Business Analytics · CRM Data

  17. SPORTS_DATA_ANALYSIS_ON_EXCEL

    • kaggle.com
    zip
    Updated Dec 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nil kamal Saha (2024). SPORTS_DATA_ANALYSIS_ON_EXCEL [Dataset]. https://www.kaggle.com/datasets/nilkamalsaha/sports-data-analysis-on-excel
    Explore at:
    zip(1203633 bytes)Available download formats
    Dataset updated
    Dec 12, 2024
    Authors
    Nil kamal Saha
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    PROJECT OBJECTIVE

    We are a part of XYZ Co Pvt Ltd company who is in the business of organizing the sports events at international level. Countries nominate sportsmen from different departments and our team has been given the responsibility to systematize the membership roster and generate different reports as per business requirements.

    Questions (KPIs)

    TASK 1: STANDARDIZING THE DATASET

    • Populate the FULLNAME consisting of the following fields ONLY, in the prescribed format: PREFIX FIRSTNAME LASTNAME.{Note: All UPPERCASE)
    • Get the COUNTRY NAME to which these sportsmen belong to. Make use of LOCATION sheet to get the required data
    • Populate the LANGUAGE_!poken by the sportsmen. Make use of LOCTION sheet to get the required data
    • Generate the EMAIL ADDRESS for those members, who speak English, in the prescribed format :lastname.firstnamel@xyz .org {Note: All lowercase) and for all other members, format should be lastname.firstname@xyz.com (Note: All lowercase)
    • Populate the SPORT LOCATION of the sport played by each player. Make use of SPORT sheet to get the required data

    TASK 2: DATA FORMATING

    • Display MEMBER IDas always 3 digit number {Note: 001,002 ...,D2D,..etc)
    • Format the BIRTHDATE as dd mmm'yyyy (Prescribed format example: 09 May' 1986)
    • Display the units for the WEIGHT column (Prescribed format example: 80 kg)
    • Format the SALARY to show the data In thousands. If SALARY is less than 100,000 then display data with 2 decimal places else display data with one decimal place. In both cases units should be thousands (k) e.g. 87670 -> 87.67 k and 12 250 -> 123.2 k

    TASK 3: SUMMARIZE DATA - PIVOT TABLE (Use SPORTSMEN worksheet after attempting TASK 1) • Create a PIVOT table in the worksheet ANALYSIS, starting at cell B3,with the following details:

    • In COLUMNS; Group : GENDER.
    • In ROWS; Group : COUNTRY (Note: use COUNTRY NAMES).
    • In VALUES; calculate the count of candidates from each COUNTRY and GENDER type, Remove GRAND TOTALs.

    TASK 4: SUMMARIZE DATA - EXCEL FUNCTIONS (Use SPORTSMEN worksheet after attempting TASK 1)

    • Create a SUMMARY table in the worksheet ANALYSIS,starting at cell G4, with the following details:

    • Starting from range RANGE H4; get the distinct GENDER. Use remove duplicates option and transpose the data.
    • Starting from range RANGE GS; get the distinct COUNTRY (Note: use COUNTRY NAMES).
    • In the cross table,get the count of candidates from each COUNTRY and GENDER type.

    TASK 5: GENERATE REPORT - PIVOT TABLE (Use SPORTSMEN worksheet after attempting TASK 1)

    • Create a PIVOT table report in the worksheet REPORT, starting at cell A3, with the following information:

    • Change the report layout to TABULAR form.
    • Remove expand and collapse buttons.
    • Remove GRAND TOTALs.
    • Allow user to filter the data by SPORT LOCATION.

    Process

    • Verify data for any missing values and anomalies, and sort out the same.
    • Made sure data is consistent and clean with respect to data type, data format and values used.
    • Created pivot tables according to the questions asked.
  18. ECOMMERCE-DATA-ANALYSING

    • kaggle.com
    zip
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harjot Singh (2025). ECOMMERCE-DATA-ANALYSING [Dataset]. https://www.kaggle.com/datasets/harjotsingh13/ecommerce-data-analysing
    Explore at:
    zip(337900 bytes)Available download formats
    Dataset updated
    Nov 12, 2025
    Authors
    Harjot Singh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🛒 E-Commerce Data Analysis (Excel & Python Project) 📖 Overview

    This project analyzes 10,000+ e-commerce sales records using Excel and Python (Pandas) to uncover valuable business insights. It covers essential data analysis techniques such as cleaning, aggregation, and visualization — perfect for beginners and data analyst learners.

    🎯 Objectives

    Understand customer purchasing trends

    Identify top-selling products

    Analyze monthly sales and revenue performance

    Calculate business KPIs such as Total Revenue, Total Orders, and Average Order Value (AOV)

    🧩 Dataset Information

    File: ecommerce_simple_10k.csv Total Rows: 10,000 Columns:

    Column Name Description order_id Unique order identifier product Product name quantity Number of items ordered price Price of a single item order_date Date of order placement city City where the order was placed 🧹 Data Cleaning (Python)

    Key cleaning steps:

    Removed currency symbols (₹) and commas from price and total_sales

    Converted order_date into proper datetime format

    Created new column month from order_date

    Handled missing or incorrect data entries

  19. Divvy Bike Share Analysis

    • kaggle.com
    zip
    Updated Sep 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabe Puente (2024). Divvy Bike Share Analysis [Dataset]. https://www.kaggle.com/gabepuente/divvy-bike-share-analysis
    Explore at:
    zip(533248770 bytes)Available download formats
    Dataset updated
    Sep 21, 2024
    Authors
    Gabe Puente
    Description

    Business Task:

    The primary business task is to analyze how casual riders and annual members use Cyclistic's bike-share services differently. The insights gained from this analysis will help the marketing team develop strategies aimed at converting casual riders into annual members. This analysis needs to be supported by data and visualizations to convince the Cyclistic executive team.

    Key Identifiers for the Case Study:

    Casual Riders vs. Annual Members: The core focus of the case study is on the behavioral differences between casual riders and annual members. Cyclistic Historical Trip Data: The data being used is Cyclistic's bike-share trip data, which includes variables like trip duration, start and end stations, user type (casual or member), and bike IDs. Goal: The goal is to design a marketing strategy that targets casual riders and converts them into annual members, as annual members are more profitable for the company.

    Key Stakeholders:

    Lily Moreno: Director of marketing, responsible for Cyclistic’s marketing strategy. Cyclistic Marketing Analytics Team: The team analyzing and reporting on the data. Cyclistic Executive Team: The decision-makers who need to be convinced by the analysis to approve the proposed marketing strategy.

    File Description

    Column Descriptions:

    • trip_id: identifier for each bike trip.
    • start_time: The start date and time of the trip.
    • end_time: The end date and time of the trip.
    • bikeid: identifier for the bike used.
    • tripduration: Duration of the trip in numerical.
    • from_station_id: ID of the station where the trip started.
    • from_station_name: Name of the station where the trip started.
    • to_station_id: ID of the station where the trip ended.
    • to_station_name: Name of the station where the trip ended.
    • usertype: Rider type, either 'Member' or 'Casual'.
    • gender: Rider’s gender.
    • birthyear: Rider’s birth year.

    For Q2 in Raw there is incorrect column names - 01 - Rental Details Rental ID: identifier for each bike rental. - 01 - Rental Details Local Start Time: The local date and time when the rental started, recorded in MM/DD/YYYY HH:MM format. - 01 - Rental Details Local End Time: The local date and time when the rental ended, recorded in MM/DD/YYYY HH:MM format. - 01 - Rental Details Bike ID: identifier for the bike used during the rental. - 01 - Rental Details Duration In Seconds Uncapped: The total duration of the rental in seconds, including trips that exceed standard time limits (uncapped). - 03 - Rental Start Station ID: identifier for the station where the rental began. - 03 - Rental Start Station Name: The name of the station where the rental began. - 02 - Rental End Station ID: identifier for the station where the rental ended. - 02 - Rental End Station Name: The name of the station where the rental ended. - User Type: Specifies whether the user is a "Subscriber" (member) or a "Customer" rider (casual). - Member Gender: The gender of the member (if available). - 05 - Member Details Member Birthyear: The birth year of the member (if available).

    Steps Taken:

    Excel Cleaning Steps

    • Combined Data: Combined the 2019 Q1-Q4 data into one workbook for a unified dataset.
    • Calculated Ride Length: Replaced trip duration with a new calculated column ride_length using ride_length = D2 - C2 to reflect the trip’s duration.
    • Created Day of Week Column: Added a day_of_week column using the formula =TEXT(C2,"dddd") to extract the weekday from the start time.
    • Removed Outliers: Removed trips longer than 24 hours to eliminate outliers.
    • Removed Columns: Dropped gender and birthyear columns due to excessive missing values.
    • Formatting: Standardized date and time formats to MM/DD/YYYY HH:MM and ensured uniform number formatting for trip IDs.
    • Saved Workbook: Saved the cleaned dataset for further analysis.

    SQL Data Preparation Steps

    • Data Upload: Uploaded each quarter’s data to SQL and stored them as separate tables (Q1, Q2, Q3, Q4).
    • Row Count Check: Verified total rows to ensure data integrity using SQL queries.
    • Distinct Rider Types: Checked for distinct values in the member_casual column to ensure correct identification of casual riders and members.
    • Calculated Trip Durations: Used SQL to find the maximum, minimum, and average trip durations for deeper insights.
    • Data Union: Combined data from all four quarters into a unified table using a UNION ALL query.
    • Grouped Analysis: Performed grouping and aggregations by rider type, time of day, day of the week, and stations to understand usage patterns.
    • Calculated Seasonal and Daily Trends: Used SQL to analyze rides by time of day, day of the week, and by month to detect seasonality and daily variations.

    *...

  20. French gas and electricity consumption (2011-2021)

    • kaggle.com
    zip
    Updated Feb 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mario Fernandez (2023). French gas and electricity consumption (2011-2021) [Dataset]. https://www.kaggle.com/datasets/mariofdz/french-gas-and-electricity-consumption-2011-2021/discussion
    Explore at:
    zip(3273067 bytes)Available download formats
    Dataset updated
    Feb 25, 2023
    Authors
    Mario Fernandez
    Area covered
    French
    Description

    Context

    After reaching historic lows during the pandemic, energy consumption increased in the aftermath of deconfinement. This trend was mostly due to economic factors; as restrictions were either reduced or removed, several countries saw a rise in both consumption and general business activity. With the rapid normalization of daily life, many supply chains came increasingly under strain. Several months later, the Russo-Ukrainian War placed further stress on global logistics networks. Energy prices soared, and inflation became a major issue in nations around the world. In an attempt to curb the consequences of this trend, several governments decided to adopt a series of energy-saving measures. France was no exception. In 2022, the French government launched its own Energy Saving Plan (Plan de sobriété énergétique). With measures aimed at households, businesses and the public sector, authorities are now hoping to cut 10% of national energy consumption by 2024 (2019 being the reference year).

    Project objective

    To reach these energy-saving goals, it is crucial to understand which trends affect French consumption over time. As such, we will be analyzing national gas and electricity use over a ten-year period (2011-2021). Hopefully, this will allow us to identify the main sources of energy consumption in France.

    About the dataset

    The project dataset was imported from the French government’s Open Data website. Showing the evolution of national electricity and gas consumption over a ten-year period (2011-2021), it was created and collected by Agence ORE, an association of national gas and electricity distribution network operators. The dataset operates under an open license, and includes variables such as operator, year, energy type, consumption category code, consumer category, consumer sector console, consumer sector, company business identification (NAF code), energy consumed, energy delivery point (pdl), and consumption regions. Observations were found in almost 30000 rows.

    The dataset was imported and stored on my computer. However, copies of both the raw and clean files can be found in this post.

    Our dataset provides extensive information. Nevertheless, we are aware of two potential limitations:

    • The data doesn’t concern all energy commodities (only gas and electricity)
    • The data doesn’t cover 2022 (year where the Russo-Ukrainian War began)

    While such information is missing, our project should not face any major obstacles. Given the long-term nature of our data, national trends should be detected even without 2022 energy consumption. In addition, gas and electricity are two of France’s major energy sources and can thus provide many of the expected insights.

    Processing

    Since the dataset was relatively small (under 30000 rows), I processed the data using Microsoft Excel. First, I created two folders called “Raw Data” and “Working Sheet” (the latter being for the clean data). Afterwards, I eliminated the following unnecessary columns:

    • Code_categorie_consommation
    • Code_grand_secteur
    • Code_naf, operateur
    • Libelle_secteur_naf2
    • Pdl
    • Code_region
    • Indqual
    • Nombre_mailles_secretisees
    • Libellé_categorie_consommation

    Once the useful columns remained, I translated their names from French to English. Thus:

    • “Annee” became “Year”
    • “Filière” became “Energy type”
    • “Libelle_grand_secteur” became “Sector”
    • “Conso” became “Energy consumption (MWh)”
    • “Libelle_region” became “Region”

    With this done, I proceeded to remove any potential duplicates from the data using the “remove duplicates” option in Excel’s “Data” section (about 200 rows were removed). Following this, I proceeded to both spell-check and translate data values by using the “Find and replace” option in Excel. As such, the following changes were made:

    • “Energy” column:
      • “Electricité” became “Electricity”
      • “Gaz” became “Gas”
    • “Sector” column:
      • “Résidentiel” became “Households”
      • “Industrie” became “Industry”
      • “Tertiaire” became “Services”
    • “Region” column: corrected typos and other spelling errors for French regional names.

    I then proceeded to eliminate rows with empty and 0 values. Once this was completed, I was left with over 15000 rows of data.

    To get a better sense of energy consumption on different scales, I also converted MWh to KWh and TWh in separate columns: “Energy consumption (KWh)” and “(Energy consumption (TWh)”. In the end however, I preferred MWh as a metric since it was simpler to analyze.

    All values were rounded to the nearest whole number.

    Analysis

    Once my data was clean, I used Power BI to create a dashboard (all of my files are available in this post).

    At first sight, it would seem that French gas and electricity use gre...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gaurav Tawri (2025). Netflix Movies and TV Shows Dataset Cleaned(excel) [Dataset]. https://www.kaggle.com/datasets/gauravtawri/netflix-movies-and-tv-shows-dataset-cleanedexcel
Organization logo

Netflix Movies and TV Shows Dataset Cleaned(excel)

Cleaned Netflix dataset with detailed formulas and step-by-step documentation

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gaurav Tawri
Description

This dataset is a cleaned and preprocessed version of the original Netflix Movies and TV Shows dataset available on Kaggle. All cleaning was done using Microsoft Excel — no programming involved.

🎯 What’s Included: - Cleaned Excel file (standardized columns, proper date format, removed duplicates/missing values) - A separate "formulas_used.txt" file listing all Excel formulas used during cleaning (e.g., TRIM, CLEAN, DATE, SUBSTITUTE, TEXTJOIN, etc.) - Columns like 'date_added' have been properly formatted into DMY structure - Multi-valued columns like 'listed_in' are split for better analysis - Null values replaced with “Unknown” for clarity - Duration field broken into numeric + unit components

🔍 Dataset Purpose: Ideal for beginners and analysts who want to: - Practice data cleaning in Excel - Explore Netflix content trends - Analyze content by type, country, genre, or date added

📁 Original Dataset Credit: The base version was originally published by Shivam Bansal on Kaggle: https://www.kaggle.com/shivamb/netflix-shows

📌 Bonus: You can find a step-by-step cleaning guide and the same dataset on GitHub as well — along with screenshots and formulas documentation.

Search
Clear search
Close search
Google apps
Main menu