29 datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. Netflix Movies and TV Shows Dataset Cleaned(excel)

    • kaggle.com
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Tawri (2025). Netflix Movies and TV Shows Dataset Cleaned(excel) [Dataset]. https://www.kaggle.com/datasets/gauravtawri/netflix-movies-and-tv-shows-dataset-cleanedexcel
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaurav Tawri
    Description

    This dataset is a cleaned and preprocessed version of the original Netflix Movies and TV Shows dataset available on Kaggle. All cleaning was done using Microsoft Excel — no programming involved.

    🎯 What’s Included: - Cleaned Excel file (standardized columns, proper date format, removed duplicates/missing values) - A separate "formulas_used.txt" file listing all Excel formulas used during cleaning (e.g., TRIM, CLEAN, DATE, SUBSTITUTE, TEXTJOIN, etc.) - Columns like 'date_added' have been properly formatted into DMY structure - Multi-valued columns like 'listed_in' are split for better analysis - Null values replaced with “Unknown” for clarity - Duration field broken into numeric + unit components

    🔍 Dataset Purpose: Ideal for beginners and analysts who want to: - Practice data cleaning in Excel - Explore Netflix content trends - Analyze content by type, country, genre, or date added

    📁 Original Dataset Credit: The base version was originally published by Shivam Bansal on Kaggle: https://www.kaggle.com/shivamb/netflix-shows

    📌 Bonus: You can find a step-by-step cleaning guide and the same dataset on GitHub as well — along with screenshots and formulas documentation.

  3. Dirty Excel Data

    • kaggle.com
    zip
    Updated Feb 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shiva Vashishtha (2022). Dirty Excel Data [Dataset]. https://www.kaggle.com/datasets/shivavashishtha/dirty-excel-data/code
    Explore at:
    zip(13123 bytes)Available download formats
    Dataset updated
    Feb 23, 2022
    Authors
    Shiva Vashishtha
    Description

    Dataset

    This dataset was created by Shiva Vashishtha

    Contents

  4. Cyclistic Bike-share

    • kaggle.com
    zip
    Updated May 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arsenio Clark (2023). Cyclistic Bike-share [Dataset]. https://www.kaggle.com/datasets/arsenioclark/cyclistic-bike-share
    Explore at:
    zip(590509171 bytes)Available download formats
    Dataset updated
    May 15, 2023
    Authors
    Arsenio Clark
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    **Introduction ** This case study will be based on Cyclistic, a bike sharing company in Chicago. I will perform tasks of a junior data analyst to answer business questions. I will do this by following a process that includes the following phases: ask, prepare, process, analyze, share and act.

    Background Cyclistic is a bike sharing company that operates 5828 bikes within 692 docking stations. The company has been around since 2016 and separates itself from the competition due to the fact that they offer a variety of bike services including assistive options. Lily Moreno is the director of the marketing team and will be the person to receive these insights from this analysis.

    Case Study and business task Lily Morenos perspective on how to generate more income by marketing Cyclistics services correctly includes converting casual riders (one day passes and/or pay per ride customers) into annual riders with a membership. Annual riders are more profitable than casual riders according to the finance analysts. She would rather see a campaign targeting casual riders into annual riders, instead of launching campaigns targeting new costumers. So her strategy as the manager of the marketing team is simply to maximize the amount of annual riders by converting casual riders.

    In order to make a data driven decision, Moreno needs the following insights:

    A better understanding of how casual riders and annual riders differ Why would a casual rider become an annual one How digital media can affect the marketing tactics Moreno has directed me to the first question - how do casual riders and annual riders differ?

    Stakeholders Lily Moreno, manager of the marketing team Cyclistic Marketing team Executive team

    Data sources and organization Data used in this report is made available and is licensed by Motivate International Inc. Personal data is hidden to protect personal information. Data used is from the past 12 months (03/2022 – 02/2023) of bike share dataset.

    By merging all 12 monthly bike share data provided, an extensive amount of data with 5,785,180 rows were returned and included in this analysis.

    Data security and limitations: Personal information is secured and hidden to prevent unlawful use. Original files are backed up in folders and subfolders.

    Tools and documentation of cleaning process The tools used for data verification and data cleaning are Microsoft Excel. The original files made accessible by Motivate International Inc. are backed up in their original format and in separate files.

    Microsoft Excel is used to generally look through the dataset and get a overview of the content. I performed simple checks of the data by filtering, sorting, formatting and standardizing the data to make it easily mergeable.. In Excel, I also changed data type to have the right format, removed unnecessary data if its incomplete or incorrect, created new columns to subtract and reformat existing columns and deleting empty cells. These tasks are easily done in spreadsheets and provides an initial cleaning process of the data.

    Limitations Microsoft Excel has a limitation of 1,048,576 rows while the data of the 12 months combined are over 5,785,180 rows. When combining the 12 months of data into one table/sheet, Excel is no longer efficient and I switched over to R programming.

  5. E-Commerce Sales Data Analysis Using Excel

    • kaggle.com
    zip
    Updated Dec 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Utkarsh Anand (2024). E-Commerce Sales Data Analysis Using Excel [Dataset]. https://www.kaggle.com/datasets/utkarshanand09/e-commerce-sales-data-analysis-using-excel
    Explore at:
    zip(60943371 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Utkarsh Anand
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Performed in-depth analysis of Myntra's e-commerce data using Excel to identify sales trends, customer behavior, and performance metrics. Leveraged advanced Excel functionalities, including pivot tables, charts, conditional formatting, and data cleaning techniques, to derive actionable insights and create visually compelling reports.

  6. d

    Data from: Data cleaning and enrichment through data integration: networking...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar (2025). Data cleaning and enrichment through data integration: networking the Italian academia [Dataset]. http://doi.org/10.5061/dryad.wpzgmsbwj
    Explore at:
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar
    Description

    We describe a bibliometric network characterizing co-authorship collaborations in the entire Italian academic community. The network, consisting of 38,220 nodes and 507,050 edges, is built upon two distinct data sources: faculty information provided by the Italian Ministry of University and Research and publications available in Semantic Scholar. Both nodes and edges are associated with a large variety of semantic data, including gender, bibliometric indexes, authors' and publications' research fields, and temporal information. While linking data between the two original sources posed many challenges, the network has been carefully validated to assess its reliability and to understand its graph-theoretic characteristics. By resembling several features of social networks, our dataset can be profitably leveraged in experimental studies in the wide social network analytics domain as well as in more specific bibliometric contexts. , The proposed network is built starting from two distinct data sources:

    the entire dataset dump from Semantic Scholar (with particular emphasis on the authors and papers datasets) the entire list of Italian faculty members as maintained by Cineca (under appointment by the Italian Ministry of University and Research).

    By means of a custom name-identity recognition algorithm (details are available in the accompanying paper published in Scientific Data), the names of the authors in the Semantic Scholar dataset have been mapped against the names contained in the Cineca dataset and authors with no match (e.g., because of not being part of an Italian university) have been discarded. The remaining authors will compose the nodes of the network, which have been enriched with node-related (i.e., author-related) attributes. In order to build the network edges, we leveraged the papers dataset from Semantic Scholar: specifically, any two authors are said to be connected if there is at least one pap..., , # Data cleaning and enrichment through data integration: networking the Italian academia

    https://doi.org/10.5061/dryad.wpzgmsbwj

    Manuscript published in Scientific Data with DOI .

    Description of the data and file structure

    This repository contains two main data files:

    • edge_data_AGG.csv, the full network in comma-separated edge list format (this file contains mainly temporal co-authorship information);
    • Coauthorship_Network_AGG.graphml, the full network in GraphML format.Â

    along with several supplementary data, listed below, useful only to build the network (i.e., for reproducibility only):

    • University-City-match.xlsx, an Excel file that maps the name of a university against the city where its respective headquarter is located;
    • Areas-SS-CINECA-match.xlsx, an Excel file that maps the research areas in Cineca against the research areas in Semantic Scholar.

    Description of the main data files

    The `Coauthorship_Networ...

  7. Data from: Excel Project

    • kaggle.com
    zip
    Updated Jan 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carina Cruz (2025). Excel Project [Dataset]. https://www.kaggle.com/datasets/carinacruz/excel-project
    Explore at:
    zip(5592940 bytes)Available download formats
    Dataset updated
    Jan 31, 2025
    Authors
    Carina Cruz
    Description

    This project includes a series of Excel files demonstrating key Excel functionalities, including:

    • Conditional Formatting for data visualization.
    • Pivot Tables for summarizing and analyzing data.
    • Excel Charts for visual representation of key insights.
    • Use of Formulas and XLOOKUP to automate calculations and data lookup.
    • Data Cleaning techniques to prepare the dataset for analysis.
    • Additionally, the project includes a final Excel file with bike sales data and an interactive dashboard.

    You can download the original Excel file with all formatting here: https://www.kaggle.com/datasets/carinacruz/excel-project

  8. u

    University of Cape Town Student Admissions Data 2006-2014 - South Africa

    • datafirst.uct.ac.za
    Updated Jul 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCT Student Administration (2020). University of Cape Town Student Admissions Data 2006-2014 - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/556
    Explore at:
    Dataset updated
    Jul 28, 2020
    Dataset authored and provided by
    UCT Student Administration
    Time period covered
    2006 - 2014
    Area covered
    South Africa
    Description

    Abstract

    This dataset was generated from a set of Excel spreadsheets from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). This database contains information on applications to UCT between the January 2006 and December 2014. In the original form received by DataFirst the data were ill suited to research purposes. This dataset represents an attempt at cleaning and organizing these data into a more tractable format. To ensure data confidentiality direct identifiers have been removed from the data and the data is only made available to accredited researchers through DataFirst's Secure Data Service.

    The dataset was separated into the following data files:

    1. Application level information: the "finest" unit of analysis. Individuals may have multiple applications. Uniquely identified by an application ID variable. There are a total of 1,714,669 applications on record.
    2. Individual level information: individuals may have multiple applications. Each individual is uniquely identified by an individual ID variable. Each individual is associated with information on "key subjects" from a separate data file also contained in the database. These key subjects are all separate variables in the individual level data file. There are a total of 285,005 individuals on record.
    3. Secondary Education Information: individuals can also be associated with row entries for each subject. This data file does not have a unique identifier. Instead, each row entry represents a specific secondary school subject for a specific individual. These subjects are quite specific and the data allows the user to distinguish between, for example, higher grade accounting and standard grade accounting. It also allows the user to identify the educational authority issuing the qualification e.g. Cambridge Internal Examinations (CIE) versus National Senior Certificate (NSC).
    4. Tertiary Education Information: the smallest of the four data files. There are multiple entries for each individual in this dataset. Each row entry contains information on the year, institution and transcript information and can be associated with individuals.

    Analysis unit

    Applications, individuals

    Kind of data

    Administrative records [adm]

    Mode of data collection

    Other [oth]

    Cleaning operations

    The data files were made available to DataFirst as a group of Excel spreadsheet documents from an SQL database managed by the University of Cape Town's Information and Communication Technology Services . The process of combining these original data files to create a research-ready dataset is summarised in a document entitled "Notes on preparing the UCT Student Application Data 2006-2014" accompanying the data.

  9. e

    Ethiopia - Multi-Tier Framework (MTF) Survey - Dataset - ENERGYDATA.INFO

    • energydata.info
    Updated Sep 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Ethiopia - Multi-Tier Framework (MTF) Survey - Dataset - ENERGYDATA.INFO [Dataset]. https://energydata.info/dataset/ethiopia-multi-tier-framework-mtf-survey-2018
    Explore at:
    Dataset updated
    Sep 26, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Ethiopia
    Description

    The MTF survey is a global baseline survey on household access to electricity and clean cooking, which goes beyond the binary approach to look at access as a spectrum of service levels experienced by households. Resources included are raw data, codebook, questionnaires, sampling strategy document, and country diagnostic report. Formats include zip file (which includes raw data sets of dta format), excel spreadsheet, pdf, and docx.

  10. Coffee Sales Excel Project

    • kaggle.com
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nuha Zahidi (2024). Coffee Sales Excel Project [Dataset]. https://www.kaggle.com/datasets/nuhazahidi/coffee-sales-excel-project
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nuha Zahidi
    Description

    Tool: Microsoft Excel

    Dataset: Coffee Sales

    Process: 1. Data Cleaning: • Remove duplicates and blanks. • Standardize date and currency formats.

    1. Data Manipulation: • Sorting and filtering function to work
      with interest subsets of data. • Use XLOOKUP, INDEX-MATCH and IF
      formula for efficient data manipulation, such as retrieving, matching and organising information in spreadsheets

    2. Data Analysis: • Create Pivot Tables and Pivot Charts with the formatting to visualize trends.

    3. Dashboard Development: • Insert Slicers with the formatting for easy filtering and dynamic updates.

    Highlights: This project aims to understand coffee sales trends by country, roast type, and year, which could help identify marketing opportunities and customer segments.

  11. g

    IP Australia - [Superseded] Intellectual Property Government Open Data 2019...

    • gimi9.com
    Updated Jul 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). IP Australia - [Superseded] Intellectual Property Government Open Data 2019 | gimi9.com [Dataset]. https://gimi9.com/dataset/au_intellectual-property-government-open-data-2019
    Explore at:
    Dataset updated
    Jul 20, 2018
    Area covered
    Australia
    Description

    What is IPGOD? The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD. # How do I use IPGOD? IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar. # IP Data Platform IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform # References The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset. * Patents * Trade Marks * Designs * Plant Breeder’s Rights # Updates ### Tables and columns Due to the changes in our systems, some tables have been affected. * We have added IPGOD 225 and IPGOD 325 to the dataset! * The IPGOD 206 table is not available this year. * Many tables have been re-built, and as a result may have different columns or different possible values. Please check the data dictionary for each table before use. ### Data quality improvements Data quality has been improved across all tables. * Null values are simply empty rather than '31/12/9999'. * All date columns are now in ISO format 'yyyy-mm-dd'. * All indicator columns have been converted to Boolean data type (True/False) rather than Yes/No, Y/N, or 1/0. * All tables are encoded in UTF-8. * All tables use the backslash \ as the escape character. * The applicant name cleaning and matching algorithms have been updated. We believe that this year's method improves the accuracy of the matches. Please note that the "ipa_id" generated in IPGOD 2019 will not match with those in previous releases of IPGOD.

  12. Excel Data Cleaning - Montgomery Fleet Inventory

    • kaggle.com
    zip
    Updated Feb 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahimryk (2025). Excel Data Cleaning - Montgomery Fleet Inventory [Dataset]. https://www.kaggle.com/datasets/ibrahimryk/excel-data-cleaning-montgomery-fleet-inventory/data
    Explore at:
    zip(10139 bytes)Available download formats
    Dataset updated
    Feb 9, 2025
    Authors
    Ibrahimryk
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains a cleaned version of the Montgomery County Fleet Equipment Inventory.

    ✅ Data Cleaning Steps: - Removed duplicate records - Fixed spelling errors - Merged department names using Flash Fill - Removed unnecessary whitespace - Converted CSV to Excel (.XLSX) format

    📂 Original Dataset Source: Montgomery County Public Dataset

  13. ECOMMERCE-DATA-ANALYSING

    • kaggle.com
    zip
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harjot Singh (2025). ECOMMERCE-DATA-ANALYSING [Dataset]. https://www.kaggle.com/datasets/harjotsingh13/ecommerce-data-analysing
    Explore at:
    zip(337900 bytes)Available download formats
    Dataset updated
    Nov 12, 2025
    Authors
    Harjot Singh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🛒 E-Commerce Data Analysis (Excel & Python Project) 📖 Overview

    This project analyzes 10,000+ e-commerce sales records using Excel and Python (Pandas) to uncover valuable business insights. It covers essential data analysis techniques such as cleaning, aggregation, and visualization — perfect for beginners and data analyst learners.

    🎯 Objectives

    Understand customer purchasing trends

    Identify top-selling products

    Analyze monthly sales and revenue performance

    Calculate business KPIs such as Total Revenue, Total Orders, and Average Order Value (AOV)

    🧩 Dataset Information

    File: ecommerce_simple_10k.csv Total Rows: 10,000 Columns:

    Column Name Description order_id Unique order identifier product Product name quantity Number of items ordered price Price of a single item order_date Date of order placement city City where the order was placed 🧹 Data Cleaning (Python)

    Key cleaning steps:

    Removed currency symbols (₹) and commas from price and total_sales

    Converted order_date into proper datetime format

    Created new column month from order_date

    Handled missing or incorrect data entries

  14. Google Certificate BellaBeats Capstone Project

    • kaggle.com
    zip
    Updated Jan 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Porzelius (2023). Google Certificate BellaBeats Capstone Project [Dataset]. https://www.kaggle.com/datasets/jasonporzelius/google-certificate-bellabeats-capstone-project
    Explore at:
    zip(169161 bytes)Available download formats
    Dataset updated
    Jan 5, 2023
    Authors
    Jason Porzelius
    Description

    Introduction: I have chosen to complete a data analysis project for the second course option, Bellabeats, Inc., using a locally hosted database program, Excel for both my data analysis and visualizations. This choice was made primarily because I live in a remote area and have limited bandwidth and inconsistent internet access. Therefore, completing a capstone project using web-based programs such as R Studio, SQL Workbench, or Google Sheets was not a feasible choice. I was further limited in which option to choose as the datasets for the ride-share project option were larger than my version of Excel would accept. In the scenario provided, I will be acting as a Junior Data Analyst in support of the Bellabeats, Inc. executive team and data analytics team. This combined team has decided to use an existing public dataset in hopes that the findings from that dataset might reveal insights which will assist in Bellabeat's marketing strategies for future growth. My task is to provide data driven insights to business tasks provided by the Bellabeats, Inc.'s executive and data analysis team. In order to accomplish this task, I will complete all parts of the Data Analysis Process (Ask, Prepare, Process, Analyze, Share, Act). In addition, I will break each part of the Data Analysis Process down into three sections to provide clarity and accountability. Those three sections are: Guiding Questions, Key Tasks, and Deliverables. For the sake of space and to avoid repetition, I will record the deliverables for each Key Task directly under the numbered Key Task using an asterisk (*) as an identifier.

    Section 1 - Ask:

    A. Guiding Questions:
    1. Who are the key stakeholders and what are their goals for the data analysis project? 2. What is the business task that this data analysis project is attempting to solve?

    B. Key Tasks: 1. Identify key stakeholders and their goals for the data analysis project *The key stakeholders for this project are as follows: -Urška Sršen and Sando Mur - co-founders of Bellabeats, Inc. -Bellabeats marketing analytics team. I am a member of this team.

    1. Identify the business task. *The business task is: -As provided by co-founder Urška Sršen, the business task for this project is to gain insight into how consumers are using their non-BellaBeats smart devices in order to guide upcoming marketing strategies for the company which will help drive future growth. Specifically, the researcher was tasked with applying insights driven by the data analysis process to 1 BellaBeats product and presenting those insights to BellaBeats stakeholders.

    Section 2 - Prepare:

    A. Guiding Questions: 1. Where is the data stored and organized? 2. Are there any problems with the data? 3. How does the data help answer the business question?

    B. Key Tasks:

    1. Research and communicate the source of the data, and how it is stored/organized to stakeholders. *The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through user Mobius in an open-source format. Therefore, the data is public and available to be copied, modified, and distributed, all without asking the user for permission. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk reportedly (see credibility section directly below) between 03/12/2016 thru 05/12/2016.
      *Reportedly (see credibility section directly below), thirty eligible Fitbit users consented to the submission of personal tracker data, including output related to steps taken, calories burned, time spent sleeping, heart rate, and distance traveled. This data was broken down into minute, hour, and day level totals. This data is stored in 18 CSV documents. I downloaded all 18 documents into my local laptop and decided to use 2 documents for the purposes of this project as they were files which had merged activity and sleep data from the other documents. All unused documents were permanently deleted from the laptop. The 2 files used were: -sleepDay_merged.csv -dailyActivity_merged.csv

    2. Identify and communicate to stakeholders any problems found with the data related to credibility and bias. *As will be more specifically presented in the Process section, the data seems to have credibility issues related to the reported time frame of the data collected. The metadata seems to indicate that the data collected covered roughly 2 months of FitBit tracking. However, upon my initial data processing, I found that only 1 month of data was reported. *As will be more specifically presented in the Process section, the data has credibility issues related to the number of individuals who reported FitBit data. Specifically, the metadata communicates that 30 individual users agreed to report their tracking data. My initial data processing uncovered 33 individual ...

  15. Google Ads sales dataset

    • kaggle.com
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NayakGanesh007 (2025). Google Ads sales dataset [Dataset]. https://www.kaggle.com/datasets/nayakganesh007/google-ads-sales-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    NayakGanesh007
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Google Ads Sales Dataset for Data Analytics Campaigns (Raw & Uncleaned) 📝 Dataset Overview This dataset contains raw, uncleaned advertising data from a simulated Google Ads campaign promoting data analytics courses and services. It closely mimics what real digital marketers and analysts would encounter when working with exported campaign data — including typos, formatting issues, missing values, and inconsistencies.

    It is ideal for practicing:

    Data cleaning

    Exploratory Data Analysis (EDA)

    Marketing analytics

    Campaign performance insights

    Dashboard creation using tools like Excel, Python, or Power BI

    📁 Columns in the Dataset Column Name ----- -Description Ad_ID --------Unique ID of the ad campaign Campaign_Name ------Name of the campaign (with typos and variations) Clicks --Number of clicks received Impressions --Number of ad impressions Cost --Total cost of the ad (in ₹ or $ format with missing values) Leads ---Number of leads generated Conversions ----Number of actual conversions (signups, sales, etc.) Conversion Rate ---Calculated conversion rate (Conversions ÷ Clicks) Sale_Amount ---Revenue generated from the conversions Ad_Date------ Date of the ad activity (in inconsistent formats like YYYY/MM/DD, DD-MM-YY) Location ------------City where the ad was served (includes spelling/case variations) Device------------ Device type (Mobile, Desktop, Tablet with mixed casing) Keyword ----------Keyword that triggered the ad (with typos)

    ⚠️ Data Quality Issues (Intentional) This dataset was intentionally left raw and uncleaned to reflect real-world messiness, such as:

    Inconsistent date formats

    Spelling errors (e.g., "analitics", "anaytics")

    Duplicate rows

    Mixed units and symbols in cost/revenue columns

    Missing values

    Irregular casing in categorical fields (e.g., "mobile", "Mobile", "MOBILE")

    🎯 Use Cases Data cleaning exercises in Python (Pandas), R, Excel

    Data preprocessing for machine learning

    Campaign performance analysis

    Conversion optimization tracking

    Building dashboards in Power BI, Tableau, or Looker

    💡 Sample Analysis Ideas Track campaign cost vs. return (ROI)

    Analyze click-through rates (CTR) by device or location

    Clean and standardize campaign names and keywords

    Investigate keyword performance vs. conversions

    🔖 Tags Digital Marketing · Google Ads · Marketing Analytics · Data Cleaning · Pandas Practice · Business Analytics · CRM Data

  16. Data from: Sales Performance

    • kaggle.com
    zip
    Updated Oct 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vutikonda Johnpaul (2025). Sales Performance [Dataset]. https://www.kaggle.com/datasets/vutikondajohnpaul/sales-performance
    Explore at:
    zip(51903 bytes)Available download formats
    Dataset updated
    Oct 31, 2025
    Authors
    Vutikonda Johnpaul
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains sales transaction records used to create an interactive Excel Sales Performance Dashboard for business analytics practice.

    It includes six columns capturing essential sales metrics such as date, region, product, quantity, sales revenue, and profit. The data is structured to help analysts and learners explore data visualization, PivotTable summarization, and dashboard design concepts in Excel.

    The dataset was created for educational and demonstration purposes to help users:

    1. Build dashboards that visualize total sales and profit trends
    2. Identify top-performing products and high-profit regions
    3. Practice Excel-based business analytics workflows

    Columns: Date – Transaction date (daily sales record) Region – Geographic area of the sale (East, West, North, South) Product – Product category or item sold Sales – Total revenue generated from the sale (USD) Profit – Net profit made per transaction Quantity – Number of units sold

    Typical uses include: Excel or Power BI dashboard projects PivotTable practice for business reporting Data cleaning and chart-building exercises Portfolio development for business analytics students Built and tested in Microsoft Excel using PivotTables, Charts, and Conditional Formatting.

  17. Video Game Sales Dataset (Excel Dashboard Project)

    • kaggle.com
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adewale Lateef W (2025). Video Game Sales Dataset (Excel Dashboard Project) [Dataset]. https://www.kaggle.com/datasets/adewalelateefw/video-game-sales-dataset-excel-dashboard-project
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Adewale Lateef W
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains video game sales data prepared for an Excel data analysis and dashboard project.

    It includes detailed information on:

    Game titles

    Platforms

    Genres

    Publishers

    Regional and global sales

    The dataset was cleaned, structured, and analyzed in Microsoft Excel to explore patterns in the global video game market. It can be used to:

    Practice data cleaning and pivot tables

    Build interactive dashboards

    Perform sales comparisons across regions and genres

    Develop business insights from entertainment data

    🧩 File Information

    Format: .xlsx (Excel Workbook)

    Columns: Name, Platform, Year, Genre, Publisher, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales

    💡 Use Cases

    Excel dashboard and chart creation

    Data visualization and storytelling

    Business and market analysis practice

    Portfolio or learning projects

    👤 Prepared by

    Adewale Lateef W — for data analysis and Excel dashboard learning purposes.

  18. Retail Store Sales: Dirty for Data Cleaning

    • kaggle.com
    zip
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning
    Explore at:
    zip(226740 bytes)Available download formats
    Dataset updated
    Jan 18, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Retail Store Sales Dataset

    Overview

    The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

    File Information

    • File Name: retail_store_sales.csv
    • Number of Rows: 12,575
    • Number of Columns: 11

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    Customer IDA unique identifier for each customer. 25 unique customers.CUST_01
    CategoryThe category of the purchased item.Food, Furniture
    ItemThe name of the purchased item. May contain missing values or None.Item_1_FOOD, None
    Price Per UnitThe static price of a single unit of the item. May contain missing or None values.4.00, None
    QuantityThe quantity of the item purchased. May contain missing or None values.1, None
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, None
    Payment MethodThe method of payment used. May contain missing or invalid values.Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Online
    Transaction DateThe date of the transaction. Always present and valid.2023-01-15
    Discount AppliedIndicates if a discount was applied to the transaction. May contain missing values.True, False, None

    Categories and Items

    The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

    Electric Household Essentials

    Item CodeItem NamePrice
    Item_1_EHEBlender5.0
    Item_2_EHEMicrowave6.5
    Item_3_EHEToaster8.0
    Item_4_EHEVacuum Cleaner9.5
    Item_5_EHEAir Purifier11.0
    Item_6_EHEElectric Kettle12.5
    Item_7_EHERice Cooker14.0
    Item_8_EHEIron15.5
    Item_9_EHECeiling Fan17.0
    Item_10_EHETable Fan18.5
    Item_11_EHEHair Dryer20.0
    Item_12_EHEHeater21.5
    Item_13_EHEHumidifier23.0
    Item_14_EHEDehumidifier24.5
    Item_15_EHECoffee Maker26.0
    Item_16_EHEPortable AC27.5
    Item_17_EHEElectric Stove29.0
    Item_18_EHEPressure Cooker30.5
    Item_19_EHEInduction Cooktop32.0
    Item_20_EHEWater Dispenser33.5
    Item_21_EHEHand Blender35.0
    Item_22_EHEMixer Grinder36.5
    Item_23_EHESandwich Maker38.0
    Item_24_EHEAir Fryer39.5
    Item_25_EHEJuicer41.0

    Furniture

    Item CodeItem NamePrice
    Item_1_FUROffice Chair5.0
    Item_2_FURSofa6.5
    Item_3_FURCoffee Table8.0
    Item_4_FURDining Table9.5
    Item_5_FURBookshelf11.0
    Item_6_FURBed F...
  19. Superstore Dataset

    • kaggle.com
    zip
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivam Amrutkar (2023). Superstore Dataset [Dataset]. https://www.kaggle.com/datasets/yesshivam007/superstore-dataset
    Explore at:
    zip(2119716 bytes)Available download formats
    Dataset updated
    Sep 25, 2023
    Authors
    Shivam Amrutkar
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    The Superstore Sales Data dataset, available in an Excel format as "Superstore.xlsx," is a comprehensive collection of sales and customer-related information from a retail superstore. This dataset comprises* three distinct tables*, each providing specific insights into the store's operations and customer interactions.

  20. Instagram Reach Analysis - Excel Project

    • kaggle.com
    zip
    Updated Jun 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raghad Al-marshadi (2025). Instagram Reach Analysis - Excel Project [Dataset]. https://www.kaggle.com/datasets/raghadalmarshadi/instagram-reach-analysis-excel-project
    Explore at:
    zip(291841 bytes)Available download formats
    Dataset updated
    Jun 14, 2025
    Authors
    Raghad Al-marshadi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📊 Instagram Reach Analysis | تحليل الوصول في إنستغرام

    An exploratory data analysis project using Excel to understand what influences Instagram post reach and engagement.
    مشروع تحليل استكشافي لفهم العوامل المؤثرة في وصول منشورات إنستغرام وتفاعل المستخدمين، باستخدام Excel.

    📁 Project Description | وصف المشروع

    This project uses an Instagram dataset imported from Kaggle to explore how different factors like hashtags, saves, shares, and caption length influence impressions and engagement.
    يستخدم هذا المشروع بيانات من إنستغرام تم استيرادها من منصة Kaggle لتحليل كيف تؤثر عوامل مثل الهاشتاقات، الحفظ، المشاركة، وطول التسمية التوضيحية في عدد مرات الظهور والتفاعل.

    🛠️ Tools Used | الأدوات المستخدمة

    • Microsoft Excel
    • Pivot Tables
    • TRIM, WRAP, and other Excel formulas
    • مايكروسوفت إكسل
    • الجداول المحورية
    • دوال مثل TRIM و WRAP وغيرها في Excel

    🧹 Data Cleaning | تنظيف البيانات

    • Removed unnecessary spaces using TRIM
    • Removed 17 duplicate rows → 103 unique rows remained
    • Standardized formatting: freeze top row, wrap text, center align

    • إزالة المسافات غير الضرورية باستخدام TRIM

    • حذف 17 صفًا مكررًا → تبقى 103 صفوف فريدة

    • تنسيق موحد: تثبيت الصف الأول، لف النص، وتوسيط المحتوى

    🔍 Key Analysis Highlights | أبرز نتائج التحليل

    1. Impressions by Source | مرات الظهور حسب المصدر

    • Highest reach: Home > Hashtags > Explore > Other
    • Some totals exceed 100% due to overlapping

    2. Engagement Insights | رؤى حول التفاعل

    • Saves strongly correlate with higher impressions
    • Caption length is inversely related to likes
    • Shares have weak correlation with impressions

    3. Hashtag Patterns | تحليل الهاشتاقات

    • Most used: #Thecleverprogrammer, #Amankharwal, #Python
    • Repeating hashtags does not guarantee higher reach

    ✅ Conclusion | الخلاصة

    Shorter captions and higher save counts contribute more to reach than repeated hashtags. Profile visits are often linked to new followers.
    العناوين القصيرة وعدد الحفظات تلعب دورًا أكبر في الوصول من تكرار الهاشتاقات. كما أن زيارات الملف الشخصي ترتبط غالبًا بزيادة المتابعين.

    👩‍💻 Author | المؤلفة

    Raghad's LinkedIn

    🧠 Inspiration | الإلهام

    Inspired by content from TheCleverProgrammer, Aman Kharwal, and Kaggle datasets.
    استُلهم المشروع من محتوى TheCleverProgrammer وأمان خروال، وبيانات من Kaggle.

    💬 Feedback | الملاحظات

    Feel free to open an issue or share suggestions!
    يسعدنا تلقي ملاحظاتكم واقتراحاتكم عبر صفحة المشروع.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
167 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu