Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A messy data for demonstrating "how to clean data using spreadsheet". This dataset was intentionally formatted to be messy, for the purpose of demonstration. It was collated from here - https://openafrica.net/dataset/historic-and-projected-rainfall-and-runoff-for-4-lake-victoria-sub-regions
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Restaurant Sales Dataset with Dirt contains data for 17,534 transactions. The data introduces realistic inconsistencies ("dirt") to simulate real-world scenarios where data may have missing or incomplete information. The dataset includes sales details across multiple categories, such as starters, main dishes, desserts, drinks, and side dishes.
This dataset is suitable for: - Practicing data cleaning tasks, such as handling missing values and deducing missing information. - Conducting exploratory data analysis (EDA) to study restaurant sales patterns. - Feature engineering to create new variables for machine learning tasks.
Column Name | Description | Example Values |
---|---|---|
Order ID | A unique identifier for each order. | ORD_123456 |
Customer ID | A unique identifier for each customer. | CUST_001 |
Category | The category of the purchased item. | Main Dishes , Drinks |
Item | The name of the purchased item. May contain missing values due to data dirt. | Grilled Chicken , None |
Price | The static price of the item. May contain missing values. | 15.0 , None |
Quantity | The quantity of the purchased item. May contain missing values. | 1 , None |
Order Total | The total price for the order (Price * Quantity ). May contain missing values. | 45.0 , None |
Order Date | The date when the order was placed. Always present. | 2022-01-15 |
Payment Method | The payment method used for the transaction. May contain missing values due to data dirt. | Cash , None |
Data Dirtiness:
Item
, Price
, Quantity
, Order Total
, Payment Method
) simulate real-world challenges.Item
is present.Price
is present.Quantity
and Order Total
are present.Price
or Quantity
is missing, the other is used to deduce the missing value (e.g., Order Total / Quantity
).Menu Categories and Items:
Chicken Melt
, French Fries
.Grilled Chicken
, Steak
.Chocolate Cake
, Ice Cream
.Coca Cola
, Water
.Mashed Potatoes
, Garlic Bread
.3 Time Range: - Orders span from January 1, 2022, to December 31, 2023.
Handle Missing Values:
Order Total
or Quantity
using the formula: Order Total = Price * Quantity
.Price
from Order Total / Quantity
if both are available.Validate Data Consistency:
Order Total = Price * Quantity
) match.Analyze Missing Patterns:
Category | Item | Price |
---|---|---|
Starters | Chicken Melt | 8.0 |
Starters | French Fries | 4.0 |
Starters | Cheese Fries | 5.0 |
Starters | Sweet Potato Fries | 5.0 |
Starters | Beef Chili | 7.0 |
Starters | Nachos Grande | 10.0 |
Main Dishes | Grilled Chicken | 15.0 |
Main Dishes | Steak | 20.0 |
Main Dishes | Pasta Alfredo | 12.0 |
Main Dishes | Salmon | 18.0 |
Main Dishes | Vegetarian Platter | 14.0 |
Desserts | Chocolate Cake | 6.0 |
Desserts | Ice Cream | 5.0 |
Desserts | Fruit Salad | 4.0 |
Desserts | Cheesecake | 7.0 |
Desserts | Brownie | 6.0 |
Drinks | Coca Cola | 2.5 |
Drinks | Orange Juice | 3.0 |
Drinks ... |
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.
It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.
Each record captures multiple attributes related to individuals in the Indian job market, including:
- Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording
The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.
This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools
It's also useful for:
- Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines
By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
Access and clean an open source herbarium dataset using Excel or RStudio.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
An unclean copy of my GoodReads dataset (as for 2024/02/11) in csv format with 406 entries.
Data types included are integers, floats, strings, data/time and booleans (both in TRUE/FALSE and 0/1 formats).
This is a good dataset to practice cleaning and analysing as it contains missing values, inconsistent formats and outliers.
Disclaimer: Since GoodReads notifies you when there are duplicate entries, which meant I had no duplicate entries, I asked an AI to add 20 random duplicate entries to the data set for the purpose of this project.
This dataset was created by Narenrdra Panwar
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is not valid, but my purpose in uploading it was to fill a gap I felt—the lack of a truly messy dataset. A major part of data science, beyond choosing algorithms and other techniques, is cleaning and preprocessing data. Therefore, this dataset can serve as good practice for learning how to clean a messy dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets contain pixel-level hyperspectral data of six snow and glacier classes. They have been extracted from a Hyperspectral image. The dataset "data.csv" has 5417 * 142 samples belonging to the classes: Clean snow, Dirty ice, Firn, Glacial ice, Ice mixed debris, and Water body. The dataset "_labels1.csv" has corresponding labels of the "data.csv" file. The dataset "RGB.csv" has only 5417 * 3 samples. There are only three band values in this file while "data.csv" has 142 band values.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Maintenance Monitoring: This model could be implemented in drones or satellite imaging systems to monitor the cleanliness of large solar panel installations. These systems could regularly analyze the status of the panels and notify maintenance staff when cleaning is required to maintain optimal efficiency.
Efficiency Optimization: Determining how much grime or dirt is on a solar panel can help estimate the reduction in efficiency. Using this model, energy companies can better plan cleanups to optimize energy production.
Damage Detection: The identification of dirt and grime on panels can also potentially assist in detecting physical damage or irregularities that could be a sign of bigger issues.
Automated Cleaning: Autonomous cleaning robots could utilize this model to identify dirty panels in real time and target specific areas that need to be cleaned, improving their efficiency and effectiveness.
Environmental Impact Studies: By identifying dirty solar panels, environmental scientists and researchers can analyze patterns, such as dust deposition over time or environmental impact, that might help in furthering research on solar panel placement strategies and environmental adjustments.
This dataset contains NYC Street Centerline (CSCL) physical_IDs which represent segments of streets and the date and time those street segments were last visited by a mechanical broom.
This dataset is connected to SweepNYC (nyc.gov/sweepnyc), a tool maintained by the NYC Department of Sanitation (DSNY) that allows New Yorkers to track the progress of DSNY mechanical brooms. The mechanical broom, also known as a street sweeper, is New York City's first line of defense against dirty curbs. Each one picks up 1,500 lbs. of litter on a single shift. For information on how to file a street sweeping complaint see the article on NYC 311.
This dataset was created by Michael Metter
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Airport Maintenance Monitoring: The Aircraft Cleanliness model can be used by airport authorities to monitor the cleanliness of aircraft and ensure timely cleaning services. This can help maintain a high standard of hygiene and visual appearance for airplanes while also reducing the risk of corrosion or damage due to accumulated dirt.
Airline Quality Control: Airlines can use the model to monitor and compare the cleanliness of their fleet, ensuring consistent quality associated with their brand. It can be employed to hold cleaning crews accountable and establish benchmarks for cleanliness quality.
Passenger Experience Enhancement: Airline ratings and review platforms can integrate the Aircraft Cleanliness model to rate airlines based on the cleanliness of their airplanes. This information can then be provided to passengers, helping them make informed decisions when choosing airlines.
Cleaning Service Optimization: Cleaning companies specializing in aircraft maintenance can utilize this model to optimize their cleaning services. By detecting specific dirt classes and focusing on those areas, they can save time and resources while providing a more effective cleaning process.
Environmental Impact Analysis: Researchers can use the Aircraft Cleanliness model to study the impact of different environmental conditions on the accumulation of dirt on airplanes. This information can lead to the development of new materials or coatings that help reduce the rate at which dirt and contaminants adhere to the aircraft surface, minimizing cleaning requirements and environmental impacts.
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Solar Photovoltaics Panel for Dust Detection Dataset is an image dataset designed to classify the presence of dust on the surface of solar panels. It consists of images of clean and dusty (dirty) panels.
2) Data Utilization (1) Characteristics of the Solar Photovoltaics Panel for Dust Detection Dataset: • The dataset contains images capturing the clean and dirty states of solar panels, which can be used to train AI models that detect performance degradation caused by dust accumulation. • The images were collected in outdoor environments, accurately reflecting the real-world conditions of solar power systems.
(2) Applications of the Solar Photovoltaics Panel for Dust Detection Dataset: • Development of automated solar panel diagnostic models: The dataset can be used to train deep learning classification models that automatically determine the cleanliness of solar panels and predict appropriate maintenance timing. • Smart solar power plant monitoring systems: It can support the development of AI-powered monitoring systems that detect dusty panels in real time based on camera data collected from solar power facilities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Hygiene Monitoring and Alert System: Implement the "dropletdataset-01-03" model in public spaces, such as restrooms and kitchens, to detect droplets and hands. This can assist in promoting proper handwashing and hygiene practices by automatically alerting facility managers to spills or unclean surfaces.
Hand-droplet Interaction Analysis: Use this model in laboratory settings to study the dynamics of droplets and their interaction with hands. This can help understand the implications of various contact scenarios and inform safety protocols for hazardous materials or in medical environments.
Dry Erase Board Maintenance Assistance: Use the model to identify when a dry erase board has been wiped clean by detecting the presence of droplets and hands. This can be employed in educational settings to automatically trigger reminders for board cleanup or to evaluate the cleanliness of a board after use.
Artistic Rendering Assistance: Employ the "dropletdataset-01-03" model in computer-aided design software to help artists replicate realistic droplet textures and hand markings when creating digital or physical artwork, particularly in scenarios where the artwork involves fluid-like materials or hand gestures.
Robotics and Automation: Incorporate the model in robotic and automated cleaning systems to differentiate between droplets and hands during cleaning processes. This can improve precision and accuracy in maintaining cleanliness while minimizing the chances of unwanted interactions with human operators.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Solar Panel Maintenance: The model could be used by solar panel service providers to automate the process of assessment and maintenance. By analyzing the state of the panels (clean, unclean, or dusty) it can help them identify which panels need immediate cleaning or service.
Industrial Inspection: In facilities with a large number of solar panels such as solar farms, the model could assist in streamlining routine checks. Rather than manual inspection, images can be taken and analyzed for cleanliness, helping to efficiently allocate cleaning resources and maintain optimum efficiency.
Home Automation Systems: The model could be integrated into smart home systems to alert homeowners when their solar panels are dirty or dusty. It can act as a smart tool for homes using solar energy as one of their primary energy sources.
Drone-based Inspection: For large scale solar installations in hard-to-reach areas (e.g. large roofs, deserts), drones equipped with cameras and the computer vision model can perform inspections. This can be safer and more effective, with the AI determining the status of each panel.
Educational Purposes: This computer vision model could be used as a teaching tool in educational institutions for courses related to renewable energy. It can demonstrate the importance of solar panel cleanliness in energy efficiency, encouraging students to engage with practical, real-world issues in their learning.
Aims: Evaluate the microbiocidal efficacy of a cleaning and disinfection (C&D) treatment using stainless steel coupons applied to three common types of animal mortality transport vehicles when exposed to agricultural conditions. Methods: Metal test coupons, inoculated with bacteriophage MS2, were affixed to the undercarriage of three types of animal mortality transport vehicles at various locations. Coupons were grimed by maneuvering the test vehicles down a series of wet dirt roads. Coupons were attached and extracted at various points to evaluate C&D performance with and without grime. C&D efficacy using a water-supplied pressure washing system and a dilute sodium hypochlorite (NaOCl) solution was determined by comparing the difference in recovered viable virus between positive control coupons and test coupons. This dataset is associated with the following publication: Boe, T., W. Calfee, P. Lemieux, S. Serre, A. Abdel-Hady, M. Monge, D. Aslett, B. Akers, and J. Howard. Evaluation of Cleaning and Disinfection Protocols for Commercial Farm Equipment Following a Foreign Animal Disease Outbreak. Remediation Journal. John Wiley & Sons, Inc., Hoboken, NJ, USA, 33(4): 379-387, (2023).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘US Minimum Wage by State from 1968 to 2020’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lislejoem/us-minimum-wage-by-state-from-1968-to-2017 on 12 November 2021.
--- Dataset description provided by original source is as follows ---
What is this? In the United States, states and the federal government set minimum hourly pay ("minimum wage") that workers can receive to ensure that citizens experience a minimum quality of life. This dataset provides the minimum wage data set by each state and the federal government from 1968 to 2020.
Why did you put this together? While looking online for a clean dataset for minimum wage data by state, I was having trouble finding one. I decided to create one myself and provide it to the community.
Who do we thank for this data? The United States Department of Labor compiles a table of this data on their website. I took the time to clean it up and provide it here for you. :) The GitHub repository (with R Code for the cleaning process) can be found here!
This is a cleaned dataset of US state and federal minimum wages from 1968 to 2020 (including 2020 equivalency values). The data was scraped from the United States Department of Labor's table of minimum wage by state.
The values in the dataset are as follows: - Year: The year of the data. All minimum wage values are as of January 1 except 1968 and 1969, which are as of February 1. - State: The state or territory of the data. - State.Minimum.Wage: The actual State's minimum wage on January 1 of Year. - State.Minimum.Wage.2020.Dollars: The State.Minimum.Wage in 2020 dollars. - Federal.Minimum.Wage: The federal minimum wage on January 1 of Year. - Federal.Minimum.Wage.2020.Dollars: The Federal.Minimum.Wage in 2020 dollars. - Effective.Minimum.Wage: The minimum wage that is enforced in State on January 1 of Year. Because the federal minimum wage takes effect if the State's minimum wage is lower than the federal minimum wage, this is the higher of the two. - Effective.Minimum.Wage.2020.Dollars: The Effective.Minimum.Wage in 2020 dollars. - CPI.Average: The average value of the Consumer Price Index in Year. When I pulled the data from the Bureau of Labor Statistics, I selected the dataset with "all items in U.S. city average, all urban consumers, not seasonally adjusted". - Department.Of.Labor.Uncleaned.Data: The unclean, scraped value from the Department of Labor's website. - Department.Of.Labor.Cleaned.Low.Value: The State's lowest enforced minimum wage on January 1 of Year. If there is only one minimum wage, this and the value for Department.Of.Labor.Cleaned.High.Value are identical. (Some states enforce different minimum wage laws depending on the size of the business. In states where this is the case, generally, smaller businesses have slightly lower minimum wage requirements.) - Department.Of.Labor.Cleaned.Low.Value.2020.Dollars: The Department.Of.Labor.Cleaned.Low.Value in 2020 dollars. - Department.Of.Labor.Cleaned.High.Value: The State's higher enforced minimum wage on January 1 of Year. If there is only one minimum wage, this and the value for Department.Of.Labor.Cleaned.Low.Value are identical. - Department.Of.Labor.Cleaned.High.Value.2020.Dollars: The Department.Of.Labor.Cleaned.High.Value in 2020 dollars. - Footnote: The footnote provided on the Department of Labor's website. See more below.
As laws differ significantly from territory to territory, especially relating to whom is protected by minimum wage laws, the following footnotes are located throughout the data in Footnote to add more context to the minimum wage. The original footnotes can be found here.
--- Original source retains full ownership of the source dataset ---
Abstract copyright UK Data Service and data collection copyright owner. The Hygiene Council Global Survey on Personal and Household Hygiene, 2011 is the first study to highlight the role of manners, orderliness and routine on hygiene behaviours. A global survey on the determinants of personal and household hygiene, with particular reference to hand-washing with soap and cleaning of household surfaces, was conducted in 1000 households in each of twelve countries across the world. A structural equation model of hygiene behaviour and its consequences derived from theory was then estimated for both behaviours. The analysis showed that the frequency of hand washing with soap is strongly tied to how automatically it is performed. Whether or not someone is busy, or tired, can also impact on whether they stop to wash hands. Surface cleaning was strongly linked to possessing a cleaning routine, so, like hand washing, it is primarily determined by non-cognitive causes. It is also inspired by the perception that one is living in a dirty environment, especially if one has a strong sense of contamination, as well as a need to keep one’s surroundings tidy. Being concerned with good manners is also linked to the performance of these behaviours. Those who see others around them as practicing surface cleaning are also more likely to do so themselves. Main Topics: Global determinants of personal and household hygiene behaviour. Multi-stage stratified random sample At least one country was chosen to represent each of the seven continents (UK, USA, Canada, France, Germany, Australia, South Africa, Malaysia, Brazil, Middle East) with the additional of two of the most populated countries in the world (China and India). Within-country, samples were based on standard representative splits of gender, age, household income and geographical region. Face-to-face interview Telephone interview Web-based survey
This hands-on workshop has two parts. The first part covers working with SAS and the Postal Code Conversion File Plus. You'll start with Postal Codes, and leave with Census geography that can be linked to Census demographics. The second part introduces OpenRefine, an open source software platform for cleaning up messy data files. Initially developed by Google, OpenRefine will open your eyes to the beauty of clean data! No previous experience required.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A messy data for demonstrating "how to clean data using spreadsheet". This dataset was intentionally formatted to be messy, for the purpose of demonstration. It was collated from here - https://openafrica.net/dataset/historic-and-projected-rainfall-and-runoff-for-4-lake-victoria-sub-regions