15 datasets found

Cafe Sales - Dirty Data for Cleaning Training

kaggle.com

zip

Updated Jan 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training

Explore at:

zip(113510 bytes)Available download formats

Dataset updated

Jan 17, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Cafe Sales Dataset

Overview

The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

File Information

File Name: dirty_cafe_sales.csv
Number of Rows: 10,000
Number of Columns: 8

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Item`	The name of the item purchased. May contain missing or invalid values (e.g., "ERROR").	`Coffee`, `Sandwich`
`Quantity`	The quantity of the item purchased. May contain missing or invalid values.	`1`, `3`, `UNKNOWN`
`Price Per Unit`	The price of a single unit of the item. May contain missing or invalid values.	`2.00`, `4.00`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `12.00`
`Payment Method`	The method of payment used. May contain missing or invalid values (e.g., `None`, "UNKNOWN").	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Takeaway`
`Transaction Date`	The date of the transaction. May contain missing or incorrect values.	`2023-01-01`

Data Characteristics

Missing Values:
- Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
Invalid Values:
- Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
Price Consistency:
- Prices for menu items are consistent but may have missing or incorrect values introduced.

Menu Items

The dataset includes the following menu items with their respective price ranges:

Item	Price($)
Coffee	2
Tea	1.5
Sandwich	4
Salad	5
Cake	3
Cookie	1
Smoothie	4
Juice	3

Use Cases

This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

Cleaning Steps Suggestions

To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

Handle Invalid Values:
- Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
Date Consistency:
- Ensure all dates are in a consistent format.
- Fill missing dates with plausible values based on nearby records.
Feature Engineering:
- Create new columns, such as Day of the Week or Transaction Month, for further analysis.

License

This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

Feedback

If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

Dirty E-Commerce Data [80,000+ Products]
kaggle.com
zip
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oleksii Martusiuk (2024). Dirty E-Commerce Data [80,000+ Products] [Dataset]. https://www.kaggle.com/datasets/oleksiimartusiuk/e-commerce-data-shein
Explore at:
zip(3611849 bytes)Available download formats
Dataset updated
Jun 29, 2024
Authors
Oleksii Martusiuk
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
E-commerce Product Dataset - Clean and Enhance Your Data Analysis Skills or Check Out The Cleaned File Below!

This dataset offers a comprehensive collection of product information from an e-commerce store, spread across 20+ CSV files and encompassing over 80,000+ products. It presents a valuable opportunity to test and refine your data cleaning and wrangling skills.

What's Included:

A variety of product categories, including:

Apparel & Accessories

Electronics

Home & Kitchen

Beauty & Health

Toys & Games

Men's Clothes

Women's Clothes

Pet Supplies

Sports & Outdoor

(and more!)

Each product record contains details such as:

Product Title

Category

Price

Discount information

(and other attributes)

Challenges and Opportunities:

Data Cleaning: The dataset is "dirty," containing missing values, inconsistencies in formatting, and potential errors. This provides a chance to practice your data-cleaning techniques such as:

Identifying and handling missing values

Standardizing data formats

Correcting inconsistencies

Dealing with duplicate entries

Feature Engineering: After cleaning, you can explore opportunities to create new features from the existing data, such as: - Extracting keywords from product titles and descriptions - Deriving price categories - Calculating average discounts

Who can benefit from this dataset?

Data analysts and scientists looking to practice data cleaning and wrangling skills on a real-world e-commerce dataset

Machine learning enthusiasts interested in building models for product recommendation, price prediction, or other e-commerce tasks

Anyone interested in exploring and understanding the structure and organization of product data in an e-commerce setting

By contributing to this dataset and sharing your cleaning and feature engineering approaches, you can help create a valuable resource for the Kaggle community!
r
Semi-supervised data cleaning
resodate.org
Updated Dec 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Mahdavi Lahijani (2020). Semi-supervised data cleaning [Dataset]. http://doi.org/10.14279/depositonce-10928
Explore at:
Unique identifier
https://doi.org/10.14279/depositonce-10928
Dataset updated
Dec 4, 2020
Dataset provided by
Technische Universität Berlin
DepositOnce
Authors
Mohammad Mahdavi Lahijani
Description
Data cleaning is one of the most important but time-consuming tasks for data scientists. The data cleaning task consists of two major steps: (1) error detection and (2) error correction. The goal of error detection is to identify wrong data values. The goal of error correction is to fix these wrong values. Data cleaning is a challenging task due to the trade-off among correctness, completeness, and automation. In fact, detecting/correcting all data errors accurately without any user involvement is not possible for every dataset. We propose a novel data cleaning approach that detects/corrects data errors with a novel two-step task formulation. The intuition is that, by collecting a set of base error detectors/correctors that can independently mark/fix data errors, we can learn to combine them into a final set of data errors/corrections using a few informative user labels. First, each base error detector/corrector generates an initial set of potential data errors/corrections. Then, the approach ensembles the output of these base error detectors/correctors into one final set of data errors/corrections in a semi-supervised manner. In fact, the approach iteratively asks the user to annotate a tuple, i.e., marking/fixing a few data errors. The approach learns to generalize the user-provided error detection/correction examples to the rest of the dataset, accordingly. Our novel two-step formulation of the error detection/correction task has four benefits. First, the approach is configuration free and does not need any user-provided rules or parameters. In fact, the approach considers the base error detectors/correctors as black-box algorithms that are not necessarily correct or complete. Second, the approach is effective in the error detection/correction task as its first and second steps maximize recall and precision, respectively. Third, the approach also minimizes human involvement as it samples the most informative tuples of the dataset for user labeling. Fourth, the task formulation of our approach allows us to leverage previous data cleaning efforts to optimize the current data cleaning task. We design an end-to-end data cleaning pipeline according to this approach that takes a dirty dataset as input and outputs a cleaned dataset. Our pipeline leverages user feedback, a set of data cleaning algorithms, and a set of previously cleaned datasets, if available. Internally, our pipeline consists of an error detection system (named Raha), an error correction system (named Baran), and a transfer learning engine. As our extensive experiments show, our data cleaning systems are effective and efficient, and involve the user minimally. Raha and Baran significantly outperform existing data cleaning approaches in terms of effectiveness and human involvement on multiple well-known datasets.

Retail Store Sales: Dirty for Data Cleaning

kaggle.com

zip

Updated Jan 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning

Explore at:

zip(226740 bytes)Available download formats

Dataset updated

Jan 18, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Retail Store Sales Dataset

Overview

The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

File Information

File Name: retail_store_sales.csv
Number of Rows: 12,575
Number of Columns: 11

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Customer ID`	A unique identifier for each customer. 25 unique customers.	`CUST_01`
`Category`	The category of the purchased item.	`Food`, `Furniture`
`Item`	The name of the purchased item. May contain missing values or `None`.	`Item_1_FOOD`, `None`
`Price Per Unit`	The static price of a single unit of the item. May contain missing or `None` values.	`4.00`, `None`
`Quantity`	The quantity of the item purchased. May contain missing or `None` values.	`1`, `None`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `None`
`Payment Method`	The method of payment used. May contain missing or invalid values.	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Online`
`Transaction Date`	The date of the transaction. Always present and valid.	`2023-01-15`
`Discount Applied`	Indicates if a discount was applied to the transaction. May contain missing values.	`True`, `False`, `None`

Categories and Items

The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

Electric Household Essentials

Item Code	Item Name	Price
Item_1_EHE	Blender	5.0
Item_2_EHE	Microwave	6.5
Item_3_EHE	Toaster	8.0
Item_4_EHE	Vacuum Cleaner	9.5
Item_5_EHE	Air Purifier	11.0
Item_6_EHE	Electric Kettle	12.5
Item_7_EHE	Rice Cooker	14.0
Item_8_EHE	Iron	15.5
Item_9_EHE	Ceiling Fan	17.0
Item_10_EHE	Table Fan	18.5
Item_11_EHE	Hair Dryer	20.0
Item_12_EHE	Heater	21.5
Item_13_EHE	Humidifier	23.0
Item_14_EHE	Dehumidifier	24.5
Item_15_EHE	Coffee Maker	26.0
Item_16_EHE	Portable AC	27.5
Item_17_EHE	Electric Stove	29.0
Item_18_EHE	Pressure Cooker	30.5
Item_19_EHE	Induction Cooktop	32.0
Item_20_EHE	Water Dispenser	33.5
Item_21_EHE	Hand Blender	35.0
Item_22_EHE	Mixer Grinder	36.5
Item_23_EHE	Sandwich Maker	38.0
Item_24_EHE	Air Fryer	39.5
Item_25_EHE	Juicer	41.0

Furniture

Item Code	Item Name	Price
Item_1_FUR	Office Chair	5.0
Item_2_FUR	Sofa	6.5
Item_3_FUR	Coffee Table	8.0
Item_4_FUR	Dining Table	9.5
Item_5_FUR	Bookshelf	11.0
Item_6_FUR	Bed F...

Messy IMDB dataset
kaggle.com
zip
Updated Mar 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Fuente Herraiz (2021). Messy IMDB dataset [Dataset]. https://www.kaggle.com/davidfuenteherraiz/messy-imdb-dataset
Explore at:
zip(5420 bytes)Available download formats
Dataset updated
Mar 18, 2021
Authors
David Fuente Herraiz
Description
This dataset contains 100 movies from the IMDb database and 11 variables: IMBd movie ID, original title, release year, genre, duration, country, content rating, director's name, worldwide income, number of votes and IMDb score. It is a messy dataset with plenty of errors to be corrected: missing values, empty rows and columns, bad variable names, multiple or wrong date formats, numeric columns containing symbols, units, characters, thousand separators, multiple and wrong decimal separators, typographic mistakes and a multiple categorical variable miscoded as a unique character variable. All variables are imported in R as character ones, but most of them are not in reality. To clean this dataset, we suggest to use clickR package. This package is now under review, but it is fully functional and allows semiautomatic and tracking-change data pre-processing without practically any external input or complicated code so that rather messy datasets can be cleaned within minutes.
Z
Messy Spreadsheet Example for Instruction
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Jun 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Curty, Renata Gonçalves (2024). Messy Spreadsheet Example for Instruction [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_12586562
Explore at:
Dataset updated
Jun 28, 2024
Dataset provided by
University of California, Santa Barbara
Authors
Curty, Renata Gonçalves
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A disorganized toy spreadsheet used for teaching good data organization. Learners are tasked with identifying as many errors as possible before creating a data dictionary and reconstructing the spreadsheet according to best practices.
Data from: Urbanev: An open benchmark dataset for urban electric vehicle...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Han Li; Haohao Qu; Xiaojun Tan; Linlin You; Rui Zhu; Wenqi Fan (2025). Urbanev: An open benchmark dataset for urban electric vehicle charging demand prediction [Dataset]. http://doi.org/10.5061/dryad.np5hqc04z
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.np5hqc04z
Dataset updated
Apr 25, 2025
Dataset provided by
Sun Yat-sen University
Institute of High Performance Computing
Hong Kong Polytechnic University
Authors
Han Li; Haohao Qu; Xiaojun Tan; Linlin You; Rui Zhu; Wenqi Fan
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The recent surge in electric vehicles (EVs), driven by a collective push to enhance global environmental sustainability, has underscored the significance of exploring EV charging prediction. To catalyze further research in this domain, we introduce UrbanEV—an open dataset showcasing EV charging space availability and electricity consumption in a pioneering city for vehicle electrification, namely Shenzhen, China. UrbanEV offers a rich repository of charging data (i.e., charging occupancy, duration, volume, and price) captured at hourly intervals across an extensive six-month span for over 20,000 individual charging stations. Beyond these core attributes, the dataset also encompasses diverse influencing factors like weather conditions and spatial proximity. These factors are thoroughly analyzed qualitatively and quantitatively to reveal their correlations and causal impacts on charging behaviors. Furthermore, comprehensive experiments have been conducted to showcase the predictive capabilities of various models, including statistical, deep learning, and transformer-based approaches, using the UrbanEV dataset. This dataset is poised to propel advancements in EV charging prediction and management, positioning itself as a benchmark resource within this burgeoning field. Methods To build a comprehensive and reliable benchmark dataset, we conduct a series of rigorous processes from data collection to dataset evaluation. The overall workflow sequentially includes data acquisition, data processing, statistical analysis, and prediction assessment. As follows, please see detailed descriptions. Study area and data acquisition

Shenzhen, a pioneering city in global vehicle electrification, has been selected for this study with the objective of offering valuable insights into electric vehicle (EV) development that can serve as a reference for other urban centers. This study encompasses the entire expanse of Shenzhen, where data on public EV charging stations distributed around the city have been meticulously gathered. Specifically, EV charging data was automatically collected from a mobile platform used by EV drivers to locate public charging stations. Through this platform, users could access real-time information on each charging pile, including its availability (e.g., busy or idle), charging price, and geographic coordinates. Accordingly, we recorded the charging-related data at five-minute intervals from September 1, 2022, to February 28, 2023. This data collection process was fully digital and did not require manual readings. Furthermore, to delve into the correlation between EV charging patterns and environmental elements, weather data for Shenzhen city were acquired from two meteorological observatories situated in the airport and central regions, respectively. These meteorological data are publicly available on the Shenzhen Government Data Open Platform. Thirdly, point of interest (POI) data was extracted through the Application Programming Interface Platform of AMap.com, along with three primary types: food and beverage services, business and residential, and lifestyle services. Lastly, the spatial and static data were organized based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. The collected data contains detailed spatiotemporal information that can be analyzed to provide valuable insights about urban EV charging patterns and their correlations with meteorological conditions.

Shenzhen, a pioneering city in global vehicle electrification, has been selected for this study with the objective of offering valuable insights into electric vehicle (EV) development that can serve as a reference for other urban centers. This study encompasses the entire expanse of Shenzhen, where data on public EV charging stations distributed around the city have been meticulously gathered. Specifically, a program was employed to extract the status (e.g., busy or idle, charging price, electricity volume, and coordinates) of each charging pile at five-minute intervals from 1 September 2022 to 28 February 2023. Furthermore, to delve into the correlation between EV charging patterns and environmental elements, weather data for Shenzhen city was acquired from two meteorological observatories situated in the airport and central regions, respectively. Thirdly, point of interest (POI) data was extracted, along with three primary types: food and beverage services, business and residential, and lifestyle services. Lastly, the spatial and static data were organized based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. The collected data contains detailed spatiotemporal information that can be analyzed to provide valuable insights about urban EV charging patterns and their correlations with meteorological conditions. Processing raw information into well-structured data To streamline the utilization of the UrbanEV dataset, we harmonize heterogeneous data from various sources into well-structured data with aligned temporal and spatial resolutions. This process can be segmented into two parts: the reorganization of EV charging data and the preparation of other influential factors. EV charging data The raw charging data, obtained from publicly available EV charging services, pertains to charging stations and predominantly comprises string-type records at a 5-minute interval. To transform this raw data into a structured time series tailored for prediction tasks, we implement the following three key measures:

Initial Extraction. From the string-type records, we extract vital information for each charging pile, such as availability (designated as "busy" or "idle"), rated power, and the corresponding charging and service fees applicable during the observed time periods. First, a charging pile is categorized as "active charging" if its states at two consecutive timestamps are both "busy". Consequently, the occupancy within a charging station can be defined as the count of in-use charging piles, while the charging duration is calculated as the product of the count of in-use piles and the time between the two timestamps (in our case, 5 minutes). Moreover, the charging volume in a station can correspondingly be estimated by multiplying the duration by the piles' rated power. Finally, the average electricity price and service price are calculated for each station in alignment with the same temporal resolution as the three charging variables.

Error Detection and Imputation. Ensuring data quality is paramount when utilizing charging data for decision-making, advanced analytics, and machine-learning applications. It is crucial to address concerns around data cleanliness, as the presence of inaccuracies and inconsistencies, often referred to as dirty data, can significantly compromise the reliability and validity of any subsequent analysis or modeling efforts. To improve data quality of our charging data, several errors are identified, particularly the negative values for charging fees and the inconsistencies between the counts of occupied, idle, and total charging piles. We remove the records containing these anomalies and treat them as missing data. Besides that, a two-step imputation process was implemented to address missing values. First, forward filling replaced missing values using data from preceding timestamps. Then, backward filling was applied to fill gaps at the start of each time series. Moreover, a certain number of outliers were identified in the dataset, which could significantly impact prediction performance. To address this, the interquartile range (IQR) method was used to detect outliers for metrics including charging volume (v), charging duration (d), and the rate of active charging piles at the charging station (o). To retain more original data and minimize the impact of outlier correction on the overall data distribution, we set the coefficient to 4 instead of the default 1.5. Finally, each outlier was replaced by the mean of its adjacent valid values. This preprocessing pipeline transformed the raw data into a structured and analyzable dataset.

Aggregation and Filtration. Building upon the station-level charging data that has been extracted and cleansed, we further organize the data into a region-level dataset with an hourly interval providing a new perspective for EV charging behavior analysis. This is achieved by two major processes: aggregation and filtration. First, we aggregate all the charging data from both temporal and spatial views: a. Temporally, we standardize all time-series data to a common time resolution of one hour, as it serves as the least common denominator among the various resolutions. This aims to establish a unified temporal resolution for all time-series data, including pricing schemes, weather records, and charging data, thereby creating a well-structured dataset. Aggregation rules specify that the five-minute charging volume v and duration $(d)$ are summed within each interval (i.e., one hour), whereas the occupancy o, electricity price pe, and service price ps are assigned specific values at certain hours for each charging pile. This distinction arises from the inherent nature of these data types: volume v and duration d are cumulative, while o, pe, and ps are instantaneous variables. Compared to using the mean or median values within each interval, selecting the instantaneous values of o, pe, and ps as representatives preserves the original data patterns more effectively and minimizes the influence of human interpretation. b. Spatially, stations are aggregated based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. After aggregation, our aggregated dataset comprises 331 regions (also called traffic zones) with 4344 timestamps. Second, variance tests and zero-value filtering functions were employed to filter out traffic zones with zero or no change in charging data. Specifically, it means that
B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
Raw Data.
plos.figshare.com
xlsx
Updated Jul 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Zhou; Wei Guo; Dongling Liu; Jianrong Li; Caixia Yang; Ying Wang; Xiaoyi Huang (2025). Raw Data. [Dataset]. http://doi.org/10.1371/journal.pone.0326380.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0326380.s001
Dataset updated
Jul 1, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Juan Zhou; Wei Guo; Dongling Liu; Jianrong Li; Caixia Yang; Ying Wang; Xiaoyi Huang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaning indicators are widely used to evaluate the efficacy of cleaning processes in automated washer-disinfectors (AWDs) in healthcare settings. In this study, we systematically analyzed the performance of commercial indicators across multiple simulated cleaning protocols to guide the correct selection of suitable cleaning indicators in Central Sterile Supply Departments (CSSD). Eleven commercially available cleaning indicators were tested in five cleaning simulations, P0 to P4, where P1 represented the standard cleaning process in CSSD, while P2-P4 incorporated induced-error cleaning processes to mimic real-world errors. All indicators were uniformly positioned at the top level of the cleaning rack to ensure comparable exposure. Key parameters, including indicator response dynamics (e.g., wash-off sequence) and final residue results, were documented throughout the cleaning cycles. The final wash-off results given by the indicators under P0, in which no detergent was injected, were much worse than those of the other four processes. Under different simulations, the final results of the indicators and their wash-off sequences changed substantially. In conclusion, an effective indicator must be selected experimentally. The last indicator to be washed off during the normal cleaning process that can simultaneously clearly show the presence of dirt residue under induced error conditions is the optimal indicator for monitoring cleaning processes.
YouTube Recommendation Data (For Cleaning & ML )
kaggle.com
zip
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shravan Kumar (2025). YouTube Recommendation Data (For Cleaning & ML ) [Dataset]. https://www.kaggle.com/datasets/iitanshravan/youtube-recommendation-data-for-cleaning-and-ml
Explore at:
zip(28098010 bytes)Available download formats
Dataset updated
Oct 1, 2025
Authors
Shravan Kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
YouTube
Description
🟥 Synthetic YouTube Recommendation Dataset (1M Rows, With Errors) 📌 Overview

This is a synthetic dataset of 1,000,000 user–video interactions generated to simulate how a YouTube-like recommendation system might log activity.

It is designed for data cleaning practice, feature engineering, and machine learning modeling. 👉 Unlike clean benchmark datasets, this one intentionally contains messy data and errors so you can practice real-world data wrangling before building ML models.
Z
BigMart Retail Sales
data.niaid.nih.gov
Updated May 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataman (2022). BigMart Retail Sales [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6509954
Explore at:
Dataset updated
May 2, 2022
Authors
Dataman
License
Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
Description
Nothing ever becomes real till it is experienced.

-John Keats

While we don't know the context in which John Keats mentioned this, we are sure about its implication in data science. While you would have enjoyed and gained exposure to real world problems in this challenge, here is another opportunity to get your hand dirty with this practice problem.

Problem Statement :

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.

Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

Data :

We have 14204 samples in data set.

Variable Description

Item Identifier: A code provided for the item of sale

Item Weight: Weight of item

Item Fat Content: A categorical column of how much fat is present in the item: ‘Low Fat’, ‘Regular’, ‘low fat’, ‘LF’, ‘reg’

Item Visibility: Numeric value for how visible the item is

Item Type: What category does the item belong to: ‘Dairy’, ‘Soft Drinks’, ‘Meat’, ‘Fruits and Vegetables’, ‘Household’, ‘Baking Goods’, ‘Snack Foods’, ‘Frozen Foods’, ‘Breakfast’, ’Health and Hygiene’, ‘Hard Drinks’, ‘Canned’, ‘Breads’, ‘Starchy Foods’, ‘Others’, ‘Seafood’.

Item MRP: The MRP price of item

Outlet Identifier: Which outlet was the item sold. This will be categorical column

Outlet Establishment Year: Which year was the outlet established

Outlet Size: A categorical column to explain size of outlet: ‘Medium’, ‘High’, ‘Small’.

Outlet Location Type: A categorical column to describe the location of the outlet: ‘Tier 1’, ‘Tier 2’, ‘Tier 3’

Outlet Type: Categorical column for type of outlet: ‘Supermarket Type1’, ‘Supermarket Type2’, ‘Supermarket Type3’, ‘Grocery Store’

Item Outlet Sales: The number of sales for an item.

Evaluation Metric:

We will use the Root Mean Square Error value to judge your response
Z
CURE-TSR: Challenging Unreal and Real Environments for Traffic Sign...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jun 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dogancan Temel; Gukyeong Kwon; Mohit Prabhushankar; Ghassan AlRegib (2020). CURE-TSR: Challenging Unreal and Real Environments for Traffic Sign Recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3903065
Explore at:
Dataset updated
Jun 28, 2020
Dataset provided by
Georgia Institute of Technology
Authors
Dogancan Temel; Gukyeong Kwon; Mohit Prabhushankar; Ghassan AlRegib
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As one of the research directions at OLIVES Lab @ Georgia Tech, we focus on the robustness of data-driven algorithms under diverse challenging conditions where trained models can possibly be depolyed. To achieve this goal, we introduced a large-sacle (>2M images) traffic sign recognition dataset (CURE-TSR) which is among the most comprehensive datasets with controlled synthetic challenging conditions. Traffic sign images in the CURE-TSR dataset were cropped from the CURE-TSD dataset, which includes around 1.7 million real-world and simulator images with more than 2 million traffic sign instances. Real-world images were obtained from the BelgiumTS video sequences and simulated images were generated with the Unreal Engine 4 game development tool. Sign types include speed limit, goods vehicles, no overtaking, no stopping, no parking, stop, bicycle, hump, no left, no right, priority to, no entry, yield, and parking. Unreal and real sequences were processed with state-of-the-art visual effect software Adobe(c) After Effects to simulate challenging conditions, which include rain, snow, haze, shadow, darkness, brightness, blurriness, dirtiness, colorlessness, sensor and codec errors. Please refer to our GitHub page for code, papers, and more information.

Instructions:

The name format of the provided images are as follows: "sequenceType_signType_challengeType_challengeLevel_Index.bmp"

sequenceType: 01 - Real data 02 - Unreal data

signType: 01 - speed_limit 02 - goods_vehicles 03 - no_overtaking 04 - no_stopping 05 - no_parking 06 - stop 07 - bicycle 08 - hump 09 - no_left 10 - no_right 11 - priority_to 12 - no_entry 13 - yield 14 - parking

challengeType: 00 - No challenge 01 - Decolorization 02 - Lens blur 03 - Codec error 04 - Darkening 05 - Dirty lens 06 - Exposure 07 - Gaussian blur 08 - Noise 09 - Rain 10 - Shadow 11 - Snow 12 - Haze

challengeLevel: A number in between [01-05] where 01 is the least severe and 05 is the most severe challenge.

Index: A number shows different instances of traffic signs in the same conditions.
Salary-Data
kaggle.com
zip
Updated Aug 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sourav Bose (2022). Salary-Data [Dataset]. https://www.kaggle.com/datasets/souravbose/salary-prediction
Explore at:
zip(14178368 bytes)Available download formats
Dataset updated
Aug 21, 2022
Authors
Sourav Bose
Description
Problem Description: Develop a salary prediction system based on the given dataset.

Data supplied: You are given two data files in CSV: • train_features.csv: Each row represents the metadata for an individual job posting. The “jobId” column represents a unique identifier for the job posting. The remaining columns describe the features of the job posting. • train_salaries.csv: Each row associates a “jobId” with a “salary”. The first row of each file contains headers for the columns. Keep in that the metadata and salary data were crawled from the internet. As such, it’s possible that the data is dirty (it may contain errors).

Questions 1. What steps did you take to prepare the data for the project? Was any cleaning necessary? 2. What algorithmic method did you apply? Why? What other methods did you consider? 3. Describe how the algorithmic method that you chose works? 4. What features did you use? Why? 5. How did you train your model? During training, what issues concerned you? 6. How did you assess the accuracy of your predictions? Why did you choose that method? Would you consider any alternative approaches for assessing accuracy? 7. Which features had the most significant impact on salary? How did you identify these to be most significant? Which features had the least impact on salary? How did you identify these?
Student Performance and Attendance Dataset
kaggle.com
zip
Updated Mar 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marvy Ayman Halim (2025). Student Performance and Attendance Dataset [Dataset]. https://www.kaggle.com/datasets/marvyaymanhalim/student-performance-and-attendance-dataset
Explore at:
zip(5849540 bytes)Available download formats
Dataset updated
Mar 10, 2025
Authors
Marvy Ayman Halim
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📝 Description: This synthetic dataset is designed to help beginners and intermediate learners practice data cleaning and analysis in a realistic setting. It simulates a student tracking system, covering key areas like:

Attendance tracking 📅

Homework completion 📝

Exam performance 🎯

Parent-teacher communication 📢

✅ Why Use This Dataset? While many datasets are pre-cleaned, real-world data is often messy. This dataset includes intentional errors to help you develop essential data cleaning skills before diving into analysis. It’s perfect for building confidence in handling raw data!

🛠️ Cleaning Challenges You’ll Tackle This dataset is packed with real-world issues, including:

Messy data: Names in lowercase, typos in attendance status.

Inconsistent date formats: Mix of MM/DD/YYYY and YYYY-MM-DD.

Incorrect values: Homework completion rates in mixed formats (e.g., 80% and 90).

Missing data: Guardian signatures, teacher comments, and emergency contacts.

Outliers: Exam scores over 100 and negative homework completion rates.

🚀 Your Task: Clean, structure, and analyze this dataset using Python or SQL to uncover meaningful insights!

📌 5. Handle Outliers

Remove exam scores above 100.

Convert homework completion rates to consistent percentages.

📌 6. Generate Insights & Visualizations

What’s the average attendance rate per grade?

Which subjects have the highest performance?

What are the most common topics in parent-teacher communication?
e
Sewerage and connections
data.europa.eu
csv, esri shape, json
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sewerage and connections [Dataset]. https://data.europa.eu/data/datasets/29450-riolering-en-aansluitingen?locale=en
Explore at:
csv, json, esri shapeAvailable download formats
License
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
Description
This data collection includes connection pipes from gullies and houses to the main sewer, as well as the geometry of the sewer strands and wells of the main sewer. The data related to connections can also be retrieved using a web feature service: https://data.riox.online/eindhoven/wfs

Do you want to measure distances in the map? This can be done in the ArcGIS viewer: click here to go to the viewer.

https://data.eindhoven.nl/assets/theme_image/riool.PNG" alt="">

(Click to measure on the ruler on the right.)

A number of points for attention:

This data collection contains a large amount of line pieces, to view them all you need to zoom in for performance reasons.

Fittings House Connection, some Connection Lines House Connection, and Wells are originally points that are shown as closed line objects in the map.

The home connections (both connection lines and fittings) are not yet complete for the whole of Eindhoven, these are still being added per area at the time of writing.

The attribute TYPE can be used to deduce whether it is a part of the main sewer, a well, or data related to home connections.

Maindriole

The location of the main sewer is only visible for reference. The available categories are: Mixed water, Rainwater (sky water), Dirty water, Dirty water + Roof surface.

Putting

For reference purposes only, the name of the well and its location are visible from the wells.

Home connections

The connection pipe and the associated attachment have been made clear from the house connections.

For ‘House Connection Tools’, the following attributes are regularly available: ADRES (adres waarop het hulpstuk van toepassing is, dit attribuut is niet altijd gevuld), PLAATS (zou altijd Eindhoven moeten zijn voor deze data, betreft het niet Eindhoven, dan een fout), STELSEL (stelsel waarop het hulpstuk is aangesloten), DIAMETER (diameter van het hulpstuk in millimeters, als 0 dan onbekend), MATERIAAL (materiaal van het hulpstuk), de BEGINPUT en EINDPUT (stemmen overeen met de PUTNAAM van een rioolput uit het hoofdriool, dit attribuut is niet altijd gevuld), PUTAFSTAND (afstand hulpstuk tot de BEGINPUT, dit attribuut is niet altijd gevuld), DIEPTE (maar sporadisch gevuld, “-“ of leeg wanneer onbekend), JAAR (jaar van aanleg, niet altijd gevuld), DATUM (plaatsingsdatum in bestand, niet altijd gevuld), NLCS (laagnaam conform Nederlandse CAD standaard, niet altijd gevuld), REFERENTIE (dit attribuut is niet altijd gevuld), en TYPE (of het gaat om een ontstoppingsstuk of een inlaat hulpstuk).

The following attributes are available for ‘House Connection Lines’: ADDRESS (address to which the attachment applies, this attribute is not always filled), PLACE (should always be Eindhoven for these dates, it is not Eindhoven, then an error), STELSEL (system to which the attachment is connected), DIAMETER (diameter of the attachment in millimetres, if 0 then unknown), MATERIAL (material of the attachment), BEGINPUT and EINDPUT (corresponding to the PUTNAME of a sewer well from the main sewer, this attribute is not always filled), PUTAFSTAND (distance attachment from the BEGINPUT, this attribute is not always filled), DIEPTE (but sporadically filled, “-“ or empty when unknown), YEAR (year of construction, not always filled), DATE (placement date in file, not always filled), NLCS (low name according to Dutch CAD standard, not always filled), REFERENCE (this attribute is not always filled).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training

Cafe Sales - Dirty Data for Cleaning Training

Dirty Cafe Sales Dataset

Explore at:

zip(113510 bytes)Available download formats

Dataset updated

Jan 17, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Cafe Sales Dataset

Overview

File Information

File Name: dirty_cafe_sales.csv
Number of Rows: 10,000
Number of Columns: 8

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Item`	The name of the item purchased. May contain missing or invalid values (e.g., "ERROR").	`Coffee`, `Sandwich`
`Quantity`	The quantity of the item purchased. May contain missing or invalid values.	`1`, `3`, `UNKNOWN`
`Price Per Unit`	The price of a single unit of the item. May contain missing or invalid values.	`2.00`, `4.00`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `12.00`
`Payment Method`	The method of payment used. May contain missing or invalid values (e.g., `None`, "UNKNOWN").	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Takeaway`
`Transaction Date`	The date of the transaction. May contain missing or incorrect values.	`2023-01-01`

Data Characteristics

Missing Values:
- Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
Invalid Values:
- Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
Price Consistency:
- Prices for menu items are consistent but may have missing or incorrect values introduced.

Menu Items

The dataset includes the following menu items with their respective price ranges:

Item	Price($)
Coffee	2
Tea	1.5
Sandwich	4
Salad	5
Cake	3
Cookie	1
Smoothie	4
Juice	3

Use Cases

Cleaning Steps Suggestions

Handle Invalid Values:
- Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
Date Consistency:
- Ensure all dates are in a consistent format.
- Fill missing dates with plausible values based on nearby records.
Feature Engineering:
- Create new columns, such as Day of the Week or Transaction Month, for further analysis.

License

This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

Feedback

If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

Clear search

Close search

Google apps

Main menu

Cafe Sales - Dirty Data for Cleaning Training

Dirty Cafe Sales Dataset

Overview

File Information

Columns Description

Data Characteristics

Menu Items

Use Cases

Cleaning Steps Suggestions

License

Feedback

Dirty E-Commerce Data [80,000+ Products]

Semi-supervised data cleaning

Retail Store Sales: Dirty for Data Cleaning

Dirty Retail Store Sales Dataset

Overview

File Information

Columns Description

Categories and Items

Electric Household Essentials

Furniture

Messy IMDB dataset

Messy Spreadsheet Example for Instruction

Data from: Urbanev: An open benchmark dataset for urban electric vehicle...

Data Cleaning Sample

Raw Data.

YouTube Recommendation Data (For Cleaning & ML )

BigMart Retail Sales

CURE-TSR: Challenging Unreal and Real Environments for Traffic Sign...

Salary-Data

Student Performance and Attendance Dataset

Sewerage and connections

Do you want to measure distances in the map? This can be done in the ArcGIS viewer: click here to go to the viewer.

(Click to measure on the ruler on the right.)

Cafe Sales - Dirty Data for Cleaning Training

Dirty Cafe Sales Dataset

Dirty Cafe Sales Dataset

Overview

File Information

Columns Description

Data Characteristics

Menu Items

Use Cases

Cleaning Steps Suggestions

License

Feedback