15 datasets found
  1. Cafe Sales - Dirty Data for Cleaning Training

    • kaggle.com
    zip
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training
    Explore at:
    zip(113510 bytes)Available download formats
    Dataset updated
    Jan 17, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Cafe Sales Dataset

    Overview

    The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

    File Information

    • File Name: dirty_cafe_sales.csv
    • Number of Rows: 10,000
    • Number of Columns: 8

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    ItemThe name of the item purchased. May contain missing or invalid values (e.g., "ERROR").Coffee, Sandwich
    QuantityThe quantity of the item purchased. May contain missing or invalid values.1, 3, UNKNOWN
    Price Per UnitThe price of a single unit of the item. May contain missing or invalid values.2.00, 4.00
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, 12.00
    Payment MethodThe method of payment used. May contain missing or invalid values (e.g., None, "UNKNOWN").Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Takeaway
    Transaction DateThe date of the transaction. May contain missing or incorrect values.2023-01-01

    Data Characteristics

    1. Missing Values:

      • Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
    2. Invalid Values:

      • Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
    3. Price Consistency:

      • Prices for menu items are consistent but may have missing or incorrect values introduced.

    Menu Items

    The dataset includes the following menu items with their respective price ranges:

    ItemPrice($)
    Coffee2
    Tea1.5
    Sandwich4
    Salad5
    Cake3
    Cookie1
    Smoothie4
    Juice3

    Use Cases

    This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

    Cleaning Steps Suggestions

    To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

    1. Handle Invalid Values:

      • Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
    2. Date Consistency:

      • Ensure all dates are in a consistent format.
      • Fill missing dates with plausible values based on nearby records.
    3. Feature Engineering:

      • Create new columns, such as Day of the Week or Transaction Month, for further analysis.

    License

    This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

    Feedback

    If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

  2. Dirty E-Commerce Data [80,000+ Products]

    • kaggle.com
    zip
    Updated Jun 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleksii Martusiuk (2024). Dirty E-Commerce Data [80,000+ Products] [Dataset]. https://www.kaggle.com/datasets/oleksiimartusiuk/e-commerce-data-shein
    Explore at:
    zip(3611849 bytes)Available download formats
    Dataset updated
    Jun 29, 2024
    Authors
    Oleksii Martusiuk
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    E-commerce Product Dataset - Clean and Enhance Your Data Analysis Skills or Check Out The Cleaned File Below!

    This dataset offers a comprehensive collection of product information from an e-commerce store, spread across 20+ CSV files and encompassing over 80,000+ products. It presents a valuable opportunity to test and refine your data cleaning and wrangling skills.

    What's Included:

    A variety of product categories, including:

    • Apparel & Accessories
    • Electronics
    • Home & Kitchen
    • Beauty & Health
    • Toys & Games
    • Men's Clothes
    • Women's Clothes
    • Pet Supplies
    • Sports & Outdoor
    • (and more!)

    Each product record contains details such as:

    • Product Title
    • Category
    • Price
    • Discount information
    • (and other attributes)

    Challenges and Opportunities:

    Data Cleaning: The dataset is "dirty," containing missing values, inconsistencies in formatting, and potential errors. This provides a chance to practice your data-cleaning techniques such as:

    • Identifying and handling missing values
    • Standardizing data formats
    • Correcting inconsistencies
    • Dealing with duplicate entries

    Feature Engineering: After cleaning, you can explore opportunities to create new features from the existing data, such as: - Extracting keywords from product titles and descriptions - Deriving price categories - Calculating average discounts

    Who can benefit from this dataset?

    • Data analysts and scientists looking to practice data cleaning and wrangling skills on a real-world e-commerce dataset
    • Machine learning enthusiasts interested in building models for product recommendation, price prediction, or other e-commerce tasks
    • Anyone interested in exploring and understanding the structure and organization of product data in an e-commerce setting
    • By contributing to this dataset and sharing your cleaning and feature engineering approaches, you can help create a valuable resource for the Kaggle community!
  3. r

    Semi-supervised data cleaning

    • resodate.org
    Updated Dec 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Mahdavi Lahijani (2020). Semi-supervised data cleaning [Dataset]. http://doi.org/10.14279/depositonce-10928
    Explore at:
    Dataset updated
    Dec 4, 2020
    Dataset provided by
    Technische Universität Berlin
    DepositOnce
    Authors
    Mohammad Mahdavi Lahijani
    Description

    Data cleaning is one of the most important but time-consuming tasks for data scientists. The data cleaning task consists of two major steps: (1) error detection and (2) error correction. The goal of error detection is to identify wrong data values. The goal of error correction is to fix these wrong values. Data cleaning is a challenging task due to the trade-off among correctness, completeness, and automation. In fact, detecting/correcting all data errors accurately without any user involvement is not possible for every dataset. We propose a novel data cleaning approach that detects/corrects data errors with a novel two-step task formulation. The intuition is that, by collecting a set of base error detectors/correctors that can independently mark/fix data errors, we can learn to combine them into a final set of data errors/corrections using a few informative user labels. First, each base error detector/corrector generates an initial set of potential data errors/corrections. Then, the approach ensembles the output of these base error detectors/correctors into one final set of data errors/corrections in a semi-supervised manner. In fact, the approach iteratively asks the user to annotate a tuple, i.e., marking/fixing a few data errors. The approach learns to generalize the user-provided error detection/correction examples to the rest of the dataset, accordingly. Our novel two-step formulation of the error detection/correction task has four benefits. First, the approach is configuration free and does not need any user-provided rules or parameters. In fact, the approach considers the base error detectors/correctors as black-box algorithms that are not necessarily correct or complete. Second, the approach is effective in the error detection/correction task as its first and second steps maximize recall and precision, respectively. Third, the approach also minimizes human involvement as it samples the most informative tuples of the dataset for user labeling. Fourth, the task formulation of our approach allows us to leverage previous data cleaning efforts to optimize the current data cleaning task. We design an end-to-end data cleaning pipeline according to this approach that takes a dirty dataset as input and outputs a cleaned dataset. Our pipeline leverages user feedback, a set of data cleaning algorithms, and a set of previously cleaned datasets, if available. Internally, our pipeline consists of an error detection system (named Raha), an error correction system (named Baran), and a transfer learning engine. As our extensive experiments show, our data cleaning systems are effective and efficient, and involve the user minimally. Raha and Baran significantly outperform existing data cleaning approaches in terms of effectiveness and human involvement on multiple well-known datasets.

  4. Retail Store Sales: Dirty for Data Cleaning

    • kaggle.com
    zip
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning
    Explore at:
    zip(226740 bytes)Available download formats
    Dataset updated
    Jan 18, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Retail Store Sales Dataset

    Overview

    The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

    File Information

    • File Name: retail_store_sales.csv
    • Number of Rows: 12,575
    • Number of Columns: 11

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    Customer IDA unique identifier for each customer. 25 unique customers.CUST_01
    CategoryThe category of the purchased item.Food, Furniture
    ItemThe name of the purchased item. May contain missing values or None.Item_1_FOOD, None
    Price Per UnitThe static price of a single unit of the item. May contain missing or None values.4.00, None
    QuantityThe quantity of the item purchased. May contain missing or None values.1, None
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, None
    Payment MethodThe method of payment used. May contain missing or invalid values.Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Online
    Transaction DateThe date of the transaction. Always present and valid.2023-01-15
    Discount AppliedIndicates if a discount was applied to the transaction. May contain missing values.True, False, None

    Categories and Items

    The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

    Electric Household Essentials

    Item CodeItem NamePrice
    Item_1_EHEBlender5.0
    Item_2_EHEMicrowave6.5
    Item_3_EHEToaster8.0
    Item_4_EHEVacuum Cleaner9.5
    Item_5_EHEAir Purifier11.0
    Item_6_EHEElectric Kettle12.5
    Item_7_EHERice Cooker14.0
    Item_8_EHEIron15.5
    Item_9_EHECeiling Fan17.0
    Item_10_EHETable Fan18.5
    Item_11_EHEHair Dryer20.0
    Item_12_EHEHeater21.5
    Item_13_EHEHumidifier23.0
    Item_14_EHEDehumidifier24.5
    Item_15_EHECoffee Maker26.0
    Item_16_EHEPortable AC27.5
    Item_17_EHEElectric Stove29.0
    Item_18_EHEPressure Cooker30.5
    Item_19_EHEInduction Cooktop32.0
    Item_20_EHEWater Dispenser33.5
    Item_21_EHEHand Blender35.0
    Item_22_EHEMixer Grinder36.5
    Item_23_EHESandwich Maker38.0
    Item_24_EHEAir Fryer39.5
    Item_25_EHEJuicer41.0

    Furniture

    Item CodeItem NamePrice
    Item_1_FUROffice Chair5.0
    Item_2_FURSofa6.5
    Item_3_FURCoffee Table8.0
    Item_4_FURDining Table9.5
    Item_5_FURBookshelf11.0
    Item_6_FURBed F...
  5. Messy IMDB dataset

    • kaggle.com
    zip
    Updated Mar 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Fuente Herraiz (2021). Messy IMDB dataset [Dataset]. https://www.kaggle.com/davidfuenteherraiz/messy-imdb-dataset
    Explore at:
    zip(5420 bytes)Available download formats
    Dataset updated
    Mar 18, 2021
    Authors
    David Fuente Herraiz
    Description

    This dataset contains 100 movies from the IMDb database and 11 variables: IMBd movie ID, original title, release year, genre, duration, country, content rating, director's name, worldwide income, number of votes and IMDb score. It is a messy dataset with plenty of errors to be corrected: missing values, empty rows and columns, bad variable names, multiple or wrong date formats, numeric columns containing symbols, units, characters, thousand separators, multiple and wrong decimal separators, typographic mistakes and a multiple categorical variable miscoded as a unique character variable. All variables are imported in R as character ones, but most of them are not in reality. To clean this dataset, we suggest to use clickR package. This package is now under review, but it is fully functional and allows semiautomatic and tracking-change data pre-processing without practically any external input or complicated code so that rather messy datasets can be cleaned within minutes.

  6. Z

    Messy Spreadsheet Example for Instruction

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Curty, Renata Gonçalves (2024). Messy Spreadsheet Example for Instruction [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_12586562
    Explore at:
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    University of California, Santa Barbara
    Authors
    Curty, Renata Gonçalves
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A disorganized toy spreadsheet used for teaching good data organization. Learners are tasked with identifying as many errors as possible before creating a data dictionary and reconstructing the spreadsheet according to best practices.

  7. Data from: Urbanev: An open benchmark dataset for urban electric vehicle...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Han Li; Haohao Qu; Xiaojun Tan; Linlin You; Rui Zhu; Wenqi Fan (2025). Urbanev: An open benchmark dataset for urban electric vehicle charging demand prediction [Dataset]. http://doi.org/10.5061/dryad.np5hqc04z
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 25, 2025
    Dataset provided by
    Sun Yat-sen University
    Institute of High Performance Computing
    Hong Kong Polytechnic University
    Authors
    Han Li; Haohao Qu; Xiaojun Tan; Linlin You; Rui Zhu; Wenqi Fan
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The recent surge in electric vehicles (EVs), driven by a collective push to enhance global environmental sustainability, has underscored the significance of exploring EV charging prediction. To catalyze further research in this domain, we introduce UrbanEV—an open dataset showcasing EV charging space availability and electricity consumption in a pioneering city for vehicle electrification, namely Shenzhen, China. UrbanEV offers a rich repository of charging data (i.e., charging occupancy, duration, volume, and price) captured at hourly intervals across an extensive six-month span for over 20,000 individual charging stations. Beyond these core attributes, the dataset also encompasses diverse influencing factors like weather conditions and spatial proximity. These factors are thoroughly analyzed qualitatively and quantitatively to reveal their correlations and causal impacts on charging behaviors. Furthermore, comprehensive experiments have been conducted to showcase the predictive capabilities of various models, including statistical, deep learning, and transformer-based approaches, using the UrbanEV dataset. This dataset is poised to propel advancements in EV charging prediction and management, positioning itself as a benchmark resource within this burgeoning field. Methods To build a comprehensive and reliable benchmark dataset, we conduct a series of rigorous processes from data collection to dataset evaluation. The overall workflow sequentially includes data acquisition, data processing, statistical analysis, and prediction assessment. As follows, please see detailed descriptions. Study area and data acquisition

    Shenzhen, a pioneering city in global vehicle electrification, has been selected for this study with the objective of offering valuable insights into electric vehicle (EV) development that can serve as a reference for other urban centers. This study encompasses the entire expanse of Shenzhen, where data on public EV charging stations distributed around the city have been meticulously gathered. Specifically, EV charging data was automatically collected from a mobile platform used by EV drivers to locate public charging stations. Through this platform, users could access real-time information on each charging pile, including its availability (e.g., busy or idle), charging price, and geographic coordinates. Accordingly, we recorded the charging-related data at five-minute intervals from September 1, 2022, to February 28, 2023. This data collection process was fully digital and did not require manual readings. Furthermore, to delve into the correlation between EV charging patterns and environmental elements, weather data for Shenzhen city were acquired from two meteorological observatories situated in the airport and central regions, respectively. These meteorological data are publicly available on the Shenzhen Government Data Open Platform. Thirdly, point of interest (POI) data was extracted through the Application Programming Interface Platform of AMap.com, along with three primary types: food and beverage services, business and residential, and lifestyle services. Lastly, the spatial and static data were organized based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. The collected data contains detailed spatiotemporal information that can be analyzed to provide valuable insights about urban EV charging patterns and their correlations with meteorological conditions.

    Shenzhen, a pioneering city in global vehicle electrification, has been selected for this study with the objective of offering valuable insights into electric vehicle (EV) development that can serve as a reference for other urban centers. This study encompasses the entire expanse of Shenzhen, where data on public EV charging stations distributed around the city have been meticulously gathered. Specifically, a program was employed to extract the status (e.g., busy or idle, charging price, electricity volume, and coordinates) of each charging pile at five-minute intervals from 1 September 2022 to 28 February 2023. Furthermore, to delve into the correlation between EV charging patterns and environmental elements, weather data for Shenzhen city was acquired from two meteorological observatories situated in the airport and central regions, respectively. Thirdly, point of interest (POI) data was extracted, along with three primary types: food and beverage services, business and residential, and lifestyle services. Lastly, the spatial and static data were organized based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. The collected data contains detailed spatiotemporal information that can be analyzed to provide valuable insights about urban EV charging patterns and their correlations with meteorological conditions. Processing raw information into well-structured data To streamline the utilization of the UrbanEV dataset, we harmonize heterogeneous data from various sources into well-structured data with aligned temporal and spatial resolutions. This process can be segmented into two parts: the reorganization of EV charging data and the preparation of other influential factors. EV charging data The raw charging data, obtained from publicly available EV charging services, pertains to charging stations and predominantly comprises string-type records at a 5-minute interval. To transform this raw data into a structured time series tailored for prediction tasks, we implement the following three key measures:

    Initial Extraction. From the string-type records, we extract vital information for each charging pile, such as availability (designated as "busy" or "idle"), rated power, and the corresponding charging and service fees applicable during the observed time periods. First, a charging pile is categorized as "active charging" if its states at two consecutive timestamps are both "busy". Consequently, the occupancy within a charging station can be defined as the count of in-use charging piles, while the charging duration is calculated as the product of the count of in-use piles and the time between the two timestamps (in our case, 5 minutes). Moreover, the charging volume in a station can correspondingly be estimated by multiplying the duration by the piles' rated power. Finally, the average electricity price and service price are calculated for each station in alignment with the same temporal resolution as the three charging variables.

    Error Detection and Imputation. Ensuring data quality is paramount when utilizing charging data for decision-making, advanced analytics, and machine-learning applications. It is crucial to address concerns around data cleanliness, as the presence of inaccuracies and inconsistencies, often referred to as dirty data, can significantly compromise the reliability and validity of any subsequent analysis or modeling efforts. To improve data quality of our charging data, several errors are identified, particularly the negative values for charging fees and the inconsistencies between the counts of occupied, idle, and total charging piles. We remove the records containing these anomalies and treat them as missing data. Besides that, a two-step imputation process was implemented to address missing values. First, forward filling replaced missing values using data from preceding timestamps. Then, backward filling was applied to fill gaps at the start of each time series. Moreover, a certain number of outliers were identified in the dataset, which could significantly impact prediction performance. To address this, the interquartile range (IQR) method was used to detect outliers for metrics including charging volume (v), charging duration (d), and the rate of active charging piles at the charging station (o). To retain more original data and minimize the impact of outlier correction on the overall data distribution, we set the coefficient to 4 instead of the default 1.5. Finally, each outlier was replaced by the mean of its adjacent valid values. This preprocessing pipeline transformed the raw data into a structured and analyzable dataset.

    Aggregation and Filtration. Building upon the station-level charging data that has been extracted and cleansed, we further organize the data into a region-level dataset with an hourly interval providing a new perspective for EV charging behavior analysis. This is achieved by two major processes: aggregation and filtration. First, we aggregate all the charging data from both temporal and spatial views: a. Temporally, we standardize all time-series data to a common time resolution of one hour, as it serves as the least common denominator among the various resolutions. This aims to establish a unified temporal resolution for all time-series data, including pricing schemes, weather records, and charging data, thereby creating a well-structured dataset. Aggregation rules specify that the five-minute charging volume v and duration $(d)$ are summed within each interval (i.e., one hour), whereas the occupancy o, electricity price pe, and service price ps are assigned specific values at certain hours for each charging pile. This distinction arises from the inherent nature of these data types: volume v and duration d are cumulative, while o, pe, and ps are instantaneous variables. Compared to using the mean or median values within each interval, selecting the instantaneous values of o, pe, and ps as representatives preserves the original data patterns more effectively and minimizes the influence of human interpretation. b. Spatially, stations are aggregated based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. After aggregation, our aggregated dataset comprises 331 regions (also called traffic zones) with 4344 timestamps. Second, variance tests and zero-value filtering functions were employed to filter out traffic zones with zero or no change in charging data. Specifically, it means that

  8. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  9. Raw Data.

    • plos.figshare.com
    xlsx
    Updated Jul 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Zhou; Wei Guo; Dongling Liu; Jianrong Li; Caixia Yang; Ying Wang; Xiaoyi Huang (2025). Raw Data. [Dataset]. http://doi.org/10.1371/journal.pone.0326380.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 1, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Juan Zhou; Wei Guo; Dongling Liu; Jianrong Li; Caixia Yang; Ying Wang; Xiaoyi Huang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cleaning indicators are widely used to evaluate the efficacy of cleaning processes in automated washer-disinfectors (AWDs) in healthcare settings. In this study, we systematically analyzed the performance of commercial indicators across multiple simulated cleaning protocols to guide the correct selection of suitable cleaning indicators in Central Sterile Supply Departments (CSSD). Eleven commercially available cleaning indicators were tested in five cleaning simulations, P0 to P4, where P1 represented the standard cleaning process in CSSD, while P2-P4 incorporated induced-error cleaning processes to mimic real-world errors. All indicators were uniformly positioned at the top level of the cleaning rack to ensure comparable exposure. Key parameters, including indicator response dynamics (e.g., wash-off sequence) and final residue results, were documented throughout the cleaning cycles. The final wash-off results given by the indicators under P0, in which no detergent was injected, were much worse than those of the other four processes. Under different simulations, the final results of the indicators and their wash-off sequences changed substantially. In conclusion, an effective indicator must be selected experimentally. The last indicator to be washed off during the normal cleaning process that can simultaneously clearly show the presence of dirt residue under induced error conditions is the optimal indicator for monitoring cleaning processes.

  10. YouTube Recommendation Data (For Cleaning & ML )

    • kaggle.com
    zip
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shravan Kumar (2025). YouTube Recommendation Data (For Cleaning & ML ) [Dataset]. https://www.kaggle.com/datasets/iitanshravan/youtube-recommendation-data-for-cleaning-and-ml
    Explore at:
    zip(28098010 bytes)Available download formats
    Dataset updated
    Oct 1, 2025
    Authors
    Shravan Kumar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    YouTube
    Description

    🟥 Synthetic YouTube Recommendation Dataset (1M Rows, With Errors) 📌 Overview

    This is a synthetic dataset of 1,000,000 user–video interactions generated to simulate how a YouTube-like recommendation system might log activity.

    It is designed for data cleaning practice, feature engineering, and machine learning modeling. 👉 Unlike clean benchmark datasets, this one intentionally contains messy data and errors so you can practice real-world data wrangling before building ML models.

  11. Z

    BigMart Retail Sales

    • data.niaid.nih.gov
    Updated May 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataman (2022). BigMart Retail Sales [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6509954
    Explore at:
    Dataset updated
    May 2, 2022
    Authors
    Dataman
    License

    Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Nothing ever becomes real till it is experienced.

    -John Keats

    While we don't know the context in which John Keats mentioned this, we are sure about its implication in data science. While you would have enjoyed and gained exposure to real world problems in this challenge, here is another opportunity to get your hand dirty with this practice problem.

    Problem Statement :

    The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.

    Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

    Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

    Data :

    We have 14204 samples in data set.

    Variable Description

    Item Identifier: A code provided for the item of sale

    Item Weight: Weight of item

    Item Fat Content: A categorical column of how much fat is present in the item: ‘Low Fat’, ‘Regular’, ‘low fat’, ‘LF’, ‘reg’

    Item Visibility: Numeric value for how visible the item is

    Item Type: What category does the item belong to: ‘Dairy’, ‘Soft Drinks’, ‘Meat’, ‘Fruits and Vegetables’, ‘Household’, ‘Baking Goods’, ‘Snack Foods’, ‘Frozen Foods’, ‘Breakfast’, ’Health and Hygiene’, ‘Hard Drinks’, ‘Canned’, ‘Breads’, ‘Starchy Foods’, ‘Others’, ‘Seafood’.

    Item MRP: The MRP price of item

    Outlet Identifier: Which outlet was the item sold. This will be categorical column

    Outlet Establishment Year: Which year was the outlet established

    Outlet Size: A categorical column to explain size of outlet: ‘Medium’, ‘High’, ‘Small’.

    Outlet Location Type: A categorical column to describe the location of the outlet: ‘Tier 1’, ‘Tier 2’, ‘Tier 3’

    Outlet Type: Categorical column for type of outlet: ‘Supermarket Type1’, ‘Supermarket Type2’, ‘Supermarket Type3’, ‘Grocery Store’

    Item Outlet Sales: The number of sales for an item.

    Evaluation Metric:

    We will use the Root Mean Square Error value to judge your response

  12. Z

    CURE-TSR: Challenging Unreal and Real Environments for Traffic Sign...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jun 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dogancan Temel; Gukyeong Kwon; Mohit Prabhushankar; Ghassan AlRegib (2020). CURE-TSR: Challenging Unreal and Real Environments for Traffic Sign Recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3903065
    Explore at:
    Dataset updated
    Jun 28, 2020
    Dataset provided by
    Georgia Institute of Technology
    Authors
    Dogancan Temel; Gukyeong Kwon; Mohit Prabhushankar; Ghassan AlRegib
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As one of the research directions at OLIVES Lab @ Georgia Tech, we focus on the robustness of data-driven algorithms under diverse challenging conditions where trained models can possibly be depolyed. To achieve this goal, we introduced a large-sacle (>2M images) traffic sign recognition dataset (CURE-TSR) which is among the most comprehensive datasets with controlled synthetic challenging conditions. Traffic sign images in the CURE-TSR dataset were cropped from the CURE-TSD dataset, which includes around 1.7 million real-world and simulator images with more than 2 million traffic sign instances. Real-world images were obtained from the BelgiumTS video sequences and simulated images were generated with the Unreal Engine 4 game development tool. Sign types include speed limit, goods vehicles, no overtaking, no stopping, no parking, stop, bicycle, hump, no left, no right, priority to, no entry, yield, and parking. Unreal and real sequences were processed with state-of-the-art visual effect software Adobe(c) After Effects to simulate challenging conditions, which include rain, snow, haze, shadow, darkness, brightness, blurriness, dirtiness, colorlessness, sensor and codec errors. Please refer to our GitHub page for code, papers, and more information.

    Instructions:

    The name format of the provided images are as follows: "sequenceType_signType_challengeType_challengeLevel_Index.bmp"

    sequenceType: 01 - Real data 02 - Unreal data

    signType: 01 - speed_limit 02 - goods_vehicles 03 - no_overtaking 04 - no_stopping 05 - no_parking 06 - stop 07 - bicycle 08 - hump 09 - no_left 10 - no_right 11 - priority_to 12 - no_entry 13 - yield 14 - parking

    challengeType: 00 - No challenge 01 - Decolorization 02 - Lens blur 03 - Codec error 04 - Darkening 05 - Dirty lens 06 - Exposure 07 - Gaussian blur 08 - Noise 09 - Rain 10 - Shadow 11 - Snow 12 - Haze

    challengeLevel: A number in between [01-05] where 01 is the least severe and 05 is the most severe challenge.

    Index: A number shows different instances of traffic signs in the same conditions.

  13. Salary-Data

    • kaggle.com
    zip
    Updated Aug 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sourav Bose (2022). Salary-Data [Dataset]. https://www.kaggle.com/datasets/souravbose/salary-prediction
    Explore at:
    zip(14178368 bytes)Available download formats
    Dataset updated
    Aug 21, 2022
    Authors
    Sourav Bose
    Description

    Problem Description: Develop a salary prediction system based on the given dataset.

    Data supplied: You are given two data files in CSV: • train_features.csv: Each row represents the metadata for an individual job posting. The “jobId” column represents a unique identifier for the job posting. The remaining columns describe the features of the job posting. • train_salaries.csv: Each row associates a “jobId” with a “salary”. The first row of each file contains headers for the columns. Keep in that the metadata and salary data were crawled from the internet. As such, it’s possible that the data is dirty (it may contain errors).

    Questions 1. What steps did you take to prepare the data for the project? Was any cleaning necessary? 2. What algorithmic method did you apply? Why? What other methods did you consider? 3. Describe how the algorithmic method that you chose works? 4. What features did you use? Why? 5. How did you train your model? During training, what issues concerned you? 6. How did you assess the accuracy of your predictions? Why did you choose that method? Would you consider any alternative approaches for assessing accuracy? 7. Which features had the most significant impact on salary? How did you identify these to be most significant? Which features had the least impact on salary? How did you identify these?

  14. Student Performance and Attendance Dataset

    • kaggle.com
    zip
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marvy Ayman Halim (2025). Student Performance and Attendance Dataset [Dataset]. https://www.kaggle.com/datasets/marvyaymanhalim/student-performance-and-attendance-dataset
    Explore at:
    zip(5849540 bytes)Available download formats
    Dataset updated
    Mar 10, 2025
    Authors
    Marvy Ayman Halim
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📝 Description: This synthetic dataset is designed to help beginners and intermediate learners practice data cleaning and analysis in a realistic setting. It simulates a student tracking system, covering key areas like:

    Attendance tracking 📅

    Homework completion 📝

    Exam performance 🎯

    Parent-teacher communication 📢

    ✅ Why Use This Dataset? While many datasets are pre-cleaned, real-world data is often messy. This dataset includes intentional errors to help you develop essential data cleaning skills before diving into analysis. It’s perfect for building confidence in handling raw data!

    🛠️ Cleaning Challenges You’ll Tackle This dataset is packed with real-world issues, including:

    Messy data: Names in lowercase, typos in attendance status.

    Inconsistent date formats: Mix of MM/DD/YYYY and YYYY-MM-DD.

    Incorrect values: Homework completion rates in mixed formats (e.g., 80% and 90).

    Missing data: Guardian signatures, teacher comments, and emergency contacts.

    Outliers: Exam scores over 100 and negative homework completion rates.

    🚀 Your Task: Clean, structure, and analyze this dataset using Python or SQL to uncover meaningful insights!

    📌 5. Handle Outliers

    Remove exam scores above 100.

    Convert homework completion rates to consistent percentages.

    📌 6. Generate Insights & Visualizations

    What’s the average attendance rate per grade?

    Which subjects have the highest performance?

    What are the most common topics in parent-teacher communication?

  15. e

    Sewerage and connections

    • data.europa.eu
    csv, esri shape, json
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sewerage and connections [Dataset]. https://data.europa.eu/data/datasets/29450-riolering-en-aansluitingen?locale=en
    Explore at:
    csv, json, esri shapeAvailable download formats
    License

    Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
    License information was derived automatically

    Description

    This data collection includes connection pipes from gullies and houses to the main sewer, as well as the geometry of the sewer strands and wells of the main sewer. The data related to connections can also be retrieved using a web feature service: https://data.riox.online/eindhoven/wfs

    Do you want to measure distances in the map? This can be done in the ArcGIS viewer: click here to go to the viewer.

    https://data.eindhoven.nl/assets/theme_image/riool.PNG" alt="">

    (Click to measure on the ruler on the right.)

    A number of points for attention:

    This data collection contains a large amount of line pieces, to view them all you need to zoom in for performance reasons.

    Fittings House Connection, some Connection Lines House Connection, and Wells are originally points that are shown as closed line objects in the map.

    The home connections (both connection lines and fittings) are not yet complete for the whole of Eindhoven, these are still being added per area at the time of writing.

    The attribute TYPE can be used to deduce whether it is a part of the main sewer, a well, or data related to home connections.

    Maindriole

    The location of the main sewer is only visible for reference. The available categories are: Mixed water, Rainwater (sky water), Dirty water, Dirty water + Roof surface.

    Putting

    For reference purposes only, the name of the well and its location are visible from the wells.

    Home connections

    The connection pipe and the associated attachment have been made clear from the house connections.

    For ‘House Connection Tools’, the following attributes are regularly available: ADRES (adres waarop het hulpstuk van toepassing is, dit attribuut is niet altijd gevuld), PLAATS (zou altijd Eindhoven moeten zijn voor deze data, betreft het niet Eindhoven, dan een fout), STELSEL (stelsel waarop het hulpstuk is aangesloten), DIAMETER (diameter van het hulpstuk in millimeters, als 0 dan onbekend), MATERIAAL (materiaal van het hulpstuk), de BEGINPUT en EINDPUT (stemmen overeen met de PUTNAAM van een rioolput uit het hoofdriool, dit attribuut is niet altijd gevuld), PUTAFSTAND (afstand hulpstuk tot de BEGINPUT, dit attribuut is niet altijd gevuld), DIEPTE (maar sporadisch gevuld, “-“ of leeg wanneer onbekend), JAAR (jaar van aanleg, niet altijd gevuld), DATUM (plaatsingsdatum in bestand, niet altijd gevuld), NLCS (laagnaam conform Nederlandse CAD standaard, niet altijd gevuld), REFERENTIE (dit attribuut is niet altijd gevuld), en TYPE (of het gaat om een ontstoppingsstuk of een inlaat hulpstuk).

    The following attributes are available for ‘House Connection Lines’: ADDRESS (address to which the attachment applies, this attribute is not always filled), PLACE (should always be Eindhoven for these dates, it is not Eindhoven, then an error), STELSEL (system to which the attachment is connected), DIAMETER (diameter of the attachment in millimetres, if 0 then unknown), MATERIAL (material of the attachment), BEGINPUT and EINDPUT (corresponding to the PUTNAME of a sewer well from the main sewer, this attribute is not always filled), PUTAFSTAND (distance attachment from the BEGINPUT, this attribute is not always filled), DIEPTE (but sporadically filled, “-“ or empty when unknown), YEAR (year of construction, not always filled), DATE (placement date in file, not always filled), NLCS (low name according to Dutch CAD standard, not always filled), REFERENCE (this attribute is not always filled).

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training
Organization logo

Cafe Sales - Dirty Data for Cleaning Training

Dirty Cafe Sales Dataset

Explore at:
zip(113510 bytes)Available download formats
Dataset updated
Jan 17, 2025
Authors
Ahmed Mohamed
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Cafe Sales Dataset

Overview

The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

File Information

  • File Name: dirty_cafe_sales.csv
  • Number of Rows: 10,000
  • Number of Columns: 8

Columns Description

Column NameDescriptionExample Values
Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
ItemThe name of the item purchased. May contain missing or invalid values (e.g., "ERROR").Coffee, Sandwich
QuantityThe quantity of the item purchased. May contain missing or invalid values.1, 3, UNKNOWN
Price Per UnitThe price of a single unit of the item. May contain missing or invalid values.2.00, 4.00
Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, 12.00
Payment MethodThe method of payment used. May contain missing or invalid values (e.g., None, "UNKNOWN").Cash, Credit Card
LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Takeaway
Transaction DateThe date of the transaction. May contain missing or incorrect values.2023-01-01

Data Characteristics

  1. Missing Values:

    • Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
  2. Invalid Values:

    • Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
  3. Price Consistency:

    • Prices for menu items are consistent but may have missing or incorrect values introduced.

Menu Items

The dataset includes the following menu items with their respective price ranges:

ItemPrice($)
Coffee2
Tea1.5
Sandwich4
Salad5
Cake3
Cookie1
Smoothie4
Juice3

Use Cases

This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

Cleaning Steps Suggestions

To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

  1. Handle Invalid Values:

    • Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
  2. Date Consistency:

    • Ensure all dates are in a consistent format.
    • Fill missing dates with plausible values based on nearby records.
  3. Feature Engineering:

    • Create new columns, such as Day of the Week or Transaction Month, for further analysis.

License

This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

Feedback

If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

Search
Clear search
Close search
Google apps
Main menu