Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).
The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage
This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.
Facebook
TwitterData Cleaning or Data cleansing is to clean the data by imputing missing values, smoothing noisy data, and identifying or removing outliers. In general, the missing values are found due to collection error or data is corrupted.
Here some info in details :Feature Engineering - Handling Missing Value
Wine_Quality.csv dataset have the numerical missing data, and students_Performance.mv.csv dataset have Numerical and categorical missing data's.
Facebook
TwitterThis dataset contains 1,000 employee records across different departments and cities, designed for practicing data cleaning, preprocessing, and handling missing values in real-world scenarios.
Facebook
TwitterIn this Datasets i simply showed the handling of missing values in your data with help of python libraries such as NumPy and pandas. You can also see the use of Nan and Non values. Detecting, dropping and filling of null values.
Facebook
TwitterThis dataset was created by Safacan Metin
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Tuệ Nguyễn
Released under Apache 2.0
Facebook
TwitterThis dataset was created by Pankesh Patel
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Zaid Mohammed Ibrahim
Released under MIT
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Feroz Shinwari
Released under Apache 2.0
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed specifically for beginners and intermediate learners to practice data cleaning techniques using Python and Pandas.
It includes 500 rows of simulated employee data with intentional errors such as:
Missing values in Age and Salary
Typos in email addresses (@gamil.com)
Inconsistent city name casing (e.g., lahore, Karachi)
Extra spaces in department names (e.g., " HR ")
✅ Skills You Can Practice:
Detecting and handling missing data
String cleaning and formatting
Removing duplicates
Validating email formats
Standardizing categorical data
You can use this dataset to build your own data cleaning notebook, or use it in interviews, assessments, and tutorials.
Facebook
TwitterDataset is final solution for dealing with missing values in the Spaceship Titanic competition. Kaggle Notebook: https://www.kaggle.com/sardorabdirayimov/best-way-of-dealing-with-missing-values-titanic-2/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for the paper "Identifying missing data handling methods with text mining".
It contains the type of missing data handling method used by a given paper.
id: ID of the article
origin: Source journal
pub_year: Publication year
discipline: Discipline category of the article based on origin
about_missing: Is the article about missing data handling? (0 - no, 1 - yes)
imputation: Was some kind of imputation technique used in the article? (0 - no, 1 - yes)
advanced: Was some kind of advanced imputation technique used in the article? (0 - no, 1 - yes)
deletion: Was some kind of deletion technique used in the article? (0 - no, 1 - yes)
text_tokens: Snipped out parts from the original articles
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
• This dataset is designed for learning how to identify missing data in Python.
• It focuses on techniques to detect null, NaN, and incomplete values.
• It includes examples of visualizing missing data patterns using Python libraries.
• Useful for beginners practicing data preprocessing and data cleaning.
• Helps users understand missing data handling methods for machine learning workflows.
• Supports practical exploration of datasets before model training.
Facebook
TwitterThis dataset was created by Himanshu Kumar
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Activity Title: "Fix the Gaps: Data Hospital Simulation" (This activity is created for students to practice techniques to handle missing data)
Description: Provide each team with a “broken patient record” dataset (incomplete entries with NaNs or blanks). Teams act as data doctors: • Diagnose the type of missingness (MCAR, MAR, MNAR) • Choose suitable imputation techniques (mean, median, KNN, regression) • Compare outcomes from different methods
Tools: Jupyter notebook / Pandas
Outcome: Group presentation on the impact of imputation and justification of the method used.
Facebook
TwitterThis dataset was created by Deep Jani
Released under Data files © Original Authors
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.
dirty_cafe_sales.csv| Column Name | Description | Example Values |
|---|---|---|
Transaction ID | A unique identifier for each transaction. Always present and unique. | TXN_1234567 |
Item | The name of the item purchased. May contain missing or invalid values (e.g., "ERROR"). | Coffee, Sandwich |
Quantity | The quantity of the item purchased. May contain missing or invalid values. | 1, 3, UNKNOWN |
Price Per Unit | The price of a single unit of the item. May contain missing or invalid values. | 2.00, 4.00 |
Total Spent | The total amount spent on the transaction. Calculated as Quantity * Price Per Unit. | 8.00, 12.00 |
Payment Method | The method of payment used. May contain missing or invalid values (e.g., None, "UNKNOWN"). | Cash, Credit Card |
Location | The location where the transaction occurred. May contain missing or invalid values. | In-store, Takeaway |
Transaction Date | The date of the transaction. May contain missing or incorrect values. | 2023-01-01 |
Missing Values:
Item, Payment Method, Location) may contain missing values represented as None or empty cells.Invalid Values:
"ERROR" or "UNKNOWN" to simulate real-world data issues.Price Consistency:
The dataset includes the following menu items with their respective price ranges:
| Item | Price($) |
|---|---|
| Coffee | 2 |
| Tea | 1.5 |
| Sandwich | 4 |
| Salad | 5 |
| Cake | 3 |
| Cookie | 1 |
| Smoothie | 4 |
| Juice | 3 |
This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.
To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."
Handle Invalid Values:
"ERROR" and "UNKNOWN" with NaN or appropriate values.Date Consistency:
Feature Engineering:
Day of the Week or Transaction Month, for further analysis.This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.
If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.
Facebook
TwitterOriginal dataset that is shared on Github can be found here. These are hands on practice datasets that were linked through the Coursera Guided Project Certificate Course for Handling Missing Values in R, a part of Coursera Project Network. The datasets links were shared by the original author and instructor of the course Arimoro Olayinka Imisioluwa.
Things you could do with this dataset: As a beginner in R, these datasets helped me to get a hang over making data clean and tidy and handling missing values(only numeric) using R. Good for anyone looking for a beginner to intermediate level understanding in these subjects.
Here are my notebooks as kernels using these datasets and using a few more preloaded datasets in R, as suggested by the instructor. TidY DatA Practice MissinG DatA HandlinG - NumeriC
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contains sales records from a café. Initially, it was messy, with missing values represented as NaN, UNKNOWN, and ERROR. The following cleaning steps were applied: 1. Handling Missing Values Replaced missing values with appropriate statistics: i. Mode for categorical columns (Item, Payment Method, and Location). ii. Mean for numerical columns (Quantity). iii. Median for temporal data (Transaction Date).
2. Price Standardization Inconsistent values in the Price per Unit column were corrected by filling them with the appropriate consistent price from the dataset.
3. Data Type Conversion Converted all columns to their appropriate data types (e.g., datetime for transaction dates, numeric for quantities and prices, categorical for items, payment methods, and locations)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dive into this specially curated dataset on credit card applications 📊.
An interesting approach to privacy has been taken in this dataset– every name and value has been creatively altered to ensure confidentiality 🔒.
What's inside?
A diverse collection of data that's sure to pique your interest. You'll encounter a range of continuous variables, giving you a glimpse into quantitative insights 📈.
Then, there are categorical variables – some with just a handful of options offering a neat, compact view, and others with a plethora of choices, adding layers of complexity and richness.
But here's where it gets even more intriguing – the dataset has been intentionally peppered with additional missing values 💡.
This isn't your average dataset; it's a playground for those who love a good data challenge.
The goal?
To equip you with real-world skills in handling and imputing missing data 🧩. You'll learn to navigate through these informational gaps, employing various imputation techniques to unveil the hidden stories within the data.
This dataset isn't just about understanding credit card applications 💳. It's a journey into the heart of data analysis and machine learning 🤖.
Whether you're a beginner eager to learn the ropes or an experienced data enthusiast looking to refine your skills, this dataset offers a unique opportunity. It challenges you to apply theoretical knowledge to practical scenarios, transforming abstract concepts into tangible skills.
So, if you're ready to test your mettle against real-world data puzzles, this is your chance. Unleash your analytical prowess, explore diverse imputation strategies, and uncover the secrets hidden in incomplete data. Welcome to a world where data tells a story, and you're the storyteller 🌐
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).
The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage
This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.