100+ datasets found

B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
Is it time to stop sweeping data cleaning under the carpet? A novel...
plos.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements (2023). Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data [Dataset]. http://doi.org/10.1371/journal.pone.0228154
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0228154
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.

Netflix Data: Cleaning, Analysis and Visualization

kaggle.com

zip

Updated Aug 26, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization

Explore at:

zip(276607 bytes)Available download formats

Dataset updated

Aug 26, 2022

Authors

Abdulrasaq Ariyo

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

Data Cleaning

We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

--View dataset

SELECT * 
FROM netflix;

--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                  
SELECT show_id, COUNT(*)                                                                                      
FROM netflix 
GROUP BY show_id                                                                                              
ORDER BY show_id DESC;

--No duplicates

--Check null values across columns

SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
    COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
    COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
    COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
    COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
    COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
    COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
    COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
    COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
    COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
    COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
    COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;

We can see that there are NULLS. 
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3

The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

-- Below, we find out if some directors are likely to work with particular cast

WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
FROM netflix
)

SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

With this, we can now populate NULL rows in directors 
using their record with movie_cast

UPDATE netflix 
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;

--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET director = 'Not Given'
WHERE director IS NULL;

--When I was doing this, I found a less complex and faster way to populate a column which I will use next

Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

--Populate the country using the director column

SELECT COALESCE(nt.country,nt2.country) 
FROM netflix AS nt
JOIN netflix AS nt2 
ON nt.director = nt2.director 
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
AND netflix.country IS NULL;


--To confirm if there are still directors linked to country that refuse to update

SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;

--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET country = 'Not Given'
WHERE country IS NULL;

The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

--Show date_added nulls

SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;

--DELETE nulls

DELETE F...

Dirty Dataset to practice Data Cleaning
kaggle.com
zip
Updated May 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Kanju (2024). Dirty Dataset to practice Data Cleaning [Dataset]. https://www.kaggle.com/datasets/martinkanju/dirty-dataset-to-practice-data-cleaning
Explore at:
zip(1235 bytes)Available download formats
Dataset updated
May 20, 2024
Authors
Martin Kanju
Description
Dataset

This dataset was created by Martin Kanju

Released under Other (specified in description)

Contents
Data Cleaning Portfolio Project
kaggle.com
zip
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deepali Sukhdeve (2024). Data Cleaning Portfolio Project [Dataset]. https://www.kaggle.com/datasets/deepalisukhdeve/data-cleaning-portfolio-project
Explore at:
zip(6053781 bytes)Available download formats
Dataset updated
Apr 2, 2024
Authors
Deepali Sukhdeve
Description
Dataset

This dataset was created by Deepali Sukhdeve

Contents
R
Clean Data Yolo_detection Dataset
universe.roboflow.com
zip
Updated Sep 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SamYolo (2025). Clean Data Yolo_detection Dataset [Dataset]. https://universe.roboflow.com/samyolo-5ewir/clean-data-yolo_detection-xn8eh
Explore at:
zipAvailable download formats
Dataset updated
Sep 9, 2025
Dataset authored and provided by
SamYolo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variables measured
Objects Bounding Boxes
Description
Clean Data Yolo_detection

## Overview Clean Data Yolo_detection is a dataset for object detection tasks - it contains Objects annotations for 1,000 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).

Retail Store Sales: Dirty for Data Cleaning

kaggle.com

zip

Updated Jan 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning

Explore at:

zip(226740 bytes)Available download formats

Dataset updated

Jan 18, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Retail Store Sales Dataset

Overview

The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

File Information

File Name: retail_store_sales.csv
Number of Rows: 12,575
Number of Columns: 11

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Customer ID`	A unique identifier for each customer. 25 unique customers.	`CUST_01`
`Category`	The category of the purchased item.	`Food`, `Furniture`
`Item`	The name of the purchased item. May contain missing values or `None`.	`Item_1_FOOD`, `None`
`Price Per Unit`	The static price of a single unit of the item. May contain missing or `None` values.	`4.00`, `None`
`Quantity`	The quantity of the item purchased. May contain missing or `None` values.	`1`, `None`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `None`
`Payment Method`	The method of payment used. May contain missing or invalid values.	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Online`
`Transaction Date`	The date of the transaction. Always present and valid.	`2023-01-15`
`Discount Applied`	Indicates if a discount was applied to the transaction. May contain missing values.	`True`, `False`, `None`

Categories and Items

The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

Electric Household Essentials

Item Code	Item Name	Price
Item_1_EHE	Blender	5.0
Item_2_EHE	Microwave	6.5
Item_3_EHE	Toaster	8.0
Item_4_EHE	Vacuum Cleaner	9.5
Item_5_EHE	Air Purifier	11.0
Item_6_EHE	Electric Kettle	12.5
Item_7_EHE	Rice Cooker	14.0
Item_8_EHE	Iron	15.5
Item_9_EHE	Ceiling Fan	17.0
Item_10_EHE	Table Fan	18.5
Item_11_EHE	Hair Dryer	20.0
Item_12_EHE	Heater	21.5
Item_13_EHE	Humidifier	23.0
Item_14_EHE	Dehumidifier	24.5
Item_15_EHE	Coffee Maker	26.0
Item_16_EHE	Portable AC	27.5
Item_17_EHE	Electric Stove	29.0
Item_18_EHE	Pressure Cooker	30.5
Item_19_EHE	Induction Cooktop	32.0
Item_20_EHE	Water Dispenser	33.5
Item_21_EHE	Hand Blender	35.0
Item_22_EHE	Mixer Grinder	36.5
Item_23_EHE	Sandwich Maker	38.0
Item_24_EHE	Air Fryer	39.5
Item_25_EHE	Juicer	41.0

Furniture

Item Code	Item Name	Price
Item_1_FUR	Office Chair	5.0
Item_2_FUR	Sofa	6.5
Item_3_FUR	Coffee Table	8.0
Item_4_FUR	Dining Table	9.5
Item_5_FUR	Bookshelf	11.0
Item_6_FUR	Bed F...

f
Clean Data.csv
datasetcatalog.nlm.nih.gov
figshare.com
Updated Dec 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hattab, Zaid (2023). Clean Data.csv [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001038541
Explore at:
Dataset updated
Dec 3, 2023
Authors
Hattab, Zaid
Description
A subset of the Oregon Health Insurance Experiment (OHIE) contains 12,229 individuals who satisfied the inclusion criteria and who responded to the in-person survey by October 2010. It has been used to explore the heterogeneity of the effects of the lottery and the Insurance on a number of outcomes.
u
Jyutping Project - Raw Data and Clean Data
rdr.ucl.ac.uk
application/csv
Updated Aug 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph Lam (2024). Jyutping Project - Raw Data and Clean Data [Dataset]. http://doi.org/10.5522/04/26504347.v1
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.5522/04/26504347.v1
Dataset updated
Aug 19, 2024
Dataset provided by
University College London
Authors
Joseph Lam
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Raw and clean data for Jyutping project, submitted to International Journal of Epidemiology.All data are openly available at the time of scrapping. I only retained Chinese Name and Hong Kong Government Romanised English Names. This project aims to describe the problem of non-standardised romanisation and it's impact on data linkage. The included data allows researchers to replicate my process of extracting Jyutping and Pinyin from Chinese Characters. Quite a few of manual screening and reviewing was required, so the code itself was not fully automated. The codes are stored on my personal GitHub, https://github.com/Jo-Lam/Jyutping_project/tree/main.Please cite this data resource: doi:10.5522/04/26504347
d
Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global /...
datarade.ai
.json, .csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coresignal, Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global / 35M+ Records / Updated Weekly [Dataset]. https://datarade.ai/data-products/coresignal-clean-data-company-data-ai-enriched-datasets-coresignal
Explore at:
.json, .csvAvailable download formats
Dataset authored and provided by
Coresignal
Area covered
Hungary, Guinea-Bissau, Guatemala, Chile, Guadeloupe, Niue, Panama, Saint Barthélemy, Namibia, Andorra
Description
This clean dataset is a refined version of our company datasets, consisting of 35M+ data records.

It’s an excellent data solution for companies with limited data engineering capabilities and those who want to reduce their time to value. You get filtered, cleaned, unified, and standardized B2B data. After cleaning, this data is also enriched by leveraging a carefully instructed large language model (LLM).

AI-powered data enrichment offers more accurate information in key data fields, such as company descriptions. It also produces over 20 additional data points that are very valuable to B2B businesses. Enhancing and highlighting the most important information in web data contributes to quicker time to value, making data processing much faster and easier.

For your convenience, you can choose from multiple data formats (Parquet, JSON, JSONL, or CSV) and select suitable delivery frequency (quarterly, monthly, or weekly).

Coresignal is a leading public business data provider in the web data sphere with an extensive focus on firmographic data and public employee profiles. More than 3B data records in different categories enable companies to build data-driven products and generate actionable insights. Coresignal is exceptional in terms of data freshness, with 890M+ records updated monthly for unprecedented accuracy and relevance.
Clean data (script number 4)
figshare.com
bin
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Violeta Berdejo Espinola (2025). Clean data (script number 4) [Dataset]. http://doi.org/10.6084/m9.figshare.29036840.v4
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29036840.v4
Dataset updated
May 22, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Violeta Berdejo Espinola
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Raw labelled data used for analysis
u
NSF/NCAR C-130 CN Clean Data
data.ucar.edu
ascii
Updated Oct 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antony D. Clarke (2025). NSF/NCAR C-130 CN Clean Data [Dataset]. http://doi.org/10.26023/FM4V-3BER-7X0J
Explore at:
asciiAvailable download formats
Unique identifier
https://doi.org/10.26023/FM4V-3BER-7X0J
Dataset updated
Oct 7, 2025
Authors
Antony D. Clarke
Time period covered
Oct 31, 1995 - Dec 23, 1995
Area covered
Description
Condensation Nuclei (CN) data collected by the University of Hawaii group (Clarke) in ACE1. All of the variables are average values for 15 second intervals. This dataset is a composite of all of the clean data files.

Cafe Sales - Dirty Data for Cleaning Training

kaggle.com

zip

Updated Jan 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training

Explore at:

zip(113510 bytes)Available download formats

Dataset updated

Jan 17, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Cafe Sales Dataset

Overview

The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

File Information

File Name: dirty_cafe_sales.csv
Number of Rows: 10,000
Number of Columns: 8

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Item`	The name of the item purchased. May contain missing or invalid values (e.g., "ERROR").	`Coffee`, `Sandwich`
`Quantity`	The quantity of the item purchased. May contain missing or invalid values.	`1`, `3`, `UNKNOWN`
`Price Per Unit`	The price of a single unit of the item. May contain missing or invalid values.	`2.00`, `4.00`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `12.00`
`Payment Method`	The method of payment used. May contain missing or invalid values (e.g., `None`, "UNKNOWN").	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Takeaway`
`Transaction Date`	The date of the transaction. May contain missing or incorrect values.	`2023-01-01`

Data Characteristics

Missing Values:
- Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
Invalid Values:
- Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
Price Consistency:
- Prices for menu items are consistent but may have missing or incorrect values introduced.

Menu Items

The dataset includes the following menu items with their respective price ranges:

Item	Price($)
Coffee	2
Tea	1.5
Sandwich	4
Salad	5
Cake	3
Cookie	1
Smoothie	4
Juice	3

Use Cases

This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

Cleaning Steps Suggestions

To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

Handle Invalid Values:
- Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
Date Consistency:
- Ensure all dates are in a consistent format.
- Fill missing dates with plausible values based on nearby records.
Feature Engineering:
- Create new columns, such as Day of the Week or Transaction Month, for further analysis.

License

This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

Feedback

If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

h
location-detection-clean-data
huggingface.co
Updated Mar 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toufiq Musah (2019). location-detection-clean-data [Dataset]. https://huggingface.co/datasets/toufiqmusah/location-detection-clean-data
Explore at:
Dataset updated
Mar 29, 2019
Authors
Toufiq Musah
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
toufiqmusah/location-detection-clean-data dataset hosted on Hugging Face and contributed by the HF Datasets community
h
salami-processed-enriched-clean-data-trunc
huggingface.co
Updated Jan 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taufiq Syed (2025). salami-processed-enriched-clean-data-trunc [Dataset]. https://huggingface.co/datasets/taufiqsyed/salami-processed-enriched-clean-data-trunc
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 29, 2025
Authors
Taufiq Syed
Description
taufiqsyed/salami-processed-enriched-clean-data-trunc dataset hosted on Hugging Face and contributed by the HF Datasets community
H
Clean Data Input for HHLocation Study
dataverse.harvard.edu
Updated Dec 7, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andy Krause (2015). Clean Data Input for HHLocation Study [Dataset]. http://doi.org/10.7910/DVN/QDEMVF
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/QDEMVF
Dataset updated
Dec 7, 2015
Dataset provided by
Harvard Dataverse
Authors
Andy Krause
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This is the cleaned household location data for the reproducible HHLocation case study
T
Mauritania Imports from Spain of Dish Washing Machines, Machinery for...
tradingeconomics.com
csv, excel, json, xml
Updated Nov 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2022). Mauritania Imports from Spain of Dish Washing Machines, Machinery for Cleaning [Dataset]. https://tradingeconomics.com/mauritania/imports/spain/machines-dishwash-clean-control-fill-packing
Explore at:
excel, json, csv, xmlAvailable download formats
Dataset updated
Nov 23, 2022
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1990 - Dec 31, 2025
Area covered
Mauritania
Description
Mauritania Imports from Spain of Dish Washing Machines, Machinery for Cleaning was US$459.97 Thousand during 2024, according to the United Nations COMTRADE database on international trade. Mauritania Imports from Spain of Dish Washing Machines, Machinery for Cleaning - data, historical chart and statistics - was last updated on November of 2025.
Dirty E-Commerce Data [80,000+ Products]
kaggle.com
zip
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oleksii Martusiuk (2024). Dirty E-Commerce Data [80,000+ Products] [Dataset]. https://www.kaggle.com/datasets/oleksiimartusiuk/e-commerce-data-shein
Explore at:
zip(3611849 bytes)Available download formats
Dataset updated
Jun 29, 2024
Authors
Oleksii Martusiuk
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
E-commerce Product Dataset - Clean and Enhance Your Data Analysis Skills or Check Out The Cleaned File Below!

This dataset offers a comprehensive collection of product information from an e-commerce store, spread across 20+ CSV files and encompassing over 80,000+ products. It presents a valuable opportunity to test and refine your data cleaning and wrangling skills.

What's Included:

A variety of product categories, including:

Apparel & Accessories

Electronics

Home & Kitchen

Beauty & Health

Toys & Games

Men's Clothes

Women's Clothes

Pet Supplies

Sports & Outdoor

(and more!)

Each product record contains details such as:

Product Title

Category

Price

Discount information

(and other attributes)

Challenges and Opportunities:

Data Cleaning: The dataset is "dirty," containing missing values, inconsistencies in formatting, and potential errors. This provides a chance to practice your data-cleaning techniques such as:

Identifying and handling missing values

Standardizing data formats

Correcting inconsistencies

Dealing with duplicate entries

Feature Engineering: After cleaning, you can explore opportunities to create new features from the existing data, such as: - Extracting keywords from product titles and descriptions - Deriving price categories - Calculating average discounts

Who can benefit from this dataset?

Data analysts and scientists looking to practice data cleaning and wrangling skills on a real-world e-commerce dataset

Machine learning enthusiasts interested in building models for product recommendation, price prediction, or other e-commerce tasks

Anyone interested in exploring and understanding the structure and organization of product data in an e-commerce setting

By contributing to this dataset and sharing your cleaning and feature engineering approaches, you can help create a valuable resource for the Kaggle community!
T
Mexico Imports from Italy of Dish Washing Machines, Machinery for Cleaning
tradingeconomics.com
csv, excel, json, xml
Updated Apr 17, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2018). Mexico Imports from Italy of Dish Washing Machines, Machinery for Cleaning [Dataset]. https://tradingeconomics.com/mexico/imports/italy/machines-dishwash-clean-control-fill-packing
Explore at:
json, excel, csv, xmlAvailable download formats
Dataset updated
Apr 17, 2018
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1990 - Dec 31, 2025
Area covered
Mexico
Description
Mexico Imports from Italy of Dish Washing Machines, Machinery for Cleaning was US$314.87 Million during 2024, according to the United Nations COMTRADE database on international trade. Mexico Imports from Italy of Dish Washing Machines, Machinery for Cleaning - data, historical chart and statistics - was last updated on December of 2025.
T
Mexico Imports from Spain of Dish Washing Machines, Machinery for Cleaning
tradingeconomics.com
csv, excel, json, xml
Updated Sep 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2022). Mexico Imports from Spain of Dish Washing Machines, Machinery for Cleaning [Dataset]. https://tradingeconomics.com/mexico/imports/spain/machines-dishwash-clean-control-fill-packing
Explore at:
excel, xml, json, csvAvailable download formats
Dataset updated
Sep 30, 2022
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1990 - Dec 31, 2025
Area covered
Mexico
Description
Mexico Imports from Spain of Dish Washing Machines, Machinery for Cleaning was US$66.43 Million during 2024, according to the United Nations COMTRADE database on international trade. Mexico Imports from Spain of Dish Washing Machines, Machinery for Cleaning - data, historical chart and statistics - was last updated on November of 2025.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:

167 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.5683/SP3/ZCN177

Dataset updated

Jul 13, 2023

Dataset provided by

Borealis

Authors

Rong Luo

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Clear search

Close search

Google apps

Main menu

Data Cleaning Sample

Is it time to stop sweeping data cleaning under the carpet? A novel...

Netflix Data: Cleaning, Analysis and Visualization

Data Cleaning

Dirty Dataset to practice Data Cleaning

Dataset

Contents

Data Cleaning Portfolio Project

Dataset

Contents

Clean Data Yolo_detection Dataset

Clean Data Yolo_detection

Retail Store Sales: Dirty for Data Cleaning

Dirty Retail Store Sales Dataset

Overview

File Information

Columns Description

Categories and Items

Electric Household Essentials

Furniture

Clean Data.csv

Jyutping Project - Raw Data and Clean Data

Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global /...

Clean data (script number 4)

NSF/NCAR C-130 CN Clean Data

Cafe Sales - Dirty Data for Cleaning Training

Dirty Cafe Sales Dataset

Overview

File Information

Columns Description

Data Characteristics

Menu Items

Use Cases

Cleaning Steps Suggestions

License

Feedback

location-detection-clean-data

salami-processed-enriched-clean-data-trunc

Clean Data Input for HHLocation Study

Mauritania Imports from Spain of Dish Washing Machines, Machinery for...

Dirty E-Commerce Data [80,000+ Products]

Mexico Imports from Italy of Dish Washing Machines, Machinery for Cleaning

Mexico Imports from Spain of Dish Washing Machines, Machinery for Cleaning

Data Cleaning Sample