100+ datasets found
  1. Data Cleaning - Feature Imputation

    • kaggle.com
    zip
    Updated Aug 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mr.Machine (2022). Data Cleaning - Feature Imputation [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-cleaning-feature-imputation
    Explore at:
    zip(116097 bytes)Available download formats
    Dataset updated
    Aug 13, 2022
    Authors
    Mr.Machine
    Description

    Data Cleaning or Data cleansing is to clean the data by imputing missing values, smoothing noisy data, and identifying or removing outliers. In general, the missing values are found due to collection error or data is corrupted.

    Here some info in details :Feature Engineering - Handling Missing Value

    Wine_Quality.csv dataset have the numerical missing data, and students_Performance.mv.csv dataset have Numerical and categorical missing data's.

  2. D

    Data Cleaning Tools Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Data Cleaning Tools Report [Dataset]. https://www.datainsightsmarket.com/reports/data-cleaning-tools-1943471
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Jul 27, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Discover the booming market for data cleaning tools! Our comprehensive analysis reveals a $10 billion+ market in 2025, driven by AI, cloud adoption, and the critical need for high-quality data. Explore key trends, leading companies (Dundas BI, IBM, Sisense), and future growth projections to 2033.

  3. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  4. Is it time to stop sweeping data cleaning under the carpet? A novel...

    • plos.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements (2023). Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data [Dataset]. http://doi.org/10.1371/journal.pone.0228154
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.

  5. B

    Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop

    • borealisdata.ca
    • search.dataone.org
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucia Costanzo; Vivek Jadon (2024). Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop [Dataset]. http://doi.org/10.5683/SP3/FF6AI9
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Borealis
    Authors
    Lucia Costanzo; Vivek Jadon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Canada
    Description

    Ahoy, data enthusiasts! Join us for a hands-on workshop where you will hoist your sails and navigate through the Statistics Canada website, uncovering hidden treasures in the form of data tables. With the wind at your back, you’ll master the art of downloading these invaluable Stats Can datasets while braving the occasional squall of data cleaning challenges using Excel with your trusty captains Vivek and Lucia at the helm.

  6. D

    Data Cleansing Software Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Sep 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Data Cleansing Software Report [Dataset]. https://www.archivemarketresearch.com/reports/data-cleansing-software-559044
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Sep 20, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Data Cleansing Software market is poised for substantial growth, estimated to reach approximately USD 3,500 million by 2025, with a projected Compound Annual Growth Rate (CAGR) of around 18% through 2033. This robust expansion is primarily driven by the escalating volume of data generated across all sectors, coupled with an increasing awareness of the critical importance of data accuracy for informed decision-making. Organizations are recognizing that flawed data can lead to significant financial losses, reputational damage, and missed opportunities. Consequently, the demand for sophisticated data cleansing solutions that can effectively identify, rectify, and prevent data errors is surging. Key drivers include the growing adoption of AI and machine learning for automated data profiling and cleansing, the increasing complexity of data sources, and the stringent regulatory requirements around data quality and privacy, especially within industries like finance and healthcare. The market landscape for data cleansing software is characterized by a dynamic interplay of trends and restraints. Cloud-based solutions are gaining significant traction due to their scalability, flexibility, and cost-effectiveness, particularly for Small and Medium-sized Enterprises (SMEs). Conversely, large enterprises and government agencies often opt for on-premise solutions, prioritizing enhanced security and control over sensitive data. While the market presents immense opportunities, challenges such as the high cost of implementation and the need for specialized skill sets to manage and operate these tools can act as restraints. However, advancements in user-friendly interfaces and the integration of data cleansing capabilities within broader data management platforms are mitigating these concerns, paving the way for wider adoption. Major players like IBM, SAP SE, and SAS Institute Inc. are continuously innovating, offering comprehensive suites that address the evolving needs of businesses navigating the complexities of big data.

  7. D

    Data Cleansing Tools Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Data Cleansing Tools Report [Dataset]. https://www.datainsightsmarket.com/reports/data-cleansing-tools-1398134
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    May 4, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The data cleansing tools market is experiencing robust growth, driven by the escalating volume and complexity of data across various sectors. The increasing need for accurate and reliable data for decision-making, coupled with stringent data privacy regulations (like GDPR and CCPA), fuels demand for sophisticated data cleansing solutions. Businesses, regardless of size, are recognizing the critical role of data quality in enhancing operational efficiency, improving customer experiences, and gaining a competitive edge. The market is segmented by application (agencies, large enterprises, SMEs, personal use), deployment type (cloud, SaaS, web, installed, API integration), and geography, reflecting the diverse needs and technological preferences of users. While the cloud and SaaS models are witnessing rapid adoption due to scalability and cost-effectiveness, on-premise solutions remain relevant for organizations with stringent security requirements. The historical period (2019-2024) showed substantial growth, and this trajectory is projected to continue throughout the forecast period (2025-2033). Specific growth rates will depend on technological advancements, economic conditions, and regulatory changes. Competition is fierce, with established players like IBM, SAS, and SAP alongside innovative startups continuously improving their offerings. The market's future depends on factors such as the evolution of AI and machine learning capabilities within data cleansing tools, the increasing demand for automated solutions, and the ongoing need to address emerging data privacy challenges. The projected Compound Annual Growth Rate (CAGR) suggests a healthy expansion of the market. While precise figures are not provided, a realistic estimate based on industry trends places the market size at approximately $15 billion in 2025. This is based on a combination of existing market reports and understanding of the growth of related fields (such as data analytics and business intelligence). This substantial market value is further segmented across the specified geographic regions. North America and Europe currently dominate, but the Asia-Pacific region is expected to exhibit significant growth potential driven by increasing digitalization and adoption of data-driven strategies. The restraints on market growth largely involve challenges related to data integration complexity, cost of implementation for smaller businesses, and the skills gap in data management expertise. However, these are being countered by the emergence of user-friendly tools and increased investment in data literacy training.

  8. m

    Comprehensive Data Cleansing Software Market Size, Share & Industry Insights...

    • marketresearchintellect.com
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Intellect (2025). Comprehensive Data Cleansing Software Market Size, Share & Industry Insights 2033 [Dataset]. https://www.marketresearchintellect.com/product/global-data-cleansing-software-market-size-and-forecast/
    Explore at:
    Dataset updated
    Jul 7, 2025
    Dataset authored and provided by
    Market Research Intellect
    License

    https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy

    Area covered
    Global
    Description

    Market Research Intellect's Data Cleansing Software Market Report highlights a valuation of USD 2.5 billion in 2024 and anticipates growth to USD 5.1 billion by 2033, with a CAGR of 9.2% from 2026-2033.Explore insights on demand dynamics, innovation pipelines, and competitive landscapes.

  9. Atlantic Hurricanes (Data Cleaning Challenge)

    • kaggle.com
    zip
    Updated Aug 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valery Liamtsau (2022). Atlantic Hurricanes (Data Cleaning Challenge) [Dataset]. https://www.kaggle.com/datasets/valery2042/hurricanes
    Explore at:
    zip(13216 bytes)Available download formats
    Dataset updated
    Aug 5, 2022
    Authors
    Valery Liamtsau
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains the information about Atlantic hurricanes of Categories 1,2,3 and 5 from 1920 to 2020. This very messy dataset is designed to improve your data cleaning skills

    Your task:

    1. Clean up duration column and get hurricane duration in days
    2. Clean up damage column
    3. Filter area affected column by USA states
    4. Clean up wind speed and pressure columns
    5. Get the date from duration column
  10. f

    Restaurant Menu (Data Cleaning)

    • rochester.figshare.com
    txt
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aabha Pandit; Alois Romanowski; Heather Owen (2025). Restaurant Menu (Data Cleaning) [Dataset]. http://doi.org/10.60593/ur.d.26462404.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 17, 2025
    Dataset provided by
    University of Rochester
    Authors
    Aabha Pandit; Alois Romanowski; Heather Owen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Restaurant Menu DatasetWith approximately 45,000 menus dating from the 1840s to the present, The New York Public Library’s restaurant menu collection is one of the largest in the world. The menu data has been transcribed, dish by dish, into this dataset. For more information, please see http://menus.nypl.org/about.This dataset is not clean and contains many missing values, making it perfect to practice data cleaning tools and techniques.Dataset Variables:id: identifier for menuname: sponsor: who sponsored the meal (organizations, people, name of restaurant)event: categoryvenue: type of place (commercial, social, professional)place: where the meal took place (often a geographic location)physical_description: dimension and material description of the menuoccasion: occasion of the meal (holidays, anniversaries, daily)notes: notes by librarians about the original materialcall_number: call number of the menukeywords: language: date: date of the menulocation: organization or business who produced the menulocation_typecurrency: system of money the menu uses (dollars, etc)currency_symbol: symbol for the currency ($, etc)status: completeness of the menu transcription (transcribed, under review, etc)page_count: how many pages the menu hasdish_count: how many dishes the menu has

  11. Teaching & Learning Team Data Cleaning and Visualization Workshop

    • figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Joan Kelly (2023). Teaching & Learning Team Data Cleaning and Visualization Workshop [Dataset]. http://doi.org/10.6084/m9.figshare.6223541.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Elizabeth Joan Kelly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Materials from workshop conducted for Monroe Library faculty as part of TLT/Faculty Development/Digital Scholarship on 2018-04-05. Objectives:Clean dataAnalyze data using pivot tablesVisualize dataDesign accessible instruction for working with dataAssociated Research Guide at http://researchguides.loyno.edu/data_workshopData sets are from the following:

    BaroqueArt Dataset by CulturePlex Lab is licensed under CC0 What's on the Menu? Menus by New York Public Library is licensed under CC0 Dog movie stars and dog breed popularity by Ghirlanda S, Acerbi A, Herzog H is licensed under CC BY 4.0 NOPD Misconduct Complaints, 2016-2018 by City of New Orleans Open Data is licensed under CC0 U.S. Consumer Product Safety Commission Recall Violations by CU.S. Consumer Product Safety Commission, Violations is licensed under CC0 NCHS - Leading Causes of Death: United States by Data.gov is licensed under CC0 Bob Ross Elements by Episode by Walt Hickey, FiveThirtyEight, is licensed under CC BY 4.0 Pacific Walrus Coastal Haulout 1852-2016 by U.S. Geological Survey, Alaska Science Center is licensed under CC0 Australia Registered Animals by Sunshine Coast Council is licensed under CC0

  12. Cafe Sales - Dirty Data for Cleaning Training

    • kaggle.com
    zip
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training
    Explore at:
    zip(113510 bytes)Available download formats
    Dataset updated
    Jan 17, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Cafe Sales Dataset

    Overview

    The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

    File Information

    • File Name: dirty_cafe_sales.csv
    • Number of Rows: 10,000
    • Number of Columns: 8

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    ItemThe name of the item purchased. May contain missing or invalid values (e.g., "ERROR").Coffee, Sandwich
    QuantityThe quantity of the item purchased. May contain missing or invalid values.1, 3, UNKNOWN
    Price Per UnitThe price of a single unit of the item. May contain missing or invalid values.2.00, 4.00
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, 12.00
    Payment MethodThe method of payment used. May contain missing or invalid values (e.g., None, "UNKNOWN").Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Takeaway
    Transaction DateThe date of the transaction. May contain missing or incorrect values.2023-01-01

    Data Characteristics

    1. Missing Values:

      • Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
    2. Invalid Values:

      • Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
    3. Price Consistency:

      • Prices for menu items are consistent but may have missing or incorrect values introduced.

    Menu Items

    The dataset includes the following menu items with their respective price ranges:

    ItemPrice($)
    Coffee2
    Tea1.5
    Sandwich4
    Salad5
    Cake3
    Cookie1
    Smoothie4
    Juice3

    Use Cases

    This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

    Cleaning Steps Suggestions

    To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

    1. Handle Invalid Values:

      • Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
    2. Date Consistency:

      • Ensure all dates are in a consistent format.
      • Fill missing dates with plausible values based on nearby records.
    3. Feature Engineering:

      • Create new columns, such as Day of the Week or Transaction Month, for further analysis.

    License

    This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

    Feedback

    If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

  13. Netflix Data: Cleaning, Analysis and Visualization

    • kaggle.com
    zip
    Updated Aug 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization
    Explore at:
    zip(276607 bytes)Available download formats
    Dataset updated
    Aug 26, 2022
    Authors
    Abdulrasaq Ariyo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

    Data Cleaning

    We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

    --View dataset
    
    SELECT * 
    FROM netflix;
    
    
    --The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                      
    SELECT show_id, COUNT(*)                                                                                      
    FROM netflix 
    GROUP BY show_id                                                                                              
    ORDER BY show_id DESC;
    
    --No duplicates
    
    --Check null values across columns
    
    SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
        COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
        COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
        COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
        COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
        COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
        COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
        COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
        COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
        COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
        COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
        COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
    FROM netflix;
    
    We can see that there are NULLS. 
    director_nulls = 2634
    movie_cast_nulls = 825
    country_nulls = 831
    date_added_nulls = 10
    rating_nulls = 4
    duration_nulls = 3 
    

    The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

    -- Below, we find out if some directors are likely to work with particular cast
    
    WITH cte AS
    (
    SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
    FROM netflix
    )
    
    SELECT director_cast, COUNT(*) AS count
    FROM cte
    GROUP BY director_cast
    HAVING COUNT(*) > 1
    ORDER BY COUNT(*) DESC;
    
    With this, we can now populate NULL rows in directors 
    using their record with movie_cast 
    
    UPDATE netflix 
    SET director = 'Alastair Fothergill'
    WHERE movie_cast = 'David Attenborough'
    AND director IS NULL ;
    
    --Repeat this step to populate the rest of the director nulls
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET director = 'Not Given'
    WHERE director IS NULL;
    
    --When I was doing this, I found a less complex and faster way to populate a column which I will use next
    

    Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

    --Populate the country using the director column
    
    SELECT COALESCE(nt.country,nt2.country) 
    FROM netflix AS nt
    JOIN netflix AS nt2 
    ON nt.director = nt2.director 
    AND nt.show_id <> nt2.show_id
    WHERE nt.country IS NULL;
    UPDATE netflix
    SET country = nt2.country
    FROM netflix AS nt2
    WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
    AND netflix.country IS NULL;
    
    
    --To confirm if there are still directors linked to country that refuse to update
    
    SELECT director, country, date_added
    FROM netflix
    WHERE country IS NULL;
    
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET country = 'Not Given'
    WHERE country IS NULL;
    

    The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

    --Show date_added nulls
    
    SELECT show_id, date_added
    FROM netflix_clean
    WHERE date_added IS NULL;
    
    --DELETE nulls
    
    DELETE F...
    
  14. q

    Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio

    • qubeshub.org
    Updated Jul 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shelly Gaynor (2020). Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio [Dataset]. http://doi.org/10.25334/DRGD-F069
    Explore at:
    Dataset updated
    Jul 16, 2020
    Dataset provided by
    QUBES
    Authors
    Shelly Gaynor
    Description

    Access and clean an open source herbarium dataset using Excel or RStudio.

  15. [70,000+ Products] E-Commerce Data CLEAN

    • kaggle.com
    zip
    Updated Jun 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleksii Martusiuk (2024). [70,000+ Products] E-Commerce Data CLEAN [Dataset]. https://www.kaggle.com/datasets/oleksiimartusiuk/80000-products-e-commerce-data-clean
    Explore at:
    zip(3680956 bytes)Available download formats
    Dataset updated
    Jun 29, 2024
    Authors
    Oleksii Martusiuk
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Transformed E-commerce Product Data: Ready for Analysis!

    Sharpen your data analysis skills with this cleaned and enhanced e-commerce product dataset. This dataset was originally spread across 20+ CSV files with over 80,000 products, this dataset now offers a streamlined single CSV for you to focus on analysis.

    What's Inside:

    Clean and Consistent Data: Missing values, formatting inconsistencies, and errors have been addressed, ensuring a foundation for reliable analysis.

    Organized Product Information: Explore a variety of product categories, including:

    • Apparel & Accessories
    • Electronics
    • Home & Kitchen
    • Men's Clothing
    • Women's Clothing
    • Swimsuits and Sleepwear
    • And more!

    Detailed Product Records:

    • Product Title
    • Category
    • Price
    • Discount Information
    • Product subcategory
    • Product rank
    • Quantity sold
    • Color count
    • Image source URL

    Dive right into creating new features from existing data, such as:

    • Extracting keywords from titles and descriptions
    • Deriving price categories
    • Calculating average discounts
  16. Dirty E-Commerce Data [80,000+ Products]

    • kaggle.com
    zip
    Updated Jun 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleksii Martusiuk (2024). Dirty E-Commerce Data [80,000+ Products] [Dataset]. https://www.kaggle.com/datasets/oleksiimartusiuk/e-commerce-data-shein
    Explore at:
    zip(3611849 bytes)Available download formats
    Dataset updated
    Jun 29, 2024
    Authors
    Oleksii Martusiuk
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    E-commerce Product Dataset - Clean and Enhance Your Data Analysis Skills or Check Out The Cleaned File Below!

    This dataset offers a comprehensive collection of product information from an e-commerce store, spread across 20+ CSV files and encompassing over 80,000+ products. It presents a valuable opportunity to test and refine your data cleaning and wrangling skills.

    What's Included:

    A variety of product categories, including:

    • Apparel & Accessories
    • Electronics
    • Home & Kitchen
    • Beauty & Health
    • Toys & Games
    • Men's Clothes
    • Women's Clothes
    • Pet Supplies
    • Sports & Outdoor
    • (and more!)

    Each product record contains details such as:

    • Product Title
    • Category
    • Price
    • Discount information
    • (and other attributes)

    Challenges and Opportunities:

    Data Cleaning: The dataset is "dirty," containing missing values, inconsistencies in formatting, and potential errors. This provides a chance to practice your data-cleaning techniques such as:

    • Identifying and handling missing values
    • Standardizing data formats
    • Correcting inconsistencies
    • Dealing with duplicate entries

    Feature Engineering: After cleaning, you can explore opportunities to create new features from the existing data, such as: - Extracting keywords from product titles and descriptions - Deriving price categories - Calculating average discounts

    Who can benefit from this dataset?

    • Data analysts and scientists looking to practice data cleaning and wrangling skills on a real-world e-commerce dataset
    • Machine learning enthusiasts interested in building models for product recommendation, price prediction, or other e-commerce tasks
    • Anyone interested in exploring and understanding the structure and organization of product data in an e-commerce setting
    • By contributing to this dataset and sharing your cleaning and feature engineering approaches, you can help create a valuable resource for the Kaggle community!
  17. R

    Autonomous Data Cleaning with AI Market Research Report 2033

    • researchintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Intelo (2025). Autonomous Data Cleaning with AI Market Research Report 2033 [Dataset]. https://researchintelo.com/report/autonomous-data-cleaning-with-ai-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Research Intelo
    License

    https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy

    Time period covered
    2024 - 2033
    Area covered
    Global
    Description

    Autonomous Data Cleaning with AI Market Outlook



    According to our latest research, the Global Autonomous Data Cleaning with AI market size was valued at $1.4 billion in 2024 and is projected to reach $8.2 billion by 2033, expanding at a robust CAGR of 21.8% during 2024–2033. This remarkable growth is primarily fueled by the exponential increase in enterprise data volumes and the urgent need for high-quality, reliable data to drive advanced analytics, machine learning, and business intelligence initiatives. The autonomous data cleaning with AI market is being propelled by the integration of artificial intelligence and machine learning algorithms that automate the tedious and error-prone processes of data cleansing, normalization, and validation, enabling organizations to unlock actionable insights with greater speed and accuracy. As businesses across diverse sectors increasingly recognize the strategic value of data-driven decision-making, the demand for autonomous data cleaning solutions is expected to surge, transforming how organizations manage and leverage their data assets globally.



    Regional Outlook



    North America currently holds the largest share of the autonomous data cleaning with AI market, accounting for over 38% of the global market value in 2024. This dominance is underpinned by the region’s mature technological infrastructure, high adoption rates of AI-driven analytics, and the presence of leading technology vendors and innovative startups. The United States, in particular, leads in enterprise digital transformation, with sectors such as BFSI, healthcare, and IT & telecommunications aggressively investing in automated data quality solutions. Stringent regulatory requirements around data governance, such as HIPAA and GDPR, have further incentivized organizations to deploy advanced data cleaning platforms to ensure compliance and mitigate risks. The region’s robust ecosystem of cloud service providers and AI research hubs also accelerates the deployment and integration of autonomous data cleaning tools, positioning North America at the forefront of market innovation and growth.



    Asia Pacific is emerging as the fastest-growing region in the autonomous data cleaning with AI market, projected to register a remarkable CAGR of 25.6% through 2033. The region’s rapid digitalization, expanding e-commerce sector, and government-led initiatives to promote smart manufacturing and digital health are driving significant investments in AI-powered data management solutions. Countries such as China, India, Japan, and South Korea are witnessing a surge in data generation from mobile applications, IoT devices, and cloud platforms, necessitating robust autonomous data cleaning capabilities to ensure data integrity and business agility. Local enterprises are increasingly partnering with global technology providers and investing in in-house AI talent to accelerate adoption. Furthermore, favorable policy reforms and incentives for AI research and development are catalyzing the advancement and deployment of autonomous data cleaning technologies across diverse industry verticals.



    In contrast, emerging economies in Latin America, the Middle East, and Africa are experiencing a gradual uptake of autonomous data cleaning with AI, shaped by unique challenges such as limited digital infrastructure, skills gaps, and budget constraints. While the potential for market expansion is substantial, particularly in sectors like banking, government, and telecommunications, adoption is often hindered by concerns over data privacy, lack of standardized frameworks, and the high upfront costs of AI integration. However, localized demand for real-time analytics, coupled with international investments in digital transformation and capacity building, is gradually fostering an environment conducive to the adoption of autonomous data cleaning solutions. Policy initiatives aimed at enhancing digital literacy and supporting startup ecosystems are also expected to play a pivotal role in bridging the adoption gap and unleashing new growth opportunities in these regions.



    Report Scope




    Attributes Details
    Report Title Autonomous Dat

  18. G

    Autonomous Data Cleaning with AI Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Autonomous Data Cleaning with AI Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/autonomous-data-cleaning-with-ai-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Oct 4, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Autonomous Data Cleaning with AI Market Outlook



    According to our latest research, the global Autonomous Data Cleaning with AI market size reached USD 1.68 billion in 2024, with a robust year-on-year growth driven by the surge in enterprise data volumes and the mounting demand for high-quality, actionable insights. The market is projected to expand at a CAGR of 24.2% from 2025 to 2033, which will take the overall market value to approximately USD 13.1 billion by 2033. This rapid growth is fueled by the increasing adoption of artificial intelligence (AI) and machine learning (ML) technologies across industries, aiming to automate and optimize the data cleaning process for improved operational efficiency and decision-making.




    The primary growth driver for the Autonomous Data Cleaning with AI market is the exponential increase in data generation across various industries such as BFSI, healthcare, retail, and manufacturing. Organizations are grappling with massive amounts of structured and unstructured data, much of which is riddled with inconsistencies, duplicates, and inaccuracies. Manual data cleaning is both time-consuming and error-prone, leading businesses to seek automated AI-driven solutions that can intelligently detect, correct, and prevent data quality issues. The integration of AI not only accelerates the data cleaning process but also ensures higher accuracy, enabling organizations to leverage clean, reliable data for analytics, compliance, and digital transformation initiatives. This, in turn, translates into enhanced business agility and competitive advantage.




    Another significant factor propelling the market is the increasing regulatory scrutiny and compliance requirements in sectors such as banking, healthcare, and government. Regulations such as GDPR, HIPAA, and others mandate strict data governance and quality standards. Autonomous Data Cleaning with AI solutions help organizations maintain compliance by ensuring data integrity, traceability, and auditability. Additionally, the evolution of cloud computing and the proliferation of big data analytics platforms have made it easier for organizations of all sizes to deploy and scale AI-powered data cleaning tools. These advancements are making autonomous data cleaning more accessible, cost-effective, and scalable, further driving market adoption.




    The growing emphasis on digital transformation and real-time decision-making is also a crucial growth factor for the Autonomous Data Cleaning with AI market. As enterprises increasingly rely on analytics, machine learning, and artificial intelligence for business insights, the quality of input data becomes paramount. Automated, AI-driven data cleaning solutions enable organizations to process, cleanse, and prepare data in real-time, ensuring that downstream analytics and AI models are fed with high-quality inputs. This not only improves the accuracy of business predictions but also reduces the time-to-insight, helping organizations stay ahead in highly competitive markets.




    From a regional perspective, North America currently dominates the Autonomous Data Cleaning with AI market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The presence of leading technology companies, early adopters of AI, and a mature regulatory environment are key factors contributing to North America’s leadership. However, Asia Pacific is expected to witness the highest CAGR over the forecast period, driven by rapid digitalization, expanding IT infrastructure, and increasing investments in AI and data analytics, particularly in countries such as China, India, and Japan. Latin America and the Middle East & Africa are also gradually emerging as promising markets, supported by growing awareness and adoption of AI-driven data management solutions.





    Component Analysis



    The Autonomous Data Cleaning with AI market is segmented by component into Software and Services. The software segment currently holds the largest market share, driven

  19. I

    Global Data Cleansing Tools Market Research and Development Focus 2025-2032

    • statsndata.org
    excel, pdf
    Updated Oct 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global Data Cleansing Tools Market Research and Development Focus 2025-2032 [Dataset]. https://www.statsndata.org/report/data-cleansing-tools-market-339171
    Explore at:
    excel, pdfAvailable download formats
    Dataset updated
    Oct 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The Data Cleansing Tools market is rapidly evolving as businesses increasingly recognize the importance of data quality in driving decision-making and strategic initiatives. Data cleansing, also known as data scrubbing or data cleaning, involves the process of identifying and correcting errors and inconsistencies in

  20. Retail Store Sales: Dirty for Data Cleaning

    • kaggle.com
    zip
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning
    Explore at:
    zip(226740 bytes)Available download formats
    Dataset updated
    Jan 18, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Retail Store Sales Dataset

    Overview

    The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

    File Information

    • File Name: retail_store_sales.csv
    • Number of Rows: 12,575
    • Number of Columns: 11

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    Customer IDA unique identifier for each customer. 25 unique customers.CUST_01
    CategoryThe category of the purchased item.Food, Furniture
    ItemThe name of the purchased item. May contain missing values or None.Item_1_FOOD, None
    Price Per UnitThe static price of a single unit of the item. May contain missing or None values.4.00, None
    QuantityThe quantity of the item purchased. May contain missing or None values.1, None
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, None
    Payment MethodThe method of payment used. May contain missing or invalid values.Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Online
    Transaction DateThe date of the transaction. Always present and valid.2023-01-15
    Discount AppliedIndicates if a discount was applied to the transaction. May contain missing values.True, False, None

    Categories and Items

    The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

    Electric Household Essentials

    Item CodeItem NamePrice
    Item_1_EHEBlender5.0
    Item_2_EHEMicrowave6.5
    Item_3_EHEToaster8.0
    Item_4_EHEVacuum Cleaner9.5
    Item_5_EHEAir Purifier11.0
    Item_6_EHEElectric Kettle12.5
    Item_7_EHERice Cooker14.0
    Item_8_EHEIron15.5
    Item_9_EHECeiling Fan17.0
    Item_10_EHETable Fan18.5
    Item_11_EHEHair Dryer20.0
    Item_12_EHEHeater21.5
    Item_13_EHEHumidifier23.0
    Item_14_EHEDehumidifier24.5
    Item_15_EHECoffee Maker26.0
    Item_16_EHEPortable AC27.5
    Item_17_EHEElectric Stove29.0
    Item_18_EHEPressure Cooker30.5
    Item_19_EHEInduction Cooktop32.0
    Item_20_EHEWater Dispenser33.5
    Item_21_EHEHand Blender35.0
    Item_22_EHEMixer Grinder36.5
    Item_23_EHESandwich Maker38.0
    Item_24_EHEAir Fryer39.5
    Item_25_EHEJuicer41.0

    Furniture

    Item CodeItem NamePrice
    Item_1_FUROffice Chair5.0
    Item_2_FURSofa6.5
    Item_3_FURCoffee Table8.0
    Item_4_FURDining Table9.5
    Item_5_FURBookshelf11.0
    Item_6_FURBed F...
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mr.Machine (2022). Data Cleaning - Feature Imputation [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-cleaning-feature-imputation
Organization logo

Data Cleaning - Feature Imputation

Handling the Missing Numerical/Categorical Value

Explore at:
zip(116097 bytes)Available download formats
Dataset updated
Aug 13, 2022
Authors
Mr.Machine
Description

Data Cleaning or Data cleansing is to clean the data by imputing missing values, smoothing noisy data, and identifying or removing outliers. In general, the missing values are found due to collection error or data is corrupted.

Here some info in details :Feature Engineering - Handling Missing Value

Wine_Quality.csv dataset have the numerical missing data, and students_Performance.mv.csv dataset have Numerical and categorical missing data's.

Search
Clear search
Close search
Google apps
Main menu