Facebook
TwitterThis dataset was created by Mohanad Hazem Qabil
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
Facebook
Twitterhttps://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy
Market Research Intellect's Data Cleansing Software Market Report highlights a valuation of USD 2.5 billion in 2024 and anticipates growth to USD 5.1 billion by 2033, with a CAGR of 9.2% from 2026-2033.Explore insights on demand dynamics, innovation pipelines, and competitive landscapes.
Facebook
TwitterThis dataset was created by AbdElRahman16
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Materials from workshop conducted for Monroe Library faculty as part of TLT/Faculty Development/Digital Scholarship on 2018-04-05. Objectives:Clean dataAnalyze data using pivot tablesVisualize dataDesign accessible instruction for working with dataAssociated Research Guide at http://researchguides.loyno.edu/data_workshopData sets are from the following:
BaroqueArt Dataset by CulturePlex Lab is licensed under CC0 What's on the Menu? Menus by New York Public Library is licensed under CC0 Dog movie stars and dog breed popularity by Ghirlanda S, Acerbi A, Herzog H is licensed under CC BY 4.0 NOPD Misconduct Complaints, 2016-2018 by City of New Orleans Open Data is licensed under CC0 U.S. Consumer Product Safety Commission Recall Violations by CU.S. Consumer Product Safety Commission, Violations is licensed under CC0 NCHS - Leading Causes of Death: United States by Data.gov is licensed under CC0 Bob Ross Elements by Episode by Walt Hickey, FiveThirtyEight, is licensed under CC BY 4.0 Pacific Walrus Coastal Haulout 1852-2016 by U.S. Geological Survey, Alaska Science Center is licensed under CC0 Australia Registered Animals by Sunshine Coast Council is licensed under CC0
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ahoy, data enthusiasts! Join us for a hands-on workshop where you will hoist your sails and navigate through the Statistics Canada website, uncovering hidden treasures in the form of data tables. With the wind at your back, you’ll master the art of downloading these invaluable Stats Can datasets while braving the occasional squall of data cleaning challenges using Excel with your trusty captains Vivek and Lucia at the helm.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dissertation_demo.zip contains the base code and demonstration purpose for the dissertation: A Conceptual Model for Transparent, Reusable, and Collaborative Data Cleaning. Each chapter has a demo folder for demonstrating provenance queries or tools. The Airbnb dataset for demonstration and simulation is not included in this demo but is available to access directly from the reference website. Any updates on demonstration and examples can be found online at: https://github.com/nikolausn/dissertation_demo
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Restaurant Menu DatasetWith approximately 45,000 menus dating from the 1840s to the present, The New York Public Library’s restaurant menu collection is one of the largest in the world. The menu data has been transcribed, dish by dish, into this dataset. For more information, please see http://menus.nypl.org/about.This dataset is not clean and contains many missing values, making it perfect to practice data cleaning tools and techniques.Dataset Variables:id: identifier for menuname: sponsor: who sponsored the meal (organizations, people, name of restaurant)event: categoryvenue: type of place (commercial, social, professional)place: where the meal took place (often a geographic location)physical_description: dimension and material description of the menuoccasion: occasion of the meal (holidays, anniversaries, daily)notes: notes by librarians about the original materialcall_number: call number of the menukeywords: language: date: date of the menulocation: organization or business who produced the menulocation_typecurrency: system of money the menu uses (dollars, etc)currency_symbol: symbol for the currency ($, etc)status: completeness of the menu transcription (transcribed, under review, etc)page_count: how many pages the menu hasdish_count: how many dishes the menu has
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Discover the booming market for data cleaning tools! Our comprehensive analysis reveals a $10 billion+ market in 2025, driven by AI, cloud adoption, and the critical need for high-quality data. Explore key trends, leading companies (Dundas BI, IBM, Sisense), and future growth projections to 2033.
Facebook
TwitterAccess and clean an open source herbarium dataset using Excel or RStudio.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this project, we work on repairing three datasets:
country_protocol_code, conduct the same clinical trials which is identified by eudract_number. Each clinical trial has a title that can help find informative details about the design of the trial.eudract_number. The ground truth samples in the dataset were established by aligning information about the trial populations provided by external registries, specifically the CT.gov database and the German Trials database. Additionally, the dataset comprises other unstructured attributes that categorize the inclusion criteria for trial participants such as inclusion.code. Samples with the same code represent the same product but are extracted from a differentb source. The allergens are indicated by (‘2’) if present, or (‘1’) if there are traces of it, and (‘0’) if it is absent in a product. The dataset also includes information on ingredients in the products. Overall, the dataset comprises categorical structured data describing the presence, trace, or absence of specific allergens, and unstructured text describing ingredients. N.B: Each '.zip' file contains a set of 5 '.csv' files which are part of the afro-mentioned datasets:
Facebook
TwitterData Cleaning or Data cleansing is to clean the data by imputing missing values, smoothing noisy data, and identifying or removing outliers. In general, the missing values are found due to collection error or data is corrupted.
Here some info in details :Feature Engineering - Handling Missing Value
Wine_Quality.csv dataset have the numerical missing data, and students_Performance.mv.csv dataset have Numerical and categorical missing data's.
Facebook
TwitterThis resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains the information about Atlantic hurricanes of Categories 1,2,3 and 5 from 1920 to 2020. This very messy dataset is designed to improve your data cleaning skills
Your task:
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Petroleum production data are usually stored in a format that makes it easy to determine the year and month production started, if there are any breaks, and when production ends. However, in some cases, you may want to compare production runs where the start of production for all wells starts at month one regardless of the year the wells started producing. This report describes the JAVA program the U.S. Geological Survey developed to examine water-to-oil and water-to-gas ratios in the form of month one, month two, and so on with the objective of estimating quantities of water and proppant used in low-permeability petroleum production. The text covers the data used by the program, the challenges with production data, the program logic for checking the quality of the production data, and the program logic for checking the completeness of the data.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Discover the booming Data Preparation Tools market! Learn about its 18.5% CAGR, key players (Microsoft, Tableau, IBM), and regional growth trends from our comprehensive analysis. Explore market segments, drivers, and restraints shaping this crucial sector for businesses of all sizes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
Twitterhttps://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Global Data Cleaning Tools market size 2025 was XX Million. Data Cleaning Tools Industry compound annual growth rate (CAGR) will be XX% from 2025 till 2033.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global Data Preparation Platform market is poised for substantial growth, estimated to reach $15,600 million by the study's end in 2033, up from $6,000 million in the base year of 2025. This trajectory is fueled by a Compound Annual Growth Rate (CAGR) of approximately 12.5% over the forecast period. The proliferation of big data and the increasing need for clean, usable data across all business functions are primary drivers. Organizations are recognizing that effective data preparation is foundational to accurate analytics, informed decision-making, and successful AI/ML initiatives. This has led to a surge in demand for platforms that can automate and streamline the complex, time-consuming process of data cleansing, transformation, and enrichment. The market's expansion is further propelled by the growing adoption of cloud-based solutions, offering scalability, flexibility, and cost-efficiency, particularly for Small & Medium Enterprises (SMEs). Key trends shaping the Data Preparation Platform market include the integration of AI and machine learning for automated data profiling and anomaly detection, enhanced collaboration features to facilitate teamwork among data professionals, and a growing focus on data governance and compliance. While the market exhibits robust growth, certain restraints may temper its pace. These include the complexity of integrating data preparation tools with existing IT infrastructures, the shortage of skilled data professionals capable of leveraging advanced platform features, and concerns around data security and privacy. Despite these challenges, the market is expected to witness continuous innovation and strategic partnerships among leading companies like Microsoft, Tableau, and Alteryx, aiming to provide more comprehensive and user-friendly solutions to meet the evolving demands of a data-driven world. Here's a comprehensive report description on Data Preparation Platforms, incorporating the requested information, values, and structure:
Facebook
TwitterThis dataset was created by Mohanad Hazem Qabil