Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is an expanded version of the popular "Sample - Superstore Sales" dataset, commonly used for introductory data analysis and visualization. It contains detailed transactional data for a US-based retail company, covering orders, products, and customer information.
This version is specifically designed for practicing Data Quality (DQ) and Data Wrangling skills, featuring a unique set of real-world "dirty data" problems (like those encountered in tools like SPSS Modeler, Tableau Prep, or Alteryx) that must be cleaned before any analysis or machine learning can begin.
This dataset combines the original Superstore data with 15,000 plausibly generated synthetic records, totaling 25,000 rows of transactional data. It includes 21 columns detailing: - Order Information: Order ID, Order Date, Ship Date, Ship Mode. - Customer Information: Customer ID, Customer Name, Segment. - Geographic Information: Country, City, State, Postal Code, Region. - Product Information: Product ID, Category, Sub-Category, Product Name. - Financial Metrics: Sales, Quantity, Discount, and Profit.
This dataset is intentionally corrupted to provide a robust practice environment for data cleaning. Challenges include: Missing/Inconsistent Values: Deliberate gaps in Profit and Discount, and multiple inconsistent entries (-- or blank) in the Region column.
Data Type Mismatches: Order Date and Ship Date are stored as text strings, and the Profit column is polluted with comma-formatted strings (e.g., "1,234.56"), forcing the entire column to be read as an object (string) type.
Categorical Inconsistencies: The Category field contains variations and typos like "Tech", "technologies", "Furni", and "OfficeSupply" that require standardization.
Outliers and Invalid Data: Extreme outliers have been added to the Sales and Profit fields, alongside a subset of transactions with an invalid Sales value of 0.
Duplicate Records: Over 200 rows are duplicated (with slight financial variations) to test your deduplication logic.
This dataset is ideal for:
Data Wrangling/Cleaning (Primary Focus): Fix all the intentional data quality issues before proceeding.
Exploratory Data Analysis (EDA): Analyze sales distribution by region, segment, and category.
Regression: Predict the Profit based on Sales, Discount, and product features.
Classification: Build an RFM Model (Recency, Frequency, Monetary) and create a target variable (HighValueCustomer = 1 if total sales are* $>$ $1000$*) to be predicted by logistical regression or decision trees.
Time Series Analysis: Aggregate sales by month/year to perform forecasting.
This dataset is an expanded and corrupted derivative of the original Sample Superstore dataset, credited to Tableau and widely shared for educational purposes. All synthetic records were generated to follow the plausible distribution of the original data.
Facebook
Twitterhttps://www.promarketreports.com/privacy-policyhttps://www.promarketreports.com/privacy-policy
The size of the Data Quality Tool Market was valued at USD 2.09 Billion in 2024 and is projected to reach USD 5.93 Billion by 2033, with an expected CAGR of 16.07% during the forecast period. Recent developments include: January 2022: IBM and Francisco Partners disclosed the execution of a definitive contract under which Francisco Partners will purchase medical care information and analytics resources from IBM, which are currently part of the IBM Watson Health business., October 2021: Informatica LLC announced an important cloud storage agreement with Google Cloud in October 2021. This collaboration allows Informatica clients to transition to Google Cloud as much as twelve times quicker. Informatica's Google Cloud Marketplace transactable solutions now incorporate Master Data Administration and Data Governance capabilities., Completing a unit of labor with incorrect data costs ten times more estimates than the Harvard Business Review, and finding the correct data for effective tools has never been difficult. A reliable system may be implemented by selecting and deploying intelligent workflow-driven, self-service options tools for data quality with inbuilt quality controls.. Key drivers for this market are: Increasing demand for data quality: Businesses are increasingly recognizing the importance of data quality for decision-making and operational efficiency. This is driving demand for data quality tools that can automate and streamline the data cleansing and validation process.
Growing adoption of cloud-based data quality tools: Cloud-based data quality tools offer several advantages over on-premises solutions, including scalability, flexibility, and cost-effectiveness. This is driving the adoption of cloud-based data quality tools across all industries.
Emergence of AI-powered data quality tools: AI-powered data quality tools can automate many of the tasks involved in data cleansing and validation, making it easier and faster to achieve high-quality data. This is driving the adoption of AI-powered data quality tools across all industries.. Potential restraints include: Data privacy and security concerns: Data privacy and security regulations are becoming increasingly stringent, which can make it difficult for businesses to implement data quality initiatives.
Lack of skilled professionals: There is a shortage of skilled data quality professionals who can implement and manage data quality tools. This can make it difficult for businesses to achieve high-quality data.
Cost of data quality tools: Data quality tools can be expensive, especially for large businesses with complex data environments. This can make it difficult for businesses to justify the investment in data quality tools.. Notable trends are: Adoption of AI-powered data quality tools: AI-powered data quality tools are becoming increasingly popular, as they can automate many of the tasks involved in data cleansing and validation. This makes it easier and faster to achieve high-quality data.
Growth of cloud-based data quality tools: Cloud-based data quality tools are becoming increasingly popular, as they offer several advantages over on-premises solutions, including scalability, flexibility, and cost-effectiveness.
Focus on data privacy and security: Data quality tools are increasingly being used to help businesses comply with data privacy and security regulations. This is driving the development of new data quality tools that can help businesses protect their data..
Facebook
TwitterStudy-specific data quality testing is an essential part of minimizing analytic errors, particularly for studies making secondary use of clinical data. We applied a systematic and reproducible approach for study-specific data quality testing to the analysis plan for PRESERVE, a 15-site, EHR-based observational study of chronic kidney disease in children. This approach integrated widely adopted data quality concepts with healthcare-specific evaluation methods. We implemented two rounds of data quality assessment. The first produced high-level evaluation using aggregate results from a distributed query, focused on cohort identification and main analytic requirements. The second focused on extended testing of row-level data centralized for analysis. We systematized reporting and cataloguing of data quality issues, providing institutional teams with prioritized issues for resolution. We tracked improvements and documented anomalous data for consideration during analyses. The checks we developed identified 115 and 157 data quality issues in the two rounds, involving completeness, data model conformance, cross-variable concordance, consistency, and plausibility, extending traditional data quality approaches to address more complex stratification and temporal patterns. Resolution efforts focused on higher priority issues, given finite study resources. In many cases, institutional teams were able to correct data extraction errors or obtain additional data, avoiding exclusion of 2 institutions entirely and resolving 123 other gaps. Other results identified complexities in measures of kidney function, bearing on the study’s outcome definition. Where limitations such as these are intrinsic to clinical data, the study team must account for them in conducting analyses. This study rigorously evaluated fitness of data for intended use. The framework is reusable and built on a strong theoretical underpinning. Significant data quality issues that would have otherwise delayed analyses or made data unusable were addressed. This study highlights the need for teams combining subject-matter and informatics expertise to address data quality when working with real world data.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global PMU Data Quality Monitoring market size reached USD 565 million in 2024, and is expected to grow at a robust CAGR of 8.7% from 2025 to 2033, reaching a forecasted market size of USD 1.23 billion by 2033. The primary growth driver for this market is the increasing integration of grid modernization initiatives and the rising adoption of Phasor Measurement Units (PMUs) in power utilities worldwide, which necessitate advanced data quality monitoring solutions to ensure grid stability and reliability.
One of the most significant factors fueling the expansion of the PMU Data Quality Monitoring market is the global push towards smart grid infrastructure. As utilities and grid operators transition from traditional to digital grids, the volume and complexity of data generated by PMUs have surged. This data is critical for real-time monitoring, fault detection, and predictive maintenance. However, the effectiveness of these applications depends heavily on the quality, accuracy, and timeliness of the data collected. As a result, there is a growing demand for sophisticated PMU data quality monitoring solutions that can detect anomalies, correct errors, and ensure the reliability of grid operations. The increasing prevalence of renewable energy sources, which introduce variability and unpredictability into the grid, further amplifies the need for robust data quality monitoring systems to maintain grid stability and efficiency.
Another key growth driver is the stringent regulatory environment and compliance requirements imposed by governments and industry bodies. Regulatory authorities across North America, Europe, and Asia Pacific are mandating the adoption of advanced monitoring and reporting mechanisms to enhance grid resilience and prevent large-scale outages. This has prompted utilities and grid operators to invest heavily in PMU data quality monitoring tools that enable compliance with these standards while minimizing operational risks. Moreover, advancements in artificial intelligence and machine learning are revolutionizing the way data quality issues are detected and resolved, enabling real-time analytics and automated decision-making. These technological innovations are expected to further accelerate market growth by offering scalable, cost-effective, and highly accurate monitoring solutions.
The rapid digital transformation of the energy sector, coupled with the proliferation of distributed energy resources such as solar and wind, is also contributing to the rising demand for PMU data quality monitoring. As the grid becomes increasingly decentralized, the number of data points and the complexity of grid management grow exponentially. PMU data quality monitoring systems play a vital role in aggregating, validating, and analyzing this data to ensure seamless grid operations. Additionally, the growing focus on grid cybersecurity is driving the adoption of monitoring solutions that can detect and mitigate data integrity threats. These trends are expected to sustain the momentum of the PMU Data Quality Monitoring market over the forecast period.
Regionally, North America continues to dominate the PMU Data Quality Monitoring market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The high adoption rate of smart grid technologies, strong regulatory frameworks, and significant investments in grid modernization projects in the United States and Canada are the primary factors contributing to North America's leadership. Meanwhile, Asia Pacific is emerging as the fastest-growing region, driven by rapid urbanization, increasing electricity demand, and large-scale renewable energy integration in countries like China, India, and Japan. Europe remains a key market, propelled by ambitious decarbonization goals and cross-border grid integration initiatives. Latin America and the Middle East & Africa are also witnessing steady growth, albeit from a smaller base, as utilities in these regions begin to modernize their grid infrastructure and enhance data management capabilities.
The PMU Data Quality Monitoring market by component is segmented into software, hardware, and services. The software segment currently holds the largest market share, owing to the growing need for advanced analytics and real-time data validation tools. As utilities and grid operators increasingly rely on P
Facebook
Twitterhttps://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The MRO (Maintenance, Repair, and Operations) Data Cleansing and Enrichment Service market is experiencing robust growth, driven by the increasing need for accurate and reliable data across various industries. The digital transformation sweeping sectors like manufacturing, oil and gas, and pharmaceuticals is fueling demand for streamlined data management. Businesses are realizing the significant cost savings and operational efficiencies achievable through improved data quality. Specifically, inaccurate or incomplete MRO data can lead to costly downtime, inefficient inventory management, and missed maintenance opportunities. Data cleansing and enrichment services address these challenges by identifying and correcting errors, filling in gaps, and standardizing data formats, ultimately improving decision-making and optimizing resource allocation. The market is segmented by application (chemical, oil & gas, pharmaceutical, mining, transportation, others) and type of service (data cleansing, data enrichment). While precise market size figures are unavailable, considering a moderate CAGR of 15% and a 2025 market value in the hundreds of millions, a reasonable projection is a market size exceeding $500 million in 2025, growing to potentially over $1 billion by 2033. This projection reflects the increasing adoption of digital technologies and the growing awareness of the value proposition of high-quality MRO data. The competitive landscape is fragmented, with numerous companies offering specialized services. Key players include both large established firms and smaller niche providers. The market's geographical distribution is diverse, with North America and Europe currently holding significant market shares, reflecting higher levels of digitalization and data management maturity in these regions. However, Asia-Pacific is emerging as a high-growth region due to rapid industrialization and increasing technological adoption. The long-term growth trajectory of the MRO Data Cleansing and Enrichment Service market will be influenced by factors such as advancements in data analytics, the expanding adoption of cloud-based solutions, and the continued focus on optimizing operational efficiency across industries. Challenges remain, however, including data security concerns and the need for skilled professionals to manage complex data cleansing and enrichment projects.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global Data Observability Software market is poised for substantial growth, projected to reach approximately $8,500 million by 2025, with an anticipated Compound Annual Growth Rate (CAGR) of around 22% through 2033. This robust expansion is fueled by the escalating complexity of data landscapes and the critical need for organizations to proactively monitor, troubleshoot, and ensure the reliability of their data pipelines. The increasing volume, velocity, and variety of data generated across industries necessitate sophisticated solutions that provide end-to-end visibility, from data ingestion to consumption. Key drivers include the growing adoption of cloud-native architectures, the proliferation of big data technologies, and the rising demand for data quality and compliance. As businesses increasingly rely on data-driven decision-making, the imperative to prevent data downtime, identify anomalies, and maintain data integrity becomes paramount, further accelerating market penetration. The market is segmented by application, with Large Enterprises constituting a significant share due to their extensive and complex data infrastructures, demanding advanced observability capabilities. Small and Medium-sized Enterprises (SMEs) are also showing increasing adoption, driven by more accessible cloud-based solutions and a growing awareness of data's strategic importance. On-premise deployments remain relevant for organizations with stringent data residency and security requirements, while cloud-based solutions are witnessing rapid growth due to their scalability, flexibility, and cost-effectiveness. Prominent market trends include the integration of AI and machine learning for automated anomaly detection and root cause analysis, the development of unified platforms offering comprehensive data lineage and metadata management, and a focus on real-time monitoring and proactive alerting. Challenges such as the high cost of implementation and the need for skilled personnel to manage these sophisticated tools, alongside the potential for vendor lock-in, are being addressed through continuous innovation and strategic partnerships within the competitive vendor landscape. This report provides an in-depth analysis of the global Data Observability Software market, forecasting its trajectory from 2019 to 2033, with a base year of 2025. The market is poised for significant expansion, driven by the escalating complexity of data ecosystems and the critical need for data reliability and trust.
Facebook
Twitter
According to our latest research, the global Autonomous Data Cleaning with AI market size reached USD 1.68 billion in 2024, with a robust year-on-year growth driven by the surge in enterprise data volumes and the mounting demand for high-quality, actionable insights. The market is projected to expand at a CAGR of 24.2% from 2025 to 2033, which will take the overall market value to approximately USD 13.1 billion by 2033. This rapid growth is fueled by the increasing adoption of artificial intelligence (AI) and machine learning (ML) technologies across industries, aiming to automate and optimize the data cleaning process for improved operational efficiency and decision-making.
The primary growth driver for the Autonomous Data Cleaning with AI market is the exponential increase in data generation across various industries such as BFSI, healthcare, retail, and manufacturing. Organizations are grappling with massive amounts of structured and unstructured data, much of which is riddled with inconsistencies, duplicates, and inaccuracies. Manual data cleaning is both time-consuming and error-prone, leading businesses to seek automated AI-driven solutions that can intelligently detect, correct, and prevent data quality issues. The integration of AI not only accelerates the data cleaning process but also ensures higher accuracy, enabling organizations to leverage clean, reliable data for analytics, compliance, and digital transformation initiatives. This, in turn, translates into enhanced business agility and competitive advantage.
Another significant factor propelling the market is the increasing regulatory scrutiny and compliance requirements in sectors such as banking, healthcare, and government. Regulations such as GDPR, HIPAA, and others mandate strict data governance and quality standards. Autonomous Data Cleaning with AI solutions help organizations maintain compliance by ensuring data integrity, traceability, and auditability. Additionally, the evolution of cloud computing and the proliferation of big data analytics platforms have made it easier for organizations of all sizes to deploy and scale AI-powered data cleaning tools. These advancements are making autonomous data cleaning more accessible, cost-effective, and scalable, further driving market adoption.
The growing emphasis on digital transformation and real-time decision-making is also a crucial growth factor for the Autonomous Data Cleaning with AI market. As enterprises increasingly rely on analytics, machine learning, and artificial intelligence for business insights, the quality of input data becomes paramount. Automated, AI-driven data cleaning solutions enable organizations to process, cleanse, and prepare data in real-time, ensuring that downstream analytics and AI models are fed with high-quality inputs. This not only improves the accuracy of business predictions but also reduces the time-to-insight, helping organizations stay ahead in highly competitive markets.
From a regional perspective, North America currently dominates the Autonomous Data Cleaning with AI market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The presence of leading technology companies, early adopters of AI, and a mature regulatory environment are key factors contributing to North America’s leadership. However, Asia Pacific is expected to witness the highest CAGR over the forecast period, driven by rapid digitalization, expanding IT infrastructure, and increasing investments in AI and data analytics, particularly in countries such as China, India, and Japan. Latin America and the Middle East & Africa are also gradually emerging as promising markets, supported by growing awareness and adoption of AI-driven data management solutions.
The Autonomous Data Cleaning with AI market is segmented by component into Software and Services. The software segment currently holds the largest market share, driven
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Release associated with The curious case of GW200129: interplay between spin-precession inference and data-quality issues Data Release. We release frame files and result file for selected parameter estimation runs in the paper.
Each directory contains .json bilby result files for the PE run described by that directory. The specific channel names and frame files that we used in the PE runs are listed below. The frames for the PE runs with BayesWave glitch subtraction are included in this release and are in BW_frames/frame_{glitch label}.
L1 data with glitch subtraction:
Channel: L1:DCS-CALIB_STRAIN_CLEAN_SUB60HZ_C01_P1800169_v4
Link: https://zenodo.org/record/5546680/files/L-L1_HOFT_CLEAN_SUB60HZ_C01_P1800169_v4-1264314068-4096.gwf
L1 data, no mitigation:
Channel: H1:DCS-CALIB_STRAIN_CLEAN_SUB60HZ_C01 (L1:GWOSC-16KHZ_R1_STRAIN)
H1 data, no mitigation:
Channel: H1:DCS-CALIB_STRAIN_CLEAN_SUB60HZ_C01 (H1:GWOSC-16KHZ_R1_STRAIN)
Channels for runs with BayesWave glitch-subtracted frames--
(Note you need to read in the correct gwf file to access each channel; for example the channel for BayesWave glitch A should be accessed after reading in the gwf file in `BW_frames/frame_A/`)
BayesWave glitch A (applies only to L1): DCS-CALIB_STRAIN_CLEAN_SUB60HZ_C01_BW_DEGLITCHED_30000
BayesWave glitch B (applies only to L1): DCS-CALIB_STRAIN_CLEAN_SUB60HZ_C01_BW_DEGLITCHED_28395
BayesWave glitch C (applies only to L1): DCS-CALIB_STRAIN_CLEAN_SUB60HZ_C01_BW_DEGLITCHED_32752
Facebook
TwitterReporting of Aggregate Case and Death Count data was discontinued on May 11, 2023, with the expiration of the COVID-19 public health emergency declaration. Although these data will continue to be publicly available, this dataset will no longer be updated.
The surveillance case definition for COVID-19, a nationally notifiable disease, was first described in a position statement from the Council for State and Territorial Epidemiologists, which was later revised. However, there is some variation in how jurisdictions implemented these case definitions. More information on how CDC collects COVID-19 case surveillance data can be found at FAQ: COVID-19 Data and Surveillance.
Aggregate Data Collection Process Since the beginning of the COVID-19 pandemic, data were reported from state and local health departments through a robust process with the following steps:
This process was collaborative, with CDC and jurisdictions working together to ensure the accuracy of COVID-19 case and death numbers. County counts provided the most up-to-date numbers on cases and deaths by report date. Throughout data collection, CDC retrospectively updated counts to correct known data quality issues.
Description This archived public use dataset focuses on the cumulative and weekly case and death rates per 100,000 persons within various sociodemographic factors across all states and their counties. All resulting data are expressed as rates calculated as the number of cases or deaths per 100,000 persons in counties meeting various classification criteria using the US Census Bureau Population Estimates Program (2019 Vintage).
Each county within jurisdictions is classified into multiple categories for each factor. All rates in this dataset are based on classification of counties by the characteristics of their population, not individual-level factors. This applies to each of the available factors observed in this dataset. Specific factors and their corresponding categories are detailed below.
Population-level factors Each unique population factor is detailed below. Please note that the “Classification” column describes each of the 12 factors in the dataset, including a data dict
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overflow: Sewage: Average Duration: Repair: North: Roraima data was reported at 0.050 Hour in 2022. This records a decrease from the previous number of 0.080 Hour for 2021. Overflow: Sewage: Average Duration: Repair: North: Roraima data is updated yearly, averaging 2.780 Hour from Dec 2014 (Median) to 2022, with 9 observations. The data reached an all-time high of 24.000 Hour in 2014 and a record low of 0.050 Hour in 2022. Overflow: Sewage: Average Duration: Repair: North: Roraima data remains active status in CEIC and is reported by Ministry of Cities. The data is categorized under Brazil Premium Database’s Environmental, Social and Governance Sector – Table BR.EVE005: Quality Indicators: Sewage: Issues: Overflow.
Facebook
TwitterThe MOD14A1 Version 6 data product was decommissioned on July 31, 2023. Users are encouraged to use the MOD14A1 Version 6.1 data product.The Terra Moderate Resolution Imaging Spectroradiometer (MODIS) Thermal Anomalies and Fire Daily (MOD14A1) Version 6 data are generated every eight days at 1 kilometer (km) spatial resolution as a Level 3 product. MOD14A1 contains eight consecutive days of fire data conveniently packaged into a single file.The Science Dataset (SDS) layers include the fire mask, pixel quality indicators, maximum fire radiative power (MaxFRP), and the position of the fire pixel within the scan. Each layer consists of daily per pixel information for each of the eight days of data acquisition. Known Issues Known issues are described on the MODIS Land Quality Assessment website and in Section 7.2 of the User Guide which covers Pre-November 2000 Data Quality, Detection Confidence, Flagging of Static Sources, and the August 2020 MODIS Aqua OutageImprovements/Changes from Previous Versions Refinements to internal cloud mask, which sometimes flags heavy smoke as clouds. Fix for frequent false alarms occurring in the Amazon that are caused by small (~1 km²) clearings within forests. * Fix to correct a bug that causes incorrect assessment of cloud and water pixels adjacent to fire pixels near the scan edge. Detect small fires using dynamic thresholding. Process ocean and coastline pixels to detect fire from oil rigs. The Version 6 fire mask has the potential to detect fire over water pixels. Therefore, class 3 pixel values have been changed to be classified as “non-fire water pixels”.
Facebook
TwitterReporting of new Aggregate Case and Death Count data was discontinued May 11, 2023, with the expiration of the COVID-19 public health emergency declaration. This dataset will receive a final update on June 1, 2023, to reconcile historical data through May 10, 2023, and will remain publicly available.
Aggregate Data Collection Process Since the start of the COVID-19 pandemic, data have been gathered through a robust process with the following steps:
Methodology Changes Several differences exist between the current, weekly-updated dataset and the archived version:
Confirmed and Probable Counts In this dataset, counts by jurisdiction are not displayed by confirmed or probable status. Instead, confirmed and probable cases and deaths are included in the Total Cases and Total Deaths columns, when available. Not all jurisdictions report probable cases and deaths to CDC.* Confirmed and probable case definition criteria are described here:
Council of State and Territorial Epidemiologists (ymaws.com).
Deaths CDC reports death data on other sections of the website: CDC COVID Data Tracker: Home, CDC COVID Data Tracker: Cases, Deaths, and Testing, and NCHS Provisional Death Counts. Information presented on the COVID Data Tracker pages is based on the same source (to
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overflow: Sewage: Average Duration: Repair: Southeast: Espírito Santo data was reported at 12.500 Hour in 2022. This records a decrease from the previous number of 17.300 Hour for 2021. Overflow: Sewage: Average Duration: Repair: Southeast: Espírito Santo data is updated yearly, averaging 15.960 Hour from Dec 2012 (Median) to 2022, with 11 observations. The data reached an all-time high of 134.630 Hour in 2013 and a record low of 10.710 Hour in 2015. Overflow: Sewage: Average Duration: Repair: Southeast: Espírito Santo data remains active status in CEIC and is reported by Ministry of Cities. The data is categorized under Brazil Premium Database’s Environmental, Social and Governance Sector – Table BR.EVE005: Quality Indicators: Sewage: Issues: Overflow.
Facebook
TwitterNote: The cumulative case count for some counties (with small population) is higher than expected due to the inclusion of non-permanent residents in COVID-19 case counts.
Reporting of Aggregate Case and Death Count data was discontinued on May 11, 2023, with the expiration of the COVID-19 public health emergency declaration. Although these data will continue to be publicly available, this dataset will no longer be updated.
Aggregate Data Collection Process Since the beginning of the COVID-19 pandemic, data were reported through a robust process with the following steps:
This process was collaborative, with CDC and jurisdictions working together to ensure the accuracy of COVID-19 case and death numbers. County counts provided the most up-to-date numbers on cases and deaths by report date. Throughout data collection, CDC retrospectively updated counts to correct known data quality issues. CDC also worked with jurisdictions after the end of the public health emergency declaration to finalize county data.
Important note: The counts reflected during a given time period in this dataset may not match the counts reflected for the same time period in the daily archived dataset noted above. Discrepancies may exist due to differences between county and state COVID-19 case surveillance and reconciliation efforts.
The surveillance case definition for COVID-19, a nationally notifiable disease, was first described in a position statement from the Council for State and Territorial Epidemiologists, which was later revised. However, there is some variation in how jurisdictions implement these case classifications. More information on how CDC collects COVID-19 case surveillance data can be found at FAQ: COVID-19 Data and Surveillance.
Confirmed and Probable Counts In this dataset, counts by jurisdiction are not displayed by confirmed or probable status. Instead, counts of confirmed and probable cases and deaths are included in the Total Cases and Total Deaths columns, when available. Not all jurisdictions report
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.
Citation
If you find this project useful or if you use our newly proposed datasets and/or our analyses, please cite our paper.
Kumar Abhishek, Aditi Jain, Ghassan Hamarneh. "Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets". arXiv preprint arXiv:2401.14497, 2024. DOI: 10.48550/ARXIV.2401.14497.
The corresponding BibTeX entry is:
@article{abhishek2024investigating, title={Investigating the Quality of {DermaMNIST} and {Fitzpatrick17k} Dermatological Image Datasets}, author={Abhishek, Kumar and Jain, Aditi and Hamarneh, Ghassan}, journal={arXiv preprint arXiv:2401.14497}, doi = {10.48550/ARXIV.2401.14497}, url = {https://arxiv.org/abs/2401.14497}, year={2024}}
Project Website
The results of the analysis, including the visualizations, are available on the project website: https://derm.cs.sfu.ca/critique/.
Code
The accompanying code for this project is hosted on GitHub at https://github.com/kakumarabhishek/Corrected-Skin-Image-Datasets.
License
The metadata files (DermaMNIST-C.csv, DermaMNIST-E.csv, Fitzpatrick17k_DiagnosisMapping.xlsx,Fitzpatrick17k-C.csv) contained in this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.
The NPZ files associated with DermaMNIST-C (dermamnist_corrected_28.npz, dermamnist_corrected_224.npz) and DermaMNIST-E (dermamnist_extended_28.npz, dermamnist_extended_224.npz) contained in this repository are licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.
The code hosted on GitHub is licensed under the Apache License 2.0.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overflow: Sewage: Average Duration: Repair: South: Rio Grande do Sul data was reported at 13.280 Hour in 2022. This records a decrease from the previous number of 25.220 Hour for 2021. Overflow: Sewage: Average Duration: Repair: South: Rio Grande do Sul data is updated yearly, averaging 25.220 Hour from Dec 2012 (Median) to 2022, with 11 observations. The data reached an all-time high of 69.640 Hour in 2017 and a record low of 1.070 Hour in 2015. Overflow: Sewage: Average Duration: Repair: South: Rio Grande do Sul data remains active status in CEIC and is reported by Ministry of Cities. The data is categorized under Brazil Premium Database’s Environmental, Social and Governance Sector – Table BR.EVE005: Quality Indicators: Sewage: Issues: Overflow.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: Patient health information is collected routinely in electronic health records (EHRs) and used for research purposes, however, many health conditions are known to be under-diagnosed or under-recorded in EHRs. In research, missing diagnoses result in under-ascertainment of true cases, which attenuates estimated associations between variables and results in a bias toward the null. Bayesian approaches allow the specification of prior information to the model, such as the likely rates of missingness in the data. This paper describes a Bayesian analysis approach which aimed to reduce attenuation of associations in EHR studies focussed on conditions characterized by under-diagnosis.Methods: Study 1: We created synthetic data, produced to mimic structured EHR data where diagnoses were under-recorded. We fitted logistic regression (LR) models with and without Bayesian priors representing rates of misclassification in the data. We examined the LR parameters estimated by models with and without priors. Study 2: We used EHR data from UK primary care in a case-control design with dementia as the outcome. We fitted LR models examining risk factors for dementia, with and without generic prior information on misclassification rates. We examined LR parameters estimated by models with and without the priors, and estimated classification accuracy using Area Under the Receiver Operating Characteristic.Results: Study 1: In synthetic data, estimates of LR parameters were much closer to the true parameter values when Bayesian priors were added to the model; with no priors, parameters were substantially attenuated by under-diagnosis. Study 2: The Bayesian approach ran well on real life clinic data from UK primary care, with the addition of prior information increasing LR parameter values in all cases. In multivariate regression models, Bayesian methods showed no improvement in classification accuracy over traditional LR.Conclusions: The Bayesian approach showed promise but had implementation challenges in real clinical data: prior information on rates of misclassification was difficult to find. Our simple model made a number of assumptions, such as diagnoses being missing at random. Further development is needed to integrate the method into studies using real-life EHR data. Our findings nevertheless highlight the importance of developing methods to address missing diagnoses in EHR data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overflow: Sewage: Average Duration: Repair: North: Amazonas data was reported at 7.000 Hour in 2022. This records a decrease from the previous number of 7.590 Hour for 2021. Overflow: Sewage: Average Duration: Repair: North: Amazonas data is updated yearly, averaging 7.000 Hour from Dec 2012 (Median) to 2022, with 11 observations. The data reached an all-time high of 36.350 Hour in 2014 and a record low of 0.190 Hour in 2015. Overflow: Sewage: Average Duration: Repair: North: Amazonas data remains active status in CEIC and is reported by Ministry of Cities. The data is categorized under Brazil Premium Database’s Environmental, Social and Governance Sector – Table BR.EVE005: Quality Indicators: Sewage: Issues: Overflow.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bug datasets play a vital role in advancing software engineering tasks, including bug detection, fault localization, and automated program repair. These datasets enable the development of more accurate algorithms, facilitate efficient fault identification, and drive the creation of reliable automated repair tools. However, the manual collection and curation of such data are labor-intensive and prone to inconsistency, which limits scalability and reliability. Current datasets often fail to provide detailed and accurate information, particularly regarding bug types, descriptions, and classifications, reducing their utility in diverse research and practical applications. To address these challenges, we introduce BugCatcher, a comprehensive approach for constructing large-scale, high-quality bug datasets. BugCatcher begins by enhancing PR-Issue linking mechanisms, extending data collection to 12 programming languages over a decade, and ensuring accurate linkage between pull requests and issues. It employs a two-stage filtering process, BugCurator, to refine data quality, and utilizes large language models with Zero-shot Chain-of-Thought prompting to generate precise bug types and detailed descriptions. Furthermore, BugCatcher incorporates a robust classification framework, fine-tuning models for improved categorization. The resulting dataset, BugCatcher-Data, includes 243,265 bug-fix entries with comprehensive fields such as code diffs, bug locations, detailed descriptions, and classifications, serving as a substantial resource for advancing software engineering research and practices.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overflow: Sewage: Average Duration: Repair: Northeast: Sergipe data was reported at 2.520 Hour in 2022. This records a decrease from the previous number of 4.130 Hour for 2021. Overflow: Sewage: Average Duration: Repair: Northeast: Sergipe data is updated yearly, averaging 4.990 Hour from Dec 2012 (Median) to 2022, with 11 observations. The data reached an all-time high of 12.000 Hour in 2014 and a record low of 2.520 Hour in 2022. Overflow: Sewage: Average Duration: Repair: Northeast: Sergipe data remains active status in CEIC and is reported by Ministry of Cities. The data is categorized under Brazil Premium Database’s Environmental, Social and Governance Sector – Table BR.EVE005: Quality Indicators: Sewage: Issues: Overflow.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is an expanded version of the popular "Sample - Superstore Sales" dataset, commonly used for introductory data analysis and visualization. It contains detailed transactional data for a US-based retail company, covering orders, products, and customer information.
This version is specifically designed for practicing Data Quality (DQ) and Data Wrangling skills, featuring a unique set of real-world "dirty data" problems (like those encountered in tools like SPSS Modeler, Tableau Prep, or Alteryx) that must be cleaned before any analysis or machine learning can begin.
This dataset combines the original Superstore data with 15,000 plausibly generated synthetic records, totaling 25,000 rows of transactional data. It includes 21 columns detailing: - Order Information: Order ID, Order Date, Ship Date, Ship Mode. - Customer Information: Customer ID, Customer Name, Segment. - Geographic Information: Country, City, State, Postal Code, Region. - Product Information: Product ID, Category, Sub-Category, Product Name. - Financial Metrics: Sales, Quantity, Discount, and Profit.
This dataset is intentionally corrupted to provide a robust practice environment for data cleaning. Challenges include: Missing/Inconsistent Values: Deliberate gaps in Profit and Discount, and multiple inconsistent entries (-- or blank) in the Region column.
Data Type Mismatches: Order Date and Ship Date are stored as text strings, and the Profit column is polluted with comma-formatted strings (e.g., "1,234.56"), forcing the entire column to be read as an object (string) type.
Categorical Inconsistencies: The Category field contains variations and typos like "Tech", "technologies", "Furni", and "OfficeSupply" that require standardization.
Outliers and Invalid Data: Extreme outliers have been added to the Sales and Profit fields, alongside a subset of transactions with an invalid Sales value of 0.
Duplicate Records: Over 200 rows are duplicated (with slight financial variations) to test your deduplication logic.
This dataset is ideal for:
Data Wrangling/Cleaning (Primary Focus): Fix all the intentional data quality issues before proceeding.
Exploratory Data Analysis (EDA): Analyze sales distribution by region, segment, and category.
Regression: Predict the Profit based on Sales, Discount, and product features.
Classification: Build an RFM Model (Recency, Frequency, Monetary) and create a target variable (HighValueCustomer = 1 if total sales are* $>$ $1000$*) to be predicted by logistical regression or decision trees.
Time Series Analysis: Aggregate sales by month/year to perform forecasting.
This dataset is an expanded and corrupted derivative of the original Sample Superstore dataset, credited to Tableau and widely shared for educational purposes. All synthetic records were generated to follow the plausible distribution of the original data.