100+ datasets found
  1. Superstore Sales: The Data Quality Challenge

    • kaggle.com
    zip
    Updated Oct 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Obsession (2025). Superstore Sales: The Data Quality Challenge [Dataset]. https://www.kaggle.com/datasets/dataobsession/superstore-sales-the-data-quality-challenge
    Explore at:
    zip(1512911 bytes)Available download formats
    Dataset updated
    Oct 25, 2025
    Authors
    Data Obsession
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Superstore Sales - The Data Quality Challenge Edition (25K Records)

    This dataset is an expanded version of the popular "Sample - Superstore Sales" dataset, commonly used for introductory data analysis and visualization. It contains detailed transactional data for a US-based retail company, covering orders, products, and customer information.

    This version is specifically designed for practicing Data Quality (DQ) and Data Wrangling skills, featuring a unique set of real-world "dirty data" problems (like those encountered in tools like SPSS Modeler, Tableau Prep, or Alteryx) that must be cleaned before any analysis or machine learning can begin.

    This dataset combines the original Superstore data with 15,000 plausibly generated synthetic records, totaling 25,000 rows of transactional data. It includes 21 columns detailing: - Order Information: Order ID, Order Date, Ship Date, Ship Mode. - Customer Information: Customer ID, Customer Name, Segment. - Geographic Information: Country, City, State, Postal Code, Region. - Product Information: Product ID, Category, Sub-Category, Product Name. - Financial Metrics: Sales, Quantity, Discount, and Profit.

    🚨 Introduced Data Quality Challenges (The Dirty Data)

    This dataset is intentionally corrupted to provide a robust practice environment for data cleaning. Challenges include: Missing/Inconsistent Values: Deliberate gaps in Profit and Discount, and multiple inconsistent entries (-- or blank) in the Region column.

    • Data Type Mismatches: Order Date and Ship Date are stored as text strings, and the Profit column is polluted with comma-formatted strings (e.g., "1,234.56"), forcing the entire column to be read as an object (string) type.

    • Categorical Inconsistencies: The Category field contains variations and typos like "Tech", "technologies", "Furni", and "OfficeSupply" that require standardization.

    • Outliers and Invalid Data: Extreme outliers have been added to the Sales and Profit fields, alongside a subset of transactions with an invalid Sales value of 0.

    • Duplicate Records: Over 200 rows are duplicated (with slight financial variations) to test your deduplication logic.

    ❓ Suggested Analysis and Modeling Tasks

    This dataset is ideal for:

    Data Wrangling/Cleaning (Primary Focus): Fix all the intentional data quality issues before proceeding.

    Exploratory Data Analysis (EDA): Analyze sales distribution by region, segment, and category.

    Regression: Predict the Profit based on Sales, Discount, and product features.

    Classification: Build an RFM Model (Recency, Frequency, Monetary) and create a target variable (HighValueCustomer = 1 if total sales are* $>$ $1000$*) to be predicted by logistical regression or decision trees.

    Time Series Analysis: Aggregate sales by month/year to perform forecasting.

    Acknowledgements

    This dataset is an expanded and corrupted derivative of the original Sample Superstore dataset, credited to Tableau and widely shared for educational purposes. All synthetic records were generated to follow the plausible distribution of the original data.

  2. D

    Data Quality Tool Market Report

    • promarketreports.com
    doc, pdf, ppt
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pro Market Reports (2025). Data Quality Tool Market Report [Dataset]. https://www.promarketreports.com/reports/data-quality-tool-market-8996
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Jan 20, 2025
    Dataset authored and provided by
    Pro Market Reports
    License

    https://www.promarketreports.com/privacy-policyhttps://www.promarketreports.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    United States
    Variables measured
    Market Size
    Description

    The size of the Data Quality Tool Market was valued at USD 2.09 Billion in 2024 and is projected to reach USD 5.93 Billion by 2033, with an expected CAGR of 16.07% during the forecast period. Recent developments include: January 2022: IBM and Francisco Partners disclosed the execution of a definitive contract under which Francisco Partners will purchase medical care information and analytics resources from IBM, which are currently part of the IBM Watson Health business., October 2021: Informatica LLC announced an important cloud storage agreement with Google Cloud in October 2021. This collaboration allows Informatica clients to transition to Google Cloud as much as twelve times quicker. Informatica's Google Cloud Marketplace transactable solutions now incorporate Master Data Administration and Data Governance capabilities., Completing a unit of labor with incorrect data costs ten times more estimates than the Harvard Business Review, and finding the correct data for effective tools has never been difficult. A reliable system may be implemented by selecting and deploying intelligent workflow-driven, self-service options tools for data quality with inbuilt quality controls.. Key drivers for this market are: Increasing demand for data quality: Businesses are increasingly recognizing the importance of data quality for decision-making and operational efficiency. This is driving demand for data quality tools that can automate and streamline the data cleansing and validation process.

    Growing adoption of cloud-based data quality tools: Cloud-based data quality tools offer several advantages over on-premises solutions, including scalability, flexibility, and cost-effectiveness. This is driving the adoption of cloud-based data quality tools across all industries.

    Emergence of AI-powered data quality tools: AI-powered data quality tools can automate many of the tasks involved in data cleansing and validation, making it easier and faster to achieve high-quality data. This is driving the adoption of AI-powered data quality tools across all industries.. Potential restraints include: Data privacy and security concerns: Data privacy and security regulations are becoming increasingly stringent, which can make it difficult for businesses to implement data quality initiatives.

    Lack of skilled professionals: There is a shortage of skilled data quality professionals who can implement and manage data quality tools. This can make it difficult for businesses to achieve high-quality data.

    Cost of data quality tools: Data quality tools can be expensive, especially for large businesses with complex data environments. This can make it difficult for businesses to justify the investment in data quality tools.. Notable trends are: Adoption of AI-powered data quality tools: AI-powered data quality tools are becoming increasingly popular, as they can automate many of the tasks involved in data cleansing and validation. This makes it easier and faster to achieve high-quality data.

    Growth of cloud-based data quality tools: Cloud-based data quality tools are becoming increasingly popular, as they offer several advantages over on-premises solutions, including scalability, flexibility, and cost-effectiveness.

    Focus on data privacy and security: Data quality tools are increasingly being used to help businesses comply with data privacy and security regulations. This is driving the development of new data quality tools that can help businesses protect their data..

  3. f

    Tailored Site Data Quality Summaries.

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Jun 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chrischilles, Elizabeth A.; Davies, Amy Goodwin; Huang, Yungui; Forrest, Christopher B.; Dickinson, Kimberley; Walters, Kellie; Mendonca, Eneida A.; Hanauer, David; Matthews, Kevin; Bailey, L. Charles; Lehmann, Harold; Denburg, Michelle R.; Rosenman, Marc; Chen, Yong; Taylor, Bradley; Bunnell, H. Timothy; Katsoufis, Chryso; Razzaghi, Hanieh; Morse, Keith; Ilunga, K. T. Sandra; Boss, Samuel; Lemas, Dominick J.; Ranade, Daksha (2024). Tailored Site Data Quality Summaries. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001477156
    Explore at:
    Dataset updated
    Jun 27, 2024
    Authors
    Chrischilles, Elizabeth A.; Davies, Amy Goodwin; Huang, Yungui; Forrest, Christopher B.; Dickinson, Kimberley; Walters, Kellie; Mendonca, Eneida A.; Hanauer, David; Matthews, Kevin; Bailey, L. Charles; Lehmann, Harold; Denburg, Michelle R.; Rosenman, Marc; Chen, Yong; Taylor, Bradley; Bunnell, H. Timothy; Katsoufis, Chryso; Razzaghi, Hanieh; Morse, Keith; Ilunga, K. T. Sandra; Boss, Samuel; Lemas, Dominick J.; Ranade, Daksha
    Description

    Study-specific data quality testing is an essential part of minimizing analytic errors, particularly for studies making secondary use of clinical data. We applied a systematic and reproducible approach for study-specific data quality testing to the analysis plan for PRESERVE, a 15-site, EHR-based observational study of chronic kidney disease in children. This approach integrated widely adopted data quality concepts with healthcare-specific evaluation methods. We implemented two rounds of data quality assessment. The first produced high-level evaluation using aggregate results from a distributed query, focused on cohort identification and main analytic requirements. The second focused on extended testing of row-level data centralized for analysis. We systematized reporting and cataloguing of data quality issues, providing institutional teams with prioritized issues for resolution. We tracked improvements and documented anomalous data for consideration during analyses. The checks we developed identified 115 and 157 data quality issues in the two rounds, involving completeness, data model conformance, cross-variable concordance, consistency, and plausibility, extending traditional data quality approaches to address more complex stratification and temporal patterns. Resolution efforts focused on higher priority issues, given finite study resources. In many cases, institutional teams were able to correct data extraction errors or obtain additional data, avoiding exclusion of 2 institutions entirely and resolving 123 other gaps. Other results identified complexities in measures of kidney function, bearing on the study’s outcome definition. Where limitations such as these are intrinsic to clinical data, the study team must account for them in conducting analyses. This study rigorously evaluated fitness of data for intended use. The framework is reusable and built on a strong theoretical underpinning. Significant data quality issues that would have otherwise delayed analyses or made data unusable were addressed. This study highlights the need for teams combining subject-matter and informatics expertise to address data quality when working with real world data.

  4. D

    PMU Data Quality Monitoring Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). PMU Data Quality Monitoring Market Research Report 2033 [Dataset]. https://dataintelo.com/report/pmu-data-quality-monitoring-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    PMU Data Quality Monitoring Market Outlook




    According to our latest research, the global PMU Data Quality Monitoring market size reached USD 565 million in 2024, and is expected to grow at a robust CAGR of 8.7% from 2025 to 2033, reaching a forecasted market size of USD 1.23 billion by 2033. The primary growth driver for this market is the increasing integration of grid modernization initiatives and the rising adoption of Phasor Measurement Units (PMUs) in power utilities worldwide, which necessitate advanced data quality monitoring solutions to ensure grid stability and reliability.




    One of the most significant factors fueling the expansion of the PMU Data Quality Monitoring market is the global push towards smart grid infrastructure. As utilities and grid operators transition from traditional to digital grids, the volume and complexity of data generated by PMUs have surged. This data is critical for real-time monitoring, fault detection, and predictive maintenance. However, the effectiveness of these applications depends heavily on the quality, accuracy, and timeliness of the data collected. As a result, there is a growing demand for sophisticated PMU data quality monitoring solutions that can detect anomalies, correct errors, and ensure the reliability of grid operations. The increasing prevalence of renewable energy sources, which introduce variability and unpredictability into the grid, further amplifies the need for robust data quality monitoring systems to maintain grid stability and efficiency.




    Another key growth driver is the stringent regulatory environment and compliance requirements imposed by governments and industry bodies. Regulatory authorities across North America, Europe, and Asia Pacific are mandating the adoption of advanced monitoring and reporting mechanisms to enhance grid resilience and prevent large-scale outages. This has prompted utilities and grid operators to invest heavily in PMU data quality monitoring tools that enable compliance with these standards while minimizing operational risks. Moreover, advancements in artificial intelligence and machine learning are revolutionizing the way data quality issues are detected and resolved, enabling real-time analytics and automated decision-making. These technological innovations are expected to further accelerate market growth by offering scalable, cost-effective, and highly accurate monitoring solutions.




    The rapid digital transformation of the energy sector, coupled with the proliferation of distributed energy resources such as solar and wind, is also contributing to the rising demand for PMU data quality monitoring. As the grid becomes increasingly decentralized, the number of data points and the complexity of grid management grow exponentially. PMU data quality monitoring systems play a vital role in aggregating, validating, and analyzing this data to ensure seamless grid operations. Additionally, the growing focus on grid cybersecurity is driving the adoption of monitoring solutions that can detect and mitigate data integrity threats. These trends are expected to sustain the momentum of the PMU Data Quality Monitoring market over the forecast period.




    Regionally, North America continues to dominate the PMU Data Quality Monitoring market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The high adoption rate of smart grid technologies, strong regulatory frameworks, and significant investments in grid modernization projects in the United States and Canada are the primary factors contributing to North America's leadership. Meanwhile, Asia Pacific is emerging as the fastest-growing region, driven by rapid urbanization, increasing electricity demand, and large-scale renewable energy integration in countries like China, India, and Japan. Europe remains a key market, propelled by ambitious decarbonization goals and cross-border grid integration initiatives. Latin America and the Middle East & Africa are also witnessing steady growth, albeit from a smaller base, as utilities in these regions begin to modernize their grid infrastructure and enhance data management capabilities.



    Component Analysis




    The PMU Data Quality Monitoring market by component is segmented into software, hardware, and services. The software segment currently holds the largest market share, owing to the growing need for advanced analytics and real-time data validation tools. As utilities and grid operators increasingly rely on P

  5. M

    MRO Data Cleansing and Enrichment Service Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). MRO Data Cleansing and Enrichment Service Report [Dataset]. https://www.marketreportanalytics.com/reports/mro-data-cleansing-and-enrichment-service-76164
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Apr 10, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The MRO (Maintenance, Repair, and Operations) Data Cleansing and Enrichment Service market is experiencing robust growth, driven by the increasing need for accurate and reliable data across various industries. The digital transformation sweeping sectors like manufacturing, oil and gas, and pharmaceuticals is fueling demand for streamlined data management. Businesses are realizing the significant cost savings and operational efficiencies achievable through improved data quality. Specifically, inaccurate or incomplete MRO data can lead to costly downtime, inefficient inventory management, and missed maintenance opportunities. Data cleansing and enrichment services address these challenges by identifying and correcting errors, filling in gaps, and standardizing data formats, ultimately improving decision-making and optimizing resource allocation. The market is segmented by application (chemical, oil & gas, pharmaceutical, mining, transportation, others) and type of service (data cleansing, data enrichment). While precise market size figures are unavailable, considering a moderate CAGR of 15% and a 2025 market value in the hundreds of millions, a reasonable projection is a market size exceeding $500 million in 2025, growing to potentially over $1 billion by 2033. This projection reflects the increasing adoption of digital technologies and the growing awareness of the value proposition of high-quality MRO data. The competitive landscape is fragmented, with numerous companies offering specialized services. Key players include both large established firms and smaller niche providers. The market's geographical distribution is diverse, with North America and Europe currently holding significant market shares, reflecting higher levels of digitalization and data management maturity in these regions. However, Asia-Pacific is emerging as a high-growth region due to rapid industrialization and increasing technological adoption. The long-term growth trajectory of the MRO Data Cleansing and Enrichment Service market will be influenced by factors such as advancements in data analytics, the expanding adoption of cloud-based solutions, and the continued focus on optimizing operational efficiency across industries. Challenges remain, however, including data security concerns and the need for skilled professionals to manage complex data cleansing and enrichment projects.

  6. D

    Data Observability Software Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Data Observability Software Report [Dataset]. https://www.datainsightsmarket.com/reports/data-observability-software-528245
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Sep 17, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Data Observability Software market is poised for substantial growth, projected to reach approximately $8,500 million by 2025, with an anticipated Compound Annual Growth Rate (CAGR) of around 22% through 2033. This robust expansion is fueled by the escalating complexity of data landscapes and the critical need for organizations to proactively monitor, troubleshoot, and ensure the reliability of their data pipelines. The increasing volume, velocity, and variety of data generated across industries necessitate sophisticated solutions that provide end-to-end visibility, from data ingestion to consumption. Key drivers include the growing adoption of cloud-native architectures, the proliferation of big data technologies, and the rising demand for data quality and compliance. As businesses increasingly rely on data-driven decision-making, the imperative to prevent data downtime, identify anomalies, and maintain data integrity becomes paramount, further accelerating market penetration. The market is segmented by application, with Large Enterprises constituting a significant share due to their extensive and complex data infrastructures, demanding advanced observability capabilities. Small and Medium-sized Enterprises (SMEs) are also showing increasing adoption, driven by more accessible cloud-based solutions and a growing awareness of data's strategic importance. On-premise deployments remain relevant for organizations with stringent data residency and security requirements, while cloud-based solutions are witnessing rapid growth due to their scalability, flexibility, and cost-effectiveness. Prominent market trends include the integration of AI and machine learning for automated anomaly detection and root cause analysis, the development of unified platforms offering comprehensive data lineage and metadata management, and a focus on real-time monitoring and proactive alerting. Challenges such as the high cost of implementation and the need for skilled personnel to manage these sophisticated tools, alongside the potential for vendor lock-in, are being addressed through continuous innovation and strategic partnerships within the competitive vendor landscape. This report provides an in-depth analysis of the global Data Observability Software market, forecasting its trajectory from 2019 to 2033, with a base year of 2025. The market is poised for significant expansion, driven by the escalating complexity of data ecosystems and the critical need for data reliability and trust.

  7. G

    Autonomous Data Cleaning with AI Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Autonomous Data Cleaning with AI Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/autonomous-data-cleaning-with-ai-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Oct 4, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Autonomous Data Cleaning with AI Market Outlook



    According to our latest research, the global Autonomous Data Cleaning with AI market size reached USD 1.68 billion in 2024, with a robust year-on-year growth driven by the surge in enterprise data volumes and the mounting demand for high-quality, actionable insights. The market is projected to expand at a CAGR of 24.2% from 2025 to 2033, which will take the overall market value to approximately USD 13.1 billion by 2033. This rapid growth is fueled by the increasing adoption of artificial intelligence (AI) and machine learning (ML) technologies across industries, aiming to automate and optimize the data cleaning process for improved operational efficiency and decision-making.




    The primary growth driver for the Autonomous Data Cleaning with AI market is the exponential increase in data generation across various industries such as BFSI, healthcare, retail, and manufacturing. Organizations are grappling with massive amounts of structured and unstructured data, much of which is riddled with inconsistencies, duplicates, and inaccuracies. Manual data cleaning is both time-consuming and error-prone, leading businesses to seek automated AI-driven solutions that can intelligently detect, correct, and prevent data quality issues. The integration of AI not only accelerates the data cleaning process but also ensures higher accuracy, enabling organizations to leverage clean, reliable data for analytics, compliance, and digital transformation initiatives. This, in turn, translates into enhanced business agility and competitive advantage.




    Another significant factor propelling the market is the increasing regulatory scrutiny and compliance requirements in sectors such as banking, healthcare, and government. Regulations such as GDPR, HIPAA, and others mandate strict data governance and quality standards. Autonomous Data Cleaning with AI solutions help organizations maintain compliance by ensuring data integrity, traceability, and auditability. Additionally, the evolution of cloud computing and the proliferation of big data analytics platforms have made it easier for organizations of all sizes to deploy and scale AI-powered data cleaning tools. These advancements are making autonomous data cleaning more accessible, cost-effective, and scalable, further driving market adoption.




    The growing emphasis on digital transformation and real-time decision-making is also a crucial growth factor for the Autonomous Data Cleaning with AI market. As enterprises increasingly rely on analytics, machine learning, and artificial intelligence for business insights, the quality of input data becomes paramount. Automated, AI-driven data cleaning solutions enable organizations to process, cleanse, and prepare data in real-time, ensuring that downstream analytics and AI models are fed with high-quality inputs. This not only improves the accuracy of business predictions but also reduces the time-to-insight, helping organizations stay ahead in highly competitive markets.




    From a regional perspective, North America currently dominates the Autonomous Data Cleaning with AI market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The presence of leading technology companies, early adopters of AI, and a mature regulatory environment are key factors contributing to North America’s leadership. However, Asia Pacific is expected to witness the highest CAGR over the forecast period, driven by rapid digitalization, expanding IT infrastructure, and increasing investments in AI and data analytics, particularly in countries such as China, India, and Japan. Latin America and the Middle East & Africa are also gradually emerging as promising markets, supported by growing awareness and adoption of AI-driven data management solutions.





    Component Analysis



    The Autonomous Data Cleaning with AI market is segmented by component into Software and Services. The software segment currently holds the largest market share, driven

  8. Data Release for "The curious case of GW200129: interplay between...

    • zenodo.org
    zip
    Updated Oct 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ethan Payne; Sophie Hourihane; Jacob Golomb; Rhiannon Udall; Derek Davis; Katerina Chatziioannou; Ethan Payne; Sophie Hourihane; Jacob Golomb; Rhiannon Udall; Derek Davis; Katerina Chatziioannou (2022). Data Release for "The curious case of GW200129: interplay between spin-precession inference and data-quality issues" [Dataset]. http://doi.org/10.5281/zenodo.7259655
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ethan Payne; Sophie Hourihane; Jacob Golomb; Rhiannon Udall; Derek Davis; Katerina Chatziioannou; Ethan Payne; Sophie Hourihane; Jacob Golomb; Rhiannon Udall; Derek Davis; Katerina Chatziioannou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Release associated with The curious case of GW200129: interplay between spin-precession inference and data-quality issues Data Release. We release frame files and result file for selected parameter estimation runs in the paper.

    Each directory contains .json bilby result files for the PE run described by that directory. The specific channel names and frame files that we used in the PE runs are listed below. The frames for the PE runs with BayesWave glitch subtraction are included in this release and are in BW_frames/frame_{glitch label}.

    L1 data with glitch subtraction:

    Channel: L1:DCS-CALIB_STRAIN_CLEAN_SUB60HZ_C01_P1800169_v4

    Link: https://zenodo.org/record/5546680/files/L-L1_HOFT_CLEAN_SUB60HZ_C01_P1800169_v4-1264314068-4096.gwf

    L1 data, no mitigation:

    Channel: H1:DCS-CALIB_STRAIN_CLEAN_SUB60HZ_C01 (L1:GWOSC-16KHZ_R1_STRAIN)

    Link: https://www.gw-openscience.org/archive/data/O3b_16KHZ_R1/1263534080/L-L1_GWOSC_O3b_16KHZ_R1-1264312320-4096.gwf

    H1 data, no mitigation:

    Channel: H1:DCS-CALIB_STRAIN_CLEAN_SUB60HZ_C01 (H1:GWOSC-16KHZ_R1_STRAIN)

    Link: https://www.gw-openscience.org/archive/data/O3b_16KHZ_R1/1263534080HL-H1_GWOSC_O3b_16KHZ_R1-1264312320-4096.gwf

    Channels for runs with BayesWave glitch-subtracted frames--

    (Note you need to read in the correct gwf file to access each channel; for example the channel for BayesWave glitch A should be accessed after reading in the gwf file in `BW_frames/frame_A/`)

    BayesWave glitch A (applies only to L1): DCS-CALIB_STRAIN_CLEAN_SUB60HZ_C01_BW_DEGLITCHED_30000

    BayesWave glitch B (applies only to L1): DCS-CALIB_STRAIN_CLEAN_SUB60HZ_C01_BW_DEGLITCHED_28395

    BayesWave glitch C (applies only to L1): DCS-CALIB_STRAIN_CLEAN_SUB60HZ_C01_BW_DEGLITCHED_32752

  9. Trends in COVID-19 Cases and Deaths in the United States, by County-level...

    • data.virginia.gov
    • healthdata.gov
    • +1more
    csv, json, rdf, xsl
    Updated Jan 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Disease Control and Prevention (2025). Trends in COVID-19 Cases and Deaths in the United States, by County-level Population Factors - ARCHIVED [Dataset]. https://data.virginia.gov/dataset/trends-in-covid-19-cases-and-deaths-in-the-united-states-by-county-level-population-factors-arc
    Explore at:
    csv, json, xsl, rdfAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Centers for Disease Control and Preventionhttp://www.cdc.gov/
    Area covered
    United States
    Description

    Reporting of Aggregate Case and Death Count data was discontinued on May 11, 2023, with the expiration of the COVID-19 public health emergency declaration. Although these data will continue to be publicly available, this dataset will no longer be updated.

    The surveillance case definition for COVID-19, a nationally notifiable disease, was first described in a position statement from the Council for State and Territorial Epidemiologists, which was later revised. However, there is some variation in how jurisdictions implemented these case definitions. More information on how CDC collects COVID-19 case surveillance data can be found at FAQ: COVID-19 Data and Surveillance.

    Aggregate Data Collection Process Since the beginning of the COVID-19 pandemic, data were reported from state and local health departments through a robust process with the following steps:

    • Aggregate county-level counts were obtained indirectly, via automated overnight web collection, or directly, via a data submission process.
    • If more than one official county data source existed, CDC used a comprehensive data selection process comparing each official county data source to retrieve the highest case and death counts, unless otherwise specified by the state.
    • A CDC data team reviewed counts for congruency prior to integration and set up alerts to monitor for discrepancies in the data.
    • CDC routinely compiled these data and post the finalized information on COVID Data Tracker.
    • County level data were aggregated to obtain state- and territory- specific totals.
    • Counting of cases and deaths is based on date of report and not on the date of symptom onset. CDC calculates rates in these data by using population estimates provided by the US Census Bureau Population Estimates Program (2019 Vintage).
    • COVID-19 aggregate case and death data are organized in a time series that includes cumulative number of cases and deaths as reported by a jurisdiction on a given date. New case and death counts are calculated as the week-to-week change in cumulative counts of cases and deaths reported (i.e., newly reported cases and deaths = cumulative number of cases/deaths reported this week minus the cumulative total reported the prior week.

    This process was collaborative, with CDC and jurisdictions working together to ensure the accuracy of COVID-19 case and death numbers. County counts provided the most up-to-date numbers on cases and deaths by report date. Throughout data collection, CDC retrospectively updated counts to correct known data quality issues.

    Description This archived public use dataset focuses on the cumulative and weekly case and death rates per 100,000 persons within various sociodemographic factors across all states and their counties. All resulting data are expressed as rates calculated as the number of cases or deaths per 100,000 persons in counties meeting various classification criteria using the US Census Bureau Population Estimates Program (2019 Vintage).

    Each county within jurisdictions is classified into multiple categories for each factor. All rates in this dataset are based on classification of counties by the characteristics of their population, not individual-level factors. This applies to each of the available factors observed in this dataset. Specific factors and their corresponding categories are detailed below.

    Population-level factors Each unique population factor is detailed below. Please note that the “Classification” column describes each of the 12 factors in the dataset, including a data dict

  10. B

    Brazil Overflow: Sewage: Average Duration: Repair: North: Roraima

    • ceicdata.com
    Updated Feb 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2023). Brazil Overflow: Sewage: Average Duration: Repair: North: Roraima [Dataset]. https://www.ceicdata.com/en/brazil/quality-indicators-sewage-issues-overflow
    Explore at:
    Dataset updated
    Feb 26, 2023
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 1, 2014 - Dec 1, 2022
    Area covered
    Brazil
    Description

    Overflow: Sewage: Average Duration: Repair: North: Roraima data was reported at 0.050 Hour in 2022. This records a decrease from the previous number of 0.080 Hour for 2021. Overflow: Sewage: Average Duration: Repair: North: Roraima data is updated yearly, averaging 2.780 Hour from Dec 2014 (Median) to 2022, with 9 observations. The data reached an all-time high of 24.000 Hour in 2014 and a record low of 0.050 Hour in 2022. Overflow: Sewage: Average Duration: Repair: North: Roraima data remains active status in CEIC and is reported by Ministry of Cities. The data is categorized under Brazil Premium Database’s Environmental, Social and Governance Sector – Table BR.EVE005: Quality Indicators: Sewage: Issues: Overflow.

  11. MODIS/Terra Thermal Anomalies/Fire Daily L3 Global 1km SIN Grid V006 -...

    • data.nasa.gov
    Updated Jun 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). MODIS/Terra Thermal Anomalies/Fire Daily L3 Global 1km SIN Grid V006 - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/modis-terra-thermal-anomalies-fire-daily-l3-global-1km-sin-grid-v006
    Explore at:
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The MOD14A1 Version 6 data product was decommissioned on July 31, 2023. Users are encouraged to use the MOD14A1 Version 6.1 data product.The Terra Moderate Resolution Imaging Spectroradiometer (MODIS) Thermal Anomalies and Fire Daily (MOD14A1) Version 6 data are generated every eight days at 1 kilometer (km) spatial resolution as a Level 3 product. MOD14A1 contains eight consecutive days of fire data conveniently packaged into a single file.The Science Dataset (SDS) layers include the fire mask, pixel quality indicators, maximum fire radiative power (MaxFRP), and the position of the fire pixel within the scan. Each layer consists of daily per pixel information for each of the eight days of data acquisition. Known Issues Known issues are described on the MODIS Land Quality Assessment website and in Section 7.2 of the User Guide which covers Pre-November 2000 Data Quality, Detection Confidence, Flagging of Static Sources, and the August 2020 MODIS Aqua OutageImprovements/Changes from Previous Versions Refinements to internal cloud mask, which sometimes flags heavy smoke as clouds. Fix for frequent false alarms occurring in the Amazon that are caused by small (~1 km²) clearings within forests. * Fix to correct a bug that causes incorrect assessment of cloud and water pixels adjacent to fire pixels near the scan edge. Detect small fires using dynamic thresholding. Process ocean and coastline pixels to detect fire from oil rigs. The Version 6 fire mask has the potential to detect fire over water pixels. Therefore, class 3 pixel values have been changed to be classified as “non-fire water pixels”.

  12. Weekly United States COVID-19 Cases and Deaths by State - ARCHIVED

    • healthdata.gov
    • data.virginia.gov
    • +1more
    csv, xlsx, xml
    Updated Oct 21, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cdc.gov (2022). Weekly United States COVID-19 Cases and Deaths by State - ARCHIVED [Dataset]. https://healthdata.gov/w/hiqp-x67x/default?cur=_65-WvB31Cw
    Explore at:
    xlsx, csv, xmlAvailable download formats
    Dataset updated
    Oct 21, 2022
    Dataset provided by
    data.cdc.gov
    Area covered
    United States
    Description

    Reporting of new Aggregate Case and Death Count data was discontinued May 11, 2023, with the expiration of the COVID-19 public health emergency declaration. This dataset will receive a final update on June 1, 2023, to reconcile historical data through May 10, 2023, and will remain publicly available.

    Aggregate Data Collection Process Since the start of the COVID-19 pandemic, data have been gathered through a robust process with the following steps:

    • A CDC data team reviews and validates the information obtained from jurisdictions’ state and local websites via an overnight data review process.
    • If more than one official county data source exists, CDC uses a comprehensive data selection process comparing each official county data source, and takes the highest case and death counts respectively, unless otherwise specified by the state.
    • CDC compiles these data and posts the finalized information on COVID Data Tracker.
    • County level data is aggregated to obtain state and territory specific totals.
    This process is collaborative, with CDC and jurisdictions working together to ensure the accuracy of COVID-19 case and death numbers. County counts provide the most up-to-date numbers on cases and deaths by report date. CDC may retrospectively update counts to correct data quality issues.

    Methodology Changes Several differences exist between the current, weekly-updated dataset and the archived version:

    • Source: The current Weekly-Updated Version is based on county-level aggregate count data, while the Archived Version is based on State-level aggregate count data.
    • Confirmed/Probable Cases/Death breakdown:  While the probable cases and deaths are included in the total case and total death counts in both versions (if applicable), they were reported separately from the confirmed cases and deaths by jurisdiction in the Archived Version.  In the current Weekly-Updated Version, the counts by jurisdiction are not reported by confirmed or probable status (See Confirmed and Probable Counts section for more detail).
    • Time Series Frequency: The current Weekly-Updated Version contains weekly time series data (i.e., one record per week per jurisdiction), while the Archived Version contains daily time series data (i.e., one record per day per jurisdiction).
    • Update Frequency: The current Weekly-Updated Version is updated weekly, while the Archived Version was updated twice daily up to October 20, 2022.
    Important note: The counts reflected during a given time period in this dataset may not match the counts reflected for the same time period in the archived dataset noted above. Discrepancies may exist due to differences between county and state COVID-19 case surveillance and reconciliation efforts.

    Confirmed and Probable Counts In this dataset, counts by jurisdiction are not displayed by confirmed or probable status. Instead, confirmed and probable cases and deaths are included in the Total Cases and Total Deaths columns, when available. Not all jurisdictions report probable cases and deaths to CDC.* Confirmed and probable case definition criteria are described here:

    Council of State and Territorial Epidemiologists (ymaws.com).

    Deaths CDC reports death data on other sections of the website: CDC COVID Data Tracker: Home, CDC COVID Data Tracker: Cases, Deaths, and Testing, and NCHS Provisional Death Counts. Information presented on the COVID Data Tracker pages is based on the same source (to

  13. B

    Brazil Overflow: Sewage: Average Duration: Repair: Southeast: Espírito Santo...

    • ceicdata.com
    Updated Feb 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2023). Brazil Overflow: Sewage: Average Duration: Repair: Southeast: Espírito Santo [Dataset]. https://www.ceicdata.com/en/brazil/quality-indicators-sewage-issues-overflow
    Explore at:
    Dataset updated
    Feb 26, 2023
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 1, 2012 - Dec 1, 2022
    Area covered
    Brazil
    Description

    Overflow: Sewage: Average Duration: Repair: Southeast: Espírito Santo data was reported at 12.500 Hour in 2022. This records a decrease from the previous number of 17.300 Hour for 2021. Overflow: Sewage: Average Duration: Repair: Southeast: Espírito Santo data is updated yearly, averaging 15.960 Hour from Dec 2012 (Median) to 2022, with 11 observations. The data reached an all-time high of 134.630 Hour in 2013 and a record low of 10.710 Hour in 2015. Overflow: Sewage: Average Duration: Repair: Southeast: Espírito Santo data remains active status in CEIC and is reported by Ministry of Cities. The data is categorized under Brazil Premium Database’s Environmental, Social and Governance Sector – Table BR.EVE005: Quality Indicators: Sewage: Issues: Overflow.

  14. Weekly United States COVID-19 Cases and Deaths by County - ARCHIVED

    • healthdata.gov
    • data.virginia.gov
    • +1more
    csv, xlsx, xml
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cdc.gov (2023). Weekly United States COVID-19 Cases and Deaths by County - ARCHIVED [Dataset]. https://healthdata.gov/w/fjew-c6u8/default?cur=dIeHm7VTFLY
    Explore at:
    xml, xlsx, csvAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    data.cdc.gov
    Area covered
    United States
    Description

    Note: The cumulative case count for some counties (with small population) is higher than expected due to the inclusion of non-permanent residents in COVID-19 case counts.

    Reporting of Aggregate Case and Death Count data was discontinued on May 11, 2023, with the expiration of the COVID-19 public health emergency declaration. Although these data will continue to be publicly available, this dataset will no longer be updated.

    Aggregate Data Collection Process Since the beginning of the COVID-19 pandemic, data were reported through a robust process with the following steps:

    • Aggregate county-level counts were obtained indirectly, via automated overnight web collection, or directly, via a data submission process.
    • If more than one official county data source existed, CDC used a comprehensive data selection process comparing each official county data source to retrieve the highest case and death counts, unless otherwise specified by the state.
    • A CDC data team reviewed counts for congruency prior to integration. CDC routinely compiled these data and post the finalized information on COVID Data Tracker.
    • Cases and deaths are based on date of report and not on the date of symptom onset. CDC calculates rates in this data by using population estimates provided by the US Census Bureau Population Estimates Program (2019 Vintage).
    • COVID-19 aggregate case and death data were organized in a time series that includes cumulative number of cases and deaths as reported by a jurisdiction on a given date. New case and death counts were calculated as the week-to-week change in reported cumulative cases and deaths (i.e., newly reported cases and deaths = cumulative number of cases/deaths reported this week minus the cumulative total reported the week before.

    This process was collaborative, with CDC and jurisdictions working together to ensure the accuracy of COVID-19 case and death numbers. County counts provided the most up-to-date numbers on cases and deaths by report date. Throughout data collection, CDC retrospectively updated counts to correct known data quality issues. CDC also worked with jurisdictions after the end of the public health emergency declaration to finalize county data.

    • Source: The weekly archived dataset is based on county-level aggregate count data
    • Confirmed/Probable Cases/Death breakdown: Cumulative cases and deaths for each county are included. Total reported cases include probable and confirmed cases.
    • Time Series Frequency: The weekly archived dataset contains weekly time series data (i.e., one record per week per county)

    Important note: The counts reflected during a given time period in this dataset may not match the counts reflected for the same time period in the daily archived dataset noted above. Discrepancies may exist due to differences between county and state COVID-19 case surveillance and reconciliation efforts.

    The surveillance case definition for COVID-19, a nationally notifiable disease, was first described in a position statement from the Council for State and Territorial Epidemiologists, which was later revised. However, there is some variation in how jurisdictions implement these case classifications. More information on how CDC collects COVID-19 case surveillance data can be found at FAQ: COVID-19 Data and Surveillance.

    Confirmed and Probable Counts In this dataset, counts by jurisdiction are not displayed by confirmed or probable status. Instead, counts of confirmed and probable cases and deaths are included in the Total Cases and Total Deaths columns, when available. Not all jurisdictions report

  15. Z

    Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological...

    • data.niaid.nih.gov
    Updated Jul 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek, Kumar; Jain, Aditi; Hamarneh, Ghassan (2024). Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11101337
    Explore at:
    Dataset updated
    Jul 14, 2024
    Dataset provided by
    Simon Fraser University
    Indian Institute of Technology Delhi
    Authors
    Abhishek, Kumar; Jain, Aditi; Hamarneh, Ghassan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.

    Citation

    If you find this project useful or if you use our newly proposed datasets and/or our analyses, please cite our paper.

    Kumar Abhishek, Aditi Jain, Ghassan Hamarneh. "Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets". arXiv preprint arXiv:2401.14497, 2024. DOI: 10.48550/ARXIV.2401.14497.

    The corresponding BibTeX entry is:

    @article{abhishek2024investigating, title={Investigating the Quality of {DermaMNIST} and {Fitzpatrick17k} Dermatological Image Datasets}, author={Abhishek, Kumar and Jain, Aditi and Hamarneh, Ghassan}, journal={arXiv preprint arXiv:2401.14497}, doi = {10.48550/ARXIV.2401.14497}, url = {https://arxiv.org/abs/2401.14497}, year={2024}}

    Project Website

    The results of the analysis, including the visualizations, are available on the project website: https://derm.cs.sfu.ca/critique/.

    Code

    The accompanying code for this project is hosted on GitHub at https://github.com/kakumarabhishek/Corrected-Skin-Image-Datasets.

    License

    The metadata files (DermaMNIST-C.csv, DermaMNIST-E.csv, Fitzpatrick17k_DiagnosisMapping.xlsx,Fitzpatrick17k-C.csv) contained in this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.

    The NPZ files associated with DermaMNIST-C (dermamnist_corrected_28.npz, dermamnist_corrected_224.npz) and DermaMNIST-E (dermamnist_extended_28.npz, dermamnist_extended_224.npz) contained in this repository are licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.

    The code hosted on GitHub is licensed under the Apache License 2.0.

  16. B

    Brazil Overflow: Sewage: Average Duration: Repair: South: Rio Grande do Sul

    • ceicdata.com
    Updated Feb 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2023). Brazil Overflow: Sewage: Average Duration: Repair: South: Rio Grande do Sul [Dataset]. https://www.ceicdata.com/en/brazil/quality-indicators-sewage-issues-overflow
    Explore at:
    Dataset updated
    Feb 26, 2023
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 1, 2012 - Dec 1, 2022
    Area covered
    Brazil
    Description

    Overflow: Sewage: Average Duration: Repair: South: Rio Grande do Sul data was reported at 13.280 Hour in 2022. This records a decrease from the previous number of 25.220 Hour for 2021. Overflow: Sewage: Average Duration: Repair: South: Rio Grande do Sul data is updated yearly, averaging 25.220 Hour from Dec 2012 (Median) to 2022, with 11 observations. The data reached an all-time high of 69.640 Hour in 2017 and a record low of 1.070 Hour in 2015. Overflow: Sewage: Average Duration: Repair: South: Rio Grande do Sul data remains active status in CEIC and is reported by Ministry of Cities. The data is categorized under Brazil Premium Database’s Environmental, Social and Governance Sector – Table BR.EVE005: Quality Indicators: Sewage: Issues: Overflow.

  17. Table_2_Can the Use of Bayesian Analysis Methods Correct for Incompleteness...

    • frontiersin.figshare.com
    docx
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Ford; Philip Rooney; Peter Hurley; Seb Oliver; Stephen Bremner; Jackie Cassell (2023). Table_2_Can the Use of Bayesian Analysis Methods Correct for Incompleteness in Electronic Health Records Diagnosis Data? Development of a Novel Method Using Simulated and Real-Life Clinical Data.DOCX [Dataset]. http://doi.org/10.3389/fpubh.2020.00054.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Elizabeth Ford; Philip Rooney; Peter Hurley; Seb Oliver; Stephen Bremner; Jackie Cassell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: Patient health information is collected routinely in electronic health records (EHRs) and used for research purposes, however, many health conditions are known to be under-diagnosed or under-recorded in EHRs. In research, missing diagnoses result in under-ascertainment of true cases, which attenuates estimated associations between variables and results in a bias toward the null. Bayesian approaches allow the specification of prior information to the model, such as the likely rates of missingness in the data. This paper describes a Bayesian analysis approach which aimed to reduce attenuation of associations in EHR studies focussed on conditions characterized by under-diagnosis.Methods: Study 1: We created synthetic data, produced to mimic structured EHR data where diagnoses were under-recorded. We fitted logistic regression (LR) models with and without Bayesian priors representing rates of misclassification in the data. We examined the LR parameters estimated by models with and without priors. Study 2: We used EHR data from UK primary care in a case-control design with dementia as the outcome. We fitted LR models examining risk factors for dementia, with and without generic prior information on misclassification rates. We examined LR parameters estimated by models with and without the priors, and estimated classification accuracy using Area Under the Receiver Operating Characteristic.Results: Study 1: In synthetic data, estimates of LR parameters were much closer to the true parameter values when Bayesian priors were added to the model; with no priors, parameters were substantially attenuated by under-diagnosis. Study 2: The Bayesian approach ran well on real life clinic data from UK primary care, with the addition of prior information increasing LR parameter values in all cases. In multivariate regression models, Bayesian methods showed no improvement in classification accuracy over traditional LR.Conclusions: The Bayesian approach showed promise but had implementation challenges in real clinical data: prior information on rates of misclassification was difficult to find. Our simple model made a number of assumptions, such as diagnoses being missing at random. Further development is needed to integrate the method into studies using real-life EHR data. Our findings nevertheless highlight the importance of developing methods to address missing diagnoses in EHR data.

  18. B

    Brazil Overflow: Sewage: Average Duration: Repair: North: Amazonas

    • ceicdata.com
    Updated Oct 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2025). Brazil Overflow: Sewage: Average Duration: Repair: North: Amazonas [Dataset]. https://www.ceicdata.com/en/brazil/quality-indicators-sewage-issues-overflow/overflow-sewage-average-duration-repair-north-amazonas
    Explore at:
    Dataset updated
    Oct 15, 2025
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 1, 2012 - Dec 1, 2022
    Area covered
    Brazil
    Description

    Overflow: Sewage: Average Duration: Repair: North: Amazonas data was reported at 7.000 Hour in 2022. This records a decrease from the previous number of 7.590 Hour for 2021. Overflow: Sewage: Average Duration: Repair: North: Amazonas data is updated yearly, averaging 7.000 Hour from Dec 2012 (Median) to 2022, with 11 observations. The data reached an all-time high of 36.350 Hour in 2014 and a record low of 0.190 Hour in 2015. Overflow: Sewage: Average Duration: Repair: North: Amazonas data remains active status in CEIC and is reported by Ministry of Cities. The data is categorized under Brazil Premium Database’s Environmental, Social and Governance Sector – Table BR.EVE005: Quality Indicators: Sewage: Issues: Overflow.

  19. BugCatcher-Data

    • zenodo.org
    bin
    Updated Dec 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaopeng XU; Shaopeng XU (2024). BugCatcher-Data [Dataset]. http://doi.org/10.5281/zenodo.14536733
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shaopeng XU; Shaopeng XU
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Bug datasets play a vital role in advancing software engineering tasks, including bug detection, fault localization, and automated program repair. These datasets enable the development of more accurate algorithms, facilitate efficient fault identification, and drive the creation of reliable automated repair tools. However, the manual collection and curation of such data are labor-intensive and prone to inconsistency, which limits scalability and reliability. Current datasets often fail to provide detailed and accurate information, particularly regarding bug types, descriptions, and classifications, reducing their utility in diverse research and practical applications. To address these challenges, we introduce BugCatcher, a comprehensive approach for constructing large-scale, high-quality bug datasets. BugCatcher begins by enhancing PR-Issue linking mechanisms, extending data collection to 12 programming languages over a decade, and ensuring accurate linkage between pull requests and issues. It employs a two-stage filtering process, BugCurator, to refine data quality, and utilizes large language models with Zero-shot Chain-of-Thought prompting to generate precise bug types and detailed descriptions. Furthermore, BugCatcher incorporates a robust classification framework, fine-tuning models for improved categorization. The resulting dataset, BugCatcher-Data, includes 243,265 bug-fix entries with comprehensive fields such as code diffs, bug locations, detailed descriptions, and classifications, serving as a substantial resource for advancing software engineering research and practices.

  20. B

    Brazil Overflow: Sewage: Average Duration: Repair: Northeast: Sergipe

    • ceicdata.com
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2025). Brazil Overflow: Sewage: Average Duration: Repair: Northeast: Sergipe [Dataset]. https://www.ceicdata.com/en/brazil/quality-indicators-sewage-issues-overflow/overflow-sewage-average-duration-repair-northeast-sergipe
    Explore at:
    Dataset updated
    Feb 15, 2025
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 1, 2012 - Dec 1, 2022
    Area covered
    Brazil
    Description

    Overflow: Sewage: Average Duration: Repair: Northeast: Sergipe data was reported at 2.520 Hour in 2022. This records a decrease from the previous number of 4.130 Hour for 2021. Overflow: Sewage: Average Duration: Repair: Northeast: Sergipe data is updated yearly, averaging 4.990 Hour from Dec 2012 (Median) to 2022, with 11 observations. The data reached an all-time high of 12.000 Hour in 2014 and a record low of 2.520 Hour in 2022. Overflow: Sewage: Average Duration: Repair: Northeast: Sergipe data remains active status in CEIC and is reported by Ministry of Cities. The data is categorized under Brazil Premium Database’s Environmental, Social and Governance Sector – Table BR.EVE005: Quality Indicators: Sewage: Issues: Overflow.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Data Obsession (2025). Superstore Sales: The Data Quality Challenge [Dataset]. https://www.kaggle.com/datasets/dataobsession/superstore-sales-the-data-quality-challenge
Organization logo

Superstore Sales: The Data Quality Challenge

Explore at:
zip(1512911 bytes)Available download formats
Dataset updated
Oct 25, 2025
Authors
Data Obsession
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Superstore Sales - The Data Quality Challenge Edition (25K Records)

This dataset is an expanded version of the popular "Sample - Superstore Sales" dataset, commonly used for introductory data analysis and visualization. It contains detailed transactional data for a US-based retail company, covering orders, products, and customer information.

This version is specifically designed for practicing Data Quality (DQ) and Data Wrangling skills, featuring a unique set of real-world "dirty data" problems (like those encountered in tools like SPSS Modeler, Tableau Prep, or Alteryx) that must be cleaned before any analysis or machine learning can begin.

This dataset combines the original Superstore data with 15,000 plausibly generated synthetic records, totaling 25,000 rows of transactional data. It includes 21 columns detailing: - Order Information: Order ID, Order Date, Ship Date, Ship Mode. - Customer Information: Customer ID, Customer Name, Segment. - Geographic Information: Country, City, State, Postal Code, Region. - Product Information: Product ID, Category, Sub-Category, Product Name. - Financial Metrics: Sales, Quantity, Discount, and Profit.

🚨 Introduced Data Quality Challenges (The Dirty Data)

This dataset is intentionally corrupted to provide a robust practice environment for data cleaning. Challenges include: Missing/Inconsistent Values: Deliberate gaps in Profit and Discount, and multiple inconsistent entries (-- or blank) in the Region column.

  • Data Type Mismatches: Order Date and Ship Date are stored as text strings, and the Profit column is polluted with comma-formatted strings (e.g., "1,234.56"), forcing the entire column to be read as an object (string) type.

  • Categorical Inconsistencies: The Category field contains variations and typos like "Tech", "technologies", "Furni", and "OfficeSupply" that require standardization.

  • Outliers and Invalid Data: Extreme outliers have been added to the Sales and Profit fields, alongside a subset of transactions with an invalid Sales value of 0.

  • Duplicate Records: Over 200 rows are duplicated (with slight financial variations) to test your deduplication logic.

❓ Suggested Analysis and Modeling Tasks

This dataset is ideal for:

Data Wrangling/Cleaning (Primary Focus): Fix all the intentional data quality issues before proceeding.

Exploratory Data Analysis (EDA): Analyze sales distribution by region, segment, and category.

Regression: Predict the Profit based on Sales, Discount, and product features.

Classification: Build an RFM Model (Recency, Frequency, Monetary) and create a target variable (HighValueCustomer = 1 if total sales are* $>$ $1000$*) to be predicted by logistical regression or decision trees.

Time Series Analysis: Aggregate sales by month/year to perform forecasting.

Acknowledgements

This dataset is an expanded and corrupted derivative of the original Sample Superstore dataset, credited to Tableau and widely shared for educational purposes. All synthetic records were generated to follow the plausible distribution of the original data.

Search
Clear search
Close search
Google apps
Main menu