100+ datasets found

B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.

Data Science Platform Market Analysis North America, Europe, APAC, South...

technavio.com

Updated Feb 13, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Technavio (2025). Data Science Platform Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, Germany, China, Canada, UK, India, France, Japan, Brazil, UAE - Size and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis

Explore at:

Dataset updated

Feb 13, 2025

Dataset provided by

TechNavio

Authors

Technavio

Time period covered

2021 - 2025

Area covered

United Kingdom, United States, Global

Description

Snapshot img

Data Science Platform Market Size 2025-2029

The data science platform market size is forecast to increase by USD 763.9 million at a CAGR of 40.2% between 2024 and 2029.

The market is experiencing significant growth, driven by the integration of artificial intelligence (AI) and machine learning (ML). This enhancement enables more advanced data analysis and prediction capabilities, making data science platforms an essential tool for businesses seeking to gain insights from their data. Another trend shaping the market is the emergence of containerization and microservices in platforms. This development offers increased flexibility and scalability, allowing organizations to efficiently manage their projects. 
However, the use of platforms also presents challenges, particularly In the area of data privacy and security. Ensuring the protection of sensitive data is crucial for businesses, and platforms must provide strong security measures to mitigate risks. In summary, the market is witnessing substantial growth due to the integration of AI and ML technologies, containerization, and microservices, while data privacy and security remain key challenges.

What will be the Size of the Data Science Platform Market During the Forecast Period?

Request Free Sample

The market is experiencing significant growth due to the increasing demand for advanced data analysis capabilities in various industries. Cloud-based solutions are gaining popularity as they offer scalability, flexibility, and cost savings. The market encompasses the entire project life cycle, from data acquisition and preparation to model development, training, and distribution. Big data, IoT, multimedia, machine data, consumer data, and business data are prime sources fueling this market's expansion. Unstructured data, previously challenging to process, is now being effectively managed through tools and software. Relational databases and machine learning models are integral components of platforms, enabling data exploration, preprocessing, and visualization.
Moreover, Artificial intelligence (AI) and machine learning (ML) technologies are essential for handling complex workflows, including data cleaning, model development, and model distribution. Data scientists benefit from these platforms by streamlining their tasks, improving productivity, and ensuring accurate and efficient model training. The market is expected to continue its growth trajectory as businesses increasingly recognize the value of data-driven insights.

How is this Data Science Platform Industry segmented and which is the largest segment?

The industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Deployment

  On-premises
  Cloud


Component

  Platform
  Services


End-user

  BFSI
  Retail and e-commerce
  Manufacturing
  Media and entertainment
  Others


Sector

  Large enterprises
  SMEs


Geography

  North America

    Canada
    US


  Europe

    Germany
    UK
    France


  APAC

    China
    India
    Japan


  South America

    Brazil


  Middle East and Africa

By Deployment Insights

The on-premises segment is estimated to witness significant growth during the forecast period.

On-premises deployment is a traditional method for implementing technology solutions within an organization. This approach involves purchasing software with a one-time license fee and a service contract. On-premises solutions offer enhanced security, as they keep user credentials and data within the company's premises. They can be customized to meet specific business requirements, allowing for quick adaptation. On-premises deployment eliminates the need for third-party providers to manage and secure data, ensuring data privacy and confidentiality. Additionally, it enables rapid and easy data access, and keeps IP addresses and data confidential. This deployment model is particularly beneficial for businesses dealing with sensitive data, such as those in manufacturing and large enterprises. While cloud-based solutions offer flexibility and cost savings, on-premises deployment remains a popular choice for organizations prioritizing data security and control.

Get a glance at the Data Science Platform Industry report of share of various segments. Request Free Sample

The on-premises segment was valued at USD 38.70 million in 2019 and showed a gradual increase during the forecast period.

Regional Analysis

North America is estimated to contribute 48% to the growth of the global market during the forecast period.

Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period.

For more insights on the market share of various regions, Request F

D
Data Cleansing Software Report
archivemarketresearch.com
doc, pdf, ppt
Updated Feb 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Data Cleansing Software Report [Dataset]. https://www.archivemarketresearch.com/reports/data-cleansing-software-44630
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Feb 23, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The data cleansing software market is expanding rapidly, with a market size of XXX million in 2023 and a projected CAGR of XX% from 2023 to 2033. This growth is driven by the increasing need for accurate and reliable data in various industries, including healthcare, finance, and retail. Key market trends include the growing adoption of cloud-based solutions, the increasing use of artificial intelligence (AI) and machine learning (ML) to automate the data cleansing process, and the increasing demand for data governance and compliance. The market is segmented by deployment type (cloud-based vs. on-premise) and application (large enterprises vs. SMEs vs. government agencies). Major players in the market include IBM, SAS Institute Inc, SAP SE, Trifacta, OpenRefine, Data Ladder, Analytics Canvas (nModal Solutions Inc.), Mo-Data, Prospecta, WinPure Ltd, Symphonic Source Inc, MuleSoft, MapR Technologies, V12 Data, and Informatica. This report provides a comprehensive overview of the global data cleansing software market, with a focus on market concentration, product insights, regional insights, trends, driving forces, challenges and restraints, growth catalysts, leading players, and significant developments.
D
Data Center Cleaning Service Report
marketresearchforecast.com
doc, pdf, ppt
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). Data Center Cleaning Service Report [Dataset]. https://www.marketresearchforecast.com/reports/data-center-cleaning-service-14735
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Jan 24, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The market for data center cleaning services is expected to grow from USD XXX million in 2025 to USD XXX million by 2033, at a CAGR of XX% during the forecast period 2025-2033. The growth of the market is attributed to the increasing number of data centers and the need to maintain these facilities in a clean environment. Data centers are critical to the functioning of the modern economy, as they house the servers that store and process vast amounts of data. Maintaining these facilities in a clean environment is essential to prevent the accumulation of dust and other contaminants, which can lead to equipment failures and downtime. The market for data center cleaning services is segmented by type, application, and region. By type, the market is segmented into equipment cleaning, ceiling cleaning, floor cleaning, and others. Equipment cleaning is the largest segment of the market, accounting for over XX% of the total market revenue in 2025. By application, the market is segmented into the internet industry, finance and insurance, manufacturing industry, government departments, and others. The internet industry is the largest segment of the market, accounting for over XX% of the total market revenue in 2025. By region, the market is segmented into North America, South America, Europe, the Middle East & Africa, and Asia Pacific. North America is the largest segment of the market, accounting for over XX% of the total market revenue in 2025.
COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam
microdata.worldbank.org
catalog.ihsn.org
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
Explore at:
Dataset updated
Oct 26, 2023
Dataset authored and provided by
World Bankhttp://worldbank.org/
Time period covered
2020
Area covered
Vietnam
Description
Geographic coverage

National, regional

Analysis unit

Households

Kind of data

Sample survey data [ssd]

Sampling procedure

The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

The questionnaire for Round 2 consisted of the following sections

Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

Cleaning operations

Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.
d
Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global /...
datarade.ai
.json, .csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coresignal, Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global / 35M+ Records / Updated Weekly [Dataset]. https://datarade.ai/data-products/coresignal-clean-data-company-data-ai-enriched-datasets-coresignal
Explore at:
.json, .csvAvailable download formats
Dataset authored and provided by
Coresignal
Area covered
Hungary, Guatemala, Guinea-Bissau, Namibia, Saint Barthélemy, Niue, Panama, Guadeloupe, Chile, Andorra
Description
This clean dataset is a refined version of our company datasets, consisting of 35M+ data records.

It’s an excellent data solution for companies with limited data engineering capabilities and those who want to reduce their time to value. You get filtered, cleaned, unified, and standardized B2B data. After cleaning, this data is also enriched by leveraging a carefully instructed large language model (LLM).

AI-powered data enrichment offers more accurate information in key data fields, such as company descriptions. It also produces over 20 additional data points that are very valuable to B2B businesses. Enhancing and highlighting the most important information in web data contributes to quicker time to value, making data processing much faster and easier.

For your convenience, you can choose from multiple data formats (Parquet, JSON, JSONL, or CSV) and select suitable delivery frequency (quarterly, monthly, or weekly).

Coresignal is a leading public business data provider in the web data sphere with an extensive focus on firmographic data and public employee profiles. More than 3B data records in different categories enable companies to build data-driven products and generate actionable insights. Coresignal is exceptional in terms of data freshness, with 890M+ records updated monthly for unprecedented accuracy and relevance.
Additional file 1 of Grouped data with survey revision
figshare.com
springernature.figshare.com
txt
Updated Aug 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chung-Han Liang; Da-Wei Wang; Mei-Lien Pan (2024). Additional file 1 of Grouped data with survey revision [Dataset]. http://doi.org/10.6084/m9.figshare.26561521.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26561521.v1
Dataset updated
Aug 13, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Chung-Han Liang; Da-Wei Wang; Mei-Lien Pan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 1. Data and estimation in the simulation study.
m
Data from: Datasets for lot sizing and scheduling problems in the...
data.mendeley.com
narcis.nl
Updated Jan 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Piñeros (2021). Datasets for lot sizing and scheduling problems in the fruit-based beverage production process [Dataset]. http://doi.org/10.17632/j2x3gbskfw.1
Explore at:
Unique identifier
https://doi.org/10.17632/j2x3gbskfw.1
Dataset updated
Jan 19, 2021
Authors
Juan Piñeros
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets presented here were partially used in “Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings” (Toscano, A., Ferreira, D. , Morabito, R. , Computers & Chemical Engineering) [1], in “A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning” (Toscano, A., Ferreira, D. , Morabito, R. , Flexible Services and Manufacturing Journal) [2], and in “A heuristic approach to optimize the production scheduling of fruit-based beverages” (Toscano et al., Gestão & Produção, 2020) [3]. In fruit-based production processes, there are two production stages: preparation tanks and production lines. This production process has some process-specific characteristics, such as temporal cleanings and synchrony between the two production stages, which make optimized production planning and scheduling even more difficult. In this sense, some papers in the literature have proposed different methods to solve this problem. To the best of our knowledge, there are no standard datasets used by researchers in the literature in order to verify the accuracy and performance of proposed methods or to be a benchmark for other researchers considering this problem. The authors have been using small data sets that do not satisfactorily represent different scenarios of production. Since the demand in the beverage sector is seasonal, a wide range of scenarios enables us to evaluate the effectiveness of the proposed methods in the scientific literature in solving real scenarios of the problem. The datasets presented here include data based on real data collected from five beverage companies. We presented four datasets that are specifically constructed assuming a scenario of restricted capacity and balanced costs. These dataset is supplementary data for the submitted paper to Data in Brief [4]. [1] Toscano, A., Ferreira, D., Morabito, R., Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings, Computers & Chemical Engineering. 142 (2020) 107038. Doi: 10.1016/j.compchemeng.2020.107038. [2] Toscano, A., Ferreira, D., Morabito, R., A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning, Flexible Services and Manufacturing Journal. 31 (2019) 142-173. Doi: 10.1007/s10696-017-9303-9. [3] Toscano, A., Ferreira, D., Morabito, R., Trassi, M. V. C., A heuristic approach to optimize the production scheduling of fruit-based beverages. Gestão & Produção, 27(4), e4869, 2020. https://doi.org/10.1590/0104-530X4869-20. [4] Piñeros, J., Toscano, A., Ferreira, D., Morabito, R., Datasets for lot sizing and scheduling problems in the fruit-based beverage production process. Data in Brief (2021).
Z
The Surface Water Chemistry (SWatCh) database
data.niaid.nih.gov
zenodo.org
Updated Apr 26, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heubach, Franz (2022). The Surface Water Chemistry (SWatCh) database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4559695
Explore at:
Dataset updated
Apr 26, 2022
Dataset provided by
Heubach, Franz
Rotteveel, Lobke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset presented in the following manuscript: The Surface Water Chemistry (SWatCh) database: A standardized global database of water chemistry to facilitate large-sample hydrological research, which is currently under review at Earth System Science Data.

Openly accessible global scale surface water chemistry datasets are urgently needed to detect widespread trends and problems, to help identify their possible solutions, and determine critical spatial data gaps where more monitoring is required. Existing datasets are limited in availability, sample size/sampling frequency, and geographic scope. These limitations inhibit the answering of emerging transboundary water chemistry questions, for example, the detection and understanding of delayed recovery from freshwater acidification. Here, we begin to address these limitations by compiling the global surface water chemistry (SWatCh) database. We collect, clean, standardize, and aggregate open access data provided by six national and international agencies to compile a database containing information on sites, methods, and samples, and a GIS shapefile of site locations. We remove poor quality data (for example, values flagged as “suspect” or “rejected”), standardize variable naming conventions and units, and perform other data cleaning steps required for statistical analysis. The database contains water chemistry data for streams, rivers, canals, ponds, lakes, and reservoirs across seven continents, 24 variables, 33,722 sites, and over 5 million samples collected between 1960 and 2022. Similar to prior research, we identify critical spatial data gaps on the African and Asian continents, highlighting the need for more data collection and sharing initiatives in these areas, especially considering freshwater ecosystems in these environs are predicted to be among the most heavily impacted by climate change. We identify the main challenges associated with compiling global databases – limited data availability, dissimilar sample collection and analysis methodology, and reporting ambiguity – and provide recommended solutions. By addressing these challenges and consolidating data from various sources into one standardized, openly available, high quality, and trans-boundary database, SWatCh allows users to conduct powerful and robust statistical analyses of global surface water chemistry.
D
Data Preparation Tools Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AMA Research & Media LLP (2025). Data Preparation Tools Report [Dataset]. https://www.archivemarketresearch.com/reports/data-preparation-tools-51852
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Mar 6, 2025
Dataset provided by
AMA Research & Media LLP
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Data Preparation Tools market is experiencing robust growth, projected to reach a market size of $3 billion in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 17.7% from 2025 to 2033. This significant expansion is driven by several key factors. The increasing volume and velocity of data generated across industries necessitates efficient and effective data preparation processes to ensure data quality and usability for analytics and machine learning initiatives. The rising adoption of cloud-based solutions, coupled with the growing demand for self-service data preparation tools, is further fueling market growth. Businesses across various sectors, including IT and Telecom, Retail and E-commerce, BFSI (Banking, Financial Services, and Insurance), and Manufacturing, are actively seeking solutions to streamline their data pipelines and improve data governance. The diverse range of applications, from simple data cleansing to complex data transformation tasks, underscores the versatility and broad appeal of these tools. Leading vendors like Microsoft, Tableau, and Alteryx are continuously innovating and expanding their product offerings to meet the evolving needs of the market, fostering competition and driving further advancements in data preparation technology. This rapid growth is expected to continue, driven by ongoing digital transformation initiatives and the increasing reliance on data-driven decision-making. The segmentation of the market into self-service and data integration tools, alongside the varied applications across different industries, indicates a multifaceted and dynamic landscape. While challenges such as data security concerns and the need for skilled professionals exist, the overall market outlook remains positive, projecting substantial expansion throughout the forecast period. The adoption of advanced technologies like artificial intelligence (AI) and machine learning (ML) within data preparation tools promises to further automate and enhance the process, contributing to increased efficiency and reduced costs for businesses. The competitive landscape is dynamic, with established players alongside emerging innovators vying for market share, leading to continuous improvement and innovation within the industry.
f
Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene...
frontiersin.figshare.com
docx
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2024.1379910.s001
Dataset updated
Mar 22, 2024
Dataset provided by
Frontiers
Authors
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
COVID-19 Case Surveillance Public Use Data
data.cdc.gov
data.virginia.gov
+6more
application/rdfxml +5
Updated Jul 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data [Dataset]. https://data.cdc.gov/widgets/vbim-akqf
Explore at:
json, application/rdfxml, csv, xml, tsv, application/rssxmlAvailable download formats
Dataset updated
Jul 9, 2024
Dataset provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
Authors
CDC Data, Analytics and Visualization Task Force
License
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Description
Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.

CDC has three COVID-19 case surveillance datasets:
COVID-19 Case Surveillance Public Use Data with Geography: Public use, patient-level dataset with clinical data (including symptoms), demographics, and county and state of residence. (19 data elements)
COVID-19 Case Surveillance Public Use Data: Public use, patient-level dataset with clinical and symptom data and demographics, with no geographic data. (12 data elements)
COVID-19 Case Surveillance Restricted Access Detailed Data: Restricted access, patient-level dataset with clinical and symptom data, demographics, and state and county of residence. Access requires a registration process and a data use agreement. (33 data elements)
The following apply to all three datasets:
Data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.
Data are considered provisional by CDC and are subject to change until the data are reconciled and verified with the state and territorial data providers.
Some data cells are suppressed to protect individual privacy.
The datasets will include all cases with the earliest date available in each record (date received by CDC or date related to illness/specimen collection) at least 14 days prior to the creation of the current datasets. This 14-day lag allows case reporting to be stabilized and ensures that time-dependent outcome data are accurately captured.
Datasets are updated monthly.
Datasets are created using CDC’s Policy on Public Health Research and Nonresearch Data Management and Access and include protections designed to protect individual privacy.
For more information about data collection and reporting, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/about-us-cases-deaths.html.
For more information about the COVID-19 case surveillance data, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/faq-surveillance.html

Overview

The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

For more information: NNDSS Supports the COVID-19 Response | CDC.

The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.

COVID-19 Case Reports

COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.

All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.

Data are Considered Provisional

The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.
Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

Data Limitations

To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

Data Quality Assurance Procedures

CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:
Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question “Was the individual hospitalized?” where the possible answer choices include “Yes,” “No,” or “Unknown,” the blank value is recoded to Missing because the case report form did not include a response to the question.
Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
Additional data quality processing to recode free text data is ongoing. Data on symptoms, race and ethnicity, and healthcare worker status have been prioritized.

Data Suppression

To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

For questions, please contact Ask SRRG (eocevent394@cdc.gov).

Additional COVID-19 Data

COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These
Variation in methods, results and reporting in electronic health...
plos.figshare.com
pdf
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samantha S. R. Crossfield; Lana Yin Hui Lai; Sarah R. Kingsbury; Paul Baxter; Owen Johnson; Philip G. Conaghan; Mar Pujades-Rodriguez (2023). Variation in methods, results and reporting in electronic health record-based studies evaluating routine care in gout: A systematic review [Dataset]. http://doi.org/10.1371/journal.pone.0224272
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0224272
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Samantha S. R. Crossfield; Lana Yin Hui Lai; Sarah R. Kingsbury; Paul Baxter; Owen Johnson; Philip G. Conaghan; Mar Pujades-Rodriguez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveTo perform a systematic review examining the variation in methods, results, reporting and risk of bias in electronic health record (EHR)-based studies evaluating management of a common musculoskeletal disease, gout.MethodsTwo reviewers systematically searched MEDLINE, Scopus, Web of Science, CINAHL, PubMed, EMBASE and Google Scholar for all EHR-based studies published by February 2019 investigating gout pharmacological treatment. Information was extracted on study design, eligibility criteria, definitions, medication usage, effectiveness and safety data, comprehensiveness of reporting (RECORD), and Cochrane risk of bias (registered PROSPERO CRD42017065195).ResultsWe screened 5,603 titles/abstracts, 613 full-texts and selected 75 studies including 1.9M gout patients. Gout diagnosis was defined in 26 ways across the studies, most commonly using a single diagnostic code (n = 31, 41.3%). 48.4% did not specify a disease-free period before ‘incident’ diagnosis. Medication use was suboptimal and varied with disease definition while results regarding effectiveness and safety were broadly similar across studies despite variability in inclusion criteria. Comprehensiveness of reporting was variable, ranging from 73% (55/75) appropriately discussing the limitations of EHR data use, to 5% (4/75) reporting on key data cleaning steps. Risk of bias was generally low.ConclusionThe wide variation in case definitions and medication-related analysis among EHR-based studies has implications for reported medication use. This is amplified by variable reporting comprehensiveness and the limited consideration of EHR-relevant biases (e.g. data adequacy) in study assessment tools. We recommend accounting for these biases and performing a sensitivity analysis on case definitions, and suggest changes to assessment tools to foster this.
STEPwise Survey for Non Communicable Diseases Risk Factors 2005 - Zimbabwe
catalog.ihsn.org
datacatalog.ihsn.org
Updated Jun 26, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Health Organization (2017). STEPwise Survey for Non Communicable Diseases Risk Factors 2005 - Zimbabwe [Dataset]. https://catalog.ihsn.org/catalog/6968
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
World Health Organizationhttps://who.int/
Ministry of Health and Child Welfare
Time period covered
2005
Area covered
Zimbabwe
Description
Abstract

Noncommunicable diseases are the top cause of deaths. In 2008, more than 36 million people worldwide died of such diseases. Ninety per cent of those lived in low-income and middle-income countries.WHO Maps Noncommunicable Disease Trends in All Countries The STEPS Noncommunicable Disease Risk Factor Survey, part of the STEPwise approach to surveillance (STEPS) Adult Risk Factor Surveillance project by the World Health Organization (WHO), is a survey methodology to help countries begin to develop their own surveillance system to monitor and fight against noncommunicable diseases. The methodology prescribes three steps—questionnaire, physical measurements, and biochemical measurements. The steps consist of core items, core variables, and optional modules. Core topics covered by most surveys are demographics, health status, and health behaviors. These provide data on socioeconomic risk factors and metabolic, nutritional, and lifestyle risk factors. Details may differ from country to country and from year to year.

The general objective of the Zimbabwe NCD STEPS survey was to assess the risk factors of selected NCDs in the adult population of Zimbabwe using the WHO STEPwise approach to non-communicable diseases surveillance. The specific objectives were: - To assess the distribution of life-style factors (physical activity, tobacco and alcohol use), and anthropometric measurements (body mass index and central obesity) which may impact on diabetes and cardiovascular risk factors. - To identify dietary practices that are risk factors for selected NCDs. - To determine the prevalence and determinants of hypertension - To determine the prevalence and determinants of diabetes. - To determine the prevalence and determinants of serum lipid profile.

Geographic coverage

Mashonaland Central, Midlands and Matebeleland South Provinces.

Analysis unit

Household Individual

Universe

The survey comprised of individuals aged 25 years and over.

Kind of data

Sample survey data [ssd]

Sampling procedure

A multistage sampling strategy with 3 stages consisting of province, district and health centre was employed. The World Health Organization STEPwise Approach (STEPS) was used as the design basis for the survey. The 3 randomly selected provinces for the survey were Mashonaland Central, Midlands and Matebeleland South. In each Province four districts were chosen and four health centres were surveyed per district. The survey comprised of individuals aged 25 years and over.The survey was carried out on 3,081 respondents consisting of 1,189 from Midlands,944 from Mashonaland Central and 948 from Matebeleland South. A detailed description of the sampling process is provided in sections 3.8 -3.9. if the survey report provided under the related materials tab.

Sampling deviation

Designing a community-based survey such as this one is fraught with difficulties in ensuring representativeness of the sample chosen. In this survey there was a preponderance of female respondents because of the pattern of employment of males and females which also influences urban rural migration.

The response rate in Midlands was lower than the other two provinces in both STEP 2 and 3. This notable difference was due to the fact that Midlands had more respondents sampled from the urban communities. A higher proportion of urban respondents was formally employed and therefore did not complete STEP 2 and 3 due to conflict with work schedules.

Mode of data collection

Face-to-face [f2f]

Research instrument

In this survey all the core and selected expanded and optional variables were collected. In addition a food frequency questionnaire and a UNICEF developed questionnaire, the Fortification Rapid Assessment Tool (FRAT) were administered to elicit relevant dietary information.

Cleaning operations

Data entry for Step 1 and Step 2 data was carried out as soon as data became available to the data management team. Step 3 data became available in October and data entry was carried out when data quality checks were completed in November. Report writing started in September and a preliminary report became available in December 2005.

Training of data entry clerks Five data entry clerks were recruited and trained for one week. The selection of data entry clerks was based on their performance during previous research carried out by the MOH&CW. The training of the data entry clerks involved the following: - Familiarization with the NCD, FRAT and FFQ questionnaires. - Familiarization with the data entry template. - Development of codes for open-ended questions. - Statistical package (EPI Info 6). - Development of a data entry template using EPI6. - Development of check files for each template - Trial runs (mock runs) to check whether template was complete and user friendly for data entry. - Double entry (what it involves and how to do it and why it should be done). - Pre-primary data cleaning (check whether denominators are tallying) of the data entry template was done.

Data Entry for NCD, FRAT and FFQ questionnaires The questionnaires were sequentially numbered and were then divided among the five data entry clerks. Each one of the data entry clerks had a unique identifier for quality control purposes. Hence, the data was entered into five separate files using the statistical package EPI Info version 6.0. The data entry clerks inter-changed their files for double entry and validation of the data. Preliminary data cleaning was done for each of the five files. The five files were then merged to give a single file. The merged file was then transferred to STATA Version 7.0 using Stat Transfer version 5.0.

Data Cleaning A data-cleaning workshop was held with the core research team members. The objectives of the workshop were: 1. To check all data entry errors. 2. To assess any inconsistencies in data filling. 3. To assess any inconsistencies in data entry. 4. To assess completeness of the data entered.

Data Merging There were two datasets (NCD questionnaire dataset and laboratory dataset) after the data entry process. The two files were merged by joining corresponding observations from the NCD questionnaire dataset with those from the laboratory dataset into single observations using a unique identifier. The ID number was chosen as the unique identifier since it appeared in both data sets. The main aim of merging was to combine the two datasets containing information on behaviour of individuals and the NCD laboratory parameters. When the two data sets were merged, a new merge variable was created. The merge variable took values 1, 2 and 3. Merge variable==1 Observation appeared in the NCD questionnaire data set but a corresponding observation was not in the laboratory data set Merge variable==2 Observation appeared in the laboratory data set but a corresponding observation did not appear in the questionnaire data set Merge variable==3 Observation appeared in both data sets and reflects a complete merge of the two data sets.

Data Cleaning After Merging Data cleaning involved identifying the observations where the merge variable values were either 1 or 2. Merge status for each observation was also changed after effecting any corrections. The other two unique variables that were used in the cleaning were Province, district and health centre since they also appeared in both data sets.

Objectives of cleaning: 1. Match common variables in both data sets and identify inconsistencies in other matching variables e.g. province, district and health centre. 2. To check for any data entry errors.

Response rate

A total of 3,081 respondents were included in the survey against an estimated sample size of 3,000. The response rate for Step 1 was 80% for and for Step 2 70% taking Step 1 accrual as being 100%.
COVID-19 High Frequency Phone Survey of Households 2020 - Viet Nam
microdata.worldbank.org
datacatalog.ihsn.org
+1more
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank (2023). COVID-19 High Frequency Phone Survey of Households 2020 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/3813
Explore at:
Dataset updated
Oct 26, 2023
Dataset authored and provided by
World Bankhttp://worldbank.org/
Time period covered
2020
Area covered
Vietnam
Description
Abstract

The main objective of this project is to collect household data for the ongoing assessment and monitoring of the socio-economic impacts of COVID-19 on households and family businesses in Vietnam. The estimated field work and sample size of households in each round is as follows:

Round 1 June fieldwork- approximately 6300 households (at least 1300 minority households) Round 2 August fieldwork - approximately 4000 households (at least 1000 minority households) Round 3 September fieldwork- approximately 4000 households (at least 1000 minority households) Round 4 December- approximately 4000 households (at least 1000 minority households) Round 5 - pending discussion

Geographic coverage

National, regional

Analysis unit

Households

Kind of data

Sample survey data [ssd]

Sampling procedure

The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. Out of the 15 households, 3 households have information collected on both income and expenditure (large module) as well as many other aspects. The remaining 12 other households have information collected on income, but do not have information collected on expenditure (small module). Therefore, estimation of large module includes 9396 households and are representative at regional and national levels, while the whole sample is representative at the provincial level.

We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. The sample size of large module has 9396 households, of which, there are 7951 households having phone number (cell phone or line phone).

After data processing, the final sample size is 6,213 households.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

The questionnaire for Round 1 consisted of the following sections Section 2. Behavior Section 3. Health Section 4. Education & Child caring Section 5A. Employment (main respondent) Section 5B. Employment (other household member) Section 6. Coping Section 7. Safety Nets Section 8. FIES

Cleaning operations

Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.

Response rate

The target for Round 1 is to complete interviews for 6300 households, of which 1888 households are located in urban area and 4475 households in rural area. In addition, at least 1300 ethnic minority households are to be interviewed. A random selection of 6300 households was made out of 7951 households for official interview and the rest as for replacement. However, the refusal rate of the survey was about 27 percent, and households from the small module in the same EA were contacted for replacement and these households are also randomly selected.
l
LScDC (Leicester Scientific Dictionary-Core)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9896579.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (
Global Data Cleansing Tools Market Research and Development Focus 2025-2032
statsndata.org
excel, pdf
Updated Feb 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats N Data (2025). Global Data Cleansing Tools Market Research and Development Focus 2025-2032 [Dataset]. https://www.statsndata.org/report/data-cleansing-tools-market-339171
Explore at:
excel, pdfAvailable download formats
Dataset updated
Feb 2025
Dataset authored and provided by
Stats N Data
License
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
Area covered
Global
Description
The Data Cleansing Tools market is rapidly evolving as businesses increasingly recognize the importance of data quality in driving decision-making and strategic initiatives. Data cleansing, also known as data scrubbing or data cleaning, involves the process of identifying and correcting errors and inconsistencies in
n
Influence of slow sand filter cleaning process type on filter media biomass:...
narcis.nl
data.mendeley.com
Updated Oct 28, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
de Souza, F (via Mendeley Data) (2020). Influence of slow sand filter cleaning process type on filter media biomass: scraping versus backwashing - SEM images [Dataset]. http://doi.org/10.17632/b26d6fbg2t.2
Explore at:
Unique identifier
https://doi.org/10.17632/b26d6fbg2t.2
Dataset updated
Oct 28, 2020
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
de Souza, F (via Mendeley Data)
Description
The use of backwashing in slow sand filters was developed to simplify slow sand filters cleaning process. This study aimed to assess biomass in backwashed slow sand filters and compare it with scraping. This data comprise Scanning Electron Microscopy (SEM) from slow sand filters filter media used in the backwashing study. Samples were taken before and after cleaning, and in different filtration depths (0 cm, 5 cm and 30 cm) from two types of slow sand filters, one scraped conventional slow sand filter (ScSF) and another backwashed slow sand filter (BSF). The micrographs here present shows different material attached to the sand used as filtration media, such as biomass. It was possible to conclude that biomass accumulate in the top filtration layers and scraping removed more biomass than backwashing. (v.2, title changed)
STEP Skills Measurement Household Survey 2012 (Wave 1) - Bolivia
catalog.ihsn.org
datacatalog.ihsn.org
+1more
Updated Mar 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank (2019). STEP Skills Measurement Household Survey 2012 (Wave 1) - Bolivia [Dataset]. https://catalog.ihsn.org/index.php/catalog/4780
Explore at:
Dataset updated
Mar 29, 2019
Dataset authored and provided by
World Bankhttp://worldbank.org/
Time period covered
2012
Area covered
Bolivia
Description
Abstract

The STEP (Skills Toward Employment and Productivity) Measurement program is the first ever initiative to generate internationally comparable data on skills available in developing countries. The program implements standardized surveys to gather information on the supply and distribution of skills and the demand for skills in labor market of low-income countries.

The uniquely-designed Household Survey includes modules that measure the cognitive skills (reading, writing and numeracy), socio-emotional skills (personality, behavior and preferences) and job-specific skills (subset of transversal skills with direct job relevance) of a representative sample of adults aged 15 to 64 living in urban areas, whether they work or not. The cognitive skills module also incorporates a direct assessment of reading literacy based on the Survey of Adults Skills instruments. Modules also gather information about family, health and language.

Geographic coverage

The cities that are covered are La Paz, El Alto, Cochabamba and Santa Cruz de la Sierra.

Analysis unit

The units of analysis are the individual respondents and households. A household roster is undertaken at the start of the survey and the individual respondent is randomly selected among all household members 15 to 64 years old. The random selection process was designed by the STEP team and compliance with the procedure is carefully monitored during fieldwork.

Universe

The STEP target population is the population 15-64 years old, living in urban areas, as defined by each country's statistical office. The following are excluded from the sample: - Residents of institutions (prisons, hospitals, etc.) - Residents of senior homes and hospices - Residents of other group dwellings such as college dormitories, halfway homes, workers' quarters, etc. - Persons living outside the country at the time of data collection

Kind of data

Sample survey data [ssd]

Sampling procedure

Stratified 3-stage sample design was implemented in Bolivia. The stratification variable is city-wealth category. There are 20 strata created by grouping the primary sample units (PSUs) into the 4 cities, i.e.,1- La Paz, 2-El Alto, 3-Cochabamba, 4-Santa Cruz de la Sierra, and 5 wealth categories, i.e., 1-Poorest, 2-Moderately Poor, 3-Middle Wealth, 4-Moderately Rich, 5-Rich.

The source of the sample frame of the first stage units is the 2001 National Census of Population and Housing carried out by the National Institute of Statistics. The primary sample unit (PSU) is a Census Sector. A sample of 218 PSUs was selected from the 10,304 PSUs on the sample frame. This sample of PSUs was comprised of 160 'initial' PSUs and 58 'reserve' PSUs. Of the 218 sampled PSUs, there were 169 activated PSUs consisting of 155 Initial Sampled PSUs and 14 Reserve sampled PSUs. Among the 160 'initial' PSUs, 5 PSUs were replaced due to security concerns; also, 14 reserve PSUs were activated to supplement the sample for initial PSUs where the target sample of 15 interviews was not achieved due to high levels of non-response; thus, only 169 PSUs were actually activated during data collection. The PSUs were grouped according to city-wealth strata, and within each city-wealth stratum PSUs were selected with probability proportional to size (PPS), where the measure of size was the number of households in a PSU.

The second stage sample unit (SSU) is a household. The sampling objective was to obtain interviews at 15 households within each of the initial PSU sample, resulting in a final initial sample of 2,400 interviews. At the second stage of sample selection, 45 households were selected in each PSU using a systematic random method. The 45 households were randomly divided into 15 'Initial' households, and 30 'Reserve' households that were ranked according to the random sample selection order. Note: Due to higher than expected levels of non-response in some PSUs, additional households were sampled; thus, the final actual sample in some PSUs exceeded 45 households.

The third stage sample unit was an individual 15-64 years old (inclusive). The sampling objective was to select one individual with equal probability from each selected household.

Mode of data collection

Face-to-face [f2f]

Research instrument

The STEP survey instruments include:

The background questionnaire developed by the World Bank (WB) STEP team

Reading Literacy Assessment developed by Educational Testing Services (ETS).

All countries adapted and translated both instruments following the STEP technical standards: two independent translators adapted and translated the STEP background questionnaire and Reading Literacy Assessment, while reconciliation was carried out by a third translator.

The survey instruments were piloted as part of the survey pre-test.

The background questionnaire covers such topics as respondents' demographic characteristics, dwelling characteristics, education and training, health, employment, job skill requirements, personality, behavior and preferences, language and family background.

The background questionnaire, the structure of the Reading Literacy Assessment and Reading Literacy Data Codebook are provided in the document "Bolivia STEP Skills Measurement Survey Instruments", available in external resources.

Cleaning operations

STEP data management process:

1) Raw data is sent by the survey firm 2) The World Bank (WB) STEP team runs data checks on the background questionnaire data. Educational Testing Services (ETS) runs data checks on the Reading Literacy Assessment data. Comments and questions are sent back to the survey firm. 3) The survey firm reviews comments and questions. When a data entry error is identified, the survey firm corrects the data. 4) The WB STEP team and ETS check if the data files are clean. This might require additional iterations with the survey firm. 5) Once the data has been checked and cleaned, the WB STEP team computes the weights. Weights are computed by the STEP team to ensure consistency across sampling methodologies. 6) ETS scales the Reading Literacy Assessment data. 7) The WB STEP team merges the background questionnaire data with the Reading Literacy Assessment data and computes derived variables.

Detailed information on data processing in STEP surveys is provided in "STEP Guidelines for Data Processing" document, available in external resources. The template do-file used by the STEP team to check raw background questionnaire data is provided as an external resource, too.

Response rate

An overall response rate of 43% was achieved in the Bolivia STEP Survey. All non-response cases were documented (refusal/not found/no eligible household member, etc.) and accounted for during the weighting process. In such cases, a reserve household was activated to replace the initial household. Procedures are described in "Operation Manual" that is provided as an external resource.

Sampling error estimates

A weighting documentation was prepared for each participating country and provides some information on sampling errors. All country weighting documentations are provided as an external resource.
f
General Household Survey-Panel Wave 3 (Post Harvest) 2015-2016 - Nigeria
microdata.fao.org
Updated Jul 17, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Bureau of Statistics (NBS) (2019). General Household Survey-Panel Wave 3 (Post Harvest) 2015-2016 - Nigeria [Dataset]. https://microdata.fao.org/index.php/catalog/930
Explore at:
Dataset updated
Jul 17, 2019
Dataset provided by
National Bureau of Statistics, Nigeria
Authors
National Bureau of Statistics (NBS)
Time period covered
2016
Area covered
Nigeria
Description
Abstract

The Nigerian General Household Survey (GHS) is implemented in collaboration with the World Bank Living Standards Measurement Study (LSMS) team as part of the Integrated Surveys on Agriculture (ISA) program and was revised in 2010 to include a panel component (GHS-Panel). The objectives of the GHS-Panel include the development of an innovative model for collecting agricultural data, inter-institutional collaboration, and comprehensive analysis of welfare indicators and socio-economic characteristics. The GHS-Panel is a nationally representative survey of 5,000 households, which are also representative of the geopolitical zones (at both the urban and rural level). The households included in the GHS-Panel are a sub-sample of the overall GHS sample households (22,000). This survey is the third wave of the GHS-Panel, and was implemented in 2015-2016.

Geographic coverage

National Coverage Sector

Analysis unit

Households

Universe

Household Members

Kind of data

Sample survey data [ssd]

Sampling procedure

The GHS-Panel sample is fully integrated with the 2010 GHS Sample. The GHS sample is comprised of 60 Primary Sampling Units (PSUs) or Enumeration Areas (EAs) chosen from each of the 37 states in Nigeria. This results in a total of 2,220 EAs nationally. Each EA contributes 10 households to the GHS sample, resulting in a sample size of 22,200 households. Out of these 22,000 households, 5,000 households from 500 EAs were selected for the panel component and 4,916 households completed their interviews in the first wave. Given the panel nature of the survey, some households had moved from their location and were not able to be located by the time of the Wave 3 visit, resulting in a slightly smaller sample of 4,581 households for Wave 3.

In order to collect detailed and accurate information on agricultural activities, GHS-Panel households are visited twice: first after the planting season (post-planting) between August and October and second after the harvest season (post-harvest) between February and April. All households are visited twice regardless of whether they participated in agricultural activities. Some important factors such as labour, food consumption, and expenditures are collected during both visits. Unless otherwise specified, the majority of the report will focus on the most recent information, collected during the post-harvest visit.

Mode of data collection

Face-to-face paper [f2f]

Cleaning operations

The data cleaning process was done in a number of stages. The first step was to ensure proper quality control during the fieldwork. This was achieved in part by using the concurrent data entry system which was designed to highlight many of the errors that occurred during the fieldwork. Errors that are caught at the fieldwork stage are corrected based on re-visits to the household on the instruction of the supervisor. The data that had gone through this first stage of cleaning was then sent from the state to the head office of NBS where a second stage of data cleaning was undertaken. During the second stage the data were examined for out of range values and outliers. The data were also examined for missing information for required variables, sections, questionnaires and EAs. Any problems found were then reported back to the state where the correction was then made. This was an ongoing process until all data were delivered to the head office.

After all the data were received by the head office, there was an overall review of the data to identify outliers and other errors on the complete set of data. Where problems were identified, this was reported to the state. There the questionnaires were checked and where necessary the relevant households were revisited and a report sent back to the head office with the corrections.

The final stage of the cleaning process was to ensure that the household- and individual-level datasets were correctly merged across all sections of the household questionnaire. Special care was taken to see that the households included in the data matched with the selected sample and where there were differences these were properly assessed and documented. The agriculture data were also checked to ensure that the plots identified in the main sections merged with the plot information identified in the other sections. This was also done for crop-by-plot information as well.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:

141 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.5683/SP3/ZCN177

Dataset updated

Jul 13, 2023

Dataset provided by

Borealis

Authors

Rong Luo

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Clear search

Close search

Google apps

Main menu

Data Cleaning Sample

Data Science Platform Market Analysis North America, Europe, APAC, South...

Snapshot img

Data Cleansing Software Report

Data Center Cleaning Service Report

COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global /...

Additional file 1 of Grouped data with survey revision

Data from: Datasets for lot sizing and scheduling problems in the...

The Surface Water Chemistry (SWatCh) database

Data Preparation Tools Report

Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene...

COVID-19 Case Surveillance Public Use Data

CDC has three COVID-19 case surveillance datasets:

Overview

COVID-19 Case Reports

Data are Considered Provisional

Data Limitations

Data Quality Assurance Procedures

Data Suppression

Additional COVID-19 Data

Variation in methods, results and reporting in electronic health...

STEPwise Survey for Non Communicable Diseases Risk Factors 2005 - Zimbabwe

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Sampling deviation

Mode of data collection

Research instrument

Cleaning operations

Response rate

COVID-19 High Frequency Phone Survey of Households 2020 - Viet Nam

Abstract

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

LScDC (Leicester Scientific Dictionary-Core)

Global Data Cleansing Tools Market Research and Development Focus 2025-2032

Influence of slow sand filter cleaning process type on filter media biomass:...

STEP Skills Measurement Household Survey 2012 (Wave 1) - Bolivia

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

General Household Survey-Panel Wave 3 (Post Harvest) 2015-2016 - Nigeria

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Cleaning operations

Data Cleaning Sample