100+ datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. Data Science Platform Market Analysis North America, Europe, APAC, South...

    • technavio.com
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Data Science Platform Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, Germany, China, Canada, UK, India, France, Japan, Brazil, UAE - Size and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
    Explore at:
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    United Kingdom, United States, Global
    Description

    Snapshot img

    Data Science Platform Market Size 2025-2029

    The data science platform market size is forecast to increase by USD 763.9 million at a CAGR of 40.2% between 2024 and 2029.

    The market is experiencing significant growth, driven by the integration of artificial intelligence (AI) and machine learning (ML). This enhancement enables more advanced data analysis and prediction capabilities, making data science platforms an essential tool for businesses seeking to gain insights from their data. Another trend shaping the market is the emergence of containerization and microservices in platforms. This development offers increased flexibility and scalability, allowing organizations to efficiently manage their projects. 
    However, the use of platforms also presents challenges, particularly In the area of data privacy and security. Ensuring the protection of sensitive data is crucial for businesses, and platforms must provide strong security measures to mitigate risks. In summary, the market is witnessing substantial growth due to the integration of AI and ML technologies, containerization, and microservices, while data privacy and security remain key challenges.
    

    What will be the Size of the Data Science Platform Market During the Forecast Period?

    Request Free Sample

    The market is experiencing significant growth due to the increasing demand for advanced data analysis capabilities in various industries. Cloud-based solutions are gaining popularity as they offer scalability, flexibility, and cost savings. The market encompasses the entire project life cycle, from data acquisition and preparation to model development, training, and distribution. Big data, IoT, multimedia, machine data, consumer data, and business data are prime sources fueling this market's expansion. Unstructured data, previously challenging to process, is now being effectively managed through tools and software. Relational databases and machine learning models are integral components of platforms, enabling data exploration, preprocessing, and visualization.
    Moreover, Artificial intelligence (AI) and machine learning (ML) technologies are essential for handling complex workflows, including data cleaning, model development, and model distribution. Data scientists benefit from these platforms by streamlining their tasks, improving productivity, and ensuring accurate and efficient model training. The market is expected to continue its growth trajectory as businesses increasingly recognize the value of data-driven insights.
    

    How is this Data Science Platform Industry segmented and which is the largest segment?

    The industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Deployment
    
      On-premises
      Cloud
    
    
    Component
    
      Platform
      Services
    
    
    End-user
    
      BFSI
      Retail and e-commerce
      Manufacturing
      Media and entertainment
      Others
    
    
    Sector
    
      Large enterprises
      SMEs
    
    
    Geography
    
      North America
    
        Canada
        US
    
    
      Europe
    
        Germany
        UK
        France
    
    
      APAC
    
        China
        India
        Japan
    
    
      South America
    
        Brazil
    
    
      Middle East and Africa
    

    By Deployment Insights

    The on-premises segment is estimated to witness significant growth during the forecast period.
    

    On-premises deployment is a traditional method for implementing technology solutions within an organization. This approach involves purchasing software with a one-time license fee and a service contract. On-premises solutions offer enhanced security, as they keep user credentials and data within the company's premises. They can be customized to meet specific business requirements, allowing for quick adaptation. On-premises deployment eliminates the need for third-party providers to manage and secure data, ensuring data privacy and confidentiality. Additionally, it enables rapid and easy data access, and keeps IP addresses and data confidential. This deployment model is particularly beneficial for businesses dealing with sensitive data, such as those in manufacturing and large enterprises. While cloud-based solutions offer flexibility and cost savings, on-premises deployment remains a popular choice for organizations prioritizing data security and control.

    Get a glance at the Data Science Platform Industry report of share of various segments. Request Free Sample

    The on-premises segment was valued at USD 38.70 million in 2019 and showed a gradual increase during the forecast period.

    Regional Analysis

    North America is estimated to contribute 48% to the growth of the global market during the forecast period.
    

    Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period.

    For more insights on the market share of various regions, Request F

  3. D

    Data Cleansing Software Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Data Cleansing Software Report [Dataset]. https://www.archivemarketresearch.com/reports/data-cleansing-software-44630
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Feb 23, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The data cleansing software market is expanding rapidly, with a market size of XXX million in 2023 and a projected CAGR of XX% from 2023 to 2033. This growth is driven by the increasing need for accurate and reliable data in various industries, including healthcare, finance, and retail. Key market trends include the growing adoption of cloud-based solutions, the increasing use of artificial intelligence (AI) and machine learning (ML) to automate the data cleansing process, and the increasing demand for data governance and compliance. The market is segmented by deployment type (cloud-based vs. on-premise) and application (large enterprises vs. SMEs vs. government agencies). Major players in the market include IBM, SAS Institute Inc, SAP SE, Trifacta, OpenRefine, Data Ladder, Analytics Canvas (nModal Solutions Inc.), Mo-Data, Prospecta, WinPure Ltd, Symphonic Source Inc, MuleSoft, MapR Technologies, V12 Data, and Informatica. This report provides a comprehensive overview of the global data cleansing software market, with a focus on market concentration, product insights, regional insights, trends, driving forces, challenges and restraints, growth catalysts, leading players, and significant developments.

  4. D

    Data Center Cleaning Service Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Jan 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Data Center Cleaning Service Report [Dataset]. https://www.marketresearchforecast.com/reports/data-center-cleaning-service-14735
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Jan 24, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The market for data center cleaning services is expected to grow from USD XXX million in 2025 to USD XXX million by 2033, at a CAGR of XX% during the forecast period 2025-2033. The growth of the market is attributed to the increasing number of data centers and the need to maintain these facilities in a clean environment. Data centers are critical to the functioning of the modern economy, as they house the servers that store and process vast amounts of data. Maintaining these facilities in a clean environment is essential to prevent the accumulation of dust and other contaminants, which can lead to equipment failures and downtime. The market for data center cleaning services is segmented by type, application, and region. By type, the market is segmented into equipment cleaning, ceiling cleaning, floor cleaning, and others. Equipment cleaning is the largest segment of the market, accounting for over XX% of the total market revenue in 2025. By application, the market is segmented into the internet industry, finance and insurance, manufacturing industry, government departments, and others. The internet industry is the largest segment of the market, accounting for over XX% of the total market revenue in 2025. By region, the market is segmented into North America, South America, Europe, the Middle East & Africa, and Asia Pacific. North America is the largest segment of the market, accounting for over XX% of the total market revenue in 2025.

  5. COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Oct 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    Time period covered
    2020
    Area covered
    Vietnam
    Description

    Geographic coverage

    National, regional

    Analysis unit

    Households

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

    Mode of data collection

    Computer Assisted Telephone Interview [cati]

    Research instrument

    The questionnaire for Round 2 consisted of the following sections

    Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

    Cleaning operations

    Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
    • Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.

  6. d

    Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global /...

    • datarade.ai
    .json, .csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coresignal, Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global / 35M+ Records / Updated Weekly [Dataset]. https://datarade.ai/data-products/coresignal-clean-data-company-data-ai-enriched-datasets-coresignal
    Explore at:
    .json, .csvAvailable download formats
    Dataset authored and provided by
    Coresignal
    Area covered
    Hungary, Guatemala, Guinea-Bissau, Namibia, Saint Barthélemy, Niue, Panama, Guadeloupe, Chile, Andorra
    Description

    This clean dataset is a refined version of our company datasets, consisting of 35M+ data records.

    It’s an excellent data solution for companies with limited data engineering capabilities and those who want to reduce their time to value. You get filtered, cleaned, unified, and standardized B2B data. After cleaning, this data is also enriched by leveraging a carefully instructed large language model (LLM).

    AI-powered data enrichment offers more accurate information in key data fields, such as company descriptions. It also produces over 20 additional data points that are very valuable to B2B businesses. Enhancing and highlighting the most important information in web data contributes to quicker time to value, making data processing much faster and easier.

    For your convenience, you can choose from multiple data formats (Parquet, JSON, JSONL, or CSV) and select suitable delivery frequency (quarterly, monthly, or weekly).

    Coresignal is a leading public business data provider in the web data sphere with an extensive focus on firmographic data and public employee profiles. More than 3B data records in different categories enable companies to build data-driven products and generate actionable insights. Coresignal is exceptional in terms of data freshness, with 890M+ records updated monthly for unprecedented accuracy and relevance.

  7. Additional file 1 of Grouped data with survey revision

    • figshare.com
    • springernature.figshare.com
    txt
    Updated Aug 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chung-Han Liang; Da-Wei Wang; Mei-Lien Pan (2024). Additional file 1 of Grouped data with survey revision [Dataset]. http://doi.org/10.6084/m9.figshare.26561521.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 13, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Chung-Han Liang; Da-Wei Wang; Mei-Lien Pan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 1. Data and estimation in the simulation study.

  8. m

    Data from: Datasets for lot sizing and scheduling problems in the...

    • data.mendeley.com
    • narcis.nl
    Updated Jan 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Piñeros (2021). Datasets for lot sizing and scheduling problems in the fruit-based beverage production process [Dataset]. http://doi.org/10.17632/j2x3gbskfw.1
    Explore at:
    Dataset updated
    Jan 19, 2021
    Authors
    Juan Piñeros
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The datasets presented here were partially used in “Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings” (Toscano, A., Ferreira, D. , Morabito, R. , Computers & Chemical Engineering) [1], in “A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning” (Toscano, A., Ferreira, D. , Morabito, R. , Flexible Services and Manufacturing Journal) [2], and in “A heuristic approach to optimize the production scheduling of fruit-based beverages” (Toscano et al., Gestão & Produção, 2020) [3]. In fruit-based production processes, there are two production stages: preparation tanks and production lines. This production process has some process-specific characteristics, such as temporal cleanings and synchrony between the two production stages, which make optimized production planning and scheduling even more difficult. In this sense, some papers in the literature have proposed different methods to solve this problem. To the best of our knowledge, there are no standard datasets used by researchers in the literature in order to verify the accuracy and performance of proposed methods or to be a benchmark for other researchers considering this problem. The authors have been using small data sets that do not satisfactorily represent different scenarios of production. Since the demand in the beverage sector is seasonal, a wide range of scenarios enables us to evaluate the effectiveness of the proposed methods in the scientific literature in solving real scenarios of the problem. The datasets presented here include data based on real data collected from five beverage companies. We presented four datasets that are specifically constructed assuming a scenario of restricted capacity and balanced costs. These dataset is supplementary data for the submitted paper to Data in Brief [4]. [1] Toscano, A., Ferreira, D., Morabito, R., Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings, Computers & Chemical Engineering. 142 (2020) 107038. Doi: 10.1016/j.compchemeng.2020.107038. [2] Toscano, A., Ferreira, D., Morabito, R., A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning, Flexible Services and Manufacturing Journal. 31 (2019) 142-173. Doi: 10.1007/s10696-017-9303-9. [3] Toscano, A., Ferreira, D., Morabito, R., Trassi, M. V. C., A heuristic approach to optimize the production scheduling of fruit-based beverages. Gestão & Produção, 27(4), e4869, 2020. https://doi.org/10.1590/0104-530X4869-20. [4] Piñeros, J., Toscano, A., Ferreira, D., Morabito, R., Datasets for lot sizing and scheduling problems in the fruit-based beverage production process. Data in Brief (2021).

  9. Z

    The Surface Water Chemistry (SWatCh) database

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heubach, Franz (2022). The Surface Water Chemistry (SWatCh) database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4559695
    Explore at:
    Dataset updated
    Apr 26, 2022
    Dataset provided by
    Heubach, Franz
    Rotteveel, Lobke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset presented in the following manuscript: The Surface Water Chemistry (SWatCh) database: A standardized global database of water chemistry to facilitate large-sample hydrological research, which is currently under review at Earth System Science Data.

    Openly accessible global scale surface water chemistry datasets are urgently needed to detect widespread trends and problems, to help identify their possible solutions, and determine critical spatial data gaps where more monitoring is required. Existing datasets are limited in availability, sample size/sampling frequency, and geographic scope. These limitations inhibit the answering of emerging transboundary water chemistry questions, for example, the detection and understanding of delayed recovery from freshwater acidification. Here, we begin to address these limitations by compiling the global surface water chemistry (SWatCh) database. We collect, clean, standardize, and aggregate open access data provided by six national and international agencies to compile a database containing information on sites, methods, and samples, and a GIS shapefile of site locations. We remove poor quality data (for example, values flagged as “suspect” or “rejected”), standardize variable naming conventions and units, and perform other data cleaning steps required for statistical analysis. The database contains water chemistry data for streams, rivers, canals, ponds, lakes, and reservoirs across seven continents, 24 variables, 33,722 sites, and over 5 million samples collected between 1960 and 2022. Similar to prior research, we identify critical spatial data gaps on the African and Asian continents, highlighting the need for more data collection and sharing initiatives in these areas, especially considering freshwater ecosystems in these environs are predicted to be among the most heavily impacted by climate change. We identify the main challenges associated with compiling global databases – limited data availability, dissimilar sample collection and analysis methodology, and reporting ambiguity – and provide recommended solutions. By addressing these challenges and consolidating data from various sources into one standardized, openly available, high quality, and trans-boundary database, SWatCh allows users to conduct powerful and robust statistical analyses of global surface water chemistry.

  10. D

    Data Preparation Tools Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AMA Research & Media LLP (2025). Data Preparation Tools Report [Dataset]. https://www.archivemarketresearch.com/reports/data-preparation-tools-51852
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Mar 6, 2025
    Dataset provided by
    AMA Research & Media LLP
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Data Preparation Tools market is experiencing robust growth, projected to reach a market size of $3 billion in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 17.7% from 2025 to 2033. This significant expansion is driven by several key factors. The increasing volume and velocity of data generated across industries necessitates efficient and effective data preparation processes to ensure data quality and usability for analytics and machine learning initiatives. The rising adoption of cloud-based solutions, coupled with the growing demand for self-service data preparation tools, is further fueling market growth. Businesses across various sectors, including IT and Telecom, Retail and E-commerce, BFSI (Banking, Financial Services, and Insurance), and Manufacturing, are actively seeking solutions to streamline their data pipelines and improve data governance. The diverse range of applications, from simple data cleansing to complex data transformation tasks, underscores the versatility and broad appeal of these tools. Leading vendors like Microsoft, Tableau, and Alteryx are continuously innovating and expanding their product offerings to meet the evolving needs of the market, fostering competition and driving further advancements in data preparation technology. This rapid growth is expected to continue, driven by ongoing digital transformation initiatives and the increasing reliance on data-driven decision-making. The segmentation of the market into self-service and data integration tools, alongside the varied applications across different industries, indicates a multifaceted and dynamic landscape. While challenges such as data security concerns and the need for skilled professionals exist, the overall market outlook remains positive, projecting substantial expansion throughout the forecast period. The adoption of advanced technologies like artificial intelligence (AI) and machine learning (ML) within data preparation tools promises to further automate and enhance the process, contributing to increased efficiency and reduced costs for businesses. The competitive landscape is dynamic, with established players alongside emerging innovators vying for market share, leading to continuous improvement and innovation within the industry.

  11. f

    Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene...

    • frontiersin.figshare.com
    docx
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Mar 22, 2024
    Dataset provided by
    Frontiers
    Authors
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.

  12. COVID-19 Case Surveillance Public Use Data

    • data.cdc.gov
    • data.virginia.gov
    • +6more
    application/rdfxml +5
    Updated Jul 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data [Dataset]. https://data.cdc.gov/widgets/vbim-akqf
    Explore at:
    json, application/rdfxml, csv, xml, tsv, application/rssxmlAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Centers for Disease Control and Preventionhttp://www.cdc.gov/
    Authors
    CDC Data, Analytics and Visualization Task Force
    License

    https://www.usa.gov/government-workshttps://www.usa.gov/government-works

    Description

    Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

    Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

    This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.

    CDC has three COVID-19 case surveillance datasets:

    The following apply to all three datasets:

    Overview

    The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

    For more information: NNDSS Supports the COVID-19 Response | CDC.

    The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.

    COVID-19 Case Reports

    COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.

    All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.

    Data are Considered Provisional

    • The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
    • Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.
    • Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

    Data Limitations

    To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

    Data Quality Assurance Procedures

    CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:

    • Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question “Was the individual hospitalized?” where the possible answer choices include “Yes,” “No,” or “Unknown,” the blank value is recoded to Missing because the case report form did not include a response to the question.
    • Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
    • Additional data quality processing to recode free text data is ongoing. Data on symptoms, race and ethnicity, and healthcare worker status have been prioritized.

    Data Suppression

    To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

    For questions, please contact Ask SRRG (eocevent394@cdc.gov).

    Additional COVID-19 Data

    COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These

  13. Variation in methods, results and reporting in electronic health...

    • plos.figshare.com
    pdf
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samantha S. R. Crossfield; Lana Yin Hui Lai; Sarah R. Kingsbury; Paul Baxter; Owen Johnson; Philip G. Conaghan; Mar Pujades-Rodriguez (2023). Variation in methods, results and reporting in electronic health record-based studies evaluating routine care in gout: A systematic review [Dataset]. http://doi.org/10.1371/journal.pone.0224272
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Samantha S. R. Crossfield; Lana Yin Hui Lai; Sarah R. Kingsbury; Paul Baxter; Owen Johnson; Philip G. Conaghan; Mar Pujades-Rodriguez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveTo perform a systematic review examining the variation in methods, results, reporting and risk of bias in electronic health record (EHR)-based studies evaluating management of a common musculoskeletal disease, gout.MethodsTwo reviewers systematically searched MEDLINE, Scopus, Web of Science, CINAHL, PubMed, EMBASE and Google Scholar for all EHR-based studies published by February 2019 investigating gout pharmacological treatment. Information was extracted on study design, eligibility criteria, definitions, medication usage, effectiveness and safety data, comprehensiveness of reporting (RECORD), and Cochrane risk of bias (registered PROSPERO CRD42017065195).ResultsWe screened 5,603 titles/abstracts, 613 full-texts and selected 75 studies including 1.9M gout patients. Gout diagnosis was defined in 26 ways across the studies, most commonly using a single diagnostic code (n = 31, 41.3%). 48.4% did not specify a disease-free period before ‘incident’ diagnosis. Medication use was suboptimal and varied with disease definition while results regarding effectiveness and safety were broadly similar across studies despite variability in inclusion criteria. Comprehensiveness of reporting was variable, ranging from 73% (55/75) appropriately discussing the limitations of EHR data use, to 5% (4/75) reporting on key data cleaning steps. Risk of bias was generally low.ConclusionThe wide variation in case definitions and medication-related analysis among EHR-based studies has implications for reported medication use. This is amplified by variable reporting comprehensiveness and the limited consideration of EHR-relevant biases (e.g. data adequacy) in study assessment tools. We recommend accounting for these biases and performing a sensitivity analysis on case definitions, and suggest changes to assessment tools to foster this.

  14. STEPwise Survey for Non Communicable Diseases Risk Factors 2005 - Zimbabwe

    • catalog.ihsn.org
    • datacatalog.ihsn.org
    Updated Jun 26, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Health Organization (2017). STEPwise Survey for Non Communicable Diseases Risk Factors 2005 - Zimbabwe [Dataset]. https://catalog.ihsn.org/catalog/6968
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    World Health Organizationhttps://who.int/
    Ministry of Health and Child Welfare
    Time period covered
    2005
    Area covered
    Zimbabwe
    Description

    Abstract

    Noncommunicable diseases are the top cause of deaths. In 2008, more than 36 million people worldwide died of such diseases. Ninety per cent of those lived in low-income and middle-income countries.WHO Maps Noncommunicable Disease Trends in All Countries The STEPS Noncommunicable Disease Risk Factor Survey, part of the STEPwise approach to surveillance (STEPS) Adult Risk Factor Surveillance project by the World Health Organization (WHO), is a survey methodology to help countries begin to develop their own surveillance system to monitor and fight against noncommunicable diseases. The methodology prescribes three steps—questionnaire, physical measurements, and biochemical measurements. The steps consist of core items, core variables, and optional modules. Core topics covered by most surveys are demographics, health status, and health behaviors. These provide data on socioeconomic risk factors and metabolic, nutritional, and lifestyle risk factors. Details may differ from country to country and from year to year.

    The general objective of the Zimbabwe NCD STEPS survey was to assess the risk factors of selected NCDs in the adult population of Zimbabwe using the WHO STEPwise approach to non-communicable diseases surveillance. The specific objectives were: - To assess the distribution of life-style factors (physical activity, tobacco and alcohol use), and anthropometric measurements (body mass index and central obesity) which may impact on diabetes and cardiovascular risk factors. - To identify dietary practices that are risk factors for selected NCDs. - To determine the prevalence and determinants of hypertension - To determine the prevalence and determinants of diabetes. - To determine the prevalence and determinants of serum lipid profile.

    Geographic coverage

    Mashonaland Central, Midlands and Matebeleland South Provinces.

    Analysis unit

    Household Individual

    Universe

    The survey comprised of individuals aged 25 years and over.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    A multistage sampling strategy with 3 stages consisting of province, district and health centre was employed. The World Health Organization STEPwise Approach (STEPS) was used as the design basis for the survey. The 3 randomly selected provinces for the survey were Mashonaland Central, Midlands and Matebeleland South. In each Province four districts were chosen and four health centres were surveyed per district. The survey comprised of individuals aged 25 years and over.The survey was carried out on 3,081 respondents consisting of 1,189 from Midlands,944 from Mashonaland Central and 948 from Matebeleland South. A detailed description of the sampling process is provided in sections 3.8 -3.9. if the survey report provided under the related materials tab.

    Sampling deviation

    Designing a community-based survey such as this one is fraught with difficulties in ensuring representativeness of the sample chosen. In this survey there was a preponderance of female respondents because of the pattern of employment of males and females which also influences urban rural migration.

    The response rate in Midlands was lower than the other two provinces in both STEP 2 and 3. This notable difference was due to the fact that Midlands had more respondents sampled from the urban communities. A higher proportion of urban respondents was formally employed and therefore did not complete STEP 2 and 3 due to conflict with work schedules.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    In this survey all the core and selected expanded and optional variables were collected. In addition a food frequency questionnaire and a UNICEF developed questionnaire, the Fortification Rapid Assessment Tool (FRAT) were administered to elicit relevant dietary information.

    Cleaning operations

    Data entry for Step 1 and Step 2 data was carried out as soon as data became available to the data management team. Step 3 data became available in October and data entry was carried out when data quality checks were completed in November. Report writing started in September and a preliminary report became available in December 2005.

    Training of data entry clerks Five data entry clerks were recruited and trained for one week. The selection of data entry clerks was based on their performance during previous research carried out by the MOH&CW. The training of the data entry clerks involved the following: - Familiarization with the NCD, FRAT and FFQ questionnaires. - Familiarization with the data entry template. - Development of codes for open-ended questions. - Statistical package (EPI Info 6). - Development of a data entry template using EPI6. - Development of check files for each template - Trial runs (mock runs) to check whether template was complete and user friendly for data entry. - Double entry (what it involves and how to do it and why it should be done). - Pre-primary data cleaning (check whether denominators are tallying) of the data entry template was done.

    Data Entry for NCD, FRAT and FFQ questionnaires The questionnaires were sequentially numbered and were then divided among the five data entry clerks. Each one of the data entry clerks had a unique identifier for quality control purposes. Hence, the data was entered into five separate files using the statistical package EPI Info version 6.0. The data entry clerks inter-changed their files for double entry and validation of the data. Preliminary data cleaning was done for each of the five files. The five files were then merged to give a single file. The merged file was then transferred to STATA Version 7.0 using Stat Transfer version 5.0.

    Data Cleaning A data-cleaning workshop was held with the core research team members. The objectives of the workshop were: 1. To check all data entry errors. 2. To assess any inconsistencies in data filling. 3. To assess any inconsistencies in data entry. 4. To assess completeness of the data entered.

    Data Merging There were two datasets (NCD questionnaire dataset and laboratory dataset) after the data entry process. The two files were merged by joining corresponding observations from the NCD questionnaire dataset with those from the laboratory dataset into single observations using a unique identifier. The ID number was chosen as the unique identifier since it appeared in both data sets. The main aim of merging was to combine the two datasets containing information on behaviour of individuals and the NCD laboratory parameters. When the two data sets were merged, a new merge variable was created. The merge variable took values 1, 2 and 3. Merge variable==1 Observation appeared in the NCD questionnaire data set but a corresponding observation was not in the laboratory data set Merge variable==2 Observation appeared in the laboratory data set but a corresponding observation did not appear in the questionnaire data set Merge variable==3 Observation appeared in both data sets and reflects a complete merge of the two data sets.

    Data Cleaning After Merging Data cleaning involved identifying the observations where the merge variable values were either 1 or 2. Merge status for each observation was also changed after effecting any corrections. The other two unique variables that were used in the cleaning were Province, district and health centre since they also appeared in both data sets.

    Objectives of cleaning: 1. Match common variables in both data sets and identify inconsistencies in other matching variables e.g. province, district and health centre. 2. To check for any data entry errors.

    Response rate

    A total of 3,081 respondents were included in the survey against an estimated sample size of 3,000. The response rate for Step 1 was 80% for and for Step 2 70% taking Step 1 accrual as being 100%.

  15. COVID-19 High Frequency Phone Survey of Households 2020 - Viet Nam

    • microdata.worldbank.org
    • datacatalog.ihsn.org
    • +1more
    Updated Oct 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2023). COVID-19 High Frequency Phone Survey of Households 2020 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/3813
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    Time period covered
    2020
    Area covered
    Vietnam
    Description

    Abstract

    The main objective of this project is to collect household data for the ongoing assessment and monitoring of the socio-economic impacts of COVID-19 on households and family businesses in Vietnam. The estimated field work and sample size of households in each round is as follows:

    Round 1 June fieldwork- approximately 6300 households (at least 1300 minority households) Round 2 August fieldwork - approximately 4000 households (at least 1000 minority households) Round 3 September fieldwork- approximately 4000 households (at least 1000 minority households) Round 4 December- approximately 4000 households (at least 1000 minority households) Round 5 - pending discussion

    Geographic coverage

    National, regional

    Analysis unit

    Households

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. Out of the 15 households, 3 households have information collected on both income and expenditure (large module) as well as many other aspects. The remaining 12 other households have information collected on income, but do not have information collected on expenditure (small module). Therefore, estimation of large module includes 9396 households and are representative at regional and national levels, while the whole sample is representative at the provincial level.

    We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. The sample size of large module has 9396 households, of which, there are 7951 households having phone number (cell phone or line phone).

    After data processing, the final sample size is 6,213 households.

    Mode of data collection

    Computer Assisted Telephone Interview [cati]

    Research instrument

    The questionnaire for Round 1 consisted of the following sections Section 2. Behavior Section 3. Health Section 4. Education & Child caring Section 5A. Employment (main respondent) Section 5B. Employment (other household member) Section 6. Coping Section 7. Safety Nets Section 8. FIES

    Cleaning operations

    Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
    • Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.

    Response rate

    The target for Round 1 is to complete interviews for 6300 households, of which 1888 households are located in urban area and 4475 households in rural area. In addition, at least 1300 ethnic minority households are to be interviewed. A random selection of 6300 households was made out of 7951 households for official interview and the rest as for replacement. However, the refusal rate of the survey was about 27 percent, and households from the small module in the same EA were contacted for replacement and these households are also randomly selected.

  16. l

    LScDC (Leicester Scientific Dictionary-Core)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (

  17. Global Data Cleansing Tools Market Research and Development Focus 2025-2032

    • statsndata.org
    excel, pdf
    Updated Feb 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global Data Cleansing Tools Market Research and Development Focus 2025-2032 [Dataset]. https://www.statsndata.org/report/data-cleansing-tools-market-339171
    Explore at:
    excel, pdfAvailable download formats
    Dataset updated
    Feb 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The Data Cleansing Tools market is rapidly evolving as businesses increasingly recognize the importance of data quality in driving decision-making and strategic initiatives. Data cleansing, also known as data scrubbing or data cleaning, involves the process of identifying and correcting errors and inconsistencies in

  18. n

    Influence of slow sand filter cleaning process type on filter media biomass:...

    • narcis.nl
    • data.mendeley.com
    Updated Oct 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    de Souza, F (via Mendeley Data) (2020). Influence of slow sand filter cleaning process type on filter media biomass: scraping versus backwashing - SEM images [Dataset]. http://doi.org/10.17632/b26d6fbg2t.2
    Explore at:
    Dataset updated
    Oct 28, 2020
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    de Souza, F (via Mendeley Data)
    Description

    The use of backwashing in slow sand filters was developed to simplify slow sand filters cleaning process. This study aimed to assess biomass in backwashed slow sand filters and compare it with scraping. This data comprise Scanning Electron Microscopy (SEM) from slow sand filters filter media used in the backwashing study. Samples were taken before and after cleaning, and in different filtration depths (0 cm, 5 cm and 30 cm) from two types of slow sand filters, one scraped conventional slow sand filter (ScSF) and another backwashed slow sand filter (BSF). The micrographs here present shows different material attached to the sand used as filtration media, such as biomass. It was possible to conclude that biomass accumulate in the top filtration layers and scraping removed more biomass than backwashing. (v.2, title changed)

  19. STEP Skills Measurement Household Survey 2012 (Wave 1) - Bolivia

    • catalog.ihsn.org
    • datacatalog.ihsn.org
    • +1more
    Updated Mar 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2019). STEP Skills Measurement Household Survey 2012 (Wave 1) - Bolivia [Dataset]. https://catalog.ihsn.org/index.php/catalog/4780
    Explore at:
    Dataset updated
    Mar 29, 2019
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    Time period covered
    2012
    Area covered
    Bolivia
    Description

    Abstract

    The STEP (Skills Toward Employment and Productivity) Measurement program is the first ever initiative to generate internationally comparable data on skills available in developing countries. The program implements standardized surveys to gather information on the supply and distribution of skills and the demand for skills in labor market of low-income countries.

    The uniquely-designed Household Survey includes modules that measure the cognitive skills (reading, writing and numeracy), socio-emotional skills (personality, behavior and preferences) and job-specific skills (subset of transversal skills with direct job relevance) of a representative sample of adults aged 15 to 64 living in urban areas, whether they work or not. The cognitive skills module also incorporates a direct assessment of reading literacy based on the Survey of Adults Skills instruments. Modules also gather information about family, health and language.

    Geographic coverage

    The cities that are covered are La Paz, El Alto, Cochabamba and Santa Cruz de la Sierra.

    Analysis unit

    The units of analysis are the individual respondents and households. A household roster is undertaken at the start of the survey and the individual respondent is randomly selected among all household members 15 to 64 years old. The random selection process was designed by the STEP team and compliance with the procedure is carefully monitored during fieldwork.

    Universe

    The STEP target population is the population 15-64 years old, living in urban areas, as defined by each country's statistical office. The following are excluded from the sample: - Residents of institutions (prisons, hospitals, etc.) - Residents of senior homes and hospices - Residents of other group dwellings such as college dormitories, halfway homes, workers' quarters, etc. - Persons living outside the country at the time of data collection

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    Stratified 3-stage sample design was implemented in Bolivia. The stratification variable is city-wealth category. There are 20 strata created by grouping the primary sample units (PSUs) into the 4 cities, i.e.,1- La Paz, 2-El Alto, 3-Cochabamba, 4-Santa Cruz de la Sierra, and 5 wealth categories, i.e., 1-Poorest, 2-Moderately Poor, 3-Middle Wealth, 4-Moderately Rich, 5-Rich.

    The source of the sample frame of the first stage units is the 2001 National Census of Population and Housing carried out by the National Institute of Statistics. The primary sample unit (PSU) is a Census Sector. A sample of 218 PSUs was selected from the 10,304 PSUs on the sample frame. This sample of PSUs was comprised of 160 'initial' PSUs and 58 'reserve' PSUs. Of the 218 sampled PSUs, there were 169 activated PSUs consisting of 155 Initial Sampled PSUs and 14 Reserve sampled PSUs. Among the 160 'initial' PSUs, 5 PSUs were replaced due to security concerns; also, 14 reserve PSUs were activated to supplement the sample for initial PSUs where the target sample of 15 interviews was not achieved due to high levels of non-response; thus, only 169 PSUs were actually activated during data collection. The PSUs were grouped according to city-wealth strata, and within each city-wealth stratum PSUs were selected with probability proportional to size (PPS), where the measure of size was the number of households in a PSU.

    The second stage sample unit (SSU) is a household. The sampling objective was to obtain interviews at 15 households within each of the initial PSU sample, resulting in a final initial sample of 2,400 interviews. At the second stage of sample selection, 45 households were selected in each PSU using a systematic random method. The 45 households were randomly divided into 15 'Initial' households, and 30 'Reserve' households that were ranked according to the random sample selection order. Note: Due to higher than expected levels of non-response in some PSUs, additional households were sampled; thus, the final actual sample in some PSUs exceeded 45 households.

    The third stage sample unit was an individual 15-64 years old (inclusive). The sampling objective was to select one individual with equal probability from each selected household.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The STEP survey instruments include:

    • The background questionnaire developed by the World Bank (WB) STEP team
    • Reading Literacy Assessment developed by Educational Testing Services (ETS).

    All countries adapted and translated both instruments following the STEP technical standards: two independent translators adapted and translated the STEP background questionnaire and Reading Literacy Assessment, while reconciliation was carried out by a third translator.

    The survey instruments were piloted as part of the survey pre-test.

    The background questionnaire covers such topics as respondents' demographic characteristics, dwelling characteristics, education and training, health, employment, job skill requirements, personality, behavior and preferences, language and family background.

    The background questionnaire, the structure of the Reading Literacy Assessment and Reading Literacy Data Codebook are provided in the document "Bolivia STEP Skills Measurement Survey Instruments", available in external resources.

    Cleaning operations

    STEP data management process:

    1) Raw data is sent by the survey firm 2) The World Bank (WB) STEP team runs data checks on the background questionnaire data. Educational Testing Services (ETS) runs data checks on the Reading Literacy Assessment data. Comments and questions are sent back to the survey firm. 3) The survey firm reviews comments and questions. When a data entry error is identified, the survey firm corrects the data. 4) The WB STEP team and ETS check if the data files are clean. This might require additional iterations with the survey firm. 5) Once the data has been checked and cleaned, the WB STEP team computes the weights. Weights are computed by the STEP team to ensure consistency across sampling methodologies. 6) ETS scales the Reading Literacy Assessment data. 7) The WB STEP team merges the background questionnaire data with the Reading Literacy Assessment data and computes derived variables.

    Detailed information on data processing in STEP surveys is provided in "STEP Guidelines for Data Processing" document, available in external resources. The template do-file used by the STEP team to check raw background questionnaire data is provided as an external resource, too.

    Response rate

    An overall response rate of 43% was achieved in the Bolivia STEP Survey. All non-response cases were documented (refusal/not found/no eligible household member, etc.) and accounted for during the weighting process. In such cases, a reserve household was activated to replace the initial household. Procedures are described in "Operation Manual" that is provided as an external resource.

    Sampling error estimates

    A weighting documentation was prepared for each participating country and provides some information on sampling errors. All country weighting documentations are provided as an external resource.

  20. f

    General Household Survey-Panel Wave 3 (Post Harvest) 2015-2016 - Nigeria

    • microdata.fao.org
    Updated Jul 17, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Bureau of Statistics (NBS) (2019). General Household Survey-Panel Wave 3 (Post Harvest) 2015-2016 - Nigeria [Dataset]. https://microdata.fao.org/index.php/catalog/930
    Explore at:
    Dataset updated
    Jul 17, 2019
    Dataset provided by
    National Bureau of Statistics, Nigeria
    Authors
    National Bureau of Statistics (NBS)
    Time period covered
    2016
    Area covered
    Nigeria
    Description

    Abstract

    The Nigerian General Household Survey (GHS) is implemented in collaboration with the World Bank Living Standards Measurement Study (LSMS) team as part of the Integrated Surveys on Agriculture (ISA) program and was revised in 2010 to include a panel component (GHS-Panel). The objectives of the GHS-Panel include the development of an innovative model for collecting agricultural data, inter-institutional collaboration, and comprehensive analysis of welfare indicators and socio-economic characteristics. The GHS-Panel is a nationally representative survey of 5,000 households, which are also representative of the geopolitical zones (at both the urban and rural level). The households included in the GHS-Panel are a sub-sample of the overall GHS sample households (22,000). This survey is the third wave of the GHS-Panel, and was implemented in 2015-2016.

    Geographic coverage

    National Coverage Sector

    Analysis unit

    Households

    Universe

    Household Members

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The GHS-Panel sample is fully integrated with the 2010 GHS Sample. The GHS sample is comprised of 60 Primary Sampling Units (PSUs) or Enumeration Areas (EAs) chosen from each of the 37 states in Nigeria. This results in a total of 2,220 EAs nationally. Each EA contributes 10 households to the GHS sample, resulting in a sample size of 22,200 households. Out of these 22,000 households, 5,000 households from 500 EAs were selected for the panel component and 4,916 households completed their interviews in the first wave. Given the panel nature of the survey, some households had moved from their location and were not able to be located by the time of the Wave 3 visit, resulting in a slightly smaller sample of 4,581 households for Wave 3.

    In order to collect detailed and accurate information on agricultural activities, GHS-Panel households are visited twice: first after the planting season (post-planting) between August and October and second after the harvest season (post-harvest) between February and April. All households are visited twice regardless of whether they participated in agricultural activities. Some important factors such as labour, food consumption, and expenditures are collected during both visits. Unless otherwise specified, the majority of the report will focus on the most recent information, collected during the post-harvest visit.

    Mode of data collection

    Face-to-face paper [f2f]

    Cleaning operations

    The data cleaning process was done in a number of stages. The first step was to ensure proper quality control during the fieldwork. This was achieved in part by using the concurrent data entry system which was designed to highlight many of the errors that occurred during the fieldwork. Errors that are caught at the fieldwork stage are corrected based on re-visits to the household on the instruction of the supervisor. The data that had gone through this first stage of cleaning was then sent from the state to the head office of NBS where a second stage of data cleaning was undertaken. During the second stage the data were examined for out of range values and outliers. The data were also examined for missing information for required variables, sections, questionnaires and EAs. Any problems found were then reported back to the state where the correction was then made. This was an ongoing process until all data were delivered to the head office.

    After all the data were received by the head office, there was an overall review of the data to identify outliers and other errors on the complete set of data. Where problems were identified, this was reported to the state. There the questionnaires were checked and where necessary the relevant households were revisited and a report sent back to the head office with the corrections.

    The final stage of the cleaning process was to ensure that the household- and individual-level datasets were correctly merged across all sections of the household questionnaire. Special care was taken to see that the households included in the data matched with the selected sample and where there were differences these were properly assessed and documented. The agriculture data were also checked to ensure that the plots identified in the main sections merged with the plot information identified in the other sections. This was also done for crop-by-plot information as well.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
141 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu