34 datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. d

    TagX Web Browsing clickstream Data - 300K Users North America, EU - GDPR -...

    • datarade.ai
    .json, .csv, .xls
    Updated Sep 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TagX (2024). TagX Web Browsing clickstream Data - 300K Users North America, EU - GDPR - CCPA Compliant [Dataset]. https://datarade.ai/data-products/tagx-web-browsing-clickstream-data-300k-users-north-america-tagx
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Sep 16, 2024
    Dataset authored and provided by
    TagX
    Area covered
    United States
    Description

    TagX Web Browsing Clickstream Data: Unveiling Digital Behavior Across North America and EU Unique Insights into Online User Behavior TagX Web Browsing clickstream Data offers an unparalleled window into the digital lives of 1 million users across North America and the European Union. This comprehensive dataset stands out in the market due to its breadth, depth, and stringent compliance with data protection regulations. What Makes Our Data Unique?

    Extensive Geographic Coverage: Spanning two major markets, our data provides a holistic view of web browsing patterns in developed economies. Large User Base: With 300K active users, our dataset offers statistically significant insights across various demographics and user segments. GDPR and CCPA Compliance: We prioritize user privacy and data protection, ensuring that our data collection and processing methods adhere to the strictest regulatory standards. Real-time Updates: Our clickstream data is continuously refreshed, providing up-to-the-minute insights into evolving online trends and user behaviors. Granular Data Points: We capture a wide array of metrics, including time spent on websites, click patterns, search queries, and user journey flows.

    Data Sourcing: Ethical and Transparent Our web browsing clickstream data is sourced through a network of partnered websites and applications. Users explicitly opt-in to data collection, ensuring transparency and consent. We employ advanced anonymization techniques to protect individual privacy while maintaining the integrity and value of the aggregated data. Key aspects of our data sourcing process include:

    Voluntary user participation through clear opt-in mechanisms Regular audits of data collection methods to ensure ongoing compliance Collaboration with privacy experts to implement best practices in data anonymization Continuous monitoring of regulatory landscapes to adapt our processes as needed

    Primary Use Cases and Verticals TagX Web Browsing clickstream Data serves a multitude of industries and use cases, including but not limited to:

    Digital Marketing and Advertising:

    Audience segmentation and targeting Campaign performance optimization Competitor analysis and benchmarking

    E-commerce and Retail:

    Customer journey mapping Product recommendation enhancements Cart abandonment analysis

    Media and Entertainment:

    Content consumption trends Audience engagement metrics Cross-platform user behavior analysis

    Financial Services:

    Risk assessment based on online behavior Fraud detection through anomaly identification Investment trend analysis

    Technology and Software:

    User experience optimization Feature adoption tracking Competitive intelligence

    Market Research and Consulting:

    Consumer behavior studies Industry trend analysis Digital transformation strategies

    Integration with Broader Data Offering TagX Web Browsing clickstream Data is a cornerstone of our comprehensive digital intelligence suite. It seamlessly integrates with our other data products to provide a 360-degree view of online user behavior:

    Social Media Engagement Data: Combine clickstream insights with social media interactions for a holistic understanding of digital footprints. Mobile App Usage Data: Cross-reference web browsing patterns with mobile app usage to map the complete digital journey. Purchase Intent Signals: Enrich clickstream data with purchase intent indicators to power predictive analytics and targeted marketing efforts. Demographic Overlays: Enhance web browsing data with demographic information for more precise audience segmentation and targeting.

    By leveraging these complementary datasets, businesses can unlock deeper insights and drive more impactful strategies across their digital initiatives. Data Quality and Scale We pride ourselves on delivering high-quality, reliable data at scale:

    Rigorous Data Cleaning: Advanced algorithms filter out bot traffic, VPNs, and other non-human interactions. Regular Quality Checks: Our data science team conducts ongoing audits to ensure data accuracy and consistency. Scalable Infrastructure: Our robust data processing pipeline can handle billions of daily events, ensuring comprehensive coverage. Historical Data Availability: Access up to 24 months of historical data for trend analysis and longitudinal studies. Customizable Data Feeds: Tailor the data delivery to your specific needs, from raw clickstream events to aggregated insights.

    Empowering Data-Driven Decision Making In today's digital-first world, understanding online user behavior is crucial for businesses across all sectors. TagX Web Browsing clickstream Data empowers organizations to make informed decisions, optimize their digital strategies, and stay ahead of the competition. Whether you're a marketer looking to refine your targeting, a product manager seeking to enhance user experience, or a researcher exploring digital trends, our cli...

  3. m

    Data from: Datasets for lot sizing and scheduling problems in the...

    • data.mendeley.com
    • narcis.nl
    Updated Jan 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Piñeros (2021). Datasets for lot sizing and scheduling problems in the fruit-based beverage production process [Dataset]. http://doi.org/10.17632/j2x3gbskfw.1
    Explore at:
    Dataset updated
    Jan 19, 2021
    Authors
    Juan Piñeros
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The datasets presented here were partially used in “Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings” (Toscano, A., Ferreira, D. , Morabito, R. , Computers & Chemical Engineering) [1], in “A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning” (Toscano, A., Ferreira, D. , Morabito, R. , Flexible Services and Manufacturing Journal) [2], and in “A heuristic approach to optimize the production scheduling of fruit-based beverages” (Toscano et al., Gestão & Produção, 2020) [3]. In fruit-based production processes, there are two production stages: preparation tanks and production lines. This production process has some process-specific characteristics, such as temporal cleanings and synchrony between the two production stages, which make optimized production planning and scheduling even more difficult. In this sense, some papers in the literature have proposed different methods to solve this problem. To the best of our knowledge, there are no standard datasets used by researchers in the literature in order to verify the accuracy and performance of proposed methods or to be a benchmark for other researchers considering this problem. The authors have been using small data sets that do not satisfactorily represent different scenarios of production. Since the demand in the beverage sector is seasonal, a wide range of scenarios enables us to evaluate the effectiveness of the proposed methods in the scientific literature in solving real scenarios of the problem. The datasets presented here include data based on real data collected from five beverage companies. We presented four datasets that are specifically constructed assuming a scenario of restricted capacity and balanced costs. These dataset is supplementary data for the submitted paper to Data in Brief [4]. [1] Toscano, A., Ferreira, D., Morabito, R., Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings, Computers & Chemical Engineering. 142 (2020) 107038. Doi: 10.1016/j.compchemeng.2020.107038. [2] Toscano, A., Ferreira, D., Morabito, R., A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning, Flexible Services and Manufacturing Journal. 31 (2019) 142-173. Doi: 10.1007/s10696-017-9303-9. [3] Toscano, A., Ferreira, D., Morabito, R., Trassi, M. V. C., A heuristic approach to optimize the production scheduling of fruit-based beverages. Gestão & Produção, 27(4), e4869, 2020. https://doi.org/10.1590/0104-530X4869-20. [4] Piñeros, J., Toscano, A., Ferreira, D., Morabito, R., Datasets for lot sizing and scheduling problems in the fruit-based beverage production process. Data in Brief (2021).

  4. l

    LScDC Word-Category RIG Matrix

    • figshare.le.ac.uk
    pdf
    Updated Apr 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LScDC Word-Category RIG Matrix [Dataset]. https://figshare.le.ac.uk/articles/dataset/LScDC_Word-Category_RIG_Matrix/12133431
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 28, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.

  5. w

    Hinch yourself happy : all the best cleaning tips to shine your sink and...

    • workwithdata.com
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2023). Hinch yourself happy : all the best cleaning tips to shine your sink and so.. [Dataset]. https://www.workwithdata.com/object/hinch-yourself-happy-all-the-best-cleaning-tips-to-shine-your-sink-and-soothe-your-soul-book-by-sophie-hinchliffe-0000
    Explore at:
    Dataset updated
    May 23, 2023
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hinch yourself happy : all the best cleaning tips to shine your sink and soothe your soul is a book. It was written by Sophie Hinchliffe and published by Michael Joseph in 2019.

  6. w

    Hinch yourself happy : all the best cleaning tips to shine your sink and...

    • workwithdata.com
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2023). Hinch yourself happy : all the best cleaning tips to shine your sink and so.. [Dataset]. https://www.workwithdata.com/book/Hinch%20yourself%20happy%20:%20all%20the%20best%20cleaning%20tips%20to%20shine%20your%20sink%20and%20soothe%20your%20soul_117157
    Explore at:
    Dataset updated
    May 18, 2023
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hinch yourself happy : all the best cleaning tips to shine your sink and soothe your soul is a book. It was written by Hinch and published by Michael Joseph in 2019.

  7. Data from: Accounting for imperfect detection in data from museums and...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Jun 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kelley D. Erickson; Adam B. Smith (2021). Accounting for imperfect detection in data from museums and herbaria when modeling species distributions: Combining and contrasting data-level versus model-level bias correction [Dataset]. http://doi.org/10.5061/dryad.51c59zw8b
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 22, 2021
    Dataset provided by
    Missouri Botanical Garden
    Authors
    Kelley D. Erickson; Adam B. Smith
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The digitization of museum collections as well as an explosion in citizen science initiatives has resulted in a wealth of data that can be useful for understanding the global distribution of biodiversity, provided that the well-documented biases inherent in unstructured opportunistic data are accounted for. While traditionally used to model imperfect detection using structured data from systematic surveys of wildlife, occupancy models provide a framework for modelling the imperfect collection process that results in digital specimen data. In this study, we explore methods for adapting occupancy models for use with biased opportunistic occurrence data from museum specimens and citizen science platforms using 7 species of Anacardiaceae in Florida as a case study. We explored two methods of incorporating information about collection effort to inform our uncertainty around species presence: (1) filtering the data to exclude collectors unlikely to collect the focal species and (2) incorporating collection covariates (collection type, time of collection, and history of previous detections) into a model of collection probability. We found that the best models incorporated both the background data filtration step as well as collector covariates. Month, method of collection and whether a collector had previously collected the focal species were important predictors of collection probability. Efforts to standardize meta-data associated with data collection will improve efforts for modeling the spatial distribution of a variety of species. Methods R code for downloading data, cleaning data, and running occupancy models.

  8. f

    Sample stemming words.

    • plos.figshare.com
    xls
    Updated Jun 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashraf Ullah; Khair Ullah Khan; Aurangzeb Khan; Sheikh Tahir Bakhsh; Atta Ur Rahman; Sajida Akbar; Bibi Saqia (2024). Sample stemming words. [Dataset]. http://doi.org/10.1371/journal.pone.0290915.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 6, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Ashraf Ullah; Khair Ullah Khan; Aurangzeb Khan; Sheikh Tahir Bakhsh; Atta Ur Rahman; Sajida Akbar; Bibi Saqia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Urdu language is spoken and written on different social media platforms like Twitter, WhatsApp, Facebook, and YouTube. However, due to the lack of Urdu Language Processing (ULP) libraries, it is quite challenging to identify threats from textual and sequential data on the social media provided in Urdu. Therefore, it is required to preprocess the Urdu data as efficiently as English by creating different stemming and data cleaning libraries for Urdu data. Different lexical and machine learning-based techniques are introduced in the literature, but all of these are limited to the unavailability of online Urdu vocabulary. This research has introduced Urdu language vocabulary, including a stop words list and a stemming dictionary to preprocess Urdu data as efficiently as English. This reduced the input size of the Urdu language sentences and removed redundant and noisy information. Finally, a deep sequential model based on Long Short-Term Memory (LSTM) units is trained on the efficiently preprocessed, evaluated, and tested. Our proposed methodology resulted in good prediction performance, i.e., an accuracy of 82%, which is greater than the existing methods.

  9. Data from: Improper data practices erode the quality of global ecological...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Jan 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steven Augustine; Isaac Bailey-Marren; Katherine Charton; Nathan Kiel; Michael Peyton (2024). Improper data practices erode the quality of global ecological databases and impede the progress of ecological research [Dataset]. http://doi.org/10.5061/dryad.wdbrv15w1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 2, 2024
    Dataset provided by
    University of Wisconsin–Madison
    Authors
    Steven Augustine; Isaac Bailey-Marren; Katherine Charton; Nathan Kiel; Michael Peyton
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The scientific community has entered an era of big data. However, with big data comes big responsibilities, and best practices for how data are contributed to databases have not kept pace with the collection, aggregation, and analysis of big data. Here, we rigorously assess the quantity of data for specific leaf area (SLA) available within the largest and most frequently used global plant trait database, the TRY Plant Trait Database, exploring how much of the data were applicable (i.e., original, representative, logical, and comparable) and traceable (i.e., published, cited, and consistent). Over three-quarters of the SLA data in TRY either lacked applicability or traceability, leaving only 22.9% of the original data usable compared to the 64.9% typically deemed usable by standard data cleaning protocols. The remaining usable data differed markedly from the original for many species, which led to altered interpretation of ecological analyses. Though the data we consider here make up only 4.5% of SLA data within TRY, similar issues of applicability and traceability likely apply to SLA data for other species as well as other commonly measured, uploaded, and downloaded plant traits. We end with suggested steps forward for global ecological databases, including suggestions for both uploaders to and curators of databases with the hope that, through addressing the issues raised here, we can increase data quality and integrity within the ecological community. Methods SLA data was downlaoded from TRY (traits 3115, 3116, and 3117) for all conifer (Araucariaceae, Cupressaceae, Pinaceae, Podocarpaceae, Sciadopityaceae, and Taxaceae), Plantago, Poa, and Quercus species. The data has not been processed in any way, but additional columns have been added to the datset that provide the viewer with information about where each data point came from, how it was cited, how it was measured, whether it was uploaded correctly, whether it had already been uploaded to TRY, and whether it was uploaded by the individual who collected the data.

  10. v

    Global Touchless Vehicle Wash Systems Market Size By Type, By Application,...

    • verifiedmarketresearch.com
    Updated Jul 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VERIFIED MARKET RESEARCH (2023). Global Touchless Vehicle Wash Systems Market Size By Type, By Application, By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/touchless-vehicle-wash-systems-market/
    Explore at:
    Dataset updated
    Jul 3, 2023
    Dataset authored and provided by
    VERIFIED MARKET RESEARCH
    License

    https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/

    Time period covered
    2024 - 2030
    Area covered
    Global
    Description

    Touchless Vehicle Wash Systems Market size was valued at USD 4.4 Billion in 2023 and is projected to reach USD 7.3 Billion by 2030, growing at a CAGR of 7.2% during the forecast period 2024-2030.

    Global Touchless Vehicle Wash Systems Market Drivers

    Technological Advancements: High-pressure water jets, chemical application systems, and sensors for accurate vehicle recognition and alignment are just a few examples of the cutting-edge technology that have been included into touchless car wash systems over the years. These developments reduce the possibility of surface damage to the car while guaranteeing effective and complete cleaning. Innovations like IoT integration and AI-driven control systems also make it possible to remotely monitor and optimize wash processes, which improves both customer happiness and operational efficiency.

    Enhanced Cleaning Performance: Touchless wash systems present a convincing answer to the rising expectations of consumers for better cleaning outcomes. These devices remove filth, grime, and pollutants from car surfaces without requiring direct physical touch by using strong water jets and specific cleaning solutions. Improved cleaning efficiency keeps the car looking great for longer by lowering the possibility of swirl marks and scratches in addition to guaranteeing a flawless surface.

    Environmental Concerns: As people’s awareness of the environment grows, the automotive industry is moving toward more environmentally friendly cleaning products. In comparison to conventional techniques, touchless wash systems use water more effectively, lowering total water consumption and limiting chemical runoff into the environment. Furthermore, the sustainability profile of touchless wash systems is further improved by developments in recycling technologies and biodegradable cleaning agents, which comply with environmental regulations and consumer expectations for eco-friendly practices.

    Convenience and Time Efficiency: For customers, convenience and time efficiency are critical in the fast-paced world of today. With less human intervention needed and quick cleaning cycles, touchless car wash systems are a practical substitute for brush-based or manual systems. Automated procedures and features that integrate with mobile apps for scheduling and payment improve the car wash process so that clients can keep their cars clean without having to give up important time.

    Maintenance of Vehicle Finish: Owners place a high value on keeping their cars looking good, which is why touchless wash systems with their gentle yet efficient cleaning method are so popular. These methods reduce the possibility of scratches, swirls, and paint damage by removing direct contact with brushes or abrasive materials. This increases the longevity and resale value of automobiles. This is a feature that appeals to auto enthusiasts and luxury automobile buyers who value having a spotless outside appearance.

    Safety and Hygiene: Touchless wash systems provide a safe and hygienic way to clean cars in the face of public health concerns. These methods lessen the possibility of cross-contamination and pathogen transmission by preventing physical contact between cleaning tools and the vehicle’s surface. This particular issue has particular relevance when discussing shared or rented automobiles, since upholding cleanliness and sanitation standards is crucial to fostering consumer satisfaction and confidence.

  11. Global Floor Sweepers Market Industry Best Practices 2025-2032

    • statsndata.org
    excel, pdf
    Updated Feb 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global Floor Sweepers Market Industry Best Practices 2025-2032 [Dataset]. https://www.statsndata.org/report/global-80811
    Explore at:
    excel, pdfAvailable download formats
    Dataset updated
    Feb 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The Floor Sweepers market plays a crucial role in maintaining cleanliness across various industrial, commercial, and residential spaces. These machines are designed to efficiently remove debris, dirt, and dust from floors, offering a significant advantage over traditional cleaning methods. With increasing global emp

  12. Z

    Dataset and trained models belonging to the article 'Distant reading...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Smits, Thomas (2021). Dataset and trained models belonging to the article 'Distant reading patterns of iconicity in 940.000 online circulations of 26 iconic photographs' [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4244000
    Explore at:
    Dataset updated
    Sep 28, 2021
    Dataset provided by
    Smits, Thomas
    Ros, Ruben
    Description

    Quantifying Iconicity - Zenodo

    The Dataset

    This dataset contains the material collected for the article "Distant reading 940,000 online circulations of 26 iconic photographs" (to be) published in New Media & Society (DOI: 10.1177/14614448211049459). We identified 26 iconic photographs based on earlier work (Van der Hoeven, 2019). The Google Cloud Vision (GCV) API was subsequently used to identify webpages that host a reproduction of the iconic image. The GCV API uses computer vision methods and the Google index to retrieve these reproductions. The code for calling the API and parsing the data can be found on GitHub: https://github.com/rubenros1795/ReACT_GCV.

    The core dataset consists of .tsv-files with the URLs that refer to the webpages. Other metadata provided by the GCV API is also found in the file and manually generated metadata. This includes: - the URL that refers specifically to the image. This can be an URL that refers to a full match or a partial match - the title of the page - the iteration number. Because the GCV API puts a limit on its output, we had to reupload the identified images to the API to extend our search. We continued these iterations until no more new unique URLs were found - the language found by the langid Python module link, along with the normalized score. - the labels associated with the image by Google - the scrape date

    Alongside the .tsv-files, there are several other elements in the following folder structure:

    ├── data
    │  ├── embeddings
    │        └── doc2vec
    │        └── input-text
    │        └── metadata
    │        └── umap
    │  └── evaluation
    │  └── results
    │        └── diachronic-plots
    │        └── top-words
    │  └── tsv
    
    1. The /embeddings folder contains the doc2vec models, the training input for the models, the metadata (id, URL, date) and the UMAP embeddings used in the GMM clustering. Please note that the date parser was not able to find dates for all webpages and for this reason not all training texts have associated metadata.
    2. The /evaluation folder contains the AIC and BIC scores for GMM clustering with different numbers of clusters.
    3. The /results folder contains the top words associated with the clusters and the diachronic cluster prominence plots.

    Data Cleaning and Curation

    Our pipeline contained several interventions to prevent noise in the data. First, in between the iterations we manually checked the scraped photos for relevance. We did so because reuploading an iconic image that is paired with another, irrelevant, one results in reproductions of the irrelevant one in the next iteration. Because we did not catch all noise, we used Scale Invariant Feature Transform (SIFT), a basic computer vision algorithm, to remove images that did not meet a threshold of ten keypoints. By doing so we removed completely unrelated photographs, but left room for variations of the original (such as painted versions of Che Guevara, or cropped versions of the Napalm Girl image). Another issue was the parsing of webpage texts. After experimenting with different webpage parsers that aim to extract 'relevant' text it proved too difficult to use one solution for all our webpages. Therefore we simply parsed all the text contained in commonly used html-tags, such as <p>, <h1> etc.

  13. Data from: Skin hydrophobicity as an adaptation for self-cleaning in geckos

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Mar 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jendrian Riedel; Matthew Vucko; Lin Schwarzkopf; Simone Blomberg (2021). Skin hydrophobicity as an adaptation for self-cleaning in geckos [Dataset]. http://doi.org/10.5061/dryad.xwdbrv19s
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 16, 2021
    Dataset provided by
    James Cook University
    ,
    Authors
    Jendrian Riedel; Matthew Vucko; Lin Schwarzkopf; Simone Blomberg
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Hydrophobicity is common in plants and animals, typically caused by high relief microtexture functioning to keep the surface clean. Although the occurrence and physical causes of hydrophobicity are well understood, ecological factors promoting its evolution are unclear. Geckos have highly hydrophobic integuments. We predicted that, because the ground is dirty and filled with pathogens, high hydrophobicity should coevolve with terrestrial microhabitat use. Advancing contact angle (ACA) measurements of water droplets were used to quantify hydrophobicity in 24 species of Australian gecko. We reconstructed the evolution of ACA values, in relation to microhabitat use of geckos. To determine the best set of structural characteristics associated with the evolution of hydrophobicity, we used linear models fitted using phylogenetic generalized least squares (PGLS), and then model averaging based on AICc values. All species were highly hydrophobic (ACA > 132.72°), but terrestrial species had significantly higher ACA values than arboreal ones. The evolution of longer spinules and smaller scales were correlated with high hydrophobicity. These results suggest that hydrophobicity has co-evolved with terrestrial microhabitat use in Australian geckos via selection for long spinules and small scales, likely to keep their skin clean and prevent fouling and disease.

    Methods For details of data collection see methods section of the paper.

  14. i

    Multi Country Study Survey 2000-2001 - Morocco

    • datacatalog.ihsn.org
    • dev.ihsn.org
    • +2more
    Updated Mar 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Health Organization (WHO) (2019). Multi Country Study Survey 2000-2001 - Morocco [Dataset]. https://datacatalog.ihsn.org/catalog/3880
    Explore at:
    Dataset updated
    Mar 29, 2019
    Dataset authored and provided by
    World Health Organization (WHO)
    Time period covered
    2000 - 2001
    Area covered
    Morocco
    Description

    Abstract

    In order to develop various methods of comparable data collection on health and health system responsiveness WHO started a scientific survey study in 2000-2001. This study has used a common survey instrument in nationally representative populations with modular structure for assessing health of indviduals in various domains, health system responsiveness, household health care expenditures, and additional modules in other areas such as adult mortality and health state valuations.

    The health module of the survey instrument was based on selected domains of the International Classification of Functioning, Disability and Health (ICF) and was developed after a rigorous scientific review of various existing assessment instruments. The responsiveness module has been the result of ongoing work over the last 2 years that has involved international consultations with experts and key informants and has been informed by the scientific literature and pilot studies.

    Questions on household expenditure and proportionate expenditure on health have been borrowed from existing surveys. The survey instrument has been developed in multiple languages using cognitive interviews and cultural applicability tests, stringent psychometric tests for reliability (i.e. test-retest reliability to demonstrate the stability of application) and most importantly, utilizing novel psychometric techniques for cross-population comparability.

    The study was carried out in 61 countries completing 71 surveys because two different modes were intentionally used for comparison purposes in 10 countries. Surveys were conducted in different modes of in- person household 90 minute interviews in 14 countries; brief face-to-face interviews in 27 countries and computerized telephone interviews in 2 countries; and postal surveys in 28 countries. All samples were selected from nationally representative sampling frames with a known probability so as to make estimates based on general population parameters.

    The survey study tested novel techniques to control the reporting bias between different groups of people in different cultures or demographic groups ( i.e. differential item functioning) so as to produce comparable estimates across cultures and groups. To achieve comparability, the selfreports of individuals of their own health were calibrated against well-known performance tests (i.e. self-report vision was measured against standard Snellen's visual acuity test) or against short descriptions in vignettes that marked known anchor points of difficulty (e.g. people with different levels of mobility such as a paraplegic person or an athlete who runs 4 km each day) so as to adjust the responses for comparability . The same method was also used for self-reports of individuals assessing responsiveness of their health systems where vignettes on different responsiveness domains describing different levels of responsiveness were used to calibrate the individual responses.

    This data are useful in their own right to standardize indicators for different domains of health (such as cognition, mobility, self care, affect, usual activities, pain, social participation, etc.) but also provide a better measurement basis for assessing health of the populations in a comparable manner. The data from the surveys can be fed into composite measures such as "Healthy Life Expectancy" and improve the empirical data input for health information systems in different regions of the world. Data from the surveys were also useful to improve the measurement of the responsiveness of different health systems to the legitimate expectations of the population.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The sample was a multi-stage stratified random sampling described as follows. Twenty-two cities were selected following the main structure of Morocco composed of 5 strata and seven areas.

    Once the stratum was defined, the city was chosen at random except for Casablanca and Rabat which were chosen intentionally due to their importance. At the second stage, a representative rural town from each area was chosen at random. The third stage consisted of selecting city quarters randomly. The fourth stage was the household selection according to the .step method. which meant approaching every third house.

    In the case of a building, the interviewers were asked to begin from the top floor, to choose only one apartment on each floor and go downstairs every other floor with a maximum of two interviews per building. The last stage was the respondent.s selection based on the Kish Method.

    Mode of data collection

    Face-to-face [f2f]

    Cleaning operations

    Data Coding At each site the data was coded by investigators to indicate the respondent status and the selection of the modules for each respondent within the survey design. After the interview was edited by the supervisor and considered adequate it was entered locally.

    Data Entry Program A data entry program was developed in WHO specifically for the survey study and provided to the sites. It was developed using a database program called the I-Shell (short for Interview Shell), a tool designed for easy development of computerized questionnaires and data entry (34). This program allows for easy data cleaning and processing.

    The data entry program checked for inconsistencies and validated the entries in each field by checking for valid response categories and range checks. For example, the program didn’t accept an age greater than 120. For almost all of the variables there existed a range or a list of possible values that the program checked for.

    In addition, the data was entered twice to capture other data entry errors. The data entry program was able to warn the user whenever a value that did not match the first entry was entered at the second data entry. In this case the program asked the user to resolve the conflict by choosing either the 1st or the 2nd data entry value to be able to continue. After the second data entry was completed successfully, the data entry program placed a mark in the database in order to enable the checking of whether this process had been completed for each and every case.

    Data Transfer The data entry program was capable of exporting the data that was entered into one compressed database file which could be easily sent to WHO using email attachments or a file transfer program onto a secure server no matter how many cases were in the file. The sites were allowed the use of as many computers and as many data entry personnel as they wanted. Each computer used for this purpose produced one file and they were merged once they were delivered to WHO with the help of other programs that were built for automating the process. The sites sent the data periodically as they collected it enabling the checking procedures and preliminary analyses in the early stages of the data collection.

    Data quality checks Once the data was received it was analyzed for missing information, invalid responses and representativeness. Inconsistencies were also noted and reported back to sites.

    Data Cleaning and Feedback After receipt of cleaned data from sites, another program was run to check for missing information, incorrect information (e.g. wrong use of center codes), duplicated data, etc. The output of this program was fed back to sites regularly. Mainly, this consisted of cases with duplicate IDs, duplicate cases (where the data for two respondents with different IDs were identical), wrong country codes, missing age, sex, education and some other important variables.

  15. Large Scale Cleaning Telescope Mirrors with Electron Beams

    • data.nasa.gov
    application/rdfxml +5
    Updated Jun 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Large Scale Cleaning Telescope Mirrors with Electron Beams [Dataset]. https://data.nasa.gov/dataset/Large-Scale-Cleaning-Telescope-Mirrors-with-Electr/utqx-figq
    Explore at:
    xml, application/rdfxml, csv, json, tsv, application/rssxmlAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    The Cleaning Lenses and Mirrored Surface with Electrons is a technology developed to provide access to large lenses and mirror structures and providing a low risk technique for cleaning their surfaces.

    The Cleaning Lenses and Mirrored Surfaces with Electrons tasks include: Development of Fractal Wand Geometries; Vacuum Chamber testing fo Fractal Wand Prototypes; and selection of best prototype.

  16. d

    Calcite saturation and yield and Mg/Ca, Mn/Ca, Al/Ca, Fe/Ca for four species...

    • search.dataone.org
    • doi.pangaea.de
    Updated Jan 6, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johnstone, Heather J H; Lee, R W; Schulz, Michael (2018). Calcite saturation and yield and Mg/Ca, Mn/Ca, Al/Ca, Fe/Ca for four species of planktic foraminifera cleaned using Mg-cleaning and Cd-cleaning methods [Dataset]. https://search.dataone.org/view/81901b2c8e0719f577ab29fd39b395f2
    Explore at:
    Dataset updated
    Jan 6, 2018
    Dataset provided by
    PANGAEA Data Publisher for Earth and Environmental Science
    Authors
    Johnstone, Heather J H; Lee, R W; Schulz, Michael
    Description

    Four species of planktic foraminifera from core-tops spanning a depth transect on the Ontong Java Plateau were prepared for Mg/Ca analysis both with (Cd-cleaning) and without (Mg-cleaning) a reductive cleaning step. Reductive cleaning caused etching of foraminiferal calcite, focused on Mg-rich inner calcite, even on tests which had already been partially dissolved at the seafloor. Despite corrosion, there was no difference in Mg/Ca of Pulleniatina obliquiloculata between cleaning methods. Reductive cleaning decreased Mg/Ca by an average (all depths) of ~ 4% for Globigerinoides ruber white and ~ 10% for Neogloboquadrina dutertrei. Mg/Ca of Globigerinoides sacculifer (above the calcite saturation horizon only) was 5% lower after reductive cleaning. The decrease in Mg/Ca due to reductive cleaning appeared insensitive to preservation state for G. ruber, N. dutertrei and P. obliquiloculata. Mg/Ca of Cd-cleaned G. sacculifer appeared less sensitive to dissolution than that of Mg-cleaned. Mg-cleaning is adequate, but SEM and contaminants (Al/Ca, Fe/Ca and Mn/Ca) show that Cd-cleaning is more effective for porous species. A second aspect of the study addressed sample loss during cleaning. Lower yield after Cd-cleaning for G. ruber, G. sacculifer and N. dutertrei confirmed this to be the more aggressive method. Strongest correlations between yield and Delta[CO3^2-] in core-top samples were for Cd-cleaned G. ruber (r = 0.88, p = 0.020) and Cd-cleaned P. obliquiloculata (r = 0.68, p = 0.030). In a down-core record (WIND28K) correlation, r, between yield values > 30% and dissolution index, XDX, was -0.61 (p = 0.002). Where cleaning yield < 30% most Mg-cleaned Mg/Ca values were biased by dissolution.

  17. d

    Catalogus Epistularum Neerlandicarum (CEN): Letter and person metadata...

    • b2find.dkrz.de
    • dataverse.nl
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Catalogus Epistularum Neerlandicarum (CEN): Letter and person metadata (1270-1820) curated by the SKILLNET project [Dataset]. https://b2find.dkrz.de/dataset/1effd553-abde-5728-aa8f-1ec198a4a3f3
    Explore at:
    Dataset updated
    Mar 28, 2024
    Description

    If you wish to explore, query, filter, visualize or export the data with the support of a jupyter notebook, this is provided for direct use here: https://edu.nl/bn93d. This dataset contains curated files derived from the cleaning process of a slice of the Catalogus Epistolarum Neerlandicaron (CEN) metadata. This dataset can be used by researchers interested in the correspondence exchanged during the Early modern period, especially related to Dutch learned men and women of the time. The CEN is the Dutch national letter catalog, which aggregates letter metadata from different universities in the Netherlands and from the National Library of the Netherlands (KB), among others, since the years 1980's to the present. Since January 2020, one can consult the CEN via Worldcat (https://picarta.on.worldcat.org,last accessed November 9, 2022). The entire database contains more than a million records (according to the KB website consulted on November 9, 2022: https://www.kb.nl/over-ons/diensten/cen). This database is curated by the KB, who owns the rights together with OCLC. A data dump in XML of the CEN database was obtained by Ingeborg van Vugt (https://orcid.org/0000-0002-7703-1791) from the KB and OCLC in October 2019. The dataset has been sliced (years between 1200 to 1820) and cleaned in two phases: (a) manually during Ingeborg's Ph.D. thesis and (b) semi-automatically during the SKILLNET project. The second part of the cleaning process has been carried out together by data manager Liliana Melgar Estrada (https://orcid.org/0000-0003-2003-4200) and Ingeborg van Vugt, receiving the input from different collaborators from the SKILLNET team, student assistant Rosalie Versmissen (https://orcid.org/0000-0001-9558-8510), and some external collaborators. The curated version of a slice of the entire data is offered in this dataset. It includes the letters between 1270 and 1820 plus some undated letters, which is of interest for the study of Early Modern correspondence. The XML file was converted to a .csv file with the support of the Digital Humanities Center at Utrecht University. The initial XML data dump is not provided in this dataset. It has more than 500 thousand rows (in which each row represents either a letter or a group of letters), however, the cleaning process was non-destructive, which means that the original metadata and links to the original source can be found together with the curated data. These data underwent a data cleaning process by semi-automatic and manual processes, which resulted in two files that are deposited in this dataset: one containing the letter's metadata, and another one containing the unique person's metadata (which also includes the mappings, i.e., identifiers, to other datasets). The curation consisted on applying data wrangling operations (parsing, harmonization), adding missing metadata or correction of inaccuracies (dates of birth/death of correspondents, letter dates), validations of correctness (between dates of birth/death and letter dates) and partial reconciliation (adding external identifiers from other letter databases). The CEN data is also available for free access online via Picarta (https://picarta.oclc.org/psi/xslt/DB=3.23/) and Worldcat (https://www.worldcat.org/). Some data may have been updated since 2020 in the online versions but, to the best of our knowledge, this doesn't happen often for the period of time (subset) that was cleaned by SKILLNET. In any case, if you use this dataset, it is recommended to give a proper citation to it, which includes the version number and date. If you wish to explore, query, filter, visualize or export the data with the support of a jupyter notebook, this is provided for direct use here: https://edu.nl/bn93d. The datasets provided by the SKILLNET project have been curated (i.e., cleaned, harmonized, reconciled) using manual and semi-automatic methods. Even though a lot of dedication and effort was put in curating the datasets provided here, some errors, inaccuracies and/or missing data still exist. Until January 2023 the files will be constantly updated with more cleaned data and mappings. For that reason, it is important that users of the dataset always include the version number in their reports, or wait until January 2023 when the latest version will be deposited.

  18. Living Standards Survey IV 1998-1999 - Ghana

    • microdata.worldbank.org
    Updated Jan 30, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Living Standards Survey IV 1998-1999 - Ghana [Dataset]. https://microdata.worldbank.org/index.php/catalog/2331
    Explore at:
    Dataset updated
    Jan 30, 2020
    Dataset provided by
    Ghana Statistical Services
    Authors
    Ghana Statistical Service (GSS)
    Time period covered
    1998 - 1999
    Area covered
    Ghana
    Description

    Abstract

    The Ghana Living Standards Survey (GLSS), with its focus on the household as a key social and economic unit, provides valuable insights into living conditions in Ghana. The survey was carried out by the Ghana Statistical Service (GSS) over a 12-month period (April 1998 to March 1999). A representative nationwide sample of more than 5,998 households, containing over 25,000 persons, was covered in GLSS IV.

    The fourth round of the GLSS has the following objectives: · To provide information on patterns of household consumption and expenditure disaggregated at greater levels. · In combination with the data from the earlier rounds to serve as a database for national and regional planning. · To provide in-depth information on the structure and composition of the wages and conditions of work of the labor force in the country. · To provide benchmark data for compilation of current statistics on average earnings, hours of work and time rates of wages and salaries that will indicate wage/salary differentials between industries, occupations, geographic locations and gender.

    Additionally, the survey will enable policy-makers to · Identify vulnerable groups for government assistance; · Analyze the impact of decisions that have already been implemented and of the economic situation on living conditions of households; · Monitor and evaluate the employment policies and programs, income generating and maintenance schemes, vocational training and similar programs. The joint measure of employment, income and expenditure provides the basis for analyzing the adequacy of employment of different categories of workers and income-generating capacity of employment-related economic development.

    Geographic coverage

    National

    Analysis unit

    • Household
    • Individual
    • Community
    • Commodity

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    A nationally representative sample of households was selected in order to achieve the survey objectives. For the purposes of this survey the list of the 1984 population census Enumeration Areas (EAs) with population and household information was used as the sampling frame. The primary sampling units were the 1984 EAs with the secondary units being the households in the EAs. This frame, though quite old, was considered the best available at the time. Indeed, this frame was used for the earlier rounds of the GLSS.

    In order to increase precision and reliability of the estimates, the technique of stratification was employed in the sample design, using geographical factors, ecological zones and location of residence as the main controls. Specifically, the EAs were first stratified according to the three ecological zones namely; Coastal, Forest and Savannah, and then within each zone further stratification was done based on the size of the locality into rural or urban.

    A two-stage sample was selected for the survey. At the first stage, 300 EAs were selected using systematic sampling with probability proportional to size method (PPS) where the size measure is the 1984 number of households in the EA. This was achieved by ordering the list of EAs with their sizes according to the strata. The size column was then cumulated, and with a random start and a fixed interval the sample EAs were selected. It was observed that some of the selected EAs had grown in size over time and therefore needed segmentation. In this connection, such EAs were divided into approximately equal parts, each segment constituting about 200 households. Only one segment was then randomly selected for listing of the households. At the second stage, a fixed number of 20 households was systematically selected from each selected EA to give a total of 6,000 households. Additional 5 households were selected as reserve to replace missing households. Equal number of households was selected from each EA in order to reflect the labor force focus of the survey.

    NOTE: The above sample selection procedure deviated slightly from that used for the earlier rounds of the GLSS, as such the sample is not self-weighting. This is because: - given the long period between 1984 and the GLSS 4 fieldwork the number of households in the various EAs are likely to have grown at different rates. - The listing exercise was not properly done as some of the selected EAs were not listed completely. Moreover, it was noted that the segmentation done for larger EAs during the listing was a bit arbitrary.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The main questionnaire used in the survey was the household questionnaire. In addition to this, there were community and Price questionnaires.

    • Household Questionnaire: The household questionnaire was used to collect information on various topics some of which pertain to eligible individual household members. The questionnaire is in two parts, A and B.
    • Community Questionnaire: The main aim of the community questionnaire was is to identify the economic infrastructure, education and health facilities existing in the villages, as well as any related problems that affects their welfare. The questionnaire was administered in the rural EAs only.
    • Price Questionnaire: As part of the survey a price questionnaire was designed to collect prices of most essential commodities in the local markets.

    Cleaning operations

    Training: The project had 3 experienced computer programmers responsible for the data processing. Data processing started with a 2-weeks training of 15 data entry operators out of which the best 10 were chosen and 2 identified as standby. The training took place one week after the commencement of the fieldwork.

    Data entry: Each data entry operator was assigned to one field team and stationed in the regional office of the GSS. The main data entry software used to capture the data was IMPS (Integrated Microcomputer Processing System). The data capture run concurrently as the data collection and lasted for 12 months.

    Tabulation/Analysis: The IMPS data was read into SAS (Statistical Analysis System), after which the analysis and generation of the statistical tables were done using SAS.

    Response rate

    Out of the selected 6000 households 5999 were successfully interviewed. One household was further dropped during the data cleaning exercise because it had very few records for many of the sections in the questionnaire. This gave 5998 household representing 99.7% coverage. Overall, 25,694 eligible household members (unweighted) were covered in the survey.

  19. f

    National Survey on Household Living Conditions and Agriculture - Wave 2,...

    • microdata.fao.org
    Updated Nov 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). National Survey on Household Living Conditions and Agriculture - Wave 2, 2014 - 2015 - Niger [Dataset]. https://microdata.fao.org/index.php/catalog/1322
    Explore at:
    Dataset updated
    Nov 8, 2022
    Time period covered
    2014 - 2015
    Area covered
    Niger
    Description

    Abstract

    Niger is part of the Living Standards Measurement Study - Integrated Surveys on Agriculture (LSMS-ISA) program. This program has developed a household level survey with a view to enhancing our knowledge of agriculture in Sub-Saharan Africa, in particular, its role in poverty reduction and the techniques for promoting efficiency and innovation in this sector. To achieve this objective, an innovative model for agricultural data collection in this region will need to be developed and implemented. To this end, activities conducted in the future will be supported by four main pillars: a multisectoral framework, institutional integration, analytical capacity building, and active dissemination.

    First, agricultural statistical data collection must be part of an expanded and multisectoral framework that goes beyond the rural area. This will facilitate generation of the data needed to formulate effective agricultural policies throughout Niger and in the broader framework of the rural economy.

    Second, agricultural statistical data collection must be supported by a well-adapted institutional framework suited to fostering collaboration and the integration of data sources. By supporting a multi-pronged approach to data collection, this project seeks to foster intersectoral collaboration and overcome a number of the current institutional constraints.

    Third, national capacity building needs to be strengthened in order to enhance the reliability of the data produced and strengthen the link between the producers and users of data. This entails having the capacity to analyze data and to produce appropriate public data sets in a timely manner. The lack of analytical expertise in developing countries perpetuates weak demand for statistical data.

    Consequently, the foregoing has a negative impact on the quality and availability of policy-related analyses. Scant dissemination of statistics and available results has compounded this problem.

    In all countries where the LSMS-ISA project will be executed, the process envisioned for data collection will be a national household survey, based on models of LSMS surveys to be conducted every three years for a panel of households. The sampling method to be adopted should ensure the quality of the data, taking into account the depth/complexity of the questionnaire and panel size, while ensuring that samples are representative.

    The main objectives of the ECVMA are to:

    • Gauge the progress made with achievement of the Millennium Development Goals (MDGs);
    • Facilitate the updating of the social indicators used in formulating the policies aimed at improving the living conditions of the population;
    • Provide data related to several areas that are important to Niger without conducting specific surveys on individual topics ;
    • Provide data on several important areas for Niger that are not necessarily collected in other more specific surveys.

    Geographic coverage

    National Coverage

    Analysis unit

    Households

    Kind of data

    Sample survey data [ssd]

    Mode of data collection

    Face-to-face paper [f2f]

    Cleaning operations

    The data entry was done in the field simultaneously with the data collection. Each data collection team included a data entry operator who key entered the data soon after it was collected. The data entry program was designed in CSPro, a data entry package developed by the US Census Bureau. This program allows three types of data checks: (1) range checks; (2) intra-record checks to verify inconsistencies pertinent to the particular module of the questionnaire; and (3) inter-record checks to determine inconsistencies between the different modules of the questionnaire.

    The data as distributed represent the best effort to provide complete information. The data were collected and cleaned prior to the construction of the consumption aggregate. Using the same guidelines as were used in 2011, the households that are provided in the data set should have consumption data for both visits. This may not be the case. During the cleaning process, it was found that households had been misidentified which allowed more households to be included in the final consumption aggregate file (see below). The raw data that contains household/item level data that was used to calculate the consumption aggregate has been included in the distribution file.There are 3,614 households and 26,579 individuals in the data.

  20. cars_wagonr_swift

    • kaggle.com
    zip
    Updated Sep 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ajay (2019). cars_wagonr_swift [Dataset]. https://www.kaggle.com/ajaykgp12/cars-wagonr-swift
    Explore at:
    zip(44486490 bytes)Available download formats
    Dataset updated
    Sep 11, 2019
    Authors
    Ajay
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Data science beginners start with curated set of data, but it's a well known fact that in a real Data Science Project, major time is spent on collecting, cleaning and organizing data . Also domain expertise is considered as important aspect of creating good ML models. Being an automobile enthusiast, I tool up this challenge to collect images of two of the popular car models from a used car website, where users upload the images of the car they want to sell and then train a Deep Neural Network to identify model of a car from car images. In my search for images I found that approximately 10 percent of the cars pictures did not represent the intended car correctly and those pictures have to be deleted from final data.

    Content

    There are 4000 images of two of the popular cars (Swift and Wagonr) in India of make Maruti Suzuki with 2000 pictures belonging to each model. The data is divided into training set with 2400 images , validation set with 800 images and test set with 800 images. The data was randomized before splitting into training, test and validation set.

    The starter kernal is provided for keras with CNN. I have also created github project documenting advanced techniques in pytorch and keras for image classification like data augmentation, dropout, batch normalization and transfer learning

    Inspiration

    1. With small dataset like this, how much accuracy can we achieve and whether more data is always better. The baseline model trained in Keras achieves 88% accuracy on validation set, can we achieve even better performance and by how much.

    2. Is the data collected for the two car models representative of all possible car from all over country or there is sample bias .

    3. I would also like someone to extend the concept to build a use case so that if user uploads an incorrect car picture of car , the ML model could automatically flag it. For example user uploading incorrect model or an image which is not a car

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
141 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu