CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
Ahoy, data enthusiasts! Join us for a hands-on workshop where you will hoist your sails and navigate through the Statistics Canada website, uncovering hidden treasures in the form of data tables. With the wind at your back, you’ll master the art of downloading these invaluable Stats Can datasets while braving the occasional squall of data cleaning challenges using Excel with your trusty captains Vivek and Lucia at the helm.
This dataset was created by Luis Lira
This dataset was created by Mohamed Khaled Idris
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Corpus consisting of 10,000 Facebook posts manually annotated on sentiment (2,587 positive, 5,174 neutral, 1,991 negative and 248 bipolar posts). The archive contains data and statistics in an Excel file (FBData.xlsx) and gold data in two text files with posts (gold-posts.txt) and labels (gols-labels.txt) on corresponding lines.
About this course Do you have messy data from multiple inconsistent sources, or open-responses to questionnaires? Do you want to improve the quality of your data by refining it and using the power of the internet? Open Refine is the perfect partner to Excel. It is a powerful, free tool for exploring, normalising and cleaning datasets, and extending data by accessing the internet through APIs. In this course we’ll work through the various features of Refine, including importing data, faceting, clustering, and calling remote APIs, by working on a fictional but plausible humanities research project. Learning Outcomes Download, install and run Open Refine Import data from csv, text or online sources and create projects Navigate data using the Open Refine interface Explore data by using facets Clean data using clustering Parse data using GREL syntax Extend data using Application Programming Interfaces (APIs) Export project for use in other applications Prerequisites The course has no prerequisites. Licence Copyright © 2021 Intersect Australia Ltd. All rights reserved.
It completely data clean excel file to attain accurate data analysis with proper visualization
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
--------CALL CENTER PERFORMANCE DATASET ANALYSIS--------
This is a self-guided project.
The Call Center dataset contained customer data such as caller id, customer name, date, call channel, city, state, reason for calling, call duration, e.t.c.
I tasked myself with identifying trends and patterns so as to create a summarical overview of the data which can give an overview-level understanding of the data to technical and non-technical viewers.
OBJECTIVES: Create a dashboard (using charts, slicers and KPIs) which can be used to statistically track, monitor and visualize the performance of a Call Center.
SOFTWARE TOOLS USED: Microsoft Excel
ANALYTICAL ACTIONS PERFORMED: Data Importation, Data Processing, Data Cleaning, VLOOKUP Pivot Tables Data Visualization (Dashboard creation) Connection Reporting (connecting slicers to Dashboard)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project focuses on data mapping, integration, and analysis to support the development and enhancement of six UNCDF operational applications: OrgTraveler, Comms Central, Internal Support Hub, Partnership 360, SmartHR, and TimeTrack. These apps streamline workflows for travel claims, internal support, partnership management, and time tracking within UNCDF.Key Features and Tools:Data Mapping for Salesforce CRM Migration: Structured and mapped data flows to ensure compatibility and seamless migration to Salesforce CRM.Python for Data Cleaning and Transformation: Utilized pandas, numpy, and APIs to clean, preprocess, and transform raw datasets into standardized formats.Power BI Dashboards: Designed interactive dashboards to visualize workflows and monitor performance metrics for decision-making.Collaboration Across Platforms: Integrated Google Collab for code collaboration and Microsoft Excel for data validation and analysis.
The 2003 Agriculture Sample Census was designed to meet the data needs of a wide range of users down to district level including policy makers at local, regional and national levels, rural development agencies, funding institutions, researchers, NGOs, farmer organisations, etc. As a result the dataset is both more numerous in its sample and detailed in its scope compared to previous censuses and surveys. To date this is the most detailed Agricultural Census carried out in Africa.
The census was carried out in order to: · Identify structural changes if any, in the size of farm household holdings, crop and livestock production, farm input and implement use. It also seeks to determine if there are any improvements in rural infrastructure and in the level of agriculture household living conditions; · Provide benchmark data on productivity, production and agricultural practices in relation to policies and interventions promoted by the Ministry of Agriculture and Food Security and other stake holders. · Establish baseline data for the measurement of the impact of high level objectives of the Agriculture Sector Development Programme (ASDP), National Strategy for Growth and Reduction of Poverty (NSGRP) and other rural development programs and projects. · Obtain benchmark data that will be used to address specific issues such as: food security, rural poverty, gender, agro-processing, marketing, service delivery, etc.
Tanzania Mainland and Zanzibar
Large scale, small scale and community farms.
Census/enumeration data [cen]
The Mainland sample consisted of 3,221 villages. These villages were drawn from the National Master Sample (NMS) developed by the National Bureau of Statistics (NBS) to serve as a national framework for the conduct of household based surveys in the country. The National Master Sample was developed from the 2002 Population and Housing Census. The total Mainland sample was 48,315 agricultural households. In Zanzibar a total of 317 enumeration areas (EAs) were selected and 4,755 agriculture households were covered. Nationwide, all regions and districts were sampled with the exception of three urban districts (two from Mainland and one from Zanzibar).
In both Mainland and Zanzibar, a stratified two stage sample was used. The number of villages/EAs selected for the first stage was based on a probability proportional to the number of villages in each district. In the second stage, 15 households were selected from a list of farming households in each selected Village/EA, using systematic random sampling, with the village chairpersons assisting to locate the selected households.
Face-to-face [f2f]
The census covered agriculture in detail as well as many other aspects of rural development and was conducted using three different questionnaires: • Small scale questionnaire • Community level questionnaire • Large scale farm questionnaire
The small scale farm questionnaire was the main census instrument and it includes questions related to crop and livestock production and practices; population demographics; access to services, resources and infrastructure; and issues on poverty, gender and subsistence versus profit making production unit.
The community level questionnaire was designed to collect village level data such as access and use of common resources, community tree plantation and seasonal farm gate prices.
The large scale farm questionnaire was administered to large farms either privately or corporately managed.
Questionnaire Design The questionnaires were designed following user meetings to ensure that the questions asked were in line with users data needs. Several features were incorporated into the design of the questionnaires to increase the accuracy of the data: • Where feasible all variables were extensively coded to reduce post enumeration coding error. • The definitions for each section were printed on the opposite page so that the enumerator could easily refer to the instructions whilst interviewing the farmer. • The responses to all questions were placed in boxes printed on the questionnaire, with one box per character. This feature made it possible to use scanning and Intelligent Character Recognition (ICR) technologies for data entry. • Skip patterns were used to reduce unnecessary and incorrect coding of sections which do not apply to the respondent. • Each section was clearly numbered, which facilitated the use of skip patterns and provided a reference for data type coding for the programming of CSPro, SPSS and the dissemination applications.
Data processing consisted of the following processes: · Data entry · Data structure formatting · Batch validation · Tabulation
Data Entry Scanning and ICR data capture technology for the small holder questionnaire were used on the Mainland. This not only increased the speed of data entry, it also increased the accuracy due to the reduction of keystroke errors. Interactive validation routines were incorporated into the ICR software to track errors during the verification process. The scanning operation was so successful that it is highly recommended for adoption in future censuses/surveys. In Zanzibar all data was entered manually using CSPro.
Prior to scanning, all questionnaires underwent a manual cleaning exercise. This involved checking that the questionnaire had a full set of pages, correct identification and good handwriting. A score was given to each questionnaire based on the legibility and the completeness of enumeration. This score will be used to assess the quality of enumeration and supervision in order to select the best field staff for future censuses/surveys.
CSPro was used for data entry of all Large Scale Farm and community based questionnaires due to the relatively small number of questionnaires. It was also used to enter data from the 2,880 small holder questionnaires that were rejected by the ICR extraction application.
Data Structure Formatting A program was developed in visual basic to automatically alter the structure of the output from the scanning/extraction process in order to harmonise it with the manually entered data. The program automatically checked and changed the number of digits for each variable, the record type code, the number of questionnaires in the village, the consistency of the Village ID Code and saved the data of one village in a file named after the village code.
Batch Validation A batch validation program was developed in order to identify inconsistencies within a questionnaire. This is in addition to the interactive validation during the ICR extraction process. The procedures varied from simple range checking within each variable to the more complex checking between variables. It took six months to screen, edit and validate the data from the smallholder questionnaires. After the long process of data cleaning, tabulations were prepared based on a pre-designed tabulation plan.
Tabulations Statistical Package for Social Sciences (SPSS) was used to produce the Census tabulations and Microsoft Excel was used to organize the tables and compute additional indicators. Excel was also used to produce charts while ArcView and Freehand were used for the maps.
Analysis and Report Preparation The analysis in this report focuses on regional comparisons, time series and national production estimates. Microsoft Excel was used to produce charts; ArcView and Freehand were used for maps, whereas Microsoft Word was used to compile the report.
Data Quality A great deal of emphasis was placed on data quality throughout the whole exercise from planning, questionnaire design, training, supervision, data entry, validation and cleaning/editing. As a result of this, it is believed that the census is highly accurate and representative of what was experienced at field level during the Census year. With very few exceptions, the variables in the questionnaire are within the norms for Tanzania and they follow expected time series trends when compared to historical data. Standard Errors and Coefficients of Variation for the main variables are presented in the Technical Report (Volume I).
The Sampling Error found on page (21) up to page (22) in the Technical Report for Agriculture Sample Census Survey 2002-2003
We describe a bibliometric network characterizing co-authorship collaborations in the entire Italian academic community. The network, consisting of 38,220 nodes and 507,050 edges, is built upon two distinct data sources: faculty information provided by the Italian Ministry of University and Research and publications available in Semantic Scholar. Both nodes and edges are associated with a large variety of semantic data, including gender, bibliometric indexes, authors' and publications' research fields, and temporal information. While linking data between the two original sources posed many challenges, the network has been carefully validated to assess its reliability and to understand its graph-theoretic characteristics. By resembling several features of social networks, our dataset can be profitably leveraged in experimental studies in the wide social network analytics domain as well as in more specific bibliometric contexts. , The proposed network is built starting from two distinct data sources:
the entire dataset dump from Semantic Scholar (with particular emphasis on the authors and papers datasets) the entire list of Italian faculty members as maintained by Cineca (under appointment by the Italian Ministry of University and Research).
By means of a custom name-identity recognition algorithm (details are available in the accompanying paper published in Scientific Data), the names of the authors in the Semantic Scholar dataset have been mapped against the names contained in the Cineca dataset and authors with no match (e.g., because of not being part of an Italian university) have been discarded. The remaining authors will compose the nodes of the network, which have been enriched with node-related (i.e., author-related) attributes. In order to build the network edges, we leveraged the papers dataset from Semantic Scholar: specifically, any two authors are said to be connected if there is at least one pap..., , # Data cleaning and enrichment through data integration: networking the Italian academia
https://doi.org/10.5061/dryad.wpzgmsbwj
Manuscript published in Scientific Data with DOI .
This repository contains two main data files:
edge_data_AGG.csv
, the full network in comma-separated edge list format (this file contains mainly temporal co-authorship information);Coauthorship_Network_AGG.graphml
, the full network in GraphML format. along with several supplementary data, listed below, useful only to build the network (i.e., for reproducibility only):
University-City-match.xlsx
, an Excel file that maps the name of a university against the city where its respective headquarter is located;Areas-SS-CINECA-match.xlsx
, an Excel file that maps the research areas in Cineca against the research areas in Semantic Scholar.The `Coauthorship_Networ...
Attribution-ShareAlike 2.0 (CC BY-SA 2.0)https://creativecommons.org/licenses/by-sa/2.0/
License information was derived automatically
These are Microsoft Excel files which contain the data used to generate the plots in the paper. The files are labelled by Figure number: a complete description is given in the paper.
Data for manuscript “Functional morphology and efficiency of the antenna cleaner in Camponotus rufifemur ants"Excel file includes 3 data sheets. One sheet for each experiment. The corresponding figures from the manuscript are mentioned above the actual data.Manuscript data.xlsx
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the results of a survey about the use of open government data applied to public agents working in public institutions in Brazil. It has two sets, one with questionnaire responses and metadata and the second with a coding table with interview extracts: 1) In the first dataset, each row holds a response to a questionnaire about the public agent's perceptions of the use and reuse of open government data in Brazilian public institutions. Columns store the questionnaire questions. Data were collected between 8 June and 13 July 2021, and this sample is composed of responses from 40 federal, state, and municipal public administrators. Thus, this dataset contains 40 rows and 158 columns. Data were collected on the LimeSurvey platform, where it was screened for missing values and incomplete responses. After cleaning, data were exported to Excel in tabular format. Questionnaire responses are provided in two files ResultsSurvey_OGDUseBRPubInstitutions_DataSet_PT and ResultsSurvey_OGDUseBRPubInstitutions_DataSet_EN. They contain the same information in Portuguese and English. 2) The second dataset records the code table of the interviews about the benefits, barriers, enablers, and drivers of open government data (OGD) use in Brazilian public institutions. A questionnaire applied to public agents working in Brazilian public institutions was followed up by interviews to broaden an understanding of the use of OGD. Nine interviews were conducted between May 17-31, 2022. This dataset represents the perspective of these public agents. The dataset contains 97 lines and six columns. Each row of the dataset lists the factor code used in the questionnaire, the factor descriptions in Portuguese and English, the interviewee code, the transcription extract of an interviewee narration collected in Portuguese, and the English translation. After collection in Portuguese, interviews were automatically transcribed using the NVivo Transcription software. Then, they were anonymized, and a human reviewed the transcriptions. Interviews were coded using NVivo and used the questionnaire factors to guide coding. Coded extracts were translated to English using Google and Microsoft translators. Then, translated extracts were revised by a human and were used for reporting. The coding table was exported to Excel. Interviews extracts are provided in one file, InterviewsExtracts_OGDUseBR_PublicInstitutions_Dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication materials for the manuscript "Skepticism in Science and Punitive Attitudes", published in the Journal of Criminal Justice.Note that the GSS repeated cross sections for 1972 to 2018 are too large to upload here, but they can be accessed from https://gss.norc.org/content/dam/gss/get-the-data/documents/spss/GSS_spss.zipIncluded here are:(A link to the repeated cross-sections data)Each of the 3 wave panels (2006-2010; 2008-2012; 2010-2014)Replication R script for the repeated cross sections cleaning and analysisReplication R script for the panel data cleaning and analysisAn excel spreadsheet with Uniform Crime Report data to merge to the cross sections.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract [Related publication]:
Cleaning symbiosis is critical for maintaining healthy biological communities in tropical marine ecosystems. However, potential negative impacts of mutualism, such as the transmission of pathogens and parasites during cleaning interactions, have rarely been evaluated. Here, we investigated whether the dedicated bluestreak cleaner wrasse Labroides dimidiatus, is susceptible to, and can transmit generalist ectoparasites between client fish. In laboratory experiments, L. dimidiatus were exposed to infective stages of three generalist ectoparasite species with contrasting life-histories. Labroides dimidiatus were susceptible to infection by the gnathiid isopod, Gnathia aureamaculosa, but significantly less susceptible to the ciliate protozoan, Cryptocaryon irritans, and the monogenean flatworm, Neobenedenia girellae, compared to control host species (Coris batuensis or Lates calcarifer). The potential for parasite transmission from a client fish to the cleaner fish was simulated using experimentally transplanted mobile adult (i.e., egg-producing) monogenean flatworms on L. dimidiatus. Parasites remained attached to cleaners for an average of two days, during which parasite egg production continued, but was reduced compared to control fish. Over this timespan, a wild cleaner may engage in several thousand cleaning interactions, providing numerous opportunities for mobile parasites to exploit cleaners as vectors. Our study provides the first experimental evidence that L. dimidiatus exhibits resistance to infective stages of some parasites yet has the potential to temporarily transport adult parasites. We propose that some parasites that evade being eaten by cleaner fish could exploit cleaning interactions as a mechanism for transmission and spread.
Data methods:
In laboratory experiments, we first test the susceptibility of L. dimidiatus to three generalist parasites with contrasting life-histories. To do so, we exposed 20 L. dimidiatus and 20 control individuals (Coris batuensis or Lates calcarifer) to infective stages of a species of gnathiid isopod Gnathia aureamaculosa, a species of monogenean flatworm Neobenedenia girellae and a species of ciliate protozoan Cryptocaryon irritans. We then test whether adult N. girellae remained attached and produced viable eggs when transferred to the skin of live Lab. dimidiatus by manually transplanted adult N. girellae from a donor host to L. dimidiatus. Finally, we test for how long adult N. girellae could survive on L. dimidiatus after being manually transplanted. All data analyses were performed in R version 4.0.2 (R Core Team 2020)
The full methodology is available in the publication shown in the Related Publication link below.
Software/equipment used to create/collect the data: Excel version 2205
Software/equipment used to manipulate/analyse the data: R Studio 2021.09.0
https://www.skyquestt.com/privacy/https://www.skyquestt.com/privacy/
Global Cleaning Robot Market size was valued at USD 4.19 billion in 2022 and is poised to grow from USD 4.97 billion in 2023 to USD 12.81 billion by 2031, growing at a CAGR of 22.9% in the forecast period (2024-2031).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set supports the journal paper "Manipulating the consequences of tests: How Shanghai teens react to different consequences", published in Educational Research and Evaluation, v26 (n5-6), pp.221-251.The data were obtained to test the impact of different levels of consequence for taking a test on student test-taking effort. The data are part of the PhD project of Anran Zhao, supervised by Brown & Meissel.The data set is in MS Excel format. Sheet 1 provides an anonymous wide-format data set post-cleaning and missing value analysis of the data.Sheet 2 provides a description of each variable.
This project demonstrates the use of data cleaning techniques, Pivot Tables and charts in Excel to answer 3 main questions:
It includes 5 sheets:
You can download the Excel file with all formatting.
A Knowledge, Attitudes and Practices (KAP) survey was conducted in Ajuong Thok and Pamir Refugee Camps in October 2019 to determine the current Water, Sanitation and Hygiene (WASH) conditions as well as hygiene attitudes and practices within the households (HHs) surveyed. The assessment utilized a systematic random sampling method, and a total of 1,474 HHs (735 HHs in Ajuong Thok and 739 HHs in Pamir) were surveyed using mobile data collection (MDC) within a period of 21 days. Data was cleaned and analyzed in Excel. The summary of the results is presented in this report.
The findings show that the overall average number of liters of water per person per day was 23.4, in both Ajuong Thok and Pamir Camps, which was slightly higher than the recommended United Nations High Commissioner for Refugees (UNHCR) minimum standard of at least 20 liters of water available per person per day. This is a slight improvement from the 21 liters reported the previous year. The average HH size was six people. Women comprised 83% of the surveyed respondents and males 17%. Almost all the respondents were refugees, constituting 99.5% (n=1,466). The refugees were aware of the key health and hygiene practices, possibly as a result of routine health and hygiene messages delivered to them by Samaritan´s Purse (SP) and other health partners. Most refugees had knowledge about keeping the water containers clean, washing hands during critical times, safe excreta disposal and disease prevention.
Ajuong Thok and Pamir Refugee Camps
Households
All households in Ajuong Thok and Pamir Refugee Camps
Sample survey data [ssd]
Households were selected using systematic random sampling. Enumerators systematically walked through the camp block by block, row by row, in such a way as to pass each HH. Within blocks, enumerators started at one corner, then systematically used the sampling interval as they walked up and down each of the rows throughout the block, covering every block in Ajuong Thok and Pamir.
In each location, the first HH sampled in a block was generated using an Excel tool customized by UNHCR which generated a Random Start and Sampling Interval.
Face-to-face [f2f]
The survey questionnaire used to collect the data consists of the following sections: - Demographics - Water collection and storage - Drinking water hygiene - Hygiene - Sanitation - Messaging - Distribution (NFI) - Diarrhea prevalence, knowledge and health seeking behaviour - Menstrual hygiene
The data collected was uploaded to a server at the end of each day. IFormBuilder generated a Microsoft (MS) Excel spreadsheet dataset which was then cleaned and analyzed using MS Excel.
Given that SP is currently implementing a WASH program in Ajuong Thok and Pamir, the assessment data collected in these camps will not only serve as the endline for UNHCR 2018 programming but also as the baseline for 2019 programming.
Data was anonymized through decoding and local suppression.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.