CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
A Knowledge, Attitudes and Practices (KAP) survey was conducted in Ajuong Thok and Pamir Refugee Camps in October 2019 to determine the current Water, Sanitation and Hygiene (WASH) conditions as well as hygiene attitudes and practices within the households (HHs) surveyed. The assessment utilized a systematic random sampling method, and a total of 1,474 HHs (735 HHs in Ajuong Thok and 739 HHs in Pamir) were surveyed using mobile data collection (MDC) within a period of 21 days. Data was cleaned and analyzed in Excel. The summary of the results is presented in this report.
The findings show that the overall average number of liters of water per person per day was 23.4, in both Ajuong Thok and Pamir Camps, which was slightly higher than the recommended United Nations High Commissioner for Refugees (UNHCR) minimum standard of at least 20 liters of water available per person per day. This is a slight improvement from the 21 liters reported the previous year. The average HH size was six people. Women comprised 83% of the surveyed respondents and males 17%. Almost all the respondents were refugees, constituting 99.5% (n=1,466). The refugees were aware of the key health and hygiene practices, possibly as a result of routine health and hygiene messages delivered to them by Samaritan´s Purse (SP) and other health partners. Most refugees had knowledge about keeping the water containers clean, washing hands during critical times, safe excreta disposal and disease prevention.
Ajuong Thok and Pamir Refugee Camps
Households
All households in Ajuong Thok and Pamir Refugee Camps
Sample survey data [ssd]
Households were selected using systematic random sampling. Enumerators systematically walked through the camp block by block, row by row, in such a way as to pass each HH. Within blocks, enumerators started at one corner, then systematically used the sampling interval as they walked up and down each of the rows throughout the block, covering every block in Ajuong Thok and Pamir.
In each location, the first HH sampled in a block was generated using an Excel tool customized by UNHCR which generated a Random Start and Sampling Interval.
Face-to-face [f2f]
The survey questionnaire used to collect the data consists of the following sections: - Demographics - Water collection and storage - Drinking water hygiene - Hygiene - Sanitation - Messaging - Distribution (NFI) - Diarrhea prevalence, knowledge and health seeking behaviour - Menstrual hygiene
The data collected was uploaded to a server at the end of each day. IFormBuilder generated a Microsoft (MS) Excel spreadsheet dataset which was then cleaned and analyzed using MS Excel.
Given that SP is currently implementing a WASH program in Ajuong Thok and Pamir, the assessment data collected in these camps will not only serve as the endline for UNHCR 2018 programming but also as the baseline for 2019 programming.
Data was anonymized through decoding and local suppression.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Excel spreadsheets providing the data plotted in the paper, and a Methods file. The Methods file provides a detailed description of the contents of the dataset.
About Datasets: - Domain : Marketing - Project: Amazon Product Review Sentiment Analysis - Datasets: Reviews.csv - Dataset Type: Excel Data - Dataset Size: 56L+ records
KPI's: 1. Distribution of Amazon Product Ratings 2. How most people rated the products they bought from Amazon 3. Total of all sentiment scores
Process: 1. Understanding the problem 2. Data Collection 3. Data Cleaning 4. Exploring and analyzing the data 5. Interpreting the results
This data contains pandas, seaborn, matplotlib, nltk.sentiment.vader, SentimentIntensityAnalyzer, value_counts(), custom_colors, figsize, pie, sentiment_score
Transportation-disadvantaged populations often face significant challenges in meeting their basic travel needs. Microtransit, a technology-enabled transit mobility solution, can potentially address these issues by providing on-demand, affordable, and flexible services. However, the extent to which microtransit serves underserved populations and the factors influencing their adoption remain unclear. This research focuses on SmaRT Ride, a microtransit pilot program operated by the Sacramento Regional Transit (SacRT) in the Sacramento area. From early February to the end of May 2024, online and intercept surveys were conducted among underserved populations to understand their travel behavior. After data cleaning, 180 valid responses were collected. Descriptive analysis of the data shows that SmaRT Ride has significantly improved transportation access for these communities. Furthermore, logistic regressions were employed to explore factors influencing the willingness to adopt microtransit a..., Reaching underserved communities, especially for surveys, is challenging due to socioeconomic and language barriers. To improve our sample and gather more responses to our survey, we used contacts from existing datasets, worked with local food banks, and identified transit stops and other intercept survey site recommendations from stakeholders such as SacRT. Additionally, multiple methods such as online, in-person, and telephone/text message survey recruitment methods were used to accommodate different preferences and access levels of underserved individuals., , Title: Dataset of underserved microtransit users in the Sacrament Area, California
Access this dataset on Dryad: https://doi.org/10.5061/dryad.r7sqv9smh
Dataset contents: Variables/Features: 273 variables. Detailed descriptions of these variables are provided in the attached "Variable description" Excel file. Number of Entries: 180 cases Time Frame: Data was collected From early February to the end of May 2024 Format: SPSS (.sav)
Description: This dataset contains underserved populations' opinions, daily travel pattern, and use of SmaRT Ride, a microtransit pilot program operated by the Sacmento Regional Transit (SacRT) in the Sacramento area. Sampling methods used to reach underserved communities included obtaining email addresses from existing datasets for online surveys, conducting intercept surveys at food distribution sites associated with food banks, busy transit stops, and other locations recommended by stakeholders such as SacRT. Rea...
The General Household Survey-Panel (GHS-Panel) is implemented in collaboration with the World Bank Living Standards Measurement Study (LSMS) team as part of the Integrated Surveys on Agriculture (ISA) program. The objectives of the GHS-Panel include the development of an innovative model for collecting agricultural data, interinstitutional collaboration, and comprehensive analysis of welfare indicators and socio-economic characteristics. The GHS-Panel is a nationally representative survey of approximately 5,000 households, which are also representative of the six geopolitical zones. The 2018/19 is the fourth round of the survey with prior rounds conducted in 2010/11, 2012/13, and 2015/16. GHS-Panel households were visited twice: first after the planting season (post-planting) between July and September 2018 and second after the harvest season (post-harvest) between January and February 2019.
National, the survey covered all the 36 states and Federal Capital Territory (FCT).
Households, Individuals, Agricultural plots, Communites
Sample survey data [ssd]
The original GHS-Panel sample of 5,000 households across 500 enumeration areas (EAs) and was designed to be representative at the national level as well as at the zonal level. The complete sampling information for the GHS-Panel is described in the Basic Information Document for GHS-Panel 2010/2011. However, after a nearly a decade of visiting the same households, a partial refresh of the GHS-Panel sample was implemented in Wave 4. For the partial refresh of the sample, a new set of 360 EAs were randomly selected which consisted of 60 EAs per zone. The refresh EAs were selected from the same sampling frame as the original GHS-Panel sample in 2010 (the "master frame").
A listing of all households was conducted in the 360 EAs and 10 households were randomly selected in each EA, resulting in a total refresh sample of approximated 3,600 households. In addition to these 3,600 refresh households, a subsample of the original 5,000 GHS-Panel households from 2010 were selected to be included in the new sample. This "long panel" sample was designed to be nationally representative to enable continued longitudinal analysis for the sample going back to 2010. The long panel sample consisted of 159 EAs systematically selected across the 6 geopolitical Zones. The systematic selection ensured that the distribution of EAs across the 6 Zones (and urban and rural areas within) is proportional to the original GHS-Panel sample.
Interviewers attempted to interview all households that originally resided in the 159 EAs and were successfully interviewed in the previous visit in 2016. This includes households that had moved away from their original location in 2010. In all, interviewers attempted to interview 1,507 households from the original panel sample. The combined sample of refresh and long panel EAs consisted of 519 EAs. The total number of households that were successfully interviewed in both visits was 4,976.
While the combined sample generally maintains both national and Zonal representativeness of the original GHS-Panel sample, the security situation in the North East of Nigeria prevented full coverage of the Zone. Due to security concerns, rural areas of Borno state were fully excluded from the refresh sample and some inaccessible urban areas were also excluded. Security concerns also prevented interviewers from visiting some communities in other parts of the country where conflict events were occurring. Refresh EAs that could not be accessed were replaced with another randomly selected EA in the Zone so as not to compromise the sample size. As a result, the combined sample is representative of areas of Nigeria that were accessible during 2018/19. The sample will not reflect conditions in areas that were undergoing conflict during that period. This compromise was necessary to ensure the safety of interviewers.
Computer Assisted Personal Interview [capi]
CAPI: For the first time in GHS-Panel, the Wave four exercise was conducted using Computer Assisted Person Interview (CAPI) techniques. All the questionnaires, household, agriculture and community questionnaires were implemented in both the post-planting and post-harvest visits of Wave 4 using the CAPI software, Survey Solutions. The Survey Solutions software was developed and maintained by the Survey Unit within the Development Economics Data Group (DECDG) at the World Bank. Each enumerator was given tablets which they used to conduct the interviews. Overall, implementation of survey using Survey Solutions CAPI was highly successful, as it allowed for timely availability of the data from completed interviews. DATA COMMUNICATION SYSTEM: The data communication system used in Wave 4 was highly automated. Each field team was given a mobile modem allow for internet connectivity and daily synchronization of their tablet. This ensured that head office in Abuja has access to the data in real-time. Once the interview is completed and uploaded to the server, the data is first reviewed by the Data Editors.
The data is also downloaded from the server, and Stata dofile was run on the downloaded data to check for additional errors that were not captured by the Survey Solutions application. An excel error file is generated following the running of the Stata dofile on the raw dataset. Information contained in the excel error files are communicated back to respective field interviewers for action by the interviewers. This action is done on a daily basis throughout the duration of the survey, both in the post-planting and post-harvest. DATA CLEANING: The data cleaning process was done in three main stages. The first stage was to ensure proper quality control during the fieldwork. This was achieved in part by incorporating validation and consistency checks into the Survey Solutions application used for the data collection and designed to highlight many of the errors that occurred during the fieldwork. The second stage cleaning involved the use of Data Editors and Data Assistants (Headquarters in Survey Solutions). As indicated above, once the interview is completed and uploaded to the server, the Data Editors review completed interview for inconsistencies and extreme values. Depending on the outcome, they can either approve or reject the case. If rejected, the case goes back to the respective interviewer's tablet upon synchronization. Special care was taken to see that the households included in the data matched with the selected sample and where there were differences, these were properly assessed and documented.
The agriculture data were also checked to ensure that the plots identified in the main sections merged with the plot information identified in the other sections. Additional errors observed were compiled into error reports that were regularly sent to the teams. These errors were then corrected based on re-visits to the household on the instruction of the supervisor. The data that had gone through this first stage of cleaning was then approved by the Data Editor. After the Data Editor's approval of the interview on Survey Solutions server, the Headquarters also reviews and depending on the outcome, can either reject or approve. The third stage of cleaning involved a comprehensive review of the final raw data following the first and second stage cleaning. Every variable was examined individually for (1) consistency with other sections and variables, (2) out of range responses, and (3) outliers. However, special care was taken to avoid making strong assumptions when resolving potential errors. Some minor errors remain in the data where the diagnosis and/or solution were unclear to the data cleaning team.
I developed an agent-based model (ABM) which is a social simulation method, to explain protest mobilisation through national identity polarisation and how social media and individual social networks contribute in this process. From the simulation code, written in NetLogo, I collected data from multiple simulation runs of various parameter combinations. Then, I cleaned and analysed such data using RStudio. There are three types of data files. NetLogo files that contain the simulation code for my ABM. RStudio files that contain the code for data cleaning and data analyses I carried out on the simulation outputs. The last data type are csv excel files containing the simulation results collected for each of the parameter combinations.
Data content: Foreign Economic and trade_ Total import and export of goods (1952-2019) and foreign economic and trade_ Total import and export by trade (1981-2019) Data sources and processing methods: the original data of China's foreign trade and investment from 2015 to 2019 (including the third pole) were obtained from the official website of the world bank and sina.com, and the foreign trade and investment data set of China (including the third pole) from 1952 to 2019 was obtained through data sorting, screening and cleaning. The data start time is from 1952 to 2019 in Microsoft Excel (xlsx) format.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.