CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A disorganized toy spreadsheet used for teaching good data organization. Learners are tasked with identifying as many errors as possible before creating a data dictionary and reconstructing the spreadsheet according to best practices.
DirtyMNIST is a concatenation of MNIST + AmbiguousMNIST, with 60k samples each in the training set. AmbiguousMNIST contains additional ambiguous digits with varying ambiguity. The AmbiguousMNIST test set contains 60k ambiguous samples as well.
Additional Guidance
DirtyMNIST is a concatenation of MNIST + AmbiguousMNIST, with 60k samples each in the training set. The current AmbiguousMNIST contains 6k unique samples with 10 labels each. This multi-label dataset gets flattened to 60k samples. The assumption is that ambiguous samples have multiple "valid" labels as they are ambiguous. MNIST samples are intentionally undersampled (in comparison), which benefits AL acquisition functions that can select unambiguous samples. Pick your initial training samples (for warm starting Active Learning) from the MNIST half of DirtyMNIST to avoid starting training with potentially very ambiguous samples, which might add a lot of variance to your experiments. Make sure to pick your validation set from the MNIST half as well, for the same reason as above. Make sure that your batch acquisition size is >= 10 (probably) given that there are 10 multi-labels per samples in Ambiguous-MNIST. By default, Gaussian noise with stddev 0.05 is added to each sample to prevent acquisition functions (in Active Learning) from cheating by disgarding "duplicates". If you want to split Ambiguous-MNIST into subsets (or Dirty-MNIST within the second ambiguous half), make sure to split by multiples of 10 to avoid splits within a flattened multi-label sample.
An unclean employee dataset can contain various types of errors, inconsistencies, and missing values that affect the accuracy and reliability of the data. Some common issues in unclean datasets include duplicate records, incomplete data, incorrect data types, spelling mistakes, inconsistent formatting, and outliers.
For example, there might be multiple entries for the same employee with slightly different spellings of their name or job title. Additionally, some rows may have missing data for certain columns such as bonus or exit date, which can make it difficult to analyze trends or make accurate predictions. Inconsistent formatting of data, such as using different date formats or capitalization conventions, can also cause confusion and errors when processing the data.
Furthermore, there may be outliers in the data, such as employees with extremely high or low salaries or ages, which can distort statistical analyses and lead to inaccurate conclusions.
Overall, an unclean employee dataset can pose significant challenges for data analysis and decision-making, highlighting the importance of cleaning and preparing data before analyzing it
Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
Nothing ever becomes real till it is experienced.
-John Keats
While we don't know the context in which John Keats mentioned this, we are sure about its implication in data science. While you would have enjoyed and gained exposure to real world problems in this challenge, here is another opportunity to get your hand dirty with this practice problem.
Problem Statement :
The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.
Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.
Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.
Data :
We have 14204 samples in data set.
Variable Description
Item Identifier: A code provided for the item of sale
Item Weight: Weight of item
Item Fat Content: A categorical column of how much fat is present in the item: ‘Low Fat’, ‘Regular’, ‘low fat’, ‘LF’, ‘reg’
Item Visibility: Numeric value for how visible the item is
Item Type: What category does the item belong to: ‘Dairy’, ‘Soft Drinks’, ‘Meat’, ‘Fruits and Vegetables’, ‘Household’, ‘Baking Goods’, ‘Snack Foods’, ‘Frozen Foods’, ‘Breakfast’, ’Health and Hygiene’, ‘Hard Drinks’, ‘Canned’, ‘Breads’, ‘Starchy Foods’, ‘Others’, ‘Seafood’.
Item MRP: The MRP price of item
Outlet Identifier: Which outlet was the item sold. This will be categorical column
Outlet Establishment Year: Which year was the outlet established
Outlet Size: A categorical column to explain size of outlet: ‘Medium’, ‘High’, ‘Small’.
Outlet Location Type: A categorical column to describe the location of the outlet: ‘Tier 1’, ‘Tier 2’, ‘Tier 3’
Outlet Type: Categorical column for type of outlet: ‘Supermarket Type1’, ‘Supermarket Type2’, ‘Supermarket Type3’, ‘Grocery Store’
Item Outlet Sales: The number of sales for an item.
Evaluation Metric:
We will use the Root Mean Square Error value to judge your response
Surface-water samples were collected, processed, and analyzed for organics, estrogen equivalents, and fecal indicator bacteria. Filtered organic samples were sent to the National Water Quality Laboratory in Denver, Colorado. Unfiltered estrogen equivalent samples were sent to the Organic Geochemistry Research Lab in Lawrence, Kansas, for extraction, after which they were sent to the National Fish Health Research Laboratory in Leetown, West Virginia. Bacteria samples were processed at the Central-Midwest Water Science Center Iowa City, Iowa, office. Staff collected field parameters in-situ.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the Muddy Creek township population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Muddy Creek township. The dataset can be utilized to understand the population distribution of Muddy Creek township by age. For example, using this dataset, we can identify the largest age group in Muddy Creek township.
Key observations
The largest age group in Muddy Creek Township, Pennsylvania was for the group of age 15 to 19 years years with a population of 193 (9.08%), according to the ACS 2019-2023 5-Year Estimates. At the same time, the smallest age group in Muddy Creek Township, Pennsylvania was the 80 to 84 years years with a population of 11 (0.52%). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates
Age groups:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Muddy Creek township Population by Age. You can refer the same here
Four samples from the Dirty Shame Rockshelter (35ML65) in southeast Oregon were floated to recover macrofloral remains. This site is a large shelter (approximately 60 meters long) and was excavated as a part of the 2010 University of Oregon Archaeological Field School with support from Dianne Pritchard, Vale District BLM Archaeologist. Samples reflect fill from a pole and thatch structure (Feature 1), as well as sediments from levels within the shelter. Radiocarbon dates of 1140 ± 95 BP and 1175 ± 70 BP were previously obtained from Feature 1. A basket fragment from Level 19 of Unit 1 yielded a radiocarbon date of 2685 ± 20 BP, while a date of 2980 ± 20 BP was returned for a basket fragment from Level 22 of Unit 2. These dates suggest multiple occupations in the shelter. Macrofloral analysis was used to provide information concerning plant resources utilized by the shelter occupants.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Within the central repository, there are subfolders of different categories. Each of these subfolders contains both images and their corresponding transcriptions, saved as .txt files. As an example, the folder 'summary-based-0001-0055' encompasses 55 handwritten image documents pertaining to the summary task, with the images ranging from 0001 to 0055 within this category. In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.
Moreover, there exists a document detailing the transcription rules utilized for transcribing the dataset. Following these guidelines will enable the seamless addition of more images.
We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.
In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.
In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.
Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.
In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This zip file contains data files for 3 activities described in the accompanying PPT slides 1. an excel spreadsheet for analysing gain scores in a 2 group, 2 times data array. this activity requires access to –https://campbellcollaboration.org/research-resources/effect-size-calculator.html to calculate effect size.2. an AMOS path model and SPSS data set for an autoregressive, bivariate path model with cross-lagging. This activity is related to the following article: Brown, G. T. L., & Marshall, J. C. (2012). The impact of training students how to write introductions for academic essays: An exploratory, longitudinal study. Assessment & Evaluation in Higher Education, 37(6), 653-670. doi:10.1080/02602938.2011.5632773. an AMOS latent curve model and SPSS data set for a 3-time latent factor model with an interaction mixed model that uses GPA as a predictor of the LCM start and slope or change factors. This activity makes use of data reported previously and a published data analysis case: Peterson, E. R., Brown, G. T. L., & Jun, M. C. (2015). Achievement emotions in higher education: A diary study exploring emotions across an assessment event. Contemporary Educational Psychology, 42, 82-96. doi:10.1016/j.cedpsych.2015.05.002andBrown, G. T. L., & Peterson, E. R. (2018). Evaluating repeated diary study responses: Latent curve modeling. In SAGE Research Methods Cases Part 2. Retrieved from http://methods.sagepub.com/case/evaluating-repeated-diary-study-responses-latent-curve-modeling doi:10.4135/9781526431592
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Agroecosystem management influences ecological interactions that underpin ecosystem services. In human-centered systems, people’s values and preferences influence management decisions. For example, aesthetic preferences for ‘tidy’ agroecosystems may remove vegetation complexity with potential negative impacts on beneficial associated biodiversity and ecosystem function. This may produce trade-offs in aesthetic- versus production-based management for ecosystem service provision. Yet, it is unclear how such preferences influence the ecology of small-scale urban agroecosystems, where aesthetic preferences for ‘tidiness’ are prominent among some gardener demographics. We used urban community gardens as a model system to experimentally test how aesthetic preferences for a ‘tidy garden’ versus a ‘messy garden’ influence insect pests, natural enemies, and pest control services. We manipulated gardens by mimicking a popular ‘tidy’ management practice – woodchip mulching – on the one hand, and simulating ‘messy’ gardens by adding ‘weedy’ plants to pathways on the other hand. Then, we measured for differences in natural enemy biodiversity (abundance, richness, community composition), and sentinel pest removal as a result of the tidy/messy manipulation. In addition, we measured vegetation and ground cover features of the garden system as measures of practices already in place. The tidy/messy manipulation did not significantly alter natural enemy or herbivore abundance within garden plots. The manipulation did, however, produce different compositions of natural enemy communities before and after the manipulation. Furthermore, the manipulation did affect short term gains and losses in predation services: the messy manipulation immediately lowered aphid pest removal compared to the tidy manipulation, while mulch already present in the system lowered Lepidoptera egg removal. Aesthetic preferences for ‘tidy’ green spaces often dominate urban landscapes. Yet, in urban food production systems, such aesthetic values and management preferences may create a fundamental tension in the provision of ecosystem services that support sustainable urban agriculture. Though human preferences may be hard to change, we suggest that gardeners allow some ‘messiness’ in their garden plots as a “lazy gardener” approach may promote particular natural enemy assemblages and may have no downsides to natural predation services.
A large spill of wastewater from oil and gas operations was discovered adjacent to Blacktail Creek near Williston, North Dakota in January 2015. To determine the effects of this spill on streambed microbial communities over time, bed sediment samples were taken from Blacktail Creek upstream, adjacent to, and at several locations downstream from the spill site. Blacktail Creek is a tributary of the Little Muddy River, and additional samples were taken upstream and downstream from the confluence of Blacktail Creek and the Little Muddy River. Samples were collected in February 2015, June 2015, June 2016, and June 2017. DNA was extracted from these sediments, and sequencing of the 16S ribosomal RNA gene was performed to enable analysis of the microbial community structure. Raw sequence data was processed, and taxonomy was assigned based on the Silva 132 database (Yilmaz et al, 2014) using the MOTHUR software package (Schloss et al, 2009). Raw sequence data are available from GenBank at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA666160.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of Muddy by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for Muddy. The dataset can be utilized to understand the population distribution of Muddy by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in Muddy. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for Muddy.
Key observations
Largest age group (population): Male # 65-69 years (6) | Female # 55-59 years (13). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Age groups:
Scope of gender :
Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Muddy Population by Gender. You can refer the same here
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This model archive summary documents the suspended-sediment concentration (SSC) model developed to estimate 15-minute SSC at Muddy Creek above Paonia Reservoir, U.S. Geological Survey (USGS) site number 385903107210800. The methods used follow USGS guidance as referenced in relevant Office of Surface Water Technical Memorandum (TM) 2016.07 and Office of Water Quality TM 2016.10, and USGS Techniques and Methods, book 3, chap. C5 (Landers and others, 2016). A total of 438 suspended-sediment samples were collected during the calibration period. Forty-one of these samples (22 equal-width-interval [EWI] samples and 19 single-point pump samples) were used in the model calibration dataset. These 41 samples were collected over the range of observed streamflow, Sediment Corrected Backscatter (SCB), and Sediment Attenuation Coefficient (SAC) conditions. Samples used in calibration were plotted on duration curve plots for streamflow from March 2005 to November 2016 (Colorado Division of Wat ...
This is the dataset for the paper "Disjoint-DABS: A Benchmark for Dynamic Aspect-Based Summarization in Disorganized Texts". It includes two sub-datasets converted from CNN/DailyMail (D-CnnDM.zip) and WikiHow (D-WikiHow.zip). We include the data with training, validation, and test split. The file for training the summarization model is at (WikiHowSep.zip and CnnDM.zip) We also include the small-scale data for D-WikiHow used for prompting experiments (D-WikiHow-sample). The generated summaries for all baselines for further research, especially for human evaluation is included (result.zip).
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
In Peru, the handwashing project targets mothers/caregivers of children under five years old, and it is aimed at improving handwashing with soap practices. Children under five represent the age group most susceptible to diarrheal disease and acute respiratory infections, which are two major causes of childhood morbidity and mortality in less developed countries.
These infections, usually transferred from dirty hands to food or water sources, or by direct contact with the mouth, can be prevented if mothers/caregivers wash their hands with soap at critical times (such as before feeding a child, cooking, eating, and after using a toilet or changing a child’s diapers). In an effort to improve handwashing behavior, the intervention borrows from both commercial and social marketing fields. This entails the design of communications campaigns and messages likely to bring about the desired behavior changes, and delivering them strategically so that the target audiences are “surrounded” by handwashing promotion.
Some key elements of the intervention include: • Key behavioral concepts or triggers for each target audience • Persuasive arguments stating why and how a given concept or trigger will lead to behavior change, and • Communication ideas to convey the concepts through many integrated activities and communication channels.
The objective of the IE is to assess the effects of the project on individual-level handwashing behavior and practices of caregivers and children. By introducing exogenous variation in handwashing promotion (through randomized exposure to the project), the IE also addresses important issues related to the effect of intended behavioral change on child health and development outcomes. In particular, it provides information on the extent to which improved handwashing behavior impacts infant health and welfare.
The sample included in the IE study is not representative of the Peruvian population at the national level because the selection of provinces and districts was random and not weighted by population, as would be necessary to be geographically representative. Because populations differ across provinces and districts, the three-stage sampling design introduced a type of bias (with respect to geographical representativeness) because selection probabilities varied across administrative units.
Sample survey data [ssd]
The primary objective of the project is to improve the health and welfare of young children. The sample size (total number of households) was chosen to capture a minimum effect size of 20 percent on the key outcome indicator of diarrhea prevalence among children under two years old at the time of the baseline. The selection of households with children in this age group was made under the assumption that health outcome measurements for young children in this age range are most sensitive to changes in hygiene in the environment. Data was collected for household members of all age ranges and the corresponding data analysis was conducted for older children and adults as well. Power calculations indicated that, in order to capture a 20 percent reduction in diarrhea incidence, around 600 households per treatment arm would need to be surveyed. Therefore, since the evaluation consists of three treatment groups and two control groups, the final sample incorporates approximately 3,000 households, each with children less than two years of age at the time the survey was conducted. An additional 500 households were added to the sample size in order to address potential attrition (loss of participants during the project); thus the minimal necessary sample size was 3,500 households (around 700 households per arm).
To select the sample, the IE team used a three-stage sampling methodology: • Stage 1: Province Level
From 195 total provinces in Peru, Pisco and Lima were excluded at the request of the implementation team.2 Of the remaining 193 provinces, 80 provinces were randomly chosen. Out of these 80 provinces, two groups of 40 provinces each were randomly formed: Group of Provinces 1 (GP1) and Group of Provinces 2 (GP2). • Stage 2: District Level
In order to assess the impact of each of the components of the project in the health of children younger than five years old, the evaluation study has two main treatments, that is, one per component. These are the Mass Media Treatment at the provincial level, also referred to as Treatment 1 (T1), and the Social Mobilization Treatment at the district level, also referred to as Treatment 2 (T2). In order to evaluate and identify the health impacts of each component, a counterfactual to T1 and T2 is needed, which we refer to as the Control (C). The three groups, T1, T2, and C include households with children under two years old at the time of the baseline.
Out of the first group of 40 provinces, GP1, 40 districts between 1,500 and 100,000 habitants were randomly chosen to receive T1. From the second group, GP2, 80 districts between 1,500 and 100,000 habitants were selected randomly; 40 of them were randomly assigned to receive T2, and the other 40 districts to serve as C to T1 and T2.
• Stage 3: Household Level For each of the three sets of 40 districts (120 districts total) allocated to T1, T2, and C, 15-20 households with children under two years of age were selected at random in each district. Also, in each of the 40 districts
Face-to-face [f2f]
The following instruments were used to collect the data: • Household questionnaire: The household questionnaire was conducted in all households and was designed to collect data on household membership, education, labor, income, assets, dwelling characteristics,water sources, drinking water, sanitation,observations of handwashing facilities and other dwelling characteristics, handwashing behavior, child discipline, maternal depression, handwashing determinants, exposure to health interventions, relationship between family and school, and mortality.
• Health questionnaire: The health questionnaire was conducted in all households and designed to collect data on children’s diarrhea prevalence, ALRI and other health symptoms, child development, child growth, and anemia.
• Community questionnaire: The community questionnaire was conducted in 120 districts to collect data on community/districts variables.
• Structured observations: Structured observations were conducted in a subsample of 160 households to collect data on direct observation of handwashing behavior.
• Water samples: Water samples were collected in a subsample of 160 households, to identify Escherichia coli (E. coli) presence in hand rinses (mother and children), sentinel toy, and drinking water.
• Stool samples: Stool samples were collected in a subsample of 160 households to identify prevalence of parasites in children’s feces.
Baseline: The baseline survey was processed using the assistance of Sistemas Integrales in Chile. A manual for the data entry system is attached under the title of: Data Entry Manual:Baseline.
Endline: Kimetrica International was contracted to design the data reduction system to be used during the endline. The data entry system was designed in CSPro (Version 4.1) using the DHS file management system as a standard for file management. Details of the system can be found in the attached manual entitled: Data Entry Manual for the Endline Survey.
The data entry system was based on a full double data entry (independent verification) of the various questionnaires. CSPro supports both dependent and independent verification (double keying) to ensure the accuracy of the data entry operation. Using independent verification, operators can key data into separate data files and use CSPro utilities to compare them and produce a report that indicates discrepancies in data entry.
The DHS system uses a fully integrated tracking system to follow the stages in the data entry process. This includes the checking in of questionnaires; the programming of logic in what is known as a system controlled environment. System controlled applications generally place more restrictions on the data entry operator. This is typically used for complex survey applications. The behavior of these applications at data entry time has the following characteristics:
Files were processed using the unique cluster number and then concatenated after a final stage of editing and output to both SPSS and STATA.
Furthermore, attempts were made to respect the values and the naming conventions as provided in the baseline. This required using non-conventional values for “missing” such as -99. In most cases the same value sets were applied or during the questionnaire review process the WSP was alerted to such discrepancies.
Baseline 1 Completed interview -----> 3508 --->94.3
2 Incomplete interview ----->48 --->1.3
3 Not available ----->7 --->.2
4 Rescheduled interview ----->7 --->.2
5 Nobody at home ----->48 --->1.3
6 Temporarily away ----->59 --->1.6
7 Refused to participate ----->44 --->1.2
Total
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
On Tuesday September 29th at approximately from 2:45-5:30 pm, an experiment for the Pan Trap Dataset was conducted with the assistance of my four other group members including Ashley, Adam, Katherine, and Kate. Initially, the weather was partially cloudy with drizzling rain and as the time went on the rain got heavier, there was less sunlight and there was a cool breeze. As hypothesized before gathering our sampling data, the rain weather would have an influence on the abundance of insects that would come out of the soil and become observable on the surface of the ground. In other words, it was predicted that in most cases of insects, there would be a negative correlation between the amount of moisture gathered on the surface of the ground and the frequency at which insects would be observed. For this particular dataset experiment, three different colours of Solo Bowls (white, blue and yellow) were filled with soapy water and each placed in a group. Each group, consisting of three different coloured bowls were positioned at random places in the two distinct pre-set locations at the Danby Woodlot and the Grassland area. The reason for which three different colours were used was because of attempting to target variety of different insects which would be attracted to certain colours—such as bees which are specifically attracted to the colour yellow for pollination of flowers—and therefore, this way our data would be unbiased and would include most insects that were present in that environment. In order to make sure that we were doing random sampling and further lower the chance of bias in our data collection, each group of three bowls were placed at different random locations within the Woodlot or Grassland site. Similarly, we have placed the bowls in two different environmental settings of Woodlot and Grassland areas due to attempting to gather unbiased and random data. This way, perhaps more variety of insects present in different environmental settings can be gathered. At the end, as it was observed and counted, there was more variety of insects present in the Grassland area. Perhaps, insects were able to take shelter among the heavy profusion of grass on the surface of the ground instead of bare surface of the damp Earth which was more observed in the Woodlot area. Insects can take refuge in hollow logs and in between tree branches and trunks during rain, so perhaps that is the reason why they were noticeably absent in the data samples from the Woodlot area. Overall, our hypothesis was to some extent correct when it came to considering the results of our data collection from the Woodlot area. Not many insects were present on the damp muddy soil of the Woodlot due to heavy pouring rain. It is highly likely that most insects were hiding and taking shelter beneath the ground or in hollow tree trunks for their own survival.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
By Department of Energy [source]
The Building Energy Data Book (2011) is an invaluable resource for gaining insight into the current state of energy consumption in the buildings sector. This dataset provides comprehensive data on residential, commercial and industrial building energy consumption, construction techniques, building technologies and characteristics. With this resource, you can get an in-depth understanding of how energy is used in various types of buildings - from single family homes to large office complexes - as well as its impact on the environment. The BTO within the U.S Department of Energy's Office of Energy Efficiency and Renewable Energy developed this dataset to provide a wealth of knowledge for researchers, policy makers, engineers and even everyday observers who are interested in learning more about our built environment and its energy usage patterns
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides comprehensive information regarding energy consumption in the buildings sector of the United States. It contains a number of key variables which can be used to analyze and explore the relations between energy consumption and building characteristics, technologies, and construction. The data is provided in both CSV format as well as tabular format which can make it helpful for those who prefer to use programs like Excel or other statistical modeling software.
In order to get started with this dataset we've developed a guide outlining how to effectively use it for your research or project needs.
Understand what's included: Before you start analyzing the data, you should read through the provided documentation so that you fully understand what is included in the datasets. You'll want to be aware of any potential limitations or requirements associated with each type of data point so that your results are valid and reliable when drawing conclusions from them.
Clean up any outliers: You may need to take some time upfront investigating suspicious outliers within your dataset before using it in any further analyses — otherwise, they can skew results down the road if not dealt with first-hand! Furthermore, they could also make complex statistical modeling more difficult as well since they artificially inflate values depending on their magnitude within each example data point (i.e., one outlier could affect an entire model’s prior distributions). Missing values should also be accounted for too since these may not always appear obvious at first glance when reviewing a table or graphical representation - but accurate statistics must still be obtained either way no matter how messy things seem!
Exploratory data analysis: After cleaning up your dataset you'll want to do some basic exploring by visualizing different types of summaries like boxplots, histograms and scatter plots etc.. This will give you an initial case into what trends might exist within certain demographic/geographic/etc.. regions & variables which can then help inform future predictive models when needed! Additionally this step will highlight any clear discontinuous changes over time due over-generalization (if applicable), making sure predictors themselves don’t become part noise instead contributing meaningful signals towards overall effect predictions accuracy etc…
Analyze key metrics & observations: Once exploratory analyses have been carried out on rawsamples post-processing steps are next such as analyzing metrics such ascorrelations amongst explanatory functions; performing significance testing regression models; imputing missing/outlier values and much more depending upon specific project needs at hand… Additionally – interpretation efforts based
- Creating an energy efficiency rating system for buildings - Using the dataset, an organization can develop a metric to rate the energy efficiency of commercial and residential buildings in a standardized way.
- Developing targeted campaigns to raise awareness about energy conservation - Analyzing data from this dataset can help organizations identify areas of high energy consumption and create targeted campaigns and incentives to encourage people to conserve energy in those areas.
- Estimating costs associated with upgrading building technologies - By evaluating various trends in building technologies and their associated costs, decision-makers can determine the most cost-effective option when it comes time to upgrade their structures' energy efficiency...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper presents a lightweight, flexible, extensible, machine readable and human-intelligible metadata schema that does not depend on a specific ontology. The metadata schema for metadata of data files is based on the concept of data lakes where data is stored as they are. The purpose of the schema is to enhance data interoperability. The lack of interoperability of messy socio-economic datasets that contain a mixture of structured, semi-structured, and unstructured data means that many datasets are underutilized. Adding a minimum set of rich metadata and describing new and existing data dictionaries in a standardized way goes a long way to make these high-variety datasets interoperable and reusable and hence allows timely and actionable information to be gleaned from those datasets. The presented metadata schema OIMS can help to standardize the description of metadata. The paper introduces overall concepts of metadata, discusses design principles of metadata schemes, and presents the structure and an applied example of OIMS.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.