Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A messy data for demonstrating "how to clean data using spreadsheet". This dataset was intentionally formatted to be messy, for the purpose of demonstration. It was collated from here - https://openafrica.net/dataset/historic-and-projected-rainfall-and-runoff-for-4-lake-victoria-sub-regions
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
Ahoy, data enthusiasts! Join us for a hands-on workshop where you will hoist your sails and navigate through the Statistics Canada website, uncovering hidden treasures in the form of data tables. With the wind at your back, you’ll master the art of downloading these invaluable Stats Can datasets while braving the occasional squall of data cleaning challenges using Excel with your trusty captains Vivek and Lucia at the helm.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This collection contains the 17 anonymised datasets from the RAAAP-2 international survey of research management and administration professional undertaken in 2019. To preserve anonymity the data are presented in 17 datasets linked only by AnalysisRegionofEmployment, as many of the textual responses, even though redacted to remove institutional affiliation could be used to identify some individuals if linked to the other data. Each dataset is presented in the original SPSS format, suitable for further analyses, as well as an Excel equivalent for ease of viewing. There are additional files in this collection showing the the questionnaire and the mappings to the datasets together with the SPSS scripts used to produce the datasets. These data follow on from, but re not directly linked to the first RAAAP survey undertaken in 2016, data from which can also be found in FigShare Errata (16/5/23) an error in v13 of the main Data Cleansing syntax file (now updated to v14) meant that two variables were missing their value labels (the underlying codes were correct) - a new version (SPSS & Excel) of the Main Dataset has been updated
Drainage Gully Cleaning Programme DCC. Published by Dublin City Council. Available under the license cc-by (CC-BY-4.0).Schedule and Monitor of Gully Cleaning for Dublin City These datasets show the gully cleaning statistics from 2004 to September 14th 2011. It consists attached 6 No. Excel Spreadsheets with the datasets from the Daily Returns section of the Gully Cleaning Application and one dataset from the Gully Repairs section of the gully application. They are divided into the five Dublin City Council administrative areas; Central, North Central, North West, Southeast, South Central. There is also a dataset containing details of all Gully repairs pending (all areas included).The datasets cover all Daily Returns since the gully cleaning programme commenced in 2004. Daily Returns are lists of the work that the gully cleaning crews carry out daily. All gullies on a street are cleaned where possible. A list of Omissions is recorded where some gullies may not have been cleaned due to lack of access or other reasons. Also, the gullies that required repair were noted. The Daily Returns datasets record only the number of gullies requiring repair on a particular street, not the details of the repair. Information in the fields is as follows: .Road name: street name or laneway denoted by nearest house or lamp post etc. If a road name is followed by the letters pl in capital letters than it means that either this road or a section of this road has been placed on the priority list due to a history of flooding or a higher potential of the gully blocking due to location etc. If a road name is followed by a number of zeros in the gullies inspected - gullies cleaned columns etc then it is very probable that this road was travelled during heavy rain as part of our flood zones and there was no flooding noted along this road at the time of travelling. Gullies inspected: number of gullies inspected along road/lane .A road name followed by lower case road names denotes a road that is part of more than one area in our gully cleaning areas and these lower case names denote the starting point and finishing point for the crews working in the particular area i.e. Howth Road All Saints Rd-Fairview denotes that the section of the Howth road between all saints road and Fairview are within the area that the crew have been asked to work in. Gullies cleaned :number of gullies cleaned from total inspected .Gully omissions :number of gullies missed i.e. Unable to put boom or shovel into gully pot due to parked cars / unable to lift grids / hoarding over gullies etc .Gully repairs: number of repairs based on inspections-note not all repairs prevent the gully from being cleaned. Comments box: this box is used to provide any additional information that may be of benefit and it can be noted that results of work carried out by the mini jet is placed in this box. ...
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.
Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.
This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.
The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.
For more information:
NNDSS Supports the COVID-19 Response | CDC.
The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.
COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.
All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.
To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.
CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:
To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.
For questions, please contact Ask SRRG (eocevent394@cdc.gov).
COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Vrinda Store: Interactive Ms Excel dashboardVrinda Store: Interactive Ms Excel dashboard Feb 2024 - Mar 2024Feb 2024 - Mar 2024 The owner of Vrinda store wants to create an annual sales report for 2022. So that their employees can understand their customers and grow more sales further. Questions asked by Owner of Vrinda store are as follows:- 1) Compare the sales and orders using single chart. 2) Which month got the highest sales and orders? 3) Who purchased more - women per men in 2022? 4) What are different order status in 2022?
And some other questions related to business. The owner of Vrinda store wanted a visual story of their data. Which can depict all the real time progress and sales insight of the store. This project is a Ms Excel dashboard which presents an interactive visual story to help the Owner and employees in increasing their sales. Task performed : Data cleaning, Data processing, Data analysis, Data visualization, Report. Tool used : Ms Excel The owner of Vrinda store wants to create an annual sales report for 2022. So that their employees can understand their customers and grow more sales further. Questions asked by Owner of Vrinda store are as follows:- 1) Compare the sales and orders using single chart. 2) Which month got the highest sales and orders? 3) Who purchased more - women per men in 2022? 4) What are different order status in 2022? And some other questions related to business. The owner of Vrinda store wanted a visual story of their data. Which can depict all the real time progress and sales insight of the store. This project is a Ms Excel dashboard which presents an interactive visual story to help the Owner and employees in increasing their sales. Task performed : Data cleaning, Data processing, Data analysis, Data visualization, Report. Tool used : Ms Excel Skills: Data Analysis · Data Analytics · ms excel · Pivot Tables
Our goals with this dataset were to 1) isolate, culture, and identify two fungal life stages of Aspergillus flavus, 2) characterize the volatile emissions from grain inoculated by each fungal morphotype, and 3) understand how microbially-produced volatile organic compounds (MVOCs) from each fungal morphotype affect foraging, attraction, and preference by S. oryzae. This dataset includes that derived from headspace collection coupled with GC-MS, where we found the sexual life stage of A. flavus had the most unique emissions of MVOCs compared to the other semiochemical treatments. This translated to a higher arrestment with kernels containing grain with the A. flavus sexual life stage, as well as a higher cumulative time spent in those zones by S. oryzae in a video-tracking assay in comparison to the asexual life stage. While fungal cues were important for foraging at close-range, the release-recapture assay indicated that grain volatiles were more important for attraction at longer distances. There was no significant preference between grain and MVOCs in a four-way olfactometer, but methodological limitations in this assay prevent broad interpretation. Overall, this study enhances our understanding of how fungal cues affect the foraging ecology of a primary stored product insect. In the assays described herein, we analyzed the behavioral response of Sitophilus oryzae to five different blends of semiochemicals found and introduced in wheat (Table 1). Briefly, these included no stimuli (negative control), UV-sanitized grain, clean grain from storage (unmanipulated, positive control), as well as grain from storage inoculated with fungal morphotype 1 (M1, identified as the asexual life stage of Aspergillus flavus) and fungal morphotype 2 (M2, identified as the sexual life stage of A. flavus). Fresh samples of semiochemicals were used for each day of testing for each assay. In order to prevent cross-contamination, 300 g of grain (tempered to 15% grain moisture) was initially sanitized using UV for 20 min. This procedure was done before inoculating grain with either morphotype 1 or 2. The 300 g of grain was kept in a sanitized mason jar (8.5 D × 17 cm H). To inoculate grain with the two different morphologies, we scraped an entire isolation from a petri dish into the 300 g of grain. Each isolation was ~1 week old and completely colonized by the given morphotype. After inoculation, each treatment was placed in an environmental chamber (136VL, Percival Instruments, Perry, IA, USA) set at constant conditions (30°C, 65% RH, and 14:10 L:D). This procedure was the same for both morphologies and was done every 2 weeks to ensure fresh treatments for each experimental assay. See file list for descriptions of each data file. Resources in this dataset:Resource Title: Ethovision Movement Assay. File Name: ponce_lizarraga_ethovision_assay_microbial_volatiles_2020.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Olfactometer Round 1 Assay - With Fused Air Permeable Glass. File Name: ponce_lizarraga_first_round_olfactometer_fungal_study_2020.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Olfactometer Round 2 Assay - With Fused Air Permeable Glass Containing Holes. File Name: ponce_lizarraga_second_round_olfactometer_fungal_study_2021.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Small Release-Recapture Assay. File Name: ponce_lizarraga_small_release_recapture_assay.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Large Release-Recapture Assay. File Name: ponce_lizarraga_large_release_recapture_assay.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Headspace Volatile Collection Assay. File Name: sandra_headspace_volatiles_2020.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: README file list. File Name: file_list_stored_grain_Aspergillus_Sitophilus_oryzae.txt
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
eBackgroundThe Digital Humanities 2016 conference is taking/took place in Kraków, Poland, between Sunday 11 July and Saturday 16 July 2016. #DH2016 is/was the conference official hashtag.What This Output IsThis is an Excel spreadsheet file containing three sheets containing a total of 3478 Tweets publicly published with the hashtag #DH2016. The archive starts with a Tweet published on Sunday July 10 2016 00:03:41 +0000 and finishes with a Tweet published on Tuesday July 12 2016 23:55:47 +0000.The original collection has been organised into conference days; one sheet per day (GMT and Central European Times included). A breakdown of Tweets per day:Sunday 10 July 2016: 179 TweetsMonday 11 July 2016: 981 TweetsTuesday 12 July 2016: 2318 Tweets Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using Martin Hawksey's TAGS 6.0. Only users with at least 1 follower were included in the archive. Retweets have been included (Retweets count as Tweets). The collection spreadsheet was customised to reflect the time zone and geographical location of the conference.The profile_image_url and entities_str metadata were removed before public sharing in this archive. Please bear in mind that the conference hashtag has been spammed so some Tweets colllected may be from spam accounts. Some automated refining has been performed to remove Tweets not related to the conference but the data is likely to require further refining and deduplication. Both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might "over-represent the more central users", not offering "an accurate picture of peripheral activity" (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet tagged with #dh2016 during the indicated period, and the dataset is shared for archival, comparative and indicative educational research purposes only.Only content from public accounts is included and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.Each Tweet and its contents were published openly on the Web with the queried hashtag and are responsibility of the original authors.No private personal information is shared in this dataset. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road. This dataset is shared to archive, document and encourage open educational research into scholarly activity on Twitter. Other ConsiderationsTweets published publicly by scholars during academic conferences are often tagged (labeled) with a hashtag dedicated to the conference in question.The purpose and function of hashtags is to organise and describe information/outputs under the relevant label in order to enhance the discoverability of the labeled information/outputs (Tweets in this case). A hashtag is metadata users choose freely to use so their content is associated, directly linked to and categorised with the chosen hashtag. Though every reason for Tweeters' use of hashtags cannot be generalised nor predicted, it can be argued that scholarly Twitter users form specialised, self-selecting public professional networks that tend to observe scholarly practices and accepted modes of social and professional behaviour. In general terms it can be argued that scholarly Twitter users willingly and consciously tag their public Tweets with a conference hashtag as a means to network and to promote, report from, reflect on, comment on and generally contribute publicly to the scholarly conversation around conferences. As Twitter users, conference Twitter hashtag contributors have agreed to Twitter's Privacy and data sharing policies. Professional associations like the Modern Language Association recognise Tweets as citeable scholarly outputs. Archiving scholarly Tweets is a means to preserve this form of rapid online scholarship that otherwise can very likely become unretrievable as time passes; Twitter's search API has well-known temporal limitations for retrospective historical search and collection.Beyond individual tweets as scholarly outputs, the collective scholarly activity on Twitter around a conference or academic project or event can provide interesting insights for the contemporary history of scholarly communications. To date, collecting in real time is the only relatively accurate method to archive tweets at a small scale. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time.The CC-BY license has been applied to the output in the repository as a curated dataset. Authorial/curatorial/collection work has been performed on the file in order to make it available as part of the scholarly record. The data contained in the deposited file is otherwise freely available elsewhere through different methods and anyone not wishing to attribute the data to the creator of this output is needless to say free to do their own collection and clean their own data.
This notebook serves to showcase my problem solving ability, knowledge of the data analysis process, proficiency with Excel and its various tools and functions, as well as my strategic mindset and statistical prowess. This project consist of an auditing prompt provided by Hive Data, a raw Excel data set, a cleaned and audited version of the raw Excel data set, and my description of my thought process and knowledge used during completion of the project. The prompt can be found below:
The raw data that accompanies the prompt can be found below:
Hive Annotation Job Results - Raw Data
^ These are the tools I was given to complete my task. The rest of the work is entirely my own.
To summarize broadly, my task was to audit the dataset and summarize my process and results. Specifically, I was to create a method for identifying which "jobs" - explained in the prompt above - needed to be rerun based on a set of "background facts," or criteria. The description of my extensive thought process and results can be found below in the Content section.
Brendan Kelley April 23, 2021
Hive Data Audit Prompt Results
This paper explains the auditing process of the “Hive Annotation Job Results” data. It includes the preparation, analysis, visualization, and summary of the data. It is accompanied by the results of the audit in the excel file “Hive Annotation Job Results – Audited”.
Observation
The “Hive Annotation Job Results” data comes in the form of a single excel sheet. It contains 7 columns and 5,001 rows, including column headers. The data includes “file”, “object id”, and the pseudonym for five questions that each client was instructed to answer about their respective table: “tabular”, “semantic”, “definition list”, “header row”, and “header column”. The “file” column includes non-unique (that is, there are multiple instances of the same value in the column) numbers separated by a dash. The “object id” column includes non-unique numbers ranging from 5 to 487539. The columns containing the answers to the five questions include Boolean values - TRUE or FALSE – which depend upon the yes/no worker judgement.
Use of the COUNTIF() function reveals that there are no values other than TRUE or FALSE in any of the five question columns. The VLOOKUP() function reveals that the data does not include any missing values in any of the cells.
Assumptions
Based on the clean state of the data and the guidelines of the Hive Data Audit Prompt, the assumption is that duplicate values in the “file” column are acceptable and should not be removed. Similarly, duplicated values in the “object id” column are acceptable and should not be removed. The data is therefore clean and is ready for analysis/auditing.
Preparation
The purpose of the audit is to analyze the accuracy of the yes/no worker judgement of each question according to the guidelines of the background facts. The background facts are as follows:
• A table that is a definition list should automatically be tabular and also semantic • Semantic tables should automatically be tabular • If a table is NOT tabular, then it is definitely not semantic nor a definition list • A tabular table that has a header row OR header column should definitely be semantic
These background facts serve as instructions for how the answers to the five questions should interact with one another. These facts can be re-written to establish criteria for each question:
For tabular column: - If the table is a definition list, it is also tabular - If the table is semantic, it is also tabular
For semantic column: - If the table is a definition list, it is also semantic - If the table is not tabular, it is not semantic - If the table is tabular and has either a header row or a header column...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General Description: This dataset contains detailed measurements collected during two controlled experiments designed to study the dynamics of the composting process using the forced aeration technique. The dataset is divided into two main parts:
Experiment 1: Includes parameters such as temperatures (hot air and compost pile), relative humidity, and air and heat inputs. Experiment 2: Complements the first experiment with oxygen levels in addition to the previously mentioned variables. Both datasets are organized in a chronological format, with records that allow the analysis of trends and correlations among the studied variables.
Purpose: The primary objective of this dataset is to facilitate the study of composting dynamics in high mountain environments using the forced aeration technique. It can be used for:
Bioprocess modeling. Studies on energy optimization in biological and chemical processes. Research in environmental biology, process engineering, and clean technologies. Dataset Features: Total Size: Experiment 1: 4302 records and 8 variables. Experiment 2: 3076 records and 9 variables. Temporal Coverage: Records are organized by hour and minute over several days of experimentation. Key Variables: Hour and minute of the record. Heater and compost pile temperatures. Relative humidity. Air and heat inputs. Oxygen levels (in Experiment 2). Days elapsed since the start of the experiment. Available Formats: The dataset is available in Excel format (.xlsx), with each experiment documented on separate sheets.
Access and Use: Restrictions: Commercial use of the dataset requires prior authorization. Potential Applications: This resource is valuable for researchers in fields such as:
Environmental engineering and bioprocesses. Design and optimization of thermal and environmental control systems.
The 2003 Agriculture Sample Census was designed to meet the data needs of a wide range of users down to district level including policy makers at local, regional and national levels, rural development agencies, funding institutions, researchers, NGOs, farmer organisations, etc. As a result the dataset is both more numerous in its sample and detailed in its scope compared to previous censuses and surveys. To date this is the most detailed Agricultural Census carried out in Africa.
The census was carried out in order to: · Identify structural changes if any, in the size of farm household holdings, crop and livestock production, farm input and implement use. It also seeks to determine if there are any improvements in rural infrastructure and in the level of agriculture household living conditions; · Provide benchmark data on productivity, production and agricultural practices in relation to policies and interventions promoted by the Ministry of Agriculture and Food Security and other stake holders. · Establish baseline data for the measurement of the impact of high level objectives of the Agriculture Sector Development Programme (ASDP), National Strategy for Growth and Reduction of Poverty (NSGRP) and other rural development programs and projects. · Obtain benchmark data that will be used to address specific issues such as: food security, rural poverty, gender, agro-processing, marketing, service delivery, etc.
Tanzania Mainland and Zanzibar
Large scale, small scale and community farms.
Census/enumeration data [cen]
The Mainland sample consisted of 3,221 villages. These villages were drawn from the National Master Sample (NMS) developed by the National Bureau of Statistics (NBS) to serve as a national framework for the conduct of household based surveys in the country. The National Master Sample was developed from the 2002 Population and Housing Census. The total Mainland sample was 48,315 agricultural households. In Zanzibar a total of 317 enumeration areas (EAs) were selected and 4,755 agriculture households were covered. Nationwide, all regions and districts were sampled with the exception of three urban districts (two from Mainland and one from Zanzibar).
In both Mainland and Zanzibar, a stratified two stage sample was used. The number of villages/EAs selected for the first stage was based on a probability proportional to the number of villages in each district. In the second stage, 15 households were selected from a list of farming households in each selected Village/EA, using systematic random sampling, with the village chairpersons assisting to locate the selected households.
Face-to-face [f2f]
The census covered agriculture in detail as well as many other aspects of rural development and was conducted using three different questionnaires: • Small scale questionnaire • Community level questionnaire • Large scale farm questionnaire
The small scale farm questionnaire was the main census instrument and it includes questions related to crop and livestock production and practices; population demographics; access to services, resources and infrastructure; and issues on poverty, gender and subsistence versus profit making production unit.
The community level questionnaire was designed to collect village level data such as access and use of common resources, community tree plantation and seasonal farm gate prices.
The large scale farm questionnaire was administered to large farms either privately or corporately managed.
Questionnaire Design The questionnaires were designed following user meetings to ensure that the questions asked were in line with users data needs. Several features were incorporated into the design of the questionnaires to increase the accuracy of the data: • Where feasible all variables were extensively coded to reduce post enumeration coding error. • The definitions for each section were printed on the opposite page so that the enumerator could easily refer to the instructions whilst interviewing the farmer. • The responses to all questions were placed in boxes printed on the questionnaire, with one box per character. This feature made it possible to use scanning and Intelligent Character Recognition (ICR) technologies for data entry. • Skip patterns were used to reduce unnecessary and incorrect coding of sections which do not apply to the respondent. • Each section was clearly numbered, which facilitated the use of skip patterns and provided a reference for data type coding for the programming of CSPro, SPSS and the dissemination applications.
Data processing consisted of the following processes: · Data entry · Data structure formatting · Batch validation · Tabulation
Data Entry Scanning and ICR data capture technology for the small holder questionnaire were used on the Mainland. This not only increased the speed of data entry, it also increased the accuracy due to the reduction of keystroke errors. Interactive validation routines were incorporated into the ICR software to track errors during the verification process. The scanning operation was so successful that it is highly recommended for adoption in future censuses/surveys. In Zanzibar all data was entered manually using CSPro.
Prior to scanning, all questionnaires underwent a manual cleaning exercise. This involved checking that the questionnaire had a full set of pages, correct identification and good handwriting. A score was given to each questionnaire based on the legibility and the completeness of enumeration. This score will be used to assess the quality of enumeration and supervision in order to select the best field staff for future censuses/surveys.
CSPro was used for data entry of all Large Scale Farm and community based questionnaires due to the relatively small number of questionnaires. It was also used to enter data from the 2,880 small holder questionnaires that were rejected by the ICR extraction application.
Data Structure Formatting A program was developed in visual basic to automatically alter the structure of the output from the scanning/extraction process in order to harmonise it with the manually entered data. The program automatically checked and changed the number of digits for each variable, the record type code, the number of questionnaires in the village, the consistency of the Village ID Code and saved the data of one village in a file named after the village code.
Batch Validation A batch validation program was developed in order to identify inconsistencies within a questionnaire. This is in addition to the interactive validation during the ICR extraction process. The procedures varied from simple range checking within each variable to the more complex checking between variables. It took six months to screen, edit and validate the data from the smallholder questionnaires. After the long process of data cleaning, tabulations were prepared based on a pre-designed tabulation plan.
Tabulations Statistical Package for Social Sciences (SPSS) was used to produce the Census tabulations and Microsoft Excel was used to organize the tables and compute additional indicators. Excel was also used to produce charts while ArcView and Freehand were used for the maps.
Analysis and Report Preparation The analysis in this report focuses on regional comparisons, time series and national production estimates. Microsoft Excel was used to produce charts; ArcView and Freehand were used for maps, whereas Microsoft Word was used to compile the report.
Data Quality A great deal of emphasis was placed on data quality throughout the whole exercise from planning, questionnaire design, training, supervision, data entry, validation and cleaning/editing. As a result of this, it is believed that the census is highly accurate and representative of what was experienced at field level during the Census year. With very few exceptions, the variables in the questionnaire are within the norms for Tanzania and they follow expected time series trends when compared to historical data. Standard Errors and Coefficients of Variation for the main variables are presented in the Technical Report (Volume I).
The Sampling Error found on page (21) up to page (22) in the Technical Report for Agriculture Sample Census Survey 2002-2003
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The attached datasets are to accompany the publication titled "Impact of material properties in determining quaternary ammonium compound adsorption and wipe product efficacy against biofilms", published in the Journal of Hospital Infection in 2022 (https://doi.org/10.1016/j.jhin.2022.03.013). The dataset consists of a summarary excel spreadsheet of the data gathered across 2020 and 2021, as well as .txt files containing the results of the BET surface analysis. Each tab of the spreadsheet is labelled with the location where the data is used in the publication.
This layer visualizes over 60,000 commercial flight paths. The data was obtained from openflights.org, and was last updated in June 2014. The site states, "The third-party that OpenFlights uses for route data ceased providing updates in June 2014. The current data is of historical value only. As of June 2014, the OpenFlights/Airline Route Mapper Route Database contains 67,663 routes between 3,321 airports on 548 airlines spanning the globe. Creating and maintaining this database has required and continues to require an immense amount of work. We need your support to keep this database up-to-date."To donate, visit the site and click the PayPal link.Routes were created using the XY-to-line tool in ArcGIS Pro, inspired by Kenneth Field's work, and following a modified methodology from Michael Markieta (www.spatialanalysis.ca/2011/global-connectivity-mapping-out-flight-routes).Some cleanup was required in the original data, including adding missing location data for several airports and some missing IATA codes. Before performing the point to line conversion, the key to preserving attributes in the original data is a combination of the INDEX and MATCH functions in Microsoft Excel. Example function: =INDEX(Airlines!$B$2:$B$6200,MATCH(Routes!$A2,Airlines!$D$2:Airlines!$D$6200,0))
The dataset was generated from a set of Excel spreadsheets extracted from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). The data in this second part of the series contain information on applications to UCT made between January 2015 and September 2019.
In the original form received by DataFirst the data were ill suited to research purposes. The series represents an attempt at cleaning and organizing the data into a more tractable format.
Individuals, applications
All applications to study at the University of Cape Town
Administrative records data
Other [oth]
In order to lessen computation times the main applications file was split by year - this part contains the years 2014-2019. Note however that the other 3 files released with the application file (that can be merged into it for additional detail) did not need to be split. As such, the four files can be used to produce a series for 2014-2019 and are labelled as such, even though the person, secondary schooling and tertiary education files all span a longer time period.
Here is additional information about the files:
Further information on the processing of the original data files is summarised in a document entitled "Notes on preparing the UCT Student Admissions Data" accompanying the data.
Input and output data of the modelling work for the paper Effects of a Delayed Expansion of Interconnector Capacities in a High RES-E European Electricity System Considered scenario years: 2030, 2040 and 2050 The profiles are based on the historical year 2016. Two scenarios are included: lower connectivity and high connectivity The data cover the ENTSO-E member countries except Iceland and Cyprus and is given in country-specific resolution. Input: Demand as hourly profile in MWh Variable RES-E as hourly profile in MWh Power plant fleet as capacities in MW NTCs as capacities in MW Output: CO2 emissions as annual data in Mt Variable electricity generation costs as annual data in MEur Variable electricity generation costs per generation as annual data in Euro/MWh Electricity generation as annual data in TWh Electricity export as annual data in TWh Electricity import as annual data in TWh Transit flows as annual data in TWh The sources are described in the corresponding paper under the following link: https://www.mdpi.com/1996-1073/12/16/3098 {"references": ["Repenning, J.; Hermann, H.; Emele, L.; J\u00f6r\u00df, W.; Blanck, R.; Loreck, C.; B\u00f6ttcher, H.; Ludig, S.; Dehoust, G.; Matthes, Felix Chr. et al. Klimaschutzszenario 2050. 2. Modellierungsrunde, Berlin, 2015. Available online: https://www.oeko.de/oekodoc/2451/2015-608-de.pdf", "Andersky, T.; Sanchis, G.; Betraoui, B. e-Highway 2050 - Database per country. Excel-Sheet, 2016", "ENTSO-E. Transparency Platform. Available online: https://transparency.entsoe.eu/", "ENTSO-E. TYNDP 2018 Scenario Report. Data set, Brussels, 2018. Available online: https://tyndp.entsoe.eu/maps-data/"]}
We describe a bibliometric network characterizing co-authorship collaborations in the entire Italian academic community. The network, consisting of 38,220 nodes and 507,050 edges, is built upon two distinct data sources: faculty information provided by the Italian Ministry of University and Research and publications available in Semantic Scholar. Both nodes and edges are associated with a large variety of semantic data, including gender, bibliometric indexes, authors' and publications' research fields, and temporal information. While linking data between the two original sources posed many challenges, the network has been carefully validated to assess its reliability and to understand its graph-theoretic characteristics. By resembling several features of social networks, our dataset can be profitably leveraged in experimental studies in the wide social network analytics domain as well as in more specific bibliometric contexts. , The proposed network is built starting from two distinct data sources:
the entire dataset dump from Semantic Scholar (with particular emphasis on the authors and papers datasets) the entire list of Italian faculty members as maintained by Cineca (under appointment by the Italian Ministry of University and Research).
By means of a custom name-identity recognition algorithm (details are available in the accompanying paper published in Scientific Data), the names of the authors in the Semantic Scholar dataset have been mapped against the names contained in the Cineca dataset and authors with no match (e.g., because of not being part of an Italian university) have been discarded. The remaining authors will compose the nodes of the network, which have been enriched with node-related (i.e., author-related) attributes. In order to build the network edges, we leveraged the papers dataset from Semantic Scholar: specifically, any two authors are said to be connected if there is at least one pap..., , # Data cleaning and enrichment through data integration: networking the Italian academia
https://doi.org/10.5061/dryad.wpzgmsbwj
Manuscript published in Scientific Data with DOI .
This repository contains two main data files:
edge_data_AGG.csv
, the full network in comma-separated edge list format (this file contains mainly temporal co-authorship information);Coauthorship_Network_AGG.graphml
, the full network in GraphML format. along with several supplementary data, listed below, useful only to build the network (i.e., for reproducibility only):
University-City-match.xlsx
, an Excel file that maps the name of a university against the city where its respective headquarter is located;Areas-SS-CINECA-match.xlsx
, an Excel file that maps the research areas in Cineca against the research areas in Semantic Scholar.The `Coauthorship_Networ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ENTSO-E Pan-European Climatic Database (PECD 2021.3) in Parquet format
TL;DR: this is a tidy and friendly version of a subset of the PECD 2021.3 data by ENTSO-E: hourly capacity factors for wind onshore, offshore, solar PV, hourly electricity demand, weekly inflow for reservoir and pumping and daily generation for run-of-river. All the data is provided for >30 climatic years (1982-2019 for wind and solar, 1982-2016 for demand, 1982-2017 for hydropoer) and at national and sub-national (>140 zones) level.
ENTSO-E has released with the latest European Resource Adequacy Assessment (ERAA 2021) all the inputs used in the study.
Those inputs include:
- Demand dataset: https://eepublicdownloads.azureedge.net/clean-documents/sdc-documents/ERAA/Demand%20Dataset.7z
- Climate data: https://eepublicdownloads.entsoe.eu/clean-documents/sdc-documents/ERAA/Climate%20Data.7z
The data files and the methodology are available on the official webpage.
As done for the previous releases (see https://zenodo.org/record/3702418#.YbmhR23MKMo and https://zenodo.org/record/3985078#.Ybmhem3MKMo), the original data - stored in large Excel spreadsheets - have been tidied and formatted in open and friendly formats (CSV for the small tables and Parquet for the large files)
Furthermore, we have carried out a simple country-aggregation for the original data - that uses instead >140 zones.
DISCLAIMER: the content of this dataset has been created with the greatest possible care. However, we invite to use the original data for critical applications and studies.
Description
This dataset includes the following files:
- capacities-national-estimates.csv: installed capacity in MW per zone, technology and the two scenarios (2025 and 2030). The files include also the total capacity for each technology per country (sum of all the zones within a country)
- PECD-2021.3-wide-LFSolarPV-2025 and PECD-2021.3-wide-LFSolarPV-2030: tables in Parquet format storing in each row the capacity factor for solar PV for a hour of the year and all the climatic years (1982-2019) for a specific zone. The two files contain the capacity factors for the scenarios "National Estimates 2025" and "National Estimates 2030"
- PECD-2021.3-wide-Onshore-2025 and PECD-2021.3-wide-Onshore-2030: same as above but for wind onshore
- PECD-2021.3-wide-Offshore-2025 and PECD-2021.3-wide-Offshore-2030: same as above but for wind offshore
- PECD-wide-demand_national_estimates-2025 and PECD-wide-demand_national_estimates-2030: hourly electricity demand for all the climatic years for a specific zone. The two files contain the load for the scenarios "National Estimates 2025" and "National Estimates 2030"
- PECD-2021.3-country-LFSolarPV-2025 and PECD-2021.3-country-LFSolarPV-2030: tables in Parquet format storing in each row the capacity factor for country/climatic year and hour of the year. The two files contain the capacity factors for the scenarios "National Estimates 2025" and "National Estimates 2030"
- PECD-2021.3-country-Onshore-2025 and PECD-2021.3-country-Onshore-2030: same as above but for wind onshore
- PECD-2021.3-country-Offshore-2025 and PECD-2021.3-country-Offshore-2030: same as above but for wind offshore
- PECD-country-demand_national_estimates-2025 and PECD-country-demand_national_estimates-2030: same as above but for electricity demand
- PECD_EERA2021_reservoir_pumping.zip: archive with four files per each scenario: 1. table.csv with generation and storage capacities per zone/technology, 2. zone weekly inflow (GWh), 3. table.csv with generation and storage per country/technology and 4. country weekly inflow (GWh)
- PECD_EERA2021_ROR.zip: as for the previous file but the inflow is daily
- plots.zip: archive with 182 png figures with the weekly climatology for all the variables (daily for the electricity demand)
Note
I would like to thank Laurens Stoop for sharing the onshore wind data for the scenario 2030, that was corrupted in the original archive.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes open-source projects written in C# programing language, annotated for the presence of Long Method and God Class code smells. Each instance was manually annotated by at least two annotators. We explain our motivation and methodology for creating this dataset in our preprint:
Luburić, N., Prokić, S., Grujić, K.G., Slivka, J., Kovačević, A., Sladić, G. and Vidaković, D., 2021. Towards a systematic approach to manual annotation of code smells.
The dataset contains two excel datasheets:
The columns in the datasheet represent:
To help guide their reasoning for evaluating the presence and the severity of a code smell, three annotators independently annotated whether the considered heuristics apply to an evaluated code snippet. We provide these results in two separate excel datasheets:
The columns of these two datasheets are:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A messy data for demonstrating "how to clean data using spreadsheet". This dataset was intentionally formatted to be messy, for the purpose of demonstration. It was collated from here - https://openafrica.net/dataset/historic-and-projected-rainfall-and-runoff-for-4-lake-victoria-sub-regions