36 datasets found

o
Messy data for data cleaning exercise - Dataset - openAFRICA
open.africa
Updated Oct 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Messy data for data cleaning exercise - Dataset - openAFRICA [Dataset]. https://open.africa/dataset/messy-data-for-data-cleaning-exercise
Explore at:
Dataset updated
Oct 6, 2021
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A messy data for demonstrating "how to clean data using spreadsheet". This dataset was intentionally formatted to be messy, for the purpose of demonstration. It was collated from here - https://openafrica.net/dataset/historic-and-projected-rainfall-and-runoff-for-4-lake-victoria-sub-regions
B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
d
Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop
search.dataone.org
borealisdata.ca
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Costanzo, Lucia; Jadon, Vivek (2024). Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop [Dataset]. http://doi.org/10.5683/SP3/FF6AI9
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/FF6AI9
Dataset updated
Jul 31, 2024
Dataset provided by
Borealis
Authors
Costanzo, Lucia; Jadon, Vivek
Description
Ahoy, data enthusiasts! Join us for a hands-on workshop where you will hoist your sails and navigate through the Statistics Canada website, uncovering hidden treasures in the form of data tables. With the wind at your back, you’ll master the art of downloading these invaluable Stats Can datasets while braving the occasional squall of data cleaning challenges using Excel with your trusty captains Vivek and Lucia at the helm.
RAAAP-2 Datasets (17 linked datasets)
figshare.com
bin
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Kerridge; Patrice Ajai-Ajagbe; Cindy Kiel; Jennifer Shambrook; BRYONY WAKEFIELD (2023). RAAAP-2 Datasets (17 linked datasets) [Dataset]. http://doi.org/10.6084/m9.figshare.18972935.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.18972935.v2
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Simon Kerridge; Patrice Ajai-Ajagbe; Cindy Kiel; Jennifer Shambrook; BRYONY WAKEFIELD
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This collection contains the 17 anonymised datasets from the RAAAP-2 international survey of research management and administration professional undertaken in 2019. To preserve anonymity the data are presented in 17 datasets linked only by AnalysisRegionofEmployment, as many of the textual responses, even though redacted to remove institutional affiliation could be used to identify some individuals if linked to the other data. Each dataset is presented in the original SPSS format, suitable for further analyses, as well as an Excel equivalent for ease of viewing. There are additional files in this collection showing the the questionnaire and the mappings to the datasets together with the SPSS scripts used to produce the datasets. These data follow on from, but re not directly linked to the first RAAAP survey undertaken in 2016, data from which can also be found in FigShare Errata (16/5/23) an error in v13 of the main Data Cleansing syntax file (now updated to v14) meant that two variables were missing their value labels (the underlying codes were correct) - a new version (SPSS & Excel) of the Main Dataset has been updated
d
Drainage Gully Cleaning Programme DCC
datasalsa.com
data.smartdublin.ie
+3more
csv
Updated Sep 21, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dublin City Council (2022). Drainage Gully Cleaning Programme DCC [Dataset]. https://datasalsa.com/dataset/?catalogue=data.gov.ie&name=drainage-gully-cleaning-programme
Explore at:
csvAvailable download formats
Dataset updated
Sep 21, 2022
Dataset authored and provided by
Dublin City Council
Time period covered
Sep 21, 2022
Description
Drainage Gully Cleaning Programme DCC. Published by Dublin City Council. Available under the license cc-by (CC-BY-4.0).Schedule and Monitor of Gully Cleaning for Dublin City These datasets show the gully cleaning statistics from 2004 to September 14th 2011. It consists attached 6 No. Excel Spreadsheets with the datasets from the Daily Returns section of the Gully Cleaning Application and one dataset from the Gully Repairs section of the gully application. They are divided into the five Dublin City Council administrative areas; Central, North Central, North West, Southeast, South Central. There is also a dataset containing details of all Gully repairs pending (all areas included).The datasets cover all Daily Returns since the gully cleaning programme commenced in 2004. Daily Returns are lists of the work that the gully cleaning crews carry out daily. All gullies on a street are cleaned where possible. A list of Omissions is recorded where some gullies may not have been cleaned due to lack of access or other reasons. Also, the gullies that required repair were noted. The Daily Returns datasets record only the number of gullies requiring repair on a particular street, not the details of the repair. Information in the fields is as follows: .Road name: street name or laneway denoted by nearest house or lamp post etc. If a road name is followed by the letters pl in capital letters than it means that either this road or a section of this road has been placed on the priority list due to a history of flooding or a higher potential of the gully blocking due to location etc. If a road name is followed by a number of zeros in the gullies inspected - gullies cleaned columns etc then it is very probable that this road was travelled during heavy rain as part of our flood zones and there was no flooding noted along this road at the time of travelling. Gullies inspected: number of gullies inspected along road/lane .A road name followed by lower case road names denotes a road that is part of more than one area in our gully cleaning areas and these lower case names denote the starting point and finishing point for the crews working in the particular area i.e. Howth Road All Saints Rd-Fairview denotes that the section of the Howth road between all saints road and Fairview are within the area that the crew have been asked to work in. Gullies cleaned :number of gullies cleaned from total inspected .Gully omissions :number of gullies missed i.e. Unable to put boom or shovel into gully pot due to parked cars / unable to lift grids / hoarding over gullies etc .Gully repairs: number of repairs based on inspections-note not all repairs prevent the gully from being cleaned. Comments box: this box is used to provide any additional information that may be of benefit and it can be noted that results of work carried out by the mini jet is placed in this box. ...
COVID-19 Case Surveillance Public Use Data
data.cdc.gov
paperswithcode.com
+5more
application/rdfxml +5
Updated Jul 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data [Dataset]. https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf
Explore at:
application/rdfxml, tsv, csv, json, xml, application/rssxmlAvailable download formats
Dataset updated
Jul 9, 2024
Dataset provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
Authors
CDC Data, Analytics and Visualization Task Force
License
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Description
Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.

CDC has three COVID-19 case surveillance datasets:
COVID-19 Case Surveillance Public Use Data with Geography: Public use, patient-level dataset with clinical data (including symptoms), demographics, and county and state of residence. (19 data elements)
COVID-19 Case Surveillance Public Use Data: Public use, patient-level dataset with clinical and symptom data and demographics, with no geographic data. (12 data elements)
COVID-19 Case Surveillance Restricted Access Detailed Data: Restricted access, patient-level dataset with clinical and symptom data, demographics, and state and county of residence. Access requires a registration process and a data use agreement. (33 data elements)
The following apply to all three datasets:
Data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.
Data are considered provisional by CDC and are subject to change until the data are reconciled and verified with the state and territorial data providers.
Some data cells are suppressed to protect individual privacy.
The datasets will include all cases with the earliest date available in each record (date received by CDC or date related to illness/specimen collection) at least 14 days prior to the creation of the current datasets. This 14-day lag allows case reporting to be stabilized and ensures that time-dependent outcome data are accurately captured.
Datasets are updated monthly.
Datasets are created using CDC’s Policy on Public Health Research and Nonresearch Data Management and Access and include protections designed to protect individual privacy.
For more information about data collection and reporting, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/about-us-cases-deaths.html.
For more information about the COVID-19 case surveillance data, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/faq-surveillance.html

Overview

The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

For more information: NNDSS Supports the COVID-19 Response | CDC.

The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.

COVID-19 Case Reports

COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.

All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.

Data are Considered Provisional

The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.
Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

Data Limitations

To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

Data Quality Assurance Procedures

CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:
Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question “Was the individual hospitalized?” where the possible answer choices include “Yes,” “No,” or “Unknown,” the blank value is recoded to Missing because the case report form did not include a response to the question.
Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
Additional data quality processing to recode free text data is ongoing. Data on symptoms, race and ethnicity, and healthcare worker status have been prioritized.

Data Suppression

To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

For questions, please contact Ask SRRG (eocevent394@cdc.gov).

Additional COVID-19 Data

COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These
Store Data Analysis using MS excel
kaggle.com
Updated Mar 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NisshaaChoudhary (2024). Store Data Analysis using MS excel [Dataset]. https://www.kaggle.com/datasets/nisshaachoudhary/store-data-analysis-using-ms-excel/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
NisshaaChoudhary
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Vrinda Store: Interactive Ms Excel dashboardVrinda Store: Interactive Ms Excel dashboard Feb 2024 - Mar 2024Feb 2024 - Mar 2024 The owner of Vrinda store wants to create an annual sales report for 2022. So that their employees can understand their customers and grow more sales further. Questions asked by Owner of Vrinda store are as follows:- 1) Compare the sales and orders using single chart. 2) Which month got the highest sales and orders? 3) Who purchased more - women per men in 2022? 4) What are different order status in 2022?

And some other questions related to business. The owner of Vrinda store wanted a visual story of their data. Which can depict all the real time progress and sales insight of the store. This project is a Ms Excel dashboard which presents an interactive visual story to help the Owner and employees in increasing their sales. Task performed : Data cleaning, Data processing, Data analysis, Data visualization, Report. Tool used : Ms Excel The owner of Vrinda store wants to create an annual sales report for 2022. So that their employees can understand their customers and grow more sales further. Questions asked by Owner of Vrinda store are as follows:- 1) Compare the sales and orders using single chart. 2) Which month got the highest sales and orders? 3) Who purchased more - women per men in 2022? 4) What are different order status in 2022? And some other questions related to business. The owner of Vrinda store wanted a visual story of their data. Which can depict all the real time progress and sales insight of the store. This project is a Ms Excel dashboard which presents an interactive visual story to help the Owner and employees in increasing their sales. Task performed : Data cleaning, Data processing, Data analysis, Data visualization, Report. Tool used : Ms Excel Skills: Data Analysis · Data Analytics · ms excel · Pivot Tables
d
Data from: Grain inoculated with different growth stages of the fungus,...
catalog.data.gov
datasets.ai
+1more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data from: Grain inoculated with different growth stages of the fungus, Aspergillus flavus, affect the close-range foraging behavior by a primary stored product pest, Sitophilus oryzae (Coleoptera: Curculionidae) [Dataset]. https://catalog.data.gov/dataset/data-from-grain-inoculated-with-different-growth-stages-of-the-fungus-aspergillus-flavus-a-065cc
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
Our goals with this dataset were to 1) isolate, culture, and identify two fungal life stages of Aspergillus flavus, 2) characterize the volatile emissions from grain inoculated by each fungal morphotype, and 3) understand how microbially-produced volatile organic compounds (MVOCs) from each fungal morphotype affect foraging, attraction, and preference by S. oryzae. This dataset includes that derived from headspace collection coupled with GC-MS, where we found the sexual life stage of A. flavus had the most unique emissions of MVOCs compared to the other semiochemical treatments. This translated to a higher arrestment with kernels containing grain with the A. flavus sexual life stage, as well as a higher cumulative time spent in those zones by S. oryzae in a video-tracking assay in comparison to the asexual life stage. While fungal cues were important for foraging at close-range, the release-recapture assay indicated that grain volatiles were more important for attraction at longer distances. There was no significant preference between grain and MVOCs in a four-way olfactometer, but methodological limitations in this assay prevent broad interpretation. Overall, this study enhances our understanding of how fungal cues affect the foraging ecology of a primary stored product insect. In the assays described herein, we analyzed the behavioral response of Sitophilus oryzae to five different blends of semiochemicals found and introduced in wheat (Table 1). Briefly, these included no stimuli (negative control), UV-sanitized grain, clean grain from storage (unmanipulated, positive control), as well as grain from storage inoculated with fungal morphotype 1 (M1, identified as the asexual life stage of Aspergillus flavus) and fungal morphotype 2 (M2, identified as the sexual life stage of A. flavus). Fresh samples of semiochemicals were used for each day of testing for each assay. In order to prevent cross-contamination, 300 g of grain (tempered to 15% grain moisture) was initially sanitized using UV for 20 min. This procedure was done before inoculating grain with either morphotype 1 or 2. The 300 g of grain was kept in a sanitized mason jar (8.5 D × 17 cm H). To inoculate grain with the two different morphologies, we scraped an entire isolation from a petri dish into the 300 g of grain. Each isolation was ~1 week old and completely colonized by the given morphotype. After inoculation, each treatment was placed in an environmental chamber (136VL, Percival Instruments, Perry, IA, USA) set at constant conditions (30°C, 65% RH, and 14:10 L:D). This procedure was the same for both morphologies and was done every 2 weeks to ensure fresh treatments for each experimental assay. See file list for descriptions of each data file. Resources in this dataset:Resource Title: Ethovision Movement Assay. File Name: ponce_lizarraga_ethovision_assay_microbial_volatiles_2020.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Olfactometer Round 1 Assay - With Fused Air Permeable Glass. File Name: ponce_lizarraga_first_round_olfactometer_fungal_study_2020.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Olfactometer Round 2 Assay - With Fused Air Permeable Glass Containing Holes. File Name: ponce_lizarraga_second_round_olfactometer_fungal_study_2021.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Small Release-Recapture Assay. File Name: ponce_lizarraga_small_release_recapture_assay.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Large Release-Recapture Assay. File Name: ponce_lizarraga_large_release_recapture_assay.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Headspace Volatile Collection Assay. File Name: sandra_headspace_volatiles_2020.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: README file list. File Name: file_list_stored_grain_Aspergillus_Sitophilus_oryzae.txt
f
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
figshare
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
f
An Archive of #DH2016 Tweets Published from Sunday 10 to Tuesday 12 July...
city.figshare.com
bin
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ernesto Priego (2023). An Archive of #DH2016 Tweets Published from Sunday 10 to Tuesday 12 July 2016 GMT [Dataset]. http://doi.org/10.6084/m9.figshare.3484817.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3484817.v2
Dataset updated
May 31, 2023
Dataset provided by
City, University of London
Authors
Ernesto Priego
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
eBackgroundThe Digital Humanities 2016 conference is taking/took place in Kraków, Poland, between Sunday 11 July and Saturday 16 July 2016. #DH2016 is/was the conference official hashtag.What This Output IsThis is an Excel spreadsheet file containing three sheets containing a total of 3478 Tweets publicly published with the hashtag #DH2016. The archive starts with a Tweet published on Sunday July 10 2016 00:03:41 +0000 and finishes with a Tweet published on Tuesday July 12 2016 23:55:47 +0000.The original collection has been organised into conference days; one sheet per day (GMT and Central European Times included). A breakdown of Tweets per day:Sunday 10 July 2016: 179 TweetsMonday 11 July 2016: 981 TweetsTuesday 12 July 2016: 2318 Tweets Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using Martin Hawksey's TAGS 6.0. Only users with at least 1 follower were included in the archive. Retweets have been included (Retweets count as Tweets). The collection spreadsheet was customised to reflect the time zone and geographical location of the conference.The profile_image_url and entities_str metadata were removed before public sharing in this archive. Please bear in mind that the conference hashtag has been spammed so some Tweets colllected may be from spam accounts. Some automated refining has been performed to remove Tweets not related to the conference but the data is likely to require further refining and deduplication. Both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might "over-represent the more central users", not offering "an accurate picture of peripheral activity" (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet tagged with #dh2016 during the indicated period, and the dataset is shared for archival, comparative and indicative educational research purposes only.Only content from public accounts is included and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.Each Tweet and its contents were published openly on the Web with the queried hashtag and are responsibility of the original authors.No private personal information is shared in this dataset. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road. This dataset is shared to archive, document and encourage open educational research into scholarly activity on Twitter. Other ConsiderationsTweets published publicly by scholars during academic conferences are often tagged (labeled) with a hashtag dedicated to the conference in question.The purpose and function of hashtags is to organise and describe information/outputs under the relevant label in order to enhance the discoverability of the labeled information/outputs (Tweets in this case). A hashtag is metadata users choose freely to use so their content is associated, directly linked to and categorised with the chosen hashtag. Though every reason for Tweeters' use of hashtags cannot be generalised nor predicted, it can be argued that scholarly Twitter users form specialised, self-selecting public professional networks that tend to observe scholarly practices and accepted modes of social and professional behaviour. In general terms it can be argued that scholarly Twitter users willingly and consciously tag their public Tweets with a conference hashtag as a means to network and to promote, report from, reflect on, comment on and generally contribute publicly to the scholarly conversation around conferences. As Twitter users, conference Twitter hashtag contributors have agreed to Twitter's Privacy and data sharing policies. Professional associations like the Modern Language Association recognise Tweets as citeable scholarly outputs. Archiving scholarly Tweets is a means to preserve this form of rapid online scholarship that otherwise can very likely become unretrievable as time passes; Twitter's search API has well-known temporal limitations for retrospective historical search and collection.Beyond individual tweets as scholarly outputs, the collective scholarly activity on Twitter around a conference or academic project or event can provide interesting insights for the contemporary history of scholarly communications. To date, collecting in real time is the only relatively accurate method to archive tweets at a small scale. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time.The CC-BY license has been applied to the output in the repository as a curated dataset. Authorial/curatorial/collection work has been performed on the file in order to make it available as part of the scholarly record. The data contained in the deposited file is otherwise freely available elsewhere through different methods and anyone not wishing to attribute the data to the creator of this output is needless to say free to do their own collection and clean their own data.
Hive Annotation Job Results - Cleaned and Audited
kaggle.com
Updated Apr 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brendan Kelley (2021). Hive Annotation Job Results - Cleaned and Audited [Dataset]. https://www.kaggle.com/brendankelley/hive-annotation-job-results-cleaned-and-audited/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 28, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Brendan Kelley
Description
Context

This notebook serves to showcase my problem solving ability, knowledge of the data analysis process, proficiency with Excel and its various tools and functions, as well as my strategic mindset and statistical prowess. This project consist of an auditing prompt provided by Hive Data, a raw Excel data set, a cleaned and audited version of the raw Excel data set, and my description of my thought process and knowledge used during completion of the project. The prompt can be found below:

Hive Data Audit Prompt

The raw data that accompanies the prompt can be found below:

Hive Annotation Job Results - Raw Data

^ These are the tools I was given to complete my task. The rest of the work is entirely my own.

To summarize broadly, my task was to audit the dataset and summarize my process and results. Specifically, I was to create a method for identifying which "jobs" - explained in the prompt above - needed to be rerun based on a set of "background facts," or criteria. The description of my extensive thought process and results can be found below in the Content section.

Content

Brendan Kelley April 23, 2021

Hive Data Audit Prompt Results

This paper explains the auditing process of the “Hive Annotation Job Results” data. It includes the preparation, analysis, visualization, and summary of the data. It is accompanied by the results of the audit in the excel file “Hive Annotation Job Results – Audited”.

Observation

The “Hive Annotation Job Results” data comes in the form of a single excel sheet. It contains 7 columns and 5,001 rows, including column headers. The data includes “file”, “object id”, and the pseudonym for five questions that each client was instructed to answer about their respective table: “tabular”, “semantic”, “definition list”, “header row”, and “header column”. The “file” column includes non-unique (that is, there are multiple instances of the same value in the column) numbers separated by a dash. The “object id” column includes non-unique numbers ranging from 5 to 487539. The columns containing the answers to the five questions include Boolean values - TRUE or FALSE – which depend upon the yes/no worker judgement.

Use of the COUNTIF() function reveals that there are no values other than TRUE or FALSE in any of the five question columns. The VLOOKUP() function reveals that the data does not include any missing values in any of the cells.

Assumptions

Based on the clean state of the data and the guidelines of the Hive Data Audit Prompt, the assumption is that duplicate values in the “file” column are acceptable and should not be removed. Similarly, duplicated values in the “object id” column are acceptable and should not be removed. The data is therefore clean and is ready for analysis/auditing.

Preparation

The purpose of the audit is to analyze the accuracy of the yes/no worker judgement of each question according to the guidelines of the background facts. The background facts are as follows:

• A table that is a definition list should automatically be tabular and also semantic • Semantic tables should automatically be tabular • If a table is NOT tabular, then it is definitely not semantic nor a definition list • A tabular table that has a header row OR header column should definitely be semantic

These background facts serve as instructions for how the answers to the five questions should interact with one another. These facts can be re-written to establish criteria for each question:

For tabular column: - If the table is a definition list, it is also tabular - If the table is semantic, it is also tabular

For semantic column: - If the table is a definition list, it is also semantic - If the table is not tabular, it is not semantic - If the table is tabular and has either a header row or a header column...
m
Dataset Supporting: Dynamic Modeling of Poultry Litter Composting in High...
data.mendeley.com
Updated Nov 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alvaro Patiño-Forero (2024). Dataset Supporting: Dynamic Modeling of Poultry Litter Composting in High Mountain Climates using System Identification Techniques [Dataset]. http://doi.org/10.17632/dgxxj2pk8s.2
Explore at:
Unique identifier
https://doi.org/10.17632/dgxxj2pk8s.2
Dataset updated
Nov 22, 2024
Authors
Alvaro Patiño-Forero
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General Description: This dataset contains detailed measurements collected during two controlled experiments designed to study the dynamics of the composting process using the forced aeration technique. The dataset is divided into two main parts:

Experiment 1: Includes parameters such as temperatures (hot air and compost pile), relative humidity, and air and heat inputs. Experiment 2: Complements the first experiment with oxygen levels in addition to the previously mentioned variables. Both datasets are organized in a chronological format, with records that allow the analysis of trends and correlations among the studied variables.

Purpose: The primary objective of this dataset is to facilitate the study of composting dynamics in high mountain environments using the forced aeration technique. It can be used for:

Bioprocess modeling. Studies on energy optimization in biological and chemical processes. Research in environmental biology, process engineering, and clean technologies. Dataset Features: Total Size: Experiment 1: 4302 records and 8 variables. Experiment 2: 3076 records and 9 variables. Temporal Coverage: Records are organized by hour and minute over several days of experimentation. Key Variables: Hour and minute of the record. Heater and compost pile temperatures. Relative humidity. Air and heat inputs. Oxygen levels (in Experiment 2). Days elapsed since the start of the experiment. Available Formats: The dataset is available in Excel format (.xlsx), with each experiment documented on separate sheets.

Access and Use: Restrictions: Commercial use of the dataset requires prior authorization. Potential Applications: This resource is valuable for researchers in fields such as:

Environmental engineering and bioprocesses. Design and optimization of thermal and environmental control systems.
i
Agriculture Sample Census Survey 2002-2003 - Tanzania
catalog.ihsn.org
datacatalog.ihsn.org
+1more
Updated Mar 29, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Bureau of Statistics (2019). Agriculture Sample Census Survey 2002-2003 - Tanzania [Dataset]. https://catalog.ihsn.org/catalog/1086
Explore at:
Dataset updated
Mar 29, 2019
Dataset provided by
National Bureau of Statistics
Office of Chief Government Statistician-Zanzibar
Time period covered
2004
Area covered
Tanzania
Description
Abstract

The 2003 Agriculture Sample Census was designed to meet the data needs of a wide range of users down to district level including policy makers at local, regional and national levels, rural development agencies, funding institutions, researchers, NGOs, farmer organisations, etc. As a result the dataset is both more numerous in its sample and detailed in its scope compared to previous censuses and surveys. To date this is the most detailed Agricultural Census carried out in Africa.

The census was carried out in order to: · Identify structural changes if any, in the size of farm household holdings, crop and livestock production, farm input and implement use. It also seeks to determine if there are any improvements in rural infrastructure and in the level of agriculture household living conditions; · Provide benchmark data on productivity, production and agricultural practices in relation to policies and interventions promoted by the Ministry of Agriculture and Food Security and other stake holders. · Establish baseline data for the measurement of the impact of high level objectives of the Agriculture Sector Development Programme (ASDP), National Strategy for Growth and Reduction of Poverty (NSGRP) and other rural development programs and projects. · Obtain benchmark data that will be used to address specific issues such as: food security, rural poverty, gender, agro-processing, marketing, service delivery, etc.

Geographic coverage

Tanzania Mainland and Zanzibar

Analysis unit

Households

Individuals

Universe

Large scale, small scale and community farms.

Kind of data

Census/enumeration data [cen]

Sampling procedure

The Mainland sample consisted of 3,221 villages. These villages were drawn from the National Master Sample (NMS) developed by the National Bureau of Statistics (NBS) to serve as a national framework for the conduct of household based surveys in the country. The National Master Sample was developed from the 2002 Population and Housing Census. The total Mainland sample was 48,315 agricultural households. In Zanzibar a total of 317 enumeration areas (EAs) were selected and 4,755 agriculture households were covered. Nationwide, all regions and districts were sampled with the exception of three urban districts (two from Mainland and one from Zanzibar).

In both Mainland and Zanzibar, a stratified two stage sample was used. The number of villages/EAs selected for the first stage was based on a probability proportional to the number of villages in each district. In the second stage, 15 households were selected from a list of farming households in each selected Village/EA, using systematic random sampling, with the village chairpersons assisting to locate the selected households.

Mode of data collection

Face-to-face [f2f]

Research instrument

The census covered agriculture in detail as well as many other aspects of rural development and was conducted using three different questionnaires: • Small scale questionnaire • Community level questionnaire • Large scale farm questionnaire

The small scale farm questionnaire was the main census instrument and it includes questions related to crop and livestock production and practices; population demographics; access to services, resources and infrastructure; and issues on poverty, gender and subsistence versus profit making production unit.

The community level questionnaire was designed to collect village level data such as access and use of common resources, community tree plantation and seasonal farm gate prices.

The large scale farm questionnaire was administered to large farms either privately or corporately managed.

Questionnaire Design The questionnaires were designed following user meetings to ensure that the questions asked were in line with users data needs. Several features were incorporated into the design of the questionnaires to increase the accuracy of the data: • Where feasible all variables were extensively coded to reduce post enumeration coding error. • The definitions for each section were printed on the opposite page so that the enumerator could easily refer to the instructions whilst interviewing the farmer. • The responses to all questions were placed in boxes printed on the questionnaire, with one box per character. This feature made it possible to use scanning and Intelligent Character Recognition (ICR) technologies for data entry. • Skip patterns were used to reduce unnecessary and incorrect coding of sections which do not apply to the respondent. • Each section was clearly numbered, which facilitated the use of skip patterns and provided a reference for data type coding for the programming of CSPro, SPSS and the dissemination applications.

Cleaning operations

Data processing consisted of the following processes: · Data entry · Data structure formatting · Batch validation · Tabulation

Data Entry Scanning and ICR data capture technology for the small holder questionnaire were used on the Mainland. This not only increased the speed of data entry, it also increased the accuracy due to the reduction of keystroke errors. Interactive validation routines were incorporated into the ICR software to track errors during the verification process. The scanning operation was so successful that it is highly recommended for adoption in future censuses/surveys. In Zanzibar all data was entered manually using CSPro.

Prior to scanning, all questionnaires underwent a manual cleaning exercise. This involved checking that the questionnaire had a full set of pages, correct identification and good handwriting. A score was given to each questionnaire based on the legibility and the completeness of enumeration. This score will be used to assess the quality of enumeration and supervision in order to select the best field staff for future censuses/surveys.

CSPro was used for data entry of all Large Scale Farm and community based questionnaires due to the relatively small number of questionnaires. It was also used to enter data from the 2,880 small holder questionnaires that were rejected by the ICR extraction application.

Data Structure Formatting A program was developed in visual basic to automatically alter the structure of the output from the scanning/extraction process in order to harmonise it with the manually entered data. The program automatically checked and changed the number of digits for each variable, the record type code, the number of questionnaires in the village, the consistency of the Village ID Code and saved the data of one village in a file named after the village code.

Batch Validation A batch validation program was developed in order to identify inconsistencies within a questionnaire. This is in addition to the interactive validation during the ICR extraction process. The procedures varied from simple range checking within each variable to the more complex checking between variables. It took six months to screen, edit and validate the data from the smallholder questionnaires. After the long process of data cleaning, tabulations were prepared based on a pre-designed tabulation plan.

Tabulations Statistical Package for Social Sciences (SPSS) was used to produce the Census tabulations and Microsoft Excel was used to organize the tables and compute additional indicators. Excel was also used to produce charts while ArcView and Freehand were used for the maps.

Analysis and Report Preparation The analysis in this report focuses on regional comparisons, time series and national production estimates. Microsoft Excel was used to produce charts; ArcView and Freehand were used for maps, whereas Microsoft Word was used to compile the report.

Data Quality A great deal of emphasis was placed on data quality throughout the whole exercise from planning, questionnaire design, training, supervision, data entry, validation and cleaning/editing. As a result of this, it is believed that the census is highly accurate and representative of what was experienced at field level during the Census year. With very few exceptions, the variables in the questionnaire are within the norms for Tanzania and they follow expected time series trends when compared to historical data. Standard Errors and Coefficients of Variation for the main variables are presented in the Technical Report (Volume I).

Sampling error estimates

The Sampling Error found on page (21) up to page (22) in the Technical Report for Agriculture Sample Census Survey 2002-2003
c
Dataset for Pascoe et al. (2022) "Impact of material properties in...
research-data.cardiff.ac.uk
zip
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Pascoe; Soumen Mandal; Gregory Shaw (2024). Dataset for Pascoe et al. (2022) "Impact of material properties in determining quaternary ammonium compound adsorption and wipe product efficacy against biofilms" [Dataset]. http://doi.org/10.17035/d.2022.0164873309
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.17035/d.2022.0164873309
Dataset updated
Sep 18, 2024
Dataset provided by
Cardiff University
Authors
Michael Pascoe; Soumen Mandal; Gregory Shaw
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The attached datasets are to accompany the publication titled "Impact of material properties in determining quaternary ammonium compound adsorption and wipe product efficacy against biofilms", published in the Journal of Hospital Infection in 2022 (https://doi.org/10.1016/j.jhin.2022.03.013). The dataset consists of a summarary excel spreadsheet of the data gathered across 2020 and 2021, as well as .txt files containing the results of the BET surface analysis. Each tab of the spreadsheet is labelled with the location where the data is used in the publication.
a
Global Airline Routes
fesec-cesj.opendata.arcgis.com
gis-for-secondary-schools-schools-be.hub.arcgis.com
+1more
Updated May 30, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ArcGIS StoryMaps (2018). Global Airline Routes [Dataset]. https://fesec-cesj.opendata.arcgis.com/datasets/Story::global-airline-routes
Explore at:
Dataset updated
May 30, 2018
Dataset authored and provided by
ArcGIS StoryMaps
Area covered
Description
This layer visualizes over 60,000 commercial flight paths. The data was obtained from openflights.org, and was last updated in June 2014. The site states, "The third-party that OpenFlights uses for route data ceased providing updates in June 2014. The current data is of historical value only. As of June 2014, the OpenFlights/Airline Route Mapper Route Database contains 67,663 routes between 3,321 airports on 548 airlines spanning the globe. Creating and maintaining this database has required and continues to require an immense amount of work. We need your support to keep this database up-to-date."To donate, visit the site and click the PayPal link.Routes were created using the XY-to-line tool in ArcGIS Pro, inspired by Kenneth Field's work, and following a modified methodology from Michael Markieta (www.spatialanalysis.ca/2011/global-connectivity-mapping-out-flight-routes).Some cleanup was required in the original data, including adding missing location data for several airports and some missing IATA codes. Before performing the point to line conversion, the key to preserving attributes in the original data is a combination of the INDEX and MATCH functions in Microsoft Excel. Example function: =INDEX(Airlines!$B$2:$B$6200,MATCH(Routes!$A2,Airlines!$D$2:Airlines!$D$6200,0))
u
University of Cape Town Student Admissions Data 2015-2019 - South Africa
datafirst.uct.ac.za
Updated Jul 28, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCT Student Administration (2020). University of Cape Town Student Admissions Data 2015-2019 - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/787
Explore at:
Dataset updated
Jul 28, 2020
Dataset authored and provided by
UCT Student Administration
Time period covered
2015 - 2019
Area covered
South Africa
Description
Abstract

The dataset was generated from a set of Excel spreadsheets extracted from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). The data in this second part of the series contain information on applications to UCT made between January 2015 and September 2019.

In the original form received by DataFirst the data were ill suited to research purposes. The series represents an attempt at cleaning and organizing the data into a more tractable format.

Analysis unit

Individuals, applications

Universe

All applications to study at the University of Cape Town

Kind of data

Administrative records data

Mode of data collection

Other [oth]

Cleaning operations

In order to lessen computation times the main applications file was split by year - this part contains the years 2014-2019. Note however that the other 3 files released with the application file (that can be merged into it for additional detail) did not need to be split. As such, the four files can be used to produce a series for 2014-2019 and are labelled as such, even though the person, secondary schooling and tertiary education files all span a longer time period.

Here is additional information about the files:

Application file: the "finest" or most disaggregated unit of analysis. Individuals may have multiple applications. Uniquely identified by an application ID variable. There are a total of 1,540,129 applications between 2015 and 2019. As mentioned, it was this application file that was split to reduce computation times. It was not necessary or logical to split the other files.

Person file: Each individual is uniquely identified by an individual ID variable. Each individual is associated with information on "key subjects" from a separate data file also contained in the database. These key subjects are all separate variables in the individual level data file. It is important to note that because individuals may have multiple applications, potentially spanning over many years, it was decided not to split the person level datafile. Rather, the person file spans the full data range from 2006 to 2019.

Secondary Education Information: Individuals can also be associated with row entries for each subject. This data file does not have a unique identifier. Instead, each row entry represents a specific secondary school subject for a specific individual. These subjects are quite specific and the data allows the user to distinguish between, for example, higher grade accounting and standard grade accounting. It also allows the user to identify the educational authority issuing the qualification e.g. Cambridge Internal Examinations (CIE) versus National Senior Certificate (NSC). This file spans 2006 to 2019.

Tertiary Education Information: the smallest of the four data files. There are multiple entries for each individual in this dataset. Each row entry contains information on the year, institution and transcript information and can be associated with individuals.This file spans 2006 to 2019.

Further information on the processing of the original data files is summarised in a document entitled "Notes on preparing the UCT Student Admissions Data" accompanying the data.
o
Data from: Effects of a Delayed Expansion of Interconnector Capacities in a...
explore.openaire.eu
Updated Jun 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Ritter (2019). Effects of a Delayed Expansion of Interconnector Capacities in a High RES-E European Electricity System - Dataset [Dataset]. http://doi.org/10.5281/zenodo.3257494
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3257494
Dataset updated
Jun 26, 2019
Authors
David Ritter
Description
Input and output data of the modelling work for the paper Effects of a Delayed Expansion of Interconnector Capacities in a High RES-E European Electricity System Considered scenario years: 2030, 2040 and 2050 The profiles are based on the historical year 2016. Two scenarios are included: lower connectivity and high connectivity The data cover the ENTSO-E member countries except Iceland and Cyprus and is given in country-specific resolution. Input: Demand as hourly profile in MWh Variable RES-E as hourly profile in MWh Power plant fleet as capacities in MW NTCs as capacities in MW Output: CO2 emissions as annual data in Mt Variable electricity generation costs as annual data in MEur Variable electricity generation costs per generation as annual data in Euro/MWh Electricity generation as annual data in TWh Electricity export as annual data in TWh Electricity import as annual data in TWh Transit flows as annual data in TWh The sources are described in the corresponding paper under the following link: https://www.mdpi.com/1996-1073/12/16/3098 {"references": ["Repenning, J.; Hermann, H.; Emele, L.; J\u00f6r\u00df, W.; Blanck, R.; Loreck, C.; B\u00f6ttcher, H.; Ludig, S.; Dehoust, G.; Matthes, Felix Chr. et al. Klimaschutzszenario 2050. 2. Modellierungsrunde, Berlin, 2015. Available online: https://www.oeko.de/oekodoc/2451/2015-608-de.pdf", "Andersky, T.; Sanchis, G.; Betraoui, B. e-Highway 2050 - Database per country. Excel-Sheet, 2016", "ENTSO-E. Transparency Platform. Available online: https://transparency.entsoe.eu/", "ENTSO-E. TYNDP 2018 Scenario Report. Data set, Brussels, 2018. Available online: https://tyndp.entsoe.eu/maps-data/"]}
d
Data from: Data cleaning and enrichment through data integration: networking...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Feb 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar (2025). Data cleaning and enrichment through data integration: networking the Italian academia [Dataset]. http://doi.org/10.5061/dryad.wpzgmsbwj
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.wpzgmsbwj
Dataset updated
Feb 25, 2025
Dataset provided by
Dryad Digital Repository
Authors
Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar
Description
We describe a bibliometric network characterizing co-authorship collaborations in the entire Italian academic community. The network, consisting of 38,220 nodes and 507,050 edges, is built upon two distinct data sources: faculty information provided by the Italian Ministry of University and Research and publications available in Semantic Scholar. Both nodes and edges are associated with a large variety of semantic data, including gender, bibliometric indexes, authors' and publications' research fields, and temporal information. While linking data between the two original sources posed many challenges, the network has been carefully validated to assess its reliability and to understand its graph-theoretic characteristics. By resembling several features of social networks, our dataset can be profitably leveraged in experimental studies in the wide social network analytics domain as well as in more specific bibliometric contexts. , The proposed network is built starting from two distinct data sources:

the entire dataset dump from Semantic Scholar (with particular emphasis on the authors and papers datasets) the entire list of Italian faculty members as maintained by Cineca (under appointment by the Italian Ministry of University and Research).

By means of a custom name-identity recognition algorithm (details are available in the accompanying paper published in Scientific Data), the names of the authors in the Semantic Scholar dataset have been mapped against the names contained in the Cineca dataset and authors with no match (e.g., because of not being part of an Italian university) have been discarded. The remaining authors will compose the nodes of the network, which have been enriched with node-related (i.e., author-related) attributes. In order to build the network edges, we leveraged the papers dataset from Semantic Scholar: specifically, any two authors are said to be connected if there is at least one pap..., , # Data cleaning and enrichment through data integration: networking the Italian academia

https://doi.org/10.5061/dryad.wpzgmsbwj

Manuscript published inÂ Scientific Data with DOI .

Description of the data and file structure

This repository contains two main data files:

edge_data_AGG.csv, the full network in comma-separated edge list format (this file contains mainly temporal co-authorship information);

Coauthorship_Network_AGG.graphml, the full network in GraphML format.Â

along with several supplementary data, listed below, useful only to build the network (i.e., for reproducibility only):

University-City-match.xlsx, an Excel file that maps the name of a university against the city where its respective headquarter is located;

Areas-SS-CINECA-match.xlsx, an Excel file that maps the research areas in Cineca against the research areas in Semantic Scholar.

Description of the main data files

TheÂ `Coauthorship_Networ...
ENTSO-E Pan-European Climatic Database (PECD 2021.3) in Parquet format
zenodo.org
data.niaid.nih.gov
bin, csv, zip
Updated Oct 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matteo De Felice; Matteo De Felice (2022). ENTSO-E Pan-European Climatic Database (PECD 2021.3) in Parquet format [Dataset]. http://doi.org/10.5281/zenodo.5780185
Explore at:
bin, csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5780185
Dataset updated
Oct 19, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matteo De Felice; Matteo De Felice
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ENTSO-E Pan-European Climatic Database (PECD 2021.3) in Parquet format

TL;DR: this is a tidy and friendly version of a subset of the PECD 2021.3 data by ENTSO-E: hourly capacity factors for wind onshore, offshore, solar PV, hourly electricity demand, weekly inflow for reservoir and pumping and daily generation for run-of-river. All the data is provided for >30 climatic years (1982-2019 for wind and solar, 1982-2016 for demand, 1982-2017 for hydropoer) and at national and sub-national (>140 zones) level.

ENTSO-E has released with the latest European Resource Adequacy Assessment (ERAA 2021) all the inputs used in the study.
Those inputs include:
- Demand dataset: https://eepublicdownloads.azureedge.net/clean-documents/sdc-documents/ERAA/Demand%20Dataset.7z
- Climate data: https://eepublicdownloads.entsoe.eu/clean-documents/sdc-documents/ERAA/Climate%20Data.7z

The data files and the methodology are available on the official webpage.

As done for the previous releases (see https://zenodo.org/record/3702418#.YbmhR23MKMo and https://zenodo.org/record/3985078#.Ybmhem3MKMo), the original data - stored in large Excel spreadsheets - have been tidied and formatted in open and friendly formats (CSV for the small tables and Parquet for the large files)

Furthermore, we have carried out a simple country-aggregation for the original data - that uses instead >140 zones.

DISCLAIMER: the content of this dataset has been created with the greatest possible care. However, we invite to use the original data for critical applications and studies.

Description

This dataset includes the following files:

- capacities-national-estimates.csv: installed capacity in MW per zone, technology and the two scenarios (2025 and 2030). The files include also the total capacity for each technology per country (sum of all the zones within a country)
- PECD-2021.3-wide-LFSolarPV-2025 and PECD-2021.3-wide-LFSolarPV-2030: tables in Parquet format storing in each row the capacity factor for solar PV for a hour of the year and all the climatic years (1982-2019) for a specific zone. The two files contain the capacity factors for the scenarios "National Estimates 2025" and "National Estimates 2030"
- PECD-2021.3-wide-Onshore-2025 and PECD-2021.3-wide-Onshore-2030: same as above but for wind onshore
- PECD-2021.3-wide-Offshore-2025 and PECD-2021.3-wide-Offshore-2030: same as above but for wind offshore
- PECD-wide-demand_national_estimates-2025 and PECD-wide-demand_national_estimates-2030: hourly electricity demand for all the climatic years for a specific zone. The two files contain the load for the scenarios "National Estimates 2025" and "National Estimates 2030"
- PECD-2021.3-country-LFSolarPV-2025 and PECD-2021.3-country-LFSolarPV-2030: tables in Parquet format storing in each row the capacity factor for country/climatic year and hour of the year. The two files contain the capacity factors for the scenarios "National Estimates 2025" and "National Estimates 2030"
- PECD-2021.3-country-Onshore-2025 and PECD-2021.3-country-Onshore-2030: same as above but for wind onshore
- PECD-2021.3-country-Offshore-2025 and PECD-2021.3-country-Offshore-2030: same as above but for wind offshore
- PECD-country-demand_national_estimates-2025 and PECD-country-demand_national_estimates-2030: same as above but for electricity demand
- PECD_EERA2021_reservoir_pumping.zip: archive with four files per each scenario: 1. table.csv with generation and storage capacities per zone/technology, 2. zone weekly inflow (GWh), 3. table.csv with generation and storage per country/technology and 4. country weekly inflow (GWh)
- PECD_EERA2021_ROR.zip: as for the previous file but the inflow is daily
- plots.zip: archive with 182 png figures with the weekly climatology for all the variables (daily for the electricity demand)

Note

I would like to thank Laurens Stoop for sharing the onshore wind data for the scenario 2030, that was corrupted in the original archive.
Towards a systematic approach to manual annotation of code smells - C#...
zenodo.org
data.niaid.nih.gov
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikola Luburić; Nikola Luburić; Simona Prokić; Simona Prokić; Katarina-Glorija Grujić; Katarina-Glorija Grujić; Jelena Slivka; Jelena Slivka; Aleksandar Kovačević; Aleksandar Kovačević; Goran Sladić; Goran Sladić; Dragan Vidaković; Dragan Vidaković (2022). Towards a systematic approach to manual annotation of code smells - C# Dataset of Long Method and Large Class code smells [Dataset]. http://doi.org/10.5281/zenodo.6520056
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6520056
Dataset updated
May 5, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nikola Luburić; Nikola Luburić; Simona Prokić; Simona Prokić; Katarina-Glorija Grujić; Katarina-Glorija Grujić; Jelena Slivka; Jelena Slivka; Aleksandar Kovačević; Aleksandar Kovačević; Goran Sladić; Goran Sladić; Dragan Vidaković; Dragan Vidaković
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes open-source projects written in C# programing language, annotated for the presence of Long Method and God Class code smells. Each instance was manually annotated by at least two annotators. We explain our motivation and methodology for creating this dataset in our preprint:

Luburić, N., Prokić, S., Grujić, K.G., Slivka, J., Kovačević, A., Sladić, G. and Vidaković, D., 2021. Towards a systematic approach to manual annotation of code smells.

The dataset contains two excel datasheets:

DataSet_Large Class.xlsx – C# classes annotated for the Large Class code smell severity.

DataSet_Long Method.xlsx – C# methods annotated for the Long method code smell severity.

The columns in the datasheet represent:

Code Snippet ID – the full name of the code snippet.

For classes, this is the package/namespace name followed by the class name. The full name of inner classes also contains the names of any outer classes (e.g., namespace.subnamespace.outerclass.innerclass).

For methods, this is the full name of the class and the methods’s signature (e.g., namespace.class.method(param1Type, param2Type) ).

Link – The GitHub link to the code snippet, including the commit and the start and end LOC.

Code Smell – code smell for which the code snippet is examined (Large Class or Long Method).

Project Link – the link to the version of the code repository that was annotated.

Metrics – a list of metrics for the code snippet, calculated by our platform. Our dataset provides 25 class-level metrics for Large Class detection and 18 method-level metrics for Long Method detection The list of metrics and their definitions is available here.

Final annotation – a single severity score calculated by a majority vote.

Annotators – each annotator's (1, 2, or 3) assigned severity score.

To help guide their reasoning for evaluating the presence and the severity of a code smell, three annotators independently annotated whether the considered heuristics apply to an evaluated code snippet. We provide these results in two separate excel datasheets:

LargeClass_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Large Class code smell.

LongMethod_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Large Class code smell.

The columns of these two datasheets are:

Code Snippet ID - the full name of the code snippet (matching the IDs from DataSet_Large Class.xlsx and DataSet_Long Method.xlsx)

Annotators – heuristics labelled by each of the annotators (1, 2, or 3).

Heuristics – whether the heuristic is applicable to the examined code snippet or not (Section 1.2.4 lists heuristics relevant for the Large Class detection, and Section 1.2.5 lists the heuristics relevant for the Long Method detection).

Facebook

Twitter

Click to copy link

Link copied

Cite

(2021). Messy data for data cleaning exercise - Dataset - openAFRICA [Dataset]. https://open.africa/dataset/messy-data-for-data-cleaning-exercise

Messy data for data cleaning exercise - Dataset - openAFRICA

Explore at:

Dataset updated

Oct 6, 2021

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A messy data for demonstrating "how to clean data using spreadsheet". This dataset was intentionally formatted to be messy, for the purpose of demonstration. It was collated from here - https://openafrica.net/dataset/historic-and-projected-rainfall-and-runoff-for-4-lake-victoria-sub-regions

Clear search

Close search

Google apps

Main menu

Messy data for data cleaning exercise - Dataset - openAFRICA

Data Cleaning Sample

Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop

RAAAP-2 Datasets (17 linked datasets)

Drainage Gully Cleaning Programme DCC

COVID-19 Case Surveillance Public Use Data

CDC has three COVID-19 case surveillance datasets:

Overview

COVID-19 Case Reports

Data are Considered Provisional

Data Limitations

Data Quality Assurance Procedures

Data Suppression

Additional COVID-19 Data

Store Data Analysis using MS excel

Data from: Grain inoculated with different growth stages of the fungus,...

Cleaned NHANES 1988-2018

An Archive of #DH2016 Tweets Published from Sunday 10 to Tuesday 12 July...

Hive Annotation Job Results - Cleaned and Audited

Context

Content

Dataset Supporting: Dynamic Modeling of Poultry Litter Composting in High...

Agriculture Sample Census Survey 2002-2003 - Tanzania

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Sampling error estimates

Dataset for Pascoe et al. (2022) "Impact of material properties in...

Global Airline Routes

University of Cape Town Student Admissions Data 2015-2019 - South Africa

Abstract

Analysis unit

Universe

Kind of data

Mode of data collection

Cleaning operations

Data from: Effects of a Delayed Expansion of Interconnector Capacities in a...

Data from: Data cleaning and enrichment through data integration: networking...

Description of the data and file structure

Description of the main data files

ENTSO-E Pan-European Climatic Database (PECD 2021.3) in Parquet format

Towards a systematic approach to manual annotation of code smells - C#...

Messy data for data cleaning exercise - Dataset - openAFRICA