Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
Facebook
TwitterThe dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
These are the materials developed for the Mo(Wa)²TER Data Science workshop, which is designed for upper level and graduate students in environmental engineering or industry professionals in the water and wastewater treatment (W/WWT) fields. Working through this material will improve a learner’s data analysis and programming skills with the free R language and will focus exclusively on problems arising in W/WWT. Training in basic R coding, data cleaning, visualization, data analysis, statistical modeling, and machine learning are provided. Real W/WWT examples and exercises are given with each topic to strengthen and deepen comprehension. These materials aim to equip students with the skills to handle data science challenges in their future careers. Materials were developed over three offerings of this workshop in 2021, 2022, and 2023. At the time of publication, all code runs, but we provide no guarantees on future versions of R or packages used in this workshop.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
Facebook
TwitterThis dataset provides the R script and example data used to estimate snow depth from ultrasonic distance measurements collected by low-cost Geoprecision-Maxbotic devices, designed for autonomous operation in polar conditions. The dataset includes:
The full R script used for data preprocessing, filtering, and snow depth calculation, with all parameters fully documented.
Example raw and clean data files, ready to use, acquired from a sensor installed in the South Shetland Islands (Antarctica) between 2023 and 2024.
The processing pipeline includes outlier removal (Hampel filter), gap interpolation, moving average smoothing, reference level estimation, and snow depth conversion in millimetres and centimetres. Derived snow depths are exported alongside summary statistics.
This code was developed as part of a research project evaluating the performance and limitations of low-cost ultrasonic snow depth measurement systems in Antarctic permafrost monitoring networks. Although the script was designed for the specific configuration of Geoprecision dataloggers and Maxbotic MB7574-SCXL-Maxsonar-WRST7 sensors, it can be easily adapted to other distance-measuring devices providing similar output formats.
All files are provided in open formats (CSV, and R) to facilitate reuse and reproducibility. Users are encouraged to modify the script to fit their own instrumentation and field conditions.
Facebook
TwitterThe high-frequency phone survey of refugees monitors the economic and social impact of and responses to the COVID-19 pandemic on refugees and nationals, by calling a sample of households every four weeks. The main objective is to inform timely and adequate policy and program responses. Since the outbreak of the COVID-19 pandemic in Ethiopia, two rounds of data collection of refugees were completed between September and November 2020. The first round of the joint national and refugee HFPS was implemented between the 24 September and 17 October 2020 and the second round between 20 October and 20 November 2020.
Household
Sample survey data [ssd]
The sample was drawn using a simple random sample without replacement. Expecting a high non-response rate based on experience from the HFPS-HH, we drew a stratified sample of 3,300 refugee households for the first round. More details on sampling methodology are provided in the Survey Methodology Document available for download as Related Materials.
Computer Assisted Telephone Interview [cati]
The Ethiopia COVID-19 High Frequency Phone Survey of Refugee questionnaire consists of the following sections:
A more detailed description of the questionnaire is provided in Table 1 of the Survey Methodology Document that is provided as Related Materials. Round 1 and 2 questionnaires available for download.
DATA CLEANING At the end of data collection, the raw dataset was cleaned by the Research team. This included formatting, and correcting results based on monitoring issues, enumerator feedback and survey changes. Data cleaning carried out is detailed below.
Variable naming and labeling: • Variable names were changed to reflect the lowercase question name in the paper survey copy, and a word or two related to the question. • Variables were labeled with longer descriptions of their contents and the full question text was stored in Notes for each variable. • “Other, specify” variables were named similarly to their related question, with “_other” appended to the name. • Value labels were assigned where relevant, with options shown in English for all variables, unless preloaded from the roster in Amharic.
Variable formatting:
• Variables were formatted as their object type (string, integer, decimal, time, date, or datetime).
• Multi-select variables were saved both in space-separated single-variables and as multiple binary variables showing the yes/no value of each possible response.
• Time and date variables were stored as POSIX timestamp values and formatted to show Gregorian dates.
• Location information was left in separate ID and Name variables, following the format of the incoming roster. IDs were formatted to include only the variable level digits, and not the higher-level prefixes (2-3 digits only.)
• Only consented surveys were kept in the dataset, and all personal information and internal survey variables were dropped from the clean dataset. • Roster data is separated from the main data set and kept in long-form but can be merged on the key variable (key can also be used to merge with the raw data).
• The variables were arranged in the same order as the paper instrument, with observations arranged according to their submission time.
Backcheck data review: Results of the backcheck survey are compared against the originally captured survey results using the bcstats command in Stata. This function delivers a comparison of variables and identifies any discrepancies. Any discrepancies identified are then examined individually to determine if they are within reason.
The following data quality checks were completed: • Daily SurveyCTO monitoring: This included outlier checks, skipped questions, a review of “Other, specify”, other text responses, and enumerator comments. Enumerator comments were used to suggest new response options or to highlight situations where existing options should be used instead. Monitoring also included a review of variable relationship logic checks and checks of the logic of answers. Finally, outliers in phone variables such as survey duration or the percentage of time audio was at a conversational level were monitored. A survey duration of close to 15 minutes and a conversation-level audio percentage of around 40% was considered normal. • Dashboard review: This included monitoring individual enumerator performance, such as the number of calls logged, duration of calls, percentage of calls responded to and percentage of non-consents. Non-consent reason rates and attempts per household were monitored as well. Duration analysis using R was used to monitor each module's duration and estimate the time required for subsequent rounds. The dashboard was also used to track overall survey completion and preview the results of key questions. • Daily Data Team reporting: The Field Supervisors and the Data Manager reported daily feedback on call progress, enumerator feedback on the survey, and any suggestions to improve the instrument, such as adding options to multiple choice questions or adjusting translations. • Audio audits: Audio recordings were captured during the consent portion of the interview for all completed interviews, for the enumerators' side of the conversation only. The recordings were reviewed for any surveys flagged by enumerators as having data quality concerns and for an additional random sample of 2% of respondents. A range of lengths were selected to observe edge cases. Most consent readings took around one minute, with some longer recordings due to questions on the survey or holding for the respondent. All reviewed audio recordings were completed satisfactorily. • Back-check survey: Field Supervisors made back-check calls to a random sample of 5% of the households that completed a survey in Round 1. Field Supervisors called these households and administered a short survey, including (i) identifying the same respondent; (ii) determining the respondent's position within the household; (iii) confirming that a member of the the data collection team had completed the interview; and (iv) a few questions from the original survey.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Each zipped folder contains results files from reanalysis of public data in our publication, "mirrorCheck: an R package facilitating informed use of DESeq2’s lfcShrink() function for differential gene expression analysis of clinical samples" (see also the Collection description).
These files were produced by rendering the Quarto documents provided in the supplementary data with the publication (one per dataset). The Quarto codes for the 3 main analyses (COVID, BRCA and Cell line datasets) performed differential gene expression (DGE) analysis using both DESeq2 with lfcShrink() via our R package mirrorCheck, and also edgeR. Each zipped folder here contains 2 folders, one for each DGE analysis. Since DESeq2 was run on data without prior data cleaning, with prefiltering or after Surrogate Variable Analysis, the 'mirrorCheck output' folders themselves contain 3 sub-folders titled 'DESeq_noclean', 'DESeq_prefilt' and 'DESeq_sva". The COVID dataset also has a folder with results from Gene Set Enrichment Analysis. Finally, the fourth folder contains results from a tutorial/vignette-style supplementary file using the Bioconductor "parathyroidSE" dataset. This analysis only utilised DESeq2, with both data cleaning methods and testing two different design formulae, resulting in 5 sub-folders in the zipped folder.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The LSC (Leicester Scientific Corpus)
April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online
The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R
The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:
Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.
Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
Facebook
TwitterProtected areas are one of the most widespread and accepted conservation interventions, yet their  population trends are rarely compared to regional trends to gain insight into their effectiveness. Here, we leverage two long-term community science datasets to demonstrate mixed effects of protected areas on long-term bird population trends. We analyzed 31 years of bird transect data recorded by community volunteers across all major habitats of Stanford University’s Jasper Ridge Biological Preserve to determine the population trends for a sample of 66 species. We found that nearly a third of species experienced long-term declines, and on average, all species declined by 12%. Further, we averaged species trends by conservation status and key life history attributes to identify correlates and possible drivers of these trends. Observed increases in some cavity-nesters and declines of scrub-associated species suggest that long-term fire suppression may be a key driver, reshaping bird communit...,
From 1989 to 2020, volunteer observers conducted monthly surveys of six sectors within Stanford University's Jasper Ridge Biological Preserve (JRBP). Each survey consisted of a trail-based transect in which a group of observers walked the trail in the morning and counted all birds detected over roughly 3 hours. Observers recorded the number of each species seen or heard along the route, regardless of the distance to the bird. Over 31 years of surveys, 192 observers conducted 2,055 transects and recorded a total of 473,401 observations of 184 species (91% of JRBP’s documented avian richness). We used these data to estimate long-term avian population trends at JRBP. Prior to analy- sis, we performed extensive data cleaning, including the standardization of species names and observer identity. Unlikely species without notes or supporting information were removed from the analysis. All transects with fewer than seven species (n = 30) were considered incidental and removed. These transect..., , # Data and model code from: Mixed population trends inside a California protected area: evidence from long-term community science monitoring
Â
Â
Here, we provide the R code used to model the abundance for each species in the Jasper Ridge Biological Preserve. We have also provided a spreadsheet with each species' life history traits, taxonomy, annual trends in the preserve, and annual trends in the surrounding region (BCR 32) from the North American Breeding Bird Survey. Finally, we have attached an R code that analyzes the trends for various life history traits and taxonomic families, compares trends within the protected area and in the surrounding region, and produces figures 2, 4, and 5 in the main manuscript and all supplementary material figures.
Â
Description of the data and file structure
**Â **
The JRBP_Transect_Data_Species.R file provides the code required to create a generalized linear mixed model for each species in R-INLA and extract the percent change in ab...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Google Ads Sales Dataset for Data Analytics Campaigns (Raw & Uncleaned) 📝 Dataset Overview This dataset contains raw, uncleaned advertising data from a simulated Google Ads campaign promoting data analytics courses and services. It closely mimics what real digital marketers and analysts would encounter when working with exported campaign data — including typos, formatting issues, missing values, and inconsistencies.
It is ideal for practicing:
Data cleaning
Exploratory Data Analysis (EDA)
Marketing analytics
Campaign performance insights
Dashboard creation using tools like Excel, Python, or Power BI
📁 Columns in the Dataset Column Name ----- -Description Ad_ID --------Unique ID of the ad campaign Campaign_Name ------Name of the campaign (with typos and variations) Clicks --Number of clicks received Impressions --Number of ad impressions Cost --Total cost of the ad (in ₹ or $ format with missing values) Leads ---Number of leads generated Conversions ----Number of actual conversions (signups, sales, etc.) Conversion Rate ---Calculated conversion rate (Conversions ÷ Clicks) Sale_Amount ---Revenue generated from the conversions Ad_Date------ Date of the ad activity (in inconsistent formats like YYYY/MM/DD, DD-MM-YY) Location ------------City where the ad was served (includes spelling/case variations) Device------------ Device type (Mobile, Desktop, Tablet with mixed casing) Keyword ----------Keyword that triggered the ad (with typos)
⚠️ Data Quality Issues (Intentional) This dataset was intentionally left raw and uncleaned to reflect real-world messiness, such as:
Inconsistent date formats
Spelling errors (e.g., "analitics", "anaytics")
Duplicate rows
Mixed units and symbols in cost/revenue columns
Missing values
Irregular casing in categorical fields (e.g., "mobile", "Mobile", "MOBILE")
🎯 Use Cases Data cleaning exercises in Python (Pandas), R, Excel
Data preprocessing for machine learning
Campaign performance analysis
Conversion optimization tracking
Building dashboards in Power BI, Tableau, or Looker
💡 Sample Analysis Ideas Track campaign cost vs. return (ROI)
Analyze click-through rates (CTR) by device or location
Clean and standardize campaign names and keywords
Investigate keyword performance vs. conversions
🔖 Tags Digital Marketing · Google Ads · Marketing Analytics · Data Cleaning · Pandas Practice · Business Analytics · CRM Data
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For any questions about the code please contact Christine at c.e.beardsworth@gmail.com
To use any data contained in this repository contact Joah at j.r.madden@exeter.ac.uk for permission.
Code:
The data for the following /code sections are found in /data, apart from the full filtered movement dataset (used in 1_cleaning and run HMM.R) as it is too large to be included here. Figures created using this code are stored in /figs
1_cleaning and run HMM.R: Some extra cleaning of pheasant movement data and subsetting so that only birds that have 7 days of data (of at least 6 hours) are present in the dataset. The beginning section R code is an example only as the data is not available on this repository. The data can be retrieved from Christine or Joah. However, the code shows how the subsequent files that are included here are created and how we ran Hidden Markov models.
2_Choose HMM and describe states and HRs.R: This code shows how we chose the best HMM to describe state transitions. We also use the same dataset to create HRs that are shown in Fig 2 of the manuscript.
3_Statistics.R: This code performs the statistics used in the manuscript as well as some figures.
4_ESM.R: This code produces the figures found in the supplementary material.
Facebook
Twitterhttps://www.bco-dmo.org/dataset/651880/licensehttps://www.bco-dmo.org/dataset/651880/license
Dissolved lead data collected from the R/V Pourquoi pas (GEOVIDE) in the North Atlantic, Labrador Sea (section GA01) during 2014 access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv,.esriCsv,.geoJson acquisition_description=Sample storage bottle lids and threads were soaked overnight in 2N reagent grade HCl, then filled with 1N reagent grade HCl to be heated in an oven at 60 degrees celcius\u00a0overnight, inverted, heated for a second day, and rinsed 5X with pure distilled water.\u00a0 The bottles were then filled with trace metal clean dilute HCl (0.01N HCl) and again heated in the oven for one day on either end.\u00a0 Clean sample bottles were emptied, and double-bagged prior to rinsing and filling with sample.
As stated in the cruise report, trace metal clean seawater samples were collected using the French GEOTRACES clean rosette (General Oceanics Inc. Model 1018 Intelligent Rosette), equipped with twenty-two new 12L GO-FLO bottles (two bottles were leaking and were never deployed during the cruise). The 22 new GO-FLO bottles were initially cleaned in LEMAR laboratory following the GEOTRACES procedures (Cutter and Bruland, 2012). The rosette was deployed on a 6mm Kevlar cable with a dedicated custom designed clean winch. Immediately after recovery, GO-FLO bottles were individually covered at each end with plastic bags to minimize contamination. They were then transferred into a clean container (class-100) for sampling. On each trace metal cast, nutrient and/or salinity samples were taken to check potential leakage of the Go-Flo bottles. Prior to filtration, GO-FLO bottles were mixed manually three times. GO-FLO bottles were pressurized to less than\u00a08 psi with 0.2-um filtered N2\u00a0(Air Liquide). For Stations 1, 11, 15, 17, 19, 21, 25, 26, 29, 32 GO-FLO spigots were fitted with an acid-cleaned piece of Bev-a-Line tubing that fed into a 0.2 um capsule filters (SARTOBRAN\u00a0300, Sartorius). For all other stations (13, 34, 36, 38, 40, 42, 44, 49, 60, 64, 68, 69, 71, 77) seawater was filtered directly through paired filters (Pall Gelman Supor 0.45um polystersulfone, and Millipore mixed ester cellulose MF 5 um) mounted in Swinnex polypropylene filter holders, following the Planquette and Sherrell (2012) method. Filters were cleaned following the protocol described in Planquette and Sherrell (2012) and kept in acid-cleaned 1L LDPE bottles (Nalgene) filled with ultrapure water (Milli-Q, 18.2 megaohm/cm) until use. Subsamples were taken into acid-cleaned (see above) Nalgene HDPE bottles after a triple rinse with the sample. All samples were acidified back in the Boyle laboratory at 2mL per liter seawater (pH 2) with trace metal clean 6N HCl.
On this cruise, only the particulate samples were assigned GEOTRACES numbers. In this dataset, the dissolved Pb samples collected at the same depth (sometimes on a different cast) as the particulate samples have been assigned identifiers as \u201cSAMPNO\u201d which corresponds to the particulate GEOTRACES number. In cases where there were no corresponding particulate samples, a number was generated as \u201cPI_SAMPNO\u201d.
Upon examining the data, we observed that the sample taken from rosette position 1 (usually the near-bottom sample) was always higher in [Pb] than the sample taken immediately above that, and that the excess decreased as the cruise proceeded. The Pb isotope ratio of these samples were higher than the comparison bottles as well. A similar situation was seen for the sample taken from rosette positions 5, 20 and 21 when compared to the depth-interpolated [Pb] from the samples immediately above and below. Also, at two stations where our near-bottom sample was taken from rosette position 2, there was no [Pb] excess over the samples immediately above. We believe that this evidence points to sampler-induced contamination that was being slowly washed out during the cruise, but never completely. So we have flagged all of these analyses with a \u201c3\u201d indicating that we do not believe that these samples should be trusted as reflecting the true ocean [Pb].
In addition, we observed high [Pb] in the samples at Station 1 and very scattered Pb isotope ratios. The majority of these concentrations were far in excess of those values observed at nearby Station 11, and also the nearby USGT10-01. Discussion among other cruise participants revealed similarly anomalous data for other trace metals (e.g., Hg species). After discussion at the 2016 GEOVIDE Workshop, we came to the conclusion that this is*- evidence of GoFlo bottles not having sufficient time to \u201cclean up\u201d prior to use, and that most or all bottles from Station 1 were contaminated. We flagged all Station 1 data with a \u201c3\u201d indicating that we do not believe these values reflect the true ocean [Pb].
Samples were analyzed at least 1 month after acidification over 36 analytical sessions by a resin pre-concentration method. This method utilized the isotope-dilution ICP-MS method described in Lee et al. 2011, which includes pre-concentration on nitrilotriacetate (NTA) resin and analysis on a Fisons PQ2+ using a 400uL/min nebulizer. Briefly, samples were poured into 30mL subsample bottles. Then, triplicate 1.5mL polypropylene vials (Nalgene) were rinsed three times with the 30mL subsample.\u00a0 Each sample was pipetted (1.3mL) from the 30mL subsample to the 1.5mL vial.\u00a0 Pipettes were calibrated daily to the desired volume.\u00a0 25 ul of a 204Pb spike were added to each sample, and the pH was raised to 5.3 using a trace metal clean ammonium acetate buffer, prepared at a pH of between 7.95 and 7.98.\u00a0 2400 beads of NTA Superflow resin (Qiagen Inc., Valencia, CA) were added to the mixture, and the vials were set to shake on a shaker for 3 \u2013 6 days to allow the sample to equilibrate with the resin.\u00a0 After equilibration, the beads were centrifuged and washed 3 times with pure distilled water, using a trace metal clean siphon tip to remove the water wash from the sample vial following centrifugation.\u00a0 After the last wash, 350\u03bcl of a 0.1N solution of trace metal clean HNO3 was added to the resin to elute the metals, and the samples were set to shake on a shaker for 1 \u2013 2 days prior to analysis by ICP-MS.
NTA Superflow resin was cleaned by batch rinsing with 0.1N trace metal clean HCl for a few hours, followed by multiple washes until the pH of the solution was above 4.\u00a0 Resin was stored at 4 degrees celcius\u00a0in the dark until use, though it was allowed to equilibrate to room temperature prior to the addition to the sample.
Nalgene polypropylene (PPCO) vials were cleaned by heated submersion for 2 days at 60 degrees celcius\u00a0in 1N reagent grade HCl, followed by a bulk rinse and 4X individual rinse of each vial with pure distilled water. Each vial was then filled with trace metal clean dilute HCl (0.01N HCl) and heated in the oven at 60 degrees celcius\u00a0for one day on either end.\u00a0 Vials were kept filled until just before usage.
On each day of sample analysis, procedure blanks were determined. Replicates (12) of 300uL of an in-house standard reference material seawater (low Pb surface water) were used, where the amount of Pb in the 300uL was verified as negligible. The procedural blank over the relevant sessions for resin preconcentration method ranged from 2.2 \u2013 9.9pmol/kg, averaging 4.6 +/-\u00a01.7pmol/kg. Within a day, procedure blanks were very reproducible with an average standard deviation of 0.7pmol/kg, resulting in detection limits (3x this standard deviation) of 2.1pmol/kg. Replicate analyses of three different large-volume seawater samples (one with 11pmol/kg, another with 24pmol/kg, and a third with 38pmol/kg) indicated that the precision of the analysis is 4% or 1.6pmol/kg, whichever is larger.
Triplicate analyses of an international reference standard gave SAFe D2: 27.2 +/-\u00a01.7 pmol/kg. However, this standard run was linked into our own long- term quality control standards that are run on every analytical day to maintain long-term consistency. \u00a0
For the most part, the reported numbers are simply as calculated from the isotope dilution equation on the day of the analysis. For some analytical days, however, quality control samples indicated offsets in the blank used to correct the samples. For the upper 5 depths of Station 29, all depths of Station 40, and the deepest 2 depths of Station 42, the quality control samples indicated our blank was overcorrecting by 3.4pM, and we applied a -3.4pM correction to our Pb concentrations for that day. For the deepest 11 depths of Station 34, the quality control samples indicated our blank was overcorrecting by 10.2pM (due to contamination of the low trace metal seawater stock), and we applied a -10.2 pM correction to our Pb concentrations for that day. With these corrections, the overall internal comparability of the Pb collection should be better than 4%.
The errors associated with these Pb concentration measurements are on average 3.2% of the concentration (0.1 \u2013 4.4pmol/kg). Although there was a formal crossover station (1) that overlaps with USGT10-01 (GA-03), sample quality on the first station of GEOVIDE appears problematical making the comparison unhelpful. However, GEOVIDE station 11 (40.33 degrees North, 12.22 degrees\u00a0West) is not too far from USGT10-01 (38.325 degrees\u00a0North, 9.66 degrees\u00a0West) and makes for a reasonable comparison. It should also be noted that the MIT lab has intercalibrated Pb with other labs on the 2008 IC1 cruise, on the 2011 USGT11 (GA-03) cruise, and on the EPZT (GP-16) cruises, and maintains in-lab quality control standards for long-term data quality evaluation.
Ten percent of the samples were analyzed by Rick Kayser and the remaining ninety percent of the samples were analyzed by Cheryl Zurbrick. There was no significant difference between them for the lowest concentration large-volume seawater
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The present dataset provides subjective ratings of valence, imageability and frequency for 150 positive and 150 negative adjectives describing personality characteristics. The words in this dataset can be used for the development or validation of existing or novel experimental tasks used in a wide range of cognition research.
In Phase 1 of data collection, an initial sample of 100 participants provided self-referential valence ratings for a list of 482 adjectives depicting personality characteristics. These ratings were averaged across the sample to facilitate the exclusion of ambiguous words rated neither negative nor positive and produce a final list of 300 words (150 negative and 150 positive). In Phase 2 of data collection, we sought to further characterise these 300 words with three separate online surveys collecting ratings of self-referential valence, imageability and subjective frequency. A further 102 participants provided self-referential valence ratings, 200 participants provided imageability ratings and 202 participants provided subjective frequency ratings. Basic demographics and data on depressive symptoms and state anxiety were collected from all participants; see Tables 1a and 1b. The raw ratings collected in each of the four surveys are provided in the "Raw Datasets" folder, and the exact surveys used are provided in Supplementary file 1.
We computed a series of statistics (mean, standard deviation, standard error, number of ratings received, median, minimum rating, maximum rating, range, skew, kurtosis) for each type of rating for each of the 300 personality descriptors. The statistics for self-referential valence, imageability, subjective frequency and word length were merged into a final dataset (see Positive and negative personality descriptor words dataset). We pooled scores from all participants for the reported statistical analyses, based on exploratory analyses showing age, gender and depression/anxiety symptoms had little effect on participant ratings (see Figures 2-8). However, if greater stratification is desired, specific population statistics can be re-calculated from the raw datasets. The R script we used for data cleaning and analysis is provided in Supplementary file 2.
We also explored the relationship between the initial self-referential valence ratings collected in Phase 1 (first Qualtrics survey) and those collected during Phase 2 (second Qualtrics survey) for our final list of 300 words. We found the mean ratings for each word to be highly correlated between the two surveys (Spearman’s rho = 0.97, p < .01; see Figure 9). Additionally, we conducted a mixed effects analysis of variance to statistically assess the effects of data collection phase on the self-referential valence ratings acquired for each personality descriptor (see Self-referential valence reliability dataset).
Only fully anonymised data is provided – all pseudonymous variables have been removed by the research team prior to sharing.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2. R program and examples. The program code of RNAdeNoise in R language, examples of the use.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code for manuscript submitted to Biostatistics. File 1 produces synthetic data and runs a minimal example of the process. File 2 contains full analysis code with data cleaning removed.
Facebook
TwitterThe Bangladesh Forest Inventory (BFI) was developed to support sustainable forest management and promoting forest monitoring system. BFI contains the biophysical inventory and socio-economic survey. The design and analysis of the components was supported by remote sensing-based land cover mapping. The inventory methodology was prepared with technical consultations with national and international forest inventory and land monitoring experts and employed by the Forest Department under the Ministry of Environment, Forests and Climate Change to establish the BFI as an accurate and replicable national forest assessment.
Biophysical inventory involved visiting of 1781 field plots, and the socioeconomic surveying covered 6400 households. Semi-automated segmented land cover mapping used for object-based land characterisation, mobile application for onsite tree species identification, Open Foris tool for data collection and processing, R statistical package for analysis and differential GPS for plot referencing used. Seven major criteria and relevant indicators were developed to monitor sustainable forest management, informing both management decisions and national and international reporting. Biophysical and socioeconomic data was integrated to estimate these indicators. BFI provided data and information on tree and forest resources, land use, and ecosystem services valuations to the country. BFI made the sample plots as permanent for the continuous assessment of forest resources and monitoring over time.
National
Fields/plots
The country was divided into five distinct strata/zone for allocation of sample plots to represent the forest and trees outside forest properly. Also, the interaction of community with forest and trees and their dependency were also considered. For monitoring purposes, the samples were made permanent, and the boundary of the zones are defined in such a way that may not change easily. The five zones of Bangladesh Forest Inventory are Sundarbans (natural Mangrove Forest) Zone, Coastal (coastal plantations including mangrove plantation) Zone, Hill (evergreen and semi evergreen hilly forest areas) Zone, Sal (Deciduous Forest) Zone and Village (mainly tree outside forest and social forestry) Zone. The universe is the tree populations across the country, included trees in and outside forest land in all five subpopulations.
Sample survey data [ssd]
The sampling strategy for the National Forest Inventory (NFI) in Bangladesh comprises multiple steps, including Zoning, Land Cover development, Biophysical Inventory, and Socioeconomic Survey.
Zone: The country is divided into five zones—Sal, Sundarbans, Village, Hill, and Coastal—based on geographical conditions, species diversity, forest types, and human interaction. The socioeconomic survey zones correspond to the biophysical zones, with the Sundarbans zone referred to as the Sundarbans periphery zone. Land Cover: The Land Representation System of Bangladesh (LRSB) was developed using an object-based classification approach with the Land Cover Classification System (LCCS v3) and satellite imagery. The 33 land cover classes from the 2015 Land Cover Map were aggregated into Forest or Other Land categories following FRA definitions. Biophysical Inventory Design: The biophysical component employs a pre-stratified systematic sampling design with variable intensities for each zone. Sample intensity was determined by a 5% confidence interval target for tree resource estimates, utilizing Neyman allocation for plot distribution. Plots were randomly placed within a hexagonal grid, with distances ranging from 5900 to 10400 meters, resulting in 2245 plot locations, of which 1858 required field visits. Each plot included subplots of 19m radius in the Sundarbans and 5 subplots in other zones. Trees with DBH ? 30 cm, 10-30 cm, and 2-10 cm were measured in 19m, 8m, and 2.5m radius plots, respectively. Soil samples were collected at 8m from the subplot centre at a 270° bearing.
Socioeconomic Survey Design: The socioeconomic survey utilized a multi-stage random sampling method. It was based on the hypothesis that tree and forest ecosystem services correlate with tree cover per household. Tree cover data from 2014 Landsat images and household data from the 2011 Census were used to calculate Household Tree Availability. The five zones were divided into four strata each, based on tree cover availability. In each pre-selected union, 20 households were surveyed (totalling 6400 households) by navigating to random GPS points. Additionally, 100 qualitative surveys were conducted through Focus Group Discussions across the zones, involving community leaders and special forest user groups.
This comprehensive sampling strategy ensures robust data collection on forest resources and their socioeconomic interactions.
Biophysical inventory: around 4% deviation took place because of inaccessibility issues mostly in hill regions Socioeconomic survey: 0 deviation from the sample design
Field measurement [field]
In BFI field data collection, data cleaning, quality control, and data archiving were part of a simultaneous process performed both in the field and in the central office. Open Foris Collect was used for data collection and processing. Collected data were submitted to the central office unit for managing, cleaning, archiving and further processing. As field data were collected, they were checked for outliers or suspect data entries, both manually and with R scripts. If an obvious correction was needed, it was updated in the Open Foris database, otherwise the field teams were consulted about suspect data to understand the problem and take further decision.
At the same time, four QA/QC teams performed quality assessments of data collection directly in the field through hot and cold checks. Hot checks allowed for the opportunity to improve data in the field. Cold checks provided the issues to be considered and identify the check list also the data will be acceptable or not, in case of unacceptable data remeasurement of plat took place. In the biophysical inventory, 39 hot checks, 54 cold checks were conducted, which is about 5% of the sampled plots. The total number of plots re-measured was 52. For the socio-economic survey, 254 hot checks, 13 cold checks, which is about 4% of the total number of households sampled. Microsoft Access database was prepared using the data exported from collect, which also enabled to generate reports with images collected from the field. With new data, the database was updated accordingly. Data cleansing conducted using Collect desktop and R statistical tool. Manual checks for records were done in Collect. R generated quality control checks were also used to identify possible inconsistencies. Inconsistencies were confirmed and corrected through consulting to the field crews or data collectors by and updated later in the collect database.
Biophysical Inventory: 1781 sample plots were inventoried among the total of 1857 plots, which is around 96%. Socioeconomic Survey: 100% of targeted household numbers were surveyed, which is 6400.
Please refer to Table 5.10 of the Report on the Bangladesh Forest Inventory for more information on the estimates of sampling error.
Socioeconomic survey had nonresponse 0%, whereas Biophysical inventory had non-response around 4%. However, knowing the fact that non-response introduce bias, and to minimize the bias estimates methods developed. 1. Treat as zeros but show the area by inaccessible class. This is transparent, so it is recommended to always present the proportion of inaccessible plots. 2. Partition inaccessible zones and report them as such. This clearly identifies regions that could not be sampled. 3. Drop the plots from estimation. This treats the inaccessible plots as if they had the strata mean. For partially accessible plots, a special estimator must be used, such as the ratio-to-size estimator or that used by FIA (Bechtold and Scott 2005) to account for the missing portion of the plot. Mostly methods 1 and 3 used. Method 2 were used if there are inaccessible regions.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is the supplementary material accompanying the manuscript "Daily life in the Open Biologist’s second job, as a Data Curator", published in Wellcome Open Research.
It contains:
- Python_scripts.zip: Python scripts used for data cleaning and organization:
-add_headers.py: adds specified headers automatically to a list of csv files, creating new output files containing a "_with_headers" suffix.
-count_NaN_values.py: counts the total number of rows containing null values in a csv file and prints the location of null values in the (row, column) format.
-remove_rowsNaN_file.py: removes rows containing null values in a single csv file and saves the modified file with a "_dropNaN" suffix.
-remove_rowsNaN_list.py: removes rows containing null values in list of csv files and saves the modified files with a "_dropNaN" suffix.
- README_template.txt: a template for a README file to be used to describe and accompany a dataset.
- template_for_source_data_information.xlsx: a spreadsheet to help manuscript authors to keep track of data used for each figure (e.g., information about data location and links to dataset description).
- Supplementary_Figure_1.tif: Example of a dataset shared by us on Zenodo. The elements that make the dataset FAIR are indicated by the respective letters. Findability (F) is achieved by the dataset unique and persistent identifier (DOI), as well as by the related identifiers for the publication and dataset on GitHub. Additionally, the dataset is described with rich metadata, (e.g., keywords). Accessibility (A) is achieved by the ease of visualization and downloading using a standardised communications protocol (https). Also, the metadata are publicly accessible and licensed under the public domain. Interoperability (I) is achieved by the open formats used (CSV; R), and metadata are harvestable using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), a low-barrier mechanism for repository interoperability. Reusability (R) is achieved by the complete description of the data with metadata in README files and links to the related publication (which contains more detailed information, as well as links to protocols on protocols.io). The dataset has a clear and accessible data usage license (CC-BY 4.0).
Facebook
Twitterhttp://opensource.org/licenses/BSD-2-Clausehttp://opensource.org/licenses/BSD-2-Clause
Source code accompanying the paper:Koopman, P., Lubbers, M. & Plasmeijer, R. (2018). A Task-Based DSL for Microcomputers. In R. Stewart (Ed.), RWDSL2018: Proceedings of the Real World Domain Specific Languages Workshop 2018, Vienna, Austria — February 24 - 24, 2018 (pp. 1-11). New York: ACM doi: 10.1145/3183895.3183902This is a snapshot of the mTask git repository:https://gitlab.science.ru.nl/mlubbers/mTaskAbstract:The Internet of Things, IoT, makes small connected computing devices almost omnipresent. These devices have typically very limited computing power and severe memory restrictions to make them cheap and power efficient. These devices can interact with the environment via special sensors and actuators. Since each device controls several peripherals running interleaved, the control software is quite complicated and hard to maintain.Task Oriented Programming, TOP, offers lightweight communicating threads that can inspect each other’s intermediate results. This makes it well suited for the IoT. In this paper presents a functional task-based domain specific language for these IoT devices. We show that it yields concise control programs. By restricting the datatypes and using strict evaluation these programs fit within the restrictions of microcontrollers.Contents:README.md contains a brief description of the filesmTaskExamples.icl: contains the example mTask programs*.icl, *.dcl: contain the mTask library Clean (https://clean.cs.ru.nl/Clean) source files
Facebook
TwitterDESCRIPTIONThis repository contains analysis scripts (with outputs), figures from the manuscript, and supplementary files the HIV Pain (HIP) Intervention Study. All analysis scripts (and their outputs -- /outputs subdirectory) are found in HIP-study.zip, while PDF copies of the analysis outputs that are cited in the manuscript as supplementary material are found in the relevant supplement-*.pdf file.Note: Participant consent did not provide for the publication of their data, and hence neither the original nor cleaned data have been made available. However, we do not wish to bar access to the data unnecessarily and we will judge requests to access the data on a case-by-case basis. Examples of potential use cases include independent assessments of our analyses, and secondary data analyses. Please contact Peter Kamerman (peter.kamerman@gmail.com), Dr Tory Madden (torymadden@gmail.com, or open an issue on the GitHub repo (https://github.com/kamermanpr/HIP-study/issues).BIBLIOGRAPHIC INFORMATIONRepository citationKamerman PR, Madden VJ, Parker R, Devan D, Cameron S, Jackson K, Reardon C, Wadley A. Analysis scripts and supplementary files: Barriers to implementing clinical trials on non-pharmacological treatments in developing countries – lessons learnt from addressing pain in HIV. DOI: 10.6084/m9.figshare.7654637.Manuscript citationParker R, Madden VJ, Devan D, Cameron S, Jackson K, Kamerman P, Reardon C, Wadley A. Barriers to implementing clinical trials on non-pharmacological treatments in developing countries – lessons learnt from addressing pain in HIV. Pain Reports [submitted 2019-01-31]Manuscript abstractintroduction: Pain affects over half of people living with HIV/AIDS (LWHA) and pharmacological treatment has limited efficacy. Preliminary evidence supports non-pharmacological interventions. We previously piloted a multimodal intervention in amaXhosa women LWHA and chronic pain in South Africa with improvements seen in all outcomes, in both intervention and control groups. Methods: A multicentre, single-blind randomised controlled trial with 160 participants recruited was conducted to determine whether the multimodal peer-led intervention reduced pain in different populations of both male and female South Africans LWHA. Participants were followed up at Weeks 4, 8, 12, 24 and 48 to evaluate effects on the primary outcome of pain, and on depression, self-efficacy and health-related quality of life. Results: We were unable to assess the efficacy of the intervention due to a 58% loss to follow up (LTFU). Secondary analysis of the LTFU found that sociocultural factors were not predictive of LTFU. Depression, however, did associate with LTFU, with greater severity of depressive symptoms predicting LTFU at week 8 (p=0.01). Discussion: We were unable to evaluate the effectiveness of the intervention due to the high LTFU and the risk of retention bias. The different sociocultural context in South Africa may warrant a different approach to interventions for pain in HIV compared to resource-rich countries, including a concurrent strategy to address barriers to health care service delivery. We suggest that assessment of pain and depression need to occur simultaneously in those with pain in HIV. We suggest investigation of the effect of social inclusion on pain and depression. USING DOCKER TO RUN THE HIP-STUDY ANALYSIS SCRIPTSThese instructions are for running the analysis on your local machine.You need to have Docker installed on your computer. To do so, go to docker.com (https://www.docker.com/community-edition#/download) and follow the instructions for downloading and installing Docker for your operating system. Once Docker has been installed, follow the steps below, noting that Docker commands are entered in a terminal window (Linux and OSX/macOS) or command prompt window (Windows). Windows users also may wish to install GNU Make (http://gnuwin32.sourceforge.net/downlinks/make.php) (required for the make method of running the scripts) and Git (https://gitforwindows.org/) version control software (not essential).Download the latest imageEnter: docker pull kamermanpr/docker-hip-study:v2.0.0Run the containerEnter: docker run -d -p 8787:8787 -v :/home/rstudio --name threshold -e USER=hip -e PASSWORD=study kamermanpr/docker-hip-study:v2.0.0Where refers to the path to the HIP-study directory on your computer, which you either cloned from GitHub (https://github.com/kamermanpr/HIP-study.git), git clone https://github.com/kamermanpr/HIP-study, or downloaded and extracted from figshare (https://doi.org/10.6084/m9.figshare.7654637).Login to RStudio Server- Open a web browser window and navigate to: localhost:8787- Use the following login credentials: - Username: hip - Password: study Prepare the HIP-study directoryThe HIP-study directory comes with the outputs for all the analysis scripts in the /outputs directory (html and md formats). However, should you wish to run the scripts yourself, there are several preparatory steps that are required:1. Acquire the data. The data required to run the scripts have not been included in the repo because participants in the studies did not consent to public release of their data. However, the data are available on request from Peter Kamerman (peter.kamerman@gmail.com). Once the data have been obtained, the files should be copied into a subdirectory named /data-original.2. Clean the /outputs directory by entering make clean in the Terminal tab in RStudio.Run the HIP-study analysis scriptsTo run all the scripts (including the data cleaning scripts), enter make all in the Terminal tab in RStudio.To run individual RMarkdown scripts (*.Rmd files)1. Generate the cleaned data using one of the following methods: - Enter make data-cleaned/demographics.rds in the Terminal tab in RStudio. - Enter source('clean-data-script.R') in the Console tab in RStudio. - Open the clean-data-script.R script through the File tab in RStudio, and then click the 'Source' button on the right of the Script console in RStudio for each script. 2. Run the individual script by: - Entering make outputs/.html in the Terminal tab in RStudio, OR - Opening the relevant *.Rmd file through the File tab in RStudio, and then clicking the 'knit' button on the left of the Script console in RStudio. Shutting downOnce done, log out of RStudio Server and enter the following into a terminal to stop the Docker container: docker stop hip. If you then want to remove the container, enter: docker rm threshold. If you also want to remove the Docker image you downloaded, enter: docker rmi kamermanpr/docker-hip-study:v2.0.0
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.