Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. The tutorial is published in a blog for The Node (https://thenode.biologists.com/converting-excellent-spreadsheets-tidy-data/education/)
This file contains the Fourier Transform Infrared Spectroscopy (FTIR) Spectroscopy Data from NOAA R/V Ronald H. Brown ship during VOCALS-REx 2008.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Data is archived here: https://doi.org/10.5281/zenodo.4818011Data and code archive provides all the files that are necessary to replicate the empirical analyses that are presented in the paper "Climate impacts and adaptation in US dairy systems 1981-2018" authored by Maria Gisbert-Queral, Arne Henningsen, Bo Markussen, Meredith T. Niles, Ermias Kebreab, Angela J. Rigden, and Nathaniel D. Mueller and published in 'Nature Food' (2021, DOI: 10.1038/s43016-021-00372-z). The empirical analyses are entirely conducted with the "R" statistical software using the add-on packages "car", "data.table", "dplyr", "ggplot2", "grid", "gridExtra", "lmtest", "lubridate", "magrittr", "nlme", "OneR", "plyr", "pracma", "quadprog", "readxl", "sandwich", "tidyr", "usfertilizer", and "usmap". The R code was written by Maria Gisbert-Queral and Arne Henningsen with assistance from Bo Markussen. Some parts of the data preparation and the analyses require substantial amounts of memory (RAM) and computational power (CPU). Running the entire analysis (all R scripts consecutively) on a laptop computer with 32 GB physical memory (RAM), 16 GB swap memory, an 8-core Intel Xeon CPU E3-1505M @ 3.00 GHz, and a GNU/Linux/Ubuntu operating system takes around 11 hours. Running some parts in parallel can speed up the computations but bears the risk that the computations terminate when two or more memory-demanding computations are executed at the same time.This data and code archive contains the following files and folders:* READMEDescription: text file with this description* flowchart.pdfDescription: a PDF file with a flow chart that illustrates how R scripts transform the raw data files to files that contain generated data sets and intermediate results and, finally, to the tables and figures that are presented in the paper.* runAll.shDescription: a (bash) shell script that runs all R scripts in this data and code archive sequentially and in a suitable order (on computers with a "bash" shell such as most computers with MacOS, GNU/Linux, or Unix operating systems)* Folder "DataRaw"Description: folder for raw data filesThis folder contains the following files:- DataRaw/COWS.xlsxDescription: MS-Excel file with the number of cows per countySource: USDA NASS QuickstatsObservations: All available counties and years from 2002 to 2012- DataRaw/milk_state.xlsxDescription: MS-Excel file with average monthly milk yields per cowSource: USDA NASS QuickstatsObservations: All available states from 1981 to 2018- DataRaw/TMAX.csvDescription: CSV file with daily maximum temperaturesSource: PRISM Climate Group (spatially averaged)Observations: All counties from 1981 to 2018- DataRaw/VPD.csvDescription: CSV file with daily maximum vapor pressure deficitsSource: PRISM Climate Group (spatially averaged)Observations: All counties from 1981 to 2018- DataRaw/countynamesandID.csvDescription: CSV file with county names, state FIPS codes, and county FIPS codesSource: US Census BureauObservations: All counties- DataRaw/statecentroids.csvDescriptions: CSV file with latitudes and longitudes of state centroidsSource: Generated by Nathan Mueller from Matlab state shapefiles using the Matlab "centroid" functionObservations: All states* Folder "DataGenerated"Description: folder for data sets that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these generated data files so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).* Folder "Results"Description: folder for intermediate results that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these intermediate results so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).* Folder "Figures"Description: folder for the figures that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these figures so that people who replicate our analysis can more easily compare the figures that they get with the figures that are presented in our paper. Additionally, this folder contains CSV files with the data that are required to reproduce the figures.* Folder "Tables"Description: folder for the tables that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these tables so that people who replicate our analysis can more easily compare the tables that they get with the tables that are presented in our paper.* Folder "logFiles"Description: the shell script runAll.sh writes the output of each R script that it runs into this folder. We provide these log files so that people who replicate our analysis can more easily compare the R output that they get with the R output that we got.* PrepareCowsData.RDescription: R script that imports the raw data set COWS.xlsx and prepares it for the further analyses* PrepareWeatherData.RDescription: R script that imports the raw data sets TMAX.csv, VPD.csv, and countynamesandID.csv, merges these three data sets, and prepares the data for the further analyses* PrepareMilkData.RDescription: R script that imports the raw data set milk_state.xlsx and prepares it for the further analyses* CalcFrequenciesTHI_Temp.RDescription: R script that calculates the frequencies of days with the different THI bins and the different temperature bins in each month for each state* CalcAvgTHI.RDescription: R script that calculates the average THI in each state* PreparePanelTHI.RDescription: R script that creates a state-month panel/longitudinal data set with exposure to the different THI bins* PreparePanelTemp.RDescription: R script that creates a state-month panel/longitudinal data set with exposure to the different temperature bins* PreparePanelFinal.RDescription: R script that creates the state-month panel/longitudinal data set with all variables (e.g., THI bins, temperature bins, milk yield) that are used in our statistical analyses* EstimateTrendsTHI.RDescription: R script that estimates the trends of the frequencies of the different THI bins within our sampling period for each state in our data set* EstimateModels.RDescription: R script that estimates all model specifications that are used for generating results that are presented in the paper or for comparing or testing different model specifications* CalcCoefStateYear.RDescription: R script that calculates the effects of each THI bin on the milk yield for all combinations of states and years based on our 'final' model specification* SearchWeightMonths.RDescription: R script that estimates our 'final' model specification with different values of the weight of the temporal component relative to the weight of the spatial component in the temporally and spatially correlated error term* TestModelSpec.RDescription: R script that applies Wald tests and Likelihood-Ratio tests to compare different model specifications and creates Table S10* CreateFigure1a.RDescription: R script that creates subfigure a of Figure 1* CreateFigure1b.RDescription: R script that creates subfigure b of Figure 1* CreateFigure2a.RDescription: R script that creates subfigure a of Figure 2* CreateFigure2b.RDescription: R script that creates subfigure b of Figure 2* CreateFigure2c.RDescription: R script that creates subfigure c of Figure 2* CreateFigure3.RDescription: R script that creates the subfigures of Figure 3* CreateFigure4.RDescription: R script that creates the subfigures of Figure 4* CreateFigure5_TableS6.RDescription: R script that creates the subfigures of Figure 5 and Table S6* CreateFigureS1.RDescription: R script that creates Figure S1* CreateFigureS2.RDescription: R script that creates Figure S2* CreateTableS2_S3_S7.RDescription: R script that creates Tables S2, S3, and S7* CreateTableS4_S5.RDescription: R script that creates Tables S4 and S5* CreateTableS8.RDescription: R script that creates Table S8* CreateTableS9.RDescription: R script that creates Table S9
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
To get the consumption model from Section 3.1, one needs load execute the file consumption_data.R. Load the data for the 3 Phases ./data/CONSUMPTION/PL1.csv, PL2.csv, PL3.csv, transform the data and build the model (starting line 225). The final consumption data can be found in one file for each year in ./data/CONSUMPTION/MEGA_CONS_list.Rdata
To get the results for the optimization problem, one needs to execute the file analyze_data.R. It provides the functions to compare production and consumption data, and to optimize for the different values (PV, MBC,).
To reproduce the figures one needs to execute the file visualize_results.R. It provides the functions to reproduce the figures.
To calculate the solar radiation that is needed in the Section Production Data, follow file calculate_total_radiation.R.
To reproduce the radiation data from from ERA5, that can be found in data.zip, do the following steps: 1. ERA5 - download the reanalysis datasets as GRIB file. For FDIR select "Total sky direct solar radiation at surface", for GHI select "Surface solar radiation downwards", and for ALBEDO select "Forecast albedo". 2. convert GRIB to csv with the file era5toGRID.sh 3. convert the csv file to the data that is used in this paper with the file convert_year_to_grid.R
This data package is associated with the publication “Meta-metabolome ecology reveals that geochemistry and microbial functional potential are linked to organic matter development across seven rivers” submitted to Science of the Total Environment. This data package includes the data necessary to replicate the analyses presented within the manuscript to investigate dissolved organic matter (DOM) development across broad spatial distances and within divergent biomes. Specifically, we included the Fourier transform ion cyclotron mass spectrometry (FTICR-MS) data, geochemistry data, annotated metagenomic data, and results from ecological null modeling analyses in this data package. Additionally, we included the scripts necessary to generate the figures from the manuscript.Complete metagenomic data associated with this data package can be found at the National Center for Biotechnology (NCBI) under Bioproject PRJNA946291.This dataset consists of (1) four folders; (2) a file-level metadata (flmd) file; (3) a data dictionary (dd) file; (4) a factor sheet describing samples; and (5) a readme. The FTICR Data folder contains (1) the processed Fourier transform ion cyclotron mass spectrometry (FTICR-MS) data; (2) a transformation-weighted characteristics dendrogram generated from the FTICR-MS data; and (3) the script used to generate all FTICR-MS related figures. The Geochemical Data folder contains (1) the single geochemistry data filemore » and (2) the R script responsible for generating associated figures. The Metagenomic Data folder contains (1) annotation information across different levels; (2) carbohydrate active enzyme (CAZyme) information from the dbCAN database (Yin et al., 2012); (3) phylogenetic tree data (FASTAs, alignments, and tree file); and (4) the scripts necessary to analyze all of these data and generate figures. The Null Modeling Data folder contains (1) data generated during null modeling for each river and all rivers combined and (2) the R scripts necessary to process the data. All files are .csv, .pdf, .tsv, .tre, .faa, .afa, .tree, or .R.« less
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project abstract: Many situations involve processing social and non-social information simultaneously. However, is not known how performance is affected in such situations. Here, we examined how our ability to process social information is affected by the need to keep track of non-social information. Participants were instructed to carry out two tasks within each trial. The social task involved referential communication – requiring participants to use social cues to guide their decisions. At the same time, cognitive load was manipulated by requiring participants to remember non-social information in the form of either one or three two-digit numbers visually presented before each social task stimulus. Results indicate that the cognitive demands of simultaneously processing social and non-social information impair social information processing. Specifically, keeping in mind three numbers slowed participants' ability to use another person's perspective to guide decisions. These results suggest that social information processing requires domain-general resources that are depleted under cognitive load. Data: These files include our dataset, as well as the scripts used to analyze the data and create graphs of the results. You will need to download R (http://www.r-project.org/) to use these files. Data are from 29 adult participants. Participants completed an adapted version of the “Director Task” (Dumontheil, Hillebrandt, Apperly, & Blakemore, 2012) with an embedded working memory (WM) Task component. Afterwards, participants completed a verbal reverse digit-span task as a measure of WM capacity and the Interpersonal Reactivity Index questionnaire to assess individual differences in trait perspective taking (Davis, 1980). Data Analysis: We used the lme4 package in R (Bates, Maechler, & Bolker, 2013) to perform a linear mixed effects analysis on the relationship between our factors of interest and accuracy and RT for both tasks. RT data from correct trials only were analyzed. To create approximately normally distributed residuals, we used a log or reciprocal function to transform RT data. We performed a two-step procedure: first, we created a global model including main and interactive effects of cognitive load (low vs. high), condition (Director Present vs. Director Absent), trial type (1-object vs. 3-object), and perspective (same vs. different) as fixed effects, and each model included a random intercept for each participant. We then compared all possible combinations[1] of the variables within our global model using an automated model selection procedure (MuMIn1.9.0; Barton, 2013). Models were ranked using Second-order Akaike Information Criterion (AICc; Burnham & Anderson, 2002). Second, after determining the best fitting model for each outcome of interest, we tested whether WM capacity or trait perspective taking explained any additional variance through likelihood ratio tests. All p-values were obtained by likelihood ratio tests comparing the best fitting model against a baseline model.[1] Interactions were always accompanied by their respective main effects and all lower order terms
Update (August 8, 2013): There was a minor error in the original SocialDualTaskData.R file, which has now been corrected.
This data package contains pumping data (.txt), parameter matrices, and R code (.R, .RData) to perform bootstrapping for parameter selection for the bioclogging model development. The pumping data were collected from the Russian River Riverbank Filtration site located in Sonoma County, California from 2010-2017 from three riverbank collection wells located alongside the study site. The pumping data is directly correlated with water table oscillations, so the code performs these correlations and simulates stochastic versions of water table oscillations. See Metadata Description.pdf for full details on dataset production. This dataset must be used with the R programming language. This dataset and R code is associated with the publication "Influence of Hydrological Perturbations and Riverbed Sediment Characteristics on Hyporheic Zone Respiration of CO2 and N-2" This research was supported by the Jane Lewis Fellowship from the University of California, Berkeley, the Sonoma County Water Agency (SCWA), the Roy G. Post Foundation Scholarship, the U.S. Department of Energy, Office of Science Graduate Student Research (SCGSR) Program, U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under award DE-AC02-05CH11231, and the UFZ-Helmholtz Centre for Environmental Research, Leipzig, Germany.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reddit contents and complementary data regarding the r/The_Donald community and its main moderation interventions, used for the corresponding article indicated in the title.
An accompanying R notebook can be found in: https://github.com/amauryt/make_reddit_great_again
If you use this dataset please cite the related article.
The dataset timeframe of the Reddit contents (submissions and comments) spans from 30 weeks before Quarantine (2018-11-28) to 30 weeks after Restriction (2020-09-23). The original Reddit content was collected from the Pushshift monthly data files, transformed, and loaded into two SQLite databases.
The first database, the_donald.sqlite, contains all the available content from r/The_Donald created during the dataset timeframe, with the last content being posted several weeks before the timeframe upper limit. It only has two tables: submissions and comments. It should be noted that the IDs of contents are on base 10 (numeric integer), unlike the original base 36 (alphanumeric) used on Reddit and Pushshift. This is for efficient storage and processing. If necessary, many programming languages or libraries can easily convert IDs from one base to another.
The second database, core_the_donald.sqlite, contains all the available content from core users of r/The_Donald made platform-wise (i.e., within and without the subreddit) during the dataset timeframe. Core users are defined as those who authored either a submission or a comment a week in r/The_Donald during the 30 weeks prior to the subreddit's Quarantine. The database has four tables: submissions, comments, subreddits, and perspective_scores. The subreddits table contains the names of the subreddits to which submissions and comments were made (their IDs are also on base 10). The perspective_scores table contains comment toxicity scores.
The Perspective API was used to score comments based on the attributes toxicity and severe_toxicity. It should be noted that not all of the comments in core_the_donald have a score because the comment body was blank or because the Perspective API returned a request error (after three tries). However, the percentage of missing scores is minuscule.
A third file, mbfc_scores.csv, contains the bias and factual reporting accuracy collected in October 2021 from Media Bias / Fact Check (MBFC). Both attributes are scored on a Likert-like manner. One can associate submissions to MBFC scores by doing a join by the domain column.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This article proposes a graphical model that handles mixed-type, multi-group data. The motivation for such a model originates from real-world observational data, which often contain groups of samples obtained under heterogeneous conditions in space and time, potentially resulting in differences in network structure among groups. Therefore, the iid assumption is unrealistic, and fitting a single graphical model on all data results in a network that does not accurately represent the between group differences. In addition, real-world observational data is typically of mixed discrete-and-continuous type, violating the Gaussian assumption that is typical of graphical models, which leads to the model being unable to adequately recover the underlying graph structure. Both these problems are solved by fitting a different graph for each group, applying the fused group penalty to fuse similar graphs together and by treating the observed data as transformed latent Gaussian data, respectively. The proposed model outperforms related models on learning partial correlations in a simulation study. Finally, the proposed model is applied on real on-farm maize yield data, showcasing the added value of the proposed method in generating new production-ecological hypotheses. An R package containing the proposed methodology can be found on https://CRAN.R-project.org/package=heteromixgm. Supplementary materials for this article are available online.
NaiveBayes_R.xlsx: This Excel file includes information as to how probabilities of observed features are calculated given recidivism (P(x_ij│R)) in the training data. Each cell is embedded with an Excel function to render appropriate figures. P(Xi|R): This tab contains probabilities of feature attributes among recidivated offenders. NIJ_Recoded: This tab contains re-coded NIJ recidivism challenge data following our coding schema described in Table 1. Recidivated_Train: This tab contains re-coded features of recidivated offenders. Tabs from [Gender] through [Condition_Other]: Each tab contains probabilities of feature attributes given recidivism. We use these conditional probabilities to replace the raw values of each feature in P(Xi|R) tab. NaiveBayes_NR.xlsx: This Excel file includes information as to how probabilities of observed features are calculated given non-recidivism (P(x_ij│N)) in the training data. Each cell is embedded with an Excel function to render appropriate figures. P(Xi|N): This tab contains probabilities of feature attributes among non-recidivated offenders. NIJ_Recoded: This tab contains re-coded NIJ recidivism challenge data following our coding schema described in Table 1. NonRecidivated_Train: This tab contains re-coded features of non-recidivated offenders. Tabs from [Gender] through [Condition_Other]: Each tab contains probabilities of feature attributes given non-recidivism. We use these conditional probabilities to replace the raw values of each feature in P(Xi|N) tab. Training_LnTransformed.xlsx: Figures in each cell are log-transformed ratios of probabilities in NaiveBayes_R.xlsx (P(Xi|R)) to the probabilities in NaiveBayes_NR.xlsx (P(Xi|N)). TestData.xlsx: This Excel file includes the following tabs based on the test data: P(Xi|R), P(Xi|N), NIJ_Recoded, and Test_LnTransformed (log-transformed P(Xi|R)/ P(Xi|N)). Training_LnTransformed.dta: We transform Training_LnTransformed.xlsx to Stata data set. We use Stat/Transfer 13 software package to transfer the file format. StataLog.smcl: This file includes the results of the logistic regression analysis. Both estimated intercept and coefficient estimates in this Stata log correspond to the raw weights and standardized weights in Figure 1. Brier Score_Re-Check.xlsx: This Excel file recalculates Brier scores of Relaxed Naïve Bayes Classifier in Table 3, showing evidence that results displayed in Table 3 are correct. *****Full List***** NaiveBayes_R.xlsx NaiveBayes_NR.xlsx Training_LnTransformed.xlsx TestData.xlsx Training_LnTransformed.dta StataLog.smcl Brier Score_Re-Check.xlsx Data for Weka (Training Set): Bayes_2022_NoID Data for Weka (Test Set): BayesTest_2022_NoID Weka output for machine learning models (Conventional naïve Bayes, AdaBoost, Multilayer Perceptron, Logistic Regression, and Random Forest)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
fisheries management is generally based on age structure models. thus, fish ageing data are collected by experts who analyze and interpret calcified structures (scales, vertebrae, fin rays, otoliths, etc.) according to a visual process. the otolith, in the inner ear of the fish, is the most commonly used calcified structure because it is metabolically inert and historically one of the first proxies developed. it contains information throughout the whole life of the fish and provides age structure data for stock assessments of all commercial species. the traditional human reading method to determine age is very time-consuming. automated image analysis can be a low-cost alternative method, however, the first step is the transformation of routinely taken otolith images into standardized images within a database to apply machine learning techniques on the ageing data. otolith shape, resulting from the synthesis of genetic heritage and environmental effects, is a useful tool to identify stock units, therefore a database of standardized images could be used for this aim. using the routinely measured otolith data of plaice (pleuronectes platessa; linnaeus, 1758) and striped red mullet (mullus surmuletus; linnaeus, 1758) in the eastern english channel and north-east arctic cod (gadus morhua; linnaeus, 1758), a greyscale images matrix was generated from the raw images in different formats. contour detection was then applied to identify broken otoliths, the orientation of each otolith, and the number of otoliths per image. to finalize this standardization process, all images were resized and binarized. several mathematical morphology tools were developed from these new images to align and to orient the images, placing the otoliths in the same layout for each image. for this study, we used three databases from two different laboratories using three species (cod, plaice and striped red mullet). this method was approved to these three species and could be applied for others species for age determination and stock identification.
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Raw data and preparation R scripts of the firmness analysis on apple sticks (raw and cooked) studied in the Pomcuite project (INRAE TRANSFORM internal funding ANS). This dataset contains : - the R Markdown file giving details on the raw dataset and the analysis scripts - the PDF processed from the R Markdown - the data set in CSV format
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contained a R script and datasets to perform a Canonical correspondence analysis (CCA) in R. CCA was used to infer the underlying relationship between the benthic harmful dinoflagellate assemblages and benthic substrate characteristics, depths and irradiances. CCA is a constrained multivariate ordination technique that extracts major gradients among combinations of explanatory variables in a dataset and requires samples to be both random and independent. Data for cell abundances were Hellinger-transformed prior to CCA to ensure the data met the statistical assumptions of normality and linearity. The analysis was performed using vegan. The significance of variation in benthic harmful dinoflagellates assemblages explained by the explanatory variables was tested using an ANOVA-like Monte Carlo permutation test as implemented in vegan.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The Department of Planning and Environment – Water is working to make models and data publicly available. These can be grouped into three high level categories: \r \r \r 1)\tClimate Data: The fundamental input for Water models is climate data in the form of daily rainfall and potential evapotranspiration. This data is input to water models of varying types, purposes, and complexity. The water models transform this input data to produce a range of water related modelled data. There are three sub-categories of climate data used in our water models: \r \r -\t_observed data_: The observed data is downloaded from the Silo database https://www.longpaddock.qld.gov.au/silo/ which has data from 1889-present based on recorded rainfall at thousands of locations, and derived data for various evapotranspiration data sets. We use patched-point rainfall, Morton’s wet area potential evapotranspiration, and Morton’s lake evaporation from Silo\r \r -\t_stochastic data_: The stochastic data are 10,000-year daily data sets of rainfall and potential evapotranspiration generated using observed data sets combined with palaeo-logical climate data. This work has been undertaken by researchers at University of Adelaide and University of Newcastle and used in Regional Water Strategies.\r \r -\t_stochastic data perturbed_ by results from climate models for projected greenhouse gas emission scenarios. The climate change perturbed data (1c) are 10,000-year daily data sets of rainfall and potential evapo-transpiration developed by combining the stochastic data with results reductions changes in climate based on results of the NARCliM climate models that show the greatest reductions in rainfall. Note: The Department does not own the IP of NARClim products to release any climate data (such as stochastic data perturbed) with NARClim climate projection. NARClim data is available on public domain for users to download directly such as https://climatedata-beta.environment.nsw.gov.au/\r \r 2)\tWater Models: (Not yet released) There are three subcategories of water models that we develop and maintain with catchment models the fundamental unit. These can be linked to form pre-development models of river systems, which are further developed by adding water infrastructure, demands, and management arrangements to form a full unregulated or regulated river system model.\r \r 3)\tModelled Data: (Partially released) The dataset comprises the outcomes generated by water models, encompassing a comprehensive array of findings pertaining to various aspects of the water balance. These findings encompass, but are not restricted to, factors such as flow, diversions, water storage, and allocations, with an initial emphasis on flow.\r
This dataset is a transformation of Greg Kolodziejzyk's remote viewing data (see Related datasets below). Greg used a "rapid-fire" technique whereby several short free-response remote viewing trials were completed in a single session. The trial-level data was transformed by Adrian Ryan into session-level Z-scores by exact binomial, in order that the data could be combined with those from other experiments, for the analyses reported here: Ryan, A., & Spottiswoode, J. (2015). Variation of ESP by Season, Local Sidereal Time, and Geomagnetic Activity. Extrasensory Perception, 377-394. Three files are included: Data file Code book Transform Procedure: The R script used to transform the original trial-level data into session-level data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ZIP folder containing data and script from Benson et al. (2021) "Reconstructed evolutionary patterns for crocodile-line archosaurs demonstrate impact of failure to log-transform body size data" Communications Biology.Contains almost all data required to replicate analyses of the paper using BayesTraits and R version 4.0.3, including the R packages ape 5.0 (Paradis & Schliep 2019) and caper 1.0.1 (Orme et al. 2018).
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The fundamental input data of work undertaken by Water Modelling Team is climate data in the form of daily rainfall and potential evapotranspiration. This data is input to water models of varying types, purposes, and complexity. The water models transform this input data to produce a range of water related modelled data.\r \r The stochastic climate data and palaeo stochastic climate data include 10,000 replicates of 130-yr daily data sets of rainfall and potential evapotranspiration generated using observed data sets without and with combined palaeo climate data. This work has been undertaken by researchers at the University of Newcastle and used in modelling for Greater Sydney Water Strategy.\r \r Stochastic Climate data and palaeo stochastic climate data are available to download for Greater Sydney region from the Related Datasets section below.\r \r -----------------------------------\r \r Note: If you would like to ask a question, make any suggestions, or tell us how you are using this dataset, please visit the NSW Water Hub which has an online forum you can join.\r \r
Measuring natural selection through the use of multiple regression has transformed our understanding of selection, although the methods used remain sensitive to the effects of multicollinearity due to highly correlated traits. While measuring selection on principal component scores is an apparent solution to this challenge, this approach has been heavily criticized due to difficulties in interpretation and relating PC axes back to the original traits. We describe and illustrate how to transform selection gradients for PC scores back into selection gradients for the original traits, addressing issues of multicollinearity and biological interpretation. In addition to reducing multicollinearity, we suggest that this method may have promise for measuring selection on high-dimensional data such as volatiles or gene expression traits. We demonstrate this approach with empirical data and examples from the literature, highlighting how selection estimates for PC scores can be interpreted while r...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
iSDAsoil dataset soil stone content / coarse fragments log-transformed predicted at 30 m resolution for 0–20 and 20–50 cm depth intervals. Data has been projected in WGS84 coordinate system and compiled as COG. Predictions have been generated using multi-scale Ensemble Machine Learning with 250 m (MODIS, PROBA-V, climatic variables and similar) and 30 m (DTM derivatives, Landsat, Sentinel-2 and similar) resolution covariates. For model training we use a pan-African compilations of soil samples and profiles (iSDA points, AfSPDB, and other national and regional soil datasets). Cite as:
Hengl, T., Miller, M.A.E., Križan, J. et al. African soil properties and nutrients mapped at 30 m spatial resolution using two-scale ensemble machine learning. Sci Rep 11, 6130 (2021). https://doi.org/10.1038/s41598-021-85639-y
To open the maps in QGIS and/or directly compute with them, please use the Cloud-Optimized GeoTIFF version.
Layer description:
Model errors were derived using bootstrapping: md is derived as standard deviation of individual learners from 5-fold cross-validation (using spatial blocking). The model 5-fold cross-validation (mlr::makeStackedLearner) for this variable indicates:
Variable: log.wpg2
R-square: 0.709
Fitted values sd: 1.25
RMSE: 0.803
Random forest model:
Call:
stats::lm(formula = f, data = d)
Residuals:
Min 1Q Median 3Q Max
-4.0555 -0.3113 -0.0222 0.2378 4.5794
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.008606 1.361982 -0.006 0.995
regr.ranger 0.972265 0.004443 218.854 < 2e-16 ***
regr.xgboost 0.034649 0.006404 5.411 6.3e-08 ***
regr.cubist 0.069589 0.005229 13.308 < 2e-16 ***
regr.nnet -0.012756 0.796535 -0.016 0.987
regr.cvglmnet -0.056645 0.005509 -10.283 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8032 on 92785 degrees of freedom
Multiple R-squared: 0.7092, Adjusted R-squared: 0.7092
F-statistic: 4.525e+04 on 5 and 92785 DF, p-value: < 2.2e-16
To back-transform values (y) to % use the following formula:
% = expm1( y / 10 )
To submit an issue or request support please visit https://isda-africa.com/isdasoil
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. The tutorial is published in a blog for The Node (https://thenode.biologists.com/converting-excellent-spreadsheets-tidy-data/education/)