100+ datasets found

f
Comparison of the Predictive Performance and Interpretability of Random...
acs.figshare.com
figshare.com
zip
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard L. Marchese Robinson; Anna Palczewska; Jan Palczewski; Nathan Kidley (2023). Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets [Dataset]. http://doi.org/10.1021/acs.jcim.6b00753.s006
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.6b00753.s006
Dataset updated
Jun 5, 2023
Dataset provided by
ACS Publications
Authors
Richard L. Marchese Robinson; Anna Palczewska; Jan Palczewski; Nathan Kidley
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The ability to interpret the predictions made by quantitative structure–activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package (https://r-forge.r-project.org/R/?group_id=1725) for the R statistical programming language and the Python program HeatMapWrapper [https://doi.org/10.5281/zenodo.495163] for heat map generation.
Z
Simulation Data & R scripts for: "Introducing recurrent events analyses to...
data.niaid.nih.gov
Updated Apr 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ferry, Nicolas (2024). Simulation Data & R scripts for: "Introducing recurrent events analyses to assess species interactions based on camera trap data: a comparison with time-to-first-event approaches" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11085005
Explore at:
Dataset updated
Apr 29, 2024
Dataset authored and provided by
Ferry, Nicolas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Files descriptions:

All csv files refer to results from the different models (PAMM, AARs, Linear models, MRPPs) on each iteration of the simulation. One row being one iteration. "results_perfect_detection.csv" refers to the results from the first simulation part with all the observations."results_imperfect_detection.csv" refers to the results from the first simulation part with randomly thinned observations to mimick imperfect detection.

ID_run: identified of the iteration (N: number of sites, D_AB: duration of the effect of A on B, D_BA: duration of the effect of B on A, AB: effect of A on B, BA: effect of B on A, Se: seed number of the iteration).PAMM30: p-value of the PAMM running on the 30-days survey.PAMM7: p-value of the PAMM running on the 7-days survey.AAR1: ratio value for the Avoidance-Attraction-Ratio calculating AB/BA.AAR2: ratio value for the Avoidance-Attraction-Ratio calculating BAB/BB.Harmsen_P: p-value from the linear model with interaction Species1*Species2 from Harmsen et al. (2009).Niedballa_P: p-value from the linear model comparing AB to BA (Niedballa et al. 2021).Karanth_permA: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species A (Karanth et al. 2017).MurphyAB_permA: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). MurphyBA_permA: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). Karanth_permB: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species B (Karanth et al. 2017).MurphyAB_permB: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021). MurphyBA_permB: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021).

"results_int_dir_perf_det.csv" refers to the results from the second simulation part, with all the observations."results_int_dir_imperf_det.csv" refers to the results from the second simulation part, with randomly thinned observations to mimick imperfect detection.ID_run: identified of the iteration (N: number of sites, D_AB: duration of the effect of A on B, D_BA: duration of the effect of B on A, AB: effect of A on B, BA: effect of B on A, Se: seed number of the iteration).p_pamm7_AB: p-value of the PAMM running on the 7-days survey testing for the effect of A on B.p_pamm7_AB: p-value of the PAMM running on the 7-days survey testing for the effect of B on A.AAR1: ratio value for the Avoidance-Attraction-Ratio calculating AB/BA.AAR2_BAB: ratio value for the Avoidance-Attraction-Ratio calculating BAB/BB.AAR2_ABA: ratio value for the Avoidance-Attraction-Ratio calculating ABA/AA.Harmsen_P: p-value from the linear model with interaction Species1*Species2 from Harmsen et al. (2009).Niedballa_P: p-value from the linear model comparing AB to BA (Niedballa et al. 2021).Karanth_permA: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species A (Karanth et al. 2017).MurphyAB_permA: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). MurphyBA_permA: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). Karanth_permB: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species B (Karanth et al. 2017).MurphyAB_permB: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021). MurphyBA_permB: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021).

Scripts files description:1_Functions: R script containing the functions: - MRPP from Karanth et al. (2017) adapted here for time efficiency. - MRPP from Murphy et al. (2021) adapted here for time efficiency. - Version of the ct_to_recurrent() function from the recurrent package adapted to process parallized on the simulation datasets. - The simulation() function used to simulate two species observations with reciprocal effect on each other.2_Simulations: R script containing the parameters definitions for all iterations (for the two parts of the simulations), the simulation paralellization and the random thinning mimicking imperfect detection.3_Approaches comparison: R script containing the fit of the different models tested on the simulated data.3_1_Real data comparison: R script containing the fit of the different models tested on the real data example from Murphy et al. 2021.4_Graphs: R script containing the code for plotting results from the simulation part and appendices.5_1_Appendix - Check for similarity between codes for Karanth et al 2017 method: R script containing Karanth et al. (2017) and Murphy et al. (2021) codes lines and the adapted version for time-efficiency matter and a comparison to verify similarity of results.5_2_Appendix - Multi-response procedure permutation difference: R script containing R code to test for difference of the MRPPs approaches according to the species on which permutation are done.
Data from: Comparison Data Sets for Benchmarking QSAR Methodologies in Lead...
figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruchi R. Mittal; Ross A. McKinnon; Michael J. Sorich (2023). Comparison Data Sets for Benchmarking QSAR Methodologies in Lead Optimization [Dataset]. http://doi.org/10.1021/ci900117m.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/ci900117m.s001
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Ruchi R. Mittal; Ross A. McKinnon; Michael J. Sorich
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
2D and 3D QSAR techniques are widely used in lead optimization-like processes. A compilation of 40 diverse data sets is described. It is proposed that these can be used as a common benchmark sample for comparisons of QSAR methodologies, primarily in terms of predictive ability. Use of this benchmark set will be useful for both assessment of new methods and for optimization of existing methods.
d
Data from: R Manual for QCA
search.dataone.org
dataverse.harvard.edu
Updated Nov 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mello, Patrick A. (2023). R Manual for QCA [Dataset]. http://doi.org/10.7910/DVN/KYF7VJ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/KYF7VJ
Dataset updated
Nov 17, 2023
Dataset provided by
Harvard Dataverse
Authors
Mello, Patrick A.
Description
The R Manual for QCA entails a PDF file that describes all the steps and code needed to prepare and conduct a Qualitative Comparative Analysis (QCA) study in R. This is complemented by an R Script that can be customized as needed. The dataset further includes two files with sample data, for the set-theoretic analysis and the visualization of QCA results. The R Manual for QCA is the online appendix to "Qualitative Comparative Analysis: An Introduction to Research Design and Application", Georgetown University Press, 2021.
e
Optimization and Evaluation Datasets for PiMine - Dataset - B2FIND
b2find.eudat.eu
Updated Jun 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Optimization and Evaluation Datasets for PiMine - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/bf8b9da8-6bb8-580d-99fe-bcebea0d4d3f
Explore at:
Dataset updated
Jun 7, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The protein-protein interface comparison software PiMine was developed to provide fast comparisons against databases of known protein-protein complex structures. Its application domains range from the prediction of interfaces and potential interaction partners to the identification of potential small molecule modulators of protein-protein interactions.[1] The protein-protein evaluation datasets are a collection of five datasets that were used for the parameter optimization (ParamOptSet), enrichment assessment (Dimer597 set, Keskin set, PiMineSet), and runtime analyses (RunTimeSet) of protein-protein interface comparison tools. The evaluation datasets contain pairs of interfaces of protein chains that either share sequential and structural similarities or are even sequentially and structurally unrelated. They enable comparative benchmark studies for tools designed to identify interface similarities. In addition, we added the results of the case studies analyzed in [1] to enable readers to follow the discussion and investigate the results individually. Data Set description: The ParamOptSet was designed based on a study on improving the benchmark datasets for the evaluation of protein-protein docking tools [2]. It was used to optimize and fine-tune the geometric search parameters of PiMine. The Dimer597 [3] and Keskin [4] sets were developed earlier. We used them to evaluate PiMine’s performance in identifying structurally and sequentially related interface pairs as well as interface pairs with prominent similarity whose constituting chains are sequentially unrelated. The PiMine set [1] was constructed to assess different quality criteria for reliable interface comparison. It consists of similar pairs of protein-protein complexes of which two chains are sequentially and structurally highly related while the other two chains are unrelated and show different folds. It enables the assessment of the performance when the interfaces of apparently unrelated chains are available only. Furthermore, we could obtain reliable interface-interface alignments based on the similar chains which can be used for alignment performance assessments. Finally, the RunTimeSet [1] comprises protein-protein complexes from the PDB that were predicted to be biologically relevant. It enables the comparison of typical run times of comparison methods and represents also an interesting dataset to screen for interface similarities. References: [1] Graef, J.; Ehrt, C.; Reim, T.; Rarey, M. Database-driven identification of structurally similar protein-protein interfaces (submitted) [2] Barradas-Bautista, D.; Almajed, A.; Oliva, R.; Kalnis, P.; Cavallo, L. Improving classification of correct and incorrect protein-protein docking models by augmenting the training set. Bioinform. Adv. 2023, 3, vbad012. [3] Gao, M.; Skolnick, J. iAlign: a method for the structural comparison of protein–protein interfaces. Bioinformatics 2010, 26, 2259-2265. [4] Keskin, O.; Tsai, C.-J.; Wolfson, H.; Nussinov, R. A new, structurally nonredundant, diverse data set of protein–protein interfaces and its implications. Protein Sci. 2004, 13, 1043-1055. This work was supported by the German Federal Ministry of Education and Research as part of CompLS and de.NBI [031L0172, 031L0105]. C.E. is funded by Data Science in Hamburg – Helmholtz Graduate School for the Structure of Matter (Grant-ID: HIDSS-0002).
Z
Benchmark Multi-Omics Datasets for Methods Comparison
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Nov 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wang, Lily (2021). Benchmark Multi-Omics Datasets for Methods Comparison [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_5683001
Explore at:
Dataset updated
Nov 14, 2021
Dataset provided by
Odom, Gabriel
Wang, Lily
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pathway Multi-Omics Simulated Data

These are synthetic variations of the TCGA COADREAD data set (original data available at http://linkedomics.org/data_download/TCGA-COADREAD/). This data set is used as a comprehensive benchmark data set to compare multi-omics tools in the manuscript "pathwayMultiomics: An R package for efficient integrative analysis of multi-omics datasets with matched or un-matched samples".

There are 100 sets (stored as 100 sub-folders, the first 50 in "pt1" and the second 50 in "pt2") of random modifications to centred and scaled copy number, gene expression, and proteomics data saved as compressed data files for the R programming language. These data sets are stored in subfolders labelled "sim001", "sim002", ..., "sim100". Each folder contains the following contents: 1) "indicatorMatricesXXX_ls.RDS" is a list of simple triplet matrices showing which genes (in which pathways) and which samples received the synthetic treatment (where XXX is the simulation run label: 001, 002, ...), (2) "CNV_partitionA_deltaB.RDS" is the synthetically modified copy number variation data (where A represents the proportion of genes in each gene set to receive the synthetic treatment [partition 1 is 20%, 2 is 40%, 3 is 60% and 4 is 80%] and B is the signal strength in units of standard deviations), (3) "RNAseq_partitionA_deltaB.RDS" is the synthetically modified gene expression data (same parameter legend as CNV), and (4) "Prot_partitionA_deltaB.RDS" is the synthetically modified protein expression data (same parameter legend as CNV).

Supplemental Files

The file "cluster_pathway_collection_20201117.gmt" is the collection of gene sets used for the simulation study in Gene Matrix Transpose format. Scripts to create and analyze these data sets available at: https://github.com/TransBioInfoLab/pathwayMultiomics_manuscript_supplement
d
Data from: A preliminary comparison of a songbird's song repertoire size and...
datadryad.org
search.dataone.org
zip
Updated Feb 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dustin Brewer; Adam Fudickar (2023). A preliminary comparison of a songbird's song repertoire size and other song measures between an urban and a rural site [Dataset]. http://doi.org/10.5061/dryad.h70rxwdkv
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.h70rxwdkv
Dataset updated
Feb 2, 2023
Dataset provided by
Dryad
Authors
Dustin Brewer; Adam Fudickar
Time period covered
Jan 19, 2022
Description
Characteristics of birdsong, especially minimum frequency, have been shown to vary for some species between urban and rural populations and along urban-rural gradients. However, few urban-rural comparisons of song complexity—and none that we know of based on the number of distinct song types in repertoires—have occurred. Given the potential ability of song repertoire size to indicate bird condition, we primarily sought to determine if number of distinct song types displayed by Song Sparrows (Melospiza melodia) varied between an urban and a rural site. We determined song repertoire size of 24 individuals; 12 were at an urban (‘human-dominated’) site and 12 were at a rural (‘agricultural’) site. Then, we compared song repertoire size, note rate, and peak frequency between these sites. Song repertoire size and note rate did not vary between our human-dominated and agricultural sites. Peak frequency was greater at the agricultural site. Our finding that peak frequency was higher at the agri...
f
ProjecTILs murine reference atlas of tumor-infiltrating T cells, version 1
figshare.com
application/gzip
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massimo Andreatta; Santiago Carmona (2023). ProjecTILs murine reference atlas of tumor-infiltrating T cells, version 1 [Dataset]. http://doi.org/10.6084/m9.figshare.12478571.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12478571.v2
Dataset updated
Jun 29, 2023
Dataset provided by
figshare
Authors
Massimo Andreatta; Santiago Carmona
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We have developed ProjecTILs, a computational approach to project new data sets into a reference map of T cells, enabling their direct comparison in a stable, annotated system of coordinates. Because new cells are embedded in the same space of the reference, ProjecTILs enables the classification of query cells into annotated, discrete states, but also over a continuous space of intermediate states. By comparing multiple samples over the same map, and across alternative embeddings, the method allows exploring the effect of cellular perturbations (e.g. as the result of therapy or genetic engineering) and identifying genetic programs significantly altered in the query compared to a control set or to the reference map. We illustrate the projection of several data sets from recent publications over two cross-study murine T cell reference atlases: the first describing tumor-infiltrating T lymphocytes (TILs), the second characterizing acute and chronic viral infection.To construct the reference TIL atlas, we obtained single-cell gene expression matrices from the following GEO entries: GSE124691, GSE116390, GSE121478, GSE86028; and entry E-MTAB-7919 from Array-Express. Data from GSE124691 contained samples from tumor and from tumor-draining lymph nodes, and were therefore treated as two separate datasets. For the TIL projection examples (OVA Tet+, miR-155 KO and Regnase-KO), we obtained the gene expression counts from entries GSE122713, GSE121478 and GSE137015, respectively.Prior to dataset integration, single-cell data from individual studies were filtered using TILPRED-1.0 (https://github.com/carmonalab/TILPRED), which removes cells not enriched in T cell markers (e.g. Cd2, Cd3d, Cd3e, Cd3g, Cd4, Cd8a, Cd8b1) and cells enriched in non T cell genes (e.g. Spi1, Fcer1g, Csf1r, Cd19). Dataset integration was performed using STACAS (https://github.com/carmonalab/STACAS), a batch-correction algorithm based on Seurat 3. For the TIL reference map, we specified 600 variable genes per dataset, excluding cell cycling genes, mitochondrial, ribosomal and non-coding genes, as well as genes expressed in less than 0.1% or more than 90% of the cells of a given dataset. For integration, a total of 800 variable genes were derived as the intersection of the 600 variable genes of individual datasets, prioritizing genes found in multiple datasets and, in case of draws, those derived from the largest datasets. We determined pairwise dataset anchors using STACAS with default parameters, and filtered anchors using an anchor score threshold of 0.8. Integration was performed using the IntegrateData function in Seurat3, providing the anchor set determined by STACAS, and a custom integration tree to initiate alignment from the largest and most heterogeneous datasets.Next, we performed unsupervised clustering of the integrated cell embeddings using the Shared Nearest Neighbor (SNN) clustering method implemented in Seurat 3 with parameters {resolution=0.6, reduction=”umap”, k.param=20}. We then manually annotated individual clusters (merging clusters when necessary) based on several criteria: i) average expression of key marker genes in individual clusters; ii) gradients of gene expression over the UMAP representation of the reference map; iii) gene-set enrichment analysis to determine over- and under- expressed genes per cluster using MAST. In order to have access to predictive methods for UMAP, we recomputed PCA and UMAP embeddings independently of Seurat3 using respectively the prcomp function from basic R package “stats”, and the “umap” R package (https://github.com/tkonopka/umap).
r
Data from: Supplementary tables:MetaFetcheR: An R package for complete...
researchdata.se
Updated Jun 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara A. Yones; Rajmund Csombordi; Jan Komorowski; Klev Diamanti (2024). Supplementary tables:MetaFetcheR: An R package for complete mapping of small compound data [Dataset]. http://doi.org/10.57804/7sf1-fw75
Explore at:
(78625), (728116)Available download formats
Unique identifier
https://doi.org/10.57804/7sf1-fw75
Dataset updated
Jun 24, 2024
Dataset provided by
Uppsala University
Authors
Sara A. Yones; Rajmund Csombordi; Jan Komorowski; Klev Diamanti
Description
The dataset includes a PDF file containing the results and an Excel file with the following tables:

Table S1 Results of comparing the performance of MetaFetcheR to MetaboAnalystR using Diamanti et al. Table S2 Results of comparing the performance of MetaFetcheR to MetaboAnalystR for Priolo et al. Table S3 Results of comparing the performance of MetaFetcheR to MetaboAnalyst 5.0 webtool using Diamanti et al. Table S4 Results of comparing the performance of MetaFetcheR to MetaboAnalyst 5.0 webtool for Priolo et al. Table S5 Data quality test results for running 100 iterations on HMDB database. Table S6 Data quality test results for running 100 iterations on KEGG database. Table S7 Data quality test results for running 100 iterations on ChEBI database. Table S8 Data quality test results for running 100 iterations on PubChem database. Table S9 Data quality test results for running 100 iterations on LIPID MAPS database. Table S10 The list of metabolites that were not mapped by MetaboAnalystR for Diamanti et al. Table S11 An example of an input matrix for MetaFetcheR. Table S12 Results of comparing the performance of MetaFetcheR to MS_targeted using Diamanti et al. Table S13 Data set from Diamanti et al. Table S14 Data set from Priolo et al. Table S15 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Diamanti et al. Table S16 Results of comparing the performance of MetaFetcheR to CTS using LIPID MAPS identifiers available in Diamanti et al. Table S17 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Priolo et al. Table S18 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Priolo et al. (See the "index" tab in the Excel file for more information)

Small-compound databases contain a large amount of information for metabolites and metabolic pathways. However, the plethora of such databases and the redundancy of their information lead to major issues with analysis and standardization. Lack of preventive establishment of means of data access at the infant stages of a project might lead to mislabelled compounds, reduced statistical power and large delays in delivery of results.

We developed MetaFetcheR, an open-source R package that links metabolite data from several small-compound databases, resolves inconsistencies and covers a variety of use-cases of data fetching. We showed that the performance of MetaFetcheR was superior to existing approaches and databases by benchmarking the performance of the algorithm in three independent case studies based on two published datasets.

The dataset was originally published in DiVA and moved to SND in 2024.
H
Time-Series Matrix (TSMx): A visualization tool for plotting multiscale...
dataverse.harvard.edu
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgios Boumis; Brad Peter (2024). Time-Series Matrix (TSMx): A visualization tool for plotting multiscale temporal trends [Dataset]. http://doi.org/10.7910/DVN/ZZDYM9
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ZZDYM9
Dataset updated
Jul 8, 2024
Dataset provided by
Harvard Dataverse
Authors
Georgios Boumis; Brad Peter
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Time-Series Matrix (TSMx): A visualization tool for plotting multiscale temporal trends TSMx is an R script that was developed to facilitate multi-temporal-scale visualizations of time-series data. The script requires only a two-column CSV of years and values to plot the slope of the linear regression line for all possible year combinations from the supplied temporal range. The outputs include a time-series matrix showing slope direction based on the linear regression, slope values plotted with colors indicating magnitude, and results of a Mann-Kendall test. The start year is indicated on the y-axis and the end year is indicated on the x-axis. In the example below, the cell in the top-right corner is the direction of the slope for the temporal range 2001–2019. The red line corresponds with the temporal range 2010–2019 and an arrow is drawn from the cell that represents that range. One cell is highlighted with a black border to demonstrate how to read the chart—that cell represents the slope for the temporal range 2004–2014. This publication entry also includes an excel template that produces the same visualizations without a need to interact with any code, though minor modifications will need to be made to accommodate year ranges other than what is provided. TSMx for R was developed by Georgios Boumis; TSMx was originally conceptualized and created by Brad G. Peter in Microsoft Excel. Please refer to the associated publication: Peter, B.G., Messina, J.P., Breeze, V., Fung, C.Y., Kapoor, A. and Fan, P., 2024. Perspectives on modifiable spatiotemporal unit problems in remote sensing of agriculture: evaluating rice production in Vietnam and tools for analysis. Frontiers in Remote Sensing, 5, p.1042624. https://www.frontiersin.org/journals/remote-sensing/articles/10.3389/frsen.2024.1042624 TSMx sample chart from the supplied Excel template. Data represent the productivity of rice agriculture in Vietnam as measured via EVI (enhanced vegetation index) from the NASA MODIS data product (MOD13Q1.V006). TSMx R script: # import packages library(dplyr) library(readr) library(ggplot2) library(tibble) library(tidyr) library(forcats) library(Kendall) options(warn = -1) # disable warnings # read data (.csv file with "Year" and "Value" columns) data <- read_csv("EVI.csv") # prepare row/column names for output matrices years <- data %>% pull("Year") r.names <- years[-length(years)] c.names <- years[-1] years <- years[-length(years)] # initialize output matrices sign.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) pval.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) slope.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) # function to return remaining years given a start year getRemain <- function(start.year) { years <- data %>% pull("Year") start.ind <- which(data[["Year"]] == start.year) + 1 remain <- years[start.ind:length(years)] return (remain) } # function to subset data for a start/end year combination splitData <- function(end.year, start.year) { keep <- which(data[['Year']] >= start.year & data[['Year']] <= end.year) batch <- data[keep,] return(batch) } # function to fit linear regression and return slope direction fitReg <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(sign(slope)) } # function to fit linear regression and return slope magnitude fitRegv2 <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(slope) } # function to implement Mann-Kendall (MK) trend test and return significance # the test is implemented only for n>=8 getMann <- function(batch) { if (nrow(batch) >= 8) { mk <- MannKendall(batch[['Value']]) pval <- mk[['sl']] } else { pval <- NA } return(pval) } # function to return slope direction for all combinations given a start year getSign <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) signs <- lapply(combs, fitReg) return(signs) } # function to return MK significance for all combinations given a start year getPval <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) pvals <- lapply(combs, getMann) return(pvals) } # function to return slope magnitude for all combinations given a start year getMagn <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) magns <- lapply(combs, fitRegv2) return(magns) } # retrieve slope direction, MK significance, and slope magnitude signs <- lapply(years, getSign) pvals <- lapply(years, getPval) magns <- lapply(years, getMagn) # fill-in output matrices dimension <- nrow(sign.matrix) for (i in 1:dimension) { sign.matrix[i, i:dimension] <- unlist(signs[i]) pval.matrix[i, i:dimension] <- unlist(pvals[i]) slope.matrix[i, i:dimension] <- unlist(magns[i]) } sign.matrix <-...
o
Crop classification dataset for testing domain adaptation or distributional...
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Mar 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan M. Kluger; Sherrie Wang; David B. Lobell (2022). Crop classification dataset for testing domain adaptation or distributional shift methods [Dataset]. http://doi.org/10.5281/zenodo.6376159
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6376159
Dataset updated
Mar 22, 2022
Authors
Dan M. Kluger; Sherrie Wang; David B. Lobell
Description
In this upload we share processed crop type datasets from both France and Kenya. These datasets can be helpful for testing and comparing various domain adaptation methods. The datasets are processed, used, and described in this paper: https://doi.org/10.1016/j.rse.2021.112488 (arXiv version: https://arxiv.org/pdf/2109.01246.pdf). In summary, each point in the uploaded datasets corresponds to a particular location. The label is the crop type grown at that location in 2017. The 70 processed features are based on Sentinel-2 satellite measurements at that location in 2017. The points in the France dataset come from 11 different departments (regions) in Occitanie, France, and the points in the Kenya dataset come from 3 different regions in Western Province, Kenya. Within each dataset there are notable shifts in the distribution of the labels and in the distribution of the features between regions. Therefore, these datasets can be helpful for testing for testing and comparing methods that are designed to address such distributional shifts. More details on the dataset and processing steps can be found in Kluger et. al. (2021). Much of the processing steps were taken to deal with Sentinel-2 measurements that were corrupted by cloud cover. For users interested in the raw multi-spectral time series data and dealing with cloud cover issues on their own (rather than using the 70 processed features provided here), the raw dataset from Kenya can be found in Yeh et. al. (2021), and the raw dataset from France can be made available upon request from the authors of this Zenodo upload. All of the data uploaded here can be found in "CropTypeDatasetProcessed.RData". We also post the dataframes and tables within that .RData file as separate .csv files for users who do not have R. The contents of each R object (or .csv file) is described in the file "Metadata.rtf". Preferred Citation: -Kluger, D.M., Wang, S., Lobell, D.B., 2021. Two shifts for crop mapping: Leveraging aggregate crop statistics to improve satellite-based maps in new regions. Remote Sens. Environ. 262, 112488. https://doi.org/10.1016/j.rse.2021.112488. -URL to this Zenodo post https://zenodo.org/record/6376160 {"references": ["Kluger D.M., Wang S., and Lobell, D.B. (2021). Two shifts for crop mapping: leveraging aggregate crop statistics to improve satellite-based maps in new regions. Remote Sensing of Environment. 262, 112488", "C. Yeh, C. Meng, S. Wang, A. Driscoll, E. Rozi, P. Liu, J. Lee, M. Burke, D. Lobell, and S. Ermon, "SustainBench: Benchmarks for Monitoring the Sustainable Development Goals with Machine Learning," in Thirty-fifth Conference on Neural Information Processing Systems, Datasets and Benchmarks Track (Round 2), Dec. 2021."]}
d
Datasets for Comparison of Surrogate Models to Estimate Pesticide...
catalog.data.gov
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Datasets for Comparison of Surrogate Models to Estimate Pesticide Concentrations at Six U.S. Geological Survey National Water Quality Network Sites During Water Years 2013–2018 [Dataset]. https://catalog.data.gov/dataset/datasets-for-comparison-of-surrogate-models-to-estimate-pesticide-concentrations-at-six-u-
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data release is comprised of data tables of input variables for seawaveQ and surrogate models used to predict concentrations of select pesticides at six U.S. Geological Survey National Water Quality Network (NWQN) river sites (Fanno Creek at Durham, Oregon; White River at Hazleton, Indiana; Kansas River at DeSoto, Kansas; Little Arkansas River near Sedgwick, Kansas; Missouri River at Hermann, Missouri; Red River of the North at Grand Forks, North Dakota). Each data table includes discrete concentrations of one select pesticide (Atrazine, Azoxystrobin, Bentazon, Bromacil, Imidacloprid, Simazine, or Triclopyr) at one of the NWQN sites; daily mean streamflow; 30-day and 1-day flow anomalies; daily median values of pH and turbidity; daily mean values of dissolved oxygen, specific conductance, and water temperature; and 30-day and 1-day anomalies for pH, turbidity, dissolved oxygen, specific conductance, and water temperature. Two pesticides were modeled at each site with three types of regression models. Also included is a zip file with outputs from seawaveQ model summary. The processes for retrieving and preparing data for regression models followed those outlined in the SEAWAVE-Q R package documentation (Ryberg and Vecchia, 2013; Ryberg and York, 2020). The R package waterData (Ryberg and Vecchia, 2012) was used to import daily mean values for discharge and either daily mean or daily median values for continuous water-quality constituents directly into R depending on what data were available at each site. Pesticide concentration, streamflow, and surrogate data (continuously measured field parameters) were imported from and are available online from the USGS National Water Information System database (USGS, 2020). The waterData package was used to screen for missing daily mean discharge values (no missing values were found for the sites) and to calculate short-term (1 day) and mid-term (30 day) anomalies for flow and short-term anomalies (1 day) for each water-quality variable. A mid-term streamflow anomaly, for instance, is the deviation of concurrent daily streamflow from average conditions for the previous 30 days (Vecchia and others, 2008). Anomalies were calculated as additional potential model variables. Pesticide concentrations for select constituents from each site were pulled into R using the dataRetrieval package (De Cicco and others, 2018). Three of the six sites (Kansas River at DeSoto, Kansas; Missouri River at Hermann, Missouri; and White River at Hazleton, Indiana) pulled pesticide data for WY 2013–17 whereas the other three sites (Fanno Creek at Durham, Oregon; Little Arkansas River near Sedgwick, Kansas; and Red River of the North at Grand Forks, North Dakota) pulled pesticide data for WY 2013–18. Discrete pesticide data were matched with daily mean discharge and daily mean or median water-quality constituents and the associated calculated short-term (1-day) and mid-term (30-day) anomalies from the date of sampling. Pesticide concentrations were estimated using the SEAWAVE-Q (with surrogates) model using 19 combinations of surrogate variables (table 2 in the associated SIR, "Comparison of Surrogate Models to Estimate Pesticide Concentrations at Six U.S. Geological Survey National Water Quality Network Sites During Water Years 2013–18.") at each of 12 site-pesticide combinations (table 3 in the associated SIR). Three measures of model performance—the generalized coefficient of determination (R2), Akaike’s Information Criteria (AIC), and scale—were included in the output and used to select best-fit models (Table 4 of the associated SIR). The three to four best-fit SEAWAVE-Q (with surrogates) models with sample sizes at least five times the number of variables were selected for each site-pesticide combination based on generalized R2 values—the higher, the better. If generalized R2 values were the same, the model with the lower AIC value was used. The standard surrogate regression and base SEAWAVE-Q models were then applied using the same samples that were used for each of the best-fit SEAWAVE-Q (with surrogates) models so that direct comparisons could be made for each site-pesticide-surrogate instance. The input data used to estimate daily pesticide concentrations for each of the best fit models have been included in this data release. An example of one output file for each model type is included in a .zip file named "output_examples.zip". Each of the output files shows the three measures of model performance. (1) The output file for the standard regression model named "HAZ8_Atrazine_Standard_Regression_Output.txt" includes: Pseudo R-square (Allison) of 0.631, Model AIC of 174.0232, and a Scale of 0.961. (2) The output file for the base SEAWAVE-Q model named "HAZ8_Atrazine_Base_Seawave-Q_Output.txt" includes: Generalized r-squared of 0.82, AIC (Akaike's An Information Criterion) of 36.38, and a Scale of 0.288. (3) The output file for the SEAWAVE-Q w/Surrogates model named "HAZ8_Atrazine_Seawave-Q_w_Surrogates_Output.txt" includes: Generalized r-squared of 0.85, AIC (Akaike's An Information Criterion) of 33.76, and a Scale of 0.268. These values match those for Site ID = HAZ, Pesticide = Atrazine, and Surrogate variable group 8 for each model type in Table 4 of the associated SIR.
Global Thermosaliniograph Dataset
zenodo.org
bin
Updated Jul 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyla Drushka; Kyla Drushka; Caitlin Whalen; Caitlin Whalen (2024). Global Thermosaliniograph Dataset [Dataset]. http://doi.org/10.5281/zenodo.12584849
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12584849
Dataset updated
Jul 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kyla Drushka; Kyla Drushka; Caitlin Whalen; Caitlin Whalen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 2020
Description
%%%%%%%%%%%%%%%%%%
Global Thermosaliniograph Dataset
%%%%%%%%%%%%%%%%%%
The dataset tsg_2020.mat is an updated version of the compiled ship thermosalinograph dataset that was originally used in Drushka et al. (2019). Details below.

%%%%%%%%
Variable Names
%%%%%%%%
t = time in matlab datenum format
T = temperature
S = salinity
x = longitude
y = latitude
callsign = ship callsign, numbers correspond to call signs in call_sign.m
dataset = number corresponds with the names of the original dataset (see list below)

%%%%%%%%%%%%
Original Dataset Names
%%%%%%%%%%%%
2: GOSUD database
4: SSS-OS database
5: PANGAEA database - R/V Polarstern
6: R/V Survostral
9: SAMOS database
10: PANGAEA database - R/V Poseidon
11: AODN database
12: M/V Oleander
15: Cruise data provided by Sophie Clayton
16: SAMOS database (files not QC'dby SAMOS, QC'd as described below)
22: JAMSTEC database
24: SOCAT 2020 database
25: LEGOS/SSSOS (update from dataset 4)
26: SAMOS database (update)
27: GOSUD database (update)

%%%%%%%%%%%%%%%
Updated Version Description
%%%%%%%%%%%%%%%
The following describes what has been updated compared to the data used in Drushka et al. (2019).

1. SOCAT: https://www.socat.info/index.php/data-access/
- Data up to 2019 included. Basic QC performed by comparing to Argo data and discarding outliers.

2. SSS-OS (LEGOS): http://sss.sedoo.fr/
- Data up to 12/2019 included. Salinity data QC’d by LEGOS; basic QC performed on temperature by comparing to Argo data and discarding outliers.

3. SAMOS https://samos.coaps.fsu.edu
- Data flatted as “good data with 0-5% flagged” " up to 12/2019 included

4. GOSUD - http://www.gosud.org/
- Delayed-mode data up to 12/2019 included
d
HUN Comparison of model variability and interannual variability
data.gov.au
researchdata.edu.au
Updated Nov 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2019). HUN Comparison of model variability and interannual variability [Dataset]. https://data.gov.au/data/dataset/activity/1c0a19f9-98c2-4d92-956d-dd764aaa10f9
Explore at:
Dataset updated
Nov 20, 2019
Dataset provided by
Bioregional Assessment Program
Description
Abstract

The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

These are the data summarising the modelled Hydrological Response Variable (HRV) variability versus climate interannual variability which has been used as an indicator of risk. For example, to understand the significance of the modelled increases in low-flow days, it is useful to look at them in the context of the interannual variability in low-flow days due to climate. In other words, are the modelled increases due to additional coal resource development within the natural range of variability of the longer-term flow regime, or are they potentially moving the system outside the range of hydrological variability it experiences under the current climate? The maximum increase in the number of low-flow days due to additional coal resource development relative to the interannual variability in low-flow days under the baseline has been adopted to put some context around the modelled changes. If the maximum change is small relative to the interannual variability due to climate (e.g. an increase of 3 days relative to a baseline range of 20 to 50 days), then the risk of impacts from the changes in low-flow days is likely to be low. If the maximum change is comparable to or greater than the interannual variability due to climate (e.g. an increase of 200 days relative to a baseline range of 20 to 50 days), then there is a greater risk of impact on the landscape classes and assets that rely on this water source. Here changes comparable to or greater than interannual variability are interpreted as presenting a risk. However, the change due to the additional coal resource development is additive, so even a 'less than interannual variability' change is not free from risk. Results of the interannual variability comparison should be viewed as indicators of risk.

Dataset History

This dataset is generated using 1000 HRV simulations together with climate inputs. Ratios between the variability in HRVs, and the variability attributable interannual variability due to climate, were calculated for the HRVs. Results of the interannual variability comparison should be viewed as indicators of risk.

Dataset Citation

Bioregional Assessment Programme (2017) HUN Comparison of model variability and interannual variability. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/1c0a19f9-98c2-4d92-956d-dd764aaa10f9.

Dataset Ancestors

Derived From River Styles Spatial Layer for New South Wales

Derived From SYD ALL climate data statistics summary

Derived From HUN AWRA-R Observed storage volumes Glenbawn Dam and Glennies Creek Dam

Derived From Hunter River Salinity Scheme Discharge NSW EPA 2006-2012

Derived From HUN AWRA-R simulation nodes v01

Derived From Bioregional Assessment areas v06

Derived From Hunter AWRA Hydrological Response Variables (HRV)

Derived From GEODATA 9 second DEM and D8: Digital Elevation Model Version 3 and Flow Direction Grid 2008

Derived From HUN AWRA-L simulation nodes_v01

Derived From Bioregional Assessment areas v04

Derived From HUN AWRA-R Gauge Station Cross Sections v01

Derived From Gippsland Project boundary

Derived From Natural Resource Management (NRM) Regions 2010

Derived From BA All Regions BILO cells in subregions shapefile

Derived From Hunter Surface Water data v2 20140724

Derived From HUN AWRA-R River Reaches Simulation v01

Derived From HUN AWRA-L simulation nodes v02

Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)

Derived From HUN AWRA-R Irrigation Area Extents and Crop Types v01

Derived From GEODATA TOPO 250K Series 3

Derived From NSW Catchment Management Authority Boundaries 20130917

Derived From Geological Provinces - Full Extent

Derived From BA SYD selected GA TOPO 250K data plus added map features

Derived From HUN gridded daily PET from 1973-2102 v01

Derived From Bioregional_Assessment_Programme_Catchment Scale Land Use of Australia - 2014

Derived From Bioregional Assessment areas v03

Derived From IQQM Model Simulation Regulated Rivers NSW DPI HUN 20150615

Derived From HUN AWRA-R calibration catchments v01

Derived From Bioregional Assessment areas v05

Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012

Derived From National Surface Water sites Hydstra

Derived From Selected streamflow gauges within and near the Hunter subregion

Derived From ASRIS Continental-scale soil property predictions 2001

Derived From Hunter Surface Water data extracted 20140718

Derived From Mean Annual Climate Data of Australia 1981 to 2012

Derived From HUN AWRA-R calibration nodes v01

Derived From HUN future climate rainfall v01

Derived From HUN AWRA-LR Model v01

Derived From HUN AWRA-L ASRIS soil properties v01

Derived From HUN AWRAR restricted input 01

Derived From Bioregional Assessment areas v01

Derived From Bioregional Assessment areas v02

Derived From Victoria - Seamless Geology 2014

Derived From HUN AWRA-L Site Station Cross Sections v01

Derived From HUN AWRA-R simulation catchments v01

Derived From HUN AWRA-R Simulation Node Cross Sections v01

Derived From Climate model 0.05x0.05 cells and cell centroids
d
Data from: Data and code from: Identification of a key target for...
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and code from: Identification of a key target for elimination of nitrous oxide, a major greenhouse gas [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-identification-of-a-key-target-for-elimination-of-nitrous-oxide-a-major-c072f
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
Note: Data files will be made available upon manuscript publication This dataset contains all code and data needed to reproduce the analyses in the manuscript: IDENTIFICATION OF A KEY TARGET FOR ELIMINATION OF NITROUS OXIDE, A MAJOR GREENHOUSE GAS. Blake A. Oakley (1), Trevor Mitchell (2), Quentin D. Read (3), Garrett Hibbs (1), Scott E. Gold (2), Anthony E. Glenn (2) Department of Plant Pathology, University of Georgia, Athens, GA, USA. Toxicology and Mycotoxin Research Unit, U.S. National Poultry Research Center, United States Department of Agriculture-Agricultural Research Service, Athens, GA, USA Southeast Area, United States Department of Agriculture-Agricultural Research Service, Raleigh, NC, USA citation will be updated upon acceptance of manuscript Brief description of study aims Denitrification is a chemical process that releases nitrous oxide (N2O), a potent greenhouse gas. The NOR1 gene is part of the denitrification pathway in Fusarium. Three experiments were conducted for this study. (1) The N2O comparative experiment compares denitrification rates, as measured by N2O production, of a variety of Fusarium spp. strains with and without the NOR1 gene. (2) The N2O substrate experiment compares denitrification rates of selected strains on different growth media (substrates). For parts 1 and 2, linear models are fit comparing N2O production between strains and/or substrates. (3) The Bioscreen growth assay tests whether there is a pleiotropic effect of the NOR1 gene. In this portion of the analysis, growth curves are fit to assess differences in growth rate and carrying capacity between selected strains with and without the NOR1 gene. Code All code is included in a .zip archive generated from a private git repository on 2022-10-13 and archived as part of this dataset. The code is contained in R scripts and RMarkdown notebooks. There are two components to the analysis: the denitrification analysis (comprising parts 1 and 2 described above) and the Bioscreen growth analysis (part 3). The scripts for each are listed and described below. Analysis of results of denitrification experiments (parts 1 and 2) NOR1_denitrification_analysis.Rmd: The R code to analyze the experimental data comparing nitrous oxide emissions is all contained in a single RMarkdown notebook. This script analyzes the results from the comparative study and the substrate study. n2o_subgroup_figures.R: R script to create additional figures using the output from the RMarkdown notebook Analysis of results of Bioscreen growth assay (part 3) bioscreen_analysis.Rmd: This RMarkdown notebook contains all R code needed to analyze the results of the Bioscreen assay comparing growth of the different strains. It could be run as is. However, the model-fitting portion was run on a high-performance computing cluster with the following scripts: bioscreen_fit_simpler.R: R script containing only the model-fitting portion of the Bioscreen analysis, fit using the Stan modeling language interfaced with R through the brms and cmdstanr packages. job_bssimple.sh: Job submission shell script used to submit the model-fitting R job to be run on USDA SciNet high-performance computing cluster. Additional scripts developed as part of the analysis but that are not required to reproduce the analyses in the manuscript are in the deprecated/ folder. Also note the files nor1-denitrification.Rproj (RStudio project file) and gtstyle.css (stylesheet for formatting the tables in the notebooks) are included. Data Data required to run the analysis scripts are archived in this dataset, other than strain_lookup.csv, a lookup table of strain abbreviations and full names included in the code repository for convenience. They should be placed in a folder or symbolic link called project within the unzipped code repository directory. N2O_data_2022-08-03/N2O_Comparative_Study_Trial_(n)(date range).xlsx: These are the data from the N2O comparative study, where n is the trial number from 1-3 and date range is the begin and end date of the trial. N2O_data_2022-08-03/Nitrogen_Substrate_Study_Trial(n)(date range).xlsx: These are the data from the N2O substrate study, where n is the trial number from 1-3 and date range is the begin and end date of the trial. Outliers_NOR1_2022/Bioscreen_NOR1_Fungal_Growth_Assay(substrate)(oxygen level)_Outliers_BAO(date).xlsx: These are the raw Bioscreen data files in MS Excel format. The format of each file name includes the substrate (minimal medium with nitrite or nitrate and lysine), oxygen level (hypoxia or normoxia), and date of the run. This repository includes code to process these files, but the processed data are also included on Ag Data Commons, so it is not necessary to run the data processing portion of the code. clean_data/bioscreen_clean_data.csv: This is an intermediate output file in CSV format generated by bioscreen_analysis.Rmd. It includes all the data from the Bioscreen assays in a clean analysis-ready format.
m
Data from: A simple approach for maximizing the overlap of phylogenetic and...
figshare.mq.edu.au
borealisdata.ca
+6more
bin
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew W. Pennell; Richard G. FitzJohn; William K. Cornwell (2023). Data from: A simple approach for maximizing the overlap of phylogenetic and comparative data [Dataset]. http://doi.org/10.5061/dryad.5d3rq
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.5d3rq
Dataset updated
May 30, 2023
Dataset provided by
Macquarie University
Authors
Matthew W. Pennell; Richard G. FitzJohn; William K. Cornwell
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Biologists are increasingly using curated, public data sets to conduct phylogenetic comparative analyses. Unfortunately, there is often a mismatch between species for which there is phylogenetic data and those for which other data are available. As a result, researchers are commonly forced to either drop species from analyses entirely or else impute the missing data. A simple strategy to improve the overlap of phylogenetic and comparative data is to swap species in the tree that lack data with ‘phylogenetically equivalent’ species that have data. While this procedure is logically straightforward, it quickly becomes very challenging to do by hand. Here, we present algorithms that use topological and taxonomic information to maximize the number of swaps without altering the structure of the phylogeny. We have implemented our method in a new R package phyndr, which will allow researchers to apply our algorithm to empirical data sets. It is relatively efficient such that taxon swaps can be quickly computed, even for large trees. To facilitate the use of taxonomic knowledge, we created a separate data package taxonlookup; it contains a curated, versioned taxonomic lookup for land plants and is interoperable with phyndr. Emerging online data bases and statistical advances are making it possible for researchers to investigate evolutionary questions at unprecedented scales. However, in this effort species mismatch among data sources will increasingly be a problem; evolutionary informatics tools, such as phyndr and taxonlookup, can help alleviate this issue.

Usage Notes Land plant taxonomic lookup tableThis dataset is a stable version (version 1.0.1) of the dataset contained in the taxonlookup R package (see https://github.com/traitecoevo/taxonlookup for the most recent version). It contains a taxonomic reference table for 16,913 genera of land plants along with the number of recognized species in each genus.plant_lookup.csv
u
Data for: Climate impacts and adaptation in US dairy systems 1981–2018
agdatacommons.nal.usda.gov
bin
Updated May 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Gisbert-Queral; Nathan Mueller (2025). Data for: Climate impacts and adaptation in US dairy systems 1981–2018 [Dataset]. http://doi.org/10.5281/zenodo.4818011
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4818011
Dataset updated
May 30, 2025
Dataset provided by
Zenodo
Authors
Maria Gisbert-Queral; Nathan Mueller
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Area covered
United States
Description
Data is archived here: https://doi.org/10.5281/zenodo.4818011Data and code archive provides all the files that are necessary to replicate the empirical analyses that are presented in the paper "Climate impacts and adaptation in US dairy systems 1981-2018" authored by Maria Gisbert-Queral, Arne Henningsen, Bo Markussen, Meredith T. Niles, Ermias Kebreab, Angela J. Rigden, and Nathaniel D. Mueller and published in 'Nature Food' (2021, DOI: 10.1038/s43016-021-00372-z). The empirical analyses are entirely conducted with the "R" statistical software using the add-on packages "car", "data.table", "dplyr", "ggplot2", "grid", "gridExtra", "lmtest", "lubridate", "magrittr", "nlme", "OneR", "plyr", "pracma", "quadprog", "readxl", "sandwich", "tidyr", "usfertilizer", and "usmap". The R code was written by Maria Gisbert-Queral and Arne Henningsen with assistance from Bo Markussen. Some parts of the data preparation and the analyses require substantial amounts of memory (RAM) and computational power (CPU). Running the entire analysis (all R scripts consecutively) on a laptop computer with 32 GB physical memory (RAM), 16 GB swap memory, an 8-core Intel Xeon CPU E3-1505M @ 3.00 GHz, and a GNU/Linux/Ubuntu operating system takes around 11 hours. Running some parts in parallel can speed up the computations but bears the risk that the computations terminate when two or more memory-demanding computations are executed at the same time.This data and code archive contains the following files and folders:* READMEDescription: text file with this description* flowchart.pdfDescription: a PDF file with a flow chart that illustrates how R scripts transform the raw data files to files that contain generated data sets and intermediate results and, finally, to the tables and figures that are presented in the paper.* runAll.shDescription: a (bash) shell script that runs all R scripts in this data and code archive sequentially and in a suitable order (on computers with a "bash" shell such as most computers with MacOS, GNU/Linux, or Unix operating systems)* Folder "DataRaw"Description: folder for raw data filesThis folder contains the following files:- DataRaw/COWS.xlsxDescription: MS-Excel file with the number of cows per countySource: USDA NASS QuickstatsObservations: All available counties and years from 2002 to 2012- DataRaw/milk_state.xlsxDescription: MS-Excel file with average monthly milk yields per cowSource: USDA NASS QuickstatsObservations: All available states from 1981 to 2018- DataRaw/TMAX.csvDescription: CSV file with daily maximum temperaturesSource: PRISM Climate Group (spatially averaged)Observations: All counties from 1981 to 2018- DataRaw/VPD.csvDescription: CSV file with daily maximum vapor pressure deficitsSource: PRISM Climate Group (spatially averaged)Observations: All counties from 1981 to 2018- DataRaw/countynamesandID.csvDescription: CSV file with county names, state FIPS codes, and county FIPS codesSource: US Census BureauObservations: All counties- DataRaw/statecentroids.csvDescriptions: CSV file with latitudes and longitudes of state centroidsSource: Generated by Nathan Mueller from Matlab state shapefiles using the Matlab "centroid" functionObservations: All states* Folder "DataGenerated"Description: folder for data sets that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these generated data files so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).* Folder "Results"Description: folder for intermediate results that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these intermediate results so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).* Folder "Figures"Description: folder for the figures that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these figures so that people who replicate our analysis can more easily compare the figures that they get with the figures that are presented in our paper. Additionally, this folder contains CSV files with the data that are required to reproduce the figures.* Folder "Tables"Description: folder for the tables that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these tables so that people who replicate our analysis can more easily compare the tables that they get with the tables that are presented in our paper.* Folder "logFiles"Description: the shell script runAll.sh writes the output of each R script that it runs into this folder. We provide these log files so that people who replicate our analysis can more easily compare the R output that they get with the R output that we got.* PrepareCowsData.RDescription: R script that imports the raw data set COWS.xlsx and prepares it for the further analyses* PrepareWeatherData.RDescription: R script that imports the raw data sets TMAX.csv, VPD.csv, and countynamesandID.csv, merges these three data sets, and prepares the data for the further analyses* PrepareMilkData.RDescription: R script that imports the raw data set milk_state.xlsx and prepares it for the further analyses* CalcFrequenciesTHI_Temp.RDescription: R script that calculates the frequencies of days with the different THI bins and the different temperature bins in each month for each state* CalcAvgTHI.RDescription: R script that calculates the average THI in each state* PreparePanelTHI.RDescription: R script that creates a state-month panel/longitudinal data set with exposure to the different THI bins* PreparePanelTemp.RDescription: R script that creates a state-month panel/longitudinal data set with exposure to the different temperature bins* PreparePanelFinal.RDescription: R script that creates the state-month panel/longitudinal data set with all variables (e.g., THI bins, temperature bins, milk yield) that are used in our statistical analyses* EstimateTrendsTHI.RDescription: R script that estimates the trends of the frequencies of the different THI bins within our sampling period for each state in our data set* EstimateModels.RDescription: R script that estimates all model specifications that are used for generating results that are presented in the paper or for comparing or testing different model specifications* CalcCoefStateYear.RDescription: R script that calculates the effects of each THI bin on the milk yield for all combinations of states and years based on our 'final' model specification* SearchWeightMonths.RDescription: R script that estimates our 'final' model specification with different values of the weight of the temporal component relative to the weight of the spatial component in the temporally and spatially correlated error term* TestModelSpec.RDescription: R script that applies Wald tests and Likelihood-Ratio tests to compare different model specifications and creates Table S10* CreateFigure1a.RDescription: R script that creates subfigure a of Figure 1* CreateFigure1b.RDescription: R script that creates subfigure b of Figure 1* CreateFigure2a.RDescription: R script that creates subfigure a of Figure 2* CreateFigure2b.RDescription: R script that creates subfigure b of Figure 2* CreateFigure2c.RDescription: R script that creates subfigure c of Figure 2* CreateFigure3.RDescription: R script that creates the subfigures of Figure 3* CreateFigure4.RDescription: R script that creates the subfigures of Figure 4* CreateFigure5_TableS6.RDescription: R script that creates the subfigures of Figure 5 and Table S6* CreateFigureS1.RDescription: R script that creates Figure S1* CreateFigureS2.RDescription: R script that creates Figure S2* CreateTableS2_S3_S7.RDescription: R script that creates Tables S2, S3, and S7* CreateTableS4_S5.RDescription: R script that creates Tables S4 and S5* CreateTableS8.RDescription: R script that creates Table S8* CreateTableS9.RDescription: R script that creates Table S9
d
Satellite Electric Vehicle Dataset (TESLA,LUCID, RIVIAN
datarade.ai
.csv
Updated Jan 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Space Know (2023). Satellite Electric Vehicle Dataset (TESLA,LUCID, RIVIAN [Dataset]. https://datarade.ai/data-products/satellite-electric-vehicle-dataset-tesla-lucid-rivian-space-know
Explore at:
.csvAvailable download formats
Dataset updated
Jan 21, 2023
Dataset authored and provided by
Space Know
Area covered
China, United States of America
Description
SpaceKnow uses satellite (SAR) data to capture activity in electric vehicles and automotive factories.

Data is updated daily, has an average lag of 4-6 days, and history back to 2017.

The insights provide you with level and change data that monitors the area which is covered with assembled light vehicles in square meters.

We offer 3 delivery options: CSV, API, and Insights Dashboard

Available companies Rivian (NASDAQ: RIVN) for employee parking, logistics, logistic centers, product distribution & product in the US. (See use-case write up on page 4) TESLA (NASDAQ: TSLA) indices for product, logistics & employee parking for Fremont, Nevada, Shanghai, Texas, Berlin, and Global level Lucid Motors (NASDAQ: LCID) for employee parking, logistics & product in US

Why get SpaceKnow's EV datasets?

Monitor the company’s business activity: Near-real-time insights into the business activities of Rivian allow users to better understand and anticipate the company’s performance.

Assess Risk: Use satellite activity data to assess the risks associated with investing in the company.

Types of Indices Available Continuous Feed Index (CFI) is a daily aggregation of the area of metallic objects in square meters. There are two types of CFI indices. The first one is CFI-R which gives you level data, so it shows how many square meters are covered by metallic objects (for example assembled cars). The second one is CFI-S which gives you change data, so it shows you how many square meters have changed within the locations between two consecutive satellite images.

How to interpret the data SpaceKnow indices can be compared with the related economic indicators or KPIs. If the economic indicator is in monthly terms, perform a 30-day rolling sum and pick the last day of the month to compare with the economic indicator. Each data point will reflect approximately the sum of the month. If the economic indicator is in quarterly terms, perform a 90-day rolling sum and pick the last day of the 90-day to compare with the economic indicator. Each data point will reflect approximately the sum of the quarter.

Product index This index monitors the area covered by manufactured cars. The larger the area covered by the assembled cars, the larger and faster the production of a particular facility. The index rises as production increases.

Product distribution index This index monitors the area covered by assembled cars that are ready for distribution. The index covers locations in the Rivian factory. The distribution is done via trucks and trains.

Employee parking index Like the previous index, this one indicates the area covered by cars, but those that belong to factory employees. This index is a good indicator of factory construction, closures, and capacity utilization. The index rises as more employees work in the factory.

Logistics index The index monitors the movement of materials supply trucks in particular car factories.

Logistics Centers index The index monitors the movement of supply trucks in warehouses.

Where the data comes from: SpaceKnow brings you information advantages by applying machine learning and AI algorithms to synthetic aperture radar and optical satellite imagery. The company’s infrastructure searches and downloads new imagery every day, and the computations of the data take place within less than 24 hours.

In contrast to traditional economic data, which are released in monthly and quarterly terms, SpaceKnow data is high-frequency and available daily. It is possible to observe the latest movements in the EV industry with just a 4-6 day lag, on average.

The EV data help you to estimate the performance of the EV sector and the business activity of the selected companies.

The backbone of SpaceKnow’s high-quality data is the locations from which data is extracted. All locations are thoroughly researched and validated by an in-house team of annotators and data analysts.

Each individual location is precisely defined so that the resulting data does not contain noise such as surrounding traffic or changing vegetation with the season.

We use radar imagery and our own algorithms, so the final indices are not devalued by weather conditions such as rain or heavy clouds.

→ Reach out to get a free trial

Use Case - Rivian:

SpaceKnow uses the quarterly production and delivery data of Rivian as a benchmark. Rivian targeted to produce 25,000 cars in 2022. To achieve this target, the company had to increase production by 45% by producing 10,683 cars in Q4. However the production was 10,020 and the target was slightly missed by reaching total production of 24,337 cars for FY22.

SpaceKnow indices help us to observe the company’s operations, and we are able to monitor if the company is set to meet its forecasts or not. We deliver five different indices for Rivian, and these indices observe logistic centers, employee parking lot, logistics, product, and prod...
d
Data from: Data and code from: Comparison of in-field and laboratory-based...
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and code from: Comparison of in-field and laboratory-based phenotyping methods for evaluation of aflatoxin accumulation in maize inbred lines [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-comparison-of-in-field-and-laboratory-based-phenotyping-methods-for-eva
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
This dataset contains all data and R code, in RMarkdown notebook format, needed to reproduce all statistical analysis, figures, and tables in the manuscript:Jeffers, D., J. S. Smith, E. D. Womack, Q. D. Read, and G. L. Windham. 2024. Comparison of in-field and laboratory-based phenotyping methods for evaluation of aflatoxin accumulation in maize inbred lines. Plant Disease. (citation to be updated upon final acceptance of MS)There is a critical need to quickly and reliably identify corn genotypes that are resistant to accumulating aflatoxin in their kernels. We compared three methods of determining how resistant different corn genotypes are to aflatoxin accumulation: a field-based assay (side-needle inoculation) and two different lab-based assays (wounding and non-wounding kernel screening assays; KSA). In this data object, we present the data from the lab and field assays, statistical models that are fit to the data, procedures for comparing model fit of different variants of the model, and model predictions. This includes how reliably each assay identifies resistant and susceptible check varieties, and how well correlated the assay methods are with one another. Statistical analyses are done using R software, including Bayesian models fit with Stan software.The following files are included:ksa_analysis_revised.Rmd: RMarkdown notebook with all code needed to reproduce analyses and create figures and tables in manuscriptksa_analysis_revised.html: HTML rendered output of notebookstep1_ksa.tsv: tab-separated data file with data from the lab assay. Columns include sample ID, genotype ID and entry code, year, treatment (wound or no-wound), subsample ID, replicate ID, aflatoxin concentration (in units of ng/g), logarithm of aflatoxin concentration, and a column indicating genotypes that are susceptible or resistant checksstep1_ksa_field.tsv: tab-separated data file with data from the field assay. Columns similar to the lab assay data file with an additional column for row in which the sample was planted.ksa_cov_mod.tsv: tab-separated data file with secondary infection covariate data from the lab assay. Columns similar to the lab assay data file with columns for secondary Asp, Fus, and NI infections and their logarithms.brmfits.zip: zip archive with 12 .rds files. These are model output files for the Bayesian mixed effect models presented in the MS that were fitted using the R function brm(). You may download these to reproduce output without having to compile and run the models yourself.The three .tsv data files should be placed in a subdirectory called "data" in the same directory where the .Rmd notebook is located.
96 wells fluorescence reading and R code statistic for analysis
zenodo.org
bin, csv, doc, pdf
Updated Aug 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JVD Molino; JVD Molino (2024). 96 wells fluorescence reading and R code statistic for analysis [Dataset]. http://doi.org/10.5281/zenodo.1119285
Explore at:
doc, csv, pdf, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1119285
Dataset updated
Aug 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
JVD Molino; JVD Molino
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

Data points present in this dataset were obtained following the subsequent steps: To assess the secretion efficiency of the constructs, 96 colonies from the selection plates were evaluated using the workflow presented in Figure Workflow. We picked transformed colonies and cultured in 400 μL TAP medium for 7 days in Deep-well plates (Corning Axygen®, No.: PDW500CS, Thermo Fisher Scientific Inc., Waltham, MA), covered with Breathe-Easy® (Sigma-Aldrich®). Cultivation was performed on a rotary shaker, set to 150 rpm, under constant illumination (50 μmol photons/m²s). Then 100 μL sample were transferred clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA) and fluorescence was measured using an Infinite® M200 PRO plate reader (Tecan, Männedorf, Switzerland). Fluorescence was measured at excitation 575/9 nm and emission 608/20 nm. Supernatant samples were obtained by spinning Deep-well plates at 3000 × g for 10 min and transferring 100 μL from each well to the clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA), followed by fluorescence measurement. To compare the constructs, R Statistic version 3.3.3 was used to perform one-way ANOVA (with Tukey's test), and to test statistical hypotheses, the significance level was set at 0.05. Graphs were generated in RStudio v1.0.136. The codes are deposit herein.

Info

ANOVA_Turkey_Sub.R -> code for ANOVA analysis in R statistic 3.3.3

barplot_R.R -> code to generate bar plot in R statistic 3.3.3

boxplotv2.R -> code to generate boxplot in R statistic 3.3.3

pRFU_+_bk.csv -> relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

sup_+_bl.csv -> supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

sup_raw.csv -> supernatant mCherry fluorescence dataset of 96 colonies for each construct.

who_+_bl2.csv -> whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

who_raw.csv -> whole culture mCherry fluorescence dataset of 96 colonies for each construct.

who_+_Chlo.csv -> whole culture chlorophyll fluorescence dataset of 96 colonies for each construct.

Anova_Output_Summary_Guide.pdf -> Explain the ANOVA files content

ANOVA_pRFU_+_bk.doc -> ANOVA of relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

ANOVA_sup_+_bk.doc -> ANOVA of supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

ANOVA_who_+_bk.doc -> ANOVA of whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

ANOVA_Chlo.doc -> ANOVA of whole culture chlorophyll fluorescence of all constructs, plus average and standard deviation values.

Consider citing our work.

Molino JVD, de Carvalho JCM, Mayfield SP (2018) Comparison of secretory signal peptides for heterologous protein expression in microalgae: Expanding the secretion portfolio for Chlamydomonas reinhardtii. PLoS ONE 13(2): e0192433. https://doi.org/10.1371/journal. pone.0192433

Facebook

Twitter

Click to copy link

Link copied

Cite

Richard L. Marchese Robinson; Anna Palczewska; Jan Palczewski; Nathan Kidley (2023). Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets [Dataset]. http://doi.org/10.1021/acs.jcim.6b00753.s006

Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets

Explore at:

160 scholarly articles cite this dataset (View in Google Scholar)

zipAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.jcim.6b00753.s006

Dataset updated

Jun 5, 2023

Dataset provided by

ACS Publications

Authors

Richard L. Marchese Robinson; Anna Palczewska; Jan Palczewski; Nathan Kidley

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The ability to interpret the predictions made by quantitative structure–activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package (https://r-forge.r-project.org/R/?group_id=1725) for the R statistical programming language and the Python program HeatMapWrapper [https://doi.org/10.5281/zenodo.495163] for heat map generation.

Clear search

Close search

Google apps

Main menu

Comparison of the Predictive Performance and Interpretability of Random...

Simulation Data & R scripts for: "Introducing recurrent events analyses to...

Data from: Comparison Data Sets for Benchmarking QSAR Methodologies in Lead...

Data from: R Manual for QCA

Optimization and Evaluation Datasets for PiMine - Dataset - B2FIND

Benchmark Multi-Omics Datasets for Methods Comparison

Data from: A preliminary comparison of a songbird's song repertoire size and...

ProjecTILs murine reference atlas of tumor-infiltrating T cells, version 1

Data from: Supplementary tables:MetaFetcheR: An R package for complete...

Time-Series Matrix (TSMx): A visualization tool for plotting multiscale...

Crop classification dataset for testing domain adaptation or distributional...

Datasets for Comparison of Surrogate Models to Estimate Pesticide...

Global Thermosaliniograph Dataset

HUN Comparison of model variability and interannual variability

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

Data from: Data and code from: Identification of a key target for...

Data from: A simple approach for maximizing the overlap of phylogenetic and...

Data for: Climate impacts and adaptation in US dairy systems 1981–2018

Satellite Electric Vehicle Dataset (TESLA,LUCID, RIVIAN

Data from: Data and code from: Comparison of in-field and laboratory-based...

96 wells fluorescence reading and R code statistic for analysis

Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets