Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The ability to interpret the predictions made by quantitative structure–activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package (https://r-forge.r-project.org/R/?group_id=1725) for the R statistical programming language and the Python program HeatMapWrapper [https://doi.org/10.5281/zenodo.495163] for heat map generation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Files descriptions:
All csv files refer to results from the different models (PAMM, AARs, Linear models, MRPPs) on each iteration of the simulation. One row being one iteration. "results_perfect_detection.csv" refers to the results from the first simulation part with all the observations."results_imperfect_detection.csv" refers to the results from the first simulation part with randomly thinned observations to mimick imperfect detection.
ID_run: identified of the iteration (N: number of sites, D_AB: duration of the effect of A on B, D_BA: duration of the effect of B on A, AB: effect of A on B, BA: effect of B on A, Se: seed number of the iteration).PAMM30: p-value of the PAMM running on the 30-days survey.PAMM7: p-value of the PAMM running on the 7-days survey.AAR1: ratio value for the Avoidance-Attraction-Ratio calculating AB/BA.AAR2: ratio value for the Avoidance-Attraction-Ratio calculating BAB/BB.Harmsen_P: p-value from the linear model with interaction Species1*Species2 from Harmsen et al. (2009).Niedballa_P: p-value from the linear model comparing AB to BA (Niedballa et al. 2021).Karanth_permA: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species A (Karanth et al. 2017).MurphyAB_permA: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). MurphyBA_permA: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). Karanth_permB: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species B (Karanth et al. 2017).MurphyAB_permB: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021). MurphyBA_permB: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021).
"results_int_dir_perf_det.csv" refers to the results from the second simulation part, with all the observations."results_int_dir_imperf_det.csv" refers to the results from the second simulation part, with randomly thinned observations to mimick imperfect detection.ID_run: identified of the iteration (N: number of sites, D_AB: duration of the effect of A on B, D_BA: duration of the effect of B on A, AB: effect of A on B, BA: effect of B on A, Se: seed number of the iteration).p_pamm7_AB: p-value of the PAMM running on the 7-days survey testing for the effect of A on B.p_pamm7_AB: p-value of the PAMM running on the 7-days survey testing for the effect of B on A.AAR1: ratio value for the Avoidance-Attraction-Ratio calculating AB/BA.AAR2_BAB: ratio value for the Avoidance-Attraction-Ratio calculating BAB/BB.AAR2_ABA: ratio value for the Avoidance-Attraction-Ratio calculating ABA/AA.Harmsen_P: p-value from the linear model with interaction Species1*Species2 from Harmsen et al. (2009).Niedballa_P: p-value from the linear model comparing AB to BA (Niedballa et al. 2021).Karanth_permA: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species A (Karanth et al. 2017).MurphyAB_permA: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). MurphyBA_permA: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). Karanth_permB: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species B (Karanth et al. 2017).MurphyAB_permB: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021). MurphyBA_permB: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021).
Scripts files description:1_Functions: R script containing the functions: - MRPP from Karanth et al. (2017) adapted here for time efficiency. - MRPP from Murphy et al. (2021) adapted here for time efficiency. - Version of the ct_to_recurrent() function from the recurrent package adapted to process parallized on the simulation datasets. - The simulation() function used to simulate two species observations with reciprocal effect on each other.2_Simulations: R script containing the parameters definitions for all iterations (for the two parts of the simulations), the simulation paralellization and the random thinning mimicking imperfect detection.3_Approaches comparison: R script containing the fit of the different models tested on the simulated data.3_1_Real data comparison: R script containing the fit of the different models tested on the real data example from Murphy et al. 2021.4_Graphs: R script containing the code for plotting results from the simulation part and appendices.5_1_Appendix - Check for similarity between codes for Karanth et al 2017 method: R script containing Karanth et al. (2017) and Murphy et al. (2021) codes lines and the adapted version for time-efficiency matter and a comparison to verify similarity of results.5_2_Appendix - Multi-response procedure permutation difference: R script containing R code to test for difference of the MRPPs approaches according to the species on which permutation are done.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
2D and 3D QSAR techniques are widely used in lead optimization-like processes. A compilation of 40 diverse data sets is described. It is proposed that these can be used as a common benchmark sample for comparisons of QSAR methodologies, primarily in terms of predictive ability. Use of this benchmark set will be useful for both assessment of new methods and for optimization of existing methods.
The R Manual for QCA entails a PDF file that describes all the steps and code needed to prepare and conduct a Qualitative Comparative Analysis (QCA) study in R. This is complemented by an R Script that can be customized as needed. The dataset further includes two files with sample data, for the set-theoretic analysis and the visualization of QCA results. The R Manual for QCA is the online appendix to "Qualitative Comparative Analysis: An Introduction to Research Design and Application", Georgetown University Press, 2021.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The protein-protein interface comparison software PiMine was developed to provide fast comparisons against databases of known protein-protein complex structures. Its application domains range from the prediction of interfaces and potential interaction partners to the identification of potential small molecule modulators of protein-protein interactions.[1] The protein-protein evaluation datasets are a collection of five datasets that were used for the parameter optimization (ParamOptSet), enrichment assessment (Dimer597 set, Keskin set, PiMineSet), and runtime analyses (RunTimeSet) of protein-protein interface comparison tools. The evaluation datasets contain pairs of interfaces of protein chains that either share sequential and structural similarities or are even sequentially and structurally unrelated. They enable comparative benchmark studies for tools designed to identify interface similarities. In addition, we added the results of the case studies analyzed in [1] to enable readers to follow the discussion and investigate the results individually. Data Set description: The ParamOptSet was designed based on a study on improving the benchmark datasets for the evaluation of protein-protein docking tools [2]. It was used to optimize and fine-tune the geometric search parameters of PiMine. The Dimer597 [3] and Keskin [4] sets were developed earlier. We used them to evaluate PiMine’s performance in identifying structurally and sequentially related interface pairs as well as interface pairs with prominent similarity whose constituting chains are sequentially unrelated. The PiMine set [1] was constructed to assess different quality criteria for reliable interface comparison. It consists of similar pairs of protein-protein complexes of which two chains are sequentially and structurally highly related while the other two chains are unrelated and show different folds. It enables the assessment of the performance when the interfaces of apparently unrelated chains are available only. Furthermore, we could obtain reliable interface-interface alignments based on the similar chains which can be used for alignment performance assessments. Finally, the RunTimeSet [1] comprises protein-protein complexes from the PDB that were predicted to be biologically relevant. It enables the comparison of typical run times of comparison methods and represents also an interesting dataset to screen for interface similarities. References: [1] Graef, J.; Ehrt, C.; Reim, T.; Rarey, M. Database-driven identification of structurally similar protein-protein interfaces (submitted) [2] Barradas-Bautista, D.; Almajed, A.; Oliva, R.; Kalnis, P.; Cavallo, L. Improving classification of correct and incorrect protein-protein docking models by augmenting the training set. Bioinform. Adv. 2023, 3, vbad012. [3] Gao, M.; Skolnick, J. iAlign: a method for the structural comparison of protein–protein interfaces. Bioinformatics 2010, 26, 2259-2265. [4] Keskin, O.; Tsai, C.-J.; Wolfson, H.; Nussinov, R. A new, structurally nonredundant, diverse data set of protein–protein interfaces and its implications. Protein Sci. 2004, 13, 1043-1055. This work was supported by the German Federal Ministry of Education and Research as part of CompLS and de.NBI [031L0172, 031L0105]. C.E. is funded by Data Science in Hamburg – Helmholtz Graduate School for the Structure of Matter (Grant-ID: HIDSS-0002).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pathway Multi-Omics Simulated Data
These are synthetic variations of the TCGA COADREAD data set (original data available at http://linkedomics.org/data_download/TCGA-COADREAD/). This data set is used as a comprehensive benchmark data set to compare multi-omics tools in the manuscript "pathwayMultiomics: An R package for efficient integrative analysis of multi-omics datasets with matched or un-matched samples".
There are 100 sets (stored as 100 sub-folders, the first 50 in "pt1" and the second 50 in "pt2") of random modifications to centred and scaled copy number, gene expression, and proteomics data saved as compressed data files for the R programming language. These data sets are stored in subfolders labelled "sim001", "sim002", ..., "sim100". Each folder contains the following contents: 1) "indicatorMatricesXXX_ls.RDS" is a list of simple triplet matrices showing which genes (in which pathways) and which samples received the synthetic treatment (where XXX is the simulation run label: 001, 002, ...), (2) "CNV_partitionA_deltaB.RDS" is the synthetically modified copy number variation data (where A represents the proportion of genes in each gene set to receive the synthetic treatment [partition 1 is 20%, 2 is 40%, 3 is 60% and 4 is 80%] and B is the signal strength in units of standard deviations), (3) "RNAseq_partitionA_deltaB.RDS" is the synthetically modified gene expression data (same parameter legend as CNV), and (4) "Prot_partitionA_deltaB.RDS" is the synthetically modified protein expression data (same parameter legend as CNV).
Supplemental Files
The file "cluster_pathway_collection_20201117.gmt" is the collection of gene sets used for the simulation study in Gene Matrix Transpose format. Scripts to create and analyze these data sets available at: https://github.com/TransBioInfoLab/pathwayMultiomics_manuscript_supplement
Characteristics of birdsong, especially minimum frequency, have been shown to vary for some species between urban and rural populations and along urban-rural gradients. However, few urban-rural comparisons of song complexity—and none that we know of based on the number of distinct song types in repertoires—have occurred. Given the potential ability of song repertoire size to indicate bird condition, we primarily sought to determine if number of distinct song types displayed by Song Sparrows (Melospiza melodia) varied between an urban and a rural site. We determined song repertoire size of 24 individuals; 12 were at an urban (‘human-dominated’) site and 12 were at a rural (‘agricultural’) site. Then, we compared song repertoire size, note rate, and peak frequency between these sites. Song repertoire size and note rate did not vary between our human-dominated and agricultural sites. Peak frequency was greater at the agricultural site. Our finding that peak frequency was higher at the agri...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have developed ProjecTILs, a computational approach to project new data sets into a reference map of T cells, enabling their direct comparison in a stable, annotated system of coordinates. Because new cells are embedded in the same space of the reference, ProjecTILs enables the classification of query cells into annotated, discrete states, but also over a continuous space of intermediate states. By comparing multiple samples over the same map, and across alternative embeddings, the method allows exploring the effect of cellular perturbations (e.g. as the result of therapy or genetic engineering) and identifying genetic programs significantly altered in the query compared to a control set or to the reference map. We illustrate the projection of several data sets from recent publications over two cross-study murine T cell reference atlases: the first describing tumor-infiltrating T lymphocytes (TILs), the second characterizing acute and chronic viral infection.To construct the reference TIL atlas, we obtained single-cell gene expression matrices from the following GEO entries: GSE124691, GSE116390, GSE121478, GSE86028; and entry E-MTAB-7919 from Array-Express. Data from GSE124691 contained samples from tumor and from tumor-draining lymph nodes, and were therefore treated as two separate datasets. For the TIL projection examples (OVA Tet+, miR-155 KO and Regnase-KO), we obtained the gene expression counts from entries GSE122713, GSE121478 and GSE137015, respectively.Prior to dataset integration, single-cell data from individual studies were filtered using TILPRED-1.0 (https://github.com/carmonalab/TILPRED), which removes cells not enriched in T cell markers (e.g. Cd2, Cd3d, Cd3e, Cd3g, Cd4, Cd8a, Cd8b1) and cells enriched in non T cell genes (e.g. Spi1, Fcer1g, Csf1r, Cd19). Dataset integration was performed using STACAS (https://github.com/carmonalab/STACAS), a batch-correction algorithm based on Seurat 3. For the TIL reference map, we specified 600 variable genes per dataset, excluding cell cycling genes, mitochondrial, ribosomal and non-coding genes, as well as genes expressed in less than 0.1% or more than 90% of the cells of a given dataset. For integration, a total of 800 variable genes were derived as the intersection of the 600 variable genes of individual datasets, prioritizing genes found in multiple datasets and, in case of draws, those derived from the largest datasets. We determined pairwise dataset anchors using STACAS with default parameters, and filtered anchors using an anchor score threshold of 0.8. Integration was performed using the IntegrateData function in Seurat3, providing the anchor set determined by STACAS, and a custom integration tree to initiate alignment from the largest and most heterogeneous datasets.Next, we performed unsupervised clustering of the integrated cell embeddings using the Shared Nearest Neighbor (SNN) clustering method implemented in Seurat 3 with parameters {resolution=0.6, reduction=”umap”, k.param=20}. We then manually annotated individual clusters (merging clusters when necessary) based on several criteria: i) average expression of key marker genes in individual clusters; ii) gradients of gene expression over the UMAP representation of the reference map; iii) gene-set enrichment analysis to determine over- and under- expressed genes per cluster using MAST. In order to have access to predictive methods for UMAP, we recomputed PCA and UMAP embeddings independently of Seurat3 using respectively the prcomp function from basic R package “stats”, and the “umap” R package (https://github.com/tkonopka/umap).
The dataset includes a PDF file containing the results and an Excel file with the following tables:
Table S1 Results of comparing the performance of MetaFetcheR to MetaboAnalystR using Diamanti et al. Table S2 Results of comparing the performance of MetaFetcheR to MetaboAnalystR for Priolo et al. Table S3 Results of comparing the performance of MetaFetcheR to MetaboAnalyst 5.0 webtool using Diamanti et al. Table S4 Results of comparing the performance of MetaFetcheR to MetaboAnalyst 5.0 webtool for Priolo et al. Table S5 Data quality test results for running 100 iterations on HMDB database. Table S6 Data quality test results for running 100 iterations on KEGG database. Table S7 Data quality test results for running 100 iterations on ChEBI database. Table S8 Data quality test results for running 100 iterations on PubChem database. Table S9 Data quality test results for running 100 iterations on LIPID MAPS database. Table S10 The list of metabolites that were not mapped by MetaboAnalystR for Diamanti et al. Table S11 An example of an input matrix for MetaFetcheR. Table S12 Results of comparing the performance of MetaFetcheR to MS_targeted using Diamanti et al. Table S13 Data set from Diamanti et al. Table S14 Data set from Priolo et al. Table S15 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Diamanti et al. Table S16 Results of comparing the performance of MetaFetcheR to CTS using LIPID MAPS identifiers available in Diamanti et al. Table S17 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Priolo et al. Table S18 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Priolo et al. (See the "index" tab in the Excel file for more information)
Small-compound databases contain a large amount of information for metabolites and metabolic pathways. However, the plethora of such databases and the redundancy of their information lead to major issues with analysis and standardization. Lack of preventive establishment of means of data access at the infant stages of a project might lead to mislabelled compounds, reduced statistical power and large delays in delivery of results.
We developed MetaFetcheR, an open-source R package that links metabolite data from several small-compound databases, resolves inconsistencies and covers a variety of use-cases of data fetching. We showed that the performance of MetaFetcheR was superior to existing approaches and databases by benchmarking the performance of the algorithm in three independent case studies based on two published datasets.
The dataset was originally published in DiVA and moved to SND in 2024.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time-Series Matrix (TSMx): A visualization tool for plotting multiscale temporal trends TSMx is an R script that was developed to facilitate multi-temporal-scale visualizations of time-series data. The script requires only a two-column CSV of years and values to plot the slope of the linear regression line for all possible year combinations from the supplied temporal range. The outputs include a time-series matrix showing slope direction based on the linear regression, slope values plotted with colors indicating magnitude, and results of a Mann-Kendall test. The start year is indicated on the y-axis and the end year is indicated on the x-axis. In the example below, the cell in the top-right corner is the direction of the slope for the temporal range 2001–2019. The red line corresponds with the temporal range 2010–2019 and an arrow is drawn from the cell that represents that range. One cell is highlighted with a black border to demonstrate how to read the chart—that cell represents the slope for the temporal range 2004–2014. This publication entry also includes an excel template that produces the same visualizations without a need to interact with any code, though minor modifications will need to be made to accommodate year ranges other than what is provided. TSMx for R was developed by Georgios Boumis; TSMx was originally conceptualized and created by Brad G. Peter in Microsoft Excel. Please refer to the associated publication: Peter, B.G., Messina, J.P., Breeze, V., Fung, C.Y., Kapoor, A. and Fan, P., 2024. Perspectives on modifiable spatiotemporal unit problems in remote sensing of agriculture: evaluating rice production in Vietnam and tools for analysis. Frontiers in Remote Sensing, 5, p.1042624. https://www.frontiersin.org/journals/remote-sensing/articles/10.3389/frsen.2024.1042624 TSMx sample chart from the supplied Excel template. Data represent the productivity of rice agriculture in Vietnam as measured via EVI (enhanced vegetation index) from the NASA MODIS data product (MOD13Q1.V006). TSMx R script: # import packages library(dplyr) library(readr) library(ggplot2) library(tibble) library(tidyr) library(forcats) library(Kendall) options(warn = -1) # disable warnings # read data (.csv file with "Year" and "Value" columns) data <- read_csv("EVI.csv") # prepare row/column names for output matrices years <- data %>% pull("Year") r.names <- years[-length(years)] c.names <- years[-1] years <- years[-length(years)] # initialize output matrices sign.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) pval.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) slope.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) # function to return remaining years given a start year getRemain <- function(start.year) { years <- data %>% pull("Year") start.ind <- which(data[["Year"]] == start.year) + 1 remain <- years[start.ind:length(years)] return (remain) } # function to subset data for a start/end year combination splitData <- function(end.year, start.year) { keep <- which(data[['Year']] >= start.year & data[['Year']] <= end.year) batch <- data[keep,] return(batch) } # function to fit linear regression and return slope direction fitReg <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(sign(slope)) } # function to fit linear regression and return slope magnitude fitRegv2 <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(slope) } # function to implement Mann-Kendall (MK) trend test and return significance # the test is implemented only for n>=8 getMann <- function(batch) { if (nrow(batch) >= 8) { mk <- MannKendall(batch[['Value']]) pval <- mk[['sl']] } else { pval <- NA } return(pval) } # function to return slope direction for all combinations given a start year getSign <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) signs <- lapply(combs, fitReg) return(signs) } # function to return MK significance for all combinations given a start year getPval <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) pvals <- lapply(combs, getMann) return(pvals) } # function to return slope magnitude for all combinations given a start year getMagn <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) magns <- lapply(combs, fitRegv2) return(magns) } # retrieve slope direction, MK significance, and slope magnitude signs <- lapply(years, getSign) pvals <- lapply(years, getPval) magns <- lapply(years, getMagn) # fill-in output matrices dimension <- nrow(sign.matrix) for (i in 1:dimension) { sign.matrix[i, i:dimension] <- unlist(signs[i]) pval.matrix[i, i:dimension] <- unlist(pvals[i]) slope.matrix[i, i:dimension] <- unlist(magns[i]) } sign.matrix <-...
In this upload we share processed crop type datasets from both France and Kenya. These datasets can be helpful for testing and comparing various domain adaptation methods. The datasets are processed, used, and described in this paper: https://doi.org/10.1016/j.rse.2021.112488 (arXiv version: https://arxiv.org/pdf/2109.01246.pdf). In summary, each point in the uploaded datasets corresponds to a particular location. The label is the crop type grown at that location in 2017. The 70 processed features are based on Sentinel-2 satellite measurements at that location in 2017. The points in the France dataset come from 11 different departments (regions) in Occitanie, France, and the points in the Kenya dataset come from 3 different regions in Western Province, Kenya. Within each dataset there are notable shifts in the distribution of the labels and in the distribution of the features between regions. Therefore, these datasets can be helpful for testing for testing and comparing methods that are designed to address such distributional shifts. More details on the dataset and processing steps can be found in Kluger et. al. (2021). Much of the processing steps were taken to deal with Sentinel-2 measurements that were corrupted by cloud cover. For users interested in the raw multi-spectral time series data and dealing with cloud cover issues on their own (rather than using the 70 processed features provided here), the raw dataset from Kenya can be found in Yeh et. al. (2021), and the raw dataset from France can be made available upon request from the authors of this Zenodo upload. All of the data uploaded here can be found in "CropTypeDatasetProcessed.RData". We also post the dataframes and tables within that .RData file as separate .csv files for users who do not have R. The contents of each R object (or .csv file) is described in the file "Metadata.rtf". Preferred Citation: -Kluger, D.M., Wang, S., Lobell, D.B., 2021. Two shifts for crop mapping: Leveraging aggregate crop statistics to improve satellite-based maps in new regions. Remote Sens. Environ. 262, 112488. https://doi.org/10.1016/j.rse.2021.112488. -URL to this Zenodo post https://zenodo.org/record/6376160 {"references": ["Kluger D.M., Wang S., and Lobell, D.B. (2021). Two shifts for crop mapping: leveraging aggregate crop statistics to improve satellite-based maps in new regions. Remote Sensing of Environment. 262, 112488", "C. Yeh, C. Meng, S. Wang, A. Driscoll, E. Rozi, P. Liu, J. Lee, M. Burke, D. Lobell, and S. Ermon, "SustainBench: Benchmarks for Monitoring the Sustainable Development Goals with Machine Learning," in Thirty-fifth Conference on Neural Information Processing Systems, Datasets and Benchmarks Track (Round 2), Dec. 2021."]}
This data release is comprised of data tables of input variables for seawaveQ and surrogate models used to predict concentrations of select pesticides at six U.S. Geological Survey National Water Quality Network (NWQN) river sites (Fanno Creek at Durham, Oregon; White River at Hazleton, Indiana; Kansas River at DeSoto, Kansas; Little Arkansas River near Sedgwick, Kansas; Missouri River at Hermann, Missouri; Red River of the North at Grand Forks, North Dakota). Each data table includes discrete concentrations of one select pesticide (Atrazine, Azoxystrobin, Bentazon, Bromacil, Imidacloprid, Simazine, or Triclopyr) at one of the NWQN sites; daily mean streamflow; 30-day and 1-day flow anomalies; daily median values of pH and turbidity; daily mean values of dissolved oxygen, specific conductance, and water temperature; and 30-day and 1-day anomalies for pH, turbidity, dissolved oxygen, specific conductance, and water temperature. Two pesticides were modeled at each site with three types of regression models. Also included is a zip file with outputs from seawaveQ model summary. The processes for retrieving and preparing data for regression models followed those outlined in the SEAWAVE-Q R package documentation (Ryberg and Vecchia, 2013; Ryberg and York, 2020). The R package waterData (Ryberg and Vecchia, 2012) was used to import daily mean values for discharge and either daily mean or daily median values for continuous water-quality constituents directly into R depending on what data were available at each site. Pesticide concentration, streamflow, and surrogate data (continuously measured field parameters) were imported from and are available online from the USGS National Water Information System database (USGS, 2020). The waterData package was used to screen for missing daily mean discharge values (no missing values were found for the sites) and to calculate short-term (1 day) and mid-term (30 day) anomalies for flow and short-term anomalies (1 day) for each water-quality variable. A mid-term streamflow anomaly, for instance, is the deviation of concurrent daily streamflow from average conditions for the previous 30 days (Vecchia and others, 2008). Anomalies were calculated as additional potential model variables. Pesticide concentrations for select constituents from each site were pulled into R using the dataRetrieval package (De Cicco and others, 2018). Three of the six sites (Kansas River at DeSoto, Kansas; Missouri River at Hermann, Missouri; and White River at Hazleton, Indiana) pulled pesticide data for WY 2013–17 whereas the other three sites (Fanno Creek at Durham, Oregon; Little Arkansas River near Sedgwick, Kansas; and Red River of the North at Grand Forks, North Dakota) pulled pesticide data for WY 2013–18. Discrete pesticide data were matched with daily mean discharge and daily mean or median water-quality constituents and the associated calculated short-term (1-day) and mid-term (30-day) anomalies from the date of sampling. Pesticide concentrations were estimated using the SEAWAVE-Q (with surrogates) model using 19 combinations of surrogate variables (table 2 in the associated SIR, "Comparison of Surrogate Models to Estimate Pesticide Concentrations at Six U.S. Geological Survey National Water Quality Network Sites During Water Years 2013–18.") at each of 12 site-pesticide combinations (table 3 in the associated SIR). Three measures of model performance—the generalized coefficient of determination (R2), Akaike’s Information Criteria (AIC), and scale—were included in the output and used to select best-fit models (Table 4 of the associated SIR). The three to four best-fit SEAWAVE-Q (with surrogates) models with sample sizes at least five times the number of variables were selected for each site-pesticide combination based on generalized R2 values—the higher, the better. If generalized R2 values were the same, the model with the lower AIC value was used. The standard surrogate regression and base SEAWAVE-Q models were then applied using the same samples that were used for each of the best-fit SEAWAVE-Q (with surrogates) models so that direct comparisons could be made for each site-pesticide-surrogate instance. The input data used to estimate daily pesticide concentrations for each of the best fit models have been included in this data release. An example of one output file for each model type is included in a .zip file named "output_examples.zip". Each of the output files shows the three measures of model performance. (1) The output file for the standard regression model named "HAZ8_Atrazine_Standard_Regression_Output.txt" includes: Pseudo R-square (Allison) of 0.631, Model AIC of 174.0232, and a Scale of 0.961. (2) The output file for the base SEAWAVE-Q model named "HAZ8_Atrazine_Base_Seawave-Q_Output.txt" includes: Generalized r-squared of 0.82, AIC (Akaike's An Information Criterion) of 36.38, and a Scale of 0.288. (3) The output file for the SEAWAVE-Q w/Surrogates model named "HAZ8_Atrazine_Seawave-Q_w_Surrogates_Output.txt" includes: Generalized r-squared of 0.85, AIC (Akaike's An Information Criterion) of 33.76, and a Scale of 0.268. These values match those for Site ID = HAZ, Pesticide = Atrazine, and Surrogate variable group 8 for each model type in Table 4 of the associated SIR.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
%%%%%%%%%%%%%%%%%%
Global Thermosaliniograph Dataset
%%%%%%%%%%%%%%%%%%
The dataset tsg_2020.mat is an updated version of the compiled ship thermosalinograph dataset that was originally used in Drushka et al. (2019). Details below.
%%%%%%%%
Variable Names
%%%%%%%%
t = time in matlab datenum format
T = temperature
S = salinity
x = longitude
y = latitude
callsign = ship callsign, numbers correspond to call signs in call_sign.m
dataset = number corresponds with the names of the original dataset (see list below)
%%%%%%%%%%%%
Original Dataset Names
%%%%%%%%%%%%
2: GOSUD database
4: SSS-OS database
5: PANGAEA database - R/V Polarstern
6: R/V Survostral
9: SAMOS database
10: PANGAEA database - R/V Poseidon
11: AODN database
12: M/V Oleander
15: Cruise data provided by Sophie Clayton
16: SAMOS database (files not QC'dby SAMOS, QC'd as described below)
22: JAMSTEC database
24: SOCAT 2020 database
25: LEGOS/SSSOS (update from dataset 4)
26: SAMOS database (update)
27: GOSUD database (update)
%%%%%%%%%%%%%%%
Updated Version Description
%%%%%%%%%%%%%%%
The following describes what has been updated compared to the data used in Drushka et al. (2019).
1. SOCAT: https://www.socat.info/index.php/data-access/
- Data up to 2019 included. Basic QC performed by comparing to Argo data and discarding outliers.
2. SSS-OS (LEGOS): http://sss.sedoo.fr/
- Data up to 12/2019 included. Salinity data QC’d by LEGOS; basic QC performed on temperature by comparing to Argo data and discarding outliers.
3. SAMOS https://samos.coaps.fsu.edu
- Data flatted as “good data with 0-5% flagged” " up to 12/2019 included
4. GOSUD - http://www.gosud.org/
- Delayed-mode data up to 12/2019 included
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
These are the data summarising the modelled Hydrological Response Variable (HRV) variability versus climate interannual variability which has been used as an indicator of risk. For example, to understand the significance of the modelled increases in low-flow days, it is useful to look at them in the context of the interannual variability in low-flow days due to climate. In other words, are the modelled increases due to additional coal resource development within the natural range of variability of the longer-term flow regime, or are they potentially moving the system outside the range of hydrological variability it experiences under the current climate? The maximum increase in the number of low-flow days due to additional coal resource development relative to the interannual variability in low-flow days under the baseline has been adopted to put some context around the modelled changes. If the maximum change is small relative to the interannual variability due to climate (e.g. an increase of 3 days relative to a baseline range of 20 to 50 days), then the risk of impacts from the changes in low-flow days is likely to be low. If the maximum change is comparable to or greater than the interannual variability due to climate (e.g. an increase of 200 days relative to a baseline range of 20 to 50 days), then there is a greater risk of impact on the landscape classes and assets that rely on this water source. Here changes comparable to or greater than interannual variability are interpreted as presenting a risk. However, the change due to the additional coal resource development is additive, so even a 'less than interannual variability' change is not free from risk. Results of the interannual variability comparison should be viewed as indicators of risk.
This dataset is generated using 1000 HRV simulations together with climate inputs. Ratios between the variability in HRVs, and the variability attributable interannual variability due to climate, were calculated for the HRVs. Results of the interannual variability comparison should be viewed as indicators of risk.
Bioregional Assessment Programme (2017) HUN Comparison of model variability and interannual variability. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/1c0a19f9-98c2-4d92-956d-dd764aaa10f9.
Derived From River Styles Spatial Layer for New South Wales
Derived From SYD ALL climate data statistics summary
Derived From HUN AWRA-R Observed storage volumes Glenbawn Dam and Glennies Creek Dam
Derived From Hunter River Salinity Scheme Discharge NSW EPA 2006-2012
Derived From HUN AWRA-R simulation nodes v01
Derived From Bioregional Assessment areas v06
Derived From Hunter AWRA Hydrological Response Variables (HRV)
Derived From GEODATA 9 second DEM and D8: Digital Elevation Model Version 3 and Flow Direction Grid 2008
Derived From HUN AWRA-L simulation nodes_v01
Derived From Bioregional Assessment areas v04
Derived From HUN AWRA-R Gauge Station Cross Sections v01
Derived From Gippsland Project boundary
Derived From Natural Resource Management (NRM) Regions 2010
Derived From BA All Regions BILO cells in subregions shapefile
Derived From Hunter Surface Water data v2 20140724
Derived From HUN AWRA-R River Reaches Simulation v01
Derived From HUN AWRA-L simulation nodes v02
Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
Derived From HUN AWRA-R Irrigation Area Extents and Crop Types v01
Derived From GEODATA TOPO 250K Series 3
Derived From NSW Catchment Management Authority Boundaries 20130917
Derived From Geological Provinces - Full Extent
Derived From BA SYD selected GA TOPO 250K data plus added map features
Derived From HUN gridded daily PET from 1973-2102 v01
Derived From Bioregional_Assessment_Programme_Catchment Scale Land Use of Australia - 2014
Derived From Bioregional Assessment areas v03
Derived From IQQM Model Simulation Regulated Rivers NSW DPI HUN 20150615
Derived From HUN AWRA-R calibration catchments v01
Derived From Bioregional Assessment areas v05
Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012
Derived From National Surface Water sites Hydstra
Derived From Selected streamflow gauges within and near the Hunter subregion
Derived From ASRIS Continental-scale soil property predictions 2001
Derived From Hunter Surface Water data extracted 20140718
Derived From Mean Annual Climate Data of Australia 1981 to 2012
Derived From HUN AWRA-R calibration nodes v01
Derived From HUN future climate rainfall v01
Derived From HUN AWRA-LR Model v01
Derived From HUN AWRA-L ASRIS soil properties v01
Derived From HUN AWRAR restricted input 01
Derived From Bioregional Assessment areas v01
Derived From Bioregional Assessment areas v02
Derived From Victoria - Seamless Geology 2014
Derived From HUN AWRA-L Site Station Cross Sections v01
Derived From HUN AWRA-R simulation catchments v01
Derived From HUN AWRA-R Simulation Node Cross Sections v01
Derived From Climate model 0.05x0.05 cells and cell centroids
Note: Data files will be made available upon manuscript publication This dataset contains all code and data needed to reproduce the analyses in the manuscript: IDENTIFICATION OF A KEY TARGET FOR ELIMINATION OF NITROUS OXIDE, A MAJOR GREENHOUSE GAS. Blake A. Oakley (1), Trevor Mitchell (2), Quentin D. Read (3), Garrett Hibbs (1), Scott E. Gold (2), Anthony E. Glenn (2) Department of Plant Pathology, University of Georgia, Athens, GA, USA. Toxicology and Mycotoxin Research Unit, U.S. National Poultry Research Center, United States Department of Agriculture-Agricultural Research Service, Athens, GA, USA Southeast Area, United States Department of Agriculture-Agricultural Research Service, Raleigh, NC, USA citation will be updated upon acceptance of manuscript Brief description of study aims Denitrification is a chemical process that releases nitrous oxide (N2O), a potent greenhouse gas. The NOR1 gene is part of the denitrification pathway in Fusarium. Three experiments were conducted for this study. (1) The N2O comparative experiment compares denitrification rates, as measured by N2O production, of a variety of Fusarium spp. strains with and without the NOR1 gene. (2) The N2O substrate experiment compares denitrification rates of selected strains on different growth media (substrates). For parts 1 and 2, linear models are fit comparing N2O production between strains and/or substrates. (3) The Bioscreen growth assay tests whether there is a pleiotropic effect of the NOR1 gene. In this portion of the analysis, growth curves are fit to assess differences in growth rate and carrying capacity between selected strains with and without the NOR1 gene. Code All code is included in a .zip archive generated from a private git repository on 2022-10-13 and archived as part of this dataset. The code is contained in R scripts and RMarkdown notebooks. There are two components to the analysis: the denitrification analysis (comprising parts 1 and 2 described above) and the Bioscreen growth analysis (part 3). The scripts for each are listed and described below. Analysis of results of denitrification experiments (parts 1 and 2) NOR1_denitrification_analysis.Rmd: The R code to analyze the experimental data comparing nitrous oxide emissions is all contained in a single RMarkdown notebook. This script analyzes the results from the comparative study and the substrate study. n2o_subgroup_figures.R: R script to create additional figures using the output from the RMarkdown notebook Analysis of results of Bioscreen growth assay (part 3) bioscreen_analysis.Rmd: This RMarkdown notebook contains all R code needed to analyze the results of the Bioscreen assay comparing growth of the different strains. It could be run as is. However, the model-fitting portion was run on a high-performance computing cluster with the following scripts: bioscreen_fit_simpler.R: R script containing only the model-fitting portion of the Bioscreen analysis, fit using the Stan modeling language interfaced with R through the brms and cmdstanr packages. job_bssimple.sh: Job submission shell script used to submit the model-fitting R job to be run on USDA SciNet high-performance computing cluster. Additional scripts developed as part of the analysis but that are not required to reproduce the analyses in the manuscript are in the deprecated/ folder. Also note the files nor1-denitrification.Rproj (RStudio project file) and gtstyle.css (stylesheet for formatting the tables in the notebooks) are included. Data Data required to run the analysis scripts are archived in this dataset, other than strain_lookup.csv, a lookup table of strain abbreviations and full names included in the code repository for convenience. They should be placed in a folder or symbolic link called project within the unzipped code repository directory. N2O_data_2022-08-03/N2O_Comparative_Study_Trial_(n)(date range).xlsx: These are the data from the N2O comparative study, where n is the trial number from 1-3 and date range is the begin and end date of the trial. N2O_data_2022-08-03/Nitrogen_Substrate_Study_Trial(n)(date range).xlsx: These are the data from the N2O substrate study, where n is the trial number from 1-3 and date range is the begin and end date of the trial. Outliers_NOR1_2022/Bioscreen_NOR1_Fungal_Growth_Assay(substrate)(oxygen level)_Outliers_BAO(date).xlsx: These are the raw Bioscreen data files in MS Excel format. The format of each file name includes the substrate (minimal medium with nitrite or nitrate and lysine), oxygen level (hypoxia or normoxia), and date of the run. This repository includes code to process these files, but the processed data are also included on Ag Data Commons, so it is not necessary to run the data processing portion of the code. clean_data/bioscreen_clean_data.csv: This is an intermediate output file in CSV format generated by bioscreen_analysis.Rmd. It includes all the data from the Bioscreen assays in a clean analysis-ready format.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Biologists are increasingly using curated, public data sets to conduct phylogenetic comparative analyses. Unfortunately, there is often a mismatch between species for which there is phylogenetic data and those for which other data are available. As a result, researchers are commonly forced to either drop species from analyses entirely or else impute the missing data. A simple strategy to improve the overlap of phylogenetic and comparative data is to swap species in the tree that lack data with ‘phylogenetically equivalent’ species that have data. While this procedure is logically straightforward, it quickly becomes very challenging to do by hand. Here, we present algorithms that use topological and taxonomic information to maximize the number of swaps without altering the structure of the phylogeny. We have implemented our method in a new R package phyndr, which will allow researchers to apply our algorithm to empirical data sets. It is relatively efficient such that taxon swaps can be quickly computed, even for large trees. To facilitate the use of taxonomic knowledge, we created a separate data package taxonlookup; it contains a curated, versioned taxonomic lookup for land plants and is interoperable with phyndr. Emerging online data bases and statistical advances are making it possible for researchers to investigate evolutionary questions at unprecedented scales. However, in this effort species mismatch among data sources will increasingly be a problem; evolutionary informatics tools, such as phyndr and taxonlookup, can help alleviate this issue.
Usage Notes Land plant taxonomic lookup tableThis dataset is a stable version (version 1.0.1) of the dataset contained in the taxonlookup R package (see https://github.com/traitecoevo/taxonlookup for the most recent version). It contains a taxonomic reference table for 16,913 genera of land plants along with the number of recognized species in each genus.plant_lookup.csv
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Data is archived here: https://doi.org/10.5281/zenodo.4818011Data and code archive provides all the files that are necessary to replicate the empirical analyses that are presented in the paper "Climate impacts and adaptation in US dairy systems 1981-2018" authored by Maria Gisbert-Queral, Arne Henningsen, Bo Markussen, Meredith T. Niles, Ermias Kebreab, Angela J. Rigden, and Nathaniel D. Mueller and published in 'Nature Food' (2021, DOI: 10.1038/s43016-021-00372-z). The empirical analyses are entirely conducted with the "R" statistical software using the add-on packages "car", "data.table", "dplyr", "ggplot2", "grid", "gridExtra", "lmtest", "lubridate", "magrittr", "nlme", "OneR", "plyr", "pracma", "quadprog", "readxl", "sandwich", "tidyr", "usfertilizer", and "usmap". The R code was written by Maria Gisbert-Queral and Arne Henningsen with assistance from Bo Markussen. Some parts of the data preparation and the analyses require substantial amounts of memory (RAM) and computational power (CPU). Running the entire analysis (all R scripts consecutively) on a laptop computer with 32 GB physical memory (RAM), 16 GB swap memory, an 8-core Intel Xeon CPU E3-1505M @ 3.00 GHz, and a GNU/Linux/Ubuntu operating system takes around 11 hours. Running some parts in parallel can speed up the computations but bears the risk that the computations terminate when two or more memory-demanding computations are executed at the same time.This data and code archive contains the following files and folders:* READMEDescription: text file with this description* flowchart.pdfDescription: a PDF file with a flow chart that illustrates how R scripts transform the raw data files to files that contain generated data sets and intermediate results and, finally, to the tables and figures that are presented in the paper.* runAll.shDescription: a (bash) shell script that runs all R scripts in this data and code archive sequentially and in a suitable order (on computers with a "bash" shell such as most computers with MacOS, GNU/Linux, or Unix operating systems)* Folder "DataRaw"Description: folder for raw data filesThis folder contains the following files:- DataRaw/COWS.xlsxDescription: MS-Excel file with the number of cows per countySource: USDA NASS QuickstatsObservations: All available counties and years from 2002 to 2012- DataRaw/milk_state.xlsxDescription: MS-Excel file with average monthly milk yields per cowSource: USDA NASS QuickstatsObservations: All available states from 1981 to 2018- DataRaw/TMAX.csvDescription: CSV file with daily maximum temperaturesSource: PRISM Climate Group (spatially averaged)Observations: All counties from 1981 to 2018- DataRaw/VPD.csvDescription: CSV file with daily maximum vapor pressure deficitsSource: PRISM Climate Group (spatially averaged)Observations: All counties from 1981 to 2018- DataRaw/countynamesandID.csvDescription: CSV file with county names, state FIPS codes, and county FIPS codesSource: US Census BureauObservations: All counties- DataRaw/statecentroids.csvDescriptions: CSV file with latitudes and longitudes of state centroidsSource: Generated by Nathan Mueller from Matlab state shapefiles using the Matlab "centroid" functionObservations: All states* Folder "DataGenerated"Description: folder for data sets that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these generated data files so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).* Folder "Results"Description: folder for intermediate results that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these intermediate results so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).* Folder "Figures"Description: folder for the figures that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these figures so that people who replicate our analysis can more easily compare the figures that they get with the figures that are presented in our paper. Additionally, this folder contains CSV files with the data that are required to reproduce the figures.* Folder "Tables"Description: folder for the tables that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these tables so that people who replicate our analysis can more easily compare the tables that they get with the tables that are presented in our paper.* Folder "logFiles"Description: the shell script runAll.sh writes the output of each R script that it runs into this folder. We provide these log files so that people who replicate our analysis can more easily compare the R output that they get with the R output that we got.* PrepareCowsData.RDescription: R script that imports the raw data set COWS.xlsx and prepares it for the further analyses* PrepareWeatherData.RDescription: R script that imports the raw data sets TMAX.csv, VPD.csv, and countynamesandID.csv, merges these three data sets, and prepares the data for the further analyses* PrepareMilkData.RDescription: R script that imports the raw data set milk_state.xlsx and prepares it for the further analyses* CalcFrequenciesTHI_Temp.RDescription: R script that calculates the frequencies of days with the different THI bins and the different temperature bins in each month for each state* CalcAvgTHI.RDescription: R script that calculates the average THI in each state* PreparePanelTHI.RDescription: R script that creates a state-month panel/longitudinal data set with exposure to the different THI bins* PreparePanelTemp.RDescription: R script that creates a state-month panel/longitudinal data set with exposure to the different temperature bins* PreparePanelFinal.RDescription: R script that creates the state-month panel/longitudinal data set with all variables (e.g., THI bins, temperature bins, milk yield) that are used in our statistical analyses* EstimateTrendsTHI.RDescription: R script that estimates the trends of the frequencies of the different THI bins within our sampling period for each state in our data set* EstimateModels.RDescription: R script that estimates all model specifications that are used for generating results that are presented in the paper or for comparing or testing different model specifications* CalcCoefStateYear.RDescription: R script that calculates the effects of each THI bin on the milk yield for all combinations of states and years based on our 'final' model specification* SearchWeightMonths.RDescription: R script that estimates our 'final' model specification with different values of the weight of the temporal component relative to the weight of the spatial component in the temporally and spatially correlated error term* TestModelSpec.RDescription: R script that applies Wald tests and Likelihood-Ratio tests to compare different model specifications and creates Table S10* CreateFigure1a.RDescription: R script that creates subfigure a of Figure 1* CreateFigure1b.RDescription: R script that creates subfigure b of Figure 1* CreateFigure2a.RDescription: R script that creates subfigure a of Figure 2* CreateFigure2b.RDescription: R script that creates subfigure b of Figure 2* CreateFigure2c.RDescription: R script that creates subfigure c of Figure 2* CreateFigure3.RDescription: R script that creates the subfigures of Figure 3* CreateFigure4.RDescription: R script that creates the subfigures of Figure 4* CreateFigure5_TableS6.RDescription: R script that creates the subfigures of Figure 5 and Table S6* CreateFigureS1.RDescription: R script that creates Figure S1* CreateFigureS2.RDescription: R script that creates Figure S2* CreateTableS2_S3_S7.RDescription: R script that creates Tables S2, S3, and S7* CreateTableS4_S5.RDescription: R script that creates Tables S4 and S5* CreateTableS8.RDescription: R script that creates Table S8* CreateTableS9.RDescription: R script that creates Table S9
SpaceKnow uses satellite (SAR) data to capture activity in electric vehicles and automotive factories.
Data is updated daily, has an average lag of 4-6 days, and history back to 2017.
The insights provide you with level and change data that monitors the area which is covered with assembled light vehicles in square meters.
We offer 3 delivery options: CSV, API, and Insights Dashboard
Available companies Rivian (NASDAQ: RIVN) for employee parking, logistics, logistic centers, product distribution & product in the US. (See use-case write up on page 4) TESLA (NASDAQ: TSLA) indices for product, logistics & employee parking for Fremont, Nevada, Shanghai, Texas, Berlin, and Global level Lucid Motors (NASDAQ: LCID) for employee parking, logistics & product in US
Why get SpaceKnow's EV datasets?
Monitor the company’s business activity: Near-real-time insights into the business activities of Rivian allow users to better understand and anticipate the company’s performance.
Assess Risk: Use satellite activity data to assess the risks associated with investing in the company.
Types of Indices Available Continuous Feed Index (CFI) is a daily aggregation of the area of metallic objects in square meters. There are two types of CFI indices. The first one is CFI-R which gives you level data, so it shows how many square meters are covered by metallic objects (for example assembled cars). The second one is CFI-S which gives you change data, so it shows you how many square meters have changed within the locations between two consecutive satellite images.
How to interpret the data SpaceKnow indices can be compared with the related economic indicators or KPIs. If the economic indicator is in monthly terms, perform a 30-day rolling sum and pick the last day of the month to compare with the economic indicator. Each data point will reflect approximately the sum of the month. If the economic indicator is in quarterly terms, perform a 90-day rolling sum and pick the last day of the 90-day to compare with the economic indicator. Each data point will reflect approximately the sum of the quarter.
Product index This index monitors the area covered by manufactured cars. The larger the area covered by the assembled cars, the larger and faster the production of a particular facility. The index rises as production increases.
Product distribution index This index monitors the area covered by assembled cars that are ready for distribution. The index covers locations in the Rivian factory. The distribution is done via trucks and trains.
Employee parking index Like the previous index, this one indicates the area covered by cars, but those that belong to factory employees. This index is a good indicator of factory construction, closures, and capacity utilization. The index rises as more employees work in the factory.
Logistics index The index monitors the movement of materials supply trucks in particular car factories.
Logistics Centers index The index monitors the movement of supply trucks in warehouses.
Where the data comes from: SpaceKnow brings you information advantages by applying machine learning and AI algorithms to synthetic aperture radar and optical satellite imagery. The company’s infrastructure searches and downloads new imagery every day, and the computations of the data take place within less than 24 hours.
In contrast to traditional economic data, which are released in monthly and quarterly terms, SpaceKnow data is high-frequency and available daily. It is possible to observe the latest movements in the EV industry with just a 4-6 day lag, on average.
The EV data help you to estimate the performance of the EV sector and the business activity of the selected companies.
The backbone of SpaceKnow’s high-quality data is the locations from which data is extracted. All locations are thoroughly researched and validated by an in-house team of annotators and data analysts.
Each individual location is precisely defined so that the resulting data does not contain noise such as surrounding traffic or changing vegetation with the season.
We use radar imagery and our own algorithms, so the final indices are not devalued by weather conditions such as rain or heavy clouds.
→ Reach out to get a free trial
Use Case - Rivian:
SpaceKnow uses the quarterly production and delivery data of Rivian as a benchmark. Rivian targeted to produce 25,000 cars in 2022. To achieve this target, the company had to increase production by 45% by producing 10,683 cars in Q4. However the production was 10,020 and the target was slightly missed by reaching total production of 24,337 cars for FY22.
SpaceKnow indices help us to observe the company’s operations, and we are able to monitor if the company is set to meet its forecasts or not. We deliver five different indices for Rivian, and these indices observe logistic centers, employee parking lot, logistics, product, and prod...
This dataset contains all data and R code, in RMarkdown notebook format, needed to reproduce all statistical analysis, figures, and tables in the manuscript:Jeffers, D., J. S. Smith, E. D. Womack, Q. D. Read, and G. L. Windham. 2024. Comparison of in-field and laboratory-based phenotyping methods for evaluation of aflatoxin accumulation in maize inbred lines. Plant Disease. (citation to be updated upon final acceptance of MS)There is a critical need to quickly and reliably identify corn genotypes that are resistant to accumulating aflatoxin in their kernels. We compared three methods of determining how resistant different corn genotypes are to aflatoxin accumulation: a field-based assay (side-needle inoculation) and two different lab-based assays (wounding and non-wounding kernel screening assays; KSA). In this data object, we present the data from the lab and field assays, statistical models that are fit to the data, procedures for comparing model fit of different variants of the model, and model predictions. This includes how reliably each assay identifies resistant and susceptible check varieties, and how well correlated the assay methods are with one another. Statistical analyses are done using R software, including Bayesian models fit with Stan software.The following files are included:ksa_analysis_revised.Rmd: RMarkdown notebook with all code needed to reproduce analyses and create figures and tables in manuscriptksa_analysis_revised.html: HTML rendered output of notebookstep1_ksa.tsv: tab-separated data file with data from the lab assay. Columns include sample ID, genotype ID and entry code, year, treatment (wound or no-wound), subsample ID, replicate ID, aflatoxin concentration (in units of ng/g), logarithm of aflatoxin concentration, and a column indicating genotypes that are susceptible or resistant checksstep1_ksa_field.tsv: tab-separated data file with data from the field assay. Columns similar to the lab assay data file with an additional column for row in which the sample was planted.ksa_cov_mod.tsv: tab-separated data file with secondary infection covariate data from the lab assay. Columns similar to the lab assay data file with columns for secondary Asp, Fus, and NI infections and their logarithms.brmfits.zip: zip archive with 12 .rds files. These are model output files for the Bayesian mixed effect models presented in the MS that were fitted using the R function brm(). You may download these to reproduce output without having to compile and run the models yourself.The three .tsv data files should be placed in a subdirectory called "data" in the same directory where the .Rmd notebook is located.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
Data points present in this dataset were obtained following the subsequent steps: To assess the secretion efficiency of the constructs, 96 colonies from the selection plates were evaluated using the workflow presented in Figure Workflow. We picked transformed colonies and cultured in 400 μL TAP medium for 7 days in Deep-well plates (Corning Axygen®, No.: PDW500CS, Thermo Fisher Scientific Inc., Waltham, MA), covered with Breathe-Easy® (Sigma-Aldrich®). Cultivation was performed on a rotary shaker, set to 150 rpm, under constant illumination (50 μmol photons/m2s). Then 100 μL sample were transferred clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA) and fluorescence was measured using an Infinite® M200 PRO plate reader (Tecan, Männedorf, Switzerland). Fluorescence was measured at excitation 575/9 nm and emission 608/20 nm. Supernatant samples were obtained by spinning Deep-well plates at 3000 × g for 10 min and transferring 100 μL from each well to the clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA), followed by fluorescence measurement. To compare the constructs, R Statistic version 3.3.3 was used to perform one-way ANOVA (with Tukey's test), and to test statistical hypotheses, the significance level was set at 0.05. Graphs were generated in RStudio v1.0.136. The codes are deposit herein.
Info
ANOVA_Turkey_Sub.R -> code for ANOVA analysis in R statistic 3.3.3
barplot_R.R -> code to generate bar plot in R statistic 3.3.3
boxplotv2.R -> code to generate boxplot in R statistic 3.3.3
pRFU_+_bk.csv -> relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
sup_+_bl.csv -> supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
sup_raw.csv -> supernatant mCherry fluorescence dataset of 96 colonies for each construct.
who_+_bl2.csv -> whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
who_raw.csv -> whole culture mCherry fluorescence dataset of 96 colonies for each construct.
who_+_Chlo.csv -> whole culture chlorophyll fluorescence dataset of 96 colonies for each construct.
Anova_Output_Summary_Guide.pdf -> Explain the ANOVA files content
ANOVA_pRFU_+_bk.doc -> ANOVA of relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
ANOVA_sup_+_bk.doc -> ANOVA of supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
ANOVA_who_+_bk.doc -> ANOVA of whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
ANOVA_Chlo.doc -> ANOVA of whole culture chlorophyll fluorescence of all constructs, plus average and standard deviation values.
Consider citing our work.
Molino JVD, de Carvalho JCM, Mayfield SP (2018) Comparison of secretory signal peptides for heterologous protein expression in microalgae: Expanding the secretion portfolio for Chlamydomonas reinhardtii. PLoS ONE 13(2): e0192433. https://doi.org/10.1371/journal. pone.0192433
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The ability to interpret the predictions made by quantitative structure–activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package (https://r-forge.r-project.org/R/?group_id=1725) for the R statistical programming language and the Python program HeatMapWrapper [https://doi.org/10.5281/zenodo.495163] for heat map generation.