Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By SocialGrep [source]
A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.
Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.
In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.
You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes
- Finding correlations between different types of datasets
- Determining which datasets are most popular on Reddit
- Analyzing the sentiments of post and comments on Reddit's /r/datasets board
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |
File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The protein-protein interface comparison software PiMine was developed to provide fast comparisons against databases of known protein-protein complex structures. Its application domains range from the prediction of interfaces and potential interaction partners to the identification of potential small molecule modulators of protein-protein interactions.[1]
The protein-protein evaluation datasets are a collection of five datasets that were used for the parameter optimization (ParamOptSet), enrichment assessment (Dimer597 set, Keskin set, PiMineSet), and runtime analyses (RunTimeSet) of protein-protein interface comparison tools. The evaluation datasets contain pairs of interfaces of protein chains that either share sequential and structural similarities or are even sequentially and structurally unrelated. They enable comparative benchmark studies for tools designed to identify interface similarities.
Data Set description:
The ParamOptSet was designed based on a study on improving the benchmark datasets for the evaluation of protein-protein docking tools [2]. It was used to optimize and fine-tune the geometric search parameters of PiMine.
The Dimer597 [3] and Keskin [4] sets were developed earlier. We used them to evaluate PiMine’s performance in identifying structurally and sequentially related interface pairs as well as interface pairs with prominent similarity whose constituting chains are sequentially unrelated.
The PiMine set [1] was constructed to assess different quality criteria for reliable interface comparison. It consists of similar pairs of protein-protein complexes of which two chains are sequentially and structurally highly related while the other two chains are unrelated and show different folds. It enables the assessment of the performance when the interfaces of apparently unrelated chains are available only. Furthermore, we could obtain reliable interface-interface alignments based on the similar chains which can be used for alignment performance assessments.
Finally, the RunTimeSet [1] comprises protein-protein complexes from the PDB that were predicted to be biologically relevant. It enables the comparison of typical run times of comparison methods and represents also an interesting dataset to screen for interface similarities.
References:
[1] Graef, J.; Ehrt, C.; Reim, T.; Rarey, M. Database-driven identification of structurally similar protein-protein interfaces (submitted)
[2] Barradas-Bautista, D.; Almajed, A.; Oliva, R.; Kalnis, P.; Cavallo, L. Improving classification of correct and incorrect protein-protein docking models by augmenting the training set. Bioinform. Adv. 2023, 3, vbad012.
[3] Gao, M.; Skolnick, J. iAlign: a method for the structural comparison of protein–protein interfaces. Bioinformatics 2010, 26, 2259-2265.
[4] Keskin, O.; Tsai, C.-J.; Wolfson, H.; Nussinov, R. A new, structurally nonredundant, diverse data set of protein–protein interfaces and its implications. Protein Sci. 2004, 13, 1043-1055.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pathway Multi-Omics Simulated Data
These are synthetic variations of the TCGA COADREAD data set (original data available at http://linkedomics.org/data_download/TCGA-COADREAD/). This data set is used as a comprehensive benchmark data set to compare multi-omics tools in the manuscript "pathwayMultiomics: An R package for efficient integrative analysis of multi-omics datasets with matched or un-matched samples".
There are 100 sets (stored as 100 sub-folders, the first 50 in "pt1" and the second 50 in "pt2") of random modifications to centred and scaled copy number, gene expression, and proteomics data saved as compressed data files for the R programming language. These data sets are stored in subfolders labelled "sim001", "sim002", ..., "sim100". Each folder contains the following contents: 1) "indicatorMatricesXXX_ls.RDS" is a list of simple triplet matrices showing which genes (in which pathways) and which samples received the synthetic treatment (where XXX is the simulation run label: 001, 002, ...), (2) "CNV_partitionA_deltaB.RDS" is the synthetically modified copy number variation data (where A represents the proportion of genes in each gene set to receive the synthetic treatment [partition 1 is 20%, 2 is 40%, 3 is 60% and 4 is 80%] and B is the signal strength in units of standard deviations), (3) "RNAseq_partitionA_deltaB.RDS" is the synthetically modified gene expression data (same parameter legend as CNV), and (4) "Prot_partitionA_deltaB.RDS" is the synthetically modified protein expression data (same parameter legend as CNV).
Supplemental Files
The file "cluster_pathway_collection_20201117.gmt" is the collection of gene sets used for the simulation study in Gene Matrix Transpose format. Scripts to create and analyze these data sets available at: https://github.com/TransBioInfoLab/pathwayMultiomics_manuscript_supplement
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Files descriptions:
All csv files refer to results from the different models (PAMM, AARs, Linear models, MRPPs) on each iteration of the simulation. One row being one iteration. "results_perfect_detection.csv" refers to the results from the first simulation part with all the observations."results_imperfect_detection.csv" refers to the results from the first simulation part with randomly thinned observations to mimick imperfect detection.
ID_run: identified of the iteration (N: number of sites, D_AB: duration of the effect of A on B, D_BA: duration of the effect of B on A, AB: effect of A on B, BA: effect of B on A, Se: seed number of the iteration).PAMM30: p-value of the PAMM running on the 30-days survey.PAMM7: p-value of the PAMM running on the 7-days survey.AAR1: ratio value for the Avoidance-Attraction-Ratio calculating AB/BA.AAR2: ratio value for the Avoidance-Attraction-Ratio calculating BAB/BB.Harmsen_P: p-value from the linear model with interaction Species1*Species2 from Harmsen et al. (2009).Niedballa_P: p-value from the linear model comparing AB to BA (Niedballa et al. 2021).Karanth_permA: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species A (Karanth et al. 2017).MurphyAB_permA: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). MurphyBA_permA: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). Karanth_permB: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species B (Karanth et al. 2017).MurphyAB_permB: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021). MurphyBA_permB: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021).
"results_int_dir_perf_det.csv" refers to the results from the second simulation part, with all the observations."results_int_dir_imperf_det.csv" refers to the results from the second simulation part, with randomly thinned observations to mimick imperfect detection.ID_run: identified of the iteration (N: number of sites, D_AB: duration of the effect of A on B, D_BA: duration of the effect of B on A, AB: effect of A on B, BA: effect of B on A, Se: seed number of the iteration).p_pamm7_AB: p-value of the PAMM running on the 7-days survey testing for the effect of A on B.p_pamm7_AB: p-value of the PAMM running on the 7-days survey testing for the effect of B on A.AAR1: ratio value for the Avoidance-Attraction-Ratio calculating AB/BA.AAR2_BAB: ratio value for the Avoidance-Attraction-Ratio calculating BAB/BB.AAR2_ABA: ratio value for the Avoidance-Attraction-Ratio calculating ABA/AA.Harmsen_P: p-value from the linear model with interaction Species1*Species2 from Harmsen et al. (2009).Niedballa_P: p-value from the linear model comparing AB to BA (Niedballa et al. 2021).Karanth_permA: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species A (Karanth et al. 2017).MurphyAB_permA: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). MurphyBA_permA: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). Karanth_permB: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species B (Karanth et al. 2017).MurphyAB_permB: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021). MurphyBA_permB: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021).
Scripts files description:1_Functions: R script containing the functions: - MRPP from Karanth et al. (2017) adapted here for time efficiency. - MRPP from Murphy et al. (2021) adapted here for time efficiency. - Version of the ct_to_recurrent() function from the recurrent package adapted to process parallized on the simulation datasets. - The simulation() function used to simulate two species observations with reciprocal effect on each other.2_Simulations: R script containing the parameters definitions for all iterations (for the two parts of the simulations), the simulation paralellization and the random thinning mimicking imperfect detection.3_Approaches comparison: R script containing the fit of the different models tested on the simulated data.3_1_Real data comparison: R script containing the fit of the different models tested on the real data example from Murphy et al. 2021.4_Graphs: R script containing the code for plotting results from the simulation part and appendices.5_1_Appendix - Check for similarity between codes for Karanth et al 2017 method: R script containing Karanth et al. (2017) and Murphy et al. (2021) codes lines and the adapted version for time-efficiency matter and a comparison to verify similarity of results.5_2_Appendix - Multi-response procedure permutation difference: R script containing R code to test for difference of the MRPPs approaches according to the species on which permutation are done.
Facebook
TwitterThe dataset includes a PDF file containing the results and an Excel file with the following tables:
Table S1 Results of comparing the performance of MetaFetcheR to MetaboAnalystR using Diamanti et al. Table S2 Results of comparing the performance of MetaFetcheR to MetaboAnalystR for Priolo et al. Table S3 Results of comparing the performance of MetaFetcheR to MetaboAnalyst 5.0 webtool using Diamanti et al. Table S4 Results of comparing the performance of MetaFetcheR to MetaboAnalyst 5.0 webtool for Priolo et al. Table S5 Data quality test results for running 100 iterations on HMDB database. Table S6 Data quality test results for running 100 iterations on KEGG database. Table S7 Data quality test results for running 100 iterations on ChEBI database. Table S8 Data quality test results for running 100 iterations on PubChem database. Table S9 Data quality test results for running 100 iterations on LIPID MAPS database. Table S10 The list of metabolites that were not mapped by MetaboAnalystR for Diamanti et al. Table S11 An example of an input matrix for MetaFetcheR. Table S12 Results of comparing the performance of MetaFetcheR to MS_targeted using Diamanti et al. Table S13 Data set from Diamanti et al. Table S14 Data set from Priolo et al. Table S15 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Diamanti et al. Table S16 Results of comparing the performance of MetaFetcheR to CTS using LIPID MAPS identifiers available in Diamanti et al. Table S17 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Priolo et al. Table S18 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Priolo et al. (See the "index" tab in the Excel file for more information)
Small-compound databases contain a large amount of information for metabolites and metabolic pathways. However, the plethora of such databases and the redundancy of their information lead to major issues with analysis and standardization. Lack of preventive establishment of means of data access at the infant stages of a project might lead to mislabelled compounds, reduced statistical power and large delays in delivery of results.
We developed MetaFetcheR, an open-source R package that links metabolite data from several small-compound databases, resolves inconsistencies and covers a variety of use-cases of data fetching. We showed that the performance of MetaFetcheR was superior to existing approaches and databases by benchmarking the performance of the algorithm in three independent case studies based on two published datasets.
The dataset was originally published in DiVA and moved to SND in 2024.
Facebook
TwitterThe R Manual for QCA entails a PDF file that describes all the steps and code needed to prepare and conduct a Qualitative Comparative Analysis (QCA) study in R. This is complemented by an R Script that can be customized as needed. The dataset further includes two files with sample data, for the set-theoretic analysis and the visualization of QCA results. The R Manual for QCA is the online appendix to "Qualitative Comparative Analysis: An Introduction to Research Design and Application", Georgetown University Press, 2021.
Facebook
TwitterBy City of San Francisco [source]
This dataset explores the late night departing runways used by aircraft at San Francisco International Airport (SFO). From 1:00 a.m. to 6:00 a.m., aircraft are directed to either 10L/R, 01L/R or 28L/R with an immediate right turn when safety and weather conditions permit to reduce noise in the area's surrounding residential communities by following over-water departure procedures, directing aircraft over the bay instead. Providing insight into which late night runways are most frequently used, data from this dataset is broken down by runway, month and year of departure as well as what percent of total departures for each month come from each runway - allowing for a comprehensive look at SFO's preferential late night use of airport runways!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can be used to analyze the degree of aircraft late night departure from San Francisco Airport in order to study the impact of runway usage on air and noise pollution in residential communities. This dataset contains information about departures from each runway (01L/R, 10L/R, 19L/R and 28L/R) at San Francisco Airport for a specified year and month. By studying the percentage of total departures by runway we can understand how much aircraft are using which runways during late night hours.
To use this dataset one needs to first become familiar with the column names such as Year, Month, 01L/R(number of departures from 01L/R runway),01L/R Percent of Departures (percentage of departures from 01LR runway) etc. It is also important to become more familiar with terms such as departure and late-night which are prominently used in this dataset.
Once you have familiarized yourself with these details you can start exploring the data for further insights into how specific runways are being used for late night flight operations in San Francisco Airport and also note any patterns or trends that may emerge when looking at multiple months or years within this data set. Additionally, by comparing percentages between different runways we can measure which runways are preferred more often than others during times when congested traffic is more common such as holidays or summer months when residents take trips more often
- To identify areas of the San Francisco Airport prone to noise pollution from aircraft and develop ways to limit it.
- To analyze the impacts of changing departure runway preferences on noise pollution levels over residential communities near the airport.
- To monitor seasonal trends in aircraft late night departures by runways, along with identifying peak hours for each runway, in order to inform flight controllers and develop improved flight control regulations and procedures at the San Francisco Airport
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: late-night-preferential-runway-use-1.csv | Column name | Description | |:--------------------------------|:--------------------------------------------------------| | Year | The year of the data. (Integer) | | Month | The month of the data. (String) | | 01L/R | The number of departures from runway 01L/R. (Integer) | | 01L/R Percent of Departures | The percentage of departures from runway 01L/R. (Float) | | 10L/R | The number of departures from runway 10L/R. (Integer) | | 10L/R Percent of Departures | The percentage of departures from runway 10L/R. (Float) | | 19L/R | The number of departures from runway 19L/R. (Integer) | | 19L/R Percent of Departures | The percentage of departures from runway 19L/R. (Float) | | 28L/R | The number of departures from runway 28L/R. (Integer) | | 28L/R Percent of Departures | The percentage of departures from runway 28L/R. (Float) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit City of San Francisco.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The four datasets 'phone', 'game', 'social', and 'video' are the processed datasets that are used as input files for the Mplus models (but then in .csv instead of .dat format). The dataset 'phone' contains all data related to the main analyses of daytime, pre-bedtime and post-bedtime smartphone use. The datasets 'game', 'social', and 'video' represent the data related to the exploratory analyses for game app, social media app, and video player app use, respectively. The dataset 'timeframes' contains information about respondents' bedtime and wake-up time, which is required to calculate the three timeframes (daytime, pre-bedtime, and post-bedtime).------------------The materials used, including the R and Mplus syntaxes (https://osf.io/tpj98/) and the preregistration of the current study (https://osf.io/kxw2h/) can be found on OSF. For more information, please contact the authors via t.siebers@uva.nl or info@project-awesome.nl.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Creation of the 2021/2 Output Area Classification [External Github Repo LINK]ScriptsMetadata.R - creating reference tables and glossaries for Census data.Downloading_data.R – downloading 2011 and 2021 Census data from NOMIS using nomisr package.Comparing_Censuses.R - data cleaning and amalgamation.Transforming_Census_data.R – data manipulation and transformation for the classification.NI.R – modelling 2021 data for Northern Ireland.Correlation.R - testing correlation between the variables.Pre-clustering.R - preparing data for clustering.Clustering.R - clustering of the data.Post-clustering.R - creating maps and plots of the cluster solution.Testing_clustering.RClustergrams.ipynb – creating Clustergrams in Python. (Credits: Prof Alex Singleton)Industry.R - loading Industry dataIndustry_classification.R - creating geodemographic classification with Industry variablesGraph_comparisons.R - comparing data with graphsDataList of folders (subfolders & files) in the project:API - Census data downloaded and saved with use of nomisr package.Clean – amalgamated data ready for the analysis.Raw_counts - datasets with raw countsPercentages - datasets transformed into percentagesTransformed - datasets transformed with IHS (analysis-ready)Final_variables - datasets with OAC variables onlyAll_data_clustering - results of the clustering for all investigated datasets.Clustering - datasets with cluster assignment for the UK and centroids.Lookups - reference tables for 2011 and 2021 Census variables.NISRA 2021 - 2021 Census data at LGD level for Northern Ireland.Objects - R objects created and stored to ensure consistency of the results or load big filesSIR – contingency tables on disability counts by age, utilised for calculation of Standardised Illness Ratio.shapefiles - folder containing shapefiles used for some of the calculations.PlotsBar_plots - Comparison of clusters to the UK (as well as Supergroup and Group averages)Clustergrams – plots used to establish number of clusters at each classification level.Maps
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The datasets are generated using either Gaussian or Uniform distributions. Each dataset contains several known sub-groups intended for testing centroid-based clustering results and cluster validity indices.
Cluster analysis is a popular machine learning used for segmenting datasets with similar data points in the same group. For those who are familiar with R, there is a new R package called "UniversalCVI" https://CRAN.R-project.org/package=UniversalCVI used for cluster evaluation. This package provides algorithms for checking the accuracy of a clustering result with known classes, computing cluster validity indices, and generating plots for comparing them. The package is compatible with K-means, fuzzy C means, EM clustering, and hierarchical clustering (single, average, and complete linkage). To use the "UniversalCVI" package, one can follow the instructions provided in the R documentation.
For more in-depth details of the package and cluster evaluation, please see the papers https://doi.org/10.1016/j.patcog.2023.109910 and https://arxiv.org/abs/2308.14785
https://github.com/O-PREEDASAWAKUL/FuzzyDatasets.git .
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17645646%2Fa2f87fbad212a908718535589681a703%2Frealplot.jpeg?generation=1700111724010268&alt=media" alt="">
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.
A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.
Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compare fastICA/InfoMax ICA/PGICA accuracies.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.
Facebook
TwitterSpaceKnow uses satellite (SAR) data to capture activity in electric vehicles and automotive factories.
Data is updated daily, has an average lag of 4-6 days, and history back to 2017.
The insights provide you with level and change data that monitors the area which is covered with assembled light vehicles in square meters.
We offer 3 delivery options: CSV, API, and Insights Dashboard
Available companies Rivian (NASDAQ: RIVN) for employee parking, logistics, logistic centers, product distribution & product in the US. (See use-case write up on page 4) TESLA (NASDAQ: TSLA) indices for product, logistics & employee parking for Fremont, Nevada, Shanghai, Texas, Berlin, and Global level Lucid Motors (NASDAQ: LCID) for employee parking, logistics & product in US
Why get SpaceKnow's EV datasets?
Monitor the company’s business activity: Near-real-time insights into the business activities of Rivian allow users to better understand and anticipate the company’s performance.
Assess Risk: Use satellite activity data to assess the risks associated with investing in the company.
Types of Indices Available Continuous Feed Index (CFI) is a daily aggregation of the area of metallic objects in square meters. There are two types of CFI indices. The first one is CFI-R which gives you level data, so it shows how many square meters are covered by metallic objects (for example assembled cars). The second one is CFI-S which gives you change data, so it shows you how many square meters have changed within the locations between two consecutive satellite images.
How to interpret the data SpaceKnow indices can be compared with the related economic indicators or KPIs. If the economic indicator is in monthly terms, perform a 30-day rolling sum and pick the last day of the month to compare with the economic indicator. Each data point will reflect approximately the sum of the month. If the economic indicator is in quarterly terms, perform a 90-day rolling sum and pick the last day of the 90-day to compare with the economic indicator. Each data point will reflect approximately the sum of the quarter.
Product index This index monitors the area covered by manufactured cars. The larger the area covered by the assembled cars, the larger and faster the production of a particular facility. The index rises as production increases.
Product distribution index This index monitors the area covered by assembled cars that are ready for distribution. The index covers locations in the Rivian factory. The distribution is done via trucks and trains.
Employee parking index Like the previous index, this one indicates the area covered by cars, but those that belong to factory employees. This index is a good indicator of factory construction, closures, and capacity utilization. The index rises as more employees work in the factory.
Logistics index The index monitors the movement of materials supply trucks in particular car factories.
Logistics Centers index The index monitors the movement of supply trucks in warehouses.
Where the data comes from: SpaceKnow brings you information advantages by applying machine learning and AI algorithms to synthetic aperture radar and optical satellite imagery. The company’s infrastructure searches and downloads new imagery every day, and the computations of the data take place within less than 24 hours.
In contrast to traditional economic data, which are released in monthly and quarterly terms, SpaceKnow data is high-frequency and available daily. It is possible to observe the latest movements in the EV industry with just a 4-6 day lag, on average.
The EV data help you to estimate the performance of the EV sector and the business activity of the selected companies.
The backbone of SpaceKnow’s high-quality data is the locations from which data is extracted. All locations are thoroughly researched and validated by an in-house team of annotators and data analysts.
Each individual location is precisely defined so that the resulting data does not contain noise such as surrounding traffic or changing vegetation with the season.
We use radar imagery and our own algorithms, so the final indices are not devalued by weather conditions such as rain or heavy clouds.
→ Reach out to get a free trial
Use Case - Rivian:
SpaceKnow uses the quarterly production and delivery data of Rivian as a benchmark. Rivian targeted to produce 25,000 cars in 2022. To achieve this target, the company had to increase production by 45% by producing 10,683 cars in Q4. However the production was 10,020 and the target was slightly missed by reaching total production of 24,337 cars for FY22.
SpaceKnow indices help us to observe the company’s operations, and we are able to monitor if the company is set to meet its forecasts or not. We deliver five different indices for Rivian, and these indices observe logistic centers, employee parking lot, logistics, product, and prod...
Facebook
TwitterNote: This dataset is superseded by: https://doi.org/10.15482/USDA.ADC/30210112 Note: Data files will be made available upon manuscript publication This dataset contains all code and data needed to reproduce the analyses in the manuscript: IDENTIFICATION OF A KEY TARGET FOR ELIMINATION OF NITROUS OXIDE, A MAJOR GREENHOUSE GAS. Blake A. Oakley (1), Trevor Mitchell (2), Quentin D. Read (3), Garrett Hibbs (1), Scott E. Gold (2), Anthony E. Glenn (2) Department of Plant Pathology, University of Georgia, Athens, GA, USA.Toxicology and Mycotoxin Research Unit, U.S. National Poultry Research Center, United States Department of Agriculture-Agricultural Research Service, Athens, GA, USASoutheast Area, United States Department of Agriculture-Agricultural Research Service, Raleigh, NC, USA citation will be updated upon acceptance of manuscript Brief description of study aims Denitrification is a chemical process that releases nitrous oxide (N2O), a potent greenhouse gas. The NOR1 gene is part of the denitrification pathway in Fusarium. Three experiments were conducted for this study. (1) The N2O comparative experiment compares denitrification rates, as measured by N2O production, of a variety of Fusarium spp. strains with and without the NOR1 gene. (2) The N2O substrate experiment compares denitrification rates of selected strains on different growth media (substrates). For parts 1 and 2, linear models are fit comparing N2O production between strains and/or substrates. (3) The Bioscreen growth assay tests whether there is a pleiotropic effect of the NOR1 gene. In this portion of the analysis, growth curves are fit to assess differences in growth rate and carrying capacity between selected strains with and without the NOR1 gene. Code All code is included in a .zip archive generated from a private git repository on 2022-10-13 and archived as part of this dataset. The code is contained in R scripts and RMarkdown notebooks. There are two components to the analysis: the denitrification analysis (comprising parts 1 and 2 described above) and the Bioscreen growth analysis (part 3). The scripts for each are listed and described below. Analysis of results of denitrification experiments (parts 1 and 2) NOR1_denitrification_analysis.Rmd: The R code to analyze the experimental data comparing nitrous oxide emissions is all contained in a single RMarkdown notebook. This script analyzes the results from the comparative study and the substrate study.n2o_subgroup_figures.R: R script to create additional figures using the output from the RMarkdown notebook Analysis of results of Bioscreen growth assay (part 3) bioscreen_analysis.Rmd: This RMarkdown notebook contains all R code needed to analyze the results of the Bioscreen assay comparing growth of the different strains. It could be run as is. However, the model-fitting portion was run on a high-performance computing cluster with the following scripts:bioscreen_fit_simpler.R: R script containing only the model-fitting portion of the Bioscreen analysis, fit using the Stan modeling language interfaced with R through the brms and cmdstanr packages.job_bssimple.sh: Job submission shell script used to submit the model-fitting R job to be run on USDA SciNet high-performance computing cluster. Additional scripts developed as part of the analysis but that are not required to reproduce the analyses in the manuscript are in the deprecated/ folder. Also note the files nor1-denitrification.Rproj (RStudio project file) and gtstyle.css (stylesheet for formatting the tables in the notebooks) are included. Data Data required to run the analysis scripts are archived in this dataset, other than strain_lookup.csv, a lookup table of strain abbreviations and full names included in the code repository for convenience. They should be placed in a folder or symbolic link called project within the unzipped code repository directory. N2O_data_2022-08-03/N2O_Comparative_Study_Trial_(n)_(date range).xlsx: These are the data from the N2O comparative study, where n is the trial number from 1-3 and date range is the begin and end date of the trial.N2O_data_2022-08-03/Nitrogen_Substrate_Study_Trial_(n)_(date range).xlsx: These are the data from the N2O substrate study, where n is the trial number from 1-3 and date range is the begin and end date of the trial.Outliers_NOR1_2022/Bioscreen_NOR1_Fungal_Growth_Assay_(substrate)_(oxygen level)_Outliers_BAO_(date).xlsx: These are the raw Bioscreen data files in MS Excel format. The format of each file name includes the substrate (minimal medium with nitrite or nitrate and lysine), oxygen level (hypoxia or normoxia), and date of the run. This repository includes code to process these files, but the processed data are also included on Ag Data Commons, so it is not necessary to run the data processing portion of the code.clean_data/bioscreen_clean_data.csv: This is an intermediate output file in CSV format generated by bioscreen_analysis.Rmd. It includes all the data from the Bioscreen assays in a clean analysis-ready format.
Facebook
TwitterThe dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
These are the data summarising the modelled Hydrological Response Variable (HRV) variability versus climate interannual variability which has been used as an indicator of risk. For example, to understand the significance of the modelled increases in low-flow days, it is useful to look at them in the context of the interannual variability in low-flow days due to climate. In other words, are the modelled increases due to additional coal resource development within the natural range of variability of the longer-term flow regime, or are they potentially moving the system outside the range of hydrological variability it experiences under the current climate? The maximum increase in the number of low-flow days due to additional coal resource development relative to the interannual variability in low-flow days under the baseline has been adopted to put some context around the modelled changes. If the maximum change is small relative to the interannual variability due to climate (e.g. an increase of 3 days relative to a baseline range of 20 to 50 days), then the risk of impacts from the changes in low-flow days is likely to be low. If the maximum change is comparable to or greater than the interannual variability due to climate (e.g. an increase of 200 days relative to a baseline range of 20 to 50 days), then there is a greater risk of impact on the landscape classes and assets that rely on this water source. Here changes comparable to or greater than interannual variability are interpreted as presenting a risk. However, the change due to the additional coal resource development is additive, so even a 'less than interannual variability' change is not free from risk. Results of the interannual variability comparison should be viewed as indicators of risk.
This dataset is generated using 1000 HRV simulations together with climate inputs. Ratios between the variability in HRVs, and the variability attributable interannual variability due to climate, were calculated for the HRVs. Results of the interannual variability comparison should be viewed as indicators of risk.
Bioregional Assessment Programme (2017) HUN Comparison of model variability and interannual variability. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/1c0a19f9-98c2-4d92-956d-dd764aaa10f9.
Derived From River Styles Spatial Layer for New South Wales
Derived From SYD ALL climate data statistics summary
Derived From HUN AWRA-R Observed storage volumes Glenbawn Dam and Glennies Creek Dam
Derived From Hunter River Salinity Scheme Discharge NSW EPA 2006-2012
Derived From HUN AWRA-R simulation nodes v01
Derived From Bioregional Assessment areas v06
Derived From Hunter AWRA Hydrological Response Variables (HRV)
Derived From GEODATA 9 second DEM and D8: Digital Elevation Model Version 3 and Flow Direction Grid 2008
Derived From HUN AWRA-L simulation nodes_v01
Derived From Bioregional Assessment areas v04
Derived From HUN AWRA-R Gauge Station Cross Sections v01
Derived From Gippsland Project boundary
Derived From Natural Resource Management (NRM) Regions 2010
Derived From BA All Regions BILO cells in subregions shapefile
Derived From Hunter Surface Water data v2 20140724
Derived From HUN AWRA-R River Reaches Simulation v01
Derived From HUN AWRA-L simulation nodes v02
Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
Derived From HUN AWRA-R Irrigation Area Extents and Crop Types v01
Derived From GEODATA TOPO 250K Series 3
Derived From NSW Catchment Management Authority Boundaries 20130917
Derived From Geological Provinces - Full Extent
Derived From BA SYD selected GA TOPO 250K data plus added map features
Derived From HUN gridded daily PET from 1973-2102 v01
Derived From Bioregional_Assessment_Programme_Catchment Scale Land Use of Australia - 2014
Derived From Bioregional Assessment areas v03
Derived From IQQM Model Simulation Regulated Rivers NSW DPI HUN 20150615
Derived From HUN AWRA-R calibration catchments v01
Derived From Bioregional Assessment areas v05
Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012
Derived From National Surface Water sites Hydstra
Derived From Selected streamflow gauges within and near the Hunter subregion
Derived From ASRIS Continental-scale soil property predictions 2001
Derived From Hunter Surface Water data extracted 20140718
Derived From Mean Annual Climate Data of Australia 1981 to 2012
Derived From HUN AWRA-R calibration nodes v01
Derived From HUN future climate rainfall v01
Derived From HUN AWRA-LR Model v01
Derived From HUN AWRA-L ASRIS soil properties v01
Derived From HUN AWRAR restricted input 01
Derived From Bioregional Assessment areas v01
Derived From Bioregional Assessment areas v02
Derived From Victoria - Seamless Geology 2014
Derived From HUN AWRA-L Site Station Cross Sections v01
Derived From HUN AWRA-R simulation catchments v01
Derived From HUN AWRA-R Simulation Node Cross Sections v01
Derived From Climate model 0.05x0.05 cells and cell centroids
Facebook
TwitterThis dataset contains all data and R code, in RMarkdown notebook format, needed to reproduce all statistical analysis, figures, and tables in the manuscript:Jeffers, D., J. S. Smith, E. D. Womack, Q. D. Read, and G. L. Windham. 2024. Comparison of in-field and laboratory-based phenotyping methods for evaluation of aflatoxin accumulation in maize inbred lines. Plant Disease. (citation to be updated upon final acceptance of MS)There is a critical need to quickly and reliably identify corn genotypes that are resistant to accumulating aflatoxin in their kernels. We compared three methods of determining how resistant different corn genotypes are to aflatoxin accumulation: a field-based assay (side-needle inoculation) and two different lab-based assays (wounding and non-wounding kernel screening assays; KSA). In this data object, we present the data from the lab and field assays, statistical models that are fit to the data, procedures for comparing model fit of different variants of the model, and model predictions. This includes how reliably each assay identifies resistant and susceptible check varieties, and how well correlated the assay methods are with one another. Statistical analyses are done using R software, including Bayesian models fit with Stan software.The following files are included:ksa_analysis_revised.Rmd: RMarkdown notebook with all code needed to reproduce analyses and create figures and tables in manuscriptksa_analysis_revised.html: HTML rendered output of notebookstep1_ksa.tsv: tab-separated data file with data from the lab assay. Columns include sample ID, genotype ID and entry code, year, treatment (wound or no-wound), subsample ID, replicate ID, aflatoxin concentration (in units of ng/g), logarithm of aflatoxin concentration, and a column indicating genotypes that are susceptible or resistant checksstep1_ksa_field.tsv: tab-separated data file with data from the field assay. Columns similar to the lab assay data file with an additional column for row in which the sample was planted.ksa_cov_mod.tsv: tab-separated data file with secondary infection covariate data from the lab assay. Columns similar to the lab assay data file with columns for secondary Asp, Fus, and NI infections and their logarithms.brmfits.zip: zip archive with 12 .rds files. These are model output files for the Bayesian mixed effect models presented in the MS that were fitted using the R function brm(). You may download these to reproduce output without having to compile and run the models yourself.The three .tsv data files should be placed in a subdirectory called "data" in the same directory where the .Rmd notebook is located.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Biologists are increasingly using curated, public data sets to conduct phylogenetic comparative analyses. Unfortunately, there is often a mismatch between species for which there is phylogenetic data and those for which other data are available. As a result, researchers are commonly forced to either drop species from analyses entirely or else impute the missing data. A simple strategy to improve the overlap of phylogenetic and comparative data is to swap species in the tree that lack data with ‘phylogenetically equivalent’ species that have data. While this procedure is logically straightforward, it quickly becomes very challenging to do by hand. Here, we present algorithms that use topological and taxonomic information to maximize the number of swaps without altering the structure of the phylogeny. We have implemented our method in a new R package phyndr, which will allow researchers to apply our algorithm to empirical data sets. It is relatively efficient such that taxon swaps can be quickly computed, even for large trees. To facilitate the use of taxonomic knowledge, we created a separate data package taxonlookup; it contains a curated, versioned taxonomic lookup for land plants and is interoperable with phyndr. Emerging online data bases and statistical advances are making it possible for researchers to investigate evolutionary questions at unprecedented scales. However, in this effort species mismatch among data sources will increasingly be a problem; evolutionary informatics tools, such as phyndr and taxonlookup, can help alleviate this issue.
Usage Notes Land plant taxonomic lookup tableThis dataset is a stable version (version 1.0.1) of the dataset contained in the taxonlookup R package (see https://github.com/traitecoevo/taxonlookup for the most recent version). It contains a taxonomic reference table for 16,913 genera of land plants along with the number of recognized species in each genus.plant_lookup.csv
Facebook
TwitterCharacteristics of birdsong, especially minimum frequency, have been shown to vary for some species between urban and rural populations and along urban-rural gradients. However, few urban-rural comparisons of song complexity—and none that we know of based on the number of distinct song types in repertoires—have occurred. Given the potential ability of song repertoire size to indicate bird condition, we primarily sought to determine if number of distinct song types displayed by Song Sparrows (Melospiza melodia) varied between an urban and a rural site. We determined song repertoire size of 24 individuals; 12 were at an urban (‘human-dominated’) site and 12 were at a rural (‘agricultural’) site. Then, we compared song repertoire size, note rate, and peak frequency between these sites. Song repertoire size and note rate did not vary between our human-dominated and agricultural sites. Peak frequency was greater at the agricultural site. Our finding that peak frequency was higher at the agri...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.