100+ datasets found
  1. Data from: Regression with Empirical Variable Selection: Description of a...

    • plos.figshare.com
    txt
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets [Dataset]. http://doi.org/10.1371/journal.pone.0034338
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Anne E. Goodenough; Adam G. Hart; Richard Stafford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.

  2. Reddit /r/datasets Dataset

    • kaggle.com
    zip
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit /r/datasets Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/the-meta-corpus-of-datasets-the-reddit-dataset
    Explore at:
    zip(9619636 bytes)Available download formats
    Dataset updated
    Nov 28, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Meta-Corpus of Datasets: The Reddit Dataset

    The Complete Collection of Datasets Posted on Reddit

    By SocialGrep [source]

    About this dataset

    A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.

    Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.

    In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.

    You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes

    Research Ideas

    • Finding correlations between different types of datasets
    • Determining which datasets are most popular on Reddit
    • Analyzing the sentiments of post and comments on Reddit's /r/datasets board

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |

    File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.

  3. u

    Optimization and Evaluation Datasets for PiMine

    • fdr.uni-hamburg.de
    zip
    Updated Sep 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Graef, Joel; Ehrt, Christiane; Reim, Thorben; Rarey, Matthias (2023). Optimization and Evaluation Datasets for PiMine [Dataset]. http://doi.org/10.25592/uhhfdm.13228
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 11, 2023
    Dataset provided by
    ZBH Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146 Hamburg, Germany
    Authors
    Graef, Joel; Ehrt, Christiane; Reim, Thorben; Rarey, Matthias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The protein-protein interface comparison software PiMine was developed to provide fast comparisons against databases of known protein-protein complex structures. Its application domains range from the prediction of interfaces and potential interaction partners to the identification of potential small molecule modulators of protein-protein interactions.[1]

    The protein-protein evaluation datasets are a collection of five datasets that were used for the parameter optimization (ParamOptSet), enrichment assessment (Dimer597 set, Keskin set, PiMineSet), and runtime analyses (RunTimeSet) of protein-protein interface comparison tools. The evaluation datasets contain pairs of interfaces of protein chains that either share sequential and structural similarities or are even sequentially and structurally unrelated. They enable comparative benchmark studies for tools designed to identify interface similarities.

    Data Set description:

    The ParamOptSet was designed based on a study on improving the benchmark datasets for the evaluation of protein-protein docking tools [2]. It was used to optimize and fine-tune the geometric search parameters of PiMine.

    The Dimer597 [3] and Keskin [4] sets were developed earlier. We used them to evaluate PiMine’s performance in identifying structurally and sequentially related interface pairs as well as interface pairs with prominent similarity whose constituting chains are sequentially unrelated.

    The PiMine set [1] was constructed to assess different quality criteria for reliable interface comparison. It consists of similar pairs of protein-protein complexes of which two chains are sequentially and structurally highly related while the other two chains are unrelated and show different folds. It enables the assessment of the performance when the interfaces of apparently unrelated chains are available only. Furthermore, we could obtain reliable interface-interface alignments based on the similar chains which can be used for alignment performance assessments.

    Finally, the RunTimeSet [1] comprises protein-protein complexes from the PDB that were predicted to be biologically relevant. It enables the comparison of typical run times of comparison methods and represents also an interesting dataset to screen for interface similarities.

    References:

    [1] Graef, J.; Ehrt, C.; Reim, T.; Rarey, M. Database-driven identification of structurally similar protein-protein interfaces (submitted)
    [2] Barradas-Bautista, D.; Almajed, A.; Oliva, R.; Kalnis, P.; Cavallo, L. Improving classification of correct and incorrect protein-protein docking models by augmenting the training set. Bioinform. Adv. 2023, 3, vbad012.
    [3] Gao, M.; Skolnick, J. iAlign: a method for the structural comparison of protein–protein interfaces. Bioinformatics 2010, 26, 2259-2265.
    [4] Keskin, O.; Tsai, C.-J.; Wolfson, H.; Nussinov, R. A new, structurally nonredundant, diverse data set of protein–protein interfaces and its implications. Protein Sci. 2004, 13, 1043-1055.

  4. Benchmark Multi-Omics Datasets for Methods Comparison

    • zenodo.org
    • resodate.org
    • +1more
    bin, zip
    Updated Nov 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Odom; Gabriel Odom; Lily Wang; Lily Wang (2021). Benchmark Multi-Omics Datasets for Methods Comparison [Dataset]. http://doi.org/10.5281/zenodo.5683002
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Nov 14, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gabriel Odom; Gabriel Odom; Lily Wang; Lily Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pathway Multi-Omics Simulated Data

    These are synthetic variations of the TCGA COADREAD data set (original data available at http://linkedomics.org/data_download/TCGA-COADREAD/). This data set is used as a comprehensive benchmark data set to compare multi-omics tools in the manuscript "pathwayMultiomics: An R package for efficient integrative analysis of multi-omics datasets with matched or un-matched samples".

    There are 100 sets (stored as 100 sub-folders, the first 50 in "pt1" and the second 50 in "pt2") of random modifications to centred and scaled copy number, gene expression, and proteomics data saved as compressed data files for the R programming language. These data sets are stored in subfolders labelled "sim001", "sim002", ..., "sim100". Each folder contains the following contents: 1) "indicatorMatricesXXX_ls.RDS" is a list of simple triplet matrices showing which genes (in which pathways) and which samples received the synthetic treatment (where XXX is the simulation run label: 001, 002, ...), (2) "CNV_partitionA_deltaB.RDS" is the synthetically modified copy number variation data (where A represents the proportion of genes in each gene set to receive the synthetic treatment [partition 1 is 20%, 2 is 40%, 3 is 60% and 4 is 80%] and B is the signal strength in units of standard deviations), (3) "RNAseq_partitionA_deltaB.RDS" is the synthetically modified gene expression data (same parameter legend as CNV), and (4) "Prot_partitionA_deltaB.RDS" is the synthetically modified protein expression data (same parameter legend as CNV).

    Supplemental Files

    The file "cluster_pathway_collection_20201117.gmt" is the collection of gene sets used for the simulation study in Gene Matrix Transpose format. Scripts to create and analyze these data sets available at: https://github.com/TransBioInfoLab/pathwayMultiomics_manuscript_supplement

  5. Z

    Simulation Data & R scripts for: "Introducing recurrent events analyses to...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ferry, Nicolas (2024). Simulation Data & R scripts for: "Introducing recurrent events analyses to assess species interactions based on camera trap data: a comparison with time-to-first-event approaches" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11085005
    Explore at:
    Dataset updated
    Apr 29, 2024
    Dataset provided by
    Department of National Park Monitoring and Animal Management, Bavarian Forest National Park
    Authors
    Ferry, Nicolas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Files descriptions:

    All csv files refer to results from the different models (PAMM, AARs, Linear models, MRPPs) on each iteration of the simulation. One row being one iteration. "results_perfect_detection.csv" refers to the results from the first simulation part with all the observations."results_imperfect_detection.csv" refers to the results from the first simulation part with randomly thinned observations to mimick imperfect detection.

    ID_run: identified of the iteration (N: number of sites, D_AB: duration of the effect of A on B, D_BA: duration of the effect of B on A, AB: effect of A on B, BA: effect of B on A, Se: seed number of the iteration).PAMM30: p-value of the PAMM running on the 30-days survey.PAMM7: p-value of the PAMM running on the 7-days survey.AAR1: ratio value for the Avoidance-Attraction-Ratio calculating AB/BA.AAR2: ratio value for the Avoidance-Attraction-Ratio calculating BAB/BB.Harmsen_P: p-value from the linear model with interaction Species1*Species2 from Harmsen et al. (2009).Niedballa_P: p-value from the linear model comparing AB to BA (Niedballa et al. 2021).Karanth_permA: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species A (Karanth et al. 2017).MurphyAB_permA: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). MurphyBA_permA: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). Karanth_permB: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species B (Karanth et al. 2017).MurphyAB_permB: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021). MurphyBA_permB: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021).

    "results_int_dir_perf_det.csv" refers to the results from the second simulation part, with all the observations."results_int_dir_imperf_det.csv" refers to the results from the second simulation part, with randomly thinned observations to mimick imperfect detection.ID_run: identified of the iteration (N: number of sites, D_AB: duration of the effect of A on B, D_BA: duration of the effect of B on A, AB: effect of A on B, BA: effect of B on A, Se: seed number of the iteration).p_pamm7_AB: p-value of the PAMM running on the 7-days survey testing for the effect of A on B.p_pamm7_AB: p-value of the PAMM running on the 7-days survey testing for the effect of B on A.AAR1: ratio value for the Avoidance-Attraction-Ratio calculating AB/BA.AAR2_BAB: ratio value for the Avoidance-Attraction-Ratio calculating BAB/BB.AAR2_ABA: ratio value for the Avoidance-Attraction-Ratio calculating ABA/AA.Harmsen_P: p-value from the linear model with interaction Species1*Species2 from Harmsen et al. (2009).Niedballa_P: p-value from the linear model comparing AB to BA (Niedballa et al. 2021).Karanth_permA: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species A (Karanth et al. 2017).MurphyAB_permA: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). MurphyBA_permA: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). Karanth_permB: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species B (Karanth et al. 2017).MurphyAB_permB: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021). MurphyBA_permB: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021).

    Scripts files description:1_Functions: R script containing the functions: - MRPP from Karanth et al. (2017) adapted here for time efficiency. - MRPP from Murphy et al. (2021) adapted here for time efficiency. - Version of the ct_to_recurrent() function from the recurrent package adapted to process parallized on the simulation datasets. - The simulation() function used to simulate two species observations with reciprocal effect on each other.2_Simulations: R script containing the parameters definitions for all iterations (for the two parts of the simulations), the simulation paralellization and the random thinning mimicking imperfect detection.3_Approaches comparison: R script containing the fit of the different models tested on the simulated data.3_1_Real data comparison: R script containing the fit of the different models tested on the real data example from Murphy et al. 2021.4_Graphs: R script containing the code for plotting results from the simulation part and appendices.5_1_Appendix - Check for similarity between codes for Karanth et al 2017 method: R script containing Karanth et al. (2017) and Murphy et al. (2021) codes lines and the adapted version for time-efficiency matter and a comparison to verify similarity of results.5_2_Appendix - Multi-response procedure permutation difference: R script containing R code to test for difference of the MRPPs approaches according to the species on which permutation are done.

  6. r

    Data from: Supplementary tables:MetaFetcheR: An R package for complete...

    • researchdata.se
    Updated Jun 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sara A. Yones; Rajmund Csombordi; Jan Komorowski; Klev Diamanti (2024). Supplementary tables:MetaFetcheR: An R package for complete mapping of small compound data [Dataset]. http://doi.org/10.57804/7sf1-fw75
    Explore at:
    (78625), (728116)Available download formats
    Dataset updated
    Jun 24, 2024
    Dataset provided by
    Uppsala University
    Authors
    Sara A. Yones; Rajmund Csombordi; Jan Komorowski; Klev Diamanti
    Description

    The dataset includes a PDF file containing the results and an Excel file with the following tables:

    Table S1 Results of comparing the performance of MetaFetcheR to MetaboAnalystR using Diamanti et al. Table S2 Results of comparing the performance of MetaFetcheR to MetaboAnalystR for Priolo et al. Table S3 Results of comparing the performance of MetaFetcheR to MetaboAnalyst 5.0 webtool using Diamanti et al. Table S4 Results of comparing the performance of MetaFetcheR to MetaboAnalyst 5.0 webtool for Priolo et al. Table S5 Data quality test results for running 100 iterations on HMDB database. Table S6 Data quality test results for running 100 iterations on KEGG database. Table S7 Data quality test results for running 100 iterations on ChEBI database. Table S8 Data quality test results for running 100 iterations on PubChem database. Table S9 Data quality test results for running 100 iterations on LIPID MAPS database. Table S10 The list of metabolites that were not mapped by MetaboAnalystR for Diamanti et al. Table S11 An example of an input matrix for MetaFetcheR. Table S12 Results of comparing the performance of MetaFetcheR to MS_targeted using Diamanti et al. Table S13 Data set from Diamanti et al. Table S14 Data set from Priolo et al. Table S15 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Diamanti et al. Table S16 Results of comparing the performance of MetaFetcheR to CTS using LIPID MAPS identifiers available in Diamanti et al. Table S17 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Priolo et al. Table S18 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Priolo et al. (See the "index" tab in the Excel file for more information)

    Small-compound databases contain a large amount of information for metabolites and metabolic pathways. However, the plethora of such databases and the redundancy of their information lead to major issues with analysis and standardization. Lack of preventive establishment of means of data access at the infant stages of a project might lead to mislabelled compounds, reduced statistical power and large delays in delivery of results.

    We developed MetaFetcheR, an open-source R package that links metabolite data from several small-compound databases, resolves inconsistencies and covers a variety of use-cases of data fetching. We showed that the performance of MetaFetcheR was superior to existing approaches and databases by benchmarking the performance of the algorithm in three independent case studies based on two published datasets.

    The dataset was originally published in DiVA and moved to SND in 2024.

  7. d

    Data from: R Manual for QCA

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mello, Patrick A. (2023). R Manual for QCA [Dataset]. http://doi.org/10.7910/DVN/KYF7VJ
    Explore at:
    Dataset updated
    Nov 17, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Mello, Patrick A.
    Description

    The R Manual for QCA entails a PDF file that describes all the steps and code needed to prepare and conduct a Qualitative Comparative Analysis (QCA) study in R. This is complemented by an R Script that can be customized as needed. The dataset further includes two files with sample data, for the set-theoretic analysis and the visualization of QCA results. The R Manual for QCA is the online appendix to "Qualitative Comparative Analysis: An Introduction to Research Design and Application", Georgetown University Press, 2021.

  8. San Francisco Airport Runway Use

    • kaggle.com
    zip
    Updated Jan 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). San Francisco Airport Runway Use [Dataset]. https://www.kaggle.com/datasets/thedevastator/san-francisco-airport-runway-use
    Explore at:
    zip(2199 bytes)Available download formats
    Dataset updated
    Jan 20, 2023
    Authors
    The Devastator
    Area covered
    San Francisco
    Description

    San Francisco Airport Runway Use

    Late Night Departure Preferences

    By City of San Francisco [source]

    About this dataset

    This dataset explores the late night departing runways used by aircraft at San Francisco International Airport (SFO). From 1:00 a.m. to 6:00 a.m., aircraft are directed to either 10L/R, 01L/R or 28L/R with an immediate right turn when safety and weather conditions permit to reduce noise in the area's surrounding residential communities by following over-water departure procedures, directing aircraft over the bay instead. Providing insight into which late night runways are most frequently used, data from this dataset is broken down by runway, month and year of departure as well as what percent of total departures for each month come from each runway - allowing for a comprehensive look at SFO's preferential late night use of airport runways!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset can be used to analyze the degree of aircraft late night departure from San Francisco Airport in order to study the impact of runway usage on air and noise pollution in residential communities. This dataset contains information about departures from each runway (01L/R, 10L/R, 19L/R and 28L/R) at San Francisco Airport for a specified year and month. By studying the percentage of total departures by runway we can understand how much aircraft are using which runways during late night hours.

    To use this dataset one needs to first become familiar with the column names such as Year, Month, 01L/R(number of departures from 01L/R runway),01L/R Percent of Departures (percentage of departures from 01LR runway) etc. It is also important to become more familiar with terms such as departure and late-night which are prominently used in this dataset.

    Once you have familiarized yourself with these details you can start exploring the data for further insights into how specific runways are being used for late night flight operations in San Francisco Airport and also note any patterns or trends that may emerge when looking at multiple months or years within this data set. Additionally, by comparing percentages between different runways we can measure which runways are preferred more often than others during times when congested traffic is more common such as holidays or summer months when residents take trips more often

    Research Ideas

    • To identify areas of the San Francisco Airport prone to noise pollution from aircraft and develop ways to limit it.
    • To analyze the impacts of changing departure runway preferences on noise pollution levels over residential communities near the airport.
    • To monitor seasonal trends in aircraft late night departures by runways, along with identifying peak hours for each runway, in order to inform flight controllers and develop improved flight control regulations and procedures at the San Francisco Airport

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: late-night-preferential-runway-use-1.csv | Column name | Description | |:--------------------------------|:--------------------------------------------------------| | Year | The year of the data. (Integer) | | Month | The month of the data. (String) | | 01L/R | The number of departures from runway 01L/R. (Integer) | | 01L/R Percent of Departures | The percentage of departures from runway 01L/R. (Float) | | 10L/R | The number of departures from runway 10L/R. (Integer) | | 10L/R Percent of Departures | The percentage of departures from runway 10L/R. (Float) | | 19L/R | The number of departures from runway 19L/R. (Integer) | | 19L/R Percent of Departures | The percentage of departures from runway 19L/R. (Float) | | 28L/R | The number of departures from runway 28L/R. (Integer) | | 28L/R Percent of Departures | The percentage of departures from runway 28L/R. (Float) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit City of San Francisco.

  9. f

    Dataset belonging to Siebers et al. (2024) Adolescents' digital nightlife:...

    • uvaauas.figshare.com
    csv
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    T. Siebers; Ine Beyens; Susanne E. Baumgartner; Patti Valkenburg (2024). Dataset belonging to Siebers et al. (2024) Adolescents' digital nightlife: The comparative effects of day- and nighttime smartphone use on sleep quality [Dataset]. http://doi.org/10.21942/uva.26395903.v2
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 29, 2024
    Dataset provided by
    University of Amsterdam / Amsterdam University of Applied Sciences
    Authors
    T. Siebers; Ine Beyens; Susanne E. Baumgartner; Patti Valkenburg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The four datasets 'phone', 'game', 'social', and 'video' are the processed datasets that are used as input files for the Mplus models (but then in .csv instead of .dat format). The dataset 'phone' contains all data related to the main analyses of daytime, pre-bedtime and post-bedtime smartphone use. The datasets 'game', 'social', and 'video' represent the data related to the exploratory analyses for game app, social media app, and video player app use, respectively. The dataset 'timeframes' contains information about respondents' bedtime and wake-up time, which is required to calculate the three timeframes (daytime, pre-bedtime, and post-bedtime).------------------The materials used, including the R and Mplus syntaxes (https://osf.io/tpj98/) and the preregistration of the current study (https://osf.io/kxw2h/) can be found on OSF. For more information, please contact the authors via t.siebers@uva.nl or info@project-awesome.nl.

  10. u

    OAC2021-2 Ancillary data and link to Github

    • rdr.ucl.ac.uk
    zip
    Updated Dec 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jakub wyszomerski (2025). OAC2021-2 Ancillary data and link to Github [Dataset]. http://doi.org/10.5522/04/28485338.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 8, 2025
    Dataset provided by
    University College London
    Authors
    jakub wyszomerski
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Creation of the 2021/2 Output Area Classification [External Github Repo LINK]ScriptsMetadata.R - creating reference tables and glossaries for Census data.Downloading_data.R – downloading 2011 and 2021 Census data from NOMIS using nomisr package.Comparing_Censuses.R - data cleaning and amalgamation.Transforming_Census_data.R – data manipulation and transformation for the classification.NI.R – modelling 2021 data for Northern Ireland.Correlation.R - testing correlation between the variables.Pre-clustering.R - preparing data for clustering.Clustering.R - clustering of the data.Post-clustering.R - creating maps and plots of the cluster solution.Testing_clustering.RClustergrams.ipynb – creating Clustergrams in Python. (Credits: Prof Alex Singleton)Industry.R - loading Industry dataIndustry_classification.R - creating geodemographic classification with Industry variablesGraph_comparisons.R - comparing data with graphsDataList of folders (subfolders & files) in the project:API - Census data downloaded and saved with use of nomisr package.Clean – amalgamated data ready for the analysis.Raw_counts - datasets with raw countsPercentages - datasets transformed into percentagesTransformed - datasets transformed with IHS (analysis-ready)Final_variables - datasets with OAC variables onlyAll_data_clustering - results of the clustering for all investigated datasets.Clustering - datasets with cluster assignment for the UK and centroids.Lookups - reference tables for 2011 and 2021 Census variables.NISRA 2021 - 2021 Census data at LGD level for Northern Ireland.Objects - R objects created and stored to ensure consistency of the results or load big filesSIR – contingency tables on disability counts by age, utilised for calculation of Standardised Illness Ratio.shapefiles - folder containing shapefiles used for some of the calculations.PlotsBar_plots - Comparison of clusters to the UK (as well as Supergroup and Group averages)Clustergrams – plots used to establish number of clusters at each classification level.Maps

  11. Benchmarks datasets for cluster analysis

    • kaggle.com
    zip
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Onthada Preedasawakul (2023). Benchmarks datasets for cluster analysis [Dataset]. https://www.kaggle.com/datasets/onthada/benchmarks-datasets-for-clustering
    Explore at:
    zip(608532 bytes)Available download formats
    Dataset updated
    Nov 15, 2023
    Authors
    Onthada Preedasawakul
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    25 Artificial Datasets

    The datasets are generated using either Gaussian or Uniform distributions. Each dataset contains several known sub-groups intended for testing centroid-based clustering results and cluster validity indices.

    Cluster analysis is a popular machine learning used for segmenting datasets with similar data points in the same group. For those who are familiar with R, there is a new R package called "UniversalCVI" https://CRAN.R-project.org/package=UniversalCVI used for cluster evaluation. This package provides algorithms for checking the accuracy of a clustering result with known classes, computing cluster validity indices, and generating plots for comparing them. The package is compatible with K-means, fuzzy C means, EM clustering, and hierarchical clustering (single, average, and complete linkage). To use the "UniversalCVI" package, one can follow the instructions provided in the R documentation.

    For more in-depth details of the package and cluster evaluation, please see the papers https://doi.org/10.1016/j.patcog.2023.109910 and https://arxiv.org/abs/2308.14785

    All the datasets are also available on GitHub at

    https://github.com/O-PREEDASAWAKUL/FuzzyDatasets.git .

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17645646%2Fa2f87fbad212a908718535589681a703%2Frealplot.jpeg?generation=1700111724010268&alt=media" alt="">

  12. n

    Data from: WiBB: An integrated method for quantifying the relative...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +2more
    zip
    Updated Aug 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 20, 2021
    Dataset provided by
    Beijing Normal University
    Field Museum of Natural History
    Authors
    Qin Li; Xiaojun Kou
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

    A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

    Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.

  13. Compare fastICA/InfoMax ICA/PGICA accuracies.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaojie Chen; Lei Huang; Huitong Qiu; Mary Beth Nebel; Stewart H. Mostofsky; James J. Pekar; Martin A. Lindquist; Ani Eloyan; Brian S. Caffo (2023). Compare fastICA/InfoMax ICA/PGICA accuracies. [Dataset]. http://doi.org/10.1371/journal.pone.0173496.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Shaojie Chen; Lei Huang; Huitong Qiu; Mary Beth Nebel; Stewart H. Mostofsky; James J. Pekar; Martin A. Lindquist; Ani Eloyan; Brian S. Caffo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Compare fastICA/InfoMax ICA/PGICA accuracies.

  14. f

    A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Jun 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker (2023). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. http://doi.org/10.1021/acs.jproteome.1c00070.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    ACS Publications
    Authors
    Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.

  15. d

    Satellite Electric Vehicle Dataset (TESLA,LUCID, RIVIAN

    • datarade.ai
    .csv
    Updated Jan 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Space Know (2023). Satellite Electric Vehicle Dataset (TESLA,LUCID, RIVIAN [Dataset]. https://datarade.ai/data-products/satellite-electric-vehicle-dataset-tesla-lucid-rivian-space-know
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Jan 21, 2023
    Dataset authored and provided by
    Space Know
    Area covered
    China, United States of America
    Description

    SpaceKnow uses satellite (SAR) data to capture activity in electric vehicles and automotive factories.

    Data is updated daily, has an average lag of 4-6 days, and history back to 2017.

    The insights provide you with level and change data that monitors the area which is covered with assembled light vehicles in square meters.

    We offer 3 delivery options: CSV, API, and Insights Dashboard

    Available companies Rivian (NASDAQ: RIVN) for employee parking, logistics, logistic centers, product distribution & product in the US. (See use-case write up on page 4) TESLA (NASDAQ: TSLA) indices for product, logistics & employee parking for Fremont, Nevada, Shanghai, Texas, Berlin, and Global level Lucid Motors (NASDAQ: LCID) for employee parking, logistics & product in US

    Why get SpaceKnow's EV datasets?

    Monitor the company’s business activity: Near-real-time insights into the business activities of Rivian allow users to better understand and anticipate the company’s performance.

    Assess Risk: Use satellite activity data to assess the risks associated with investing in the company.

    Types of Indices Available Continuous Feed Index (CFI) is a daily aggregation of the area of metallic objects in square meters. There are two types of CFI indices. The first one is CFI-R which gives you level data, so it shows how many square meters are covered by metallic objects (for example assembled cars). The second one is CFI-S which gives you change data, so it shows you how many square meters have changed within the locations between two consecutive satellite images.

    How to interpret the data SpaceKnow indices can be compared with the related economic indicators or KPIs. If the economic indicator is in monthly terms, perform a 30-day rolling sum and pick the last day of the month to compare with the economic indicator. Each data point will reflect approximately the sum of the month. If the economic indicator is in quarterly terms, perform a 90-day rolling sum and pick the last day of the 90-day to compare with the economic indicator. Each data point will reflect approximately the sum of the quarter.

    Product index This index monitors the area covered by manufactured cars. The larger the area covered by the assembled cars, the larger and faster the production of a particular facility. The index rises as production increases.

    Product distribution index This index monitors the area covered by assembled cars that are ready for distribution. The index covers locations in the Rivian factory. The distribution is done via trucks and trains.

    Employee parking index Like the previous index, this one indicates the area covered by cars, but those that belong to factory employees. This index is a good indicator of factory construction, closures, and capacity utilization. The index rises as more employees work in the factory.

    Logistics index The index monitors the movement of materials supply trucks in particular car factories.

    Logistics Centers index The index monitors the movement of supply trucks in warehouses.

    Where the data comes from: SpaceKnow brings you information advantages by applying machine learning and AI algorithms to synthetic aperture radar and optical satellite imagery. The company’s infrastructure searches and downloads new imagery every day, and the computations of the data take place within less than 24 hours.

    In contrast to traditional economic data, which are released in monthly and quarterly terms, SpaceKnow data is high-frequency and available daily. It is possible to observe the latest movements in the EV industry with just a 4-6 day lag, on average.

    The EV data help you to estimate the performance of the EV sector and the business activity of the selected companies.

    The backbone of SpaceKnow’s high-quality data is the locations from which data is extracted. All locations are thoroughly researched and validated by an in-house team of annotators and data analysts.

    Each individual location is precisely defined so that the resulting data does not contain noise such as surrounding traffic or changing vegetation with the season.

    We use radar imagery and our own algorithms, so the final indices are not devalued by weather conditions such as rain or heavy clouds.

    → Reach out to get a free trial

    Use Case - Rivian:

    SpaceKnow uses the quarterly production and delivery data of Rivian as a benchmark. Rivian targeted to produce 25,000 cars in 2022. To achieve this target, the company had to increase production by 45% by producing 10,683 cars in Q4. However the production was 10,020 and the target was slightly missed by reaching total production of 24,337 cars for FY22.

    SpaceKnow indices help us to observe the company’s operations, and we are able to monitor if the company is set to meet its forecasts or not. We deliver five different indices for Rivian, and these indices observe logistic centers, employee parking lot, logistics, product, and prod...

  16. Data from: Data and code from: Identification of a key target for...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Identification of a key target for elimination of nitrous oxide, a major greenhouse gas [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-identification-of-a-key-target-for-elimination-of-nitrous-oxide-a-major-c072f
    Explore at:
    Dataset updated
    Dec 2, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    Note: This dataset is superseded by: https://doi.org/10.15482/USDA.ADC/30210112 Note: Data files will be made available upon manuscript publication This dataset contains all code and data needed to reproduce the analyses in the manuscript: IDENTIFICATION OF A KEY TARGET FOR ELIMINATION OF NITROUS OXIDE, A MAJOR GREENHOUSE GAS. Blake A. Oakley (1), Trevor Mitchell (2), Quentin D. Read (3), Garrett Hibbs (1), Scott E. Gold (2), Anthony E. Glenn (2) Department of Plant Pathology, University of Georgia, Athens, GA, USA.Toxicology and Mycotoxin Research Unit, U.S. National Poultry Research Center, United States Department of Agriculture-Agricultural Research Service, Athens, GA, USASoutheast Area, United States Department of Agriculture-Agricultural Research Service, Raleigh, NC, USA citation will be updated upon acceptance of manuscript Brief description of study aims Denitrification is a chemical process that releases nitrous oxide (N2O), a potent greenhouse gas. The NOR1 gene is part of the denitrification pathway in Fusarium. Three experiments were conducted for this study. (1) The N2O comparative experiment compares denitrification rates, as measured by N2O production, of a variety of Fusarium spp. strains with and without the NOR1 gene. (2) The N2O substrate experiment compares denitrification rates of selected strains on different growth media (substrates). For parts 1 and 2, linear models are fit comparing N2O production between strains and/or substrates. (3) The Bioscreen growth assay tests whether there is a pleiotropic effect of the NOR1 gene. In this portion of the analysis, growth curves are fit to assess differences in growth rate and carrying capacity between selected strains with and without the NOR1 gene. Code All code is included in a .zip archive generated from a private git repository on 2022-10-13 and archived as part of this dataset. The code is contained in R scripts and RMarkdown notebooks. There are two components to the analysis: the denitrification analysis (comprising parts 1 and 2 described above) and the Bioscreen growth analysis (part 3). The scripts for each are listed and described below. Analysis of results of denitrification experiments (parts 1 and 2) NOR1_denitrification_analysis.Rmd: The R code to analyze the experimental data comparing nitrous oxide emissions is all contained in a single RMarkdown notebook. This script analyzes the results from the comparative study and the substrate study.n2o_subgroup_figures.R: R script to create additional figures using the output from the RMarkdown notebook Analysis of results of Bioscreen growth assay (part 3) bioscreen_analysis.Rmd: This RMarkdown notebook contains all R code needed to analyze the results of the Bioscreen assay comparing growth of the different strains. It could be run as is. However, the model-fitting portion was run on a high-performance computing cluster with the following scripts:bioscreen_fit_simpler.R: R script containing only the model-fitting portion of the Bioscreen analysis, fit using the Stan modeling language interfaced with R through the brms and cmdstanr packages.job_bssimple.sh: Job submission shell script used to submit the model-fitting R job to be run on USDA SciNet high-performance computing cluster. Additional scripts developed as part of the analysis but that are not required to reproduce the analyses in the manuscript are in the deprecated/ folder. Also note the files nor1-denitrification.Rproj (RStudio project file) and gtstyle.css (stylesheet for formatting the tables in the notebooks) are included. Data Data required to run the analysis scripts are archived in this dataset, other than strain_lookup.csv, a lookup table of strain abbreviations and full names included in the code repository for convenience. They should be placed in a folder or symbolic link called project within the unzipped code repository directory. N2O_data_2022-08-03/N2O_Comparative_Study_Trial_(n)_(date range).xlsx: These are the data from the N2O comparative study, where n is the trial number from 1-3 and date range is the begin and end date of the trial.N2O_data_2022-08-03/Nitrogen_Substrate_Study_Trial_(n)_(date range).xlsx: These are the data from the N2O substrate study, where n is the trial number from 1-3 and date range is the begin and end date of the trial.Outliers_NOR1_2022/Bioscreen_NOR1_Fungal_Growth_Assay_(substrate)_(oxygen level)_Outliers_BAO_(date).xlsx: These are the raw Bioscreen data files in MS Excel format. The format of each file name includes the substrate (minimal medium with nitrite or nitrate and lysine), oxygen level (hypoxia or normoxia), and date of the run. This repository includes code to process these files, but the processed data are also included on Ag Data Commons, so it is not necessary to run the data processing portion of the code.clean_data/bioscreen_clean_data.csv: This is an intermediate output file in CSV format generated by bioscreen_analysis.Rmd. It includes all the data from the Bioscreen assays in a clean analysis-ready format.

  17. d

    HUN Comparison of model variability and interannual variability

    • data.gov.au
    • researchdata.edu.au
    Updated Nov 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2019). HUN Comparison of model variability and interannual variability [Dataset]. https://data.gov.au/data/dataset/activity/1c0a19f9-98c2-4d92-956d-dd764aaa10f9
    Explore at:
    Dataset updated
    Nov 20, 2019
    Dataset provided by
    Bioregional Assessment Program
    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    These are the data summarising the modelled Hydrological Response Variable (HRV) variability versus climate interannual variability which has been used as an indicator of risk. For example, to understand the significance of the modelled increases in low-flow days, it is useful to look at them in the context of the interannual variability in low-flow days due to climate. In other words, are the modelled increases due to additional coal resource development within the natural range of variability of the longer-term flow regime, or are they potentially moving the system outside the range of hydrological variability it experiences under the current climate? The maximum increase in the number of low-flow days due to additional coal resource development relative to the interannual variability in low-flow days under the baseline has been adopted to put some context around the modelled changes. If the maximum change is small relative to the interannual variability due to climate (e.g. an increase of 3 days relative to a baseline range of 20 to 50 days), then the risk of impacts from the changes in low-flow days is likely to be low. If the maximum change is comparable to or greater than the interannual variability due to climate (e.g. an increase of 200 days relative to a baseline range of 20 to 50 days), then there is a greater risk of impact on the landscape classes and assets that rely on this water source. Here changes comparable to or greater than interannual variability are interpreted as presenting a risk. However, the change due to the additional coal resource development is additive, so even a 'less than interannual variability' change is not free from risk. Results of the interannual variability comparison should be viewed as indicators of risk.

    Dataset History

    This dataset is generated using 1000 HRV simulations together with climate inputs. Ratios between the variability in HRVs, and the variability attributable interannual variability due to climate, were calculated for the HRVs. Results of the interannual variability comparison should be viewed as indicators of risk.

    Dataset Citation

    Bioregional Assessment Programme (2017) HUN Comparison of model variability and interannual variability. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/1c0a19f9-98c2-4d92-956d-dd764aaa10f9.

    Dataset Ancestors

  18. d

    Data from: Data and code from: Comparison of in-field and laboratory-based...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Dec 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Comparison of in-field and laboratory-based phenotyping methods for evaluation of aflatoxin accumulation in maize inbred lines [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-comparison-of-in-field-and-laboratory-based-phenotyping-methods-for-eva
    Explore at:
    Dataset updated
    Dec 2, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset contains all data and R code, in RMarkdown notebook format, needed to reproduce all statistical analysis, figures, and tables in the manuscript:Jeffers, D., J. S. Smith, E. D. Womack, Q. D. Read, and G. L. Windham. 2024. Comparison of in-field and laboratory-based phenotyping methods for evaluation of aflatoxin accumulation in maize inbred lines. Plant Disease. (citation to be updated upon final acceptance of MS)There is a critical need to quickly and reliably identify corn genotypes that are resistant to accumulating aflatoxin in their kernels. We compared three methods of determining how resistant different corn genotypes are to aflatoxin accumulation: a field-based assay (side-needle inoculation) and two different lab-based assays (wounding and non-wounding kernel screening assays; KSA). In this data object, we present the data from the lab and field assays, statistical models that are fit to the data, procedures for comparing model fit of different variants of the model, and model predictions. This includes how reliably each assay identifies resistant and susceptible check varieties, and how well correlated the assay methods are with one another. Statistical analyses are done using R software, including Bayesian models fit with Stan software.The following files are included:ksa_analysis_revised.Rmd: RMarkdown notebook with all code needed to reproduce analyses and create figures and tables in manuscriptksa_analysis_revised.html: HTML rendered output of notebookstep1_ksa.tsv: tab-separated data file with data from the lab assay. Columns include sample ID, genotype ID and entry code, year, treatment (wound or no-wound), subsample ID, replicate ID, aflatoxin concentration (in units of ng/g), logarithm of aflatoxin concentration, and a column indicating genotypes that are susceptible or resistant checksstep1_ksa_field.tsv: tab-separated data file with data from the field assay. Columns similar to the lab assay data file with an additional column for row in which the sample was planted.ksa_cov_mod.tsv: tab-separated data file with secondary infection covariate data from the lab assay. Columns similar to the lab assay data file with columns for secondary Asp, Fus, and NI infections and their logarithms.brmfits.zip: zip archive with 12 .rds files. These are model output files for the Bayesian mixed effect models presented in the MS that were fitted using the R function brm(). You may download these to reproduce output without having to compile and run the models yourself.The three .tsv data files should be placed in a subdirectory called "data" in the same directory where the .Rmd notebook is located.

  19. m

    Data from: A simple approach for maximizing the overlap of phylogenetic and...

    • figshare.mq.edu.au
    • borealisdata.ca
    • +5more
    bin
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew W. Pennell; Richard G. FitzJohn; William K. Cornwell (2023). Data from: A simple approach for maximizing the overlap of phylogenetic and comparative data [Dataset]. http://doi.org/10.5061/dryad.5d3rq
    Explore at:
    binAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Macquarie University
    Authors
    Matthew W. Pennell; Richard G. FitzJohn; William K. Cornwell
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Biologists are increasingly using curated, public data sets to conduct phylogenetic comparative analyses. Unfortunately, there is often a mismatch between species for which there is phylogenetic data and those for which other data are available. As a result, researchers are commonly forced to either drop species from analyses entirely or else impute the missing data. A simple strategy to improve the overlap of phylogenetic and comparative data is to swap species in the tree that lack data with ‘phylogenetically equivalent’ species that have data. While this procedure is logically straightforward, it quickly becomes very challenging to do by hand. Here, we present algorithms that use topological and taxonomic information to maximize the number of swaps without altering the structure of the phylogeny. We have implemented our method in a new R package phyndr, which will allow researchers to apply our algorithm to empirical data sets. It is relatively efficient such that taxon swaps can be quickly computed, even for large trees. To facilitate the use of taxonomic knowledge, we created a separate data package taxonlookup; it contains a curated, versioned taxonomic lookup for land plants and is interoperable with phyndr. Emerging online data bases and statistical advances are making it possible for researchers to investigate evolutionary questions at unprecedented scales. However, in this effort species mismatch among data sources will increasingly be a problem; evolutionary informatics tools, such as phyndr and taxonlookup, can help alleviate this issue.

    Usage Notes Land plant taxonomic lookup tableThis dataset is a stable version (version 1.0.1) of the dataset contained in the taxonlookup R package (see https://github.com/traitecoevo/taxonlookup for the most recent version). It contains a taxonomic reference table for 16,913 genera of land plants along with the number of recognized species in each genus.plant_lookup.csv

  20. d

    Data from: A preliminary comparison of a songbird's song repertoire size and...

    • datadryad.org
    • search.dataone.org
    zip
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dustin Brewer; Adam Fudickar (2023). A preliminary comparison of a songbird's song repertoire size and other song measures between an urban and a rural site [Dataset]. http://doi.org/10.5061/dryad.h70rxwdkv
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 2, 2023
    Dataset provided by
    Dryad
    Authors
    Dustin Brewer; Adam Fudickar
    Time period covered
    Jan 19, 2022
    Description

    Characteristics of birdsong, especially minimum frequency, have been shown to vary for some species between urban and rural populations and along urban-rural gradients. However, few urban-rural comparisons of song complexity—and none that we know of based on the number of distinct song types in repertoires—have occurred. Given the potential ability of song repertoire size to indicate bird condition, we primarily sought to determine if number of distinct song types displayed by Song Sparrows (Melospiza melodia) varied between an urban and a rural site. We determined song repertoire size of 24 individuals; 12 were at an urban (‘human-dominated’) site and 12 were at a rural (‘agricultural’) site. Then, we compared song repertoire size, note rate, and peak frequency between these sites. Song repertoire size and note rate did not vary between our human-dominated and agricultural sites. Peak frequency was greater at the agricultural site. Our finding that peak frequency was higher at the agri...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets [Dataset]. http://doi.org/10.1371/journal.pone.0034338
Organization logo

Data from: Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets

Related Article
Explore at:
36 scholarly articles cite this dataset (View in Google Scholar)
txtAvailable download formats
Dataset updated
Jun 8, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Anne E. Goodenough; Adam G. Hart; Richard Stafford
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.

Search
Clear search
Close search
Google apps
Main menu