Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The columns indicated whether hierarchical clustering is suitable (in contrast to partitioning), whether distance matrices can be retrieved, whether the funcionality is available in R (to the authors’ knowledge) and whether ordinal variables are treated in a special way. Only clustering based on Gower’s similarity coefficient is applied throughout the manuscript.
Facebook
TwitterStift_et_al_used_R_codeThis package contains the R-script used for simulating genetic data of two (or more) populations of a species with multiple ploidy levels. The script is extensively annotated using comments. The script will also perform several clustering analyses on the produced genetic data files. For this, the programs Structure, Admixture, and Instruct should be first downloaded from their respective websites, and placed in the R working directory. The package also contains the parameter files that are used for the Structure analysis; these should also be placed in the working directory.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The respective smallest BER is highlighted in bold.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The last four methods in the table are applied and compared throughout the manuscript.
Facebook
TwitterThe overarching effects and benefits of land management decisions, such as through watershed restoration, are often not fully understood due to a lacking control within an experimental design. This can be addressed through the application of a paired watershed approach, allowing for comparison between treatment and control watersheds. We developed and applied a statistic-based hierarchical clustering analysis for watershed pairing within an experimental landscape consisting of numerous superficially structurally-similar sub-basins to address this concern. Our three-step research approach follows: 1) We construct a comprehensive spatial database consisting of various biophysical, structural, and modeled hydrologic data for each watershed. 2) We apply a correlation analysis to reduce the dimensionality of the spatial datasets and select specific spatial variables using a mixed quantitative and qualitative approach. 3) We complete hierarchal clustering analyses to group watersheds based on their spatial properties. This data release consists of three primary products, 1) a vector shapefile, 2) an R software script, and 3) a Google Earth Engine (GEE) script. The vector shapefile displays the selected study sub-basins present within Smith Canyon Watershed. Within the vector shapefile, we included attribute information for each of the spatial variables included in the spatial database as well as the hierarchical cluster designation for the primary and secondary clusters. The R software script was used to complete the correlation analysis and hierarchical clustering. The Google Earth Engine (GEE) script was used to produce the mean Normalized Difference Vegetation Index (NDVI) image product.
Facebook
TwitterPlease view the spreadsheet tab "ReadMe_tab" of the data file for usage notes.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Bayesian clustering methods have emerged as a popular tool for assessing hybridization using genetic markers. Simulation studies have shown these methods perform well under certain conditions; however, these methods have not been evaluated using empirical datasets with individuals of known ancestry. We evaluated the performance of two Bayesian clustering programs, BAPS and STRUCTURE, with genetic data from a reintroduced red wolf (Canis rufus) population in North Carolina, USA. Red wolves hybridize with coyotes (C. latrans), and a single hybridization event resulted in introgression of coyote genes into the red wolf population. A detailed pedigree has been reconstructed for the wild red wolf population that includes individuals of 50–100% red wolf ancestry, providing an ideal case study for evaluating the ability of these methods to estimate admixture. Using 17 microsatellite loci, we tested the programs using different training set compositions and varying numbers of loci. STRUCTURE was more likely than BAPS to detect an admixed genotype and correctly estimate an individual’s true ancestry composition. However, STRUCTURE was more likely to misclassify a pure individual as a hybrid. Both programs were outperformed by a maximum-likelihood-based test designed specifically for this system, which never misclassified a hybrid (50-75% red wolf) as a red wolf or vice versa. Both training set composition and the number of loci had an impact on accuracy but their relative importance varied depending on the program. Our findings demonstrate the importance of evaluating methods used for evaluating hybridization in the context of endangered species management.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Restriction site-Associated DNA sequencing (RADseq) has great potential for genome-wide systematics studies of non-model organisms. However, accurately assembling RADseq reads into orthologous loci remains a major challenge in the absence of a reference genome. Traditional assembly pipelines cluster putative orthologous sequences based on a user-defined clustering threshold. Because improper clustering of orthologs is expected to affect results in downstream analyses, it is crucial to design pipelines for empirically optimizing the clustering threshold. While this issue has been largely discussed from a population genomics perspective, it remains understudied in the context of phylogenomics and coalescent species delimitation. To address this issue, we generated RADseq assemblies of representatives of the amphibian genera Discoglossus, Rana, Lissotriton and Triturus using a wide range of clustering thresholds. Particularly, we studied the effects of the intra-sample Clustering Threshold (iCT) and between-sample Clustering Threshold (bCT) separately, as both are expected to differ in multi-species data sets. The obtained assemblies were used for downstream inference of concatenation-based phylogenies, and multi-species coalescent species trees and species delimitation. The results were evaluated in the light of a reference genome-wide phylogeny calculated from newly generated Hybrid-Enrichment markers, as well as extensive background knowledge on the species’ systematics. Overall, our analyses show that the inferred topologies and their resolution are resilient to changes of the iCT and bCT, regardless of the analytical method employed. Except for some extreme clustering thresholds, all assemblies yielded identical, well-supported inter-species relationships that were mostly congruent with those inferred from the reference Hybrid-Enrichment data set. Similarly, coalescent species delimitation was consistent among similarity threshold values. However, we identified a strong effect of the bCT on the branch lengths of concatenation and species trees, with higher bCTs yielding trees with shorter branches, which might be a pitfall for downstream inferences of evolutionary rates. Our results suggest that the choice of assembly parameters for RADseq data in the context of shallow phylogenomics might be less challenging than previously thought. Finally, we propose a pipeline for empirical optimization of the iCT and bCT, implemented in optiRADCT, a series of scripts readily usable for future RADseq studies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains genetic sequences obtained from Hybrid-Enrichment and RAD sequencing protocols of the amphibian genera Discoglossus, Lissotriton, Rana and Triturus, as well as phylogenetic trees inferred from the RADseq data. This data was generated for the manuscript "Exploring the impact of read clustering thresholds on RADseq-based systematics: an empirical example from European amphibians.", in which we tested the influence of the clustering threshold used to assemble RADseq data on downstream phylogenetic inferences. Details on the data generation and analyses can be found in the manuscript and related supplementary materials.
The repository is organised as follow:
-> Hybrid-Enrichment: alignments of the Hybrid-Enrichment markers in phylip/fasta format (with one subdirectory for each of the four datasets assembled: Discoglossus, Lissotriton, Rana, Triturus)
--> RADseq: Assemblies and phylogenetic trees obtained from a RADseq protocol
--> Assemblies: RADseq assemblies (complete loci sequences and SNP matrices, spreadsheets with assembly metrics). Divided into "iCT" (assemblies produced with 23 different intra-sample Clustering Threshold [iCT] and a fixed between-samples Clustering Threshold [bCT]) and "bCT" (assemblies produced with a fixed iCT and 23 different bCT). Both iCT and bCT are further divided in four sub-directories corresponding to the four datasets: Discoglossus, Lissotriton, Rana, Triturus)
--> Trees: Phylogenetic trees inferred from the aforementionned assemblies. Divided into "iCT" (RAxML concatenation trees inferred from the assemblies with different iCTs) and "bCT" (RAxML concatenation trees and Tetrad species trees inferred from the assemblies with different bCTs).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identification of underlying subpopulations to account for unobserved heterogeneity in the population is a challenging statistical problem, mainly because no explicit information about the latent classes is available. Although latent class analysis via finite mixture models is often used successfully to probabilistically identify subpopulations in applications, it often fails with data for which such subpopulations exhibit high latency. Borrowing strength from readily accessible auxiliary classifiers, even when subject to misclassification, may yield improved results in such settings. We develop in this article a joint modeling approach that combines data from multiple sources, including observed characteristics that are often used alone for clustering and classification, as well as results based on imperfect surrogate classifiers, to better identify the latent classes for more accurate classification and prediction. We outline maximum likelihood estimation for the joint model using the EM algorithm, and we show empirically via simulations that our methodology yields better estimates of the underlying latent class distributions than those obtained by ignoring the auxiliary information, while providing joint assessments of the surrogate classifiers. The advantages are significant when there is high latency and the surrogate classifiers are at least moderately accurate. We use real diagnostic data on dry eye disease, for which no gold standard is available, to illustrate our methodology.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
KEGG group clusters were only extracted from the differentially-regulated protein list if at least two proteins were present in each group with a compound probability of p≥0.05. To account for both KEGG enrichment (R: Observed expression frequency/Expected expression frequency) and compound probability (p) of clustered proteins, a hybrid score was used. Hybrid score = R * −log10(p).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Understanding human use of public lands is essential for management of natural and cultural resources. However, compiling consistently reliable visitation data across large spatial and temporal scales and across different land managing entities is challenging. Cellular device locations have been demonstrated as a source to map human activity patterns and may offer a viable solution to overcome some of the challenges that traditional on-the-ground visitation counts face on public lands. Yet, large-scale applicability of human mobility data derived from cell phone device locations for estimating visitation counts to public lands remains unclear. This study aims to address this knowledge gap by examining the efficacy and limitations of using commercially available cellular data to estimate visitation to public lands. We used the United States’ National Park Service’s (NPS) 2018 and 2019 monthly visitor use counts as a ground-truth and developed visitation models using cellular device location-derived monthly visitor counts as a predictor variable. Other covariates, including park unit type, porousness, and park setting (i.e., urban vs. non-urban, iconic vs. local), were included in the model to examine the impact of park attributes on the relationship between NPS and cell phone-derived counts. We applied Pearson’s correlation and generalized linear mixed model with adjustment of month and accounting for potential clustering by the individual park units to evaluate the reliability of using cell data to estimate visitation counts. Of the 38 parks in our study, 20 parks had a correlation of greater than 0.8 between monthly NPS and cell data counts and 8 parks had a correlation of less than 0.5. Regression modeling showed that the cell data could explain a great amount of the variability (conditional R-squared = 0.96) of NPS counts. However, these relationships varied across parks, with better associations generally observed for iconic parks. While our study increased our confidence in using cell phone data to estimate visitation, we also became aware of some of the limitations and challenges which we present in the Discussion.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Understanding human use of public lands is essential for management of natural and cultural resources. However, compiling consistently reliable visitation data across large spatial and temporal scales and across different land managing entities is challenging. Cellular device locations have been demonstrated as a source to map human activity patterns and may offer a viable solution to overcome some of the challenges that traditional on-the-ground visitation counts face on public lands. Yet, large-scale applicability of human mobility data derived from cell phone device locations for estimating visitation counts to public lands remains unclear. This study aims to address this knowledge gap by examining the efficacy and limitations of using commercially available cellular data to estimate visitation to public lands. We used the United States’ National Park Service’s (NPS) 2018 and 2019 monthly visitor use counts as a ground-truth and developed visitation models using cellular device location-derived monthly visitor counts as a predictor variable. Other covariates, including park unit type, porousness, and park setting (i.e., urban vs. non-urban, iconic vs. local), were included in the model to examine the impact of park attributes on the relationship between NPS and cell phone-derived counts. We applied Pearson’s correlation and generalized linear mixed model with adjustment of month and accounting for potential clustering by the individual park units to evaluate the reliability of using cell data to estimate visitation counts. Of the 38 parks in our study, 20 parks had a correlation of greater than 0.8 between monthly NPS and cell data counts and 8 parks had a correlation of less than 0.5. Regression modeling showed that the cell data could explain a great amount of the variability (conditional R-squared = 0.96) of NPS counts. However, these relationships varied across parks, with better associations generally observed for iconic parks. While our study increased our confidence in using cell phone data to estimate visitation, we also became aware of some of the limitations and challenges which we present in the Discussion.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview of data collection conditions and dataset partition.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Distribution of Operating Modes in validation and test data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Distribution of operating modes in the training data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Optimized hyperparameter configurations for baseline models.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.