Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R codes for implementing the described analyses (sample processing, data pre-processing, class comparison and class prediction). Caliper matching was implemented using the nonrandom package; the t- and the AD tests were implemented using the stats package and the adk package, respectively. Notice that the updated package for implementing the AD test is kSamples. As regards the bootstrap selection and the egg-shaped plot, we respectively modified the doBS and the importance igraph functions, both included in the bootfs package. For the SVM model we used the e1071 package. (R 12Â kb)
Species and environmental dataThis compiled (zip) file consists of 7 matrices of data: one species data matrix, with abundance observations per visited plot; and 6 environmental data matrices, consisting of land cover classification (Class), simulated EnMAP and Landsat data (April and August), and a 6 time-step Landsat time series (January, March, May, June, July and September). All data is compiled to the 125m radius plots, as described in the paper.Leitaoetal_Mapping beta diversity from space_Data.zip,1. Spatial patterns of community composition turnover (beta diversity) may be mapped through Generalised Dissimilarity Modelling (GDM). While remote sensing data are adequate to describe these patterns, the often high-dimensional nature of these data poses some analytical challenges, potentially resulting in loss of generality. This may hinder the use of such data for mapping and monitoring beta-diversity patterns. 2. This study presents Sparse Generalised Dissimilarity Modelling (SGDM), a methodological framework designed to improve the use of high-dimensional data to predict community turnover with GDM. SGDM consists of a two-stage approach, by first transforming the environmental data with a sparse canonical correlation analysis (SCCA), aimed at dealing with high-dimensional datasets, and secondly fitting the transformed data with GDM. The SCCA penalisation parameters are chosen according to a grid search procedure in order to optimise the predictive performance of a GDM fit on the resulting components. The proposed method was illustrated on a case study with a clear environmental gradient of shrub encroachment following cropland abandonment, and subsequent turnover in the bird communities. Bird community data, collected on 115 plots located along the described gradient, were used to fit composition dissimilarity as a function of several remote sensing datasets, including a time series of Landsat data as well as simulated EnMAP hyperspectral data. 3. The proposed approach always outperformed GDM models when fit on high-dimensional datasets. Its usage on low-dimensional data was not consistently advantageous. Models using high-dimensional data, on the other hand, always outperformed those using low-dimensional data, such as single date multispectral imagery. 4. This approach improved the direct use of high-dimensional remote sensing data, such as time series or hyperspectral imagery, for community dissimilarity modelling, resulting in better performing models. The good performance of models using high-dimensional datasets further highlights the relevance of dense time series and data coming from new and forthcoming satellite sensors for ecological applications such as mapping species beta diversity.
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Yearly citation counts for the publication titled "Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identification of genomic biomarkers is an important area of research in the context of drug discovery experiments. These experiments typically consist of several high dimensional datasets that contain information about a set of drugs (compounds) under development. This type of data structure introduces the challenge of multi-source data integration. High-Performance Computing (HPC) has become an important tool for everyday research tasks. In the context of drug discovery, high dimensional multi-source data needs to be analyzed to identify the biological pathways related to the new set of drugs under development. In order to process all information contained in the datasets, HPC techniques are required. Even though R packages for parallel computing are available, they are not optimized for a specific setting and data structure. In this article, we propose a new framework, for data analysis, to use R in a computer cluster. The proposed data analysis workflow is applied to a multi-source high dimensional drug discovery dataset and compared with a few existing R packages for parallel computing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 1. The raw data file used in this study.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The analysis of phenotypic change is important for several evolutionary biology disciplines, including phenotypic plasticity, evolutionary developmental biology, morphological evolution, physiological evolution, evolutionary ecology and behavioral evolution. It is common for researchers in these disciplines to work with multivariate phenotypic data. When phenotypic variables exceed the number of research subjects—data called 'high-dimensional data'—researchers are confronted with analytical challenges. Parametric tests that require high observation to variable ratios present a paradox for researchers, as eliminating variables potentially reduces effect sizes for comparative analyses, yet test statistics require more observations than variables. This problem is exacerbated with data that describe 'multidimensional' phenotypes, whereby a description of phenotype requires high-dimensional data. For example, landmark-based geometric morphometric data use the Cartesian coordinates of (potentially) many anatomical landmarks to describe organismal shape. Collectively such shape variables describe organism shape, although the analysis of each variable, independently, offers little benefit for addressing biological questions. Here we present a nonparametric method of evaluating effect size that is not constrained by the number of phenotypic variables, and motivate its use with example analyses of phenotypic change using geometric morphometric data. Our examples contrast different characterizations of body shape for a desert fish species, associated with measuring and comparing sexual dimorphism between two populations. We demonstrate that using more phenotypic variables can increase effect sizes, and allow for stronger inferences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the set of data shown in the paper "Relevant, hidden, and frustrated information in high-dimensional analyses of complex dynamical systems with internal noise", published on arXiv (DOI: 10.48550/arXiv.2412.09412).
The scripts contained herein are:
To reproduce the data of this work you should start form SOAP-Component-Analysis.py to calculate the SOAP descriptor and select the components that are interesting for you, then you can calculate the PCA with PCA-Analysis.py, and applying the clustering based on your necessities (OnionClustering-1d.py, OnionClustering-2d.py, Hierarchical-Clustering.py). Further modifications of the Onion plot can be done with the script: OnionClustering-plot.py. Umap can be calculated with UMAP.py.
Additional data contained herein are:
The data related to the Quincke rollers can be found here: https://zenodo.org/records/10638736
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many economic and scientific problems involve the analysis of high-dimensional functional time series, where the number of functional variables p diverges as the number of serially dependent observations n increases. In this paper, we present a novel functional factor model for high-dimensional functional time series that maintains and makes use of the functional and dynamic structure to achieve great dimension reduction and find the latent factor structure. To estimate the number of functional factors and the factor loadings, we propose a fully functional estimation procedure based on an eigenanalysis for a nonnegative definite and symmetric matrix. Our proposal involves a weight matrix to improve the estimation efficiency and tackle the issue of heterogeneity, the rationale of which is illustrated by formulating the estimation from a novel regression perspective. Asymptotic properties of the proposed method are studied when p diverges at some polynomial rate as n increases. To provide a parsimonious model and enhance interpretability for near-zero factor loadings, we impose sparsity assumptions on the factor loading space and then develop a regularized estimation procedure with theoretical guarantees when p grows exponentially fast relative to n. Finally, we demonstrate the superiority of our proposed estimators over the alternatives/competitors through simulations and applications to a U.K. temperature data set and a Japanese mortality data set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This technical note proposes a novel profile likelihood method for estimating the covariance parameters in exploratory factor analysis of high-dimensional Gaussian datasets with fewer observations than number of variables. An implicitly restarted Lanczos algorithm and a limited-memory quasi-Newton method are implemented to develop a matrix-free framework for likelihood maximization. Simulation results show that our method is substantially faster than the expectation-maximization solution without sacrificing accuracy. Our method is applied to fit factor models on data from suicide attempters, suicide ideators, and a control group. Supplementary materials for this article are available online.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
In biomedicine as well as in many other areas experimental data consists of topographically ordered multidimensional data arrays or images.
In our collaboration, multi parameter flourescence microscopy data of immunoflourescently labeled lymphocytes has to be analysed. One experimental data set consists of n intensity images of the sample. As a result of a specific immunolabeling technique in each image different subsets of the lymphocytes appear with high intensity values, expressing the existence of a specific cell surface protein. Because the positions of the cells are not affected by the labeling process, the n flourescence signals of a cell can be traced through the image stack at constant coordinates.
The analysis of such stacks of images by an expert user is limited to two strategies in most laboratories: the images are analyzed one after the other or up to three images are written into the RGB channels of a color map. Obviously, these techniques are not suitable for the analysis of higher dimensional data.
Here, Sonification of the stack of images allows to perceive the complete pattern of all markers. The biomedical expert may probe specific cells on an auditory map and listen to their flourescence patterns. The sonification was designed to satisfy specific requirements:
Such sonifications can be derived using several strategies. One is to play a tone for each marker if the corresponding flourescence intensity is more than a threshold. Thus a rythmic pattern emerges for each cell. Another strategy is to use frequency to distinct markers. Thus each cell is a superposition of tones with different pitch and a chord or tone-cluster is the result. This leads to a harmonic presentation of each cell. However, using both time and pitch, the result is a rythmical sequence of tones and thus a specific melody for a cell. As our abbilities to memorize and recognice melodies or musical structures is better than recognizing visual presented histograms, this yields a promising approach for the inspection of such data by an expert. Now, an example sonification is presented using only five dimensional data images. However, the results are even good with much higher dimensionality - we tested the method with a stack of 12 images. The following demonstration uses only 5 markers. A map is rendered to show all cells for browsing (shown right)
cd-02 | cd-08 | Identical patterns:
Cell 1 Cell 2 | |
cd-03 |
------ | superposition | Similar pattern:
Cell 3 |
cd-04 | hla-dr | Very different pattern:
Cell 4 |
A specific advantage of this method is, that it allows to examine the high-dimensional data vectors without the need to change the viewing direction. However, there are many other methods to present such data acoustically, e.g. by using different timbre classes for the markers, like percussive instruments, fluid sounds, musical instruments or the human voice. These alternatives and their applicability are currently investigated.
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Network of 44 papers and 69 citation links related to "Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The explosion in the volume of biological imaging data challenges the available technologies for data interrogation and its intersection with related published bioinformatics data sets. Moreover, intersection of highly rich and complex datasets from different sources provided as flat csv files requires advanced informatics skills, which is time consuming and not accessible to all. Here, we provide a “user manual” to our new paradigm for systematically filtering and analysing a dataset with more than 1300 microscopy data figures using Multi-Dimensional Viewer (MDV: https://mdv.molbiol.ox.ac.uk), a solution for interactive multimodal data visualisation and exploration. The primary data we use are derived from our published systematic analysis of 200 YFP gene traps reveals common discordance between mRNA and protein across the nervous system (https://doi.org/10.1083/jcb.202205129). This manual provides the raw image data together with the expert annotations of the mRNA and protein distribution as well as associated bioinformatics data. We provide an explanation, with specific examples, of how to use MDV to make the multiple data types interoperable and explore them together. We also provide the open-source python code (github link) used to annotate the figures, which could be adapted to any other kind of data annotation task.
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Yearly citation counts for the publication titled "High-dimensional MRI data analysis using a large-scale manifold learning approach".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mapper and Ball Mapper are Topological Data Analysis tools used for exploring high dimensional point clouds and visualizing scalar–valued functions on those point clouds. Inspired by open questions in knot theory, new features are added to Ball Mapper that enable encoding of the structure, internal relations and symmetries of the point cloud. Moreover, the strengths of Mapper and Ball Mapper constructions are combined to create a tool for comparing high dimensional data descriptors of a single dataset. This new hybrid algorithm, Mapper on Ball Mapper, is applicable to high dimensional lens functions. As a proof of concept we include applications to knot and game theory.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper presents an improved robust Fisher discriminant method designed to handle high-dimensional data, particularly in the presence of outliers. Traditional Fisher discriminant methods are sensitive to outliers, which can significantly degrade their performance. To address this issue, we integrate the Minimum Regularized Covariance Determinant (MRCD) algorithm into the Fisher discriminant framework, resulting in the MRCD-Fisher discriminant model. The MRCD algorithm enhances robustness by regularizing the covariance matrix, making it suitable for high-dimensional data where the number of variables exceeds the number of observations. We conduct comparative experiments with other robust discriminant methods, the results demonstrate that the MRCD-Fisher discriminant outperforms these methods in terms of robustness and accuracy, especially when dealing with data contaminated by outliers. The MRCD-Fisher discriminant maintains high data cleanliness and computational stability, making it a reliable choice for high-dimensional data analysis. This study provides a valuable contribution to the field of robust statistical analysis, offering a practical solution for handling complex, outlier-prone datasets.
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Yearly citation counts for the publication titled "Feature Extraction and Uncorrelated Discriminant Analysis for High-Dimensional Data".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Using complementary forms of high dimensional profiling we define differences in CD45+ cells from syngeneic mouse tumors that either grow progressively or eventually reject following immune checkpoint therapy (ICT). Unbiased assessment of gene expression of tumor infiltrating cells by single cell RNA sequencing (scRNAseq) and longitudinal assessment of cellular protein expression by mass cytometry (CyTOF) revealed significant remodeling of both the lymphoid and myeloid intratumoral compartments. Surprisingly, we observed multiple subpopulations of monocytes/macrophages distinguishable by the combinatorial presence or absence of CD206, CX3CR1, CD1d and iNOS, markers of different macrophage activation states that change over time during ICT in a manner partially dependent on IFNγ. Both the CyTOF data and additional analysis of scRNAseq data support the hypothesis that macrophage polarization/activation results from effects on circulatory monocytes/early macrophages entering tumors rather than on pre-polarized mature intratumoral macrophages. Thus, ICT induces transcriptional and functional remodeling of both myeloid and lymphoid compartments. Droplet-based 3′ end massively parallel single-cell RNA sequencing was performed by encapsulating sorted live CD45+ tumor infiltrating cells into droplets and libraries were prepared using Chromium Single Cell 3′ Reagent Kits v1 according to manufacturer’s protocol (10x Genomics). The generated scRNAseq libraries were sequenced using an Illumina HiSeq2500.
MATLAB code + demo to reproduce results for "Sparse Principal Component Analysis with Preserved Sparsity". This code calculates the principal loading vectors for any given high-dimensional data matrix. The advantage of this method over existing sparse-PCA methods is that it can produce principal loading vectors with the same sparsity pattern for any number of principal components. Please see Readme.md for more information.
Clustering algorithms are at the basis of several technological applications, and are fueling the development of rapidly evolving fields such as machine learning. In the recent past, however, it has become apparent that they face challenges stemming from datasets that span more spatial dimensions. In fact, the best-performing clustering algorithms scale linearly in the number of points, but quadratically with respect to the local density of points. In this work, we introduce qCLUE, a quantum clustering algorithm that scales linearly in both the number of points and their density. qCLUE is inspired by CLUE, an algorithm developed to address the challenging time and memory budgets of Event Reconstruction (ER) in future High-Energy Physics experiments. As such, qCLUE marries decades of development with the quadratic speedup provided by quantum computers. We numerically test qCLUE in several scenarios, demonstrating its effectiveness and proving it to be a promising route to handle complex data analysis tasks – especially in high-dimensional datasets with high densities of points.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R codes for implementing the described analyses (sample processing, data pre-processing, class comparison and class prediction). Caliper matching was implemented using the nonrandom package; the t- and the AD tests were implemented using the stats package and the adk package, respectively. Notice that the updated package for implementing the AD test is kSamples. As regards the bootstrap selection and the egg-shaped plot, we respectively modified the doBS and the importance igraph functions, both included in the bootfs package. For the SVM model we used the e1071 package. (R 12Â kb)