Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Because of the “curse of dimensionality,” high-dimensional processes present challenges to traditional multivariate statistical process monitoring (SPM) techniques. In addition, the unknown underlying distribution of and complicated dependency among variables such as heteroscedasticity increase the uncertainty of estimated parameters and decrease the effectiveness of control charts. In addition, the requirement of sufficient reference samples limits the application of traditional charts in high-dimension, low-sample-size scenarios (small n, large p). More difficulties appear when detecting and diagnosing abnormal behaviors caused by a small set of variables (i.e., sparse changes). In this article, we propose two change-point–based control charts to detect sparse shifts in the mean vector of high-dimensional heteroscedastic processes. Our proposed methods can start monitoring when the number of observations is a lot smaller than the dimensionality. The simulation results show that the proposed methods are robust to nonnormality and heteroscedasticity. Two real data examples are used to illustrate the effectiveness of the proposed control charts in high-dimensional applications. The R codes are provided online.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The analysis of phenotypic change is important for several evolutionary biology disciplines, including phenotypic plasticity, evolutionary developmental biology, morphological evolution, physiological evolution, evolutionary ecology and behavioral evolution. It is common for researchers in these disciplines to work with multivariate phenotypic data. When phenotypic variables exceed the number of research subjects—data called 'high-dimensional data'—researchers are confronted with analytical challenges. Parametric tests that require high observation to variable ratios present a paradox for researchers, as eliminating variables potentially reduces effect sizes for comparative analyses, yet test statistics require more observations than variables. This problem is exacerbated with data that describe 'multidimensional' phenotypes, whereby a description of phenotype requires high-dimensional data. For example, landmark-based geometric morphometric data use the Cartesian coordinates of (potentially) many anatomical landmarks to describe organismal shape. Collectively such shape variables describe organism shape, although the analysis of each variable, independently, offers little benefit for addressing biological questions. Here we present a nonparametric method of evaluating effect size that is not constrained by the number of phenotypic variables, and motivate its use with example analyses of phenotypic change using geometric morphometric data. Our examples contrast different characterizations of body shape for a desert fish species, associated with measuring and comparing sexual dimorphism between two populations. We demonstrate that using more phenotypic variables can increase effect sizes, and allow for stronger inferences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the set of data shown in the paper "Relevant, hidden, and frustrated information in high-dimensional analyses of complex dynamical systems with internal noise", published on arXiv (DOI: 10.48550/arXiv.2412.09412).
The scripts contained herein are:
To reproduce the data of this work you should start form SOAP-Component-Analysis.py to calculate the SOAP descriptor and select the components that are interesting for you, then you can calculate the PCA with PCA-Analysis.py, and applying the clustering based on your necessities (OnionClustering-1d.py, OnionClustering-2d.py, Hierarchical-Clustering.py). Further modifications of the Onion plot can be done with the script: OnionClustering-plot.py. Umap can be calculated with UMAP.py.
Additional data contained herein are:
The data related to the Quincke rollers can be found here: https://zenodo.org/records/10638736
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R codes for implementing the described analyses (sample processing, data pre-processing, class comparison and class prediction). Caliper matching was implemented using the nonrandom package; the t- and the AD tests were implemented using the stats package and the adk package, respectively. Notice that the updated package for implementing the AD test is kSamples. As regards the bootstrap selection and the egg-shaped plot, we respectively modified the doBS and the importance igraph functions, both included in the bootfs package. For the SVM model we used the e1071 package. (R 12Â kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
First-passage probability estimation of high-dimensional nonlinear stochastic dynamic systems is a significant task to be solved in many science and engineering fields, but remains still an open challenge. The present paper develops a novel approach, termed ‘fractional moments-based mixture distribution’, to address such challenge. This approach is implemented by capturing the extreme value distribution (EVD) of the system response with the concepts of fractional moment and mixture distribution. In our context, the fractional moment itself is by definition a high-dimensional integral with a complicated integrand. To efficiently compute the fractional moments, a parallel adaptive sampling scheme that allows for sample size extension is developed using the refined Latinized stratified sampling (RLSS). In this manner, both variance reduction and parallel computing are possible for evaluating the fractional moments. From the knowledge of low-order fractional moments, the EVD of interest is then expected to be reconstructed. Based on introducing an extended inverse Gaussian distribution and a log extended skew-normal distribution, one flexible mixture distribution model is proposed, where its fractional moments are derived in analytic form. By fitting a set of fractional moments, the EVD can be recovered via the proposed mixture model. Accordingly, the first-passage probabilities under different thresholds can be obtained from the recovered EVD straightforwardly. The performance of the proposed method is verified by three examples consisting of two test examples and one engineering problem.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all the information necessary for reproducing the tables in the paper "A machine learning approach to portfolio pricing and risk management for high dimensional problems".
The raw benchmark data can be found in the Zenodo dataset "Benchmark and training data for replicating financial and insurance examples" (https://zenodo.org/record/3837381). To the extend necessary, only summary data from that dataset is used in this dataset.
The dataset includes a jupyter notebook file that explains what the different files contain, and provides sample code to analyze the information and reproduce the tables.
https://www.immport.org/agreementhttps://www.immport.org/agreement
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has emphasized the importance and challenges of correctly interpreting antibody test results. Identification of positive and negative samples requires a classification strategy with low error rates, which is hard to achieve when the corresponding measurement values overlap. Additional uncertainty arises when classification schemes fail to account for complicated structure in data. We address these problems through a mathematical framework that combines high dimensional data modeling and optimal decision theory. Specifically, we show that appropriately increasing the dimension of data better separates positive and negative populations and reveals nuanced structure that can be described in terms of mathematical models. We combine these models with optimal decision theory to yield a classification scheme that better separates positive and negative samples relative to traditional methods such as confidence intervals (CIs) and receiver operating characteristics. We validate the usefulness of this approach in the context of a multiplex salivary SARS-CoV-2 immunoglobulin G assay dataset. This example illustrates how our analysis: (i) improves the assay accuracy, (e.g. lowers classification errors by up to 42% compared to CI methods); (ii) reduces the number of indeterminate samples when an inconclusive class is permissible, (e.g. by 40% compared to the original analysis of the example multiplex dataset) and (iii) decreases the number of antigens needed to classify samples. Our work showcases the power of mathematical modeling in diagnostic classification and highlights a method that can be adopted broadly in public health and clinical settings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please also see the latest version of the repository: |
The explosion in the volume of biological imaging data challenges the available technologies for data interrogation and its intersection with related published bioinformatics data sets. Moreover, intersection of highly rich and complex datasets from different sources provided as flat csv files requires advanced informatics skills, which is time consuming and not accessible to all. Here, we provide a “user manual” to our new paradigm for systematically filtering and analysing a dataset with more than 1300 microscopy data figures using Multi-Dimensional Viewer (MDV) -link, a solution for interactive multimodal data visualisation and exploration. The primary data we use are derived from our published systematic analysis of 200 YFP traps reveals common discordance between mRNA and protein across the nervous system (eprint link). This manual provides the raw image data together with the expert annotations of the mRNA and protein distribution as well as associated bioinformatics data. We provide an explanation, with specific examples, of how to use MDV to make the multiple data types interoperable and explore them together. We also provide the open-source python code (github link) used to annotate the figures, which could be adapted to any other kind of data annotation task.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mean squared error (×10−2) for data as in Table 2.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
It has become increasingly common to collect high-dimensional binary response data; for example, with the emergence of new sampling techniques in ecology. In smaller dimensions, multivariate probit (MVP) models are routinely used for inferences. However, algorithms for fitting such models face issues in scaling up to high dimensions due to the intractability of the likelihood, involving an integral over a multivariate normal distribution having no analytic form. Although a variety of algorithms have been proposed to approximate this intractable integral, these approaches are difficult to implement and/or inaccurate in high dimensions. Our main focus is in accommodating high-dimensional binary response data with a small-to-moderate number of covariates. We propose a two-stage approach for inference on model parameters while taking care of uncertainty propagation between the stages. We use the special structure of latent Gaussian models to reduce the highly expensive computation involved in joint parameter estimation to focus inference on marginal distributions of model parameters. This essentially makes the method embarrassingly parallel for both stages. We illustrate performance in simulations and applications to joint species distribution modeling in ecology. Supplementary materials for this article are available online.
Understanding what shapes species phenotypes over macroevolutionary timescales from comparative data often requires studying the relationship between phenotypes and putative explanatory factors or testing for differences in phenotypes across species groups. In phyllostomid bats for example, is mandible morphology associated to diet preferences? Performing such analyses depends upon reliable phylogenetic regression techniques and associated tests (e.g. phylogenetic Generalized Least Squares, pGLS and phylogenetic analyses of variance and covariance, pANOVA, pANCOVA). While these tools are well established for univariate data, their multivariate counterparts are lagging behind. This is particularly true for high dimensional phenotypic data, such as morphometric data. Here we implement much-needed likelihood-based multivariate pGLS, pMANOVA and pMANCOVA, and use a recently developed penalized likelihood framework to extend their application to the difficult case when the number of traits p...
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
🕊️ DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation
🌐 Project Website | 📄 Read our paper
Updates 📅
2025-06-11: Added Llama 70B evaluations with ~5,700 MMLU examples across 100 different prompt variations (= 570K new predictions!), based on data from ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments 2025-04-12: Added MMLU predictions from dozens of models including OpenAI, Qwen, Mistral, Gemini… See the full description on the dataset page: https://huggingface.co/datasets/nlphuji/DOVE.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We propose a multiscale weighted principal component regression (MWPCR) framework for the use of high-dimensional features with strong spatial features (e.g., smoothness and correlation) to predict an outcome variable, such as disease status. This development is motivated by identifying imaging biomarkers that could potentially aid detection, diagnosis, assessment of prognosis, prediction of response to treatment, and monitoring of disease status, among many others. The MWPCR can be regarded as a novel integration of principal components analysis (PCA), kernel methods, and regression models. In MWPCR, we introduce various weight matrices to prewhitten high-dimensional feature vectors, perform matrix decomposition for both dimension reduction and feature extraction, and build a prediction model by using the extracted features. Examples of such weight matrices include an importance score weight matrix for the selection of individual features at each location and a spatial weight matrix for the incorporation of the spatial pattern of feature vectors. We integrate the importance of score weights with the spatial weights to recover the low-dimensional structure of high-dimensional features. We demonstrate the utility of our methods through extensive simulations and real data analyses of the Alzheimer’s disease neuroimaging initiative (ADNI) dataset. Supplementary materials for this article are available online.
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
https://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html
Software package for data-driven identification of latent port-Hamiltonian systems. Abstract Conventional physics-based modeling techniques involve high effort, e.g.~time and expert knowledge, while data-driven methods often lack interpretability, structure, and sometimes reliability. To mitigate this, we present a data-driven system identification framework that derives models in the port-Hamiltonian (pH) formulation. This formulation is suitable for multi-physical systems while guaranteeing the useful system theoretical properties of passivity and stability. Our framework combines linear and nonlinear reduction with structured, physics-motivated system identification. In this process, high-dimensional state data obtained from possibly nonlinear systems serves as the input for an autoencoder, which then performs two tasks: (i) nonlinearly transforming and (ii) reducing this data onto a low-dimensional manifold. In the resulting latent space, a pH system is identified by considering the unknown matrix entries as weights of a neural network. The matrices strongly satisfy the pH matrix properties through Cholesky factorizations. In a joint optimization process over the loss term, the pH matrices are adjusted to match the dynamics observed by the data, while defining a linear pH system in the latent space per construction. The learned, low-dimensional pH system can describe even nonlinear systems and is rapidly computable due to its small size. The method is exemplified by a parametric mass-spring-damper and a nonlinear pendulum example as well as the high-dimensional model of a disc brake with linear thermoelastic behavior Features This package implements neural networks that identify linear port-Hamiltonian systems from (potentially high-dimensional) data [1]. Autoencoders (AEs) for dimensionality reduction pH layer to identify system matrices that fullfill the definition of a linear pH system pHIN: identify a (parametric) low-dimensional port-Hamiltonian system directly ApHIN: identify a (parametric) low-dimensional latent port-Hamiltonian system based on coordinate representations found using an autoencoder Examples for the identification of linear pH systems from data One-dimensional mass-spring-damper chain Pendulum discbrake model See documentation for more details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bayes’ linear analysis and approximate Bayesian computation (ABC) are techniques commonly used in the Bayesian analysis of complex models. In this article, we connect these ideas by demonstrating that regression-adjustment ABC algorithms produce samples for which first- and second-order moment summaries approximate adjusted expectation and variance for a Bayes’ linear analysis. This gives regression-adjustment methods a useful interpretation and role in exploratory analysis in high-dimensional problems. As a result, we propose a new method for combining high-dimensional, regression-adjustment ABC with lower-dimensional approaches (such as using Markov chain Monte Carlo for ABC). This method first obtains a rough estimate of the joint posterior via regression-adjustment ABC, and then estimates each univariate marginal posterior distribution separately in a lower-dimensional analysis. The marginal distributions of the initial estimate are then modified to equal the separately estimated marginals, thereby providing an improved estimate of the joint posterior. We illustrate this method with several examples. Supplementary materials for this article are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Increases in the scale and complexity of behavioral data pose an increasing challenge for data analysis. A common strategy involves replacing entire behaviors with small numbers of handpicked, domain-specific features, but this approach suffers from several crucial limitations. For example, handpicked features may miss important dimensions of variability, and correlations among them complicate statistical testing. Here, by contrast, we apply the variational autoencoder (VAE), an unsupervised learning method, to learn features directly from data and quantify the vocal behavior of two model species: the laboratory mouse and the zebra finch. The VAE converges on a parsimonious representation that outperforms handpicked features on a variety of common analysis tasks, enables the measurement of moment-by-moment vocal variability on the timescale of tens of milliseconds in the zebra finch, provides strong evidence that mouse ultrasonic vocalizations do not cluster as is commonly believed, and captures the similarity of tutor and pupil birdsong with qualitatively higher fidelity than previous approaches. In all, we demonstrate the utility of modern unsupervised learning approaches to the quantification of complex and high-dimensional vocal behavior. ... [Read More]
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Because of the “curse of dimensionality,” high-dimensional processes present challenges to traditional multivariate statistical process monitoring (SPM) techniques. In addition, the unknown underlying distribution of and complicated dependency among variables such as heteroscedasticity increase the uncertainty of estimated parameters and decrease the effectiveness of control charts. In addition, the requirement of sufficient reference samples limits the application of traditional charts in high-dimension, low-sample-size scenarios (small n, large p). More difficulties appear when detecting and diagnosing abnormal behaviors caused by a small set of variables (i.e., sparse changes). In this article, we propose two change-point–based control charts to detect sparse shifts in the mean vector of high-dimensional heteroscedastic processes. Our proposed methods can start monitoring when the number of observations is a lot smaller than the dimensionality. The simulation results show that the proposed methods are robust to nonnormality and heteroscedasticity. Two real data examples are used to illustrate the effectiveness of the proposed control charts in high-dimensional applications. The R codes are provided online.