Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
row sparse (Sparse Model B)
Regression problems on massive data sets are ubiquitous in many application domains including the Internet, earth and space sciences, and finances. Gaussian Process regression is a popular technique for modeling the input-output relations of a set of variables under the assumption that the weight vector has a Gaussian prior. However, it is challenging to apply Gaussian Process regression to large data sets since prediction based on the learned model requires inversion of an order n kernel matrix. Approximate solutions for sparse Gaussian Processes have been proposed for sparse problems. However, in almost all cases, these solution techniques are agnostic to the input domain and do not preserve the similarity structure in the data. As a result, although these solutions sometimes provide excellent accuracy, the models do not have interpretability. Such interpretable sparsity patterns are very important for many applications. We propose a new technique for sparse Gaussian Process regression that allows us to compute a parsimonious model while preserving the interpretability of the sparsity structure in the data. We discuss how the inverse kernel matrix used in Gaussian Process prediction gives valuable domain information and then adapt the inverse covariance estimation from Gaussian graphical models to estimate the Gaussian kernel. We solve the optimization problem using the alternating direction method of multipliers that is amenable to parallel computation. We demonstrate the performance of our method in terms of accuracy, scalability and interpretability on a climate data set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data used for the preparation of the manuscript "Lagrangian analysis of submesoscale flows from sparse data using Gaussian Process Regression for field reconstruction".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clustering is a widely used unsupervised learning technique that groups data into homogeneous clusters. However, when dealing with real-world data that contain categorical values, existing algorithms can be computationally costly in high dimensions and can struggle with noisy data that has missing values. Furthermore, except for one algorithm, no others provide theoretical guarantees of clustering accuracy. In this article, we propose a general categorical data encoding method and a computationally efficient spectral-based algorithm to cluster high-dimensional noisy categorical data (nominal or ordinal). Under a statistical model for data on m attributes from n subjects in r clusters with missing probability ϵ, we show that our algorithm exactly recovers the true clusters with high probability when mn(1−ϵ)≥CMr2 log 3M, with M=max(n,m) and a fixed constant C. In addition, we show that mn(1−ϵ)2≥rδ/2 with 0
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Pre-trained models and associated processed datasets for all analyses in the paper
Species and environmental dataThis compiled (zip) file consists of 7 matrices of data: one species data matrix, with abundance observations per visited plot; and 6 environmental data matrices, consisting of land cover classification (Class), simulated EnMAP and Landsat data (April and August), and a 6 time-step Landsat time series (January, March, May, June, July and September). All data is compiled to the 125m radius plots, as described in the paper.Leitaoetal_Mapping beta diversity from space_Data.zip,1. Spatial patterns of community composition turnover (beta diversity) may be mapped through Generalised Dissimilarity Modelling (GDM). While remote sensing data are adequate to describe these patterns, the often high-dimensional nature of these data poses some analytical challenges, potentially resulting in loss of generality. This may hinder the use of such data for mapping and monitoring beta-diversity patterns. 2. This study presents Sparse Generalised Dissimilarity Modelling (SGDM), a methodological framework designed to improve the use of high-dimensional data to predict community turnover with GDM. SGDM consists of a two-stage approach, by first transforming the environmental data with a sparse canonical correlation analysis (SCCA), aimed at dealing with high-dimensional datasets, and secondly fitting the transformed data with GDM. The SCCA penalisation parameters are chosen according to a grid search procedure in order to optimise the predictive performance of a GDM fit on the resulting components. The proposed method was illustrated on a case study with a clear environmental gradient of shrub encroachment following cropland abandonment, and subsequent turnover in the bird communities. Bird community data, collected on 115 plots located along the described gradient, were used to fit composition dissimilarity as a function of several remote sensing datasets, including a time series of Landsat data as well as simulated EnMAP hyperspectral data. 3. The proposed approach always outperformed GDM models when fit on high-dimensional datasets. Its usage on low-dimensional data was not consistently advantageous. Models using high-dimensional data, on the other hand, always outperformed those using low-dimensional data, such as single date multispectral imagery. 4. This approach improved the direct use of high-dimensional remote sensing data, such as time series or hyperspectral imagery, for community dissimilarity modelling, resulting in better performing models. The good performance of models using high-dimensional datasets further highlights the relevance of dense time series and data coming from new and forthcoming satellite sensors for ecological applications such as mapping species beta diversity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many problems in classification involve huge numbers of irrelevant features. Variable selection reveals the crucial features, reduces the dimensionality of feature space, and improves model interpretation. In the support vector machine literature, variable selection is achieved by l1 penalties. These convex relaxations seriously bias parameter estimates toward 0 and tend to admit too many irrelevant features. The current article presents an alternative that replaces penalties by sparse-set constraints. Penalties still appear, but serve a different purpose. The proximal distance principle takes a loss function L(β) and adds the penalty ρ2dist(β,Sk)2 capturing the squared Euclidean distance of the parameter vector β to the sparsity set Sk where at most k components of β are nonzero. If βρ represents the minimum of the objective fρ(β)=L(β)+ρ2dist(β,Sk)2, then βρ tends to the constrained minimum of L(β) over Sk as ρ tends to ∞. We derive two closely related algorithms to carry out this strategy. Our simulated and real examples vividly demonstrate how the algorithms achieve better sparsity without loss of classification power. Supplementary materials for this article are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Factor models have been applied extensively for forecasting when high-dimensional datasets are available. In this case, the number of variables can be very large. For instance, usual dynamic factor models in central banks handle over 100 variables. However, there is a growing body of literature indicating that more variables do not necessarily lead to estimated factors with lower uncertainty or better forecasting results. This paper investigates the usefulness of partial least squares techniques that take into account the variable to be forecast when reducing the dimension of the problem from a large number of variables to a smaller number of factors. We propose different approaches of dynamic sparse partial least squares as a means of improving forecast efficiency by simultaneously taking into account the variable forecast while forming an informative subset of predictors, instead of using all the available ones to extract the factors. We use the well-known Stock and Watson database to check the forecasting performance of our approach. The proposed dynamic sparse models show good performance in improving efficiency compared to widely used factor methods in macroeconomic forecasting.
With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.
This is a scikit-learn compatible Python implementation of Stabl, coupled with useful functions and
example notebooks to rerun the analyses on the different use cases located in the Sample data
folder of the code library and in the data.zip
folder of this repository
Python version : from 3.7 up to 3.10
Python packages:
Julia package for noise generation (version 1.9.2) :
To install Julia, please follow these instructions:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This directory contains R code and required data to run the full data augmentation described in, "An Imputation-Based Approach for Augmenting Sparse Epidemiological Signals."
"aug_pipeline.R" runs through all component steps and calls individual functions and data files within the directory. "plots_for_pipeline.R" uses data created during the aug_pipeline script to visualize individual steps in the augmentation process.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Docker image containing installed netDx software in Ubuntu to reproduce examples from the published manuscript. The R implementation of netDx is hosted at: https://github.com/BaderLab/netDx
Publication abstract: Patient classification has widespread biomedical and clinical applications, including diagnosis, prognosis and treatment response prediction. A clinically useful prediction algorithm should be accurate, generalizable, be able to integrate diverse data types, and handle sparse data. A clinical predictor based on genomic data needs to be easily interpretable to drive hypothesis-driven research into new treatments. We describe netDx, a novel supervised patient classification framework based on patient similarity networks. netDx meets the above criteria and particularly excels at data integration and model interpretability. We compared classification performance of this method against other machine-learning algorithms, using a cancer survival benchmark with four cancer types, each requiring integration of up to six genomic and clinical data types. In these tests, netDx has significantly higher average performance than most other machine-learning approaches across most cancer types. In comparison to traditional machine learning-based patient classifiers, netDx results are more interpretable, visualizing the decision boundary in the context of patient similarity space. When patient similarity is defined by pathway-level gene expression, netDx identifies biological pathways important for outcome prediction, as demonstrated in diverse data sets of breast cancer and asthma. Thus, netDx can serve both as a patient classifier and as a tool for discovery of biological features characteristic of disease. We provide a freely available software implementation of netDx along with sample files and automation workflows in R.
In situations where the cost/benefit analysis of using physics-based damage propagation algorithms is not favorable and when sufficient test data are available that map out the damage space, one can employ data-driven approaches. In this investigation, we evaluate different algorithms for their suitability in those circumstances. We are interested in assessing the trade-off that arises from the ability to support uncertainty management, and the accuracy of the predictions. We compare here a Relevance Vector Machine (RVM), Gaussian Process Regression (GPR), and a Neural Network-based approach and employ them on relatively sparse training sets with very high noise content. Results show that while all methods can provide remaining life estimates although different damage estimates of the data (diagnostic output) changes the outcome considerably. In addition, we found that there is a need for performance metrics that provide a comprehensive and objective assessment of prognostics algorithm performance.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset accompanies the following publication, please cite this publication if you use this dataset:
Fischer, T. and Milford, M., 2022. How Many Events Do You Need? Event-Based Visual Place Recognition Using Sparse But Varying Pixels. IEEE Robotics and Automation Letters, 7(4), pp.12275-12282.
@article{FischerRAL2022ICRA2023,
title={How Many Events do You Need? Event-based Visual Place Recognition Using Sparse But Varying Pixels},
author={Tobias Fischer and Michael Milford},
journal={IEEE Robotics and Automation Letters},
volume={7},
number={4},
pages={12275--12282},
year={2022},
doi={10.1109/LRA.2022.3216226},
}
The dataset contains seven sequences of recordings. For each recording, the following files are made available:
A rosbag (*.bag) file with the following contents:
/dvs/events (type: dvs_msgs/EventArray) with the event stream, see https://github.com/uzh-rpg/rpg_dvs_ros
/dvs/camera_info (type: sensor_msgs/CameraInfo) with the camera info of the DAVIS frame camera
/dvs/image_raw (type: sensor_msgs/Image) with the DAVIS frame camera images
/dvs/imu (sensor_msgs/Imu) with the IMU data of the event camera
A parquet file that can be read with pandas, which is converted from the bag file, with a denoising algorithm applied.
A zip file containing the DAVIS frame camera images. Once extracted, the images have the timestamp as their filename.
Please see the associated code repository (https://github.com/Tobias-Fischer/sparse-event-vpr) for manually annotated ground-truth information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the present work, a new artificial neural network-based model for predicting the curing characteristics of rubber blends with different contents of carbon black filler cured at various temperatures has been developed. The variations of 4 curing characteristics, most commonly used in the rubber industry, namely of the minimum and maximum elastic torque, scorch time and optimal cure time, with carbon black contents in the rubber blend and cure temperature, have been obtained on the basis of the analysis of 11 experimental isothermal rheological cure curves registered by an oscillating-disk rheometer at 10 cure temperatures. The computer implementation of the ANN model requires a special pre-processing of the raw experimental data, which is described in detail in the paper. The implementation of ANN model for predicting the curing characteristics of RBs with different contents of CB filler at various cure temperatures was done in the MATLAB® software package, Version 9.0.0.341360 R2016a 64-bit, equipped with a Neural Network Toolbox (Math Works, Natic, MA, USA), that provides a number of built-in tools for sufficiently powerful and user-friendly work with ANNs of a wide range of types and architectures. The GRNN was used to solve the given function approximation problem, in particular for its extremely high learning rate and rapid convergence to optimal regression levels even in the case of sparse data. The satisfactory agreement between the experimental and modelled values has been found for all four curing characteristics, with the maximum error in the prediction for modelled minimum and maximum elastic torque less than 3%, and for modelled scorch time and optimal cure time not exceeding 5% of their experimental values. It can be concluded that the generalized regression neural network is a very powerful tool for intelligent modelling the curing process of rubber blends even in the case of a small training dataset, and it can find a wide practical application in the area of the rubber industry.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data submitted to the 11th Society of Petroleum Engineers Comparative Solution Project. Contains the sparse and dense data files of 77 results submitted by 18 participating groups for three cases SPE11A-C. Each zip file contains one such result, where the name speX_NAMEY.zip indicates Result Y for Case SPE11X of Participant NAME. Unpacking a result file yields one sparse data file speX_time_series.csv, several dense data files speX_spatial_map_TIME.csv, and, optionally, performance data files. A sparse data file contains the evolution of several scalar quantities over time, while a dense data file contains the spatial distribution of several scalar quantities at a particular reporting time step. For more information, see the related publication. The results can be processed by the scripts provided in the repository github.com/Simulation-Benchmarks/11thSPE-CSP. From the repository's website, access to a Jupyter Hub is enabled that allows to run the scripts on the full dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Because of the “curse of dimensionality,” high-dimensional processes present challenges to traditional multivariate statistical process monitoring (SPM) techniques. In addition, the unknown underlying distribution of and complicated dependency among variables such as heteroscedasticity increase the uncertainty of estimated parameters and decrease the effectiveness of control charts. In addition, the requirement of sufficient reference samples limits the application of traditional charts in high-dimension, low-sample-size scenarios (small n, large p). More difficulties appear when detecting and diagnosing abnormal behaviors caused by a small set of variables (i.e., sparse changes). In this article, we propose two change-point–based control charts to detect sparse shifts in the mean vector of high-dimensional heteroscedastic processes. Our proposed methods can start monitoring when the number of observations is a lot smaller than the dimensionality. The simulation results show that the proposed methods are robust to nonnormality and heteroscedasticity. Two real data examples are used to illustrate the effectiveness of the proposed control charts in high-dimensional applications. The R codes are provided online.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Landsat satellite imagery is used to derive woody vegetation extent products that discriminate between forest, sparse woody and non-woody land cover across a time series from 1988 to 2018. A forest is defined as woody vegetation with a minimum 20 per cent canopy cover, potentially reaching 2 metres high and a minimum area of 0.2 hectares. Sparse woody is defined as woody vegetation with a canopy cover between 5-19 per cent.
The three-class classification (forest, sparse woody and non-woody) supersedes the two class classification (forest and non-forest) from 2016. The new classification is produced using the same approach in terms of time series processing (conditional probability networks) as the two-class method, to detect woody vegetation cover. The three-class algorithm better encompasses the different types of woody vegetation across the Australian landscape.
Matlab has a reputation for running slowly. Here are some pointers on how to speed computations, to an often unexpected degree. Subjects currently covered: Matrix Coding Implicit Multithreading on a Multicore Machine Sparse Matrices Sub-Block Computation to Avoid Memory Overflow Matrix Coding - 1 Matlab documentation notes that efficient computation depends on using the matrix facilities, and that mathematically identical algorithms can have very different runtimes, but they are a bit coy about just what these differences are. A simple but telling example: The following is the core of the GD-CLS algorithm of Berry et.al., copied from fig. 1 of Shahnaz et.al, 2006, "Document clustering using nonnegative matrix factorization': for jj = 1:maxiter A = W'*W + lambda*eye(k); for ii = 1:n b = W'*V(:,ii); H(:,ii) = A \ b; end H = H .* (H>0); W = W .* (V*H') ./ (W*(H*H') + 1e-9); end Replacing the columwise update of H with a matrix update gives: for jj = 1:maxiter A = W'*W + lambda*eye(k); B = W'*V; H = A \ B; H = H .* (H>0); W = W .* (V*H') ./ (W*(H*H') + 1e-9); end These were tested on an 8049 x 8660 sparse matrix bag of words V (.0083 non-zeros), with W of size 8049 x 50, H 50 x 8660, maxiter = 50, lambda = 0.1, and identical initial W. They were run consecutivly, multithreaded on an 8-processor Sun server, starting at ~7:30PM. Tic-toc timing was recorded. Runtimes were respectivly 6586.2 and 70.5 seconds, a 93:1 difference. The maximum absolute pairwise difference between W matrix values was 6.6e-14. Similar speedups have been consistantly observed in other cases. In one algorithm, combining matrix operations with efficient use of the sparse matrix facilities gave a 3600:1 speedup. For speed alone, C-style iterative programming should be avoided wherever possible. In addition, when a couple lines of matrix code can substitute for an entire C-style function, program clarity is much improved. Matrix Coding - 2 Applied to integration, the speed gains are not so great, largely due to the time taken to set up the and deal with the boundaries. The anyomous function setup time is neglegable. I demonstrate on a simple uniform step linearly interpolated 1-D integration of cos() from 0 to pi, which should yield zero: tic; step = .00001; fun = @cos; start = 0; endit = pi; enda = floor((endit - start)/step)step + start; delta = (endit - enda)/step; intF = fun(start)/2; intF = intF + fun(endit)delta/2; intF = intF + fun(enda)(delta+1)/2; for ii = start+step:step:enda-step intF = intF + fun(ii); end intF = intFstep toc; intF = -2.910164109692914e-14 Elapsed time is 4.091038 seconds. Replacing the inner summation loop with the matrix equivalent speeds things up a bit: tic; step = .00001; fun = @cos; start = 0; endit = pi; enda = floor((endit - start)/step)*step + start; delta = (endit - enda)/step; intF = fun(start)/2; intF = intF + fun(endit)*delta/2; intF = intF + fun(enda)*(delta+1)/2; intF = intF + sum(fun(start+step:step:enda-step)); intF = intF*step toc; intF = -2.868419946011613e-14 Elapsed time is 0.141564 seconds. The core computation take
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
In response to NASA SBIR topic A1.05, "Data Mining for Integrated Vehicle Health Management", Michigan Aerospace Corporation (MAC) asserts that our unique SPADE (Sparse Processing Applied to Data Exploitation) technology meets a significant fraction of the stated criteria and has functionality that enables it to handle many applications within the aircraft lifecycle. SPADE distills input data into highly quantized features and uses MAC's novel techniques for constructing Ensembles of Decision Trees to develop extremely accurate diagnostic/prognostic models for classification, regression, clustering, anomaly detection and semi-supervised learning tasks. These techniques are currently being employed to do Threat Assessment for satellites in conjunction with researchers at the Air Force Research Lab. Significant advantages to this approach include: 1) completely data driven; 2) training and evaluation are faster than conventional methods; 3) operates effectively on huge datasets (> billion samples X > million features), 4) proven to be as accurate as state-of-the-art techniques in many significant real-world applications. The specific goals for Phase 1 will be to work with domain experts at NASA and with our partners Boeing, SpaceX and GMV Space Systems to delineate a subset of problems that are particularly well-suited to this approach and to determine requirements for deploying algorithms on platforms of opportunity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
row sparse (Sparse Model B)