Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance comparison for user cold-start problem on BE-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistics of sample data from the Last.fm and Douban datasets.
Regression problems on massive data sets are ubiquitous in many application domains including the Internet, earth and space sciences, and finances. Gaussian Process regression is a popular technique for modeling the input-output relations of a set of variables under the assumption that the weight vector has a Gaussian prior. However, it is challenging to apply Gaussian Process regression to large data sets since prediction based on the learned model requires inversion of an order n kernel matrix. Approximate solutions for sparse Gaussian Processes have been proposed for sparse problems. However, in almost all cases, these solution techniques are agnostic to the input domain and do not preserve the similarity structure in the data. As a result, although these solutions sometimes provide excellent accuracy, the models do not have interpretability. Such interpretable sparsity patterns are very important for many applications. We propose a new technique for sparse Gaussian Process regression that allows us to compute a parsimonious model while preserving the interpretability of the sparsity structure in the data. We discuss how the inverse kernel matrix used in Gaussian Process prediction gives valuable domain information and then adapt the inverse covariance estimation from Gaussian graphical models to estimate the Gaussian kernel. We solve the optimization problem using the alternating direction method of multipliers that is amenable to parallel computation. We demonstrate the performance of our method in terms of accuracy, scalability and interpretability on a climate data set.
In this paper we propose an innovative learning algorithm - a variation of One-class ? Support Vector Machines (SVMs) learning algorithm to produce sparser solutions with much reduced computational complexities. The proposed technique returns an approximate solution, nearly as good as the solution set obtained by the classical approach, by minimizing the original risk function along with a regularization term. We introduce a bi-criterion optimization that helps guide the search towards the optimal set in much reduced time. The outcome of the proposed learning technique was compared with the benchmark one-class Support Vector machines algorithm which more often leads to solutions with redundant support vectors. Through out the analysis, the problem size for both optimization routines was kept consistent. We have tested the proposed algorithm on a variety of data sources under different conditions to demonstrate the effectiveness. In all cases the proposed algorithm closely preserves the accuracy of standard one-class ? SVMs while reducing both training time and test time by several factors.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MetaFlux is a global, long-term carbon flux dataset of gross primary production and ecosystem respiration that is generated using meta-learning. The principle of meta-learning stems from the need to solve the problem of learning in the face of sparse data availability. Data sparsity is a prevalent challenge in climate and ecology science. For instance, in-situ observations tend to be spatially and temporally sparse. This issue can arise from sensor malfunctions, limited sensor locations, or non-ideal climate conditions such as persistent cloud cover. The lack of high-quality continuous data can make it difficult to understand many climate processes that are otherwise critical. The machine-learning community has attempted to tackle this problem by developing several learning approaches, including meta-learning that learns how to learn broad features across tasks to better infer other poorly sampled ones. In this work, we applied meta-learning to solve the problem of upscaling continuous carbon fluxes from sparse observations. Data scarcity in carbon flux applications is particularly problematic in the tropics and semi-arid regions, where only around 8–11% of long-term eddy covariance stations are currently operational. Unfortunately, these regions are important in modulating the global carbon cycle and its interannual variability. In general, we find that meta-trained machine models, including multi-layer perceptrons (MLP), long-short-term memory (LSTM), and bi-directional LSTM (BiLSTM), have lower validation errors on flux estimates by 9–16% when compared to their non-meta-trained counterparts. In addition, meta-trained models are more robust to extreme conditions, with 4–24% lower overall errors. Finally, we use an ensemble of meta-trained deep networks to generate a global product of ecosystem-scale photosynthesis and respiration fluxes from in-situ observations to daily and monthly global products at a 0.25-degree spatial resolution from 2001 to 2023, called "MetaFlux". We also checked for the seasonality, interannual variability, and correlation to solar-induced fluorescence of the upscaled product and found that MetaFlux outperformed state-of-the-art machine learning upscaling models, especially in critical semi-arid and tropical regions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The performance comparison based on Last.fm dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Parameters used for the AD methods.
The efficiency of genomic selection methodologies can be increased by sparse testing where a subset of materials are evaluated in different environments. Seven different multi-environment plant breeding datasets were used to evaluate four different methods for allocating lines to environments in a multi-trait genomic prediction problem. The results of the analysis are presented in the accompanying article.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training and testing set for E-Commerce product images dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Factor models have been applied extensively for forecasting when high-dimensional datasets are available. In this case, the number of variables can be very large. For instance, usual dynamic factor models in central banks handle over 100 variables. However, there is a growing body of literature indicating that more variables do not necessarily lead to estimated factors with lower uncertainty or better forecasting results. This paper investigates the usefulness of partial least squares techniques that take into account the variable to be forecast when reducing the dimension of the problem from a large number of variables to a smaller number of factors. We propose different approaches of dynamic sparse partial least squares as a means of improving forecast efficiency by simultaneously taking into account the variable forecast while forming an informative subset of predictors, instead of using all the available ones to extract the factors. We use the well-known Stock and Watson database to check the forecasting performance of our approach. The proposed dynamic sparse models show good performance in improving efficiency compared to widely used factor methods in macroeconomic forecasting.
NOTE: This dataset is now outdated. Please see https://doi.org/10.11583/DTU.17091101 for the updated version with many more problems.
This is a collection of min-cut/max-flow problem instances that can be used for benchmarking min-cut/max-flow algorithms. The collection is released in companionship with the paper:
The problem instances are collected from a wide selection of sources to be as representative as possible. Specifically, this collection contains:
The reason for releasing this collection is to provide a single place to download all datasets used in our paper (and various previous paper) instead of having to scavenge from multiple sources. Furthermore, several of the problem instances typically used for benchmarking min-cut/max-flow algorithms are no longer available at their original locations and may be difficult to find. By storing the data on Zenodo with a dedicated DOI we hope to avoid this. For license information, see below.
Files and formats
We provide all problem instances in two file formats: DIMACS and a custom binary format. Both are described below. Each file has been zipped, and similar files have then been grouped into their own zip file (i.e. it is a zip of zips). DIMACS files have been prefixed with `dimacs_` and binary files have been prefixed with `bin_`.
DIMACS
All problem instances are available in DIMACS format (explained here: http://lpsolve.sourceforge.net/5.5/DIMACS_maxf.htm).
For the larger problem instances, we have also published a partition of the graph nodes into blocks for block-based parallel min-cut/max-flow. The partition matches the one used in the companion review paper (Jensen et al., 2021). For a problem instance with filename `
4
0 0 1 1 2 2 3 3 0 0 1 1 2 2 3 3 0 0 1 1 2 2 3 3 0 0 1 1 2 2 3 3
Binary files
While DIMACS has the advantage of being human-readable, storing everything as text requires a lot of space. This makes the files unnecessarily large and slow to parse. To overcome this, we also release all problem instances in a simple binary storage format. We have two formats: one for graphs and one for quadratic pseudo-boolean optimization (QPBO) problems. Code to convert to/from DIMACS is also available at: https://www.doi.org/10.5281/zenodo.4903946 or https://github.com/patmjen/maxflow_algorithms.
Binary BK (`.bbk`) files are for storing normal graphs for min-cut/max-flow. They closely follow the internal storage format used in the original implementation of the Boykov-Kolmogorov algorithm, meaning that terminal arcs are stored in a separate list from normal neighbor arcs. The format is:
Uncompressed:
Header: (3 x uint8) 'BBQ'
Types codes: (2 x uint8) captype, tcaptype
Sizes: (3 x uint64) num_nodes, num_terminal_arcs, num_neighbor_arcs
Terminal arcs: (num_terminal_arcs x BkTermArc)
Neighbor arcs: (num_neighbor_arcs x BkNborArc)
Compressed (using Google's snappy: https://github.com/google/snappy):
Header: (3 x uint8) 'bbq'
Types codes: (2 x uint8) captype, tcaptype
Sizes: (3 x uint64) num_nodes, num_terminal_arcs, num_neighbor_arcs
Terminal arcs: (1 x uint64) compressed_bytes_1
(compressed_bytes_1 x uint8) compressed num_terminal_arcs x BkTermArc
Neighbor arcs: (1 x uint64) compressed_bytes_2
(compressed_bytes_2 x uint8) compressed num_neighbor_arcs x BkNborArc
Where:
/** Enum for switching over POD types. */
enum TypeCode : uint8_t {
TYPE_UINT8,
TYPE_INT8,
TYPE_UINT16,
TYPE_INT16,
TYPE_UINT32,
TYPE_INT32,
TYPE_UINT64,
TYPE_INT64,
TYPE_FLOAT,
TYPE_DOUBLE,
TYPE_INVALID = 0xFF
};
/** Terminal arc with source and sink capacity for given node. */
template
Binary QPBO (`.bq`) files are for storing QPBO problems. Unary and binary terms are stored in separate lists. The format is:
Uncompressed:
Header: (5 x uint8) 'BQPBO'
Types codes: (1 x uint8) captype
Sizes: (3 x uint64) num_nodes, num_unary_terms, num_binary_terms
Unary arcs: (num_unary_terms x BkUnaryTerm)
Binary arcs: (num_binary_terms x BkBinaryTerm)
Compressed (using Google's snappy: https://github.com/google/snappy):
Header: (5 x uint8) 'bqpbo'
Types codes: (1 x uint8) captype
Sizes: (3 x uint64) num_nodes, num_unary_terms, num_binary_terms
Unary terms: (1 x uint64) compressed_bytes_1
(compressed_bytes_1 x uint8) compressed num_unary_terms x BkUnaryTerm
Binary terms: (1 x uint64) compressed_bytes_2
(compressed_bytes_2 x uint8) compressed num_binary_terms x BkBinaryTerm
Where:
/** Enum for switching over POD types. */
enum TypeCode : uint8_t {
TYPE_UINT8,
TYPE_INT8,
TYPE_UINT16,
TYPE_INT16,
TYPE_UINT32,
TYPE_INT32,
TYPE_UINT64,
TYPE_INT64,
TYPE_FLOAT,
TYPE_DOUBLE,
TYPE_INVALID = 0xFF
};
/** Unary term */
template
Block (`.blk`) files are for storing a partition of the graph nodes into disjoint blocks. The format is:
Nodes: uint64_t num_nodes
Blocks: uint16_t num_blocks
Data: (num_nodes x uint16_t) node_blocks
We do not claim ownership over the problem instances from the University of Waterloo and those from (Verma and Batra, 2012). Please contact the original sources for additional information. We publish the datasets from (Jeppesen et al., 2020) and (Jensen et al., 2020) under the Creative Commons Attribution 4.0 International (CC BY 4.0), see https://creativecommons.org/licenses/by/4.0/. The specific files are (not counting their `dimacs_`/`bin_` prefix):
University of Waterloo (consult original sources)
Verma & Batra, 2012 (consult original sources):
Jeppesen et al., 2020 (CC BY 4.0):
Jensen et al., 2020 (CC BY 4.0):
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Wavefront shaping gains increasing importance in complex photonics, which can manipulate light spatially and temporally to counter the scattering effect. Important applications include deep-tissue imaging, microendoscopy, optical communications, nanofabrication, and remote sensing. However, high-speed and high-fidelity wavefront shaping is fundamentally hindered by the dimensionality limitation of hardware devices, evinced by the competition between the frame rate, pixel count, and modulation depth. To overcome the speed-fidelity tradeoff, we leverage complex media (e.g., diffusers or multimode fibers) as analogue random multiplexers for pattern compression to address the demand for high-dimensional spatiotemporal control. Sparsity-constrained wavefront optimization is designed to solve the problem by seeking a low-dimensional, robust representation of wavefronts with a carefully designed sparsity constraint. This optimization framework can achieve high-fidelity wavefront shaping through complex media using high-speed, yet relatively low-precision spatial light modulation devices (e.g., digital micromirror devices) without compromising the frame rate.
Methods
The dataset contains an experimentally calibrated complex-field transmission matrix of a graded-index multimode fiber (GIF50C, Thorlabs) and 1000 preprocessed test images extracted from the Fashion-MNIST dataset. The script takes the preprocessed test images as the ground truth. It carries out the sparsity-constrained wavefront optimization to solve for the wavefront to generate those test images through the multimode fiber given its transmission matrix.
In brief, the transmission matrix was measured by raster scanning the proximal end of the multimode fiber using a DMD and recording the corresponding speckles at the distal end using off-axis holography. The test images were first downsampled and interpolated to match the coordinate of the distal end of the multimode fiber. Then, they were vectorized to a one-dimensional vector to comply with the format of the transmission matrix. The initial guesses were obtained by performing the Gerchberg-Saxton algorithm with 10 iterations. All the implementation details can be found in Section 4.3 of our paper https://arxiv.org/abs/2302.10254.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
In response to NASA SBIR topic A1.05, "Data Mining for Integrated Vehicle Health Management", Michigan Aerospace Corporation (MAC) asserts that our unique SPADE (Sparse Processing Applied to Data Exploitation) technology meets a significant fraction of the stated criteria and has functionality that enables it to handle many applications within the aircraft lifecycle. SPADE distills input data into highly quantized features and uses MAC's novel techniques for constructing Ensembles of Decision Trees to develop extremely accurate diagnostic/prognostic models for classification, regression, clustering, anomaly detection and semi-supervised learning tasks. These techniques are currently being employed to do Threat Assessment for satellites in conjunction with researchers at the Air Force Research Lab. Significant advantages to this approach include: 1) completely data driven; 2) training and evaluation are faster than conventional methods; 3) operates effectively on huge datasets (> billion samples X > million features), 4) proven to be as accurate as state-of-the-art techniques in many significant real-world applications. The specific goals for Phase 1 will be to work with domain experts at NASA and with our partners Boeing, SpaceX and GMV Space Systems to delineate a subset of problems that are particularly well-suited to this approach and to determine requirements for deploying algorithms on platforms of opportunity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The basic statistics of using a PFNF algorithm on Douban datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of an approx 50k collection of research articles from PubMed repository. Originally these documents are manually annotated by Biomedical Experts with their MeSH labels and each article are described in terms of 10-15 MeSH labels. In this Dataset we have huge numbers of labels present as a MeSH major, raising the issue of extremely large output space and severe label sparsity issues. To solve this issue, the Dataset has been Processed and mapped to its root as described below.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The application of machine learning to theoretical chemistry has made it possible to combine the accuracy of quantum chemical energetics with the thorough sampling of finite-temperature fluctuations. To reach this goal, a diverse set of methods has been proposed, ranging from simple linear models to kernel regression and highly nonlinear neural networks. Here we apply two widely different approaches to the same, challenging problem - the sampling of the conformational landscape of polypeptides at finite temperature. We develop a Local Kernel Regression (LKR) coupled with a supervised sparsity method and compare it with a more established approach based on Behler-Parrinello type Neural Networks. In the context of the LKR, we discuss how the supervised selection of the reference pool of environments is crucial to achieve accurate potential energy surfaces at a competitive computational cost and leverage the locality of the model to infer which chemical environments are poorly described by the DFTB baseline. We then discuss the relative merits of the two frameworks and perform Hamiltonian-reservoir replica-exchange Monte Carlo sampling and metadynamics simulations, respectively, to demonstrate that both frameworks can achieve converged and transferable sampling of the conformational landscape of complex and flexible biomolecules with comparable accuracy and computational cost.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting manuscript submitted to PNAS: Video-Rate Raman-based Metabolic Imaging by Airy Light-Sheet Illumination and Photon-Sparse Detection. The data set includes: [1] raw data and [2] related images used in the analyses described within the manuscript. Despite its massive potential, Raman imaging represents just a modest fraction of all research and clinical microscopy to date. This is due to the ultralow Raman scattering cross-sections of most biomolecules that impose low-light or photon-sparse conditions. Bioimaging under such conditions is suboptimal, as it either results in ultralow frame rates or requires increased levels of irradiance. Here, we overcome this tradeoff by introducing Raman imaging that operates at both video rates and 1,000-fold lower irradiance than state-of-the-art methods. To accomplish this, we deployed a judicially designed Airy light-sheet microscope to efficiently image large specimen regions. Further, we implemented subphoton per pixel image acquisition and reconstruction to confront issues arising from photon sparsity at just millisecond integrations. We demonstrate the versatility of our approach by imaging a variety of samples, including the three-dimensional (3D) metabolic activity of single microbial cells and the underlying cell-to-cell variability. To image such small-scale targets, we again harnessed photon sparsity to increase magnification without a field-of-view penalty, thus, overcoming another key limitation in modern light-sheet microscopy.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of the proposed model with baseline algorithms for RMSE on the basis of SR.
https://www.elsevier.com/about/policies/open-access-licenses/elsevier-user-license/cpc-license/https://www.elsevier.com/about/policies/open-access-licenses/elsevier-user-license/cpc-license/
Abstract A program is presented for determining a few selected eigenvalues and their eigenvectors on either end of the spectrum of a large, real, symmetric matrix. Based on the Davidson method, which is extensively used in quantum chemistry/physics, the current implementation improves the power of the original algorithm by adopting several extensions. The matrix-vector multiplication routine that it requires is to be provided by the user. Different matrix formats and optimizations are thus feasible. E...
Title of program: DVDSON Catalogue Id: ACPZ_v1_0
Nature of problem Finding a few extreme eigenpairs of a real, symmetric matrix is of great importance in scientific computations. Examples abound in structural engineering, quantum chemistry and electronic structure physics [1,2]. The matrices involved are usually too large to be efficiently solved using standard methods. Moreover, their large size often prohibits full storage forcing various sparse representations. Even sparse representations cannot always be stored in main memory [3]. Thus, an iterative method ...
Versions of this program held in the CPC repository in Mendeley Data ACPZ_v1_0; DVDSON; 10.1016/0010-4655(94)90073-6
This program has been imported from the CPC Program Library held at Queen's University Belfast (1969-2019)
With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance comparison for user cold-start problem on BE-dataset.