Facebook
TwitterSparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.
Facebook
TwitterUsage of high-level intermediate representations promises the generation of fast code from a high-level description, improving the productivity of developers while achieving the performance traditionally only reached with low-level programming approaches.
High-level IRs come in two flavors: 1) domain-specific IRs designed to express only for a specific application area; or 2) generic high-level IRs that can be used to generate high-performance code across many domains. Developing generic IRs is more challenging but offers the advantage of reusing a common compiler infrastructure various applications.
In this paper, we extend a generic high-level IR to enable efficient computation with sparse data structures. Crucially, we encode sparse representation using reusable dense building blocks already present in the high-level IR. We use a form of dependent types to model sparse matrices in CSR format by expressing the relationship between multiple dense arrays explicitly separately storing ...
Facebook
TwitterIn this paper we propose an innovative learning algorithm - a variation of One-class ? Support Vector Machines (SVMs) learning algorithm to produce sparser solutions with much reduced computational complexities. The proposed technique returns an approximate solution, nearly as good as the solution set obtained by the classical approach, by minimizing the original risk function along with a regularization term. We introduce a bi-criterion optimization that helps guide the search towards the optimal set in much reduced time. The outcome of the proposed learning technique was compared with the benchmark one-class Support Vector machines algorithm which more often leads to solutions with redundant support vectors. Through out the analysis, the problem size for both optimization routines was kept consistent. We have tested the proposed algorithm on a variety of data sources under different conditions to demonstrate the effectiveness. In all cases the proposed algorithm closely preserves the accuracy of standard one-class ? SVMs while reducing both training time and test time by several factors.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Landsat satellite imagery is used to derive woody vegetation extent products that discriminate between forest, sparse woody and non-woody land cover across a time series from 1988 to 2020. A forest is defined as woody vegetation with a minimum 20 per cent canopy cover, at least 2 metres high and a minimum area of 0.2 hectares. Sparse woody is defined as woody vegetation with a canopy cover between 5-19 per cent.\r \r The three-class classification (forest, sparse woody and non-woody) supersedes the two-class classification (forest and non-forest) from 2016. The new classification is produced using the same approach in terms of time series processing (conditional probability networks) as the two-class method, to detect woody vegetation cover. The three-class algorithm better encompasses the different types of woody vegetation across the Australian landscape.\r \r Earlier versions of this dataset were published in the Department of Environment and Energy.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Radiography is an imaging technique used in a variety of applications, such as medical diagnosis, airport security, and nondestructive testing. We present a deep learning system for extracting information from radiographic images. We perform various prediction tasks using our system, including material classification and regression on the dimensions of a given object that is being radiographed. Our system is designed to address the sparse-data issue for radiographic nondestructive testing applications. It uses a radiographic simulation tool for synthetic data augmentation, and it uses transfer learning with a pre-trained convolutional neural network model. Using this system, our preliminary results indicate that the object geometry regression task saw an improvement of 70% in the R-squared value when using a multi-regime model. In addition, we increase the performance of the object material classification tasks by utilizing data from different imaging systems. In particular, using neutron imaging improved the material classification accuracy by 20% when compared to x-ray imaging.
Facebook
Twitter
According to our latest research, the global Sparse-Matrix Compression Engine market size reached USD 1.42 billion in 2024, reflecting robust adoption across high-performance computing and advanced analytics sectors. The market is poised for substantial expansion, with a projected CAGR of 15.8% during the forecast period. By 2033, the market is forecasted to achieve a value of USD 5.18 billion, driven by escalating data complexity, the proliferation of machine learning applications, and the imperative for efficient storage and computational solutions. The surge in demand for real-time analytics and the growing penetration of artificial intelligence across industries are primary factors fueling this remarkable growth trajectory.
One of the key growth drivers for the Sparse-Matrix Compression Engine market is the exponential increase in data generation and the corresponding need for efficient data processing and storage. As organizations in sectors such as scientific computing, finance, and healthcare grapple with large-scale, high-dimensional datasets, the requirement for optimized storage solutions becomes paramount. Sparse-matrix compression engines enable significant reduction in data redundancy, leading to lower storage costs and faster data retrieval. This efficiency is particularly crucial in high-performance computing environments where memory bandwidth and storage limitations can hinder computational throughput. The adoption of these engines is further propelled by advancements in hardware accelerators and software algorithms that enhance compression ratios without compromising data integrity.
Another significant factor contributing to market growth is the rising adoption of machine learning and artificial intelligence across diverse industry verticals. Modern AI and ML algorithms often operate on sparse datasets, especially in areas such as natural language processing, recommendation systems, and scientific simulations. Sparse-matrix compression engines play a pivotal role in minimizing memory footprint and optimizing computational resources, thereby accelerating model training and inference. The integration of these engines into cloud-based and on-premises solutions allows enterprises to scale their AI workloads efficiently, driving widespread deployment in both research and commercial applications. Additionally, the ongoing evolution of lossless and lossy compression techniques is expanding the applicability of these engines to new and emerging use cases.
The market is also benefiting from the increasing emphasis on cost optimization and energy efficiency in data centers and enterprise IT infrastructure. As organizations strive to reduce operational expenses and carbon footprints, the adoption of compression technologies that minimize data movement and storage requirements becomes a strategic imperative. Sparse-matrix compression engines facilitate this by enabling higher data throughput and lower energy consumption, making them attractive for deployment in large-scale analytics, telecommunications, and industrial automation. Furthermore, the growing ecosystem of service providers and solution integrators is making these technologies more accessible to small and medium enterprises, contributing to broader market penetration.
The development of High-Speed Hardware Compression Chip technology is revolutionizing the Sparse-Matrix Compression Engine market. These chips are designed to accelerate data compression processes, significantly enhancing the performance of high-performance computing systems. By integrating these chips, organizations can achieve faster data processing speeds, which is crucial for handling large-scale datasets in real-time analytics and AI applications. The chips offer a unique advantage by reducing latency and improving throughput, making them an essential component in modern data centers. As the demand for efficient data management solutions grows, the adoption of high-speed hardware compression chips is expected to rise, driving further innovation and competitiveness in the market.
From a regional perspective, North America continues to dominate the Sparse-Matrix Compression Engine market, accounting for the largest revenue share in 2024 owing to the presence of leading technology companies, advanced research institutions, and
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The paper "Bringing Order to Sparsity: A Sparse Matrix Reordering Study on Multicore CPUs" compares various strategies for reordering sparse matrices. The purpose of reordering is to improve performance of sparse matrix operations, for example, by reducing fill-in resulting from sparse Cholesky factorisation or improving data locality in sparse matrix-vector multiplication (SpMV). Many reordering strategies have been proposed in the literature and the current paper provides a thorough comparison of several of the most popular methods.
This comparison is based on performance measurements that were collected on the eX3 cluster, a Norwegian, experimental research infrastructure for exploration of exascale computing. These performance measurements are gathered in the data set provided here, particularly related to the performance of two SpMV kernels with respect to 490 sparse matrices, 6 matrix orderings and 8 multicore CPUs.
Experimental results are provided in a human-readable, tabular format using plain-text ASCII. This format may be readily consumed by gnuplot to create plots or imported into commonly used spreadsheet tools for further analysis.
Performance measurements are provided based on an SpMV kernel using the compressed sparse row (CSR) storage format with 7 matrix orderings. One file is provided for each of 8 multicore CPU systems considered in the paper:
1. Skylake: csr_all_xeongold16q_032_threads_ss490.txt
2. Ice Lake: csr_all_habanaq_072_threads_ss490.txt
3. Naples: csr_all_defq_064_threads_ss490.txt
4. Rome: csr_all_rome16q_016_threads_ss490.txt
5. Milan A: csr_all_fpgaq_048_threads_ss490.txt
6. Milan B: csr_all_milanq_128_threads_ss490.txt
7. TX2: csr_all_armq_064_threads_ss490.txt
8. Hi1620: csr_all_huaq_128_threads_ss490.txt
A corresponding set of files and performance measurements are provided for a second SpMV kernel that is also studied in the paper.
Each file consists of 490 rows and 54 columns. Each row corresponds to a different matrix from the SuiteSparse Matrix Collection (https://sparse.tamu.edu/). The first 5 columns specify some general information about the matrix, such as its group and name, as well as the number of rows, columns and nonzeros. Column 6 specifies the number of threads used for the experiment (which depends on the CPU). The remaining columns are grouped according to the 7 different matrix orderings that were studied, in the following order: original, Reverse Cuthill-McKee (RCM), Nested Dissection (ND), Approximate Minimum Degree (AMD), Graph Partitioning (GP), Hypergraph Partitioning (HP), and Gray ordering. For each ordering, the following 7 columns are given:
1. Minimum number of nonzeros processed by any thread by the SpMV kernel
2. Maximum number of nonzeros processed by any thread by the SpMV kernel
3. Mean number of nonzeros processed per thread by the SpMV kernel
4. Imbalance factor, which is the ratio of the maximum to the mean number of nonzeros processed per thread by the SpMV kernel
5. Time (in seconds) to perform a single SpMV iteration; this was measured by taking the minimum out of 100 SpMV iterations performed
6. Maximum performance (in Gflop/s) for a single SpMV iteration; this was measured by taking twice the number of matrix nonzeros and dividing by the minimum time out of 100 SpMV iterations performed.
7. Mean performance (in Gflop/s) for a single SpMV iteration; this was measured by taking twice the number of matrix nonzeros and dividing by the mean time of the 97 last SpMV iterations performed (i.e., the first 3 SpMV iterations are ignored).
The results in Fig. 1 of the paper show speedup (or slowdown) resulting from reordering with respect to 3 reorderings and 3 selected matrices. These results can be reproduced by inspecting the performance results that were collected on the Milan B and Ice Lake systems for the three matrices Freescale/Freescale2, SNAP/com-Amazon and GenBank/kmer_V1r. Specifically, the numbers displayed in the figure are obtained by dividing the maximum performance measured for the respective orderings (i.e., RCM, ND and GP) by the maximum performance measured for the original ordering.
The results presented in Figs. 2 and 3 of the paper show the speedup of SpMV as a result of reordering for the two SpMV kernels considered in the paper. In this case, gnuplot scripts are provided to reproduce the figures from the data files described above.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Distance sampling is a common survey method in wildlife studies, because it allows accounting for imperfect detection. The framework has been extended to hierarchical distance sampling (HDS), which accommodates the modelling of abundance as a function of covariates, but rare and elusive species may not yield enough observations to fit such a model. We integrate HDS into a community modelling framework that accommodates multi-species spatially replicated distance sampling data. The model allows species-specific parameters, but these come from a common underlying distribution. This form of information sharing enables estimation of parameters for species with sparse data sets that would otherwise be discarded from analysis. We evaluate the performance of the model under varying community sizes with different species-specific abundances through a simulation study. We further fit the model to a seabird data set obtained from shipboard distance sampling surveys off the East Coast of the USA. Comparing communities comprised of 5, 15 or 30 species, bias of all community-level parameters and some species-level parameters decreased with increasing community size, while precision increased. Most species-level parameters were less biased for more abundant species. For larger communities, the community model increased precision in abundance estimates of rarely observed species when compared to single-species models. For the seabird application, we found a strong negative association of community and species abundance with distance to shore. Water temperature and prey density had weak effects on seabird abundance. Patterns in overall abundance were consistent with known seabird ecology. The community distance sampling model can be expanded to account for imperfect availability, imperfect species identification or other missing individual covariates. The model allowed us to make inference about ecology of species communities, including rarely observed species, which is particularly important in conservation and management. The approach holds great potential to improve inference on species communities that can be surveyed with distance sampling.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: These are the experimental data for the paper Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" published on arXiv in 2024. You can find the paper here and the code here. See the README for details. The datasets used in our study (which we also provide here) originate from PMLB. The corresponding GitHub repository is MIT-licensed ((c) 2016 Epistasis Lab at UPenn). Please see the file LICENSE in the folder datasets/ for the license text. TechnicalRemarks: # Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" These are the experimental data for the paper Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions"
Facebook
Twitterhttps://www.apache.org/licenses/LICENSE-2.0https://www.apache.org/licenses/LICENSE-2.0
This dataset contains supplementary code for the paper Fast Sparse Grid Operations using the Unidirectional Principle: A Generalized and Unified Framework. The code is also provided on GitHub. Here, we additionally provide the runtime measurement data generated by the code, which was used to generate the runtime plot in the paper. For more details, we refer to the file README.md.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This repository is used to calibrate the underlying structure of a stylized supply chain simulation model of counterfeit Personal Protective Equipment (PPE). For this, we use four calibration techniques: Approximate Bayesian Computing using pydream, Bayesian Optimization using bayesian-optimization, Genetic Algorithms using Platypus, and Powell's Method using SciPy. The calibration is done with sparse data, which is generated by degrading the ground truth data on noise, bias, and missing values. We define the structure of a supply chain simulation model as a key value of a dictionary (sorted on betweenness centrality), which is a set of possible supply chain models. The integer is, thus, the decision variable of the calibration.
To use this repository, we need a simulation model developed in pydsol-core and pydsol-model . Additionally, we need a dictionary with various different simulation structures as input, as well as the ground truth data. For this project, we use the repository complex_stylized_supply_chain_model_generator as simulation model.
This repository is an extension of the celibration library, making it easy to plugin different calibration models, distance metrics and functions, and data.
This repository is also part of the Ph.D. thesis of Isabelle M. van Schilt, Delft University of Technology.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset we used to train the 2D and 3D Bicephalous Convolutional Autoencoders (BCAEs) described in "Fast 2D Bicephalous Convolutional Autoencoder for Compressing 3D Time Projection Chamber Data" published in the 9th International Workshop on Data Analysis and Reduction for Big Scientific Data (https://drbsd.github.io/).
To untar the file, run tar -xvzf outer.tgz.
The Time Projection Chamber (TPC) is a hollow cylinder. Along the radial dimension, the TPC is composed of 48 cylindrical layers of small sensors, which are grouped into three layer groups: inner, middle, and outer. Each layer group has 16 consecutive layers. On each TPC layer, the voxels are presented as a rectangular grid with rows along the z (or horizontal) direction and columns along the azimuthal direction. Within one layer group, all layers have the same number of rows and columns. This allows us to represent the ADC values from one layer group as a 3D array.
The data released here focuses on the outer layer group, where the array of ADC values has shape (16, 2304, 498) in the radial, azimuthal, and horizontal orders. The full voxel data are divided into 24 equal-size non-overlapping sections: 12 along the azimuthal direction (30 degrees per section) and 2 along the horizontal direction (divided by the transverse plane passing the collision point). We call one such section a TPC wedge. The array of ADC values from each TPC wedge in the outer layer has shape (16, 192, 249), listed in radial, azimuthal, and horizontal directions, respectively. The TPC wedges are used as the direct input to the deep neural network compression algorithms.
We simulated 1310 events for central sqrt[s]=200 GeV Au-Au collisions with 170kHz pile-up. The data were generated with the HIJING event generator and Geant4 Monte Carlo detector simulation package integrated with the sPHENIX software framework.
The simulated TPC readout (ADC values) from these events are represented in a 10-bit unsigned integer in [0, 1023]. To reduce unnecessary data transmission between detector pixels and front-end electronics, a zero-suppression algorithm has been applied. ADC values below 64 are suppressed to zero as most of them are noise. The zero compression makes the TPC data sparse at about 10% nonzero occupancy.We divide the 1310 total events into 1048 events for training and 262 for testing. Each event contains 24 outer-layer wedges. Thus, the training partition contains 25152 TPC outer-layer wedges, while the testing portion has 6288 wedges. The compression algorithm compresses each wedge independently.The dataset has the following structure:
24 subfolders with the name 12-2_[azimuthal section]-[horizontal section] where the [azimuthal section] is labeled by an integer in [0, 11] and the [horizontal section] is labeled by either 0 or 1. Each file in one of the subfolders has the name in the format "AuAu200_170kHz_10C_Iter2_[simulation id].xml_TPCMLDataInterface_[event id within simulation].npy". There are 131 simulations, and each simulation contains 10 independent events (and hence the 1310 total events as mentioned above). Each [event id within simulation] is an integer in [0, 9].
train.txt: a list of all TPC wedges for the training split.
text.txt: a list of all TPC wedges for the test split.
Note that the dataset is split by events. That is, if a TPC wedge from an event is in the train split, all 24 wedges from the same event will all be in the train split. The same holds for the test split.
Facebook
Twitter
According to our latest research, the global sparse models serving market size reached USD 1.35 billion in 2024, reflecting the rapid adoption of efficient AI model deployment across industries. With a robust compound annual growth rate (CAGR) of 31.7% projected between 2025 and 2033, the market is forecasted to achieve a value of USD 14.52 billion by 2033. This remarkable growth trajectory is primarily driven by the increasing demand for scalable, low-latency AI inference solutions, as enterprises seek to optimize resource utilization and reduce operational costs in an era of data explosion and AI-driven digital transformation.
A key growth factor for the sparse models serving market is the exponential increase in the deployment of artificial intelligence and machine learning solutions across diverse sectors. As organizations generate and process massive volumes of data, the need for highly efficient and scalable AI models has become paramount. Sparse models, which leverage techniques such as weight pruning and quantization, enable faster inference and reduced memory footprints without compromising accuracy. This capability is particularly valuable for real-time applications in industries such as finance, healthcare, and retail, where latency and resource efficiency are critical. The widespread adoption of edge computing and IoT devices further amplifies the demand for sparse model serving, as these environments often operate under stringent computational constraints.
Another significant driver is the ongoing advancements in hardware accelerators and AI infrastructure. The evolution of specialized hardware, such as GPUs, TPUs, and custom AI chips, has enabled the efficient execution of sparse models at scale. Leading technology providers are investing heavily in the development of optimized software frameworks and libraries that support sparse computation, making it easier for organizations to deploy and manage these models in production environments. The integration of sparse model serving with cloud-native platforms and container orchestration systems has further streamlined the operationalization of AI workloads, allowing enterprises to achieve seamless scalability, high availability, and cost-effectiveness. This technological synergy is accelerating the adoption of sparse models serving across both on-premises and cloud deployments.
The growing emphasis on sustainable AI and green computing is also propelling the market forward. Sparse models consume significantly less energy and computational resources compared to dense models, aligning with the global push towards environmentally responsible technology practices. Enterprises are increasingly prioritizing solutions that not only deliver high performance but also minimize their carbon footprint. Sparse model serving addresses this need by enabling efficient utilization of existing hardware, reducing the frequency of hardware upgrades, and lowering overall power consumption. This sustainability aspect is becoming a key differentiator for vendors in the sparse models serving market, as regulatory frameworks and corporate social responsibility initiatives place greater emphasis on eco-friendly AI deployments.
From a regional perspective, North America currently dominates the sparse models serving market, accounting for the largest share in 2024. The region’s leadership can be attributed to its advanced digital infrastructure, strong presence of AI technology providers, and early adoption of cutting-edge AI solutions across sectors such as BFSI, healthcare, and IT. Europe and Asia Pacific are also witnessing rapid growth, driven by increasing investments in AI research, government initiatives supporting digital transformation, and the proliferation of data-centric industries. Emerging markets in Latin America and the Middle East & Africa are gradually catching up, as enterprises in these regions recognize the value of efficient AI model deployment in enhancing competitiveness and operational efficiency.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Codes, syntetic and empirical data for "Bayesian estimation of information-theoretic metrics for sparsely sampled distributions"
Abstract:
Estimating the Shannon entropy of a discrete distribution from which we have only observed a small sample is challenging. Estimating other information-theoretic metrics, such as the Kullback-Leibler divergence between two sparsely sampled discrete distributions, is even harder. Here, we propose a fast, semi-analytical estimator for sparsely sampled distributions. Its derivation is grounded in probabilistic considerations and uses a hierarchical Bayesian approach to extract as much information as possible from the few observations available. Our approach provides estimates of the Shannon entropy with precision at least comparable to the benchmarks we consider, and most often higher; it does so across diverse distributions with very different properties. Our method can also be used to obtain accurate estimates of other information-theoretic metrics, including the notoriously challenging Kullback-Leibler divergence. Here, again, our approach has less bias, overall, than the benchmark estimators we consider.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In high-throughput genomics, large-scale designed experiments are becoming common, and analysis approaches based on highly multivariate regression and anova concepts are key tools. Shrinkage models of one form or another can provide comprehensive approaches to the problems of simultaneous inference that involve implicit multiple comparisons over the many, many parameters representing effects of design factors and covariates. We use such approaches here in a study of cardiovascular genomics. The primary experimental context concerns a carefully designed, and rich, gene expression study focused on gene-environment interactions, with the goals of identifying genes implicated in connection with disease states and known risk factors, and in generating expression signatures as proxies for such risk factors. A coupled exploratory analysis investigates cross-species extrapolation of gene expression signatures—how these mouse-model signatures translate to humans. The latter involves exploration of sparse latent factor analysis of human observational data and of how it relates to projected risk signatures derived in the animal models. The study also highlights a range of applied statistical and genomic data analysis issues, including model specification, computational questions and model-based correction of experimental artifacts in DNA microarray data.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Microbial communities are composed of functionally integrated taxa, and identifying which taxa contribute to a given ecosystem function is essential for predicting community behaviors. This study compares the effectiveness of a previously proposed method for identifying ``functional taxa,'' Ensemble Quotient Optimization (EQO), to a potentially simpler approach based on the Least Absolute Shrinkage and Selection Operator (LASSO). In contrast to LASSO, EQO uses a binary prior on coefficients, assuming uniform contribution strength across taxa. Using synthetic datasets with increasingly realistic structure, we demonstrate that EQO's strong prior enables it to perform better in low-data regime. However, LASSO’s flexibility and efficiency can make it preferable as data complexity increases. Our results detail the favorable conditions for EQO and emphasize LASSO as a viable alternative.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Landsat satellite imagery is used to derive woody vegetation extent products that discriminate between forest, sparse woody and non-woody land cover across a time series from 1988 to 2018. A forest is defined as woody vegetation with a minimum 20 per cent canopy cover, potentially reaching 2 metres high and a minimum area of 0.2 hectares. Sparse woody is defined as woody vegetation with a canopy cover between 5-19 per cent.\r \r The three-class classification (forest, sparse woody and non-woody) supersedes the two class classification (forest and non-forest) from 2016. The new classification is produced using the same approach in terms of time series processing (conditional probability networks) as the two-class method, to detect woody vegetation cover. The three-class algorithm better encompasses the different types of woody vegetation across the Australian landscape.\r
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This code is part of the Ph.D. thesis of Isabelle M. van Schilt, Delft University of Technology.
This code is used to calibrate a parameter of a stylized supply chain simulation model of counterfeit Personal Protective Equipment (PPE). For this, we use three calibration techniques: Approximate Bayesian Computing using pydream, Genetic Algorithms using Platypus, and Powell's Method using SciPy. The calibration is done with sparse data, which is generated by degrading the ground truth data on noise, bias, and missing values.
This code is an extension of the celibration library, making it easy to plugin different calibration models, distance metrics and functions, and data.
Note that this code uses an old version of pydsol, which is included in the zip file.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting manuscript submitted to PNAS: Video-Rate Raman-based Metabolic Imaging by Airy Light-Sheet Illumination and Photon-Sparse Detection. The data set includes: [1] raw data and [2] related images used in the analyses described within the manuscript.
Despite its massive potential, Raman imaging represents just a modest fraction of all research and clinical microscopy to date. This is due to the ultralow Raman scattering cross-sections of most biomolecules that impose low-light or photon-sparse conditions. Bioimaging under such conditions is suboptimal, as it either results in ultralow frame rates or requires increased levels of irradiance. Here, we overcome this tradeoff by introducing Raman imaging that operates at both video rates and 1,000-fold lower irradiance than state-of-the-art methods. To accomplish this, we deployed a judicially designed Airy light-sheet microscope to efficiently image large specimen regions. Further, we implemented subphoton per pixel image acquisition and reconstruction to confront issues arising from photon sparsity at just millisecond integrations. We demonstrate the versatility of our approach by imaging a variety of samples, including the three-dimensional (3D) metabolic activity of single microbial cells and the underlying cell-to-cell variability. To image such small-scale targets, we again harnessed photon sparsity to increase magnification without a field-of-view penalty, thus, overcoming another key limitation in modern light-sheet microscopy.
Data Use:
License: CC-BY 4.0
Recommended Citation: Vasdekis AE (2023) Data from: Video-rate raman-based metabolic imaging by airy light-sheet illumination and photon-sparse detection [Dataset]. University of Idaho. https://doi.org/10.11578/1908656
Facebook
TwitterSparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.