40 datasets found
  1. Data from: Sparse Machine Learning Methods for Understanding Large Text...

    • data.nasa.gov
    • gimi9.com
    • +3more
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Sparse Machine Learning Methods for Understanding Large Text Corpora [Dataset]. https://data.nasa.gov/dataset/sparse-machine-learning-methods-for-understanding-large-text-corpora
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.

  2. d

    Data from: Generating fast sparse matrix vector multiplication from a high...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Mar 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federico Pizzuti; Michel Steuwer; Christophe Dubach (2020). Generating fast sparse matrix vector multiplication from a high level generic functional IR [Dataset]. http://doi.org/10.5061/dryad.wstqjq2gs
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 19, 2020
    Dataset provided by
    Dryad
    Authors
    Federico Pizzuti; Michel Steuwer; Christophe Dubach
    Time period covered
    Mar 19, 2020
    Description

    Usage of high-level intermediate representations promises the generation of fast code from a high-level description, improving the productivity of developers while achieving the performance traditionally only reached with low-level programming approaches.

    High-level IRs come in two flavors: 1) domain-specific IRs designed to express only for a specific application area; or 2) generic high-level IRs that can be used to generate high-performance code across many domains. Developing generic IRs is more challenging but offers the advantage of reusing a common compiler infrastructure various applications.

    In this paper, we extend a generic high-level IR to enable efficient computation with sparse data structures. Crucially, we encode sparse representation using reusable dense building blocks already present in the high-level IR. We use a form of dependent types to model sparse matrices in CSR format by expressing the relationship between multiple dense arrays explicitly separately storing ...

  3. d

    Data from: Sparse Solutions for Single Class SVMs: A Bi-Criterion Approach

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Sparse Solutions for Single Class SVMs: A Bi-Criterion Approach [Dataset]. https://catalog.data.gov/dataset/sparse-solutions-for-single-class-svms-a-bi-criterion-approach
    Explore at:
    Dataset updated
    Nov 14, 2025
    Dataset provided by
    Dashlink
    Description

    In this paper we propose an innovative learning algorithm - a variation of One-class ? Support Vector Machines (SVMs) learning algorithm to produce sparser solutions with much reduced computational complexities. The proposed technique returns an approximate solution, nearly as good as the solution set obtained by the classical approach, by minimizing the original risk function along with a regularization term. We introduce a bi-criterion optimization that helps guide the search towards the optimal set in much reduced time. The outcome of the proposed learning technique was compared with the benchmark one-class Support Vector machines algorithm which more often leads to solutions with redundant support vectors. Through out the analysis, the problem size for both optimization routines was kept consistent. We have tested the proposed algorithm on a variety of data sources under different conditions to demonstrate the effectiveness. In all cases the proposed algorithm closely preserves the accuracy of standard one-class ? SVMs while reducing both training time and test time by several factors.

  4. National Forest and Sparse Woody Vegetation Data (Version 5.0 - 2020...

    • researchdata.edu.au
    Updated Aug 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Australian Government Department of Climate Change, Energy, the Environment and Water (2021). National Forest and Sparse Woody Vegetation Data (Version 5.0 - 2020 Release) [Dataset]. https://researchdata.edu.au/national-forest-sparse-2020-release/2989276
    Explore at:
    Dataset updated
    Aug 5, 2021
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    Australian Government Department of Climate Change, Energy, the Environment and Water
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    Landsat satellite imagery is used to derive woody vegetation extent products that discriminate between forest, sparse woody and non-woody land cover across a time series from 1988 to 2020. A forest is defined as woody vegetation with a minimum 20 per cent canopy cover, at least 2 metres high and a minimum area of 0.2 hectares. Sparse woody is defined as woody vegetation with a canopy cover between 5-19 per cent.\r \r The three-class classification (forest, sparse woody and non-woody) supersedes the two-class classification (forest and non-forest) from 2016. The new classification is produced using the same approach in terms of time series processing (conditional probability networks) as the two-class method, to detect woody vegetation cover. The three-class algorithm better encompasses the different types of woody vegetation across the Australian landscape.\r \r Earlier versions of this dataset were published in the Department of Environment and Energy.

  5. f

    Data from: Sparse-Data Deep Learning Strategies for Radiographic...

    • tandf.figshare.com
    pdf
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacqueline Alvarez; Keith Henderson; Maurice B. Aufderheide; Brian Gallagher; Roummel F. Marcia; Ming Jiang (2025). Sparse-Data Deep Learning Strategies for Radiographic Non-Destructive Testing [Dataset]. http://doi.org/10.6084/m9.figshare.29480707.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 9, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Jacqueline Alvarez; Keith Henderson; Maurice B. Aufderheide; Brian Gallagher; Roummel F. Marcia; Ming Jiang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Radiography is an imaging technique used in a variety of applications, such as medical diagnosis, airport security, and nondestructive testing. We present a deep learning system for extracting information from radiographic images. We perform various prediction tasks using our system, including material classification and regression on the dimensions of a given object that is being radiographed. Our system is designed to address the sparse-data issue for radiographic nondestructive testing applications. It uses a radiographic simulation tool for synthetic data augmentation, and it uses transfer learning with a pre-trained convolutional neural network model. Using this system, our preliminary results indicate that the object geometry regression task saw an improvement of 70% in the R-squared value when using a multi-regime model. In addition, we increase the performance of the object material classification tasks by utilizing data from different imaging systems. In particular, using neutron imaging improved the material classification accuracy by 20% when compared to x-ray imaging.

  6. G

    Sparse-Matrix Compression Engine Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Sparse-Matrix Compression Engine Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/sparse-matrix-compression-engine-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Aug 29, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Sparse-Matrix Compression Engine Market Outlook



    According to our latest research, the global Sparse-Matrix Compression Engine market size reached USD 1.42 billion in 2024, reflecting robust adoption across high-performance computing and advanced analytics sectors. The market is poised for substantial expansion, with a projected CAGR of 15.8% during the forecast period. By 2033, the market is forecasted to achieve a value of USD 5.18 billion, driven by escalating data complexity, the proliferation of machine learning applications, and the imperative for efficient storage and computational solutions. The surge in demand for real-time analytics and the growing penetration of artificial intelligence across industries are primary factors fueling this remarkable growth trajectory.




    One of the key growth drivers for the Sparse-Matrix Compression Engine market is the exponential increase in data generation and the corresponding need for efficient data processing and storage. As organizations in sectors such as scientific computing, finance, and healthcare grapple with large-scale, high-dimensional datasets, the requirement for optimized storage solutions becomes paramount. Sparse-matrix compression engines enable significant reduction in data redundancy, leading to lower storage costs and faster data retrieval. This efficiency is particularly crucial in high-performance computing environments where memory bandwidth and storage limitations can hinder computational throughput. The adoption of these engines is further propelled by advancements in hardware accelerators and software algorithms that enhance compression ratios without compromising data integrity.




    Another significant factor contributing to market growth is the rising adoption of machine learning and artificial intelligence across diverse industry verticals. Modern AI and ML algorithms often operate on sparse datasets, especially in areas such as natural language processing, recommendation systems, and scientific simulations. Sparse-matrix compression engines play a pivotal role in minimizing memory footprint and optimizing computational resources, thereby accelerating model training and inference. The integration of these engines into cloud-based and on-premises solutions allows enterprises to scale their AI workloads efficiently, driving widespread deployment in both research and commercial applications. Additionally, the ongoing evolution of lossless and lossy compression techniques is expanding the applicability of these engines to new and emerging use cases.




    The market is also benefiting from the increasing emphasis on cost optimization and energy efficiency in data centers and enterprise IT infrastructure. As organizations strive to reduce operational expenses and carbon footprints, the adoption of compression technologies that minimize data movement and storage requirements becomes a strategic imperative. Sparse-matrix compression engines facilitate this by enabling higher data throughput and lower energy consumption, making them attractive for deployment in large-scale analytics, telecommunications, and industrial automation. Furthermore, the growing ecosystem of service providers and solution integrators is making these technologies more accessible to small and medium enterprises, contributing to broader market penetration.



    The development of High-Speed Hardware Compression Chip technology is revolutionizing the Sparse-Matrix Compression Engine market. These chips are designed to accelerate data compression processes, significantly enhancing the performance of high-performance computing systems. By integrating these chips, organizations can achieve faster data processing speeds, which is crucial for handling large-scale datasets in real-time analytics and AI applications. The chips offer a unique advantage by reducing latency and improving throughput, making them an essential component in modern data centers. As the demand for efficient data management solutions grows, the adoption of high-speed hardware compression chips is expected to rise, driving further innovation and competitiveness in the market.




    From a regional perspective, North America continues to dominate the Sparse-Matrix Compression Engine market, accounting for the largest revenue share in 2024 owing to the presence of leading technology companies, advanced research institutions, and

  7. Performance measurements for "Bringing Order to Sparsity: A Sparse Matrix...

    • zenodo.org
    zip
    Updated Apr 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James D. Trotter; James D. Trotter (2023). Performance measurements for "Bringing Order to Sparsity: A Sparse Matrix Reordering Study on Multicore CPUs" [Dataset]. http://doi.org/10.5281/zenodo.7821491
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    James D. Trotter; James D. Trotter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The paper "Bringing Order to Sparsity: A Sparse Matrix Reordering Study on Multicore CPUs" compares various strategies for reordering sparse matrices. The purpose of reordering is to improve performance of sparse matrix operations, for example, by reducing fill-in resulting from sparse Cholesky factorisation or improving data locality in sparse matrix-vector multiplication (SpMV). Many reordering strategies have been proposed in the literature and the current paper provides a thorough comparison of several of the most popular methods.

    This comparison is based on performance measurements that were collected on the eX3 cluster, a Norwegian, experimental research infrastructure for exploration of exascale computing. These performance measurements are gathered in the data set provided here, particularly related to the performance of two SpMV kernels with respect to 490 sparse matrices, 6 matrix orderings and 8 multicore CPUs.

    Experimental results are provided in a human-readable, tabular format using plain-text ASCII. This format may be readily consumed by gnuplot to create plots or imported into commonly used spreadsheet tools for further analysis.

    Performance measurements are provided based on an SpMV kernel using the compressed sparse row (CSR) storage format with 7 matrix orderings. One file is provided for each of 8 multicore CPU systems considered in the paper:

    1. Skylake: csr_all_xeongold16q_032_threads_ss490.txt
    2. Ice Lake: csr_all_habanaq_072_threads_ss490.txt
    3. Naples: csr_all_defq_064_threads_ss490.txt
    4. Rome: csr_all_rome16q_016_threads_ss490.txt
    5. Milan A: csr_all_fpgaq_048_threads_ss490.txt
    6. Milan B: csr_all_milanq_128_threads_ss490.txt
    7. TX2: csr_all_armq_064_threads_ss490.txt
    8. Hi1620: csr_all_huaq_128_threads_ss490.txt

    A corresponding set of files and performance measurements are provided for a second SpMV kernel that is also studied in the paper.

    Each file consists of 490 rows and 54 columns. Each row corresponds to a different matrix from the SuiteSparse Matrix Collection (https://sparse.tamu.edu/). The first 5 columns specify some general information about the matrix, such as its group and name, as well as the number of rows, columns and nonzeros. Column 6 specifies the number of threads used for the experiment (which depends on the CPU). The remaining columns are grouped according to the 7 different matrix orderings that were studied, in the following order: original, Reverse Cuthill-McKee (RCM), Nested Dissection (ND), Approximate Minimum Degree (AMD), Graph Partitioning (GP), Hypergraph Partitioning (HP), and Gray ordering. For each ordering, the following 7 columns are given:


    1. Minimum number of nonzeros processed by any thread by the SpMV kernel
    2. Maximum number of nonzeros processed by any thread by the SpMV kernel
    3. Mean number of nonzeros processed per thread by the SpMV kernel
    4. Imbalance factor, which is the ratio of the maximum to the mean number of nonzeros processed per thread by the SpMV kernel
    5. Time (in seconds) to perform a single SpMV iteration; this was measured by taking the minimum out of 100 SpMV iterations performed
    6. Maximum performance (in Gflop/s) for a single SpMV iteration; this was measured by taking twice the number of matrix nonzeros and dividing by the minimum time out of 100 SpMV iterations performed.
    7. Mean performance (in Gflop/s) for a single SpMV iteration; this was measured by taking twice the number of matrix nonzeros and dividing by the mean time of the 97 last SpMV iterations performed (i.e., the first 3 SpMV iterations are ignored).

    The results in Fig. 1 of the paper show speedup (or slowdown) resulting from reordering with respect to 3 reorderings and 3 selected matrices. These results can be reproduced by inspecting the performance results that were collected on the Milan B and Ice Lake systems for the three matrices Freescale/Freescale2, SNAP/com-Amazon and GenBank/kmer_V1r. Specifically, the numbers displayed in the figure are obtained by dividing the maximum performance measured for the respective orderings (i.e., RCM, ND and GP) by the maximum performance measured for the original ordering.

    The results presented in Figs. 2 and 3 of the paper show the speedup of SpMV as a result of reordering for the two SpMV kernels considered in the paper. In this case, gnuplot scripts are provided to reproduce the figures from the data files described above.

  8. Data from: A hierarchical distance sampling model to estimate abundance and...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin
    Updated May 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahel Sollmann; Beth Gardner; Kathryn A. Williams; Andrew T. Gilbert; Richard R. Veit; Rahel Sollmann; Beth Gardner; Kathryn A. Williams; Andrew T. Gilbert; Richard R. Veit (2022). Data from: A hierarchical distance sampling model to estimate abundance and covariate associations of species and communities [Dataset]. http://doi.org/10.5061/dryad.gb905
    Explore at:
    binAvailable download formats
    Dataset updated
    May 29, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rahel Sollmann; Beth Gardner; Kathryn A. Williams; Andrew T. Gilbert; Richard R. Veit; Rahel Sollmann; Beth Gardner; Kathryn A. Williams; Andrew T. Gilbert; Richard R. Veit
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Distance sampling is a common survey method in wildlife studies, because it allows accounting for imperfect detection. The framework has been extended to hierarchical distance sampling (HDS), which accommodates the modelling of abundance as a function of covariates, but rare and elusive species may not yield enough observations to fit such a model. We integrate HDS into a community modelling framework that accommodates multi-species spatially replicated distance sampling data. The model allows species-specific parameters, but these come from a common underlying distribution. This form of information sharing enables estimation of parameters for species with sparse data sets that would otherwise be discarded from analysis. We evaluate the performance of the model under varying community sizes with different species-specific abundances through a simulation study. We further fit the model to a seabird data set obtained from shipboard distance sampling surveys off the East Coast of the USA. Comparing communities comprised of 5, 15 or 30 species, bias of all community-level parameters and some species-level parameters decreased with increasing community size, while precision increased. Most species-level parameters were less biased for more abundant species. For larger communities, the community model increased precision in abundance estimates of rarely observed species when compared to single-species models. For the seabird application, we found a strong negative association of community and species abundance with distance to shore. Water temperature and prey density had weak effects on seabird abundance. Patterns in overall abundance were consistent with known seabird ecology. The community distance sampling model can be expanded to account for imperfect availability, imperfect species identification or other missing individual covariates. The model allowed us to make inference about ecology of species communities, including rarely observed species, which is particularly important in conservation and management. The approach holds great potential to improve inference on species communities that can be surveyed with distance sampling.

  9. t

    Experimental data for the paper "using constraints to discover sparse and...

    • service.tib.eu
    Updated Nov 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Experimental data for the paper "using constraints to discover sparse and alternative subgroup descriptions" [Dataset]. https://service.tib.eu/ldmservice/dataset/rdr-doi-10-35097-cakkjctokqgxyvqg
    Explore at:
    Dataset updated
    Nov 28, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract: These are the experimental data for the paper Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" published on arXiv in 2024. You can find the paper here and the code here. See the README for details. The datasets used in our study (which we also provide here) originate from PMLB. The corresponding GitHub repository is MIT-licensed ((c) 2016 Epistasis Lab at UPenn). Please see the file LICENSE in the folder datasets/ for the license text. TechnicalRemarks: # Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" These are the experimental data for the paper Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions"

  10. D

    Replication Data for: Fast Sparse Grid Operations using the Unidirectional...

    • darus.uni-stuttgart.de
    Updated Mar 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Holzmüller (2022). Replication Data for: Fast Sparse Grid Operations using the Unidirectional Principle: A Generalized and Unified Framework [Dataset]. http://doi.org/10.18419/DARUS-1779
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2022
    Dataset provided by
    DaRUS
    Authors
    David Holzmüller
    License

    https://www.apache.org/licenses/LICENSE-2.0https://www.apache.org/licenses/LICENSE-2.0

    Dataset funded by
    DFG
    Description

    This dataset contains supplementary code for the paper Fast Sparse Grid Operations using the Unidirectional Principle: A Generalized and Unified Framework. The code is also provided on GitHub. Here, we additionally provide the runtime measurement data generated by the code, which was used to generate the runtime plot in the paper. For more details, we refer to the file README.md.

  11. 4

    Code: Structural Calibration for Supply Chain Simulation Models with Sparse...

    • data.4tu.nl
    zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isabelle van Schilt (2024). Code: Structural Calibration for Supply Chain Simulation Models with Sparse Data [Dataset]. http://doi.org/10.4121/2a2a2677-4f73-4bd9-ac0d-e28099c3cc26.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    4TU.ResearchData
    Authors
    Isabelle van Schilt
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    This repository is used to calibrate the underlying structure of a stylized supply chain simulation model of counterfeit Personal Protective Equipment (PPE). For this, we use four calibration techniques: Approximate Bayesian Computing using pydream, Bayesian Optimization using bayesian-optimization, Genetic Algorithms using Platypus, and Powell's Method using SciPy. The calibration is done with sparse data, which is generated by degrading the ground truth data on noise, bias, and missing values. We define the structure of a supply chain simulation model as a key value of a dictionary (sorted on betweenness centrality), which is a set of possible supply chain models. The integer is, thus, the decision variable of the calibration.


    To use this repository, we need a simulation model developed in pydsol-core and pydsol-model . Additionally, we need a dictionary with various different simulation structures as input, as well as the ground truth data. For this project, we use the repository complex_stylized_supply_chain_model_generator as simulation model.


    This repository is an extension of the celibration library, making it easy to plugin different calibration models, distance metrics and functions, and data.


    This repository is also part of the Ph.D. thesis of Isabelle M. van Schilt, Delft University of Technology.

  12. Simulated sPHENIX Time-Projection Chamber (TPC) Data in Central Au-Au...

    • data.niaid.nih.gov
    Updated Nov 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huang, Yi; Ren, Yihui (Ray); Huang, Jin (2024). Simulated sPHENIX Time-Projection Chamber (TPC) Data in Central Au-Au Collisions at sqrt[s] = 200 GeV, outer layer group [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10028586
    Explore at:
    Dataset updated
    Nov 10, 2024
    Dataset provided by
    Brookhaven National Laboratoryhttp://www.bnl.gov/
    Authors
    Huang, Yi; Ren, Yihui (Ray); Huang, Jin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset we used to train the 2D and 3D Bicephalous Convolutional Autoencoders (BCAEs) described in "Fast 2D Bicephalous Convolutional Autoencoder for Compressing 3D Time Projection Chamber Data" published in the 9th International Workshop on Data Analysis and Reduction for Big Scientific Data (https://drbsd.github.io/). To untar the file, run tar -xvzf outer.tgz. The Time Projection Chamber (TPC) is a hollow cylinder. Along the radial dimension, the TPC is composed of 48 cylindrical layers of small sensors, which are grouped into three layer groups: inner, middle, and outer. Each layer group has 16 consecutive layers. On each TPC layer, the voxels are presented as a rectangular grid with rows along the z (or horizontal) direction and columns along the azimuthal direction. Within one layer group, all layers have the same number of rows and columns. This allows us to represent the ADC values from one layer group as a 3D array. The data released here focuses on the outer layer group, where the array of ADC values has shape (16, 2304, 498) in the radial, azimuthal, and horizontal orders. The full voxel data are divided into 24 equal-size non-overlapping sections: 12 along the azimuthal direction (30 degrees per section) and 2 along the horizontal direction (divided by the transverse plane passing the collision point). We call one such section a TPC wedge. The array of ADC values from each TPC wedge in the outer layer has shape (16, 192, 249), listed in radial, azimuthal, and horizontal directions, respectively. The TPC wedges are used as the direct input to the deep neural network compression algorithms. We simulated 1310 events for central sqrt[s]=200 GeV Au-Au collisions with 170kHz pile-up. The data were generated with the HIJING event generator and Geant4 Monte Carlo detector simulation package integrated with the sPHENIX software framework. The simulated TPC readout (ADC values) from these events are represented in a 10-bit unsigned integer in [0, 1023]. To reduce unnecessary data transmission between detector pixels and front-end electronics, a zero-suppression algorithm has been applied. ADC values below 64 are suppressed to zero as most of them are noise. The zero compression makes the TPC data sparse at about 10% nonzero occupancy.We divide the 1310 total events into 1048 events for training and 262 for testing. Each event contains 24 outer-layer wedges. Thus, the training partition contains 25152 TPC outer-layer wedges, while the testing portion has 6288 wedges. The compression algorithm compresses each wedge independently.The dataset has the following structure:

    24 subfolders with the name 12-2_[azimuthal section]-[horizontal section] where the [azimuthal section] is labeled by an integer in [0, 11] and the [horizontal section] is labeled by either 0 or 1. Each file in one of the subfolders has the name in the format "AuAu200_170kHz_10C_Iter2_[simulation id].xml_TPCMLDataInterface_[event id within simulation].npy". There are 131 simulations, and each simulation contains 10 independent events (and hence the 1310 total events as mentioned above). Each [event id within simulation] is an integer in [0, 9]. train.txt: a list of all TPC wedges for the training split. text.txt: a list of all TPC wedges for the test split. Note that the dataset is split by events. That is, if a TPC wedge from an event is in the train split, all 24 wedges from the same event will all be in the train split. The same holds for the test split.

  13. G

    Sparse Models Serving Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Sparse Models Serving Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/sparse-models-serving-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Oct 7, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Sparse Models Serving Market Outlook



    According to our latest research, the global sparse models serving market size reached USD 1.35 billion in 2024, reflecting the rapid adoption of efficient AI model deployment across industries. With a robust compound annual growth rate (CAGR) of 31.7% projected between 2025 and 2033, the market is forecasted to achieve a value of USD 14.52 billion by 2033. This remarkable growth trajectory is primarily driven by the increasing demand for scalable, low-latency AI inference solutions, as enterprises seek to optimize resource utilization and reduce operational costs in an era of data explosion and AI-driven digital transformation.




    A key growth factor for the sparse models serving market is the exponential increase in the deployment of artificial intelligence and machine learning solutions across diverse sectors. As organizations generate and process massive volumes of data, the need for highly efficient and scalable AI models has become paramount. Sparse models, which leverage techniques such as weight pruning and quantization, enable faster inference and reduced memory footprints without compromising accuracy. This capability is particularly valuable for real-time applications in industries such as finance, healthcare, and retail, where latency and resource efficiency are critical. The widespread adoption of edge computing and IoT devices further amplifies the demand for sparse model serving, as these environments often operate under stringent computational constraints.




    Another significant driver is the ongoing advancements in hardware accelerators and AI infrastructure. The evolution of specialized hardware, such as GPUs, TPUs, and custom AI chips, has enabled the efficient execution of sparse models at scale. Leading technology providers are investing heavily in the development of optimized software frameworks and libraries that support sparse computation, making it easier for organizations to deploy and manage these models in production environments. The integration of sparse model serving with cloud-native platforms and container orchestration systems has further streamlined the operationalization of AI workloads, allowing enterprises to achieve seamless scalability, high availability, and cost-effectiveness. This technological synergy is accelerating the adoption of sparse models serving across both on-premises and cloud deployments.




    The growing emphasis on sustainable AI and green computing is also propelling the market forward. Sparse models consume significantly less energy and computational resources compared to dense models, aligning with the global push towards environmentally responsible technology practices. Enterprises are increasingly prioritizing solutions that not only deliver high performance but also minimize their carbon footprint. Sparse model serving addresses this need by enabling efficient utilization of existing hardware, reducing the frequency of hardware upgrades, and lowering overall power consumption. This sustainability aspect is becoming a key differentiator for vendors in the sparse models serving market, as regulatory frameworks and corporate social responsibility initiatives place greater emphasis on eco-friendly AI deployments.




    From a regional perspective, North America currently dominates the sparse models serving market, accounting for the largest share in 2024. The region’s leadership can be attributed to its advanced digital infrastructure, strong presence of AI technology providers, and early adoption of cutting-edge AI solutions across sectors such as BFSI, healthcare, and IT. Europe and Asia Pacific are also witnessing rapid growth, driven by increasing investments in AI research, government initiatives supporting digital transformation, and the proliferation of data-centric industries. Emerging markets in Latin America and the Middle East & Africa are gradually catching up, as enterprises in these regions recognize the value of efficient AI model deployment in enhancing competitiveness and operational efficiency.





    Compone

  14. Z

    Bayesian estimation of information-theoretic metrics for sparsely sampled...

    • data.niaid.nih.gov
    Updated Jan 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piga, Angelo; Font-Pomarol, Lluc; Sales-Pardo, Marta; Guimerà, Roger (2024). Bayesian estimation of information-theoretic metrics for sparsely sampled distributions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10592746
    Explore at:
    Dataset updated
    Jan 30, 2024
    Dataset provided by
    Universitat de Barcelona
    Universidad Rovira i Virgili
    Authors
    Piga, Angelo; Font-Pomarol, Lluc; Sales-Pardo, Marta; Guimerà, Roger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Codes, syntetic and empirical data for "Bayesian estimation of information-theoretic metrics for sparsely sampled distributions"

    Abstract:

    Estimating the Shannon entropy of a discrete distribution from which we have only observed a small sample is challenging. Estimating other information-theoretic metrics, such as the Kullback-Leibler divergence between two sparsely sampled discrete distributions, is even harder. Here, we propose a fast, semi-analytical estimator for sparsely sampled distributions. Its derivation is grounded in probabilistic considerations and uses a hierarchical Bayesian approach to extract as much information as possible from the few observations available. Our approach provides estimates of the Shannon entropy with precision at least comparable to the benchmarks we consider, and most often higher; it does so across diverse distributions with very different properties. Our method can also be used to obtain accurate estimates of other information-theoretic metrics, including the notoriously challenging Kullback-Leibler divergence. Here, again, our approach has less bias, overall, than the benchmark estimators we consider.

  15. n

    Mapping beta diversity from space: Sparse Generalized Dissimilarity...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Mar 18, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro J. Leitão; Stefan Suess; Marcel Schwieder; Inês Catry; Edward Milton; Francisco Moreira; Patrick E. Osborne; Manuel J. Pinto; Sebastian van der Linden; Patrick Hostert; Edward J. Milton (2016). Mapping beta diversity from space: Sparse Generalized Dissimilarity Modelling (SGDM) for analysing high-dimensional data [Dataset]. http://doi.org/10.5061/dryad.ns7pv
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 18, 2016
    Dataset provided by
    Humboldt-Universität zu Berlin
    University of Southampton
    University of Lisbon
    Authors
    Pedro J. Leitão; Stefan Suess; Marcel Schwieder; Inês Catry; Edward Milton; Francisco Moreira; Patrick E. Osborne; Manuel J. Pinto; Sebastian van der Linden; Patrick Hostert; Edward J. Milton
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Portugal, Castro Verde
    Description
    1. Spatial patterns of community composition turnover (beta diversity) may be mapped through Generalised Dissimilarity Modelling (GDM). While remote sensing data are adequate to describe these patterns, the often high-dimensional nature of these data poses some analytical challenges, potentially resulting in loss of generality. This may hinder the use of such data for mapping and monitoring beta-diversity patterns. 2. This study presents Sparse Generalised Dissimilarity Modelling (SGDM), a methodological framework designed to improve the use of high-dimensional data to predict community turnover with GDM. SGDM consists of a two-stage approach, by first transforming the environmental data with a sparse canonical correlation analysis (SCCA), aimed at dealing with high-dimensional datasets, and secondly fitting the transformed data with GDM. The SCCA penalisation parameters are chosen according to a grid search procedure in order to optimise the predictive performance of a GDM fit on the resulting components. The proposed method was illustrated on a case study with a clear environmental gradient of shrub encroachment following cropland abandonment, and subsequent turnover in the bird communities. Bird community data, collected on 115 plots located along the described gradient, were used to fit composition dissimilarity as a function of several remote sensing datasets, including a time series of Landsat data as well as simulated EnMAP hyperspectral data. 3. The proposed approach always outperformed GDM models when fit on high-dimensional datasets. Its usage on low-dimensional data was not consistently advantageous. Models using high-dimensional data, on the other hand, always outperformed those using low-dimensional data, such as single date multispectral imagery. 4. This approach improved the direct use of high-dimensional remote sensing data, such as time series or hyperspectral imagery, for community dissimilarity modelling, resulting in better performing models. The good performance of models using high-dimensional datasets further highlights the relevance of dense time series and data coming from new and forthcoming satellite sensors for ecological applications such as mapping species beta diversity.
  16. H

    Replication data for: Of Mice and Men: Sparse Statistical Modeling in...

    • dataverse.harvard.edu
    • datasetcatalog.nlm.nih.gov
    Updated Nov 28, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David M. Seo; Pascal J. Goldschmidt-Clermont; Mike ike West (2007). Replication data for: Of Mice and Men: Sparse Statistical Modeling in Cardiovascular Genomics [Dataset]. http://doi.org/10.7910/DVN/OSQPDV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 28, 2007
    Dataset provided by
    Harvard Dataverse
    Authors
    David M. Seo; Pascal J. Goldschmidt-Clermont; Mike ike West
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In high-throughput genomics, large-scale designed experiments are becoming common, and analysis approaches based on highly multivariate regression and anova concepts are key tools. Shrinkage models of one form or another can provide comprehensive approaches to the problems of simultaneous inference that involve implicit multiple comparisons over the many, many parameters representing effects of design factors and covariates. We use such approaches here in a study of cardiovascular genomics. The primary experimental context concerns a carefully designed, and rich, gene expression study focused on gene-environment interactions, with the goals of identifying genes implicated in connection with disease states and known risk factors, and in generating expression signatures as proxies for such risk factors. A coupled exploratory analysis investigates cross-species extrapolation of gene expression signatures—how these mouse-model signatures translate to humans. The latter involves exploration of sparse latent factor analysis of human observational data and of how it relates to projected risk signatures derived in the animal models. The study also highlights a range of applied statistical and genomic data analysis issues, including model specification, computational questions and model-based correction of experimental artifacts in DNA microarray data.

  17. Data from: Comparing regression-based approaches for identifying microbial...

    • data-staging.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fang Yu; Mikhail Tikhonov (2025). Comparing regression-based approaches for identifying microbial functional groups [Dataset]. http://doi.org/10.5061/dryad.n8pk0p366
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 6, 2025
    Dataset provided by
    Washington University in St. Louis
    Authors
    Fang Yu; Mikhail Tikhonov
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Microbial communities are composed of functionally integrated taxa, and identifying which taxa contribute to a given ecosystem function is essential for predicting community behaviors. This study compares the effectiveness of a previously proposed method for identifying ``functional taxa,'' Ensemble Quotient Optimization (EQO), to a potentially simpler approach based on the Least Absolute Shrinkage and Selection Operator (LASSO). In contrast to LASSO, EQO uses a binary prior on coefficients, assuming uniform contribution strength across taxa. Using synthetic datasets with increasingly realistic structure, we demonstrate that EQO's strong prior enables it to perform better in low-data regime. However, LASSO’s flexibility and efficiency can make it preferable as data complexity increases. Our results detail the favorable conditions for EQO and emphasize LASSO as a viable alternative.

  18. National Forest and Sparse Woody Vegetation Data (Version 3, 2018 Release)

    • researchdata.edu.au
    Updated Apr 3, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Australian Government Department of Climate Change, Energy, the Environment and Water (2019). National Forest and Sparse Woody Vegetation Data (Version 3, 2018 Release) [Dataset]. https://researchdata.edu.au/national-forest-sparse-2018-release/2994529
    Explore at:
    Dataset updated
    Apr 3, 2019
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    Australian Government Department of Climate Change, Energy, the Environment and Water
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Landsat satellite imagery is used to derive woody vegetation extent products that discriminate between forest, sparse woody and non-woody land cover across a time series from 1988 to 2018. A forest is defined as woody vegetation with a minimum 20 per cent canopy cover, potentially reaching 2 metres high and a minimum area of 0.2 hectares. Sparse woody is defined as woody vegetation with a canopy cover between 5-19 per cent.\r \r The three-class classification (forest, sparse woody and non-woody) supersedes the two class classification (forest and non-forest) from 2016. The new classification is produced using the same approach in terms of time series processing (conditional probability networks) as the two-class method, to detect woody vegetation cover. The three-class algorithm better encompasses the different types of woody vegetation across the Australian landscape.\r

  19. 4

    Code: Parametric Calibration for Supply Chain Simulation Models with Sparse...

    • data.4tu.nl
    zip
    Updated Mar 17, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isabelle van Schilt (2007). Code: Parametric Calibration for Supply Chain Simulation Models with Sparse Data [Dataset]. http://doi.org/10.4121/a772fd6f-ec0b-4038-8e54-5b9901f060ad.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 17, 2007
    Dataset provided by
    4TU.ResearchData
    Authors
    Isabelle van Schilt
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    This code is part of the Ph.D. thesis of Isabelle M. van Schilt, Delft University of Technology.

    This code is used to calibrate a parameter of a stylized supply chain simulation model of counterfeit Personal Protective Equipment (PPE). For this, we use three calibration techniques: Approximate Bayesian Computing using pydream, Genetic Algorithms using Platypus, and Powell's Method using SciPy. The calibration is done with sparse data, which is generated by degrading the ground truth data on noise, bias, and missing values.

    This code is an extension of the celibration library, making it easy to plugin different calibration models, distance metrics and functions, and data.

    Note that this code uses an old version of pydsol, which is included in the zip file.

  20. u

    Data from: Video-rate raman-based metabolic imaging by airy light-sheet...

    • verso.uidaho.edu
    • data.nkn.uidaho.edu
    • +1more
    txt, zip
    Updated Dec 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Vasdekis (2022). Data from: Video-rate raman-based metabolic imaging by airy light-sheet illumination and photon-sparse detection [Dataset]. https://verso.uidaho.edu/esploro/outputs/dataset/Data-from-Video-rate-raman-based-metabolic-imaging/996765635801851
    Explore at:
    txt(6123 bytes), zip(144076388 bytes)Available download formats
    Dataset updated
    Dec 20, 2022
    Dataset provided by
    University of Idaho
    Authors
    Andreas Vasdekis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 20, 2022
    Description

    Data supporting manuscript submitted to PNAS: Video-Rate Raman-based Metabolic Imaging by Airy Light-Sheet Illumination and Photon-Sparse Detection. The data set includes: [1] raw data and [2] related images used in the analyses described within the manuscript.

    Despite its massive potential, Raman imaging represents just a modest fraction of all research and clinical microscopy to date. This is due to the ultralow Raman scattering cross-sections of most biomolecules that impose low-light or photon-sparse conditions. Bioimaging under such conditions is suboptimal, as it either results in ultralow frame rates or requires increased levels of irradiance. Here, we overcome this tradeoff by introducing Raman imaging that operates at both video rates and 1,000-fold lower irradiance than state-of-the-art methods. To accomplish this, we deployed a judicially designed Airy light-sheet microscope to efficiently image large specimen regions. Further, we implemented subphoton per pixel image acquisition and reconstruction to confront issues arising from photon sparsity at just millisecond integrations. We demonstrate the versatility of our approach by imaging a variety of samples, including the three-dimensional (3D) metabolic activity of single microbial cells and the underlying cell-to-cell variability. To image such small-scale targets, we again harnessed photon sparsity to increase magnification without a field-of-view penalty, thus, overcoming another key limitation in modern light-sheet microscopy.

    Data Use:
    License: CC-BY 4.0
    Recommended Citation: Vasdekis AE (2023) Data from: Video-rate raman-based metabolic imaging by airy light-sheet illumination and photon-sparse detection [Dataset]. University of Idaho. https://doi.org/10.11578/1908656

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
nasa.gov (2025). Sparse Machine Learning Methods for Understanding Large Text Corpora [Dataset]. https://data.nasa.gov/dataset/sparse-machine-learning-methods-for-understanding-large-text-corpora
Organization logo

Data from: Sparse Machine Learning Methods for Understanding Large Text Corpora

Related Article
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description

Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.

Search
Clear search
Close search
Google apps
Main menu