13 datasets found

d
Data from: Sparse Machine Learning Methods for Understanding Large Text...
catalog.data.gov
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
+3more
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Sparse Machine Learning Methods for Understanding Large Text Corpora [Dataset]. https://catalog.data.gov/dataset/sparse-machine-learning-methods-for-understanding-large-text-corpora
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.
f
Additional file 1 of Seagull: lasso, group lasso and sparse-group lasso...
springernature.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Klosa; Noah Simon; Pål Olof Westermark; Volkmar Liebscher; Dörte Wittenburg (2023). Additional file 1 of Seagull: lasso, group lasso and sparse-group lasso regularization for linear regression models via proximal gradient descent [Dataset]. http://doi.org/10.6084/m9.figshare.12960734.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12960734.v1
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Authors
Jan Klosa; Noah Simon; Pål Olof Westermark; Volkmar Liebscher; Dörte Wittenburg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 1. An R script for downloading and processing the methylation data used in this study.
f
Additional file 3 of Seagull: lasso, group lasso and sparse-group lasso...
springernature.figshare.com
txt
Updated Jun 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Klosa; Noah Simon; Pål Olof Westermark; Volkmar Liebscher; Dörte Wittenburg (2023). Additional file 3 of Seagull: lasso, group lasso and sparse-group lasso regularization for linear regression models via proximal gradient descent [Dataset]. http://doi.org/10.6084/m9.figshare.12960740.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12960740.v1
Dataset updated
Jun 4, 2023
Dataset provided by
figshare
Authors
Jan Klosa; Noah Simon; Pål Olof Westermark; Volkmar Liebscher; Dörte Wittenburg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 3. An R script for performing an exemplary genome-wide association study.
f
Consistent Sparse Deep Learning: Theory and Computation
tandf.figshare.com
txt
Updated Feb 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan Sun; Qifan Song; Faming Liang (2024). Consistent Sparse Deep Learning: Theory and Computation [Dataset]. http://doi.org/10.6084/m9.figshare.14120235.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14120235.v1
Dataset updated
Feb 29, 2024
Dataset provided by
Taylor & Francis
Authors
Yan Sun; Qifan Song; Faming Liang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Deep learning has been the engine powering many successes of data science. However, the deep neural network (DNN), as the basic model of deep learning, is often excessively over-parameterized, causing many difficulties in training, prediction and interpretation. We propose a frequentist-like method for learning sparse DNNs and justify its consistency under the Bayesian framework: the proposed method could learn a sparse DNN with at most O(n/ log (n)) connections and nice theoretical guarantees such as posterior consistency, variable selection consistency and asymptotically optimal generalization bounds. In particular, we establish posterior consistency for the sparse DNN with a mixture Gaussian prior, show that the structure of the sparse DNN can be consistently determined using a Laplace approximation-based marginal posterior inclusion probability approach, and use Bayesian evidence to elicit sparse DNNs learned by an optimization method such as stochastic gradient descent in multiple runs with different initializations. The proposed method is computationally more efficient than standard Bayesian methods for large-scale sparse DNNs. The numerical results indicate that the proposed method can perform very well for large-scale network compression and high-dimensional nonlinear variable selection, both advancing interpretable machine learning.
Sparse-Matrix Compression Engine Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Sparse-Matrix Compression Engine Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/sparse-matrix-compression-engine-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Aug 4, 2025
Dataset provided by
Authors
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Sparse-Matrix Compression Engine Market Outlook

According to our latest research, the global Sparse-Matrix Compression Engine market size reached USD 1.42 billion in 2024, reflecting robust adoption across high-performance computing and advanced analytics sectors. The market is poised for substantial expansion, with a projected CAGR of 15.8% during the forecast period. By 2033, the market is forecasted to achieve a value of USD 5.18 billion, driven by escalating data complexity, the proliferation of machine learning applications, and the imperative for efficient storage and computational solutions. The surge in demand for real-time analytics and the growing penetration of artificial intelligence across industries are primary factors fueling this remarkable growth trajectory.

One of the key growth drivers for the Sparse-Matrix Compression Engine market is the exponential increase in data generation and the corresponding need for efficient data processing and storage. As organizations in sectors such as scientific computing, finance, and healthcare grapple with large-scale, high-dimensional datasets, the requirement for optimized storage solutions becomes paramount. Sparse-matrix compression engines enable significant reduction in data redundancy, leading to lower storage costs and faster data retrieval. This efficiency is particularly crucial in high-performance computing environments where memory bandwidth and storage limitations can hinder computational throughput. The adoption of these engines is further propelled by advancements in hardware accelerators and software algorithms that enhance compression ratios without compromising data integrity.

Another significant factor contributing to market growth is the rising adoption of machine learning and artificial intelligence across diverse industry verticals. Modern AI and ML algorithms often operate on sparse datasets, especially in areas such as natural language processing, recommendation systems, and scientific simulations. Sparse-matrix compression engines play a pivotal role in minimizing memory footprint and optimizing computational resources, thereby accelerating model training and inference. The integration of these engines into cloud-based and on-premises solutions allows enterprises to scale their AI workloads efficiently, driving widespread deployment in both research and commercial applications. Additionally, the ongoing evolution of lossless and lossy compression techniques is expanding the applicability of these engines to new and emerging use cases.

The market is also benefiting from the increasing emphasis on cost optimization and energy efficiency in data centers and enterprise IT infrastructure. As organizations strive to reduce operational expenses and carbon footprints, the adoption of compression technologies that minimize data movement and storage requirements becomes a strategic imperative. Sparse-matrix compression engines facilitate this by enabling higher data throughput and lower energy consumption, making them attractive for deployment in large-scale analytics, telecommunications, and industrial automation. Furthermore, the growing ecosystem of service providers and solution integrators is making these technologies more accessible to small and medium enterprises, contributing to broader market penetration.

From a regional perspective, North America continues to dominate the Sparse-Matrix Compression Engine market, accounting for the largest revenue share in 2024 owing to the presence of leading technology companies, advanced research institutions, and early adopters of high-performance computing solutions. However, the Asia Pacific region is witnessing the fastest growth, driven by rapid digital transformation, expanding AI research, and significant investments in data infrastructure across China, Japan, and India. Europe follows closely, with robust demand for advanced analytics and scientific computing in sectors such as automotive, healthcare, and finance. Latin America and Middle East & Africa are gradually emerging as promising markets, supported by increasing investments in IT modernization and digitalization initiatives.

"https://growthmarketreports.com/request-sample/27696">
<
f
Data from: Large-Scale Learning of Structure−Activity Relationships Using a...
acs.figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georg Hinselmann; Lars Rosenbaum; Andreas Jahn; Nikolas Fechner; Claude Ostermann; Andreas Zell (2023). Large-Scale Learning of Structure−Activity Relationships Using a Linear Support Vector Machine and Problem-Specific Metrics [Dataset]. http://doi.org/10.1021/ci100073w.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/ci100073w.s001
Dataset updated
May 30, 2023
Dataset provided by
ACS Publications
Authors
Georg Hinselmann; Lars Rosenbaum; Andreas Jahn; Nikolas Fechner; Claude Ostermann; Andreas Zell
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Naïve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches.
f
DataSheet1_Dynamic Meta-data Network Sparse PCA for Cancer Subtype Biomarker...
figshare.com
frontiersin.figshare.com
docx
Updated Jun 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Miao; Xin Dong; Xiao-Ying Liu; Sio-Long Lo; Xin-Yue Mei; Qi Dang; Jie Cai; Shao Li; Kuo Yang; Sheng-Li Xie; Yong Liang (2023). DataSheet1_Dynamic Meta-data Network Sparse PCA for Cancer Subtype Biomarker Screening.DOCX [Dataset]. http://doi.org/10.3389/fgene.2022.869906.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2022.869906.s001
Dataset updated
Jun 15, 2023
Dataset provided by
Frontiers
Authors
Rui Miao; Xin Dong; Xiao-Ying Liu; Sio-Long Lo; Xin-Yue Mei; Qi Dang; Jie Cai; Shao Li; Kuo Yang; Sheng-Li Xie; Yong Liang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Previous research shows that each type of cancer can be divided into multiple subtypes, which is one of the key reasons that make cancer difficult to cure. Under these circumstances, finding a new target gene of cancer subtypes has great significance on developing new anti-cancer drugs and personalized treatment. Due to the fact that gene expression data sets of cancer are usually high-dimensional and with high noise and have multiple potential subtypes’ information, many sparse principal component analysis (sparse PCA) methods have been used to identify cancer subtype biomarkers and subtype clusters. However, the existing sparse PCA methods have not used the known cancer subtype information as prior knowledge, and their results are greatly affected by the quality of the samples. Therefore, we propose the Dynamic Metadata Edge-group Sparse PCA (DM-ESPCA) model, which combines the idea of meta-learning to solve the problem of sample quality and uses the known cancer subtype information as prior knowledge to capture some gene modules with better biological interpretations. The experiment results on the three biological data sets showed that the DM-ESPCA model can find potential target gene probes with richer biological information to the cancer subtypes. Moreover, the results of clustering and machine learning classification models based on the target genes screened by the DM-ESPCA model can be improved by up to 22–23% of accuracies compared with the existing sparse PCA methods. We also proved that the result of the DM-ESPCA model is better than those of the four classic supervised machine learning models in the task of classification of cancer subtypes.
f
Table9_Dynamic Meta-data Network Sparse PCA for Cancer Subtype Biomarker...
figshare.com
datasetcatalog.nlm.nih.gov
+1more
xlsx
Updated May 9, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Miao; Xin Dong; Xiao-Ying Liu; Sio-Long Lo; Xin-Yue Mei; Qi Dang; Jie Cai; Shao Li; Kuo Yang; Sheng-Li Xie; Yong Liang (2022). Table9_Dynamic Meta-data Network Sparse PCA for Cancer Subtype Biomarker Screening.XLSX [Dataset]. http://doi.org/10.3389/fgene.2022.869906.s013
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2022.869906.s013
Dataset updated
May 9, 2022
Dataset provided by
Frontiers
Authors
Rui Miao; Xin Dong; Xiao-Ying Liu; Sio-Long Lo; Xin-Yue Mei; Qi Dang; Jie Cai; Shao Li; Kuo Yang; Sheng-Li Xie; Yong Liang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Previous research shows that each type of cancer can be divided into multiple subtypes, which is one of the key reasons that make cancer difficult to cure. Under these circumstances, finding a new target gene of cancer subtypes has great significance on developing new anti-cancer drugs and personalized treatment. Due to the fact that gene expression data sets of cancer are usually high-dimensional and with high noise and have multiple potential subtypes’ information, many sparse principal component analysis (sparse PCA) methods have been used to identify cancer subtype biomarkers and subtype clusters. However, the existing sparse PCA methods have not used the known cancer subtype information as prior knowledge, and their results are greatly affected by the quality of the samples. Therefore, we propose the Dynamic Metadata Edge-group Sparse PCA (DM-ESPCA) model, which combines the idea of meta-learning to solve the problem of sample quality and uses the known cancer subtype information as prior knowledge to capture some gene modules with better biological interpretations. The experiment results on the three biological data sets showed that the DM-ESPCA model can find potential target gene probes with richer biological information to the cancer subtypes. Moreover, the results of clustering and machine learning classification models based on the target genes screened by the DM-ESPCA model can be improved by up to 22–23% of accuracies compared with the existing sparse PCA methods. We also proved that the result of the DM-ESPCA model is better than those of the four classic supervised machine learning models in the task of classification of cancer subtypes.
f
Data from: Scalable Hyperparameter Selection for Latent Dirichlet Allocation...
tandf.figshare.com
zip
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Xia; Hani Doss (2024). Scalable Hyperparameter Selection for Latent Dirichlet Allocation [Dataset]. http://doi.org/10.6084/m9.figshare.11996595.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11996595.v2
Dataset updated
Feb 28, 2024
Dataset provided by
Taylor & Francis
Authors
Wei Xia; Hani Doss
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Latent Dirichlet allocation (LDA) is a heavily used Bayesian hierarchical model used in machine learning for modeling high-dimensional sparse count data, for example, text documents. As a Bayesian model, it incorporates a prior on a set of latent variables. The prior is indexed by some hyperparameters, which have a big impact on inference regarding the model. The ideal estimate of the hyperparameters is the empirical Bayes estimate which is, by definition, the maximizer of the marginal likelihood of the data with all the latent variables integrated out. This estimate cannot be obtained analytically. In practice, the hyperparameters are chosen either in an ad-hoc manner, or through some variants of the EM algorithm for which the theoretical basis is weak. We propose an MCMC-based fully Bayesian method for obtaining the empirical Bayes estimate of the hyperparameter. We compare our method with other existing approaches both on synthetic and real data. The comparative experiments demonstrate that the LDA model with hyperparameters specified by our method outperforms models with the hyperparameters estimated by other methods. Supplementary materials for this article are available online.
f
Data from: FAStEN: An Efficient Adaptive Method for Feature Selection and...
tandf.figshare.com
pdf
Updated Nov 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tobia Boschi; Lorenzo Testa; Francesca Chiaromonte; Matthew Reimherr (2024). FAStEN: An Efficient Adaptive Method for Feature Selection and Estimation in High-Dimensional Functional Regressions [Dataset]. http://doi.org/10.6084/m9.figshare.27122532.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27122532.v1
Dataset updated
Nov 22, 2024
Dataset provided by
Taylor & Francis
Authors
Tobia Boschi; Lorenzo Testa; Francesca Chiaromonte; Matthew Reimherr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Functional regression analysis is an established tool for many contemporary scientific applications. Regression problems involving large and complex datasets are ubiquitous, and feature selection is crucial for avoiding overfitting and achieving accurate predictions. We propose a new, flexible and ultra-efficient approach to perform feature selection in a sparse high dimensional function-on-function regression problem, and we show how to extend it to the scalar-on-function framework. Our method, called FAStEN, combines functional data, optimization, and machine learning techniques to perform feature selection and parameter estimation simultaneously. We exploit the properties of Functional Principal Components and the sparsity inherent to the Dual Augmented Lagrangian problem to significantly reduce computational cost, and we introduce an adaptive scheme to improve selection accuracy. In addition, we derive asymptotic oracle properties, which guarantee estimation and selection consistency for the proposed FAStEN estimator. Through an extensive simulation study, we benchmark our approach to the best existing competitors and demonstrate a massive gain in terms of CPU time and selection performance, without sacrificing the quality of the coefficients’ estimation. The theoretical derivations and the simulation study provide a strong motivation for our approach. Finally, we present an application to brain fMRI data from the AOMIC PIOP1 study. Complete FAStEN code is provided at https://github.com/IBM/funGCN. Supplementary materials for this article are available online.
f
Data from: Flexible Extensions to Structural Equation Models Using...
tandf.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erik–Jan van Kesteren; Daniel L. Oberski (2023). Flexible Extensions to Structural Equation Models Using Computation Graphs [Dataset]. http://doi.org/10.6084/m9.figshare.16862956.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16862956.v2
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Erik–Jan van Kesteren; Daniel L. Oberski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Structural equation modeling (SEM) is being applied to ever more complex data types and questions, often requiring extensions such as regularization or novel fitting functions. To extend SEM, researchers currently need to completely reformulate SEM and its optimization algorithm – a challenging and time–consuming task. In this paper, we introduce the computation graph for SEM, and show that this approach can extend SEM without the need for bespoke software development. We show that both existing and novel SEM improvements follow naturally. To demonstrate, we introduce three SEM extensions: least absolute deviation estimation, Bayesian LASSO optimization, and sparse high–dimensional mediation analysis. We provide an implementation of SEM in PyTorch – popular software in the machine learning community – to accelerate development of structural equation models adequate for modern–day data and research questions.
f
Table4_Dynamic Meta-data Network Sparse PCA for Cancer Subtype Biomarker...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Miao; Xin Dong; Xiao-Ying Liu; Sio-Long Lo; Xin-Yue Mei; Qi Dang; Jie Cai; Shao Li; Kuo Yang; Sheng-Li Xie; Yong Liang (2023). Table4_Dynamic Meta-data Network Sparse PCA for Cancer Subtype Biomarker Screening.XLSX [Dataset]. http://doi.org/10.3389/fgene.2022.869906.s008
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2022.869906.s008
Dataset updated
Jun 5, 2023
Dataset provided by
Frontiers
Authors
Rui Miao; Xin Dong; Xiao-Ying Liu; Sio-Long Lo; Xin-Yue Mei; Qi Dang; Jie Cai; Shao Li; Kuo Yang; Sheng-Li Xie; Yong Liang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Previous research shows that each type of cancer can be divided into multiple subtypes, which is one of the key reasons that make cancer difficult to cure. Under these circumstances, finding a new target gene of cancer subtypes has great significance on developing new anti-cancer drugs and personalized treatment. Due to the fact that gene expression data sets of cancer are usually high-dimensional and with high noise and have multiple potential subtypes’ information, many sparse principal component analysis (sparse PCA) methods have been used to identify cancer subtype biomarkers and subtype clusters. However, the existing sparse PCA methods have not used the known cancer subtype information as prior knowledge, and their results are greatly affected by the quality of the samples. Therefore, we propose the Dynamic Metadata Edge-group Sparse PCA (DM-ESPCA) model, which combines the idea of meta-learning to solve the problem of sample quality and uses the known cancer subtype information as prior knowledge to capture some gene modules with better biological interpretations. The experiment results on the three biological data sets showed that the DM-ESPCA model can find potential target gene probes with richer biological information to the cancer subtypes. Moreover, the results of clustering and machine learning classification models based on the target genes screened by the DM-ESPCA model can be improved by up to 22–23% of accuracies compared with the existing sparse PCA methods. We also proved that the result of the DM-ESPCA model is better than those of the four classic supervised machine learning models in the task of classification of cancer subtypes.
Data from: SyMANTIC: An Efficient Symbolic Regression Method for...
acs.figshare.com
zip
Updated Feb 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Madhav R. Muthyala; Farshud Sorourifar; You Peng; Joel A. Paulson (2025). SyMANTIC: An Efficient Symbolic Regression Method for Interpretable and Parsimonious Model Discovery in Science and Beyond [Dataset]. http://doi.org/10.1021/acs.iecr.4c03503.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.iecr.4c03503.s001
Dataset updated
Feb 4, 2025
Dataset provided by
ACS Publications
Authors
Madhav R. Muthyala; Farshud Sorourifar; You Peng; Joel A. Paulson
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Symbolic regression (SR) is an emerging branch of machine learning focused on discovering simple and interpretable mathematical expressions from data. Although a wide-variety of SR methods have been developed, they often face challenges such as high computational cost, poor scalability with respect to the number of input dimensions, fragility to noise, and an inability to balance accuracy and complexity. This work introduces SyMANTIC, a novel SR algorithm that addresses these challenges. SyMANTIC efficiently identifies (potentially several) low-dimensional descriptors from a large set of candidates (from ∼105 to ∼1010 or more) through a unique combination of mutual information-based feature selection, adaptive feature expansion, and recursively applied l0-based sparse regression. In addition, it employs an information-theoretic measure to produce an approximate set of Pareto-optimal equations, each offering the best-found accuracy for a given complexity. Furthermore, our open-source implementation of SyMANTIC, built on the PyTorch ecosystem, facilitates easy installation and GPU acceleration. We demonstrate the effectiveness of SyMANTIC across a range of problems, including synthetic examples, scientific benchmarks, real-world material property predictions, and chaotic dynamical system identification from small datasets. Extensive comparisons show that SyMANTIC uncovers similar or more accurate models at a fraction of the cost of existing SR methods.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dashlink (2025). Sparse Machine Learning Methods for Understanding Large Text Corpora [Dataset]. https://catalog.data.gov/dataset/sparse-machine-learning-methods-for-understanding-large-text-corpora

Data from: Sparse Machine Learning Methods for Understanding Large Text Corpora

Explore at:

Dataset updated

Apr 10, 2025

Dataset provided by

Dashlink

Description

Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.

Clear search

Close search

Google apps

Main menu

Data from: Sparse Machine Learning Methods for Understanding Large Text...

Additional file 1 of Seagull: lasso, group lasso and sparse-group lasso...

Additional file 3 of Seagull: lasso, group lasso and sparse-group lasso...

Consistent Sparse Deep Learning: Theory and Computation

Sparse-Matrix Compression Engine Market Research Report 2033

Sparse-Matrix Compression Engine Market Outlook

Data from: Large-Scale Learning of Structure−Activity Relationships Using a...

DataSheet1_Dynamic Meta-data Network Sparse PCA for Cancer Subtype Biomarker...

Table9_Dynamic Meta-data Network Sparse PCA for Cancer Subtype Biomarker...

Data from: Scalable Hyperparameter Selection for Latent Dirichlet Allocation...

Data from: FAStEN: An Efficient Adaptive Method for Feature Selection and...

Data from: Flexible Extensions to Structural Equation Models Using...

Table4_Dynamic Meta-data Network Sparse PCA for Cancer Subtype Biomarker...

Data from: SyMANTIC: An Efficient Symbolic Regression Method for...

Data from: Sparse Machine Learning Methods for Understanding Large Text Corpora