83 datasets found

d
Data from: Sparse Machine Learning Methods for Understanding Large Text...
catalog.data.gov
s.cnmilf.com
+3more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Sparse Machine Learning Methods for Understanding Large Text Corpora [Dataset]. https://catalog.data.gov/dataset/sparse-machine-learning-methods-for-understanding-large-text-corpora
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.
Multiarray EMG, hand movements, subject 5
figshare.com
bin
Updated Jan 19, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jukka-pekka Kauppi (2016). Multiarray EMG, hand movements, subject 5 [Dataset]. http://doi.org/10.6084/m9.figshare.1394682.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1394682.v1
Dataset updated
Jan 19, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Jukka-pekka Kauppi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
EMG data for classifier evaluation
r
Data from: Sparse Principal Component Analysis with Preserved Sparsity...
researchdata.edu.au
Updated 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Inge Koch; Navid Shokouhi; Abd-Krim Seghouane; Mathematics and Statistics (2019). Sparse Principal Component Analysis with Preserved Sparsity Pattern [Dataset]. http://doi.org/10.24433/CO.4593141.V1
Explore at:
Unique identifier
https://doi.org/10.24433/CO.4593141.V1
Dataset updated
2019
Dataset provided by
The University of Western Australia
Code Ocean
Authors
Inge Koch; Navid Shokouhi; Abd-Krim Seghouane; Mathematics and Statistics
Description
MATLAB code + demo to reproduce results for "Sparse Principal Component Analysis with Preserved Sparsity". This code calculates the principal loading vectors for any given high-dimensional data matrix. The advantage of this method over existing sparse-PCA methods is that it can produce principal loading vectors with the same sparsity pattern for any number of principal components. Please see Readme.md for more information.
c
Data from: Low-energy electron microscopy intensity-voltage data –...
research-data.cardiff.ac.uk
zip
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesco Masia; Wolfgang Langbein; Simon Fischer; Jon-Olaf Krisponeit; Jens Falta (2024). Low-energy electron microscopy intensity-voltage data – factorization, sparse sampling, and classification [Dataset]. http://doi.org/10.17035/d.2022.0153725100
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.17035/d.2022.0153725100
Dataset updated
Sep 18, 2024
Dataset provided by
Cardiff University
Authors
Francesco Masia; Wolfgang Langbein; Simon Fischer; Jon-Olaf Krisponeit; Jens Falta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Low-energy electron microscopy (LEEM) taken as intensity-voltage (I-V) curves provides hyperspectral images of surfaces, which can be used to identify the surface type, but are difficult to analyze. Here, we demonstrate the use of an algorithm for factorizing the data into spectra and concentrations of characteristic components (FSC3) for identifying distinct physical surface phases. Importantly, FSC3 is an unsupervised and fast algorithm. As example data we use experiments on the growth of praseodymium oxide or ruthenium oxide on ruthenium single crystal substrates, both featuring a complex distribution of coexisting surface components, varying in both chemical composition and crystallographic structure. With the factorization result a sparse sampling method is demonstrated, reducing the measurement time by 1-2 orders of magnitude, relevant for dynamic surface studies. The FSC3 concentrations are providing the features for a support vector machine (SVM) based supervised classification of the types. Here, specific surface regions which have been identified structurally, via their diffraction pattern, as well as chemically by complementary spectro-microscopic techniques, are used as training sets. A reliable classification is demonstrated on both exemplary LEEM I-V datasets.Research results are published at https://arxiv.org/abs/2203.12353The data available represents the concentration maps obtained by the FSC3 in tiff format, together with the associated spectra as ascii. Similarly the results of the classification algorithm are available as tiff images, while the average concentration and spectra calculated over the training and testing regions are given as ascii data. The raw data are also given as tiff images, which can be used to test the FSC3 and classification algorithms (available at https://langsrv.astro.cf.ac.uk/HIA/HIA.html, and https://github.com/masiaf-cf/leem-svm-classify, respectively).Research results based upon these data are pubished at https://doi.org/10.1111/jmi.13155
Industrial Benchmark Dataset for Customer Escalation Prediction
zenodo.org
opendatalab.com
+1more
bin
Updated Sep 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
An Nguyen; An Nguyen; Stefan Foerstel; Thomas Kittler; Andrey Kurzyukov; Leo Schwinn; Dario Zanca; Tobias Hipp; Sun Da Jun; Michael Schrapp; Eva Rothgang; Bjoern Eskofier; Stefan Foerstel; Thomas Kittler; Andrey Kurzyukov; Leo Schwinn; Dario Zanca; Tobias Hipp; Sun Da Jun; Michael Schrapp; Eva Rothgang; Bjoern Eskofier (2021). Industrial Benchmark Dataset for Customer Escalation Prediction [Dataset]. http://doi.org/10.5281/zenodo.4383145
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4383145
Dataset updated
Sep 6, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
An Nguyen; An Nguyen; Stefan Foerstel; Thomas Kittler; Andrey Kurzyukov; Leo Schwinn; Dario Zanca; Tobias Hipp; Sun Da Jun; Michael Schrapp; Eva Rothgang; Bjoern Eskofier; Stefan Foerstel; Thomas Kittler; Andrey Kurzyukov; Leo Schwinn; Dario Zanca; Tobias Hipp; Sun Da Jun; Michael Schrapp; Eva Rothgang; Bjoern Eskofier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a real-world industrial benchmark dataset from a major medical device manufacturer for the prediction of customer escalations. The dataset contains features derived from IoT (machine log) and enterprise data including labels for escalation from a fleet of thousands of customers of high-end medical devices.

The dataset accompanies the publication "System Design for a Data-driven and Explainable Customer Sentiment Monitor" (submitted). We provide an anonymized version of data collected over a period of two years.

The dataset should fuel the research and development of new machine learning algorithms to better cope with real-world data challenges including sparse and noisy labels, and concept drifts. Additional challenges is the optimal fusion of enterprise and log based features for the prediction task. Thereby, interpretability of designed prediction models should be ensured in order to have practical relevancy.

Supporting software

Kindly use the corresponding GitHub repository (https://github.com/annguy/customer-sentiment-monitor) to design and benchmark your algorithms.

Citation and Contact

If you use this dataset please cite the following publication:

@ARTICLE{9520354, author={Nguyen, An and Foerstel, Stefan and Kittler, Thomas and Kurzyukov, Andrey and Schwinn, Leo and Zanca, Dario and Hipp, Tobias and Jun, Sun Da and Schrapp, Michael and Rothgang, Eva and Eskofier, Bjoern}, journal={IEEE Access}, title={System Design for a Data-Driven and Explainable Customer Sentiment Monitor Using IoT and Enterprise Data}, year={2021}, volume={9}, number={}, pages={117140-117152}, doi={10.1109/ACCESS.2021.3106791}}

If you would like to get in touch, please contact an.nguyen@fau.de.
Z
Data from: MetaFlux: Meta-learning global carbon fluxes from sparse...
data.niaid.nih.gov
zenodo.org
Updated Apr 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu, Jiangong (2024). MetaFlux: Meta-learning global carbon fluxes from sparse spatiotemporal observations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7761880
Explore at:
Dataset updated
Apr 14, 2024
Dataset provided by
Nathaniel, Juan
Gentine, Pierre
Liu, Jiangong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MetaFlux is a global, long-term carbon flux dataset of gross primary production and ecosystem respiration that is generated using meta-learning. The principle of meta-learning stems from the need to solve the problem of learning in the face of sparse data availability. Data sparsity is a prevalent challenge in climate and ecology science. For instance, in-situ observations tend to be spatially and temporally sparse. This issue can arise from sensor malfunctions, limited sensor locations, or non-ideal climate conditions such as persistent cloud cover. The lack of high-quality continuous data can make it difficult to understand many climate processes that are otherwise critical. The machine-learning community has attempted to tackle this problem by developing several learning approaches, including meta-learning that learns how to learn broad features across tasks to better infer other poorly sampled ones. In this work, we applied meta-learning to solve the problem of upscaling continuous carbon fluxes from sparse observations. Data scarcity in carbon flux applications is particularly problematic in the tropics and semi-arid regions, where only around 8–11% of long-term eddy covariance stations are currently operational. Unfortunately, these regions are important in modulating the global carbon cycle and its interannual variability. In general, we find that meta-trained machine models, including multi-layer perceptrons (MLP), long-short-term memory (LSTM), and bi-directional LSTM (BiLSTM), have lower validation errors on flux estimates by 9–16% when compared to their non-meta-trained counterparts. In addition, meta-trained models are more robust to extreme conditions, with 4–24% lower overall errors. Finally, we use an ensemble of meta-trained deep networks to generate a global product of ecosystem-scale photosynthesis and respiration fluxes from in-situ observations to daily and monthly global products at a 0.25-degree spatial resolution from 2001 to 2023, called "MetaFlux". We also checked for the seasonality, interannual variability, and correlation to solar-induced fluorescence of the upscaled product and found that MetaFlux outperformed state-of-the-art machine learning upscaling models, especially in critical semi-arid and tropical regions.
J
Reduced‐form factor augmented VAR—Exploiting sparsity to include meaningful...
journaldata.zbw.eu
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Beyeler; Sylvia Kaufmann; Simon Beyeler; Sylvia Kaufmann (2024). Reduced‐form factor augmented VAR—Exploiting sparsity to include meaningful factors (replication data) [Dataset]. http://doi.org/10.15456/jae.2022327.0720929807
Explore at:
csv(475803), pdf(1877620), application/vnd.wolfram.mathematica.package(1859), txt(1747)Available download formats
Unique identifier
https://doi.org/10.15456/jae.2022327.0720929807
Dataset updated
Feb 20, 2024
Dataset provided by
ZBW - Leibniz Informationszentrum Wirtschaft
Authors
Simon Beyeler; Sylvia Kaufmann; Simon Beyeler; Sylvia Kaufmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Induced sparsity in the factor loading matrix identifies the factor basis, while rotational identification is obtained ex post by clustering methods closely related to machine learning. We extract meaningful economic concepts from a high-dimensional data set, which together with observed variables follow an unrestricted, reduced-form VAR process. Including a comprehensive set of economic concepts allows reliable, fundamental structural analysis, even of the factor augmented VAR itself. We illustrate this by combining two structural identification methods to further analyze the model. To account for the shift in monetary policy instruments triggered by the Great Recession, we follow separate strategies to identify monetary policy shocks. Comparing ours to other parametric and non-parametric factor estimates uncovers advantages of parametric sparse factor estimation in a high dimensional data environment. Besides meaningful factor extraction, we gain precision in the estimation of factor loadings.
AI Sparsity Engine Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). AI Sparsity Engine Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/ai-sparsity-engine-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
AI Sparsity Engine Market Outlook

According to our latest research, the AI Sparsity Engine market size reached USD 1.19 billion globally in 2024, with a robust year-on-year growth propelled by advancements in deep learning optimization and efficient neural network deployment. The market is forecasted to expand at a CAGR of 34.7% from 2025 to 2033, reaching an estimated USD 16.1 billion by 2033. This exceptional growth trajectory is primarily driven by the increasing demand for computational efficiency in AI workloads and the widespread adoption of AI sparsity engines across diverse industry verticals, as per our latest research findings.

The primary growth factor for the AI Sparsity Engine market is the surging need for high-performance and energy-efficient AI models, particularly in edge computing and data center environments. As organizations worldwide seek to deploy complex AI models on resource-constrained hardware, sparsity engines have emerged as essential tools for pruning redundant parameters and optimizing model size without sacrificing accuracy. This capability is vital for accelerating AI inference, reducing computational costs, and extending battery life in edge devices. Furthermore, the proliferation of AI-powered applications in sectors such as healthcare, automotive, and finance has intensified the demand for scalable and efficient AI solutions, thus fueling the adoption of AI sparsity engines.

Another significant driver is the rapid evolution of AI algorithms and neural network architectures, which increasingly rely on sparsity techniques to enhance model interpretability and scalability. The integration of AI sparsity engines with mainstream machine learning frameworks and hardware accelerators has simplified the deployment process, enabling enterprises to seamlessly integrate sparsity into their existing AI pipelines. Additionally, the growing focus on sustainable AI and green computing has positioned sparsity engines as a key enabler for reducing the energy footprint of large-scale AI deployments. As regulatory pressures and corporate sustainability goals intensify, organizations are prioritizing technologies that deliver both performance and energy efficiency, thereby boosting the AI sparsity engine market.

A further catalyst for market expansion is the increasing investment in AI research and development, particularly in emerging economies. Governments and private sector players are allocating substantial resources to advance AI infrastructure and foster innovation in AI model optimization. The availability of open-source AI sparsity toolkits and collaborative research initiatives has democratized access to cutting-edge sparsity techniques, accelerating market penetration across small and medium enterprises (SMEs) and large enterprises alike. The convergence of AI sparsity engines with complementary technologies such as federated learning and secure AI is also opening new avenues for market growth, especially in privacy-sensitive industries.

From a regional perspective, North America currently dominates the AI Sparsity Engine market, accounting for the largest revenue share in 2024, followed closely by Europe and the Asia Pacific. The United States, in particular, has witnessed significant adoption of AI sparsity engines across its technology, healthcare, and financial sectors, driven by a mature AI ecosystem and strong R&D investments. Meanwhile, Asia Pacific is poised for the fastest growth throughout the forecast period, fueled by rapid digital transformation, expanding AI infrastructure, and rising government initiatives to promote AI innovation in countries such as China, Japan, and South Korea. Europe is also experiencing steady growth, supported by robust regulatory frameworks and increasing focus on sustainable AI solutions.

Component Analysis

The AI Sparsity Engine market by component is segmented into software, hardware, and services, each playing a pivotal role in the overall ecosystem. Software solutions currently dominate the market, accounting for the largest share in 202
NeuriPhy - Neuroimaging Dataset for Physics-Informed Learning
zenodo.org
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tiago Assis; Tiago Assis (2025). NeuriPhy - Neuroimaging Dataset for Physics-Informed Learning [Dataset]. http://doi.org/10.5281/zenodo.15381866
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15381866
Dataset updated
May 12, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tiago Assis; Tiago Assis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 12, 2025
Description
Work in progress...

NeuriPhy - Neuroimaging Dataset for Physics-Informed Learning

This dataset was developed in the context of my master's thesis titled "Physics-Guided Deep Learning for Sparse Data-Driven Brain Shift Registration", which investigates the integration of physics-based biomechanical modeling into deep learning frameworks for the task of brain shift registration. The core objective of this project is to improve the accuracy and reliability of intraoperative brain shift prediction by enabling deep neural networks to interpolate sparse intraoperative data under biomechanical constraints. Such capabilities are critical for enhancing image-guided neurosurgery systems, especially when full intraoperative imaging is unavailable or impractical.

The dataset integrates and extends data from two publicly available sources: ReMIND and UPENN-GBM. A total of 207 patient cases (45 cases from ReMIND and 162 cases from UPENN-GBM), each represented as a separate folder with all relevant data grouped per case, are included in this dataset. It contains preoperative imaging (unstripped), synthetic ground truth displacement fields, anatomical segmentations, and keypoints, structured to support machine learning and registration tasks.

For details on the image acquisition and other topics related to the original datasets, see their original links above.

Contents

Imaging Data:

T1ce: Preoperative contrast-enhanced T1-weighted MRI scans.

T2: Preoperative T2-weighted MRI scans, including mostly T2-SPACE, but also native T2 and T2-BLADE acquisitions depending on the case.

All MRI scans are in NIfTI format and have been resampled to the same isotropic resolution (1x1x1 mm). Intra-patient rigid coregistration was performed as part of preprocessing with the "General Registration (BRAINS)" extension of 3D Slicer.

Synthetic Displacement Fields:

Biomechanically simulated ground truth displacement fields were generated using a meshless approach and by solving differential equations of nonlinear elasticity using explicit methods, as described in 1, 2, 3, 4.

For each patient, 1 to 5 simulations were successfully performed, each with a different gravity vector orientation according to a plausible surgical entry point, creating variability in the deformations obtained. Overall, the dataset contains 394 simulations that aimed to predict the intraoperative state after tumor-resection-induced brain shift.

Includes the initial and displaced (final) coordinates of several points in the brain volume that were used to generate the displacement field using a multi-level BSpline interpolation algorithm.

These displacement fields were mainly intended for use as supervision in deep learning-based registration methods.

Keypoints:

Sparse 3D keypoints and their descriptors were generated using the 3D SIFT-Rank algorithm on the T1ce images (or T2 if T1ce was unavailable).

Keypoints are provided for each case in both voxel space and world coordinates (RAS?), being suitable for sparse registration or landmark-based evaluation.

Segmentations:

Brain segmentations were automatically generated using SynthSeg, a deep learning model capable of robust whole-brain segmentation with scans of any contrast and resolution.

Tumor segmentations are included from the original datasets.

All segmentations are provided in the NRRD format.

Data Structure

Each patient folder contains the following subfolders:

images/: Preoperative MRI scans (T1ce, T2) in NIfTI format.

segmentations/: Brain and tumor segmentations in NRRD format.

simulations/: Biomechanically simulated displacement fields with initial and final point coordinates (LPS) in .npz and .txt formats, respectively.

keypoints/: 3D SIFT-Rank keypoints and their descriptors in both voxel space and world coordinates (RAS?) as .key files.

The folder naming and organization are consistent across patients for ease of use and scripting.

Source Datasets

ReMIND: is a multimodal imaging dataset of 114 brain tumor patients that underwent image-guided surgical resection at Brigham and Women’s Hospital, containing preoperative MRI, intraoperative MRI, and 3D intraoperative ultrasound data. It includes over 300 imaging series and 350 expert-annotated segmentations such as tumors, resection cavities, cerebrum, and ventricles. Demographic and clinico-pathological information (e.g., tumor type, grade, eloquence) is also provided.

UPENN-GBM: comprises multi-parametric MRI scans from de novo glioblastoma (GBM) patients treated at the University of Pennsylvania Health System. It includes co-registered and skull-stripped T1-weighted, T1-weighted contrast-enhanced, T2-weighted, and FLAIR images. The dataset features high-quality tumor and brain segmentation labels, initially produced by automated methods and subsequently corrected and approved by board-certified neuroradiologists. Alongside imaging data, the collection provides comprehensive clinical metadata including patient demographics, genomic profiles, survival outcomes, and tumor progression indicators.

Use Cases

This dataset is tailored for researchers and developers working on:

Deformable image registration

Physics-informed machine learning

Intraoperative brain shift modeling

Sparse data interpolation and deep learning

Multi-modal image alignment in neuroimaging

It is especially well-suited for evaluating learning-based registration methods that incorporate physical priors or aim to generalize under sparse supervision.
Data, code, and model weights for "Insights on Galaxy Evolution from...
zenodo.org
application/gzip, bin +2
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Wu; John Wu (2025). Data, code, and model weights for "Insights on Galaxy Evolution from Interpretable Sparse Feature Networks" [Dataset]. http://doi.org/10.5281/zenodo.14712542
Explore at:
zip, csv, bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14712542
Dataset updated
Jan 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
John Wu; John Wu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

This repository contains data, code, and model weights for reproducing the main results of the paper, Insights on Galaxy Evolution from Interpretable Sparse Feature Networks (see arXiv preprint). Specifically, we provide data files (images-sdss.tar.gz and galaxies.csv), a snapshot of the code base (sparse-feature-networks v1.0.0), and model weights (resnet18-topk_4-metallicity.pth, resnet18-topk_4-bpt_lines.pth). These are described in detail below.

Data

galaxies.csv is the main galaxy sample after we have issued the cuts described in the paper (250,224 rows). We include 30 columns queried from the SDSS galSpecInfo, galSpecLine, and galSpecExtra tables:

objID (int64) DR7ObjID (int64) specObjID (int64) ra (float32) dec (float32) z (float32) zErr (float32) velDisp (float32) velDispErr (float32) modelMag_u (float32) modelMag_g (float32) modelMag_r (float32) modelMag_i (float32) modelMag_z (float32) petroMag_r (float32) petroR50_r (float32) petroR90_r (float32) bptclass (int32) oh_p50 (float32) lgm_tot_p50 (float32) sfr_tot_p50 (float32) nii_6584_flux (float32) nii_6584_flux_err (float32) h_alpha_flux (float32) h_alpha_flux_err (float32) oiii_5007_flux (float32) oiii_5007_flux_err (float32) h_beta_flux (float32) h_beta_flux_err (float32) reliable (int32)

images-sdss.tar.gz is a compressed directory containing 250,224 image cutouts from the DESI Legacy Imaging Surveys viewer. Each cutout was generated using the RESTful call http://legacysurvey.org/viewer/cutout.jpg?ra={ra}&dec={dec}&pixscale=0.262&layer=sdss&size=160 where the ra and dec are directly taken from galaxies.csv. Each image is name using the format {objID}.jpg, again taken from galaxies.csv.

Code

The code is a snapshot of https://github.com/jwuphysics/sparse-feature-networks at v1.0.0. After unpacking the images and moving them into the ./data directory, the directory structure should look like:

./ ├── data/ │ ├── images-sdss/ │ └── galaxies.csv ├── model/ ├── results/ └── src/ ├── config.py ├── dataloader.py ├── model.py ├── main.py └── trainer.py

In order to run the analysis and reproduce the main results of the paper, you must create the software environment first:

pip install torch fastai numpy pandas matplotlib cmasher tqdm

and then simply run python src/main.py.

Models

The trained model weghts (resnet18-topk_4-metallicity.pth, resnet18-topk_4-bpt_lines.pth) are provided here for reproducing the exact results from the paper. These are compatible with the ResNet18TopK class defined in src/model.py, and the weights can be stored in the ./model directory.

Alternatively, you can train your own models (i.e. by using the functions defined in src/trainer.py) and save them natively with Pytorch.
Multiarray EMG, finger pressings
figshare.com
bin
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jukka-pekka Kauppi (2016). Multiarray EMG, finger pressings [Dataset]. http://doi.org/10.6084/m9.figshare.1394676.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1394676.v1
Dataset updated
Jan 19, 2016
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jukka-pekka Kauppi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
EMG data for classifier evaluation
d
Data from: Discovery of sparse, reliable omic biomarkers with Stabl
datadryad.org
data.niaid.nih.gov
zip
Updated Oct 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julien Hédou; Ivana Marić; Grégoire Bellan; Jakob Einhaus; Brice Gaudillière (2023). Discovery of sparse, reliable omic biomarkers with Stabl [Dataset]. http://doi.org/10.5061/dryad.stqjq2c7d
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.stqjq2c7d
Dataset updated
Oct 12, 2023
Dataset provided by
Dryad
Authors
Julien Hédou; Ivana Marić; Grégoire Bellan; Jakob Einhaus; Brice Gaudillière
Time period covered
2023
Description
Stabl: sparse and reliable biomarker discovery in predictive modeling of high-dimensional omic data

This is a scikit-learn compatible Python implementation of Stabl, coupled with useful functions and example notebooks to rerun the analyses on the different use cases located in the Sample data folder of the code library and in the data.zip folder of this repository

Requirements

Python version : from 3.7 up to 3.10

Python packages:

joblib == 1.1.0

tqdm == 4.64.0

matplotlib == 3.5.2

numpy == 1.23.1

cmake == 3.27.1

knockpy == 1.2

scikit-learn == 1.1.2

seaborn == 0.12.0

groupyr == 0.3.2

pandas == 1.4.2

statsmodels == 0.14.0

openpyxl == 3.0.7

adjustText == 0.8

scipy == 1.10.1

julia == 0.6.1

osqp == 0.6.2

Julia package for noise generation (version 1.9.2) :

Bigsimr == 0.8.7

Distributions == 0.25.98

PyCall == 1.96.1

Installation

Julia installation

To install Julia, please follow these instructions:

Download Julia from [here](ht...
m
Data from: Local kernel regression and neural network approaches to the...
archive.materialscloud.org
application/gzip, bin +1
Updated Aug 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raimon Fabregat; Alberto Fabrizio; Edgar Engel; Benjamin Meyer; Veronika Juraskova; Michele Ceriotti; Clemence Corminboeuf; Raimon Fabregat; Alberto Fabrizio; Edgar Engel; Benjamin Meyer; Veronika Juraskova; Michele Ceriotti; Clemence Corminboeuf (2021). Local kernel regression and neural network approaches to the conformational landscapes of oligopeptides [Dataset]. http://doi.org/10.24435/materialscloud:kp-82
Explore at:
text/markdown, bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.24435/materialscloud:kp-82
Dataset updated
Aug 10, 2021
Dataset provided by
Materials Cloud
Authors
Raimon Fabregat; Alberto Fabrizio; Edgar Engel; Benjamin Meyer; Veronika Juraskova; Michele Ceriotti; Clemence Corminboeuf; Raimon Fabregat; Alberto Fabrizio; Edgar Engel; Benjamin Meyer; Veronika Juraskova; Michele Ceriotti; Clemence Corminboeuf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The application of machine learning to theoretical chemistry has made it possible to combine the accuracy of quantum chemical energetics with the thorough sampling of finite-temperature fluctuations. To reach this goal, a diverse set of methods has been proposed, ranging from simple linear models to kernel regression and highly nonlinear neural networks. Here we apply two widely different approaches to the same, challenging problem - the sampling of the conformational landscape of polypeptides at finite temperature. We develop a Local Kernel Regression (LKR) coupled with a supervised sparsity method and compare it with a more established approach based on Behler-Parrinello type Neural Networks. In the context of the LKR, we discuss how the supervised selection of the reference pool of environments is crucial to achieve accurate potential energy surfaces at a competitive computational cost and leverage the locality of the model to infer which chemical environments are poorly described by the DFTB baseline. We then discuss the relative merits of the two frameworks and perform Hamiltonian-reservoir replica-exchange Monte Carlo sampling and metadynamics simulations, respectively, to demonstrate that both frameworks can achieve converged and transferable sampling of the conformational landscape of complex and flexible biomolecules with comparable accuracy and computational cost.
Raw data for "Efficient protein structure generation with sparse denoising...
zenodo.org
application/gzip
Updated Jan 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Jendrusch; Michael Jendrusch; Jan Korbel; Jan Korbel (2025). Raw data for "Efficient protein structure generation with sparse denoising models" [Dataset]. http://doi.org/10.5281/zenodo.14711580
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14711580
Dataset updated
Jan 31, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael Jendrusch; Michael Jendrusch; Jan Korbel; Jan Korbel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 24, 2025
Description
This repository contains model parameters and protein structures described in the manuscript "Efficient protein structure generation with sparse denoising models".

model source code

"salad-0.1.0.tar.gz" contains the snapshot of the salad code-base used in the manuscript.

model parameters

The parameters for the salad (sparse all-atom denoising) models described in the manuscript are contained in "salad_params.tar.gz". This unpacks to a directory "params/", which contains pickled parameter files for a number of model variants:

"default_vp-200k.jax" (and "default_vp_timeless-200k.jax", "default_vp_minimal_timeless-200k.jax"):

checkpoints at 200,000 training steps for diffusion models with fixed-standard deviation (10 Å) variance preserving (VP) noise.

"timeless" and "minimal_timeless" files contain parameters for ablated models without diffusion time features and with reduced pair features as described in the manuscript.

"default_vp_scaled-200k.jax" (and "timeless" / "minimal_timeless" variants):

checkpoints at 200,000 training steps for input-dependent standard deviation VP noise (VP-scaled in the manuscript).

"default_ve_scaled-200k.jax" (and "timeless" / "minimal_timeless" variants):

checkpoints at 200,000 training steps for variance expanding (VE) noise.

"multimotif_vp-200k.jax":

checkpoint at 200,000 training steps for a model with multi-motif conditioning and VP noise.

"default_vp-pdb256-200k.jax":

checkpoint for a model trained on proteins with 50 to 256 amino acids from the PDB.

"default_vp-synthete-256-200k.jax":

checkpoint for a model trained on proteins with 50 to 256 amino acids generated using random secondary structure conditioning with the default_vp-200k.jax checkpoint and ProteinMPNN redesign.

In addition to salad model parameters, we also provide the parameters for the autoencoder models described in the manuscript in "ae_params.tar.gz". This unpacks to a directory "ae_params/", which contains the following checkpoints:

"small_none-200k.jax": sparse decoder with neighbour selection based only on predicted coordinates.

"small_inner-200k.jax": sparse decoder with neighbour selection based on per-block distogram predictions and predicted coordinates.

"small_semiequivariant-200k.jax": same as "small_inner-200k.jax", with the addition of using non-equivariant features on top of the usual equivariant features (relative orientation / distance).

"small_nodist_vq-200k.jax": same as "small_none-200k.jax", with vector quantization (VQ).

"small_vq-200k.jax": same as "small_inner-200k.jax", with VQ

"small_semiequivariant_vq-200k.jax": same as "small_semiequivariant-200k.jax", with VQ

"small_vq_e2-500k.jax": same as "small_vq-200k.jax", with double encoder depth and 500k training steps.

generated proteins

The protein structures generated using salad, as well as their corresponding sequences generated using ProteinMPNN and predicted structures using ESMfold are contained in "data_package.tar.gz". This archive unpacks to a directory "data_package/" which contains subdirectories for each protein design task described in the manuscript "Efficient protein structure generation with sparse denoising models":

monomers/

This directory contains subdirectories named "

"ve_large_100": VE diffusion starting from noise with standard deviation 100 Å

"ve_large_80": VE diffusion starting from noise with standard deviation 80 Å

"ve_domain_100": VE diffusion starting from domain-shaped noise as described in the manuscript with standard deviation 100 Å

"ve_domain_80": VE diffusion starting from domain-shaped noise as described in the manuscript with standard deviation 80 Å

In addition, there are subdirectories with "random" in their name, instead of a number of steps, e.g. "default_vp_scaled-200-random-esm/". These subdirectories contain data generated using random secondary structure conditioning.

Each subdirectory has the same underlying structure:

"backbones/": directory containing PDB files of salad-generated backbones

"predictions/": directory containing PDB files of the predicted structure for the best sequence according to ESMfold pLDDT and scRMSD for each designable backbone (a backbone with at least 1 predicted structure with scRMSD < 2 Å and pLDDT > 70).

"scores.csv": comma-separated file of structure-prediction metrics for each ProteinMPNN sequence generated for each backbone in "backbones". This file has the following columns:

"name": base name of the backbone PDB file in "backbones"

"index": index of the sequence corresponding to this row (0th, 1st, etc.)

"sequence": amino acid sequence that was had its structure predicted in this row

"sc_rmsd": root mean square deviation (RMSD) between the salad backbone and the predicted structure for this row

"sc_tm": TM score between the salad backbone and the predicted structure for this row

"plddt": pLDDT of the predicted structure for this row

"ptm": pTM of the predicted structure for this row

"pae": predicted aligned error for this ro

for complexes (irrelevant for this study):

"ipae": mean interface pAE for this row

"mpae": minimum interface pAE for this row

comparison/

Same as "monomers/", but contains data generated using RFdiffusion and Genie 2 for protein sizes between 50 and 400 amino acids.

shape/

This directory contains the subdirectories named "ve-seg-

motif/

This directory contains generated structures for the motif-scaffolding benchmark described by Lin et al., 2024 [1]. It contains two subdirectories:

"cond/": contains results generated using motif-conditioned models with the checkpoint "multimotif_vp-200k.jax"

"nocond/": contains results generated using structure-editing for motif-scaffolding with the checkpoint "default_vp-200k.jax"

Each of these subdirectories has the same structure as the directories "monomers/" and "shape/", with one subdirectory per motif PDB file in the motif-scaffolding benchmark, e.g. "cond/multimotif_vp-1bcf.pdb-esm/" or "nocond/default_vp-1bcf.pdb-esm/". These directories contain the usual "backbones/" and "predictions/" subdirectories, as well as a file "motif_scores.csv". This has fields analogous to "scores.csv", with the addition of two additional fields for motif-RMSD:

"motif_rmsd_ca": CA-only RMSD between the ESMfold predicted structure and the input motif

"motif_rmsd_bb": full-backbone (N, CA, C) RMSD between the ESMfold predicted structure and the input motif

A designed sequence-structure pair is only considered successful if sc_rmsd < 2 Å, plddt > 70 and motif_rmsd_bb < 1 Å.

sym/

This directory contains generated structures for symmetric repeat proteins using both VP and VE models with structure-editing. Subdirectories are named by model type ("default_vp", "default_ve_minimal_timeless"), symmetry ("C

confchange/

This directory contains generated structures for designed multi-state proteins. In our manuscript we compare two different approaches to multi-state design using salad which are reflected in two subdirectories of "confchange/":

"default_vp-parent-split-af2": running independent denoising processes with distinct secondary structure constraints, followed by tied ProteinMPNN sequence design

"default_vp-parent-split-constrained-af2": running coupled denoising processes where shared substructures across states are kept aligned across states, followed by tied ProteinMPNN sequence design

Both share the same directory structure:

"backbones/{parent, child1, child2, partial_parent, partial_child1, partial_child2}": generated successful backbones, and partially successful backbones ("partial_") for the three designed states (the full parent structure and two child structures resulting from splitting the parent sequence into its N and C terminal parts).

"predictions/{parent, child1, child2, partial_parent, partial_child1, partial_child2}": the best AlphaFold 2 predicted structures for each successful backbone.

"scores_{parent, child1, child2}.csv": scores files as above, generated using AlphaFold 2
Sparse-Matrix Compression Engine Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Sparse-Matrix Compression Engine Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/sparse-matrix-compression-engine-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Jul 4, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Sparse-Matrix Compression Engine Market Outlook

According to our latest research, the global Sparse-Matrix Compression Engine market size reached USD 1.42 billion in 2024, reflecting robust adoption across high-performance computing and advanced analytics sectors. The market is poised for substantial expansion, with a projected CAGR of 15.8% during the forecast period. By 2033, the market is forecasted to achieve a value of USD 5.18 billion, driven by escalating data complexity, the proliferation of machine learning applications, and the imperative for efficient storage and computational solutions. The surge in demand for real-time analytics and the growing penetration of artificial intelligence across industries are primary factors fueling this remarkable growth trajectory.

One of the key growth drivers for the Sparse-Matrix Compression Engine market is the exponential increase in data generation and the corresponding need for efficient data processing and storage. As organizations in sectors such as scientific computing, finance, and healthcare grapple with large-scale, high-dimensional datasets, the requirement for optimized storage solutions becomes paramount. Sparse-matrix compression engines enable significant reduction in data redundancy, leading to lower storage costs and faster data retrieval. This efficiency is particularly crucial in high-performance computing environments where memory bandwidth and storage limitations can hinder computational throughput. The adoption of these engines is further propelled by advancements in hardware accelerators and software algorithms that enhance compression ratios without compromising data integrity.

Another significant factor contributing to market growth is the rising adoption of machine learning and artificial intelligence across diverse industry verticals. Modern AI and ML algorithms often operate on sparse datasets, especially in areas such as natural language processing, recommendation systems, and scientific simulations. Sparse-matrix compression engines play a pivotal role in minimizing memory footprint and optimizing computational resources, thereby accelerating model training and inference. The integration of these engines into cloud-based and on-premises solutions allows enterprises to scale their AI workloads efficiently, driving widespread deployment in both research and commercial applications. Additionally, the ongoing evolution of lossless and lossy compression techniques is expanding the applicability of these engines to new and emerging use cases.

The market is also benefiting from the increasing emphasis on cost optimization and energy efficiency in data centers and enterprise IT infrastructure. As organizations strive to reduce operational expenses and carbon footprints, the adoption of compression technologies that minimize data movement and storage requirements becomes a strategic imperative. Sparse-matrix compression engines facilitate this by enabling higher data throughput and lower energy consumption, making them attractive for deployment in large-scale analytics, telecommunications, and industrial automation. Furthermore, the growing ecosystem of service providers and solution integrators is making these technologies more accessible to small and medium enterprises, contributing to broader market penetration.

From a regional perspective, North America continues to dominate the Sparse-Matrix Compression Engine market, accounting for the largest revenue share in 2024 owing to the presence of leading technology companies, advanced research institutions, and early adopters of high-performance computing solutions. However, the Asia Pacific region is witnessing the fastest growth, driven by rapid digital transformation, expanding AI research, and significant investments in data infrastructure across China, Japan, and India. Europe follows closely, with robust demand for advanced analytics and scientific computing in sectors such as automotive, healthcare, and finance. Latin America and Middle East & Africa are gradually emerging as promising markets, supported by increasing investments in IT modernization and digitalization initiatives.

"https://growthmarketreports.com/request-sample/27696">
<
f
Student performance dataset feature description.
plos.figshare.com
xls
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jianwei Dong; Ruishuang Sun; Zhipeng Yan; Meilun Shi; Xinyu Bi (2025). Student performance dataset feature description. [Dataset]. http://doi.org/10.1371/journal.pone.0325713.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325713.t002
Dataset updated
Jun 18, 2025
Dataset provided by
PLOS ONE
Authors
Jianwei Dong; Ruishuang Sun; Zhipeng Yan; Meilun Shi; Xinyu Bi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Academic achievement is an important index to measure the quality of education and students’ learning outcomes. Reasonable and accurate prediction of academic achievement can help improve teachers’ educational methods. And it also provides corresponding data support for the formulation of education policies. However, traditional methods for classifying academic performance have many problems, such as low accuracy, limited ability to handle nonlinear relationships, and poor handling of data sparsity. Based on this, our study analyzes various characteristics of students, including personal information, academic performance, attendance rate, family background, extracurricular activities and etc. Our work offers a comprehensive view to understand the various factors affecting students’ academic performance. In order to improve the accuracy and robustness of student performance classification, we adopted Gaussian Distribution based Data Augmentation technique (GDO), combined with multiple Deep Learning (DL) and Machine Learning (ML) models. We explored the application of different Machine Learning and Deep Learning models in classifying student grades. And different feature combinations and data augmentation techniques were used to evaluate the performance of multiple models in classification tasks. In addition, we also checked the synthetic data’s effectiveness with variance homogeneity and P-values, and studied how the oversampling rate affects actual classification results. Research has shown that the RBFN model based on educational habit features performs the best after using GDO data augmentation. The accuracy rate is 94.12%, and the F1 score is 94.46%. These results provide valuable references for the classification of student grades and the development of intervention strategies. New methods and perspectives in the field of educational data analysis are proposed in our study. At the same time, it has also promoted innovation and development in the intelligence of the education system.
z
Data from: NeuroSense: A Novel EEG Dataset Utilizing Low-Cost, Sparse...
zenodo.org
Updated Oct 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tommaso Colafiglio; Tommaso Colafiglio; Angela Lombardi; Angela Lombardi; Paolo Sorino; Paolo Sorino; Elvira Brattico; Elvira Brattico; Domenico Lofù; Domenico Lofù; Danilo Danese; Danilo Danese; Eugenio Di Sciascio; Eugenio Di Sciascio; Tommaso Di Noia; Tommaso Di Noia; Fedelucio Narducci; Fedelucio Narducci (2024). NeuroSense: A Novel EEG Dataset Utilizing Low-Cost, Sparse Electrode Devices for Emotion Exploration [Dataset]. http://doi.org/10.5281/zenodo.14003181
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14003181
Dataset updated
Oct 30, 2024
Dataset provided by
Zenodo
Authors
Tommaso Colafiglio; Tommaso Colafiglio; Angela Lombardi; Angela Lombardi; Paolo Sorino; Paolo Sorino; Elvira Brattico; Elvira Brattico; Domenico Lofù; Domenico Lofù; Danilo Danese; Danilo Danese; Eugenio Di Sciascio; Eugenio Di Sciascio; Tommaso Di Noia; Tommaso Di Noia; Fedelucio Narducci; Fedelucio Narducci
Time period covered
Oct 28, 2024
Description
README

Link to the Publication

🔗 Read the Paper

Details related to access to the data

Data user agreement

The terms and conditions for using this dataset are specified in the [LICENCE](LICENCE) file included in this repository. Please review these terms carefully before accessing or using the data.

Contact person

For additional information about the dataset, please contact:
- Name: Angela Lombardi
- Affiliation: Department of Electrical and Information Engineering, Politecnico di Bari
- Email: angela.lombardi@poliba.it

Practical information to access the data

The dataset can be accessed through our dedicated web platform. To request access:

1. Visit the main dataset page at: https://sisinflab.poliba.it/neurosense-dataset-request/
2. Follow the instructions on the website to submit your access request
3. Upon approval, you will receive further instructions for downloading the data

Please ensure you have read and agreed to the terms in the data user agreement before requesting access.

Overview

EEG Emotion Recognition - Muse Headset
2023-2024

The experiment consists in 40 sessions per user. During each session, users are asked to watch a
music video with the aim to understand their emotions.
Recordings are performed with a Muse EEG headset at a 256 Hz sampling rate.
Channels are recorded as follows:
- Channel 0: AF7
- Channel 1: TP9
- Channel 2: TP10
- Channel 3: AF8

The chosen songs have various Last.fm tags in order to create different feelings. The title of every track
can be found in the "TaskName" field of sub-ID***_ses-S***_task-Default_run-001_eeg.json, while the author,
the Last.fm tag and additional information in "TaskDescription".

Methods

Subjects

The subject pool is made of 30 college students, aged between 18 and 35. 16 of them are males, 14 females.

Apparatus

The experiment was performed using the same procedures as those to create
[Deap Dataset](https://www.eecs.qmul.ac.uk/mmv/datasets/deap/), which is a dataset to recognize emotions via a Brain
Computer Interface (BCI).

Task organization

Firstly, music videos were selected. Once 40 songs were picked, the protocol was chosen and the self-assessment
questionnaire was created.

Task details

In order to evaluate the stimulus, Russell's VAD (Valence-Arousal-Dominance) scale was used.
In this scale, valenza-arousal space can be divided in four quadrants:
- Low Arousal/Low Valence (LALV);
- Low Arousal/High Valence (LAHV);
- High Arousal/Low Valence (HALV);
- High Arousal/High Valence (HAHV).

Experimental location

The experiment was performed in a laboratory located at DEI Department of
[Politecnico di Bari](https://www.poliba.it/).

Missing data

Data recorded during session S019 - Session 2, ID021 - Session 23, user was corrupted, therefore is missing.
Sessions S033 and S038 of ID015 user show a calculated effective sampling rate lower than 256 Hz:
- ID015_ses-S033 has 226.1320 Hz
- ID015_ses-S038 has 216.9549 Hz
Spatiotemporal Upscaling of Sparse Air-Sea pCO2 Data via Physics-Informed...
zenodo.org
nc
Updated Jan 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siyeon Kim; Juan Nathaniel; Zhewen Hou; Tian Zheng; Pierre Gentine; Siyeon Kim; Juan Nathaniel; Zhewen Hou; Tian Zheng; Pierre Gentine (2024). Spatiotemporal Upscaling of Sparse Air-Sea pCO2 Data via Physics-Informed Transfer Learning [Dataset]. http://doi.org/10.5281/zenodo.10552177
Explore at:
ncAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10552177
Dataset updated
Jan 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Siyeon Kim; Juan Nathaniel; Zhewen Hou; Tian Zheng; Pierre Gentine; Siyeon Kim; Juan Nathaniel; Zhewen Hou; Tian Zheng; Pierre Gentine
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Global measurements of ocean pCO2 are critical to monitor and understand changes in the global carbon cycle. However, pCO2 observations remain sparse as they are mostly collected on opportunistic ship tracks. Several approaches, especially based on machine learning, have been used to upscale and extrapolate sparse point data to dense global estimates based on globally available input features. However, those estimates tend to exhibit spatially heterogeneous performance. As a result, we propose a physics-informed transfer learning workflow to generate dense pCO2 estimates. The model is initially trained on synthetic Earth system model data, and then adjusted (using transfer learning) to the actual sparse SOCAT observational data, thus leveraging both the spatial and temporal correlation pre-learned on physically-informed Earth system ensembles. Compared to the benchmark upscaling of SOCAT point-wise data on baseline models, our transfer learning methodology shows a major improvement of up to 30-52%. Our strategy thus presents a new monthly global pCO2 estimates that spans for 35 years between 1982 and 2017.
P
Tox21 Dataset
paperswithcode.com
Updated Feb 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Tox21 Dataset [Dataset]. https://paperswithcode.com/dataset/tox21-1
Explore at:
Dataset updated
Feb 22, 2021
Description
The Tox21 data set comprises 12,060 training samples and 647 test samples that represent chemical compounds. There are 801 "dense features" that represent chemical descriptors, such as molecular weight, solubility or surface area, and 272,776 "sparse features" that represent chemical substructures (ECFP10, DFS6, DFS8; stored in Matrix Market Format ). Machine learning methods can either use sparse or dense data or combine them. For each sample there are 12 binary labels that represent the outcome (active/inactive) of 12 different toxicological experiments. Note that the label matrix contains many missing values (NAs). The original data source and Tox21 challenge site is https://tripod.nih.gov/tox21/challenge/.
f
Reliability verification of synthetic data.
plos.figshare.com
xls
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jianwei Dong; Ruishuang Sun; Zhipeng Yan; Meilun Shi; Xinyu Bi (2025). Reliability verification of synthetic data. [Dataset]. http://doi.org/10.1371/journal.pone.0325713.t012
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325713.t012
Dataset updated
Jun 18, 2025
Dataset provided by
PLOS ONE
Authors
Jianwei Dong; Ruishuang Sun; Zhipeng Yan; Meilun Shi; Xinyu Bi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Academic achievement is an important index to measure the quality of education and students’ learning outcomes. Reasonable and accurate prediction of academic achievement can help improve teachers’ educational methods. And it also provides corresponding data support for the formulation of education policies. However, traditional methods for classifying academic performance have many problems, such as low accuracy, limited ability to handle nonlinear relationships, and poor handling of data sparsity. Based on this, our study analyzes various characteristics of students, including personal information, academic performance, attendance rate, family background, extracurricular activities and etc. Our work offers a comprehensive view to understand the various factors affecting students’ academic performance. In order to improve the accuracy and robustness of student performance classification, we adopted Gaussian Distribution based Data Augmentation technique (GDO), combined with multiple Deep Learning (DL) and Machine Learning (ML) models. We explored the application of different Machine Learning and Deep Learning models in classifying student grades. And different feature combinations and data augmentation techniques were used to evaluate the performance of multiple models in classification tasks. In addition, we also checked the synthetic data’s effectiveness with variance homogeneity and P-values, and studied how the oversampling rate affects actual classification results. Research has shown that the RBFN model based on educational habit features performs the best after using GDO data augmentation. The accuracy rate is 94.12%, and the F1 score is 94.46%. These results provide valuable references for the classification of student grades and the development of intervention strategies. New methods and perspectives in the field of educational data analysis are proposed in our study. At the same time, it has also promoted innovation and development in the intelligence of the education system.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dashlink (2025). Sparse Machine Learning Methods for Understanding Large Text Corpora [Dataset]. https://catalog.data.gov/dataset/sparse-machine-learning-methods-for-understanding-large-text-corpora

Data from: Sparse Machine Learning Methods for Understanding Large Text Corpora

Explore at:

Dataset updated

Apr 10, 2025

Dataset provided by

Dashlink

Description

Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.

Clear search

Close search

Google apps

Main menu

Data from: Sparse Machine Learning Methods for Understanding Large Text...

Multiarray EMG, hand movements, subject 5

Data from: Sparse Principal Component Analysis with Preserved Sparsity...

Data from: Low-energy electron microscopy intensity-voltage data –...

Industrial Benchmark Dataset for Customer Escalation Prediction

Data from: MetaFlux: Meta-learning global carbon fluxes from sparse...

Reduced‐form factor augmented VAR—Exploiting sparsity to include meaningful...

AI Sparsity Engine Market Research Report 2033

AI Sparsity Engine Market Outlook

Component Analysis

NeuriPhy - Neuroimaging Dataset for Physics-Informed Learning

NeuriPhy - Neuroimaging Dataset for Physics-Informed Learning

Contents

Data Structure

Source Datasets

Use Cases

Data, code, and model weights for "Insights on Galaxy Evolution from...

Overview

Data

Code

Models

Multiarray EMG, finger pressings

Data from: Discovery of sparse, reliable omic biomarkers with Stabl

Stabl: sparse and reliable biomarker discovery in predictive modeling of high-dimensional omic data

Requirements

Installation

Julia installation

Data from: Local kernel regression and neural network approaches to the...

Raw data for "Efficient protein structure generation with sparse denoising...

model source code

model parameters

generated proteins

monomers/

comparison/

shape/

motif/

sym/

confchange/

Sparse-Matrix Compression Engine Market Research Report 2033

Sparse-Matrix Compression Engine Market Outlook

Student performance dataset feature description.

Data from: NeuroSense: A Novel EEG Dataset Utilizing Low-Cost, Sparse...

README

Link to the Publication

Details related to access to the data

Data user agreement

Contact person

Practical information to access the data

Overview

EEG Emotion Recognition - Muse Headset2023-2024

Methods

Subjects

Apparatus

Task organization

Task details

Experimental location

Missing data

Spatiotemporal Upscaling of Sparse Air-Sea pCO2 Data via Physics-Informed...

Tox21 Dataset

Reliability verification of synthetic data.

Data from: Sparse Machine Learning Methods for Understanding Large Text CorporaSee More Versions

EEG Emotion Recognition - Muse Headset
2023-2024

Data from: Sparse Machine Learning Methods for Understanding Large Text Corpora