Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EMG data for classifier evaluation
MATLAB code + demo to reproduce results for "Sparse Principal Component Analysis with Preserved Sparsity". This code calculates the principal loading vectors for any given high-dimensional data matrix. The advantage of this method over existing sparse-PCA methods is that it can produce principal loading vectors with the same sparsity pattern for any number of principal components. Please see Readme.md for more information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Low-energy electron microscopy (LEEM) taken as intensity-voltage (I-V) curves provides hyperspectral images of surfaces, which can be used to identify the surface type, but are difficult to analyze. Here, we demonstrate the use of an algorithm for factorizing the data into spectra and concentrations of characteristic components (FSC3) for identifying distinct physical surface phases. Importantly, FSC3 is an unsupervised and fast algorithm. As example data we use experiments on the growth of praseodymium oxide or ruthenium oxide on ruthenium single crystal substrates, both featuring a complex distribution of coexisting surface components, varying in both chemical composition and crystallographic structure. With the factorization result a sparse sampling method is demonstrated, reducing the measurement time by 1-2 orders of magnitude, relevant for dynamic surface studies. The FSC3 concentrations are providing the features for a support vector machine (SVM) based supervised classification of the types. Here, specific surface regions which have been identified structurally, via their diffraction pattern, as well as chemically by complementary spectro-microscopic techniques, are used as training sets. A reliable classification is demonstrated on both exemplary LEEM I-V datasets.Research results are published at https://arxiv.org/abs/2203.12353The data available represents the concentration maps obtained by the FSC3 in tiff format, together with the associated spectra as ascii. Similarly the results of the classification algorithm are available as tiff images, while the average concentration and spectra calculated over the training and testing regions are given as ascii data. The raw data are also given as tiff images, which can be used to test the FSC3 and classification algorithms (available at https://langsrv.astro.cf.ac.uk/HIA/HIA.html, and https://github.com/masiaf-cf/leem-svm-classify, respectively).Research results based upon these data are pubished at https://doi.org/10.1111/jmi.13155
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a real-world industrial benchmark dataset from a major medical device manufacturer for the prediction of customer escalations. The dataset contains features derived from IoT (machine log) and enterprise data including labels for escalation from a fleet of thousands of customers of high-end medical devices.
The dataset accompanies the publication "System Design for a Data-driven and Explainable Customer Sentiment Monitor" (submitted). We provide an anonymized version of data collected over a period of two years.
The dataset should fuel the research and development of new machine learning algorithms to better cope with real-world data challenges including sparse and noisy labels, and concept drifts. Additional challenges is the optimal fusion of enterprise and log based features for the prediction task. Thereby, interpretability of designed prediction models should be ensured in order to have practical relevancy.
Supporting software
Kindly use the corresponding GitHub repository (https://github.com/annguy/customer-sentiment-monitor) to design and benchmark your algorithms.
Citation and Contact
If you use this dataset please cite the following publication:
@ARTICLE{9520354,
author={Nguyen, An and Foerstel, Stefan and Kittler, Thomas and Kurzyukov, Andrey and Schwinn, Leo and Zanca, Dario and Hipp, Tobias and Jun, Sun Da and Schrapp, Michael and Rothgang, Eva and Eskofier, Bjoern},
journal={IEEE Access},
title={System Design for a Data-Driven and Explainable Customer Sentiment Monitor Using IoT and Enterprise Data},
year={2021},
volume={9},
number={},
pages={117140-117152},
doi={10.1109/ACCESS.2021.3106791}}
If you would like to get in touch, please contact an.nguyen@fau.de.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MetaFlux is a global, long-term carbon flux dataset of gross primary production and ecosystem respiration that is generated using meta-learning. The principle of meta-learning stems from the need to solve the problem of learning in the face of sparse data availability. Data sparsity is a prevalent challenge in climate and ecology science. For instance, in-situ observations tend to be spatially and temporally sparse. This issue can arise from sensor malfunctions, limited sensor locations, or non-ideal climate conditions such as persistent cloud cover. The lack of high-quality continuous data can make it difficult to understand many climate processes that are otherwise critical. The machine-learning community has attempted to tackle this problem by developing several learning approaches, including meta-learning that learns how to learn broad features across tasks to better infer other poorly sampled ones. In this work, we applied meta-learning to solve the problem of upscaling continuous carbon fluxes from sparse observations. Data scarcity in carbon flux applications is particularly problematic in the tropics and semi-arid regions, where only around 8–11% of long-term eddy covariance stations are currently operational. Unfortunately, these regions are important in modulating the global carbon cycle and its interannual variability. In general, we find that meta-trained machine models, including multi-layer perceptrons (MLP), long-short-term memory (LSTM), and bi-directional LSTM (BiLSTM), have lower validation errors on flux estimates by 9–16% when compared to their non-meta-trained counterparts. In addition, meta-trained models are more robust to extreme conditions, with 4–24% lower overall errors. Finally, we use an ensemble of meta-trained deep networks to generate a global product of ecosystem-scale photosynthesis and respiration fluxes from in-situ observations to daily and monthly global products at a 0.25-degree spatial resolution from 2001 to 2023, called "MetaFlux". We also checked for the seasonality, interannual variability, and correlation to solar-induced fluorescence of the upscaled product and found that MetaFlux outperformed state-of-the-art machine learning upscaling models, especially in critical semi-arid and tropical regions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Induced sparsity in the factor loading matrix identifies the factor basis, while rotational identification is obtained ex post by clustering methods closely related to machine learning. We extract meaningful economic concepts from a high-dimensional data set, which together with observed variables follow an unrestricted, reduced-form VAR process. Including a comprehensive set of economic concepts allows reliable, fundamental structural analysis, even of the factor augmented VAR itself. We illustrate this by combining two structural identification methods to further analyze the model. To account for the shift in monetary policy instruments triggered by the Great Recession, we follow separate strategies to identify monetary policy shocks. Comparing ours to other parametric and non-parametric factor estimates uncovers advantages of parametric sparse factor estimation in a high dimensional data environment. Besides meaningful factor extraction, we gain precision in the estimation of factor loadings.
According to our latest research, the AI Sparsity Engine market size reached USD 1.19 billion globally in 2024, with a robust year-on-year growth propelled by advancements in deep learning optimization and efficient neural network deployment. The market is forecasted to expand at a CAGR of 34.7% from 2025 to 2033, reaching an estimated USD 16.1 billion by 2033. This exceptional growth trajectory is primarily driven by the increasing demand for computational efficiency in AI workloads and the widespread adoption of AI sparsity engines across diverse industry verticals, as per our latest research findings.
The primary growth factor for the AI Sparsity Engine market is the surging need for high-performance and energy-efficient AI models, particularly in edge computing and data center environments. As organizations worldwide seek to deploy complex AI models on resource-constrained hardware, sparsity engines have emerged as essential tools for pruning redundant parameters and optimizing model size without sacrificing accuracy. This capability is vital for accelerating AI inference, reducing computational costs, and extending battery life in edge devices. Furthermore, the proliferation of AI-powered applications in sectors such as healthcare, automotive, and finance has intensified the demand for scalable and efficient AI solutions, thus fueling the adoption of AI sparsity engines.
Another significant driver is the rapid evolution of AI algorithms and neural network architectures, which increasingly rely on sparsity techniques to enhance model interpretability and scalability. The integration of AI sparsity engines with mainstream machine learning frameworks and hardware accelerators has simplified the deployment process, enabling enterprises to seamlessly integrate sparsity into their existing AI pipelines. Additionally, the growing focus on sustainable AI and green computing has positioned sparsity engines as a key enabler for reducing the energy footprint of large-scale AI deployments. As regulatory pressures and corporate sustainability goals intensify, organizations are prioritizing technologies that deliver both performance and energy efficiency, thereby boosting the AI sparsity engine market.
A further catalyst for market expansion is the increasing investment in AI research and development, particularly in emerging economies. Governments and private sector players are allocating substantial resources to advance AI infrastructure and foster innovation in AI model optimization. The availability of open-source AI sparsity toolkits and collaborative research initiatives has democratized access to cutting-edge sparsity techniques, accelerating market penetration across small and medium enterprises (SMEs) and large enterprises alike. The convergence of AI sparsity engines with complementary technologies such as federated learning and secure AI is also opening new avenues for market growth, especially in privacy-sensitive industries.
From a regional perspective, North America currently dominates the AI Sparsity Engine market, accounting for the largest revenue share in 2024, followed closely by Europe and the Asia Pacific. The United States, in particular, has witnessed significant adoption of AI sparsity engines across its technology, healthcare, and financial sectors, driven by a mature AI ecosystem and strong R&D investments. Meanwhile, Asia Pacific is poised for the fastest growth throughout the forecast period, fueled by rapid digital transformation, expanding AI infrastructure, and rising government initiatives to promote AI innovation in countries such as China, Japan, and South Korea. Europe is also experiencing steady growth, supported by robust regulatory frameworks and increasing focus on sustainable AI solutions.
The AI Sparsity Engine market by component is segmented into software, hardware, and services, each playing a pivotal role in the overall ecosystem. Software solutions currently dominate the market, accounting for the largest share in 202
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Work in progress...
This dataset was developed in the context of my master's thesis titled "Physics-Guided Deep Learning for Sparse Data-Driven Brain Shift Registration", which investigates the integration of physics-based biomechanical modeling into deep learning frameworks for the task of brain shift registration. The core objective of this project is to improve the accuracy and reliability of intraoperative brain shift prediction by enabling deep neural networks to interpolate sparse intraoperative data under biomechanical constraints. Such capabilities are critical for enhancing image-guided neurosurgery systems, especially when full intraoperative imaging is unavailable or impractical.
The dataset integrates and extends data from two publicly available sources: ReMIND and UPENN-GBM. A total of 207 patient cases (45 cases from ReMIND and 162 cases from UPENN-GBM), each represented as a separate folder with all relevant data grouped per case, are included in this dataset. It contains preoperative imaging (unstripped), synthetic ground truth displacement fields, anatomical segmentations, and keypoints, structured to support machine learning and registration tasks.
For details on the image acquisition and other topics related to the original datasets, see their original links above.
Each patient folder contains the following subfolders:
images/
: Preoperative MRI scans (T1ce, T2) in NIfTI format.
segmentations/
: Brain and tumor segmentations in NRRD format.
simulations/
: Biomechanically simulated displacement fields with initial and final point coordinates (LPS) in .npz and .txt formats, respectively.
keypoints/
: 3D SIFT-Rank keypoints and their descriptors in both voxel space and world coordinates (RAS?) as .key files.
The folder naming and organization are consistent across patients for ease of use and scripting.
ReMIND: is a multimodal imaging dataset of 114 brain tumor patients that underwent image-guided surgical resection at Brigham and Women’s Hospital, containing preoperative MRI, intraoperative MRI, and 3D intraoperative ultrasound data. It includes over 300 imaging series and 350 expert-annotated segmentations such as tumors, resection cavities, cerebrum, and ventricles. Demographic and clinico-pathological information (e.g., tumor type, grade, eloquence) is also provided.
UPENN-GBM: comprises multi-parametric MRI scans from de novo glioblastoma (GBM) patients treated at the University of Pennsylvania Health System. It includes co-registered and skull-stripped T1-weighted, T1-weighted contrast-enhanced, T2-weighted, and FLAIR images. The dataset features high-quality tumor and brain segmentation labels, initially produced by automated methods and subsequently corrected and approved by board-certified neuroradiologists. Alongside imaging data, the collection provides comprehensive clinical metadata including patient demographics, genomic profiles, survival outcomes, and tumor progression indicators.
This dataset is tailored for researchers and developers working on:
It is especially well-suited for evaluating learning-based registration methods that incorporate physical priors or aim to generalize under sparse supervision.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This repository contains data, code, and model weights for reproducing the main results of the paper, Insights on Galaxy Evolution from Interpretable Sparse Feature Networks (see arXiv preprint). Specifically, we provide data files (images-sdss.tar.gz
and galaxies.csv
), a snapshot of the code base (sparse-feature-networks v1.0.0), and model weights (resnet18-topk_4-metallicity.pth
, resnet18-topk_4-bpt_lines.pth
). These are described in detail below.
galaxies.csv
is the main galaxy sample after we have issued the cuts described in the paper (250,224 rows). We include 30 columns queried from the SDSS galSpecInfo, galSpecLine, and galSpecExtra tables:
objID (int64)
DR7ObjID (int64)
specObjID (int64)
ra (float32)
dec (float32)
z (float32)
zErr (float32)
velDisp (float32)
velDispErr (float32)
modelMag_u (float32)
modelMag_g (float32)
modelMag_r (float32)
modelMag_i (float32)
modelMag_z (float32)
petroMag_r (float32)
petroR50_r (float32)
petroR90_r (float32)
bptclass (int32)
oh_p50 (float32)
lgm_tot_p50 (float32)
sfr_tot_p50 (float32)
nii_6584_flux (float32)
nii_6584_flux_err (float32)
h_alpha_flux (float32)
h_alpha_flux_err (float32)
oiii_5007_flux (float32)
oiii_5007_flux_err (float32)
h_beta_flux (float32)
h_beta_flux_err (float32)
reliable (int32)
images-sdss.tar.gz
is a compressed directory containing 250,224 image cutouts from the DESI Legacy Imaging Surveys viewer. Each cutout was generated using the RESTful call http://legacysurvey.org/viewer/cutout.jpg?ra={ra}&dec={dec}&pixscale=0.262&layer=sdss&size=160
where the ra
and dec
are directly taken from galaxies.csv
. Each image is name using the format {objID}.jpg
, again taken from galaxies.csv
.
The code is a snapshot of https://github.com/jwuphysics/sparse-feature-networks at v1.0.0. After unpacking the images and moving them into the ./data
directory, the directory structure should look like:
./
├── data/
│ ├── images-sdss/
│ └── galaxies.csv
├── model/
├── results/
└── src/
├── config.py
├── dataloader.py
├── model.py
├── main.py
└── trainer.py
In order to run the analysis and reproduce the main results of the paper, you must create the software environment first:
pip install torch fastai numpy pandas matplotlib cmasher tqdm
and then simply run python src/main.py
.
The trained model weghts (resnet18-topk_4-metallicity.pth
, resnet18-topk_4-bpt_lines.pth
) are provided here for reproducing the exact results from the paper. These are compatible with the ResNet18TopK
class defined in src/model.py
, and the weights can be stored in the ./model
directory.
Alternatively, you can train your own models (i.e. by using the functions defined in src/trainer.py
) and save them natively with Pytorch.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EMG data for classifier evaluation
This is a scikit-learn compatible Python implementation of Stabl, coupled with useful functions and
example notebooks to rerun the analyses on the different use cases located in the Sample data
folder of the code library and in the data.zip
folder of this repository
Python version : from 3.7 up to 3.10
Python packages:
Julia package for noise generation (version 1.9.2) :
To install Julia, please follow these instructions:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The application of machine learning to theoretical chemistry has made it possible to combine the accuracy of quantum chemical energetics with the thorough sampling of finite-temperature fluctuations. To reach this goal, a diverse set of methods has been proposed, ranging from simple linear models to kernel regression and highly nonlinear neural networks. Here we apply two widely different approaches to the same, challenging problem - the sampling of the conformational landscape of polypeptides at finite temperature. We develop a Local Kernel Regression (LKR) coupled with a supervised sparsity method and compare it with a more established approach based on Behler-Parrinello type Neural Networks. In the context of the LKR, we discuss how the supervised selection of the reference pool of environments is crucial to achieve accurate potential energy surfaces at a competitive computational cost and leverage the locality of the model to infer which chemical environments are poorly described by the DFTB baseline. We then discuss the relative merits of the two frameworks and perform Hamiltonian-reservoir replica-exchange Monte Carlo sampling and metadynamics simulations, respectively, to demonstrate that both frameworks can achieve converged and transferable sampling of the conformational landscape of complex and flexible biomolecules with comparable accuracy and computational cost.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains model parameters and protein structures described in the manuscript "Efficient protein structure generation with sparse denoising models".
"salad-0.1.0.tar.gz" contains the snapshot of the salad code-base used in the manuscript.
The parameters for the salad (sparse all-atom denoising) models described in the manuscript are contained in "salad_params.tar.gz". This unpacks to a directory "params/", which contains pickled parameter files for a number of model variants:
In addition to salad model parameters, we also provide the parameters for the autoencoder models described in the manuscript in "ae_params.tar.gz". This unpacks to a directory "ae_params/", which contains the following checkpoints:
The protein structures generated using salad, as well as their corresponding sequences generated using ProteinMPNN and predicted structures using ESMfold are contained in "data_package.tar.gz". This archive unpacks to a directory "data_package/" which contains subdirectories for each protein design task described in the manuscript "Efficient protein structure generation with sparse denoising models":
This directory contains subdirectories named "
In addition, there are subdirectories with "random" in their name, instead of a number of steps, e.g. "default_vp_scaled-200-random-esm/". These subdirectories contain data generated using random secondary structure conditioning.
Each subdirectory has the same underlying structure:
Same as "monomers/", but contains data generated using RFdiffusion and Genie 2 for protein sizes between 50 and 400 amino acids.
This directory contains the subdirectories named "ve-seg-
This directory contains generated structures for the motif-scaffolding benchmark described by Lin et al., 2024 [1]. It contains two subdirectories:
Each of these subdirectories has the same structure as the directories "monomers/" and "shape/", with one subdirectory per motif PDB file in the motif-scaffolding benchmark, e.g. "cond/multimotif_vp-1bcf.pdb-esm/" or "nocond/default_vp-1bcf.pdb-esm/". These directories contain the usual "backbones/" and "predictions/" subdirectories, as well as a file "motif_scores.csv". This has fields analogous to "scores.csv", with the addition of two additional fields for motif-RMSD:
A designed sequence-structure pair is only considered successful if sc_rmsd < 2 Å, plddt > 70 and motif_rmsd_bb < 1 Å.
This directory contains generated structures for symmetric repeat proteins using both VP and VE models with structure-editing. Subdirectories are named by model type ("default_vp", "default_ve_minimal_timeless"), symmetry ("C
This directory contains generated structures for designed multi-state proteins. In our manuscript we compare two different approaches to multi-state design using salad which are reflected in two subdirectories of "confchange/":
Both share the same directory structure:
According to our latest research, the global Sparse-Matrix Compression Engine market size reached USD 1.42 billion in 2024, reflecting robust adoption across high-performance computing and advanced analytics sectors. The market is poised for substantial expansion, with a projected CAGR of 15.8% during the forecast period. By 2033, the market is forecasted to achieve a value of USD 5.18 billion, driven by escalating data complexity, the proliferation of machine learning applications, and the imperative for efficient storage and computational solutions. The surge in demand for real-time analytics and the growing penetration of artificial intelligence across industries are primary factors fueling this remarkable growth trajectory.
One of the key growth drivers for the Sparse-Matrix Compression Engine market is the exponential increase in data generation and the corresponding need for efficient data processing and storage. As organizations in sectors such as scientific computing, finance, and healthcare grapple with large-scale, high-dimensional datasets, the requirement for optimized storage solutions becomes paramount. Sparse-matrix compression engines enable significant reduction in data redundancy, leading to lower storage costs and faster data retrieval. This efficiency is particularly crucial in high-performance computing environments where memory bandwidth and storage limitations can hinder computational throughput. The adoption of these engines is further propelled by advancements in hardware accelerators and software algorithms that enhance compression ratios without compromising data integrity.
Another significant factor contributing to market growth is the rising adoption of machine learning and artificial intelligence across diverse industry verticals. Modern AI and ML algorithms often operate on sparse datasets, especially in areas such as natural language processing, recommendation systems, and scientific simulations. Sparse-matrix compression engines play a pivotal role in minimizing memory footprint and optimizing computational resources, thereby accelerating model training and inference. The integration of these engines into cloud-based and on-premises solutions allows enterprises to scale their AI workloads efficiently, driving widespread deployment in both research and commercial applications. Additionally, the ongoing evolution of lossless and lossy compression techniques is expanding the applicability of these engines to new and emerging use cases.
The market is also benefiting from the increasing emphasis on cost optimization and energy efficiency in data centers and enterprise IT infrastructure. As organizations strive to reduce operational expenses and carbon footprints, the adoption of compression technologies that minimize data movement and storage requirements becomes a strategic imperative. Sparse-matrix compression engines facilitate this by enabling higher data throughput and lower energy consumption, making them attractive for deployment in large-scale analytics, telecommunications, and industrial automation. Furthermore, the growing ecosystem of service providers and solution integrators is making these technologies more accessible to small and medium enterprises, contributing to broader market penetration.
From a regional perspective, North America continues to dominate the Sparse-Matrix Compression Engine market, accounting for the largest revenue share in 2024 owing to the presence of leading technology companies, advanced research institutions, and early adopters of high-performance computing solutions. However, the Asia Pacific region is witnessing the fastest growth, driven by rapid digital transformation, expanding AI research, and significant investments in data infrastructure across China, Japan, and India. Europe follows closely, with robust demand for advanced analytics and scientific computing in sectors such as automotive, healthcare, and finance. Latin America and Middle East & Africa are gradually emerging as promising markets, supported by increasing investments in IT modernization and digitalization initiatives.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Academic achievement is an important index to measure the quality of education and students’ learning outcomes. Reasonable and accurate prediction of academic achievement can help improve teachers’ educational methods. And it also provides corresponding data support for the formulation of education policies. However, traditional methods for classifying academic performance have many problems, such as low accuracy, limited ability to handle nonlinear relationships, and poor handling of data sparsity. Based on this, our study analyzes various characteristics of students, including personal information, academic performance, attendance rate, family background, extracurricular activities and etc. Our work offers a comprehensive view to understand the various factors affecting students’ academic performance. In order to improve the accuracy and robustness of student performance classification, we adopted Gaussian Distribution based Data Augmentation technique (GDO), combined with multiple Deep Learning (DL) and Machine Learning (ML) models. We explored the application of different Machine Learning and Deep Learning models in classifying student grades. And different feature combinations and data augmentation techniques were used to evaluate the performance of multiple models in classification tasks. In addition, we also checked the synthetic data’s effectiveness with variance homogeneity and P-values, and studied how the oversampling rate affects actual classification results. Research has shown that the RBFN model based on educational habit features performs the best after using GDO data augmentation. The accuracy rate is 94.12%, and the F1 score is 94.46%. These results provide valuable references for the classification of student grades and the development of intervention strategies. New methods and perspectives in the field of educational data analysis are proposed in our study. At the same time, it has also promoted innovation and development in the intelligence of the education system.
The terms and conditions for using this dataset are specified in the [LICENCE](LICENCE) file included in this repository. Please review these terms carefully before accessing or using the data.
For additional information about the dataset, please contact:
- Name: Angela Lombardi
- Affiliation: Department of Electrical and Information Engineering, Politecnico di Bari
- Email: angela.lombardi@poliba.it
The dataset can be accessed through our dedicated web platform. To request access:
1. Visit the main dataset page at: https://sisinflab.poliba.it/neurosense-dataset-request/
2. Follow the instructions on the website to submit your access request
3. Upon approval, you will receive further instructions for downloading the data
Please ensure you have read and agreed to the terms in the data user agreement before requesting access.
The experiment consists in 40 sessions per user. During each session, users are asked to watch a
music video with the aim to understand their emotions.
Recordings are performed with a Muse EEG headset at a 256 Hz sampling rate.
Channels are recorded as follows:
- Channel 0: AF7
- Channel 1: TP9
- Channel 2: TP10
- Channel 3: AF8
The chosen songs have various Last.fm tags in order to create different feelings. The title of every track
can be found in the "TaskName" field of sub-ID***_ses-S***_task-Default_run-001_eeg.json, while the author,
the Last.fm tag and additional information in "TaskDescription".
The subject pool is made of 30 college students, aged between 18 and 35. 16 of them are males, 14 females.
The experiment was performed using the same procedures as those to create
[Deap Dataset](https://www.eecs.qmul.ac.uk/mmv/datasets/deap/), which is a dataset to recognize emotions via a Brain
Computer Interface (BCI).
Firstly, music videos were selected. Once 40 songs were picked, the protocol was chosen and the self-assessment
questionnaire was created.
In order to evaluate the stimulus, Russell's VAD (Valence-Arousal-Dominance) scale was used.
In this scale, valenza-arousal space can be divided in four quadrants:
- Low Arousal/Low Valence (LALV);
- Low Arousal/High Valence (LAHV);
- High Arousal/Low Valence (HALV);
- High Arousal/High Valence (HAHV).
The experiment was performed in a laboratory located at DEI Department of
[Politecnico di Bari](https://www.poliba.it/).
Data recorded during session S019 - Session 2, ID021 - Session 23, user was corrupted, therefore is missing.
Sessions S033 and S038 of ID015 user show a calculated effective sampling rate lower than 256 Hz:
- ID015_ses-S033 has 226.1320 Hz
- ID015_ses-S038 has 216.9549 Hz
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Global measurements of ocean pCO2 are critical to monitor and understand changes in the global carbon cycle. However, pCO2 observations remain sparse as they are mostly collected on opportunistic ship tracks. Several approaches, especially based on machine learning, have been used to upscale and extrapolate sparse point data to dense global estimates based on globally available input features. However, those estimates tend to exhibit spatially heterogeneous performance. As a result, we propose a physics-informed transfer learning workflow to generate dense pCO2 estimates. The model is initially trained on synthetic Earth system model data, and then adjusted (using transfer learning) to the actual sparse SOCAT observational data, thus leveraging both the spatial and temporal correlation pre-learned on physically-informed Earth system ensembles. Compared to the benchmark upscaling of SOCAT point-wise data on baseline models, our transfer learning methodology shows a major improvement of up to 30-52%. Our strategy thus presents a new monthly global pCO2 estimates that spans for 35 years between 1982 and 2017.
The Tox21 data set comprises 12,060 training samples and 647 test samples that represent chemical compounds. There are 801 "dense features" that represent chemical descriptors, such as molecular weight, solubility or surface area, and 272,776 "sparse features" that represent chemical substructures (ECFP10, DFS6, DFS8; stored in Matrix Market Format ). Machine learning methods can either use sparse or dense data or combine them. For each sample there are 12 binary labels that represent the outcome (active/inactive) of 12 different toxicological experiments. Note that the label matrix contains many missing values (NAs). The original data source and Tox21 challenge site is https://tripod.nih.gov/tox21/challenge/.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Academic achievement is an important index to measure the quality of education and students’ learning outcomes. Reasonable and accurate prediction of academic achievement can help improve teachers’ educational methods. And it also provides corresponding data support for the formulation of education policies. However, traditional methods for classifying academic performance have many problems, such as low accuracy, limited ability to handle nonlinear relationships, and poor handling of data sparsity. Based on this, our study analyzes various characteristics of students, including personal information, academic performance, attendance rate, family background, extracurricular activities and etc. Our work offers a comprehensive view to understand the various factors affecting students’ academic performance. In order to improve the accuracy and robustness of student performance classification, we adopted Gaussian Distribution based Data Augmentation technique (GDO), combined with multiple Deep Learning (DL) and Machine Learning (ML) models. We explored the application of different Machine Learning and Deep Learning models in classifying student grades. And different feature combinations and data augmentation techniques were used to evaluate the performance of multiple models in classification tasks. In addition, we also checked the synthetic data’s effectiveness with variance homogeneity and P-values, and studied how the oversampling rate affects actual classification results. Research has shown that the RBFN model based on educational habit features performs the best after using GDO data augmentation. The accuracy rate is 94.12%, and the F1 score is 94.46%. These results provide valuable references for the classification of student grades and the development of intervention strategies. New methods and perspectives in the field of educational data analysis are proposed in our study. At the same time, it has also promoted innovation and development in the intelligence of the education system.
Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.