95 datasets found
  1. d

    Data from: Sparse Machine Learning Methods for Understanding Large Text...

    • catalog.data.gov
    • s.cnmilf.com
    • +3more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Sparse Machine Learning Methods for Understanding Large Text Corpora [Dataset]. https://catalog.data.gov/dataset/sparse-machine-learning-methods-for-understanding-large-text-corpora
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.

  2. Industrial Benchmark Dataset for Customer Escalation Prediction

    • zenodo.org
    • opendatalab.com
    • +1more
    bin
    Updated Sep 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    An Nguyen; An Nguyen; Stefan Foerstel; Thomas Kittler; Andrey Kurzyukov; Leo Schwinn; Dario Zanca; Tobias Hipp; Sun Da Jun; Michael Schrapp; Eva Rothgang; Bjoern Eskofier; Stefan Foerstel; Thomas Kittler; Andrey Kurzyukov; Leo Schwinn; Dario Zanca; Tobias Hipp; Sun Da Jun; Michael Schrapp; Eva Rothgang; Bjoern Eskofier (2021). Industrial Benchmark Dataset for Customer Escalation Prediction [Dataset]. http://doi.org/10.5281/zenodo.4383145
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 6, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    An Nguyen; An Nguyen; Stefan Foerstel; Thomas Kittler; Andrey Kurzyukov; Leo Schwinn; Dario Zanca; Tobias Hipp; Sun Da Jun; Michael Schrapp; Eva Rothgang; Bjoern Eskofier; Stefan Foerstel; Thomas Kittler; Andrey Kurzyukov; Leo Schwinn; Dario Zanca; Tobias Hipp; Sun Da Jun; Michael Schrapp; Eva Rothgang; Bjoern Eskofier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a real-world industrial benchmark dataset from a major medical device manufacturer for the prediction of customer escalations. The dataset contains features derived from IoT (machine log) and enterprise data including labels for escalation from a fleet of thousands of customers of high-end medical devices.

    The dataset accompanies the publication "System Design for a Data-driven and Explainable Customer Sentiment Monitor" (submitted). We provide an anonymized version of data collected over a period of two years.

    The dataset should fuel the research and development of new machine learning algorithms to better cope with real-world data challenges including sparse and noisy labels, and concept drifts. Additional challenges is the optimal fusion of enterprise and log based features for the prediction task. Thereby, interpretability of designed prediction models should be ensured in order to have practical relevancy.

    Supporting software

    Kindly use the corresponding GitHub repository (https://github.com/annguy/customer-sentiment-monitor) to design and benchmark your algorithms.

    Citation and Contact

    If you use this dataset please cite the following publication:


    @ARTICLE{9520354,
     author={Nguyen, An and Foerstel, Stefan and Kittler, Thomas and Kurzyukov, Andrey and Schwinn, Leo and Zanca, Dario and Hipp, Tobias and Jun, Sun Da and Schrapp, Michael and Rothgang, Eva and Eskofier, Bjoern},
     journal={IEEE Access}, 
     title={System Design for a Data-Driven and Explainable Customer Sentiment Monitor Using IoT and Enterprise Data}, 
     year={2021},
     volume={9},
     number={},
     pages={117140-117152},
     doi={10.1109/ACCESS.2021.3106791}}

    If you would like to get in touch, please contact an.nguyen@fau.de.

  3. r

    Data from: Sparse Principal Component Analysis with Preserved Sparsity...

    • researchdata.edu.au
    Updated 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Inge Koch; Navid Shokouhi; Abd-Krim Seghouane; Mathematics and Statistics (2019). Sparse Principal Component Analysis with Preserved Sparsity Pattern [Dataset]. http://doi.org/10.24433/CO.4593141.V1
    Explore at:
    Dataset updated
    2019
    Dataset provided by
    Code Ocean
    The University of Western Australia
    Authors
    Inge Koch; Navid Shokouhi; Abd-Krim Seghouane; Mathematics and Statistics
    Description

    MATLAB code + demo to reproduce results for "Sparse Principal Component Analysis with Preserved Sparsity". This code calculates the principal loading vectors for any given high-dimensional data matrix. The advantage of this method over existing sparse-PCA methods is that it can produce principal loading vectors with the same sparsity pattern for any number of principal components. Please see Readme.md for more information.

  4. f

    Data_Sheet_1_A Novel Transfer Learning Approach to Enhance Deep Neural...

    • frontiersin.figshare.com
    pdf
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hailong Li; Nehal A. Parikh; Lili He (2023). Data_Sheet_1_A Novel Transfer Learning Approach to Enhance Deep Neural Network Classification of Brain Functional Connectomes.pdf [Dataset]. http://doi.org/10.3389/fnins.2018.00491.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Hailong Li; Nehal A. Parikh; Lili He
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Early diagnosis remains a significant challenge for many neurological disorders, especially for rare disorders where studying large cohorts is not possible. A novel solution that investigators have undertaken is combining advanced machine learning algorithms with resting-state functional Magnetic Resonance Imaging to unveil hidden pathological brain connectome patterns to uncover diagnostic and prognostic biomarkers. Recently, state-of-the-art deep learning techniques are outperforming traditional machine learning methods and are hailed as a milestone for artificial intelligence. However, whole brain classification that combines brain connectome with deep learning has been hindered by insufficient training samples. Inspired by the transfer learning strategy employed in computer vision, we exploited previously collected resting-state functional MRI data for healthy subjects from existing databases and transferred this knowledge for new disease classification tasks. We developed a deep transfer learning neural network (DTL-NN) framework for enhancing the classification of whole brain functional connectivity patterns. Briefly, we trained a stacked sparse autoencoder (SSAE) prototype to learn healthy functional connectivity patterns in an offline learning environment. Then, the SSAE prototype was transferred to a DTL-NN model for a new classification task. To test the validity of our framework, we collected resting-state functional MRI data from the Autism Brain Imaging Data Exchange (ABIDE) repository. Using autism spectrum disorder (ASD) classification as a target task, we compared the performance of our DTL-NN approach with a traditional deep neural network and support vector machine models across four ABIDE data sites that enrolled at least 60 subjects. As compared to traditional models, our DTL-NN approach achieved an improved performance in accuracy, sensitivity, specificity and area under receiver operating characteristic curve. These findings suggest that DTL-NN approaches could enhance disease classification for neurological conditions, where accumulating large neuroimaging datasets has been challenging.

  5. NeuriPhy - Neuroimaging Dataset for Physics-Informed Learning

    • zenodo.org
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tiago Assis; Tiago Assis (2025). NeuriPhy - Neuroimaging Dataset for Physics-Informed Learning [Dataset]. http://doi.org/10.5281/zenodo.15381866
    Explore at:
    Dataset updated
    May 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tiago Assis; Tiago Assis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 12, 2025
    Description

    Work in progress...

    NeuriPhy - Neuroimaging Dataset for Physics-Informed Learning

    This dataset was developed in the context of my master's thesis titled "Physics-Guided Deep Learning for Sparse Data-Driven Brain Shift Registration", which investigates the integration of physics-based biomechanical modeling into deep learning frameworks for the task of brain shift registration. The core objective of this project is to improve the accuracy and reliability of intraoperative brain shift prediction by enabling deep neural networks to interpolate sparse intraoperative data under biomechanical constraints. Such capabilities are critical for enhancing image-guided neurosurgery systems, especially when full intraoperative imaging is unavailable or impractical.

    The dataset integrates and extends data from two publicly available sources: ReMIND and UPENN-GBM. A total of 207 patient cases (45 cases from ReMIND and 162 cases from UPENN-GBM), each represented as a separate folder with all relevant data grouped per case, are included in this dataset. It contains preoperative imaging (unstripped), synthetic ground truth displacement fields, anatomical segmentations, and keypoints, structured to support machine learning and registration tasks.

    For details on the image acquisition and other topics related to the original datasets, see their original links above.

    Contents

    • Imaging Data:
      • T1ce: Preoperative contrast-enhanced T1-weighted MRI scans.
      • T2: Preoperative T2-weighted MRI scans, including mostly T2-SPACE, but also native T2 and T2-BLADE acquisitions depending on the case.
      • All MRI scans are in NIfTI format and have been resampled to the same isotropic resolution (1x1x1 mm). Intra-patient rigid coregistration was performed as part of preprocessing with the "General Registration (BRAINS)" extension of 3D Slicer.
    • Synthetic Displacement Fields:
      • Biomechanically simulated ground truth displacement fields were generated using a meshless approach and by solving differential equations of nonlinear elasticity using explicit methods, as described in 1, 2, 3, 4.
      • For each patient, 1 to 5 simulations were successfully performed, each with a different gravity vector orientation according to a plausible surgical entry point, creating variability in the deformations obtained. Overall, the dataset contains 394 simulations that aimed to predict the intraoperative state after tumor-resection-induced brain shift.
      • Includes the initial and displaced (final) coordinates of several points in the brain volume that were used to generate the displacement field using a multi-level BSpline interpolation algorithm.
      • These displacement fields were mainly intended for use as supervision in deep learning-based registration methods.
    • Keypoints:
      • Sparse 3D keypoints and their descriptors were generated using the 3D SIFT-Rank algorithm on the T1ce images (or T2 if T1ce was unavailable).
      • Keypoints are provided for each case in both voxel space and world coordinates (RAS?), being suitable for sparse registration or landmark-based evaluation.
    • Segmentations:
      • Brain segmentations were automatically generated using SynthSeg, a deep learning model capable of robust whole-brain segmentation with scans of any contrast and resolution.
      • Tumor segmentations are included from the original datasets.
      • All segmentations are provided in the NRRD format.

    Data Structure

    Each patient folder contains the following subfolders:

    images/: Preoperative MRI scans (T1ce, T2) in NIfTI format.

    segmentations/: Brain and tumor segmentations in NRRD format.

    simulations/: Biomechanically simulated displacement fields with initial and final point coordinates (LPS) in .npz and .txt formats, respectively.

    keypoints/: 3D SIFT-Rank keypoints and their descriptors in both voxel space and world coordinates (RAS?) as .key files.

    The folder naming and organization are consistent across patients for ease of use and scripting.

    Source Datasets

    ReMIND: is a multimodal imaging dataset of 114 brain tumor patients that underwent image-guided surgical resection at Brigham and Women’s Hospital, containing preoperative MRI, intraoperative MRI, and 3D intraoperative ultrasound data. It includes over 300 imaging series and 350 expert-annotated segmentations such as tumors, resection cavities, cerebrum, and ventricles. Demographic and clinico-pathological information (e.g., tumor type, grade, eloquence) is also provided.

    UPENN-GBM: comprises multi-parametric MRI scans from de novo glioblastoma (GBM) patients treated at the University of Pennsylvania Health System. It includes co-registered and skull-stripped T1-weighted, T1-weighted contrast-enhanced, T2-weighted, and FLAIR images. The dataset features high-quality tumor and brain segmentation labels, initially produced by automated methods and subsequently corrected and approved by board-certified neuroradiologists. Alongside imaging data, the collection provides comprehensive clinical metadata including patient demographics, genomic profiles, survival outcomes, and tumor progression indicators.

    Use Cases

    This dataset is tailored for researchers and developers working on:

    • Deformable image registration
    • Physics-informed machine learning
    • Intraoperative brain shift modeling
    • Sparse data interpolation and deep learning
    • Multi-modal image alignment in neuroimaging

    It is especially well-suited for evaluating learning-based registration methods that incorporate physical priors or aim to generalize under sparse supervision.

  6. Data, code, and model weights for "Insights on Galaxy Evolution from...

    • zenodo.org
    application/gzip, bin +2
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Wu; John Wu (2025). Data, code, and model weights for "Insights on Galaxy Evolution from Interpretable Sparse Feature Networks" [Dataset]. http://doi.org/10.5281/zenodo.14712542
    Explore at:
    zip, csv, bin, application/gzipAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    John Wu; John Wu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This repository contains data, code, and model weights for reproducing the main results of the paper, Insights on Galaxy Evolution from Interpretable Sparse Feature Networks (see arXiv preprint). Specifically, we provide data files (images-sdss.tar.gz and galaxies.csv), a snapshot of the code base (sparse-feature-networks v1.0.0), and model weights (resnet18-topk_4-metallicity.pth, resnet18-topk_4-bpt_lines.pth). These are described in detail below.

    Data

    galaxies.csv is the main galaxy sample after we have issued the cuts described in the paper (250,224 rows). We include 30 columns queried from the SDSS galSpecInfo, galSpecLine, and galSpecExtra tables:

    objID (int64)
    DR7ObjID (int64)
    specObjID (int64)
    ra (float32)
    dec (float32)
    z (float32)
    zErr (float32)
    velDisp (float32)
    velDispErr (float32)
    modelMag_u (float32)
    modelMag_g (float32)
    modelMag_r (float32)
    modelMag_i (float32)
    modelMag_z (float32)
    petroMag_r (float32)
    petroR50_r (float32)
    petroR90_r (float32)
    bptclass (int32)
    oh_p50 (float32)
    lgm_tot_p50 (float32)
    sfr_tot_p50 (float32)
    nii_6584_flux (float32)
    nii_6584_flux_err (float32)
    h_alpha_flux (float32)
    h_alpha_flux_err (float32)
    oiii_5007_flux (float32)
    oiii_5007_flux_err (float32)
    h_beta_flux (float32)
    h_beta_flux_err (float32)
    reliable (int32)

    images-sdss.tar.gz is a compressed directory containing 250,224 image cutouts from the DESI Legacy Imaging Surveys viewer. Each cutout was generated using the RESTful call http://legacysurvey.org/viewer/cutout.jpg?ra={ra}&dec={dec}&pixscale=0.262&layer=sdss&size=160 where the ra and dec are directly taken from galaxies.csv. Each image is name using the format {objID}.jpg, again taken from galaxies.csv.

    Code

    The code is a snapshot of https://github.com/jwuphysics/sparse-feature-networks at v1.0.0. After unpacking the images and moving them into the ./data directory, the directory structure should look like:

    ./
    ├── data/
    │  ├── images-sdss/
    │  └── galaxies.csv
    ├── model/
    ├── results/
    └── src/
      ├── config.py     
      ├── dataloader.py   
      ├── model.py     
      ├── main.py       
      └── trainer.py 

    In order to run the analysis and reproduce the main results of the paper, you must create the software environment first:

    pip install torch fastai numpy pandas matplotlib cmasher tqdm
    

    and then simply run python src/main.py.

    Models

    The trained model weghts (resnet18-topk_4-metallicity.pth, resnet18-topk_4-bpt_lines.pth) are provided here for reproducing the exact results from the paper. These are compatible with the ResNet18TopK class defined in src/model.py, and the weights can be stored in the ./model directory.

    Alternatively, you can train your own models (i.e. by using the functions defined in src/trainer.py) and save them natively with Pytorch.

  7. f

    Data from: Sparse-Data Deep Learning Strategies for Radiographic...

    • tandf.figshare.com
    pdf
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacqueline Alvarez; Keith Henderson; Maurice B. Aufderheide; Brian Gallagher; Roummel F. Marcia; Ming Jiang (2025). Sparse-Data Deep Learning Strategies for Radiographic Non-Destructive Testing [Dataset]. http://doi.org/10.6084/m9.figshare.29480707.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Jacqueline Alvarez; Keith Henderson; Maurice B. Aufderheide; Brian Gallagher; Roummel F. Marcia; Ming Jiang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Radiography is an imaging technique used in a variety of applications, such as medical diagnosis, airport security, and nondestructive testing. We present a deep learning system for extracting information from radiographic images. We perform various prediction tasks using our system, including material classification and regression on the dimensions of a given object that is being radiographed. Our system is designed to address the sparse-data issue for radiographic nondestructive testing applications. It uses a radiographic simulation tool for synthetic data augmentation, and it uses transfer learning with a pre-trained convolutional neural network model. Using this system, our preliminary results indicate that the object geometry regression task saw an improvement of 70% in the R-squared value when using a multi-regime model. In addition, we increase the performance of the object material classification tasks by utilizing data from different imaging systems. In particular, using neutron imaging improved the material classification accuracy by 20% when compared to x-ray imaging.

  8. c

    Data from: Low-energy electron microscopy intensity-voltage data –...

    • research-data.cardiff.ac.uk
    zip
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesco Masia; Wolfgang Langbein; Simon Fischer; Jon-Olaf Krisponeit; Jens Falta (2024). Low-energy electron microscopy intensity-voltage data – factorization, sparse sampling, and classification [Dataset]. http://doi.org/10.17035/d.2022.0153725100
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 18, 2024
    Dataset provided by
    Cardiff University
    Authors
    Francesco Masia; Wolfgang Langbein; Simon Fischer; Jon-Olaf Krisponeit; Jens Falta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Low-energy electron microscopy (LEEM) taken as intensity-voltage (I-V) curves provides hyperspectral images of surfaces, which can be used to identify the surface type, but are difficult to analyze. Here, we demonstrate the use of an algorithm for factorizing the data into spectra and concentrations of characteristic components (FSC3) for identifying distinct physical surface phases. Importantly, FSC3 is an unsupervised and fast algorithm. As example data we use experiments on the growth of praseodymium oxide or ruthenium oxide on ruthenium single crystal substrates, both featuring a complex distribution of coexisting surface components, varying in both chemical composition and crystallographic structure. With the factorization result a sparse sampling method is demonstrated, reducing the measurement time by 1-2 orders of magnitude, relevant for dynamic surface studies. The FSC3 concentrations are providing the features for a support vector machine (SVM) based supervised classification of the types. Here, specific surface regions which have been identified structurally, via their diffraction pattern, as well as chemically by complementary spectro-microscopic techniques, are used as training sets. A reliable classification is demonstrated on both exemplary LEEM I-V datasets.Research results are published at https://arxiv.org/abs/2203.12353The data available represents the concentration maps obtained by the FSC3 in tiff format, together with the associated spectra as ascii. Similarly the results of the classification algorithm are available as tiff images, while the average concentration and spectra calculated over the training and testing regions are given as ascii data. The raw data are also given as tiff images, which can be used to test the FSC3 and classification algorithms (available at https://langsrv.astro.cf.ac.uk/HIA/HIA.html, and https://github.com/masiaf-cf/leem-svm-classify, respectively).Research results based upon these data are pubished at https://doi.org/10.1111/jmi.13155

  9. n

    Data from: Overcoming Imperfect Data Challenges in Deep Learning-Based...

    • curate.nd.edu
    pdf
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dewen Zeng (2025). Overcoming Imperfect Data Challenges in Deep Learning-Based Medical Imaging [Dataset]. http://doi.org/10.7274/28786175.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 12, 2025
    Dataset provided by
    University of Notre Dame
    Authors
    Dewen Zeng
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    Deep learning (DL) techniques have demonstrated exceptional success in developing high-performing models for medical imaging applications. However, their effectiveness largely depends on access to extensive, high-quality labeled datasets, which are challenging to obtain in the medical field due to the high cost of annotation and privacy constraints. This dissertation introduces several novel deep-learning approaches aimed at addressing challenges associated with imperfect medical datasets, with the goal of reducing annotation efforts and enhancing the generalization capabilities of DL models. Specifically, two imperfect data challenges are studied in this dissertation. (1) Scarce annotation, where only a limited amount of labeled data is available for training. We propose several novel self-supervised learning techniques that leverage the inherent structure of medical images to improve representation learning. In addition, data augmentation with synthetic models is explored to generate synthetic images to improve self-supervised learning performance. (2) Weak annotation, in which the training data has only image-level annotation, noisy annotation, sparse annotation, or inconsistent annotation. We first introduce a novel self-supervised learning-based approach to better utilize the image level label for medical image semantic segmentation. Motivated by the large inter-observer variation in myocardial annotations for ultrasound images, we further propose an extended dice metric that integrates multiple annotations into the loss function, allowing the model to focus on learning generalizable features while minimizing variations caused by individual annotators.

  10. Z

    Data from: MetaFlux: Meta-learning global carbon fluxes from sparse...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liu, Jiangong (2024). MetaFlux: Meta-learning global carbon fluxes from sparse spatiotemporal observations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7761880
    Explore at:
    Dataset updated
    Apr 14, 2024
    Dataset provided by
    Nathaniel, Juan
    Liu, Jiangong
    Gentine, Pierre
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MetaFlux is a global, long-term carbon flux dataset of gross primary production and ecosystem respiration that is generated using meta-learning. The principle of meta-learning stems from the need to solve the problem of learning in the face of sparse data availability. Data sparsity is a prevalent challenge in climate and ecology science. For instance, in-situ observations tend to be spatially and temporally sparse. This issue can arise from sensor malfunctions, limited sensor locations, or non-ideal climate conditions such as persistent cloud cover. The lack of high-quality continuous data can make it difficult to understand many climate processes that are otherwise critical. The machine-learning community has attempted to tackle this problem by developing several learning approaches, including meta-learning that learns how to learn broad features across tasks to better infer other poorly sampled ones. In this work, we applied meta-learning to solve the problem of upscaling continuous carbon fluxes from sparse observations. Data scarcity in carbon flux applications is particularly problematic in the tropics and semi-arid regions, where only around 8–11% of long-term eddy covariance stations are currently operational. Unfortunately, these regions are important in modulating the global carbon cycle and its interannual variability. In general, we find that meta-trained machine models, including multi-layer perceptrons (MLP), long-short-term memory (LSTM), and bi-directional LSTM (BiLSTM), have lower validation errors on flux estimates by 9–16% when compared to their non-meta-trained counterparts. In addition, meta-trained models are more robust to extreme conditions, with 4–24% lower overall errors. Finally, we use an ensemble of meta-trained deep networks to generate a global product of ecosystem-scale photosynthesis and respiration fluxes from in-situ observations to daily and monthly global products at a 0.25-degree spatial resolution from 2001 to 2023, called "MetaFlux". We also checked for the seasonality, interannual variability, and correlation to solar-induced fluorescence of the upscaled product and found that MetaFlux outperformed state-of-the-art machine learning upscaling models, especially in critical semi-arid and tropical regions.

  11. p

    MIMIC-III-Ext-tPatchGNN

    • physionet.org
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chenlong Yin; Weijia Zhang (2025). MIMIC-III-Ext-tPatchGNN [Dataset]. http://doi.org/10.13026/ckn0-3868
    Explore at:
    Dataset updated
    Apr 9, 2025
    Authors
    Chenlong Yin; Weijia Zhang
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    This dataset is a curated subset of MIMIC-III (v1.4), specifically formatted to facilitate reproducibility of the experiments in the work t-PatchGNN. It serves as part of a benchmark designed for forecasting irregular multivariate clinical time series, that is, given a set of historical Irregular Multivariate Time Series (IMTS) observations and forecasting queries, the forecasting problem aims to accurately forecast the values in correspondence to these queries. This requires addressing key challenges such as missing data, variable sampling rates, and complex temporal dependencies. The dataset includes patient records with diverse physiological measurements, each sampled at irregular intervals, reflecting real-world clinical scenarios. It is structured to capture both short-term and long-term temporal patterns, making it well-suited for evaluating machine learning models in medical time series forecasting. By providing a standardized benchmark, this dataset aims to advance research in predictive modeling for healthcare, enabling the development of robust algorithms that can handle irregular and sparse clinical data. The dataset’s applications extend to critical areas such as early disease detection, patient risk stratification, and treatment outcome prediction, making it a valuable resource for the medical AI and machine learning communities.

  12. d

    Data from: Discovery of sparse, reliable omic biomarkers with Stabl

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julien Hédou; Ivana Marić; Grégoire Bellan; Jakob Einhaus; Brice Gaudillière (2023). Discovery of sparse, reliable omic biomarkers with Stabl [Dataset]. http://doi.org/10.5061/dryad.stqjq2c7d
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    Dryad
    Authors
    Julien Hédou; Ivana Marić; Grégoire Bellan; Jakob Einhaus; Brice Gaudillière
    Time period covered
    2023
    Description

    Stabl: sparse and reliable biomarker discovery in predictive modeling of high-dimensional omic data

    This is a scikit-learn compatible Python implementation of Stabl, coupled with useful functions and example notebooks to rerun the analyses on the different use cases located in the Sample data folder of the code library and in the data.zip folder of this repository

    Requirements

    Python version : from 3.7 up to 3.10

    Python packages:

    • joblib == 1.1.0
    • tqdm == 4.64.0
    • matplotlib == 3.5.2
    • numpy == 1.23.1
    • cmake == 3.27.1
    • knockpy == 1.2
    • scikit-learn == 1.1.2
    • seaborn == 0.12.0
    • groupyr == 0.3.2
    • pandas == 1.4.2
    • statsmodels == 0.14.0
    • openpyxl == 3.0.7
    • adjustText == 0.8
    • scipy == 1.10.1
    • julia == 0.6.1
    • osqp == 0.6.2

    Julia package for noise generation (version 1.9.2) :

    • Bigsimr == 0.8.7
    • Distributions == 0.25.98
    • PyCall == 1.96.1

    Installation

    Julia installation

    To install Julia, please follow these instructions:

    1. Download Julia from [here](ht...
  13. Z

    Data from: netDx: Interpretable patient classification using integrated...

    • data.niaid.nih.gov
    Updated Jul 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shah, Muhammad A (2024). netDx: Interpretable patient classification using integrated patient similarity networks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2558451
    Explore at:
    Dataset updated
    Jul 25, 2024
    Dataset provided by
    Hui, Shirley
    Bader, Gary D
    Shah, Muhammad A
    Kaka, Hussam
    Isserlin, Ruth
    Pai, Shraddha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Docker image containing installed netDx software in Ubuntu to reproduce examples from the published manuscript. The R implementation of netDx is hosted at: https://github.com/BaderLab/netDx

    Publication abstract: Patient classification has widespread biomedical and clinical applications, including diagnosis, prognosis and treatment response prediction. A clinically useful prediction algorithm should be accurate, generalizable, be able to integrate diverse data types, and handle sparse data. A clinical predictor based on genomic data needs to be easily interpretable to drive hypothesis-driven research into new treatments. We describe netDx, a novel supervised patient classification framework based on patient similarity networks. netDx meets the above criteria and particularly excels at data integration and model interpretability. We compared classification performance of this method against other machine-learning algorithms, using a cancer survival benchmark with four cancer types, each requiring integration of up to six genomic and clinical data types. In these tests, netDx has significantly higher average performance than most other machine-learning approaches across most cancer types. In comparison to traditional machine learning-based patient classifiers, netDx results are more interpretable, visualizing the decision boundary in the context of patient similarity space. When patient similarity is defined by pathway-level gene expression, netDx identifies biological pathways important for outcome prediction, as demonstrated in diverse data sets of breast cancer and asthma. Thus, netDx can serve both as a patient classifier and as a tool for discovery of biological features characteristic of disease. We provide a freely available software implementation of netDx along with sample files and automation workflows in R.

  14. Raw data for "Efficient protein structure generation with sparse denoising...

    • zenodo.org
    application/gzip
    Updated Jan 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Jendrusch; Michael Jendrusch; Jan Korbel; Jan Korbel (2025). Raw data for "Efficient protein structure generation with sparse denoising models" [Dataset]. http://doi.org/10.5281/zenodo.14711580
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 31, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Jendrusch; Michael Jendrusch; Jan Korbel; Jan Korbel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 24, 2025
    Description

    This repository contains model parameters and protein structures described in the manuscript "Efficient protein structure generation with sparse denoising models".

    model source code

    "salad-0.1.0.tar.gz" contains the snapshot of the salad code-base used in the manuscript.

    model parameters

    The parameters for the salad (sparse all-atom denoising) models described in the manuscript are contained in "salad_params.tar.gz". This unpacks to a directory "params/", which contains pickled parameter files for a number of model variants:

    • "default_vp-200k.jax" (and "default_vp_timeless-200k.jax", "default_vp_minimal_timeless-200k.jax"):
      • checkpoints at 200,000 training steps for diffusion models with fixed-standard deviation (10 Å) variance preserving (VP) noise.
      • "timeless" and "minimal_timeless" files contain parameters for ablated models without diffusion time features and with reduced pair features as described in the manuscript.
    • "default_vp_scaled-200k.jax" (and "timeless" / "minimal_timeless" variants):
      • checkpoints at 200,000 training steps for input-dependent standard deviation VP noise (VP-scaled in the manuscript).
    • "default_ve_scaled-200k.jax" (and "timeless" / "minimal_timeless" variants):
      • checkpoints at 200,000 training steps for variance expanding (VE) noise.
    • "multimotif_vp-200k.jax":
      • checkpoint at 200,000 training steps for a model with multi-motif conditioning and VP noise.
    • "default_vp-pdb256-200k.jax":
      • checkpoint for a model trained on proteins with 50 to 256 amino acids from the PDB.
    • "default_vp-synthete-256-200k.jax":
      • checkpoint for a model trained on proteins with 50 to 256 amino acids generated using random secondary structure conditioning with the default_vp-200k.jax checkpoint and ProteinMPNN redesign.

    In addition to salad model parameters, we also provide the parameters for the autoencoder models described in the manuscript in "ae_params.tar.gz". This unpacks to a directory "ae_params/", which contains the following checkpoints:

    • "small_none-200k.jax": sparse decoder with neighbour selection based only on predicted coordinates.
    • "small_inner-200k.jax": sparse decoder with neighbour selection based on per-block distogram predictions and predicted coordinates.
    • "small_semiequivariant-200k.jax": same as "small_inner-200k.jax", with the addition of using non-equivariant features on top of the usual equivariant features (relative orientation / distance).
    • "small_nodist_vq-200k.jax": same as "small_none-200k.jax", with vector quantization (VQ).
    • "small_vq-200k.jax": same as "small_inner-200k.jax", with VQ
    • "small_semiequivariant_vq-200k.jax": same as "small_semiequivariant-200k.jax", with VQ
    • "small_vq_e2-500k.jax": same as "small_vq-200k.jax", with double encoder depth and 500k training steps.

    generated proteins

    The protein structures generated using salad, as well as their corresponding sequences generated using ProteinMPNN and predicted structures using ESMfold are contained in "data_package.tar.gz". This archive unpacks to a directory "data_package/" which contains subdirectories for each protein design task described in the manuscript "Efficient protein structure generation with sparse denoising models":

    monomers/

    This directory contains subdirectories named "

    • "ve_large_100": VE diffusion starting from noise with standard deviation 100 Å
    • "ve_large_80": VE diffusion starting from noise with standard deviation 80 Å
    • "ve_domain_100": VE diffusion starting from domain-shaped noise as described in the manuscript with standard deviation 100 Å
    • "ve_domain_80": VE diffusion starting from domain-shaped noise as described in the manuscript with standard deviation 80 Å

    In addition, there are subdirectories with "random" in their name, instead of a number of steps, e.g. "default_vp_scaled-200-random-esm/". These subdirectories contain data generated using random secondary structure conditioning.

    Each subdirectory has the same underlying structure:

    • "backbones/": directory containing PDB files of salad-generated backbones
    • "predictions/": directory containing PDB files of the predicted structure for the best sequence according to ESMfold pLDDT and scRMSD for each designable backbone (a backbone with at least 1 predicted structure with scRMSD < 2 Å and pLDDT > 70).
    • "scores.csv": comma-separated file of structure-prediction metrics for each ProteinMPNN sequence generated for each backbone in "backbones". This file has the following columns:
      • "name": base name of the backbone PDB file in "backbones"
      • "index": index of the sequence corresponding to this row (0th, 1st, etc.)
      • "sequence": amino acid sequence that was had its structure predicted in this row
      • "sc_rmsd": root mean square deviation (RMSD) between the salad backbone and the predicted structure for this row
      • "sc_tm": TM score between the salad backbone and the predicted structure for this row
      • "plddt": pLDDT of the predicted structure for this row
      • "ptm": pTM of the predicted structure for this row
      • "pae": predicted aligned error for this ro
      • for complexes (irrelevant for this study):
        • "ipae": mean interface pAE for this row
        • "mpae": minimum interface pAE for this row

    comparison/

    Same as "monomers/", but contains data generated using RFdiffusion and Genie 2 for protein sizes between 50 and 400 amino acids.

    shape/

    This directory contains the subdirectories named "ve-seg-

    motif/

    This directory contains generated structures for the motif-scaffolding benchmark described by Lin et al., 2024 [1]. It contains two subdirectories:

    • "cond/": contains results generated using motif-conditioned models with the checkpoint "multimotif_vp-200k.jax"
    • "nocond/": contains results generated using structure-editing for motif-scaffolding with the checkpoint "default_vp-200k.jax"

    Each of these subdirectories has the same structure as the directories "monomers/" and "shape/", with one subdirectory per motif PDB file in the motif-scaffolding benchmark, e.g. "cond/multimotif_vp-1bcf.pdb-esm/" or "nocond/default_vp-1bcf.pdb-esm/". These directories contain the usual "backbones/" and "predictions/" subdirectories, as well as a file "motif_scores.csv". This has fields analogous to "scores.csv", with the addition of two additional fields for motif-RMSD:

    • "motif_rmsd_ca": CA-only RMSD between the ESMfold predicted structure and the input motif
    • "motif_rmsd_bb": full-backbone (N, CA, C) RMSD between the ESMfold predicted structure and the input motif

    A designed sequence-structure pair is only considered successful if sc_rmsd < 2 Å, plddt > 70 and motif_rmsd_bb < 1 Å.

    sym/

    This directory contains generated structures for symmetric repeat proteins using both VP and VE models with structure-editing. Subdirectories are named by model type ("default_vp", "default_ve_minimal_timeless"), symmetry ("C

    confchange/

    This directory contains generated structures for designed multi-state proteins. In our manuscript we compare two different approaches to multi-state design using salad which are reflected in two subdirectories of "confchange/":

    • "default_vp-parent-split-af2": running independent denoising processes with distinct secondary structure constraints, followed by tied ProteinMPNN sequence design
    • "default_vp-parent-split-constrained-af2": running coupled denoising processes where shared substructures across states are kept aligned across states, followed by tied ProteinMPNN sequence design

    Both share the same directory structure:

    • "backbones/{parent, child1, child2, partial_parent, partial_child1, partial_child2}": generated successful backbones, and partially successful backbones ("partial_") for the three designed states (the full parent structure and two child structures resulting from splitting the parent sequence into its N and C terminal parts).
    • "predictions/{parent, child1, child2, partial_parent, partial_child1, partial_child2}": the best AlphaFold 2 predicted structures for each successful backbone.
    • "scores_{parent, child1, child2}.csv": scores files as above, generated using AlphaFold 2

  15. f

    Data from: Algorithms for Sparse Support Vector Machines

    • tandf.figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alfonso Landeros; Kenneth Lange (2023). Algorithms for Sparse Support Vector Machines [Dataset]. http://doi.org/10.6084/m9.figshare.21554661.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Alfonso Landeros; Kenneth Lange
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many problems in classification involve huge numbers of irrelevant features. Variable selection reveals the crucial features, reduces the dimensionality of feature space, and improves model interpretation. In the support vector machine literature, variable selection is achieved by l1 penalties. These convex relaxations seriously bias parameter estimates toward 0 and tend to admit too many irrelevant features. The current article presents an alternative that replaces penalties by sparse-set constraints. Penalties still appear, but serve a different purpose. The proximal distance principle takes a loss function L(β) and adds the penalty ρ2dist(β,Sk)2 capturing the squared Euclidean distance of the parameter vector β to the sparsity set Sk where at most k components of β are nonzero. If βρ represents the minimum of the objective fρ(β)=L(β)+ρ2dist(β,Sk)2, then βρ tends to the constrained minimum of L(β) over Sk as ρ tends to ∞. We derive two closely related algorithms to carry out this strategy. Our simulated and real examples vividly demonstrate how the algorithms achieve better sparsity without loss of classification power. Supplementary materials for this article are available online.

  16. m

    COVID-19 CNN MFCC classifier

    • data.mendeley.com
    • narcis.nl
    Updated Aug 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Dunne (2020). COVID-19 CNN MFCC classifier [Dataset]. http://doi.org/10.17632/ww5dfy53cw.1
    Explore at:
    Dataset updated
    Aug 20, 2020
    Authors
    Robert Dunne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    High accuracy classification of COVID-19 coughs using Mel-frequency cepstral coefficients and a Convolutional Neural Network with a use case for smart home devices.

    Diagnosing COVID-19 early in domestic settings is possible through smart home devices that can classify audio input of coughs, and determine whether they are COVID-19. Research is currently sparse in this area and data is difficult to obtain. How- ever, a few small data collection projects have en- abled audio classification research into the application of different machine learning classification algorithms, including Logistic Regression (LR), Support Vector Machines (SVM), and Convolution Neural Networks (CNN). We show here that a CNN using audio converted to Mel-frequency cepstral coefficient spectrogram images as input can achieve high accuracy results; with classification of validation data scoring an accuracy of 97.5% cor- rect classification of covid and not covid labelled audio. The work here provides a proof of concept that high accuracy can be achieved with a small dataset, which can have a significant impact in this area. The results are highly encouraging and provide further opportunities for research by the academic community on this important topic.

    Preprint: https://www.researchgate.net/publication/343376336_High_accuracy_classification_of_COVID-19_coughs_using_Mel-frequency_cepstral_coefficients_and_a_Convolutional_Neural_Network_with_a_use_case_for_smart_home_devices

  17. N

    Neural Network Software Market Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Neural Network Software Market Report [Dataset]. https://www.marketreportanalytics.com/reports/neural-network-software-market-89614
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Neural Network Software market is experiencing explosive growth, projected to reach a substantial size, driven by the increasing adoption of AI across diverse sectors. A 35.20% CAGR from 2019 to 2024 suggests a significant market expansion, with this momentum expected to continue throughout the forecast period (2025-2033). Key drivers include the rising need for advanced analytics in fraud detection, financial forecasting, and image optimization across industries like BFSI (Banking, Financial Services, and Insurance), healthcare, and retail. The market's segmentation by application and end-user vertical highlights the versatility and wide-ranging applicability of neural network software. The growth is further fueled by ongoing advancements in hardware capabilities (like GPUs from NVIDIA and Intel) that enhance processing power and efficiency for complex neural network computations. However, challenges such as the high cost of implementation, the need for specialized expertise, and data privacy concerns represent potential restraints on market growth. Despite these challenges, the continuous innovation in algorithm development, expanding cloud computing infrastructure, and the increasing availability of large datasets are expected to overcome these limitations and fuel further expansion in the coming years. The competitive landscape is dominated by major players like IBM, NVIDIA, Intel, and Microsoft, alongside specialized companies like Clarifai and Alyuda Research. This blend of established tech giants and nimble specialists contributes to the market's dynamism and innovation. The North American market currently holds a significant share, but rapid technological adoption in Asia-Pacific and other regions indicates a potential shift in geographical market share over the forecast period. The increasing demand for automation and improved decision-making across various sectors promises sustained growth for the Neural Network Software market, making it a lucrative and rapidly evolving space. The market is poised to benefit from the growing trend of using AI for predictive maintenance and streamlining operations in areas such as logistics and defense. Recent developments include: May 2022: Gogle AI released GraphWorld, a tool to accelerate performance benchmarking in the area of graph neural networks (GNNs). By enabling artificial intelligence (AI) engineers and academics to test new GNN architectures on larger graph datasets, it allows a new approach to GNN architectural testing and design., August 2022: With the introduction of NVIDIA's NeuralVDB, the prestigious OpenVDB combined artificial intelligence (AI) and general processing unit (GPU) optimization to help professionals across scientific computing, visualization, and more, interact with large and complex volumetric data in real-time. NeuralVDB offers a 100x memory footprint reduction for sparse volumetric data such as smoke and clouds.. Key drivers for this market are: Availability of Spatial Data and Analytical Tools, Increasing Demand for Predicting Solutions. Potential restraints include: Availability of Spatial Data and Analytical Tools, Increasing Demand for Predicting Solutions. Notable trends are: Healthcare Segment to Grow Significantly.

  18. z

    Data from: NeuroSense: A Novel EEG Dataset Utilizing Low-Cost, Sparse...

    • zenodo.org
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tommaso Colafiglio; Tommaso Colafiglio; Angela Lombardi; Angela Lombardi; Paolo Sorino; Paolo Sorino; Elvira Brattico; Elvira Brattico; Domenico Lofù; Domenico Lofù; Danilo Danese; Danilo Danese; Eugenio Di Sciascio; Eugenio Di Sciascio; Tommaso Di Noia; Tommaso Di Noia; Fedelucio Narducci; Fedelucio Narducci (2024). NeuroSense: A Novel EEG Dataset Utilizing Low-Cost, Sparse Electrode Devices for Emotion Exploration [Dataset]. http://doi.org/10.5281/zenodo.14003181
    Explore at:
    Dataset updated
    Oct 30, 2024
    Dataset provided by
    Zenodo
    Authors
    Tommaso Colafiglio; Tommaso Colafiglio; Angela Lombardi; Angela Lombardi; Paolo Sorino; Paolo Sorino; Elvira Brattico; Elvira Brattico; Domenico Lofù; Domenico Lofù; Danilo Danese; Danilo Danese; Eugenio Di Sciascio; Eugenio Di Sciascio; Tommaso Di Noia; Tommaso Di Noia; Fedelucio Narducci; Fedelucio Narducci
    Time period covered
    Oct 28, 2024
    Description

    README

    Link to the Publication

    🔗 Read the Paper

    Details related to access to the data

    Data user agreement

    The terms and conditions for using this dataset are specified in the [LICENCE](LICENCE) file included in this repository. Please review these terms carefully before accessing or using the data.

    Contact person

    For additional information about the dataset, please contact:
    - Name: Angela Lombardi
    - Affiliation: Department of Electrical and Information Engineering, Politecnico di Bari
    - Email: angela.lombardi@poliba.it

    Practical information to access the data

    The dataset can be accessed through our dedicated web platform. To request access:

    1. Visit the main dataset page at: https://sisinflab.poliba.it/neurosense-dataset-request/
    2. Follow the instructions on the website to submit your access request
    3. Upon approval, you will receive further instructions for downloading the data

    Please ensure you have read and agreed to the terms in the data user agreement before requesting access.

    Overview

    EEG Emotion Recognition - Muse Headset
    2023-2024

    The experiment consists in 40 sessions per user. During each session, users are asked to watch a
    music video with the aim to understand their emotions.
    Recordings are performed with a Muse EEG headset at a 256 Hz sampling rate.
    Channels are recorded as follows:
    - Channel 0: AF7
    - Channel 1: TP9
    - Channel 2: TP10
    - Channel 3: AF8

    The chosen songs have various Last.fm tags in order to create different feelings. The title of every track
    can be found in the "TaskName" field of sub-ID***_ses-S***_task-Default_run-001_eeg.json, while the author,
    the Last.fm tag and additional information in "TaskDescription".

    Methods

    Subjects

    The subject pool is made of 30 college students, aged between 18 and 35. 16 of them are males, 14 females.

    Apparatus

    The experiment was performed using the same procedures as those to create
    [Deap Dataset](https://www.eecs.qmul.ac.uk/mmv/datasets/deap/), which is a dataset to recognize emotions via a Brain
    Computer Interface (BCI).


    Task organization

    Firstly, music videos were selected. Once 40 songs were picked, the protocol was chosen and the self-assessment
    questionnaire was created.

    Task details

    In order to evaluate the stimulus, Russell's VAD (Valence-Arousal-Dominance) scale was used.
    In this scale, valenza-arousal space can be divided in four quadrants:
    - Low Arousal/Low Valence (LALV);
    - Low Arousal/High Valence (LAHV);
    - High Arousal/Low Valence (HALV);
    - High Arousal/High Valence (HAHV).

    Experimental location

    The experiment was performed in a laboratory located at DEI Department of
    [Politecnico di Bari](https://www.poliba.it/).

    Missing data


    Data recorded during session S019 - Session 2, ID021 - Session 23, user was corrupted, therefore is missing.
    Sessions S033 and S038 of ID015 user show a calculated effective sampling rate lower than 256 Hz:
    - ID015_ses-S033 has 226.1320 Hz
    - ID015_ses-S038 has 216.9549 Hz

  19. MedalCare-XL

    • zenodo.org
    • paperswithcode.com
    zip
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karli Gillette*; Karli Gillette*; Matthias A.F. Gsell*; Matthias A.F. Gsell*; Claudia Nagel*; Claudia Nagel*; Jule Bender; Benjamin Winkler; Benjamin Winkler; Steven E. Williams; Steven E. Williams; Markus Bär; Markus Bär; Tobias Schäffter; Tobias Schäffter; Olaf Dössel; Olaf Dössel; Gernot Plank; Gernot Plank; Axel Loewe; Axel Loewe; Jule Bender (2023). MedalCare-XL [Dataset]. http://doi.org/10.5281/zenodo.7293655
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Karli Gillette*; Karli Gillette*; Matthias A.F. Gsell*; Matthias A.F. Gsell*; Claudia Nagel*; Claudia Nagel*; Jule Bender; Benjamin Winkler; Benjamin Winkler; Steven E. Williams; Steven E. Williams; Markus Bär; Markus Bär; Tobias Schäffter; Tobias Schäffter; Olaf Dössel; Olaf Dössel; Gernot Plank; Gernot Plank; Axel Loewe; Axel Loewe; Jule Bender
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Mechanistic cardiac electrophysiology models allow for personalized simulations of the electrical activity in the heart and the ensuing electrocardiogram (ECG) on the body surface. As such, synthetic signals possess precisely known ground truth labels of the underlying disease (model parameterization) and can be employed for validation of machine learning ECG analysis tools in addition to clinical signals. Recently, synthetic ECG signals were used to enrich sparse clinical data for machine learning or even replace them completely during training leading to good performance on real-world clinical test data.

    We thus generated a large synthetic database comprising a total of 16,900 12-lead ECGs based on multi-scale electrophysiological simulations equally distributed into 1 normal healthy control and 7 pathology classes. The pathological case of myocardial infraction had 6 sub-classes. A comparison of extracted timing and amplitude features between the virtual cohort and a large publicly available clinical ECG database demonstrated that the synthetic signals represent clinical ECGs for healthy and pathological subpopulations with high fidelity. The novel dataset of simulated ECG signals is split into training, validation and test data folds for development of novel machine learning algorithms and their objective assessment.

    This folder WP2_largeDataset_Noise contains the 12-lead ECGs of 10 seconds length. Each ECG is stored in a separate CSV file with one row per lead (lead order: I, II, III, aVR, aVL, aVF, V1-V6) and one sample per column (sampling rate: 500Hz). Data are split by pathologies (avblock = AV block, lbbb = left bundle branch block, rbbb = right bundle branch block, sinus = normal sinus rhythm, lae = left atrial enlargement, fam = fibrotic atrial cardiomyopathy, iab = interatrial conduction block, mi = myocardial infarction). MI data are further split into subclasses depending on the occlusion site (LAD, LCX, RCA) and transmurality (0.3 or 1.0). Each pathology subclass contains training, validation and testing data (~ 70/15/15 split). Training, validation and testing datasets were defined according to the model with which QRST complexes were simulated, i.e., ECGs calculated with the same anatomical model but different electrophysiological parameters are only present in one of the test, validation and training datasets but never in multiple. Each subfolder also contains a "siginfo.csv" file specifying the respective simulation run for the P wave and the QRST segment that was used to synthesize the 10 second ECG segment. Each signal is available in three variations:
    run_*_raw.csv contains the synthesized ECG without added noise and without filtering
    run_*_noise.csv contains the synthesized ECG (unfiltered) with superimposed noise
    run_*_filtered.csv contains the filtered synthesized ECG (fiter settings: highpass cutoff frequency 0.5Hz, lowpass cutoff frequency 150Hz, butterworth filters of order 3).

    The folder WP2_largeDataset_ParameterFiles contains the parameter files used to simulate the 12-lead ECGs. Parameters are split for atrial and ventricular simulations, which were run independently from one another.
    See Gillette*, Gsell*, Nagel* et al. "MedalCare-XL: 16,900 healthy and pathological electrocardiograms obtained through multi-scale electrophysiological models" for a description of the model parameters.

  20. Data from: Uncertainty-aware molecular dynamics from Bayesian active...

    • zenodo.org
    • data.niaid.nih.gov
    bin, sh, tar, zip
    Updated Mar 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Xie; Jonathan Vandermause; Senja Ramakers; Nakib H. Protik; Anders Johansson; Boris Kozinsky; Yu Xie; Jonathan Vandermause; Senja Ramakers; Nakib H. Protik; Anders Johansson; Boris Kozinsky (2022). Uncertainty-aware molecular dynamics from Bayesian active learning: Phase Transformations and Thermal Transport in SiC [Dataset]. http://doi.org/10.5281/zenodo.5797177
    Explore at:
    tar, zip, bin, shAvailable download formats
    Dataset updated
    Mar 8, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yu Xie; Jonathan Vandermause; Senja Ramakers; Nakib H. Protik; Anders Johansson; Boris Kozinsky; Yu Xie; Jonathan Vandermause; Senja Ramakers; Nakib H. Protik; Anders Johansson; Boris Kozinsky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning interatomic force fields are promising for combining high computational efficiency and accuracy in modeling quantum interactions and simulating atomic level processes. Active learning methods have been recently developed to train force fields efficiently and automatically. Among them, Bayesian active learning utilizes principled uncertainty quantification to make data acquisition decisions. In this work, we present an efficient Bayesian active learning workflow, where the force field is constructed from a sparse Gaussian process regression model based on atomic cluster expansion descriptors. To circumvent the high computational cost of the sparse Gaussian process uncertainty calculation, we formulate a high-performance approximate mapping of the uncertainty and demonstrate a speedup of several orders of magnitude. As an application, we train a model for silicon carbide (SiC), a wide-gap semiconductor with complex polymorphic structure and diverse technological applications in power electronics, nuclear physics and astronomy. We show that the high pressure phase transformation is accurately captured by the autonomous active learning workflow. The trained force field shows excellent agreement with both \textit{ab initio} calculations and experimental measurements, and outperforms existing empirical models on vibrational and thermal properties. The active learning workflow is readily generalized to a wide range of systems, accelerates computational understanding and design.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dashlink (2025). Sparse Machine Learning Methods for Understanding Large Text Corpora [Dataset]. https://catalog.data.gov/dataset/sparse-machine-learning-methods-for-understanding-large-text-corpora

Data from: Sparse Machine Learning Methods for Understanding Large Text Corpora

Related Article
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description

Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.

Search
Clear search
Close search
Google apps
Main menu