17 datasets found

Understanding and Managing Missing Data.pdf
figshare.com
pdf
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29265155.v1
Dataset updated
Jun 9, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Ibrahim Denis Fofanah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
Data from: A multiple imputation method using population information
tandf.figshare.com
pdf
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tadayoshi Fushiki (2025). A multiple imputation method using population information [Dataset]. http://doi.org/10.6084/m9.figshare.28900017.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28900017.v1
Dataset updated
Apr 30, 2025
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Tadayoshi Fushiki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multiple imputation (MI) is effectively used to deal with missing data when the missing mechanism is missing at random. However, MI may not be effective when the missing mechanism is not missing at random (NMAR). In such cases, additional information is required to obtain an appropriate imputation. Pham et al. (2019) proposed the calibrated-δ adjustment method, which is a multiple imputation method using population information. It provides appropriate imputation in two NMAR settings. However, the calibrated-δ adjustment method has two problems. First, it can be used only when one variable has missing values. Second, the theoretical properties of the variance estimator have not been provided. This article proposes a multiple imputation method using population information that can be applied when several variables have missing values. The proposed method is proven to include the calibrated-δ adjustment method. It is shown that the proposed method provides a consistent estimator for the parameter of the imputation model in an NMAR situation. The asymptotic variance of the estimator obtained by the proposed method and its estimator are also given.
d
Data from: Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 -...
catalog.data.gov
data.usgs.gov
+2more
Updated Nov 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data [Dataset]. https://catalog.data.gov/dataset/variable-terrestrial-gps-telemetry-detection-rates-parts-1-7data
Explore at:
Dataset updated
Nov 27, 2025
Dataset provided by
U.S. Geological Survey
Description
Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animal’s home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.
f
Recognizing molecular mechanisms of antitumor compounds from cytostatic...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Jan 14, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Supek, Fran (2015). Recognizing molecular mechanisms of antitumor compounds from cytostatic activity patterns on the NCI-60 human cancer cell lines [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001863582
Explore at:
Dataset updated
Jan 14, 2015
Authors
Supek, Fran
Description
Dataset: Recognizing molecular mechanisms of antitumor compounds from cytostatic activity patterns on the NCI-60 human cancer cell lines. Related to Ester K et al. (2012) doi:10.1007/s10637-010-9571-7 The NCI-DTP (Developmental Therapeutics Program) has tested thousands of small molecules for in vitro antiproliferative activity against 60 human cancer cell lines – the NCI-60 screen [1]. The mode of action for 100s of these compounds is characterized well enough so they can be assigned to a mechanistic class. This data can be used to develop models that recognize the mechanism-of-activity (MOA) class of a compound from its pattern of differential activity across the 60 cell lines [2, 3]. Then, such models could be applied to a novel compound to find its MOA after having tested the compound for growth inhibition on the 60 cell lines. In particular, the dataset contains 11,999/13,404/12,998 molecules (depending on preprocessing, see below), 475/552/615 of which are assigned to exactly one of 12 possible MOA classes. For each molecule, its cytostatic activity against 60 cell lines is given, expressed as –log10 of the GI50 [also called IC50] concentrations. For example, activity of 6.0 means the compound is active (inhibits growth by 50%) in the 10-6 M concentration (micromolar range). There are some missing values: overall 2.2%/6.3%/2.1%, and varying substantially across cell lines. We have excluded compounds that are either completely inactive or active against very few (<10) cell lines, and those with many missing values or experimental results removed in preprocessing (see Methods); the unfiltered NCI-60 dataset would have >47,000 compounds. The possible MOA classes are: alkylating agent, DNA antimetabolite, nucleoside/nucleobase analog, cytoskeleton/mitotic agent, antineoplastic antibiotic, kinase targeting agent, membrane active agent, DNA intercalator, steroid compound, ion channel agent, topoisomerase I poison, or topoisomerase II poison. The MOA labels were filtered to retain only compounds where the GI50 patterns across cell lines were found to be consistent with other compounds in the same MOA class, using a Random Forest analysis (see Methods). This specific dataset has been used to find the MOA of novel compounds by experimentally measuring cytostatic activity against a subset of NCI-60 cell lines, and applying a Random Forest classifier model on such measurements. This is described in: (i) Ester, Supek et al. 2012 Inv New Drugs. “Putative mechanisms of antitumor activity of cyano-substituted heteroaryles in HeLa cells”, and (ii) Supek, Kralj et al. 2010 Inv New Drugs. “Atypical cytostatic mechanism of N-1-sulfonylcytosine derivatives determined by in vitro screening and computational analysis”. If you found this data useful, please cite Ester et al. (2012) doi:10.1007/s10637-010-9571-7 and an original NCI publication with MOA analyses (for instance, [2]). Methods: Preprocessing – handling defaults. This data set is replete with default GI50 values, meaning that the actual GI50 concentration falls outside of the experimentally tested range. Given that compounds are often tested in the concentration range of 10-8 M to 10-4 M, inactive compounds will have the default of GI50 >= 10-4 M, and (less frequently) very active compounds will have the default <= 10-8 M. As a complicating circumstance, compounds are sometimes not tested in this typical range 10-8 - 10-4 M, and moreover the same compound may be tested in multiple experiments which may have different ranges (and thus different default values). Data for some compounds was not provided in some cell lines. The three ways of preprocessing depend on how these defaults are handled: (1) Force typical range: all experiments which do not range from 10-8 to 10-4 M are fully discarded; some compounds will thus be left out of the final data. The high activity (GI50≤10-8 M) and the low activity (GI50≥10-4) defaults are kept as observed measurements, although the actual activity may be a lot higher/lower than the given value. Multiple experiments per compound are averaged over. Max. 3/60 missing values per compound are tolerated. At least 10/60 cell lines must have a non-default value or the compound is discarded. (2) Force non-defaults: all experiments are kept, regardless of tested range. All default values are discarded, and represented by missing values. This results in many missing values for the inactive (or rarely, very active) compounds. Multiple experiments for the same compound are averaged over. Max. 10/60 missing values per compound are tolerated (here this includes both the actual missing values, and the defaults) or the compound is discarded. (3) Smart filter (RECOMMENDED): all experiments are kept regardless of range; the defaults are also kept and recorded as observations. If the same compound has multiple experiments and gives >1 default value, the most extreme default is kept. If some experiment records a default and others record non-defaults for the same compound, the default is completely discarded, and the non-defaults are averaged over. Max. 3/60 missing values per compound are tolerated. At least 10/60 cell lines must have a non-default value or the compound is discarded. Methods (1) and (3) have little missing data but may be noisier as they include defaults; method (3) has higher coverage. Option (2) has the most reliable data but also more missing values. This data is not normalized and the –log10 GI50 values are given as-is. However in the Ester et al. and Supek et al. papers above, we used the data after standardizing (scaling) each compound (row in table) to a mean value of 0 and standard deviation of 1 across the cell lines. Data sources for GI50 values and MOA class labels. The growth inhibition data is the Dec-2010 version downloaded from the NCI-DTP site [4] and preprocessed as above. The putative mechanism-of-action (MOA) class labels were collected from the DTP website [5], a broader set derived from GI50 profile clustering analyses in [2,3] was kindly provided by the DTP (pers. comm.), and was further edited by removing and/or merging smaller classes, by manually curating further compounds from the literature, and by assigning each compound to at most one class. Importantly, these putative MOA labels were further filtered to retain only the compounds where the pattern of GI50 scores across cell lines was consistent within one MOA class. In particular, the Random Forest was used to classify the MOA classes based on their log GI50 values (dataset: no GI50 defaults allowed, max. 30/60 missing values tolerated per compound; GI50 scores of each compound were standardized to a mean=0 and sd=1.0 across the cell lines; 938 compounds of known MOA meet these criteria). The Random Forest crossvalidation predictions (by out-of-bag method) were examined and all the compounds marked as classification errors were discarded. In other words, if a compounds’ differential GI50 pattern did not agree with other compounds sharing its putative MOA class, this MOA assignment was removed, leaving 714 compounds with known MOA at this step. The supplied SMILES molecular structures for the subset of compounds with known MOA are also from the DTP website, where the SMILES for the complete set of compounds in the NCI-60 screen can be found. References. [1] Shoemaker, R. H. The NCI60 Human Tumour Cell line Anticancer Drug Screen. Nat Rev Cancer, 6: 813-823, 2006. [2] Rabow AA, Shoemaker RH, Sausville EA, Covell DG (2002) Mining the National Cancer Institute’s tumor-screening database: identification of compounds with similar cellular activities. J Med Chem 45:818–840 [3] Huang R, Wallqvist A, Covell DG (2006). Assessment of in vitro and in vivo activities in the National Cancer Institute's anticancer screen with respect to chemical structure, target specificity, and mechanism of action. J Med Chem. 2006 Mar 23;49(6):1964-79. [4] http://dtp.nci.nih.gov/webdata.html [5] http://www.dtp.nci.nih.gov/docs/cancer/searches/standard_mechanism.html
Data from: Dealing with missing observations in the outcome and covariates...
tandf.figshare.com
docx
Updated Apr 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mutamba T. Kayembe; Frans E. S. Tan; Gerard J. P. van Breukelen; Shahab Jolani (2024). Dealing with missing observations in the outcome and covariates in randomized controlled trials [Dataset]. http://doi.org/10.6084/m9.figshare.25718798.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25718798.v1
Dataset updated
Apr 29, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Mutamba T. Kayembe; Frans E. S. Tan; Gerard J. P. van Breukelen; Shahab Jolani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This article compares different missing data methods in randomized controlled trials, specifically addressing cases involving joint missingness in the outcome and covariates. In the existing literature, it is still unclear how advanced methods like linear mixed model (LMM) and multiple imputation (MI) perform in comparison to simpler methods regarding the estimation of treatment effects and their standard errors. We therefore evaluates the performance of LMM and MI against simple alternatives across a wide range of simulation scenarios for various realistic missingness mechanisms. The results show that no single method universally outperforms the others. However, LMM followed by MI demonstrates superior performance across most missingness scenarios. Interestingly, a simple method that combines complete case analysis for the missing outcome and mean imputation for the missing covariate (CCAME) performs similarly to LMM and MI. All methods are furthermore compared in the context of a randomized controlled trial on chronic obstructive pulmonary disease.
d
Data from: Molecular mechanisms underlying plasticity in a thermally varying...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Mar 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul Vinu Salachan; Jesper Givskov Sørensen (2022). Molecular mechanisms underlying plasticity in a thermally varying environment [Dataset]. http://doi.org/10.5061/dryad.2v6wwpzjh
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2v6wwpzjh
Dataset updated
Mar 11, 2022
Dataset provided by
Dryad
Authors
Paul Vinu Salachan; Jesper Givskov Sørensen
Time period covered
Feb 26, 2022
Description
Phenotypic data from time to knock-down assays performed at different timepoints during larval and adult stage of Drosophila melanogaster. Experimental insects were placed in wells within experimental arenas which were further placed in an incubator set to a high temperature for recording the knock-down time. See the Materials and Methods section of the associated publication for further details.
A large-scale fMRI dataset in response to short naturalistic facial...
openneuro.org
Updated Apr 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Panppan Chen; Chi Zhang; Bao Li; Li Tong; Linyuan Wang; Shuxiao Ma; Long Cao; Ziya Yu; Bin Yan (2024). A large-scale fMRI dataset in response to short naturalistic facial expressions videos [Dataset]. http://doi.org/10.18112/openneuro.ds005047.v1.0.0
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds005047.v1.0.0
Dataset updated
Apr 16, 2024
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Panppan Chen; Chi Zhang; Bao Li; Li Tong; Linyuan Wang; Shuxiao Ma; Long Cao; Ziya Yu; Bin Yan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Summary

Facial expression is one of the most natural ways for human beings to communicate their emotional information in daily life.While the neural mechanism of facial expression has been widely studied using lab-controlled images and small amount of videos stimuli,how the human brain processes naturalistic dynamic facial expression still needs to be explored. However,this type of data is currently missing.Here,we describe the Naturalistic Facial Expressions Dataset (NFED),a large-scale dataset of whole-brain functional magnetic resonance imaging (fMRI) responses to 1,320 short (3s) facial expression video clips.The video clips encompass three expressive categories:positive,neutral,and negative.We validated that the dataset has good quality within and across participants and,notably,can encode temporal and spatial stimuli features in the brain.NFED provides researchers with fMRI data that can be used not only to explore the neural mechanisms underlying the processing of emotional information conveyed by naturalistic dynamic facial expression stimuli,but also to more effectively characterize the relationship between perceived stimuli and brain responses through neural encoding and neural decoding.

Data record

The data were organized according to the Brain-Imaging-Data-Structure (BIDS) Specification version 1.0.2 and can be accessed from the OpenNeuro public repository (accession number: XXX). In short, raw data of each subject were stored in “sub-

Stimulus The stimulus for different fMRI experiments are stored in different folders: “stimuli”, “stimuli/floc” ,and “stimuli/prf”.The category labels and metadata corresponding to video stimuli are stored in “stimuli/video-category-labels” and “stimuli/video-metadata”directories, respectively.

Raw MRI data Each participant's folder is comprised of 11 session folders: “sub-

Freesurfer recon-all The results of reconstructing the Cortical Surface were saved as “derivatives/recon-all-FreeSurfer/sub-

Pre-processed volume data from pre-processing The pre-processed volume-based fMRI data were saved as“pre-processed_volume_data/sub-

Preprocessed surface-based data The pre-processed surface-based data were saved as “derivatives/volumetosurface/sub-

Brain activation data from surface-based GLM analyses The GLM analysis data of main experiment for a scan session were saved as “derivatives/GLM/sub-

The technical validation of the NFED The results of the technical validation were saved as “derivatives/validation/results” for each participant.The code of the technical validation were saved as “derivatives/validation/code”.
Parent dataset and code from: Atomistic Mechanisms of the regulation of...
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+2more
zip
Updated Aug 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Woltz; Yang Zhang; Woori Choi; Khoa Ngo; Pauline Trinh; Lu Ren; Phung Thai; Brandon Harris; Yanxiao Han; Kyle Rouen; Diego Lopez Mateos; Zhong Jian; Ye Chen-Izu; Eamonn Dickson; Ebenezer Yamoah; Vladimir Yarov-Yarovoy; Igor Vorobyov; Xiao-Dong Zhang; Nipavan Chiamvimonvat (2024). Parent dataset and code from: Atomistic Mechanisms of the regulation of small conductance Ca 2+ -activated K + channel (SK2) by PIP2 [Dataset]. http://doi.org/10.5061/dryad.ksn02v7dj
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.ksn02v7dj
Dataset updated
Aug 30, 2024
Dataset provided by
Stanford University
University of California, Davis
University of Nevada, Reno
Authors
Ryan Woltz; Yang Zhang; Woori Choi; Khoa Ngo; Pauline Trinh; Lu Ren; Phung Thai; Brandon Harris; Yanxiao Han; Kyle Rouen; Diego Lopez Mateos; Zhong Jian; Ye Chen-Izu; Eamonn Dickson; Ebenezer Yamoah; Vladimir Yarov-Yarovoy; Igor Vorobyov; Xiao-Dong Zhang; Nipavan Chiamvimonvat
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Small conductance Ca 2+ -activated K + channels (SK, K Ca 2) are gated solely by intracellular microdomain Ca 2+. The channel has emerged as a therapeutic target for cardiac arrhythmias. Calmodulin (CaM) interacts with the CaM binding domain (CaMBD) of the SK channels, serving as the obligatory Ca 2+ sensor to gate the channels. In heterologous expression systems, phosphatidylinositol 4,5-bisphosphate (PIP2) coordinates with CaM in regulating SK channels. However, the roles and mechanisms of PIP2 in regulating SK channels in cardiomyocytes remain unknown. Here, optogenetics, magnetic nanoparticles, combined with Rosetta structural modeling and molecular dynamics (MD) simulations revealed the atomistic mechanisms of how PIP2 works in concert with Ca 2+ -CaM in the SK channel activation. Our computational study affords evidence for the critical role of the amino acid residue R395 in the S6 transmembrane domain, which is localized in propinquity to the intracellular hydrophobic gate. This residue forms a salt bridge with residue E398 in the S6 transmembrane domain from the adjacent subunit. Both R395 and E398 are conserved in all known isoforms of SK channels. Our findings suggest that the binding of PIP2 to R395 residue disrupts the R395:E398 salt bridge, increasing the flexibility of the transmembrane segment S6 and the activation of the channel. Importantly, our findings serve as a new platform for testing structural-based drug designs for therapeutic inhibitors and activators of the SK channel family. The study is timely since inhibitors of SK channels are currently in clinical trials to treat atrial arrhythmias. Methods See the PNAS publication for a full description and references Computational modeling of hSK2 channel to generate starting structures for MD simulations: The generation of structural models for the hSK2 channel was achieved via three stages using Rosetta molecular modeling (Online Supplemental Fig. S1) as described below. Electron density map refinement: In the first stage, we refined and converted three states of the hSK4-CaM cryo-EM structures (PDB IDs: 6CNM, 6CNN, and 6CNO) into Rosetta-optimized energy with Rosetta 2021 software. Cryo-EM refinement was performed with side chains only being optimized. This conversion facilitated the energy terms of these structures to align with Rosetta's scoring function (Score.gd2), leading to improved homology modeling. We used a modified version of the Rosetta demo included in the Rosetta software(https://new.rosettacommons.org/demos/latest/public/electron_density_structure_refinement/structure_refinement) (7, 8). Homology modeling of hSK2 channels in closed, intermediate, and open states: In the second stage, we employed the modified Rosetta Comparative Modeling (RosettaCM) protocol to generate hSK2 homology models (8-15). This involved conducting sequence alignments between hSK4 and hSK2 channels, and subsequently formatting the aligned sequences into a Grishin format suitable for RosettaCM. Regions in hSK2 that were not present in hSK4 structures were modeled using the loop modeling protocol, (https://www.rosettacommons.org/docs/latest/application_documentation/structure_prediction/loop_modeling/KIC_with_fragments) (16) primarily the S3-S4 linker. The template protocol can be found at this link: (https://new.rosettacommons.org/demos/latest/Home). The top model from the cryo-EM refinement of the hSK4 step was then used as the template for homology modeling for hSK2. The first attempt was to only model hSK2, then dock CaM onto the channel. However, the absence of CaM in the models triggered large movements in the CaM binding domain (CaMBD) in the C-terminal domain of hSK2 (main-text Fig. 2A) due to two factors: A) CaMBD is perpendicular to the S1-S6 transmembrane segments of hSK2 and B) the linker region that connects the CaMBD and S1-S6 is very flexible. In contrast, the inclusion of CaM led to convergence and improved agreement in the top models for the CaMBD (main-text Fig. 2B-D), resulting in a model that is much closer to the template. In our attempt to create reliable homology models, several features were included in the protocol, namely implicit lipid membrane environment to accurately model membrane-spanning protein segments, enforcing symmetry to preserve a four-fold homotetrameric hSK2-CaM complex symmetry, explicit inclusion of metal ions to preserve the Ca 2+ binding loops, treatment of multiple chains that include hSK2 and CaM, and loop modeling for the flexible regions of the protein missing from the cryo-EM structures. However, an error occurred between the symmetry function and the metal binding feature, specifically, the Ca 2+ ions assumed the exact coordinates of the first C α atom of the first residue of CaM. This necessitated the homology modeling to be performed without the symmetry function. In addition, we meticulously monitored for possible displacement of the backbone carbonyl oxygen atoms in the selectivity filter (SF) of hSK2 during the homology modeling since deformation of the SF may lead to a non-conducting channel in the MD simulations. Indeed, repulsion of the oxygen atoms results in the SF deformation at amino acid residues I359 with widening and shortening of the SF. This necessitated the inclusion of harmonic restraints that were determined empirically, and a weighted value of 100.0 kcal/mol/Å 2 was used on the C α atoms of the backbone of each amino acid residue in the SF. Additionally, the deformation was minimized with the explicit inclusion of K + ions in the SF. Interestingly, we did observe that the open state of hSK2 required the largest restraints to maintain the SF structure. 50,000 models were created for each conformational state of hSK2-CaM, and a top model was selected with a standard clustering selection process. Clustering and top model selection: Standard Rosetta clustering was performed for each cryo-EM refinement and homology modeling step with minor differences in the filtering process, based on the models that were used as inputs. The cryo-EM refinement output structures were filtered first by sorting using the “r15” term (REF2015 terms: SCORE – elec_dens_fast) and keeping the lowest 50% of r15 ranked decoys. The resulting list of decoys was then sorted by elec_dens_fast term, and the lowest 20% were kept. This final selection was then clustered with a radius determined empirically to provide a distribution, where the most decoys (~30%) were in the first cluster, and subsequent clusters were with progressively fewer decoys. An attempt to obtain upwards of 90% of all models in the first 20 clusters was made. The centers of the top 10 clusters were then evaluated for retention of similar critical structural features described above. The top model from these 10 was used as a template for the homology modeling. The results of the homology modeling were sorted by the total score term with the lowest 10% used for clustering. Clustering of the homology models was performed in the same manner as the cryo-EM refinement with the top model used as the starting structure for MD simulations. Molecular dynamics (MD) simulations: The final stage involved molecular dynamics (MD) simulations. The hSK2 models derived from homology modeling were initially visualized without a membrane and then embedded into 1-palmitoyl-2-oleoyl-sn-glycero-3-phosphocholine (POPC) lipid bilayer and solvated by 0.15 M KCl aqueous solution using CHARMM-GUI (17-19) (Fig. 3A). Online Supplemental Table S1 provides a summary of 18 5-µs-long MD simulations on Anton 2 supercomputer (20) of hSK2-CaM complex in POPC membrane with or without mono-protonated state of phosphatidylinositol -(4,5)-bisphosphate (PIP2) with protonation on P4 oxygen atom (SAPI24) at 2.5, 5, and 10% in the lower leaflet of the lipid bilayer. The concentrations of PIP2 were chosen based on recent estimations of PIP2 in the lower leaflet of the lipid bilayer that can be as high as 2-5% (21). In addition, we performed a total of 27 simulations at 1 µs using either NAMD 3.0 alpha (22) or AMBER18 (23) on the high-performance computing (HPC) EXPANSE platform (San Diego Supercomputer Center at the University of California, San Diego) with computational time granted through Extreme Science and Engineering Discovery Environment, XSEDE (now Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support, ACCESS). MD simulations were run in the NPT ensemble at 310 K and 1 atm pressure using tetragonal periodic boundary conditions using a standard set of non-bonded cutoffs and other options as in our previous studies (24, 25). All-atom biomolecular CHARMM36m protein (26), C36 lipid (27, 28), and TIP3 water (29) were used. Each MD simulation system was equilibrated for 2.27 ns with suggested gradually diminishing positional and dihedral restraints provided by CHARMM-GUI scripts. Due to the relatively large size of the hSK2-CaM protein complex in our MD simulations and to ensure its conformational stability (as described above), a follow-up extended equilibration MD simulation was performed for 100 ns with gradually reduced restraints on protein backbone atoms of the hSK2-CaM complex and its components as shown in main-text Fig. 3B. The extended equilibration protocol was performed on the EXPANSE platform with AMBER18 (Fig. 3B). The protein backbone restraints were maintained using the force constant of 1.0 kcal·mol -1 ·Å -2 during the initial environment equilibration until the start of the extended equilibration stage, when the restraints were successively reduced in a 5 or 10 ns stepwise fashion, starting from the periphery of the protein and moving towards the center of the protein over a period of 100 ns (main-text Fig. 3B). The extended equilibration protocol was determined empirically to maintain the stability of essential structural features such as the pore domain or SF. Each successive step of either 5 or 10 ns was determined based on
H
Progress Towards Interpretable Machine Learning-based Disruption Predictors...
dataverse.harvard.edu
osti.gov
Updated Sep 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
C. Rea, K.J. Montes, A. Pau, R.S. Granetz , O. Sauter (2021). Progress Towards Interpretable Machine Learning-based Disruption Predictors Across Tokamaks [Dataset]. http://doi.org/10.7910/DVN/QRZOQ3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/QRZOQ3
Dataset updated
Sep 21, 2021
Dataset provided by
Harvard Dataverse
Authors
C. Rea, K.J. Montes, A. Pau, R.S. Granetz , O. Sauter
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.7910/DVN/QRZOQ3https://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.7910/DVN/QRZOQ3
Description
In this paper we lay the groundwork for a robust cross-device comparison of data-driven disruption prediction algorithms on DIII-D and JET tokamaks. In order to consistently carry on a comparative analysis, we define physics-based indicators of disruption precursors based on temperature, density, and radiation profiles that are currently missing for DIII-D data. These profile-based indicators are shown to well-describe impurity accumulation events in both DIII-D and JET discharges that eventually disrupt. Thanks to the univariate analysis on the features used in such data-driven applications on both tokamaks, we are able to statistically highlight differences in the dominant disruption precursors: JET with its ITER-like wall is more prone to impurity accumulation events, while DIII-D is more subject to edge cooling mechanisms that destabilize dangerous MHD modes. Even though the analyzed datasets are characterized by such intrinsic differences, we show how data-driven algorithms trained on one device can be used to predict and interpret disruptive scenarios on the other. As long as the destabilizing precursors are diagnosed in a device-independent way, the knowledge that data-driven algorithms learn on one device can be used to explain a disruptive behavior on another device.
n
Data from: Conserved islands of divergence associated with adaptive...
data-staging.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Sep 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Euclide; Wes Larson; Yue Shi; Kristen Gruenthal; Kristen Christensen; Jim Seeb; Lisa Seeb (2023). Conserved islands of divergence associated with adaptive variation in sockeye salmon are maintained by multiple mechanisms [Dataset]. http://doi.org/10.5061/dryad.zcrjdfnh5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.zcrjdfnh5
Dataset updated
Sep 6, 2023
Dataset provided by
Alaska Department of Fish and Game
University of Wisconsin–Madison
Purdue University West Lafayette
National Oceanic and Atmospheric Administration
University of Washington
University of Victoria
Authors
Peter Euclide; Wes Larson; Yue Shi; Kristen Gruenthal; Kristen Christensen; Jim Seeb; Lisa Seeb
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Local adaptation is facilitated by loci clustered in relatively few regions of the genome, termed genomic islands of divergence. The mechanisms that create and maintain these islands and how they contribute to adaptive divergence is an active research topic. Here, we use sockeye salmon as a model to investigate both the mechanisms responsible for creating islands of divergence and the patterns of differentiation at these islands. Previous research suggested that multiple islands contributed to adaptive radiation of sockeye salmon. However, the low-density genomic methods used by these studies made it difficult to fully elucidate the mechanisms responsible for islands and connect genotypes to adaptive variation. We used whole genome resequencing to genotype millions of loci to investigate patterns of genetic variation at islands and the mechanisms that potentially created them. We discovered 64 islands, including 16 clustered in four genomic regions shared between two isolated populations. Characterization of these four regions suggested that three were likely created by structural variation, while one was created by processes not involving structural variation. All four regions were small (< 600 kb), suggesting low recombination regions do not have to span megabases to be important for adaptive divergence. Differentiation at islands was not consistently associated with established population attributes. In sum, the landscape of adaptive divergence and the mechanisms that create it are complex; this complexity likely helps to facilitate fine-scale local adaptation unique to each population. Methods

Sampling design We resequenced genomes of sockeye salmon from seven populations in Southwest Alaska, USA (these samples are a subset of those analyzed in Larson et al., 2019). Fin-clips from 27 individuals per population (189 individuals total) were obtained from three lake-type spawning populations in each of the Kvichak River and Wood River drainages as well as one putatively ancestral sea/river population in the Nushagak River drainage. Lake-type samples were further subdivided into the following groups based on spawning habitat: mainland beaches, island beaches, creeks, and rivers. Mainland and island beaches are similar except island beaches are found in the middle of lakes where they are highly affected by wind and wave action (Stewart et al., 2003). Creeks are narrow (< 5 m wide) and shallow (< 0.5 m deep on average) while rivers are wide (> 30 m wide), deep (> 0.5 m deep), and fast flowing (Quinn et al., 2001). All samples were collected from spawning adults by Alaska Department of Fish and Game between 1999 and 2013 and provided as extracted DNA (extracted with Qiagen DNAeasy Blood and Tissue Kits, Hilden, Germany).
Whole genome library preparation and sequencing

Libraries were prepared according to Baym et al. (2015) and Therkildsen and Palumbi (2017) with the following modifications. Input DNA was normalized to 10 ng for each individual. Steps for 96-well AMPure XP (Beckman Colter; Brea, CA) purification; product quantification, normalization, and pooling; and size selection were replaced with a SequalPrep (ThermoFisher Scientific, Waltham, MA, USA) normalization and pooling protocol, similar to that used in GT-seq (Campbell et al., 2015). We used three SequalPrep plates per each of the two 96-well tagmented and adaptor-ligated DNA library plates and pooled the full eluate per individual DNA library to increase total yield. Normalized pooled libraries were subject to a 0.6X size selection, purification, and volume concentration with AMPure XP following Therkildsen and Palumbi (2017). In-house QC consisted of visualization on a precast 2% agarose E-Gel (ThermoFisher Scientific) and quantification with a Qubit HS dsDNA Assay Kit (ThermoFisher Scientific). We constructed two libraries each containing 96 individuals and each of these libraries was sequenced on three Novaseq S4 lanes (six lanes total) at Novogene (Sacramento, CA, USA).

Genotype calling and quality control

Variants and genotypes were called using the Genotype Analysis Toolkit (GATK) version 4.1.7 (DePristo et al., 2011; McKenna et al., 2010) and a protocol that closely followed Christensen et al. (2020). Paired-end reads were aligned to the sockeye salmon genome (GCF_006149115.2; Christensen et al., 2020) with BWA MEM v.0.7.17 (Li, 2013) and indexed and sorted with Samtools v.1.10 (Li et al., 2009). Next, readgroups for each alignment file (bam file) were assigned using Picard v2.22.6 (AddOrReplaceReadGroups; http://broadinstitute.github.io/picard). Individual bam files produced on separate sequencing lanes were merged, and PCR duplicates were marked using the MarkDuplicates function from Picard with stringency set to “LENIENT”. Individual genomic VCF files (gvcf) were generated from alignments using HaplotypeCaller from GATK. A single database was created containing all individual gvcf files using GenomeDBImport from GATK. Once the variants from all individuals had been added to the database, joint-genotyping was conducted using the GenotypeGVCFs function. The resulting variant file (vcf) was then hard filtered using the VariantFiltration function (filter expression = QD < 2.0 || FS > 60.0 || SOR < 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0). All variants that passed hard filter were used in conjunction with three datasets used previously as truth datasets by (Christensen et al., 2020) for GATK’s VarientRecalibrator function. The tranches file generated by VarientRecalibrator was subsequently used as the input for the ApplyVQSR function and to produce a corrected vcf file and submitted to additional variant filtration in VCFtools v.0.1.16 (parameters: --maf 0.05, --max-alleles 2, --min-alleles 2, --max-missing 0.9, --remove-filtered-all –remove-indels; Danecek et al., 2011). Finally, loci with an allele balance of less than 0.2 were marked. The resulting vcf file constituted our baseline file for all other analysis and downstream processing.
Creating a merged dataset Because the islands of divergence we identified were consistent among spatially isolated drainages in Alaska, we hypothesized that these regions may be conserved in other sockeye populations. To test this, we merged the dataset generated in the present study with whole-genome data from 78 sockeye salmon (kokanee excluded) from Christensen et al. (2020). This dataset was sequenced to a similar depth of coverage and was processed using an almost identical GATK4 pipeline. The dataset included 16 spawning populations that we grouped into five drainage regions: Bristol Bay (N = 12 individuals), Fraser/Columbia river basins (N = 47), Gulf of Alaska (N = 8), Northern British Columbia (N = 9), and Russia (N = 2). The variants identified in Christensen et al. (2020) were merged with ours using bcftools v.1.11 (Danecek et al., 2021) by retaining variants that intersected between the two datasets, had a genotyping rate > 80%, and were positioned within one of the refined haploblock regions.
HF patients correspond to the ICD-9 codes.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang (2023). HF patients correspond to the ICD-9 codes. [Dataset]. http://doi.org/10.1371/journal.pone.0276835.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0276835.t001
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Heart failure (HF) is the final stage of the various heart diseases developing. The mortality rates of prognosis HF patients are highly variable, ranging from 5% to 75%. Evaluating the all-cause mortality of HF patients is an important means to avoid death and positively affect the health of patients. But in fact, machine learning models are difficult to gain good results on missing values, high dimensions, and imbalances HF data. Therefore, a deep learning system is proposed. In this system, we propose an indicator vector to indicate whether the value is true or be padded, which fast solves the missing values and helps expand data dimensions. Then, we use a convolutional neural network with different kernel sizes to obtain the features information. And a multi-head self-attention mechanism is applied to gain whole channel information, which is essential for the system to improve performance. Besides, the focal loss function is introduced to deal with the imbalanced problem better. The experimental data of the system are from the public database MIMIC-III, containing valid data for 10311 patients. The proposed system effectively and fast predicts four death types: death within 30 days, death within 180 days, death within 365 days and death after 365 days. Our study uses Deep SHAP to interpret the deep learning model and obtains the top 15 characteristics. These characteristics further confirm the effectiveness and rationality of the system and help provide a better medical service.
Feature correspondence q.
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang (2023). Feature correspondence q. [Dataset]. http://doi.org/10.1371/journal.pone.0276835.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0276835.t010
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Heart failure (HF) is the final stage of the various heart diseases developing. The mortality rates of prognosis HF patients are highly variable, ranging from 5% to 75%. Evaluating the all-cause mortality of HF patients is an important means to avoid death and positively affect the health of patients. But in fact, machine learning models are difficult to gain good results on missing values, high dimensions, and imbalances HF data. Therefore, a deep learning system is proposed. In this system, we propose an indicator vector to indicate whether the value is true or be padded, which fast solves the missing values and helps expand data dimensions. Then, we use a convolutional neural network with different kernel sizes to obtain the features information. And a multi-head self-attention mechanism is applied to gain whole channel information, which is essential for the system to improve performance. Besides, the focal loss function is introduced to deal with the imbalanced problem better. The experimental data of the system are from the public database MIMIC-III, containing valid data for 10311 patients. The proposed system effectively and fast predicts four death types: death within 30 days, death within 180 days, death within 365 days and death after 365 days. Our study uses Deep SHAP to interpret the deep learning model and obtains the top 15 characteristics. These characteristics further confirm the effectiveness and rationality of the system and help provide a better medical service.
The composition of features.
plos.figshare.com
xls
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang (2023). The composition of features. [Dataset]. http://doi.org/10.1371/journal.pone.0276835.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0276835.t005
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Heart failure (HF) is the final stage of the various heart diseases developing. The mortality rates of prognosis HF patients are highly variable, ranging from 5% to 75%. Evaluating the all-cause mortality of HF patients is an important means to avoid death and positively affect the health of patients. But in fact, machine learning models are difficult to gain good results on missing values, high dimensions, and imbalances HF data. Therefore, a deep learning system is proposed. In this system, we propose an indicator vector to indicate whether the value is true or be padded, which fast solves the missing values and helps expand data dimensions. Then, we use a convolutional neural network with different kernel sizes to obtain the features information. And a multi-head self-attention mechanism is applied to gain whole channel information, which is essential for the system to improve performance. Besides, the focal loss function is introduced to deal with the imbalanced problem better. The experimental data of the system are from the public database MIMIC-III, containing valid data for 10311 patients. The proposed system effectively and fast predicts four death types: death within 30 days, death within 180 days, death within 365 days and death after 365 days. Our study uses Deep SHAP to interpret the deep learning model and obtains the top 15 characteristics. These characteristics further confirm the effectiveness and rationality of the system and help provide a better medical service.
Effect of using attention and not using.
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang (2023). Effect of using attention and not using. [Dataset]. http://doi.org/10.1371/journal.pone.0276835.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0276835.t008
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Heart failure (HF) is the final stage of the various heart diseases developing. The mortality rates of prognosis HF patients are highly variable, ranging from 5% to 75%. Evaluating the all-cause mortality of HF patients is an important means to avoid death and positively affect the health of patients. But in fact, machine learning models are difficult to gain good results on missing values, high dimensions, and imbalances HF data. Therefore, a deep learning system is proposed. In this system, we propose an indicator vector to indicate whether the value is true or be padded, which fast solves the missing values and helps expand data dimensions. Then, we use a convolutional neural network with different kernel sizes to obtain the features information. And a multi-head self-attention mechanism is applied to gain whole channel information, which is essential for the system to improve performance. Besides, the focal loss function is introduced to deal with the imbalanced problem better. The experimental data of the system are from the public database MIMIC-III, containing valid data for 10311 patients. The proposed system effectively and fast predicts four death types: death within 30 days, death within 180 days, death within 365 days and death after 365 days. Our study uses Deep SHAP to interpret the deep learning model and obtains the top 15 characteristics. These characteristics further confirm the effectiveness and rationality of the system and help provide a better medical service.
Model performance compared with other methods.
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang (2023). Model performance compared with other methods. [Dataset]. http://doi.org/10.1371/journal.pone.0276835.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0276835.t006
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Heart failure (HF) is the final stage of the various heart diseases developing. The mortality rates of prognosis HF patients are highly variable, ranging from 5% to 75%. Evaluating the all-cause mortality of HF patients is an important means to avoid death and positively affect the health of patients. But in fact, machine learning models are difficult to gain good results on missing values, high dimensions, and imbalances HF data. Therefore, a deep learning system is proposed. In this system, we propose an indicator vector to indicate whether the value is true or be padded, which fast solves the missing values and helps expand data dimensions. Then, we use a convolutional neural network with different kernel sizes to obtain the features information. And a multi-head self-attention mechanism is applied to gain whole channel information, which is essential for the system to improve performance. Besides, the focal loss function is introduced to deal with the imbalanced problem better. The experimental data of the system are from the public database MIMIC-III, containing valid data for 10311 patients. The proposed system effectively and fast predicts four death types: death within 30 days, death within 180 days, death within 365 days and death after 365 days. Our study uses Deep SHAP to interpret the deep learning model and obtains the top 15 characteristics. These characteristics further confirm the effectiveness and rationality of the system and help provide a better medical service.
Full features list obtained including training features, identification,...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Sep 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sabrina G. Anjara; Adrianna Janik; Amy Dunford-Stenger; Kenneth Mc Kenzie; Ana Collazo-Lorduy; Maria Torrente; Luca Costabello; Mariano Provencio (2023). Full features list obtained including training features, identification, label, and features with missing values filtered out before training. [Dataset]. http://doi.org/10.1371/journal.pone.0291443.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291443.t002
Dataset updated
Sep 14, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Sabrina G. Anjara; Adrianna Janik; Amy Dunford-Stenger; Kenneth Mc Kenzie; Ana Collazo-Lorduy; Maria Torrente; Luca Costabello; Mariano Provencio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Full features list obtained including training features, identification, label, and features with missing values filtered out before training.
Effects before and after adding indicator vector.
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang (2023). Effects before and after adding indicator vector. [Dataset]. http://doi.org/10.1371/journal.pone.0276835.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0276835.t007
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Heart failure (HF) is the final stage of the various heart diseases developing. The mortality rates of prognosis HF patients are highly variable, ranging from 5% to 75%. Evaluating the all-cause mortality of HF patients is an important means to avoid death and positively affect the health of patients. But in fact, machine learning models are difficult to gain good results on missing values, high dimensions, and imbalances HF data. Therefore, a deep learning system is proposed. In this system, we propose an indicator vector to indicate whether the value is true or be padded, which fast solves the missing values and helps expand data dimensions. Then, we use a convolutional neural network with different kernel sizes to obtain the features information. And a multi-head self-attention mechanism is applied to gain whole channel information, which is essential for the system to improve performance. Besides, the focal loss function is introduced to deal with the imbalanced problem better. The experimental data of the system are from the public database MIMIC-III, containing valid data for 10311 patients. The proposed system effectively and fast predicts four death types: death within 30 days, death within 180 days, death within 365 days and death after 365 days. Our study uses Deep SHAP to interpret the deep learning model and obtains the top 15 characteristics. These characteristics further confirm the effectiveness and rationality of the system and help provide a better medical service.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1

Understanding and Managing Missing Data.pdf

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.29265155.v1

Dataset updated

Jun 9, 2025

Dataset provided by

Figsharehttp://figshare.com/
figshare

Authors

Ibrahim Denis Fofanah

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

Clear search

Close search

Google apps

Main menu

Understanding and Managing Missing Data.pdf

Data from: A multiple imputation method using population information

Data from: Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 -...

Recognizing molecular mechanisms of antitumor compounds from cytostatic...

Data from: Dealing with missing observations in the outcome and covariates...

Data from: Molecular mechanisms underlying plasticity in a thermally varying...

A large-scale fMRI dataset in response to short naturalistic facial...

Parent dataset and code from: Atomistic Mechanisms of the regulation of...

Progress Towards Interpretable Machine Learning-based Disruption Predictors...

Data from: Conserved islands of divergence associated with adaptive...

HF patients correspond to the ICD-9 codes.

Feature correspondence q.

The composition of features.

Effect of using attention and not using.

Model performance compared with other methods.

Full features list obtained including training features, identification,...

Effects before and after adding indicator vector.

Understanding and Managing Missing Data.pdf