7 datasets found
  1. h

    Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...

    • datahub.hku.hk
    Updated Aug 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1
    Explore at:
    Dataset updated
    Aug 13, 2020
    Dataset provided by
    HKU Data Repository
    Authors
    Wen Ma
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description
    1. NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
  2. Wealth and Assets Survey, Waves 1-5 and Rounds 5-8, 2006-2022

    • beta.ukdataservice.ac.uk
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Survey Division Office For National Statistics (2025). Wealth and Assets Survey, Waves 1-5 and Rounds 5-8, 2006-2022 [Dataset]. http://doi.org/10.5255/ukda-sn-7215-20
    Explore at:
    Dataset updated
    2025
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    datacite
    Authors
    Social Survey Division Office For National Statistics
    Description

    The Wealth and Assets Survey (WAS) is a longitudinal survey, which aims to address gaps identified in data about the economic well-being of households by gathering information on level of assets, savings and debt; saving for retirement; how wealth is distributed among households or individuals; and factors that affect financial planning. Private households in Great Britain were sampled for the survey (meaning that people in residential institutions, such as retirement homes, nursing homes, prisons, barracks or university halls of residence, and also homeless people were not included).

    The WAS commenced in July 2006, with a first wave of interviews carried out over two years, to June 2008. Interviews were achieved with 30,595 households at Wave 1. Those households were approached again for a Wave 2 interview between July 2008 and June 2010, and 20,170 households took part. Wave 3 covered July 2010 - June 2012, Wave 4 covered July 2012 - June 2014 and Wave 5 covered July 2014 - June 2016. Revisions to previous waves' data mean that small differences may occur between originally published estimates and estimates from the datasets held by the UK Data Service. Data are revised on a wave by wave basis, as a result of backwards imputation from the current wave's data. These revisions are due to improvements in the imputation methodology.

    Note from the WAS team - November 2023:

    “The Office for National Statistics has identified a very small number of outlier cases present in the seventh round of the Wealth and Assets Survey covering the period April 2018 to March 2020. Our current approach is to treat cases where we have reasonable evidence to suggest the values provided for specific variables are outliers. This approach did not occur for two individuals for several variables involved in the estimation of their pension wealth. While we estimate any impacts are very small overall and median pension wealth and median total wealth estimates are unaffected, this will affect the accuracy of the breakdowns of the pension wealth within the wealthiest decile, and data derived from them. We are urging caution in the interpretation of more detailed estimates.”

    Survey Periodicity - "Waves" to "Rounds"
    Due to the survey periodicity moving from “Waves” (July, ending in June two years later) to “Rounds” (April, ending in March two years later), interviews using the ‘Wave 6’ questionnaire started in July 2016 and were conducted for 21 months, finishing in March 2018. Data for round 6 covers the period April 2016 to March 2018. This comprises of the last three months of Wave 5 (April to June 2016) and 21 months of Wave 6 (July 2016 to March 2018). Round 5 and Round 6 datasets are based on a mixture of original wave-based datasets. Each wave of the survey has a unique questionnaire and therefore each of these round-based datasets are based on two questionnaires. While there may be some changes in the questionnaires, the derived variables for the key wealth estimates have not changed over this period. The aim is to collect the same data, though in some cases the exact questions asked may differ slightly. Detailed information on Moving the Wealth and Assets Survey onto a financial years’ basis was published on the ONS website in July 2019.

    A Secure Access version of the WAS, subject to more stringent access conditions, is available under SN 6709; it contains more detailed geographic variables than the EUL version. Users are advised to download the EUL version first (SN 7215) to see if it is suitable for their needs, before considering making an application for the Secure Access version.

    Further information and documentation may be found on the ONS "https://www.ons.gov.uk/economy/nationalaccounts/uksectoraccounts/methodologies/wealthandassetssurveywas" title="Wealth and Assets Survey"> Wealth and Assets Survey webpage. Users are advised to the check the page for updates before commencing analysis.

    Occupation data for 2021 and 2022 data files

    The ONS have identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. None of ONS' headline statistics, other than those directly sourced from occupational data, are affected and you can continue to rely on their accuracy. For further information on this issue, please see: https://www.ons.gov.uk/news/statementsandletters/occupationaldatainonssurveys.

    The data dictionary for round 8 person file is not available.

    Latest edition information

    For the 20th edition (May 2025), the Round 8 data files were updated to include variables personr7, nounitsr8 and porage1tar8, and derived binary versions of multi-choice questions, their collected equivalents and imputed binary versions of these variables. Also, variables that were only collected for part of the round have been removed. Additional documentation for Round 8 was also added to the study, including an updated variable list and derived variable specifications.

  3. f

    Data from: Tolerated outlier prediction method of excavation damaged zone...

    • tandf.figshare.com
    xlsx
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaxi Shen; Shunchuan Wu; Yongbing Wang; Jiaxin Wang; Shuxian Wang; Shigui Huang (2024). Tolerated outlier prediction method of excavation damaged zone thickness of drift based on interpretable SOA-QRF ensemble learning [Dataset]. http://doi.org/10.6084/m9.figshare.25585923.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 2, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Yaxi Shen; Shunchuan Wu; Yongbing Wang; Jiaxin Wang; Shuxian Wang; Shigui Huang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Drift excavation induces excavation damaged zones (EDZ) due to stress redistribution, impacting drift stability and rock deformation support. Predicting EDZ thickness is crucial, but traditional machine learning models are susceptible to potential outliers in dataset. Directly eliminating outliers, however, impacts training effectiveness. This study introduces an EDZ thickness prediction model utilising quantile loss and random forest (RF) optimised by the seagull optimisation algorithm (SOA), enabling median regression with tolerated outlier performance. 209 sets of data sets containing 34 mine borehole data were used to establish the prediction model. Evaluation using R2, explained variance score (EVS), mean absolute error (MAE), and mean square error (MSE) demonstrates the superior accuracy of the proposed SOA-QRF model compared to traditional models. Based on the discussion on the treatment of outliers, the outcomes indicate that the SOA-QRF model is more suitable for the dataset with outliers as well as being able to effectuate tolerated outlier prediction. Additionally, three interpretation methods were utilised to explain the SOA-QRF model and enhance the transparency of the model’s prediction process and facilitating the analysis of dispatcher regulation.

  4. f

    Relevant training splits of the VessMAP dataset. Each row shows the average...

    • figshare.com
    • plos.figshare.com
    xls
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matheus Viana da Silva; Natália de Carvalho Santos; Julie Ouellette; Baptiste Lacoste; Cesar H. Comin (2025). Relevant training splits of the VessMAP dataset. Each row shows the average Dice score obtained on the remaining 96 samples when training a neural network using the four images indicated in the training set column. The standard deviation obtained across five repetitions of the training runs is also shown. [Dataset]. http://doi.org/10.1371/journal.pone.0322048.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 27, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Matheus Viana da Silva; Natália de Carvalho Santos; Julie Ouellette; Baptiste Lacoste; Cesar H. Comin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Relevant training splits of the VessMAP dataset. Each row shows the average Dice score obtained on the remaining 96 samples when training a neural network using the four images indicated in the training set column. The standard deviation obtained across five repetitions of the training runs is also shown.

  5. c

    Levels of obesity and inactivity related illnesses (physical illnesses):...

    • data.catchmentbasedapproach.org
    Updated Apr 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Rivers Trust (2021). Levels of obesity and inactivity related illnesses (physical illnesses): Summary (England) [Dataset]. https://data.catchmentbasedapproach.org/datasets/levels-of-obesity-and-inactivity-related-illnesses-physical-illnesses-summary-england
    Explore at:
    Dataset updated
    Apr 7, 2021
    Dataset authored and provided by
    The Rivers Trust
    Area covered
    Description

    SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of physical illnesses that are linked with obesity and inactivity. Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to:- Asthma (in persons of all ages)- Cancer (in persons of all ages)- Chronic kidney disease (in adults aged 18+)- Coronary heart disease (in persons of all ages)- Diabetes mellitus (in persons aged 17+)- Hypertension (in persons of all ages)- Stroke and transient ischaemic attack (in persons of all ages)This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.For each of the above illnesses, the percentage of each MSOA’s population with that illness was estimated. This was achieved by calculating a weighted average based on:- The percentage of the MSOA area that was covered by each GP practice’s catchment area- Of the GPs that covered part of that MSOA: the percentage of patients registered with each GP that have that illnessThe estimated percentage of each MSOA’s population with each illness was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with each illness, within the relevant age range.For each illness, each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have that illnessB) the NUMBER of people within that MSOA who are estimated to have that illnessAn average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA predicted to have that illness, compared to other MSOAs. In other words, those are areas where a large number of people are predicted to suffer from an illness, and where those people make up a large percentage of the population, indicating there is a real issue with that illness within the population and the investment of resources to address that issue could have the greatest benefits.The scores for each of the 7 illnesses were added together then converted to a relative score between 1 – 0 (1 = worst, 0 = best), to give an overall score for each MSOA: a score close to 1 would indicate that an area has high predicted levels of all obesity/inactivity-related illnesses, and these are areas where the local population could benefit the most from interventions to address those illnesses. A score close to 0 would indicate very low predicted levels of obesity/inactivity-related illnesses and therefore interventions might not be required.LIMITATIONS1. GPs do not have catchments that are mutually exclusive from each other: they overlap, with some geographic areas being covered by 30+ practices. This dataset should be viewed in combination with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset to identify where there are areas that are covered by multiple GP practices but at least one of those GP practices did not provide data. Results of the analysis in these areas should be interpreted with caution, particularly if the levels of obesity/inactivity-related illnesses appear to be significantly lower than the immediate surrounding areas.2. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).3. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.4. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of obesity/inactivity-related illnesses, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of these illnesses. TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:- Health and wellbeing statistics (GP-level, England): Missing data and potential outliersDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.

  6. Code and dataset for the budget closure correction method Minimized Series...

    • figshare.com
    txt
    Updated Nov 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zengliang Luo; Huan Li (2022). Code and dataset for the budget closure correction method Minimized Series Deviation method (MSD) [Dataset]. http://doi.org/10.6084/m9.figshare.20208026.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 30, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Zengliang Luo; Huan Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Enforcing terrestrial water budget closure is critical for obtaining consistent datasets of budget components to understand the changes and availability of water resources over time. However, most existing budget closure correction methods (BCCs) are significantly affected by random errors and outliers in the budget-component products. Moreover, these existing BCCs do not fully account for the preselection of high-precision input datasets before enforcing the water budget closure, resulting in uncertainties in the budget-corrected datasets. In this study, a two-step method was proposed to enforce the water budget closure of satellite-based hydrological products. First, high-precision budget-component datasets were selected and second, the water budget closure of the selected high-precision datasets was then enforced by proposing an improved BCC strategy, i.e., the Minimized Series Deviation method (MSD). The performance of the proposed two-step method was verified in 24 global basins by comparing it to three existing BCCs of varying complexity, i.e., Proportional Redistribution (PR), Constrained Kalman Filter (CKF), and Multiple Collocation (MCL). The results showed that compared to the existing BCCs, the proposed two-step method significantly improved the accuracy of budget-corrected datasets between 2 and 19% (statistical analysis was based root mean square error (RMSE) and mean absolute error (MAE). This study also summarized the main factors influencing the performance of the existing BCCs and their further development prospects based on the results. This provides insight into the expansion of theories and methods related to closing the terrestrial water budget.

  7. d

    High frequency dataset for event-scale concentration-discharge analysis in a...

    • search.dataone.org
    • hydroshare.org
    Updated Sep 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Musolff (2024). High frequency dataset for event-scale concentration-discharge analysis in a forested headwater 01/2018-08/2023 [Dataset]. http://doi.org/10.4211/hs.9be43573ba754ec1b3650ce233fc99de
    Explore at:
    Dataset updated
    Sep 21, 2024
    Dataset provided by
    Hydroshare
    Authors
    Andreas Musolff
    Time period covered
    Jan 1, 2018 - Aug 23, 2023
    Area covered
    Description

    This composite repository contains high-frequency data of discharge, electrical conductivity, nitrate-N, DOC and water temperature obtained the Rappbode headwater catchment in the Harz mountains, Germany. This catchment was affected by a bark-beetle infestion and forest dieback from 2018 onwards.The data extents previous observations from the same catchment (RB) published as part of Musolff (2020). Details on the catchment can be found here: Werner et al. (2019, 2021), Musolff et al. (2021). The file RB_HF_data_2018_2023.txt states measurements for each timestep using the following columns: "index" (number of observation),"Date.Time" (timestamp in YYYY-MM-DD HH:MM:SS), "WT" (water temperature in degree celsius), "Q.smooth" ( discharge in mm/d smoothed using moving average), "NO3.smooth" (nitrate concentrations in mg N/L smoothed using moving average), "DOC.smooth" (Dissolved organic carbon concentrations in mg/L, smoothed using moving average), "EC.smooth" (electrical conductivity in µS/cm smoothed using moving average); NA - no data.

    Water quality data and discharge was measured at a high-frequency interval of 15 min in the time period between January 2018 and August 2023. Both, NO3-N and DOC were measured using an in-situ UV-VIS probe (s::can spectrolyser, scan Austria). EC was measured using an in-situ probe (CTD Diver, Van Essen Canada). Discharge measurements relied on an established stage-discharge relationship based on water level observations (CTD Diver, Van Essen Canada, see Werner et al. [2019]). Data loggers were maintained every two weeks, including manual cleaning of the UV-VIS probes and grab sampling for subsequent lab analysis, calibration and validation.

    Data preparation included five steps: drift corrections, outlier detection, gap filling, calibration and moving averaging: - Drift was corrected by distributing the offset between mean values one hour before and after cleaning equally among the two weeks maintenance interval as an exponential growth. - Outliers were detected with a two-step procedure. First, values outside a physically unlikely range were removed. Second, the Grubbs test, to detect and remove outliers, was applied to a moving window of 100 values. - Data gaps smaller than two hours were filled using cubic spline interpolation. - The resulting time series were globally calibrated against the lab measured concentration of NO3-N and DOC. EC was calibrated against field values obtained with a handheld WTW probe (WTW Multi 430, Xylem Analytics Germany). - Noise in the signal of both discharge and water quality was reduced by a moving average with a window lenght of 2.5 hours.

    References: Musolff, A. (2020). High frequency dataset for event-scale concentration-discharge analysis. https://doi.org/http://www.hydroshare.org/resource/27c93a3f4ee2467691a1671442e047b8 Musolff, A., Zhan, Q., Dupas, R., Minaudo, C., Fleckenstein, J. H., Rode, M., Dehaspe, J., & Rinke, K. (2021). Spatial and Temporal Variability in Concentration-Discharge Relationships at the Event Scale. Water Resources Research, 57(10). Werner, B. J., A. Musolff, O. J. Lechtenfeld, G. H. de Rooij, M. R. Oosterwoud, and J. H. Fleckenstein (2019), High-frequency measurements explain quantity and quality of dissolved organic carbon mobilization in a headwater catchment, Biogeosciences, 16(22), 4497-4516. Werner, B. J., Lechtenfeld, O. J., Musolff, A., de Rooij, G. H., Yang, J., Grundling, R., Werban, U., & Fleckenstein, J. H. (2021). Small-scale topography explains patterns and dynamics of dissolved organic carbon exports from the riparian zone of a temperate, forested catchment. Hydrology and Earth System Sciences, 25(12), 6067-6086.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1

Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Aug 13, 2020
Dataset provided by
HKU Data Repository
Authors
Wen Ma
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description
  1. NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
Search
Clear search
Close search
Google apps
Main menu