9 datasets found
  1. f

    Preprocessing steps.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hyeong Jun Ahn; Kyle Ishikawa; Min-Hee Kim (2024). Preprocessing steps. [Dataset]. http://doi.org/10.1371/journal.pone.0304785.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Hyeong Jun Ahn; Kyle Ishikawa; Min-Hee Kim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this study, we employed various machine learning models to predict metabolic phenotypes, focusing on thyroid function, using a dataset from the National Health and Nutrition Examination Survey (NHANES) from 2007 to 2012. Our analysis utilized laboratory parameters relevant to thyroid function or metabolic dysregulation in addition to demographic features, aiming to uncover potential associations between thyroid function and metabolic phenotypes by various machine learning methods. Multinomial Logistic Regression performed best to identify the relationship between thyroid function and metabolic phenotypes, achieving an area under receiver operating characteristic curve (AUROC) of 0.818, followed closely by Neural Network (AUROC: 0.814). Following the above, the performance of Random Forest, Boosted Trees, and K Nearest Neighbors was inferior to the first two methods (AUROC 0.811, 0.811, and 0.786, respectively). In Random Forest, homeostatic model assessment for insulin resistance, serum uric acid, serum albumin, gamma glutamyl transferase, and triiodothyronine/thyroxine ratio were positioned in the upper ranks of variable importance. These results highlight the potential of machine learning in understanding complex relationships in health data. However, it’s important to note that model performance may vary depending on data characteristics and specific requirements. Furthermore, we emphasize the significance of accounting for sampling weights in complex survey data analysis and the potential benefits of incorporating additional variables to enhance model accuracy and insights. Future research can explore advanced methodologies combining machine learning, sample weights, and expanded variable sets to further advance survey data analysis.

  2. f

    Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and...

    • frontiersin.figshare.com
    docx
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang (2025). Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm.docx [Dataset]. http://doi.org/10.3389/frai.2024.1473837.s008
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Frontiers
    Authors
    Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.

  3. f

    BCI competition III dataset 4a classification accuracy (%) with different...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Sep 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rabia Avais Khan; Nasir Rashid; Muhammad Shahzaib; Umar Farooq Malik; Arshia Arif; Javaid Iqbal; Mubasher Saleem; Umar Shahbaz Khan; Mohsin Tiwana (2023). BCI competition III dataset 4a classification accuracy (%) with different classifiers. [Dataset]. http://doi.org/10.1371/journal.pone.0276133.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Rabia Avais Khan; Nasir Rashid; Muhammad Shahzaib; Umar Farooq Malik; Arshia Arif; Javaid Iqbal; Mubasher Saleem; Umar Shahbaz Khan; Mohsin Tiwana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BCI competition III dataset 4a classification accuracy (%) with different classifiers.

  4. f

    Optimized hyper-parameters from subject ‘a’ of dataset 1.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rabia Avais Khan; Nasir Rashid; Muhammad Shahzaib; Umar Farooq Malik; Arshia Arif; Javaid Iqbal; Mubasher Saleem; Umar Shahbaz Khan; Mohsin Tiwana (2023). Optimized hyper-parameters from subject ‘a’ of dataset 1. [Dataset]. http://doi.org/10.1371/journal.pone.0276133.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Rabia Avais Khan; Nasir Rashid; Muhammad Shahzaib; Umar Farooq Malik; Arshia Arif; Javaid Iqbal; Mubasher Saleem; Umar Shahbaz Khan; Mohsin Tiwana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Optimized hyper-parameters from subject ‘a’ of dataset 1.

  5. f

    Table_1_Machine Learning in Modeling of Mouse Behavior.PDF

    • frontiersin.figshare.com
    pdf
    Updated Jun 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marjan Gharagozloo; Abdelaziz Amrani; Kevin Wittingstall; Andrew Hamilton-Wright; Denis Gris (2023). Table_1_Machine Learning in Modeling of Mouse Behavior.PDF [Dataset]. http://doi.org/10.3389/fnins.2021.700253.s002
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    Frontiers
    Authors
    Marjan Gharagozloo; Abdelaziz Amrani; Kevin Wittingstall; Andrew Hamilton-Wright; Denis Gris
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Mouse behavior is a primary outcome in evaluations of therapeutic efficacy. Exhaustive, continuous, multiparametric behavioral phenotyping is a valuable tool for understanding the pathophysiological status of mouse brain diseases. Automated home cage behavior analysis produces highly granulated data both in terms of number of features and sampling frequency. Previously, we demonstrated several ways to reduce feature dimensionality. In this study, we propose novel approaches for analyzing 33-Hz data generated by CleverSys software. We hypothesized that behavioral patterns within short time windows are reflective of physiological state, and that computer modeling of mouse behavioral routines can serve as a predictive tool in classification tasks. To remove bias due to researcher decisions, our data flow is indifferent to the quality, value, and importance of any given feature in isolation. To classify day and night behavior, as an example application, we developed a data preprocessing flow and utilized logistic regression (LG), support vector machines (SVM), random forest (RF), and one-dimensional convolutional neural networks paired with long short-term memory deep neural networks (1DConvBiLSTM). We determined that a 5-min video clip is sufficient to classify mouse behavior with high accuracy. LG, SVM, and RF performed similarly, predicting mouse behavior with 85% accuracy, and combining the three algorithms in an ensemble procedure increased accuracy to 90%. The best performance was achieved by combining the 1DConv and BiLSTM algorithms yielding 96% accuracy. Our findings demonstrate that computer modeling of the home-cage ethome can clearly define mouse physiological state. Furthermore, we showed that continuous behavioral data can be analyzed using approaches similar to natural language processing. These data provide proof of concept for future research in diagnostics of complex pathophysiological changes that are accompanied by changes in behavioral profile.

  6. f

    Data_Sheet_1_Machine Learning Prediction Models for Mechanically Ventilated...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated Jun 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yibing Zhu; Jin Zhang; Guowei Wang; Renqi Yao; Chao Ren; Ge Chen; Xin Jin; Junyang Guo; Shi Liu; Hua Zheng; Yan Chen; Qianqian Guo; Lin Li; Bin Du; Xiuming Xi; Wei Li; Huibin Huang; Yang Li; Qian Yu (2023). Data_Sheet_1_Machine Learning Prediction Models for Mechanically Ventilated Patients: Analyses of the MIMIC-III Database.pdf [Dataset]. http://doi.org/10.3389/fmed.2021.662340.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    Frontiers
    Authors
    Yibing Zhu; Jin Zhang; Guowei Wang; Renqi Yao; Chao Ren; Ge Chen; Xin Jin; Junyang Guo; Shi Liu; Hua Zheng; Yan Chen; Qianqian Guo; Lin Li; Bin Du; Xiuming Xi; Wei Li; Huibin Huang; Yang Li; Qian Yu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: Mechanically ventilated patients in the intensive care unit (ICU) have high mortality rates. There are multiple prediction scores, such as the Simplified Acute Physiology Score II (SAPS II), Oxford Acute Severity of Illness Score (OASIS), and Sequential Organ Failure Assessment (SOFA), widely used in the general ICU population. We aimed to establish prediction scores on mechanically ventilated patients with the combination of these disease severity scores and other features available on the first day of admission.Methods: A retrospective administrative database study from the Medical Information Mart for Intensive Care (MIMIC-III) database was conducted. The exposures of interest consisted of the demographics, pre-ICU comorbidity, ICU diagnosis, disease severity scores, vital signs, and laboratory test results on the first day of ICU admission. Hospital mortality was used as the outcome. We used the machine learning methods of k-nearest neighbors (KNN), logistic regression, bagging, decision tree, random forest, Extreme Gradient Boosting (XGBoost), and neural network for model establishment. A sample of 70% of the cohort was used for the training set; the remaining 30% was applied for testing. Areas under the receiver operating characteristic curves (AUCs) and calibration plots would be constructed for the evaluation and comparison of the models' performance. The significance of the risk factors was identified through models and the top factors were reported.Results: A total of 28,530 subjects were enrolled through the screening of the MIMIC-III database. After data preprocessing, 25,659 adult patients with 66 predictors were included in the model analyses. With the training set, the models of KNN, logistic regression, decision tree, random forest, neural network, bagging, and XGBoost were established and the testing set obtained AUCs of 0.806, 0.818, 0.743, 0.819, 0.780, 0.803, and 0.821, respectively. The calibration curves of all the models, except for the neural network, performed well. The XGBoost model performed best among the seven models. The top five predictors were age, respiratory dysfunction, SAPS II score, maximum hemoglobin, and minimum lactate.Conclusion: The current study indicates that models with the risk of factors on the first day could be successfully established for predicting mortality in ventilated patients. The XGBoost model performs best among the seven machine learning models.

  7. f

    BCI competition IV dataset 1 classification accuracy (%) of dataset 1 with...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Sep 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rabia Avais Khan; Nasir Rashid; Muhammad Shahzaib; Umar Farooq Malik; Arshia Arif; Javaid Iqbal; Mubasher Saleem; Umar Shahbaz Khan; Mohsin Tiwana (2023). BCI competition IV dataset 1 classification accuracy (%) of dataset 1 with proposed method compared with other methodologies. [Dataset]. http://doi.org/10.1371/journal.pone.0276133.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Rabia Avais Khan; Nasir Rashid; Muhammad Shahzaib; Umar Farooq Malik; Arshia Arif; Javaid Iqbal; Mubasher Saleem; Umar Shahbaz Khan; Mohsin Tiwana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BCI competition IV dataset 1 classification accuracy (%) of dataset 1 with proposed method compared with other methodologies.

  8. This is the code file which can be utilized to generate results shown in the...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    application/x-rar
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rabia Avais Khan; Nasir Rashid; Muhammad Shahzaib; Umar Farooq Malik; Arshia Arif; Javaid Iqbal; Mubasher Saleem; Umar Shahbaz Khan; Mohsin Tiwana (2023). This is the code file which can be utilized to generate results shown in the research paper. [Dataset]. http://doi.org/10.1371/journal.pone.0276133.s001
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rabia Avais Khan; Nasir Rashid; Muhammad Shahzaib; Umar Farooq Malik; Arshia Arif; Javaid Iqbal; Mubasher Saleem; Umar Shahbaz Khan; Mohsin Tiwana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the code file which can be utilized to generate results shown in the research paper.

  9. f

    Statistical report containing details of all data pre-processing steps to...

    • plos.figshare.com
    html
    Updated May 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Barrett-Jolley; Alexander J. German (2024). Statistical report containing details of all data pre-processing steps to create the dataset for all owners. [Dataset]. http://doi.org/10.1371/journal.pone.0280173.s016
    Explore at:
    htmlAvailable download formats
    Dataset updated
    May 15, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Richard Barrett-Jolley; Alexander J. German
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistical report containing details of all data pre-processing steps to create the dataset for all owners.

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hyeong Jun Ahn; Kyle Ishikawa; Min-Hee Kim (2024). Preprocessing steps. [Dataset]. http://doi.org/10.1371/journal.pone.0304785.t001

Preprocessing steps.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Jun 28, 2024
Dataset provided by
PLOS ONE
Authors
Hyeong Jun Ahn; Kyle Ishikawa; Min-Hee Kim
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In this study, we employed various machine learning models to predict metabolic phenotypes, focusing on thyroid function, using a dataset from the National Health and Nutrition Examination Survey (NHANES) from 2007 to 2012. Our analysis utilized laboratory parameters relevant to thyroid function or metabolic dysregulation in addition to demographic features, aiming to uncover potential associations between thyroid function and metabolic phenotypes by various machine learning methods. Multinomial Logistic Regression performed best to identify the relationship between thyroid function and metabolic phenotypes, achieving an area under receiver operating characteristic curve (AUROC) of 0.818, followed closely by Neural Network (AUROC: 0.814). Following the above, the performance of Random Forest, Boosted Trees, and K Nearest Neighbors was inferior to the first two methods (AUROC 0.811, 0.811, and 0.786, respectively). In Random Forest, homeostatic model assessment for insulin resistance, serum uric acid, serum albumin, gamma glutamyl transferase, and triiodothyronine/thyroxine ratio were positioned in the upper ranks of variable importance. These results highlight the potential of machine learning in understanding complex relationships in health data. However, it’s important to note that model performance may vary depending on data characteristics and specific requirements. Furthermore, we emphasize the significance of accounting for sampling weights in complex survey data analysis and the potential benefits of incorporating additional variables to enhance model accuracy and insights. Future research can explore advanced methodologies combining machine learning, sample weights, and expanded variable sets to further advance survey data analysis.

Search
Clear search
Close search
Google apps
Main menu