100+ datasets found
  1. f

    DataSheet1_Exploratory data analysis (EDA) machine learning approaches for...

    • frontiersin.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

  2. l

    Multivariate Exploratory Data Analysis Toolbox Template

    • metadatacatalogue.lifewatch.eu
    Updated May 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Multivariate Exploratory Data Analysis Toolbox Template [Dataset]. https://metadatacatalogue.lifewatch.eu/srv/search?keyword=Octave
    Explore at:
    Dataset updated
    May 30, 2024
    Description

    This workflow integrates the MEDA Toolbox for Matlab and Octave, focusing on data simulation, Principal Component Analysis (PCA), and result visualization. Key steps include simulating multivariate data, applying PCA for data modeling, and creating interactive visualizations. The MEDA Toolbox combines traditional and advanced methods, such as ANOVA Simultaneous Component Analysis (ASCA). The aim is to integrate the MEDA Toolbox into LifeWatch, providing tools for enhanced data analysis and visualization in research. Background This workflow is a template for the integration of the Multivariate Exploratory Data Analysis Toolbox (MEDA Toolbox, https://github.com/codaslab/MEDA-Toolbox) in LifeWatch. The MEDA Toolbox for Matlab and Octave is a set of multivariate analysis tools for the exploration of data sets. There are several alternative tools in the market for that purpose, both commercial and free. The PLS_Toolbox from Eigenvector Inc. is a very nice example. The MEDA Toolbox is not intended to replace or compete with any of these toolkits. Rather, the MEDA Toolbox is a complementary tool that includes several contributions of the Computational Data Science Laboratory (CoDaS Lab) to the field of data analysis. Thus, traditional exploratory plots based on Principal Component Analysis (PCA) or Partial Least Squares (PLS), such as score, loading, and residual plots, are combined with new methods: MEDA, oMEDA, SVI plots, ADICOV, EKF & CKF cross-validation, CSP, GPCA, etc. A main tool in the MEDA Toolbox which has received a lot of attention lately is ANOVA Simultaneous Component Analysis (ASCA). The ASCA code in the MEDA Toolbox is one of the most advanced internationally. Introduction The workflow integrates three examples of functionality within the MEDA Toolbox. First, there is a data simulation step, in which a matrix of random data is simulated with a user-defined correlation level. The output is sent to a modeling step, in which Principal Component Analysis (PCA) is computed. The PCA model is then sent to a visualization module. Aims The main goal of this template is the integration of the MEDA Toolbox in LifeWatch, including data simulation, data modeling, and data visualization routines. Scientific Questions This workflow only exemplifies the integration of the MEDA Toolbox. No specific questions are addressed.

  3. f

    Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

    • frontiersin.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Yi-Hui Zhou; Ehsan Saghapour
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

  4. Data from: Supplementary Material for "Sonification for Exploratory Data...

    • search.datacite.org
    Updated Feb 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Hermann (2019). Supplementary Material for "Sonification for Exploratory Data Analysis" [Dataset]. http://doi.org/10.4119/unibi/2920448
    Explore at:
    Dataset updated
    Feb 5, 2019
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Bielefeld University
    Authors
    Thomas Hermann
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Sonification for Exploratory Data Analysis #### Chapter 8: Sonification Models In Chapter 8 of the thesis, 6 sonification models are presented to give some examples for the framework of Model-Based Sonification, developed in Chapter 7. Sonification models determine the rendering of the sonification and possible interactions. The "model in mind" helps the user to interprete the sound with respect to the data. ##### 8.1 Data Sonograms Data Sonograms use spherical expanding shock waves to excite linear oscillators which are represented by point masses in model space. * Table 8.2, page 87: Sound examples for Data Sonograms File: Iris dataset: started in plot (a) at S0 (b) at S1 (c) at S2
    10d noisy circle dataset: started in plot (c) at S0 (mean) (d) at S1 (edge)
    10d Gaussian: plot (d) started at S0
    3 clusters: Example 1
    3 clusters: invisible columns used as output variables: Example 2 Description: Data Sonogram Sound examples for synthetic datasets and the Iris dataset Duration: about 5 s ##### 8.2 Particle Trajectory Sonification Model This sonification model explores features of a data distribution by computing the trajectories of test particles which are injected into model space and move according to Newton's laws of motion in a potential given by the dataset. * Sound example: page 93, PTSM-Ex-1 Audification of 1 particle in the potential of phi(x). * Sound example: page 93, PTSM-Ex-2 Audification of a sequence of 15 particles in the potential of a dataset with 2 clusters. * Sound example: page 94, PTSM-Ex-3 Audification of 25 particles simultaneous in a potential of a dataset with 2 clusters. * Sound example: page 94, PTSM-Ex-4 Audification of 25 particles simultaneous in a potential of a dataset with 1 cluster. * Sound example: page 95, PTSM-Ex-5 sigma-step sequence for a mixture of three Gaussian clusters * Sound example: page 95, PTSM-Ex-6 sigma-step sequence for a Gaussian cluster * Sound example: page 96, PTSM-Iris-1 Sonification for the Iris Dataset with 20 particles per step. * Sound example: page 96, PTSM-Iris-2 Sonification for the Iris Dataset with 3 particles per step. * Sound example: page 96, PTSM-Tetra-1 Sonification for a 4d tetrahedron clusters dataset. ##### 8.3 Markov chain Monte Carlo Sonification The McMC Sonification Model defines a exploratory process in the domain of a given density p such that the acoustic representation summarizes features of p, particularly concerning the modes of p by sound. * Sound Example: page 105, MCMC-Ex-1 McMC Sonification, stabilization of amplitudes. * Sound Example: page 106, MCMC-Ex-2 Trajectory Audification for 100 McMC steps in 3 cluster dataset * McMC Sonification for Cluster Analysis, dataset with three clusters, page 107 * Stream 1 MCMC-Ex-3.1 * Stream 2 MCMC-Ex-3.2 * Stream 3 MCMC-Ex-3.3 * Mix MCMC-Ex-3.4 * McMC Sonification for Cluster Analysis, dataset with three clusters, T =0.002s, page 107 * Stream 1 MCMC-Ex-4.1 (stream 1) * Stream 2 MCMC-Ex-4.2 (stream 2) * Stream 3 MCMC-Ex-4.3 (stream 3) * Mix MCMC-Ex-4.4 * McMC Sonification for Cluster Analysis, density with 6 modes, T=0.008s, page 107 * Stream 1 MCMC-Ex-5.1 (stream 1) * Stream 2 MCMC-Ex-5.2 (stream 2) * Stream 3 MCMC-Ex-5.3 (stream 3) * Mix MCMC-Ex-5.4 * McMC Sonification for the Iris dataset, page 108 * MCMC-Ex-6.1 * MCMC-Ex-6.2 * MCMC-Ex-6.3 * MCMC-Ex-6.4 * MCMC-Ex-6.5 * MCMC-Ex-6.6 * MCMC-Ex-6.7 * MCMC-Ex-6.8 ##### 8.4 Principal Curve Sonification Principal Curve Sonification represents data by synthesizing the soundscape while a virtual listener moves along the principal curve of the dataset through the model space. * Noisy Spiral dataset, PCS-Ex-1.1 , page 113 * Noisy Spiral dataset with variance modulation PCS-Ex-1.2 , page 114 * 9d tetrahedron cluster dataset (10 clusters) PCS-Ex-2 , page 114 * Iris dataset, class label used as pitch of auditory grains PCS-Ex-3 , page 114 ##### 8.5 Data Crystallization Sonification Model * Table 8.6, page 122: Sound examples for Crystallization Sonification for 5d Gaussian distribution File: DCS started at center, in tail, from far outside Description: DCS for dataset sampled from N{0, I_5} excited at different locations Duration: 1.4 s * Mixture of 2 Gaussians, page 122 * DCS started at point A DCS-Ex1A * DCS started at point B DCS-Ex1B * Table 8.7, page 124: Sound examples for DCS on variation of the harmonics factor File: h_omega = 1, 2, 3, 4, 5, 6 Description: DCS for a mixture of two Gaussians with varying harmonics factor Duration: 1.4 s * Table 8.8, page 124: Sound examples for DCS on variation of the energy decay time File: tau_(1/2) = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2 Description: DCS for a mixture of two Gaussians varying the energy decay time tau_(1/2) Duration: 1.4 s * Table 8.9, page 125: Sound examples for DCS on variation of the sonification time File: T = 0.2, 0.5, 1, 2, 4, 8 Description: DCS for a mixture of two Gaussians on varying the duration T Duration: 0.2s -- 8s * Table 8.10, page 125: Sound examples for DCS on variation of model space dimension File: selected columns of the dataset: (x0) (x0,x1) (x0,...,x2) (x0,...,x3) (x0,...,x4) (x0,...,x5) Description: DCS for a mixture of two Gaussians varying the dimension Duration: 1.4 s * Table 8.11, page 126: Sound examples for DCS for different excitation locations File: starting point: C0, C1, C2 Description: DCS for a mixture of three Gaussians in 10d space with different rank(S) = {2,4,8} Duration: 1.9 s * Table 8.12, page 126: Sound examples for DCS for the mixture of a 2d distribution and a 5d cluster File: condensation nucleus in (x0,x1)-plane at: (-6,0)=C1, (-3,0)=C2, ( 0,0)=C0 Description: DCS for a mixture of a uniform 2d and a 5d Gaussian Duration: 2.16 s * Table 8.13, page 127: Sound examples for DCS for the cancer dataset File: condensation nucleus in (x0,x1)-plane at: benign 1, benign 2
    malignant 1, malignant 2 Description: DCS for a mixture of a uniform 2d and a 5d Gaussian Duration: 2.16 s ##### 8.6 Growing Neural Gas Sonification * Table 8.14, page 133: Sound examples for GNGS Probing File: Cluster C0 (2d): a, b, c
    Cluster C1 (4d): a, b, c
    Cluster C2 (8d): a, b, c Description: GNGS for a mixture of 3 Gaussians in 10d space Duration: 1 s * Table 8.15, page 134: Sound examples for GNGS for the noisy spiral dataset File: (a) GNG with 3 neurons 1, 2
    (b) GNG with 20 neurons end, middle, inner end
    (c) GNG with 45 neurons outer end, middle, close to inner end, at inner end
    (d) GNG with 150 neurons outer end, in the middle, inner end
    (e) GNG with 20 neurons outer end, in the middle, inner end
    (f) GNG with 45 neurons outer end, in the middle, inner end Description: GNG probing sonification for 2d noisy spiral dataset Duration: 1 s * Table 8.16, page 136: Sound examples for GNG Process Monitoring Sonification for different data distributions File: Noisy spiral with 1 rotation: sound
    Noisy spiral with 2 rotations: sound
    Gaussian in 5d: sound
    Mixture of 5d and 2d distributions: sound Description: GNG process sonification examples Duration: 5 s #### Chapter 9: Extensions #### In this chapter, two extensions for Parameter Mapping

  5. f

    SEM regression for H1-5.

    • plos.figshare.com
    xls
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). SEM regression for H1-5. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.

  6. The five-step co-duction cycle.

    • plos.figshare.com
    xls
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). The five-step co-duction cycle. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.

  7. Lending Club Loan Data Analysis - Deep Learning

    • kaggle.com
    Updated Aug 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deependra Verma (2023). Lending Club Loan Data Analysis - Deep Learning [Dataset]. https://www.kaggle.com/datasets/deependraverma13/lending-club-loan-data-analysis-deep-learning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Deependra Verma
    Description

    DESCRIPTION

    Create a model that predicts whether or not a loan will be default using the historical data.

    Problem Statement:

    For companies like Lending Club correctly predicting whether or not a loan will be a default is very important. In this project, using the historical data from 2007 to 2015, you have to build a deep learning model to predict the chance of default for future loans. As you will see later this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

    Domain: Finance

    Analysis to be done: Perform data preprocessing and build a deep learning prediction model.

    Content:

    Dataset columns and definition:

    credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.

    purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").

    int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.

    installment: The monthly installments owed by the borrower if the loan is funded.

    log.annual.inc: The natural log of the self-reported annual income of the borrower.

    dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).

    fico: The FICO credit score of the borrower.

    days.with.cr.line: The number of days the borrower has had a credit line.

    revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).

    revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).

    inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.

    delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

    pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

    Steps to perform:

    Perform exploratory data analysis and feature engineering and then apply feature engineering. Follow up with a deep learning model to predict whether or not the loan will be default using the historical data.

    Tasks:

    1. Feature Transformation

    Transform categorical values into numerical values (discrete)

    1. Exploratory data analysis of different factors of the dataset.

    2. Additional Feature Engineering

    You will check the correlation between features and will drop those features which have a strong correlation

    This will help reduce the number of features and will leave you with the most relevant features

    1. Modeling

    After applying EDA and feature engineering, you are now ready to build the predictive models

    In this part, you will create a deep learning model using Keras with Tensorflow backend

  8. Replication Package for 'Data-Driven Analysis and Optimization of Machine...

    • zenodo.org
    zip
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joel Castaño; Joel Castaño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

    This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.
    The framework considers the trade-offs between three key objectives:
    1. Performance (maximizing throughput)
    2. Energy Efficiency (minimizing estimated energy per unit)
    3. Cost (minimizing estimated hardware cost)

    Repository Structure

    This repository is organized as follows:
    • Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.
    • Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.
    • Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.
    • Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.
    • Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.
    • eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.
    • requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.
    • eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.
    • optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.
    • pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.
    • shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

    Requirements and Installation

    To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.
    1. Clone the repository:
    bash
    git clone
    cd
    2. **Create and activate a virtual environment (optional but recommended):
    bash
    python -m venv venv
    source venv/bin/activate # On Windows, use `venv\Scripts\activate`
    3. Install the required packages:
    All dependencies are listed in the `requirements.txt` file. Install them using pip:
    bash
    pip install -r requirements.txt

    Step-by-Step Reproduction Workflow

    The notebooks are designed to be run in a logical sequence.

    Step 1: Data Enrichment (Optional)

    The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

    Step 2: Exploratory Data Analysis (Optional)

    All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

    Step 3: Main Model Training, Validation, and Recommendation

    This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:
    1. It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.
    2. It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.
    3. It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.
    4. It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
  9. Exploratory Data Analysis (EDA) for COVIND-19

    • kaggle.com
    Updated Apr 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Badea-Matei Iuliana (2024). Exploratory Data Analysis (EDA) for COVIND-19 [Dataset]. https://www.kaggle.com/datasets/mateiiuliana/exploratory-data-analysis-eda-for-covind-19
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Badea-Matei Iuliana
    Description

    Description: The COVID-19 dataset used for this EDA project encompasses comprehensive data on COVID-19 cases, deaths, and recoveries worldwide. It includes information gathered from authoritative sources such as the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), and national health agencies. The dataset covers global, regional, and national levels, providing a holistic view of the pandemic's impact.

    Purpose: This dataset is instrumental in understanding the multifaceted impact of the COVID-19 pandemic through data exploration. It aligns perfectly with the objectives of the EDA project, aiming to unveil insights, patterns, and trends related to COVID-19. Here are the key objectives: 1. Data Collection and Cleaning: • Gather reliable COVID-19 datasets from authoritative sources (such as WHO, CDC, or national health agencies). • Clean and preprocess the data to ensure accuracy and consistency. 2. Descriptive Statistics: • Summarize key statistics: total cases, recoveries, deaths, and testing rates. • Visualize temporal trends using line charts, bar plots, and heat maps. 3. Geospatial Analysis: • Map COVID-19 cases across countries, regions, or cities. • Identify hotspots and variations in infection rates. 4. Demographic Insights: • Explore how age, gender, and pre-existing conditions impact vulnerability. • Investigate disparities in infection rates among different populations. 5. Healthcare System Impact: • Analyze hospitalization rates, ICU occupancy, and healthcare resource allocation. • Assess the strain on medical facilities. 6. Economic and Social Effects: • Investigate the relationship between lockdown measures, economic indicators, and infection rates. • Explore behavioral changes (e.g., mobility patterns, remote work) during the pandemic. 7. Predictive Modeling (Optional): • If data permits, build simple predictive models (e.g., time series forecasting) to estimate future cases.

    Data Sources: The primary sources of the COVID-19 dataset include the Johns Hopkins CSSE COVID-19 Data Repository, Google Health’s COVID-19 Open Data, and the U.S. Economic Development Administration (EDA). These sources provide reliable and up-to-date information on COVID-19 cases, deaths, testing rates, and other relevant variables. Additionally, GitHub repositories and platforms like Medium host supplementary datasets and analyses, enriching the available data resources.

    Data Format: The dataset is available in various formats, such as CSV and JSON, facilitating easy access and analysis. Before conducting the EDA, the data underwent preprocessing steps to ensure accuracy and consistency. Data cleaning procedures were performed to address missing values, inconsistencies, and outliers, enhancing the quality and reliability of the dataset.

    License: The COVID-19 dataset may be subject to specific usage licenses or restrictions imposed by the original data sources. Proper attribution is essential to acknowledge the contributions of the WHO, CDC, national health agencies, and other entities providing the data. Users should adhere to any licensing terms and usage guidelines associated with the dataset.

    Attribution: We acknowledge the invaluable contributions of the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), national health agencies, and other authoritative sources in compiling and disseminating the COVID-19 data used for this EDA project. Their efforts in collecting, curating, and sharing data have been instrumental in advancing our understanding of the pandemic and guiding public health responses globally.

  10. Wetlands Ecological Integrity Depth To Water Data - Florissant Fossil Beds...

    • catalog.data.gov
    • data.amerigeoss.org
    Updated Jun 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Park Service (2024). Wetlands Ecological Integrity Depth To Water Data - Florissant Fossil Beds National Monument 2009-2019 [Dataset]. https://catalog.data.gov/dataset/wetlands-ecological-integrity-depth-to-water-data-florissant-fossil-beds-national-mon-2009
    Explore at:
    Dataset updated
    Jun 5, 2024
    Dataset provided by
    National Park Servicehttp://www.nps.gov/
    Area covered
    Florissant
    Description

    Wetlands Ecological Integrity Depth to Water Logger data from 2009-2019 at Florissant Fossil Beds National Monument. This includes Raw dataset (primarily hourly), daily summaries, weekly summaries, and monthly summaries. Included in the data package are exploratory data analysis figures at the daily, weekly and monthly time steps. Lastly included is the R code used to extract the depth to water logger data from the National Park Service Aquarius data system, and to create the exploratory data analysis figures.

  11. O

    Analytic_Provenance

    • opendatalab.com
    • paperswithcode.com
    zip
    Updated Jan 17, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Texas A&M University (2018). Analytic_Provenance [Dataset]. https://opendatalab.com/OpenDataLab/Analytic_Provenance
    Explore at:
    zip(321803532 bytes)Available download formats
    Dataset updated
    Jan 17, 2018
    Dataset provided by
    Texas A&M University
    Description

    Analytic provenance is a data repository that can be used to study human analysis activity, thought processes, and software interaction with visual analysis tools during exploratory data analysis. It was collected during a series of user studies involving exploratory data analysis scenario with textual and cyber security data. Interactions logs, think-alouds, videos and all coded data in this study are available online for research purposes. Analysis sessions are segmented in multiple sub-task steps based on user think-alouds, video and audios captured during the studies. These analytic provenance datasets can be used for research involving tools and techniques for analyzing interaction logs and analysis history.

  12. Wetlands Ecological Integrity Depth To Water Data - Great Sand Dunes...

    • catalog.data.gov
    Updated Jun 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Park Service (2024). Wetlands Ecological Integrity Depth To Water Data - Great Sand Dunes National Park 2009-2019 [Dataset]. https://catalog.data.gov/dataset/wetlands-ecological-integrity-depth-to-water-data-great-sand-dunes-national-park-2009-2019
    Explore at:
    Dataset updated
    Jun 5, 2024
    Dataset provided by
    National Park Servicehttp://www.nps.gov/
    Description

    Wetlands Ecological Integrity Depth to Water Logger data from 2009-2019 at Great Sand Dunes National Park. This includes Raw dataset (primarily hourly), daily summaries, weekly summaries, and monthly summaries. Included in the data package are exploratory data analysis figures at the daily, weekly and monthly time steps. Lastly included is the R code used to extract the depth to water logger data from the National Park Service Aquarius data system, and to create the exploratory data analysis figures.

  13. n

    HadISD: Global sub-daily, surface meteorological station data, 1931-2018,...

    • data-search.nerc.ac.uk
    • catalogue.ceda.ac.uk
    Updated Jul 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). HadISD: Global sub-daily, surface meteorological station data, 1931-2018, v3.0.0.2018f [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=dewpoint
    Explore at:
    Dataset updated
    Jul 24, 2021
    Description

    This is version 3.0.0.2018f of Met Office Hadley Centre's Integrated Surface Database, HadISD. These data are global sub-daily surface meteorological data that extends HadISD v2.0.2.2017f to include 2018 and so spans 1931-2018. The quality controlled variables in this dataset are: temperature, dewpoint temperature, sea-level pressure, wind speed and direction, cloud data (total, low, mid and high level). Past significant weather and precipitation data are also included, but have not been quality controlled, so their quality and completeness cannot be guaranteed. Quality control flags and data values which have been removed during the quality control process are provided in the qc_flags and flagged_values fields, and ancillary data files show the station listing with a station listing with IDs, names and location information. The data are provided as one NetCDF file per station. Files in the station_data folder station data files have the format "station_code"_HadISD_HadOBS_19310101-20181231_v3-0-0-2018f.nc. The station codes can be found under the docs tab or on the archive beside the station_data folder. The station codes file has five columns as follows: 1) station code, 2) station name 3) station latitude 4) station longitude 5) station height. To keep informed about updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS. For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISD blog: http://hadisd.blogspot.co.uk/ References: When using the dataset in a paper you must cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference) : Dunn, R. J. H., Willett, K. M., Parker, D. E., and Mitchell, L.: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geosci. Instrum. Method. Data Syst., 5, 473-491, doi:10.5194/gi-5-473-2016, 2016. Dunn, R. J. H., et al. (2012), HadISD: A Quality Controlled global synoptic report database for selected variables at long-term stations from 1973-2011, Clim. Past, 8, 1649-1679, 2012, doi:10.5194/cp-8-1649-2012 Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704–708, doi:10.1175/2011BAMS3015.1 For a homogeneity assessment of HadISD please see this following reference Dunn, R. J. H., K. M. Willett, C. P. Morice, and D. E. Parker. "Pairwise homogeneity assessment of HadISD." Climate of the Past 10, no. 4 (2014): 1501-1522. doi:10.5194/cp-10-1501-2014, 2014.

  14. n

    HadISD: Global sub-daily, surface meteorological station data, 1931-2020,...

    • data-search.nerc.ac.uk
    • catalogue.ceda.ac.uk
    Updated Jul 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). HadISD: Global sub-daily, surface meteorological station data, 1931-2020, v3.1.1.2020f [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=dewpoint
    Explore at:
    Dataset updated
    Jul 24, 2021
    Description

    This is version 3.1.1.2020f of Met Office Hadley Centre's Integrated Surface Database, HadISD. These data are global sub-daily surface meteorological data that extends HadISD v3.1.0.2019f to include 2020 and so spans 1931-2020. The quality controlled variables in this dataset are: temperature, dewpoint temperature, sea-level pressure, wind speed and direction, cloud data (total, low, mid and high level). Past significant weather and precipitation data are also included, but have not been quality controlled, so their quality and completeness cannot be guaranteed. Quality control flags and data values which have been removed during the quality control process are provided in the qc_flags and flagged_values fields, and ancillary data files show the station listing with a station listing with IDs, names and location information. The data are provided as one NetCDF file per station. Files in the station_data folder station data files have the format "station_code"_HadISD_HadOBS_19310101-20210101_v3-1-1-2020f.nc. The station codes can be found under the docs tab. The station codes file has five columns as follows: 1) station code, 2) station name 3) station latitude 4) station longitude 5) station height. To keep informed about updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS. For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISD blog: http://hadisd.blogspot.co.uk/ References: When using the dataset in a paper you must cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference) : Dunn, R. J. H., (2019), HadISD version 3: monthly updates, Hadley Centre Technical Note. Dunn, R. J. H., Willett, K. M., Parker, D. E., and Mitchell, L.: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geosci. Instrum. Method. Data Syst., 5, 473-491, doi:10.5194/gi-5-473-2016, 2016. Dunn, R. J. H., et al. (2012), HadISD: A Quality Controlled global synoptic report database for selected variables at long-term stations from 1973-2011, Clim. Past, 8, 1649-1679, 2012, doi:10.5194/cp-8-1649-2012 Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704–708, doi:10.1175/2011BAMS3015.1 For a homogeneity assessment of HadISD please see this following reference Dunn, R. J. H., K. M. Willett, C. P. Morice, and D. E. Parker. "Pairwise homogeneity assessment of HadISD." Climate of the Past 10, no. 4 (2014): 1501-1522. doi:10.5194/cp-10-1501-2014, 2014.

  15. n

    HadISD: Global sub-daily, surface meteorological station data, 1931-2022,...

    • data-search.nerc.ac.uk
    • catalogue.ceda.ac.uk
    Updated Jul 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). HadISD: Global sub-daily, surface meteorological station data, 1931-2022, v3.3.0.2022f [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=dewpoint
    Explore at:
    Dataset updated
    Jul 24, 2021
    Description

    This is version v3.3.0.2022f of Met Office Hadley Centre's Integrated Surface Database, HadISD. These data are global sub-daily surface meteorological data. The quality controlled variables in this dataset are: temperature, dewpoint temperature, sea-level pressure, wind speed and direction, cloud data (total, low, mid and high level). Past significant weather and precipitation data are also included, but have not been quality controlled, so their quality and completeness cannot be guaranteed. Quality control flags and data values which have been removed during the quality control process are provided in the qc_flags and flagged_values fields, and ancillary data files show the station listing with a station listing with IDs, names and location information. The data are provided as one NetCDF file per station. Files in the station_data folder station data files have the format "station_code"_HadISD_HadOBS_19310101-20230101_v3.3.1.2022f.nc. The station codes can be found under the docs tab. The station codes file has five columns as follows: 1) station code, 2) station name 3) station latitude 4) station longitude 5) station height. To keep informed about updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS. For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISD blog: http://hadisd.blogspot.co.uk/ References: When using the dataset in a paper you must cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference) : Dunn, R. J. H., (2019), HadISD version 3: monthly updates, Hadley Centre Technical Note. Dunn, R. J. H., Willett, K. M., Parker, D. E., and Mitchell, L.: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geosci. Instrum. Method. Data Syst., 5, 473-491, doi:10.5194/gi-5-473-2016, 2016. Dunn, R. J. H., et al. (2012), HadISD: A Quality Controlled global synoptic report database for selected variables at long-term stations from 1973-2011, Clim. Past, 8, 1649-1679, 2012, doi:10.5194/cp-8-1649-2012 Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704–708, doi:10.1175/2011BAMS3015.1 For a homogeneity assessment of HadISD please see this following reference Dunn, R. J. H., K. M. Willett, C. P. Morice, and D. E. Parker. "Pairwise homogeneity assessment of HadISD." Climate of the Past 10, no. 4 (2014): 1501-1522. doi:10.5194/cp-10-1501-2014, 2014.

  16. f

    Whole dataset experiment, kingdom level: Clustering results for kernels VH,...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene García; Bessem Chouaia; Mercè Llabrés; Marta Simeoni (2023). Whole dataset experiment, kingdom level: Clustering results for kernels VH, SP, WL and PM starting from the MDS data. [Dataset]. http://doi.org/10.1371/journal.pone.0281047.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Irene García; Bessem Chouaia; Mercè Llabrés; Marta Simeoni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Whole dataset experiment, kingdom level: Clustering results for kernels VH, SP, WL and PM starting from the MDS data.

  17. Preventive Maintenance for Marine Engines

    • kaggle.com
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fijabi J. Adekunle (2025). Preventive Maintenance for Marine Engines [Dataset]. https://www.kaggle.com/datasets/jeleeladekunlefijabi/preventive-maintenance-for-marine-engines
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    Kaggle
    Authors
    Fijabi J. Adekunle
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Preventive Maintenance for Marine Engines: Data-Driven Insights

    Introduction:

    Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.

    Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.

    Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.

    Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization

    Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning

    Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.

    Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.

    Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.

  18. n

    Data from: Research and exploratory analysis driven - time-data...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jan 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Del Gaizo; Kenneth Catchpole; Alexander Alekseyenko (2022). Research and exploratory analysis driven - time-data visualization (read-tv) software [Dataset]. http://doi.org/10.5061/dryad.d51c5b02g
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 30, 2022
    Dataset provided by
    Medical University of South Carolina
    Authors
    John Del Gaizo; Kenneth Catchpole; Alexander Alekseyenko
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    read-tv

    The main paper is about, read-tv, open-source software for longitudinal data visualization. We uploaded sample use case surgical flow disruption data to highlight read-tv's capabilities. We scrubbed the data of protected health information, and uploaded it as a single CSV file. A description of the original data is described below.

    Data source

    Surgical workflow disruptions, defined as “deviations from the natural progression of an operation thereby potentially compromising the efficiency or safety of care”, provide a window on the systems of work through which it is possible to analyze mismatches between the work demands and the ability of the people to deliver the work. They have been shown to be sensitive to different intraoperative technologies, surgical errors, surgical experience, room layout, checklist implementation and the effectiveness of the supporting team. The significance of flow disruptions lies in their ability to provide a hitherto unavailable perspective on the quality and efficiency of the system. This allows for a systematic, quantitative and replicable assessment of risks in surgical systems, evaluation of interventions to address them, and assessment of the role that technology plays in exacerbation or mitigation.

    In 2014, Drs Catchpole and Anger were awarded NIBIB R03 EB017447 to investigate flow disruptions in Robotic Surgery which has resulted in the detailed, multi-level analysis of over 4,000 flow disruptions. Direct observation of 89 RAS (robitic assisted surgery) cases, found a mean of 9.62 flow disruptions per hour, which varies across different surgical phases, predominantly caused by coordination, communication, equipment, and training problems.

    Methods This section does not describe the methods of read-tv software development, which can be found in the associated manuscript from JAMIA Open (JAMIO-2020-0121.R1). This section describes the methods involved in the surgical work flow disruption data collection. A curated, PHI-free (protected health information) version of this dataset was used as a use case for this manuscript.

    Observer training

    Trained human factors researchers conducted each observation following the completion of observer training. The researchers were two full-time research assistants based in the department of surgery at site 3 who visited the other two sites to collect data. Human Factors experts guided and trained each observer in the identification and standardized collection of FDs. The observers were also trained in the basic components of robotic surgery in order to be able to tangibly isolate and describe such disruptive events.

    Comprehensive observer training was ensured with both classroom and floor training. Observers were required to review relevant literature, understand general practice guidelines for observing in the OR (e.g., where to stand, what to avoid, who to speak to), and conduct practice observations. The practice observations were broken down into three phases, all performed under the direct supervision of an experienced observer. During phase one, the trainees oriented themselves to the real-time events of both the OR and the general steps in RAS. The trainee was also introduced to the OR staff and any other involved key personnel. During phase two, the trainer and trainee observed three RAS procedures together to practice collecting FDs and become familiar with the data collection tool. Phase three was dedicated to determining inter-rater reliability by having the trainer and trainee simultaneously, yet independently, conduct observations for at least three full RAS procedures. Observers were considered fully trained if, after three full case observations, intra-class correlation coefficients (based on number of observed disruptions per phase) were greater than 0.80, indicating good reliability.

    Data collection

    Following the completion of training, observers individually conducted observations in the OR. All relevant RAS cases were pre-identified on a monthly basis by scanning the surgical schedule and recording a list of procedures. All procedures observed were conducted with the Da Vinci Xi surgical robot, with the exception of one procedure at Site 2, which was performed with the Si robot. Observers attended those cases that fit within their allotted work hours and schedule. Observers used Microsoft Surface Pro tablets configured with a customized data collection tool developed using Microsoft Excel to collect data. The data collection tool divided procedures into five phases, as opposed to the four phases previously used in similar research, to more clearly distinguish between task demands throughout the procedure. Phases consisted of phase 1 - patient in the room to insufflation, phase 2 -insufflation to surgeon on console (including docking), phase 3 - surgeon on console to surgeon off console, phase 4 - surgeon off console to patient closure, and phase 5 - patient closure to patient leaves the operating room. During each procedure, FDs were recorded into the appropriate phase, and a narrative, time-stamp, and classification (based off of a robot-specific FD taxonomy) were also recorded.

    Each FD was categorized into one of ten categories: communication, coordination, environment, equipment, external factors, other, patient factors, surgical task considerations, training, or unsure. The categorization system is modeled after previous studies, as well as the examples provided for each FD category.

    Once in the OR, observers remained as unobtrusive as possible. They stood at an appropriate vantage point in the room without getting in the way of team members. Once an appropriate time presented itself, observers introduced themselves to the circulating nurse and informed them of the reason for their presence. Observers did not directly engage in conversations with operating room staff, however, if a staff member approached them with any questions/comments they would respond.

    Data Reduction and PHI (Protected Health Information) Removal

    This dataset uses 41 of the aforementioned surgeries. All columns have been removed except disruption type, a numeric timestamp for number of minutes into the day, and surgical phase. In addition, each surgical case had it's initial disruption set to 12 noon, (720 minutes).

  19. T

    Impact of AI in Education Processes

    • dataverse.tdl.org
    Updated Feb 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saksham Adhikari; Saksham Adhikari (2024). Impact of AI in Education Processes [Dataset]. http://doi.org/10.18738/T8/RXUCHK
    Explore at:
    tsv(7079), application/x-ipynb+json(428065), pptx(80640)Available download formats
    Dataset updated
    Feb 20, 2024
    Dataset provided by
    Texas Data Repository
    Authors
    Saksham Adhikari; Saksham Adhikari
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We did data analysis on a open dataset which contained responses regarding a survey about how useful students find AI in the educational process. We cleaned the data, preprocessed and then did analysis on it. We did an EDA (Exploratory Data Analysis) on the dataset and visualized the results and our findings. Then we interpreted the findings into our digital poster.

  20. g

    Wetlands Ecological Integrity Depth To Water Data - Rocky Mountain National...

    • gimi9.com
    Updated Jun 14, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2007). Wetlands Ecological Integrity Depth To Water Data - Rocky Mountain National Park 2007-2019 | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_wetlands-ecological-integrity-depth-to-water-data-rocky-mountain-national-park-2007-2019
    Explore at:
    Dataset updated
    Jun 14, 2007
    Area covered
    Rocky Mountains
    Description

    Wetlands Ecological Integrity Depth to Water Logger data from 2007-2019 at Rocky Mountain National Park. This includes Raw dataset (primarily hourly), daily summaries, weekly summaries, and monthly summaries. Included in the data package are exploratory data analysis figures at the daily, weekly and monthly time steps. Lastly included is the R code used to extract the depth to water logger data from the National Park Service Aquarius data system, and to create the exploratory data analysis figures.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001

DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx

Related Article
Explore at:
docxAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
World
Description

Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

Search
Clear search
Close search
Google apps
Main menu