58 datasets found
  1. Advanced exploratory data analysis (EDA)

    • kaggle.com
    Updated Nov 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mustafa Ghzi (2023). Advanced exploratory data analysis (EDA) [Dataset]. https://www.kaggle.com/datasets/mustafaghzi/advanced-exploratory-data-analysis-eda/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mustafa Ghzi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Mustafa Ghzi

    Released under CC BY-NC-SA 4.0

    Contents

  2. R

    Eda_all Dataset

    • universe.roboflow.com
    zip
    Updated May 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cropperyash (2024). Eda_all Dataset [Dataset]. https://universe.roboflow.com/cropperyash/eda_all/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 24, 2024
    Dataset authored and provided by
    cropperyash
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    All Polygons
    Description

    Eda_all

    ## Overview
    
    Eda_all is a dataset for instance segmentation tasks - it contains All annotations for 1,314 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  3. EDA Signal Dataset Collected During Startle Events While Walking With a...

    • zenodo.org
    zip
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Villalba-Bravo; Rafael Villalba-Bravo (2025). EDA Signal Dataset Collected During Startle Events While Walking With a Smart Cane [Dataset]. http://doi.org/10.5281/zenodo.15715155
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rafael Villalba-Bravo; Rafael Villalba-Bravo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    EDA Signal Dataset Collected During Startle Events While Walking With a Smart Cane

    This dataset accompanies the publication (currently under review):

    Villalba-Bravo, R., Grande-Bueno, S., Trujillo-León, A., & Vidal-Verdú, F.
    Analysis of EDA signal features under motion artifacts for non-personalized detection of startle events using a smart cane
    IEEE SENSORS 2025, Vancouver, Canada.

    Description

    This dataset includes Electrodermal Activity (EDA) signals collected from seven participants during an experiment in which they walked on a treadmill at a constant speed of 1 km/h while using a smart cane. During the walking task, participants were exposed to auditory startle stimuli designed to elicit stress responses. The smart cane was equipped with a Galvanic Skin Response (GSR) sensor integrated into its handle to continuously record physiological signals in a natural walking context.

    The data is organized by participant. All participants provided written informed consent both to take part in the experiment and to allow their anonymized data to be publicly shared for research purposes. Furthermore, the experiment was approved by the Ethical Committee of the Universidad de Málaga (reference 46-2024-H).

    Folder Structure

    Each folder corresponds to a particiapnt session (e.g., S0/, S2/, etc.) and contains the following files:

    S0/
    ├── S0_DataExperiment.mat
    ├── S0_audioEventVector.mat
    └── S0_SA_Score.mat

    ...

    S8/
    ├── S8_DataExperiment.mat
    ├── S8_audioEventVector.mat
    └── S8_SA_Score.mat

    In addition, the dataset includes a CSV file named caneFeatures_pre_post.csv, containing the extracted features from the GSR, tonic and phasic signals, allowing for the replication of the statistical analyses presented in the study.

    File Descriptions

    1. S*_DataExperiment.mat

    • Description: This file contains the EDA signals acquired at a 4 Hz sampling rate during the experiment, stored in MATLAB .mat format as a structured variable.

    • Format: MATLAB Struct (3 fields)

      • GSR: Contains the raw GSR signal along with associated time information: TimeStampDate (UTC date-time format) and TimeStampPosix (POSIX timestamp).

      • TONIC: Contains the tonic component of the EDA signal with the same timestamp fields.

      • PHASIC: Contains the phasic component of the EDA signal with the corresponding timestamps.

    2. S*_audioEventVector.mat

    • Description: This file contains information about the timing of the auditory startle stimuli presented during the experiment. The data is stored as a MATLAB struct sampled at 32 Hz.

    • Format: MATLAB Struct (3 fields)

      • data: A binary step signal indicating the presence of auditory events (0 = no stimulus, 1 = stimulus being played).

      • TimeStampDate: A vector of timestamps in MATLAB datetime format, corresponding to each sample in the data field.

    3. S*_SA_Score.mat

    • Description: This file contains the self-reported State Anxiety (STAI-State) scores provided by each participant before and after the experimental session. The data is stored as a MATLAB struct.

    • Format: MATLAB Struct (2 fields)

      • Training: Numeric score reported after the training session.

      • Experiment: Numeric score reported after the experimental session.

    Contact Information

    For any questions or further information regarding this dataset, please contact fvidal@uma.es.

  4. R

    Solar Panel Eda Dataset

    • universe.roboflow.com
    zip
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramkumar (2024). Solar Panel Eda Dataset [Dataset]. https://universe.roboflow.com/ramkumar/solar-panel-eda
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 29, 2024
    Dataset authored and provided by
    Ramkumar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Solar Panel Bounding Boxes
    Description

    Solar Panel EDA

    ## Overview
    
    Solar Panel EDA is a dataset for object detection tasks - it contains Solar Panel annotations for 721 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  5. h

    opencores

    • huggingface.co
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nwang227 (2024). opencores [Dataset]. https://huggingface.co/datasets/LLM-EDA/opencores
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 16, 2024
    Authors
    nwang227
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Opencores

    We gathered high-quality specification-code pairs from Opencores, a community aimed to developing digital open-source hardware using electronic design automation (EDA). We then filtered out data instances exceeding 4096 characters in length and those that could not be parsed into Abstract Syntax Trees (AST). The final dataset comprises approximately 800 data instances.

      Dataset Features
    

    instruction (string): The nature language instruction for… See the full description on the dataset page: https://huggingface.co/datasets/LLM-EDA/opencores.

  6. Eda Export Data of HS Code 29212100 India – Seair.co.in

    • seair.co.in
    Updated Apr 20, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2016). Eda Export Data of HS Code 29212100 India – Seair.co.in [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Apr 20, 2016
    Dataset provided by
    Seair Exim Solutions
    Authors
    Seair Exim
    Area covered
    Venezuela (Bolivarian Republic of), India, Colombia, Algeria, Estonia, Antarctica, Georgia, Croatia, Niue, Morocco
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  7. The Global EDA Market size was USD 14.9 billion in 2023!

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research (2025). The Global EDA Market size was USD 14.9 billion in 2023! [Dataset]. https://www.cognitivemarketresearch.com/eda-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    According to Cognitive Market Research, The Global EDA Market size will be USD 14.9 billion in 2023 and will grow at a compound annual growth rate (CAGR) of 10.50% from 2023 to 2030.

    The demand for the EDA Market is rising due to the rise in outdoor and adventure activities.
    Changing consumer lifestyle trends are higher in the EDA market.
    The cat segment held the highest EDA Market revenue share in 2023.
    North American EDA will continue to lead, whereas the European EDA Market will experience the most substantial growth until 2030.
    

    Supply Chain and Risk Analysis to Provide Viable Market Output

    The industry is facing supply chain and logistics disruptions. EDA tools have been instrumental in analyzing supply chain data, identifying vulnerabilities, predicting risks, and developing disruption mitigation strategies. Consumer behavior has undergone drastic changes due to blockages and restrictions. EDA helps companies analyze changing trends in buying behavior, online shopping preferences, and demand patterns, enabling organizations to adjust their marketing and sales strategies accordingly.

    Health and Pharmaceutical Research to Propel Market Growth.
    

    EDA tools have played a key role in analyzing large amounts of data related to vaccine development, drug trials, patient records and epidemiological studies. These tools have helped researchers process and interpret complex medical data, leading to advances in the development of treatments and vaccines. The pandemic has created challenges in data collection, especially in sectors affected by lockdowns or blackouts. Rapidly changing conditions and incomplete data sets make effective EDA difficult due to data quality issues. The economic uncertainty caused by the pandemic has led to budget cuts in some sectors, impacting investment in new technologies. Some organizations have limited budgets that limit their ability to adopt or update EDA tools.

    Market Dynamics of the EDA

    Privacy and Data Security Issues to Restrict Market Growth.
    

    With the focus on data privacy regulations such as GDPR, CCPA, etc., organizations need to ensure compliance when handling sensitive data. These compliance requirements may limit the scope of the EDA by limiting the availability and use of certain data sets for information analysis. EDA often requires data analysts or data scientists who are skilled in statistical analysis and data visualization tools. A lack of professionals with these specialized skills can hinder an organization's ability to use EDA tools effectively, limiting adoption. Advanced EDA techniques can involve complex algorithms and statistical techniques that are difficult for non-technical users to understand. Interpreting results and deriving actionable insights from EDA results pose challenges that affect applicability to a wider audience.

    Key Opportunity of market.

    Growing miniaturization in various industries can be an opportunity.
    

    With the age of highly advanced electronics, miniaturization has become a trend that enabled organizations across diverse sectors such as healthcare, consumer electronics, aerospace and defense, automotive and others to design miniature electronic devices. The devices incorporate miniaturized semiconductor components, e.g., surgical instruments and blood glucose meters in healthcare, fitness bands in wearable devices, automotive modules in the automotive sector, and intelligent baggage labels. Miniaturization has a number of advantages such as freeing space for other features and better batteries. The increased consciousness among consumers towards fitness is fueling the demand for smaller fitness devices such as smartwatches and fitness trackers. This is motivating companies to come up with innovative products with improved features, while researchers are concentrating on cost-effective and efficient product development through electronic design tools. Besides, use of portable equipment has gained immense popularity among media professionals because of the increasing demand for live reporting of different events like riots, accidents, sports, and political rallies. As a result of the inconvenience in the use of cumbersome TV production vans to access such events, demand for portable handheld equipment has risen. Such devices are simply portable and can be quickly moved to the event venue if carried in backpacks. Therefore, the need for compact devices across various indust...

  8. Guns incident data

    • kaggle.com
    Updated Sep 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Miglani (2020). Guns incident data [Dataset]. https://www.kaggle.com/datasets/datatattle/guns-incident-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aman Miglani
    Description

    This data consists of the incidents involving guns. Perform EDA to find out the hidden patterns. Columns: 1) Race: Race of individual 2) Date: Date of incident 3) Education 4) Police involvment

    Please leave an upvote if you find this relevant. P.S. I am new and it will help immensely. :)

  9. f

    Data on EEG, EDA, BVP, psychological responses and audio files used for the...

    • figshare.com
    xlsx
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Norberto Emmanuel Naal-Ruiz; Hyunkook Lee; Luz Maria Alonso-Valerdi; David Isaac Ibarra Zarate (2024). Data on EEG, EDA, BVP, psychological responses and audio files used for the study of 3D Audio Immersive Experience [Dataset]. http://doi.org/10.6084/m9.figshare.25421464.v3
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 10, 2024
    Dataset provided by
    figshare
    Authors
    Norberto Emmanuel Naal-Ruiz; Hyunkook Lee; Luz Maria Alonso-Valerdi; David Isaac Ibarra Zarate
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data described in this repository has five items:DataSpecsThis excel file has six worksheets with the following information: demographic data, biofiles available, Immersive Tendencies Questionnaire responses, immersive questionnaire responses, items of questionnaires, and EEG electrode positions in Theta/Phi coordinates.LoudspeakerInformationPDF file explaining the alignment and positions of loudspeakers for stereo, PCMA-3D, and ESMA-3D audio playback. RawDataFolder with individual subfolders of participants labeled with assigned ID. Each folder has EEG, EDA, and BVP files in GDF format for three conditions: 1) resting state (Bl), 2) concert hall (Music), and 3) urban park (Park) soundscapes. The assigned audio group (Stereo or 3D) is specified in file names. Sample rates are: EEG = 500 Hz, BVP = 64 Hz, and EDA = 4 Hz. The assigned audio group is specified in file names. For example, file “01_Stereo_BVP_Bl” corresponds to BVP data in the resting state of the participant 01 assigned to the Stereo group.LatencyAdjustmentFolder with individual subfolders of participants labeled with assigned ID in SET/FDT format. The only difference is that "condition 8" onset was adjusted according to the latency caused by the distance between the audio system and participants (2 m). Condition 8 indicates the moment a soundscape (Music or Park) was played.AudioFilesThis folder contains two subfolders:Music: 2-minute long WAV audio files of concert hall recordings prepared to be heard on PCMA-3D and Stereo (Downmix files) loudspeaker array at 48k Hz of sample rate and 24-bit depthPark: 2-minute long WAV audio files of urban park recordings prepared to be heard on ESMA-3D and Stereo (Downmix files) loudspeaker array at 48k Hz of sample rate and 24-bit depthStereo downmix files include the word “_Downmix_”.Note: In the worksheet Items of DataSpecs, the codes that the questionnaires provide are included. Just one item of the Immersive Tendencies Questionnaire and the items of the Self-assessment manikin test do not have codes in their original publications.

  10. Eda Import Data in September - Seair.co.in

    • seair.co.in
    Updated Sep 29, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2016). Eda Import Data in September - Seair.co.in [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Sep 29, 2016
    Dataset provided by
    Seair Exim Solutions
    Authors
    Seair Exim
    Area covered
    Haiti, Falkland Islands (Malvinas), Solomon Islands, Northern Mariana Islands, Brunei Darussalam, Western Sahara, Heard Island and McDonald Islands, Equatorial Guinea, Taiwan, Saint Helena
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  11. EDA Movies

    • kaggle.com
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rehenatun Jannat (2024). EDA Movies [Dataset]. https://www.kaggle.com/datasets/rehenatunjannat/eda-movies/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rehenatun Jannat
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Rehenatun Jannat

    Released under CC0: Public Domain

    Contents

  12. h

    vgen_cpp

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nwang227, vgen_cpp [Dataset]. https://huggingface.co/datasets/LLM-EDA/vgen_cpp
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    nwang227
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Opencores

    In the process of continual pre-training, we utilized the publicly available VGen dataset. VGen aggregates Verilog repositories from GitHub, systematically filters out duplicates and excessively large files, and retains only those files containing \texttt{module} and \texttt{endmodule} statements. We also incorporated the CodeSearchNet dataset \cite{codesearchnet}, which contains approximately 40MB function codes and their documentation.… See the full description on the dataset page: https://huggingface.co/datasets/LLM-EDA/vgen_cpp.

  13. h

    DA-Code

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jianwen Luo, DA-Code [Dataset]. https://huggingface.co/datasets/Jianwen2003/DA-Code
    Explore at:
    Authors
    Jianwen Luo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    [EMNLP2024] DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

    DA-Code is a comprehensive evaluation dataset designed to assess the data analysis and code generation capabilities of LLM in agent-based data science tasks. Our papers and experiment reports have been published on Arxiv.

      Dataset Overview
    

    500 complex real-world data analysis tasks across Data Wrangling (DW), Machine Learning (ML), and Exploratory Data Analysis (EDA). Tasks cover… See the full description on the dataset page: https://huggingface.co/datasets/Jianwen2003/DA-Code.

  14. Eda international inc USA Import & Buyer Data

    • seair.co.in
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim, Eda international inc USA Import & Buyer Data [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset provided by
    Seair Exim Solutions
    Authors
    Seair Exim
    Area covered
    United States
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  15. Replication Package for 'Data-Driven Analysis and Optimization of Machine...

    • zenodo.org
    zip
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joel Castaño; Joel Castaño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

    This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.
    The framework considers the trade-offs between three key objectives:
    1. Performance (maximizing throughput)
    2. Energy Efficiency (minimizing estimated energy per unit)
    3. Cost (minimizing estimated hardware cost)

    Repository Structure

    This repository is organized as follows:
    • Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.
    • Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.
    • Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.
    • Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.
    • Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.
    • eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.
    • requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.
    • eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.
    • optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.
    • pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.
    • shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

    Requirements and Installation

    To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.
    1. Clone the repository:
    bash
    git clone
    cd
    2. **Create and activate a virtual environment (optional but recommended):
    bash
    python -m venv venv
    source venv/bin/activate # On Windows, use `venv\Scripts\activate`
    3. Install the required packages:
    All dependencies are listed in the `requirements.txt` file. Install them using pip:
    bash
    pip install -r requirements.txt

    Step-by-Step Reproduction Workflow

    The notebooks are designed to be run in a logical sequence.

    Step 1: Data Enrichment (Optional)

    The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

    Step 2: Exploratory Data Analysis (Optional)

    All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

    Step 3: Main Model Training, Validation, and Recommendation

    This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:
    1. It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.
    2. It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.
    3. It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.
    4. It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
  16. ML-Based RUL Prediction for NPP Transformers

    • kaggle.com
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmitry_Menyailov (2025). ML-Based RUL Prediction for NPP Transformers [Dataset]. https://www.kaggle.com/datasets/idmitri/ml-based-rul-prediction-for-npp-transformers
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dmitry_Menyailov
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23516597%2F11309e6c4df1437ed2aa6a8fb121daa5%2FScreenshot%202025-04-10%20at%2004.17.42.png?generation=1744233480336962&alt=media" alt="">

    Notebooks

    1. Exploratory_Data_Analysis

    https://www.kaggle.com/code/idmitri/exploratory-data-analysis

    2. RUL_Prediction_Modeling

    https://www.kaggle.com/code/idmitri/rul-prediction-modeling

    О проекте

    Силовые трансформаторы на АЭС могут эксплуатироваться дольше расчетного срока службы (25 лет), что требует усиленного мониторинга их состояния для обеспечения надежности и безопасности эксплуатации.

    Для оценки состояния трансформаторов применяется хроматографический анализ растворенных газов, который позволяет выявлять дефекты по концентрациям газов в масле и прогнозировать остаточный срок службы трансформатора (RUL). Традиционные системы мониторинга ограничиваются фиксированными пороговыми значениями концентраций, снижая точность диагностики и автоматизацию. Методы машинного обучения позволяют выявлять скрытые зависимости и повышать точность прогнозирования. Подробнее: https://habr.com/ru/articles/743682/

    Результаты

    В данном проекте проводится глубокий анализ данных (EDA) с созданием 12 групп признаков:
    - gases (концентрации газов)
    - trend (трендовые компоненты)
    - seasonal (сезонные компоненты)
    - resid (остаточные компоненты)
    - quantiles (квантили распределений)
    - volatility (волатильность концентраций)
    - range (размах значений)
    - coefficient of variation (коэффициент вариации)
    - standard deviation (стандартное отклонение)
    - skewness (асимметрия распределения)
    - kurtosis (эксцесс распределения)
    - category (категориальные признаки неисправностей)

    Использование статистических и декомпозиционных признаков позволило достичь совпадения точности силуэта распределения RUL с автоматической обработкой выбросов, что ранее требовало ручной корректировки.

    Для моделирования использованы алгоритмы машинного обучения (LightGBM, CatBoost, Extra Trees) и их ансамбль. Лучшая точность достигнута моделью LightGBM с оптимизацией гиперпараметров с помощью Optuna: MAE = 61.85, RMSE = 88.21, R2 = 0.8634.

    Комментарий

    Код для проведения разведочного анализа данных (EDA) был разработан и протестирован локально в VSC Jupyter Notebook с использованием окружения Python 3.10.16. И на платформе Kaggle большинство графиков отображается корректно. Но некоторые сложные и комплексные визуализации (например, многомерные графики с цветовой шкалой) не адаптированы из-за ограничений среды. Несмотря на попытки оптимизировать код без существенных изменений, добиться полной совместимости не удалось. Основная проблема заключалась в конфликте версий библиотек и значительном снижении производительности — расчет занимал примерно в 10 раз больше времени по сравнению с локальной машиной MacBook M3 Pro. На Kaggle либо корректно выполнялись операции с использованием PyCaret, либо работали модели машинного обучения, но не обе части одновременно.

    Предлагается гибридный вариант работы:
    - Публикация и вывод метрик на Kaggle для визуализации результатов. - Локальный расчет и обучение моделей с использованием предварительно настроенного окружения Python 3.10.16. Для воспроизведения экспериментов подготовлена папка Codes с кодами VSC EDA, RUL и файлом libraries_for_modeling, содержащим список версий всех используемых библиотек.

    Готов ответить в комментариях на все вопросы по настройке и запуску кода. И буду признателен за советы по предотвращению подобных проблем.

  17. Final Project EDA Statprob

    • kaggle.com
    Updated Dec 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Revalina F (2023). Final Project EDA Statprob [Dataset]. https://www.kaggle.com/datasets/revalinaf/final-project-eda-statprob/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 13, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Revalina F
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Revalina F

    Released under MIT

    Contents

  18. Physiological Data Collected from smartwatch: EDA, PPG, and Skin Temperature...

    • zenodo.org
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos Albarrán Morillo; Carlos Albarrán Morillo; John F. Suárez-Pérez; John F. Suárez-Pérez; Camargo Salinas Mónica Andrea; Camargo Salinas Mónica Andrea; Nasli Miranda Arandia; Nasli Miranda Arandia (2025). Physiological Data Collected from smartwatch: EDA, PPG, and Skin Temperature and external factors in a pharmaceutical case study [Dataset]. http://doi.org/10.5281/zenodo.14891916
    Explore at:
    Dataset updated
    May 26, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Carlos Albarrán Morillo; Carlos Albarrán Morillo; John F. Suárez-Pérez; John F. Suárez-Pérez; Camargo Salinas Mónica Andrea; Camargo Salinas Mónica Andrea; Nasli Miranda Arandia; Nasli Miranda Arandia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was collected in a pharmaceutical case study where participants performed repetitive packing tasks for approximately 20 minutes directly on the production line. The study aimed to assess physiological and ergonomic factors affecting workers during the task.

    Key Variables:

    • Participant Information:

      • ID participant: Unique identifier for each participant.
      • Age: Age of the participant.
      • Experience: Work experience in years.
    • Task Context:

      • Moment: Time of measurement during the shift (Start, Middle, End).
      • Turn: Work shift number.
      • Plant/Line: Identification of the production line.
      • Day: Day of the week.
      • Time: Exact timestamp of data collection.
      • LoTNum: Lot number for batch packing.
    • Physiological Measurements (from wearable devices):

      • eda_scl_usiemens: Electrodermal activity (EDA) in microsiemens.
      • pulse_rate_bpm: Heart rate in beats per minute.
      • temperature_celsius: Skin temperature in Celsius.
      • accelerometers_std_g: Standard deviation of accelerometer readings (movement intensity).
      • steps_count: Number of steps taken.
      • activity_counts: General activity level.
    • Ergonomic and Risk Indicators:

      • IndexRiskR: Risk index for the right hand.
      • IndexRiskL: Risk index for the left hand.
      • Borg Test: Subjective rating of perceived exertion (Borg scale).
  19. Titanic EDA

    • kaggle.com
    zip
    Updated Aug 3, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gourav Rohra (2021). Titanic EDA [Dataset]. https://www.kaggle.com/gouravrohra/titanic-eda
    Explore at:
    zip(58919 bytes)Available download formats
    Dataset updated
    Aug 3, 2021
    Authors
    Gourav Rohra
    Description

    Dataset

    This dataset was created by Gourav Rohra

    Contents

  20. Eda Import Data in October - Seair.co.in

    • seair.co.in
    Updated Oct 28, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2016). Eda Import Data in October - Seair.co.in [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Oct 28, 2016
    Dataset provided by
    Seair Exim Solutions
    Authors
    Seair Exim
    Area covered
    Cocos (Keeling) Islands, Honduras, Malawi, Guernsey, Kenya, Åland Islands, Saint Barthélemy, Myanmar, Svalbard and Jan Mayen, Central African Republic
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mustafa Ghzi (2023). Advanced exploratory data analysis (EDA) [Dataset]. https://www.kaggle.com/datasets/mustafaghzi/advanced-exploratory-data-analysis-eda/code
Organization logo

Advanced exploratory data analysis (EDA)

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 18, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mustafa Ghzi
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Dataset

This dataset was created by Mustafa Ghzi

Released under CC BY-NC-SA 4.0

Contents

Search
Clear search
Close search
Google apps
Main menu