58 datasets found

Advanced exploratory data analysis (EDA)
kaggle.com
Updated Nov 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mustafa Ghzi (2023). Advanced exploratory data analysis (EDA) [Dataset]. https://www.kaggle.com/datasets/mustafaghzi/advanced-exploratory-data-analysis-eda/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 18, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mustafa Ghzi
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset

This dataset was created by Mustafa Ghzi

Released under CC BY-NC-SA 4.0

Contents
R
Eda_all Dataset
universe.roboflow.com
zip
Updated May 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cropperyash (2024). Eda_all Dataset [Dataset]. https://universe.roboflow.com/cropperyash/eda_all/model/1
Explore at:
zipAvailable download formats
Dataset updated
May 24, 2024
Dataset authored and provided by
cropperyash
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
All Polygons
Description
Eda_all

## Overview Eda_all is a dataset for instance segmentation tasks - it contains All annotations for 1,314 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
EDA Signal Dataset Collected During Startle Events While Walking With a...
zenodo.org
zip
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Villalba-Bravo; Rafael Villalba-Bravo (2025). EDA Signal Dataset Collected During Startle Events While Walking With a Smart Cane [Dataset]. http://doi.org/10.5281/zenodo.15715155
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15715155
Dataset updated
Jun 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Villalba-Bravo; Rafael Villalba-Bravo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
EDA Signal Dataset Collected During Startle Events While Walking With a Smart Cane

This dataset accompanies the publication (currently under review):

Villalba-Bravo, R., Grande-Bueno, S., Trujillo-León, A., & Vidal-Verdú, F.
Analysis of EDA signal features under motion artifacts for non-personalized detection of startle events using a smart cane
IEEE SENSORS 2025, Vancouver, Canada.

Description

This dataset includes Electrodermal Activity (EDA) signals collected from seven participants during an experiment in which they walked on a treadmill at a constant speed of 1 km/h while using a smart cane. During the walking task, participants were exposed to auditory startle stimuli designed to elicit stress responses. The smart cane was equipped with a Galvanic Skin Response (GSR) sensor integrated into its handle to continuously record physiological signals in a natural walking context.

The data is organized by participant. All participants provided written informed consent both to take part in the experiment and to allow their anonymized data to be publicly shared for research purposes. Furthermore, the experiment was approved by the Ethical Committee of the Universidad de Málaga (reference 46-2024-H).

Folder Structure

Each folder corresponds to a particiapnt session (e.g., S0/, S2/, etc.) and contains the following files:

S0/
├── S0_DataExperiment.mat
├── S0_audioEventVector.mat
└── S0_SA_Score.mat

...

S8/
├── S8_DataExperiment.mat
├── S8_audioEventVector.mat
└── S8_SA_Score.mat

In addition, the dataset includes a CSV file named caneFeatures_pre_post.csv, containing the extracted features from the GSR, tonic and phasic signals, allowing for the replication of the statistical analyses presented in the study.

File Descriptions

1. S*_DataExperiment.mat

Description: This file contains the EDA signals acquired at a 4 Hz sampling rate during the experiment, stored in MATLAB .mat format as a structured variable.

Format: MATLAB Struct (3 fields)

GSR: Contains the raw GSR signal along with associated time information: TimeStampDate (UTC date-time format) and TimeStampPosix (POSIX timestamp).

TONIC: Contains the tonic component of the EDA signal with the same timestamp fields.

PHASIC: Contains the phasic component of the EDA signal with the corresponding timestamps.

2. S*_audioEventVector.mat

Description: This file contains information about the timing of the auditory startle stimuli presented during the experiment. The data is stored as a MATLAB struct sampled at 32 Hz.

Format: MATLAB Struct (3 fields)

data: A binary step signal indicating the presence of auditory events (0 = no stimulus, 1 = stimulus being played).

TimeStampDate: A vector of timestamps in MATLAB datetime format, corresponding to each sample in the data field.

3. S*_SA_Score.mat

Description: This file contains the self-reported State Anxiety (STAI-State) scores provided by each participant before and after the experimental session. The data is stored as a MATLAB struct.

Format: MATLAB Struct (2 fields)

Training: Numeric score reported after the training session.

Experiment: Numeric score reported after the experimental session.

Contact Information

For any questions or further information regarding this dataset, please contact fvidal@uma.es.
R
Solar Panel Eda Dataset
universe.roboflow.com
zip
Updated Aug 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramkumar (2024). Solar Panel Eda Dataset [Dataset]. https://universe.roboflow.com/ramkumar/solar-panel-eda
Explore at:
zipAvailable download formats
Dataset updated
Aug 29, 2024
Dataset authored and provided by
Ramkumar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Solar Panel Bounding Boxes
Description
Solar Panel EDA

## Overview Solar Panel EDA is a dataset for object detection tasks - it contains Solar Panel annotations for 721 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
opencores
huggingface.co
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nwang227 (2024). opencores [Dataset]. https://huggingface.co/datasets/LLM-EDA/opencores
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 16, 2024
Authors
nwang227
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Opencores

We gathered high-quality specification-code pairs from Opencores, a community aimed to developing digital open-source hardware using electronic design automation (EDA). We then filtered out data instances exceeding 4096 characters in length and those that could not be parsed into Abstract Syntax Trees (AST). The final dataset comprises approximately 800 data instances.

Dataset Features

instruction (string): The nature language instruction for… See the full description on the dataset page: https://huggingface.co/datasets/LLM-EDA/opencores.
Eda Export Data of HS Code 29212100 India – Seair.co.in
seair.co.in
Updated Apr 20, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2016). Eda Export Data of HS Code 29212100 India – Seair.co.in [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Apr 20, 2016
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
Venezuela (Bolivarian Republic of), India, Colombia, Algeria, Estonia, Antarctica, Georgia, Croatia, Niue, Morocco
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
The Global EDA Market size was USD 14.9 billion in 2023!
cognitivemarketresearch.com
pdf,excel,csv,ppt
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research (2025). The Global EDA Market size was USD 14.9 billion in 2023! [Dataset]. https://www.cognitivemarketresearch.com/eda-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Apr 30, 2025
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, The Global EDA Market size will be USD 14.9 billion in 2023 and will grow at a compound annual growth rate (CAGR) of 10.50% from 2023 to 2030.

The demand for the EDA Market is rising due to the rise in outdoor and adventure activities. Changing consumer lifestyle trends are higher in the EDA market. The cat segment held the highest EDA Market revenue share in 2023. North American EDA will continue to lead, whereas the European EDA Market will experience the most substantial growth until 2030.

Supply Chain and Risk Analysis to Provide Viable Market Output

The industry is facing supply chain and logistics disruptions. EDA tools have been instrumental in analyzing supply chain data, identifying vulnerabilities, predicting risks, and developing disruption mitigation strategies. Consumer behavior has undergone drastic changes due to blockages and restrictions. EDA helps companies analyze changing trends in buying behavior, online shopping preferences, and demand patterns, enabling organizations to adjust their marketing and sales strategies accordingly.

Health and Pharmaceutical Research to Propel Market Growth.

EDA tools have played a key role in analyzing large amounts of data related to vaccine development, drug trials, patient records and epidemiological studies. These tools have helped researchers process and interpret complex medical data, leading to advances in the development of treatments and vaccines. The pandemic has created challenges in data collection, especially in sectors affected by lockdowns or blackouts. Rapidly changing conditions and incomplete data sets make effective EDA difficult due to data quality issues. The economic uncertainty caused by the pandemic has led to budget cuts in some sectors, impacting investment in new technologies. Some organizations have limited budgets that limit their ability to adopt or update EDA tools.

Market Dynamics of the EDA

Privacy and Data Security Issues to Restrict Market Growth.

With the focus on data privacy regulations such as GDPR, CCPA, etc., organizations need to ensure compliance when handling sensitive data. These compliance requirements may limit the scope of the EDA by limiting the availability and use of certain data sets for information analysis. EDA often requires data analysts or data scientists who are skilled in statistical analysis and data visualization tools. A lack of professionals with these specialized skills can hinder an organization's ability to use EDA tools effectively, limiting adoption. Advanced EDA techniques can involve complex algorithms and statistical techniques that are difficult for non-technical users to understand. Interpreting results and deriving actionable insights from EDA results pose challenges that affect applicability to a wider audience.

Key Opportunity of market.

Growing miniaturization in various industries can be an opportunity.

With the age of highly advanced electronics, miniaturization has become a trend that enabled organizations across diverse sectors such as healthcare, consumer electronics, aerospace and defense, automotive and others to design miniature electronic devices. The devices incorporate miniaturized semiconductor components, e.g., surgical instruments and blood glucose meters in healthcare, fitness bands in wearable devices, automotive modules in the automotive sector, and intelligent baggage labels. Miniaturization has a number of advantages such as freeing space for other features and better batteries. The increased consciousness among consumers towards fitness is fueling the demand for smaller fitness devices such as smartwatches and fitness trackers. This is motivating companies to come up with innovative products with improved features, while researchers are concentrating on cost-effective and efficient product development through electronic design tools. Besides, use of portable equipment has gained immense popularity among media professionals because of the increasing demand for live reporting of different events like riots, accidents, sports, and political rallies. As a result of the inconvenience in the use of cumbersome TV production vans to access such events, demand for portable handheld equipment has risen. Such devices are simply portable and can be quickly moved to the event venue if carried in backpacks. Therefore, the need for compact devices across various indust...
Guns incident data
kaggle.com
Updated Sep 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Miglani (2020). Guns incident data [Dataset]. https://www.kaggle.com/datasets/datatattle/guns-incident-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aman Miglani
Description
This data consists of the incidents involving guns. Perform EDA to find out the hidden patterns. Columns: 1) Race: Race of individual 2) Date: Date of incident 3) Education 4) Police involvment

Please leave an upvote if you find this relevant. P.S. I am new and it will help immensely. :)
f
Data on EEG, EDA, BVP, psychological responses and audio files used for the...
figshare.com
xlsx
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Norberto Emmanuel Naal-Ruiz; Hyunkook Lee; Luz Maria Alonso-Valerdi; David Isaac Ibarra Zarate (2024). Data on EEG, EDA, BVP, psychological responses and audio files used for the study of 3D Audio Immersive Experience [Dataset]. http://doi.org/10.6084/m9.figshare.25421464.v3
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25421464.v3
Dataset updated
Sep 10, 2024
Dataset provided by
figshare
Authors
Norberto Emmanuel Naal-Ruiz; Hyunkook Lee; Luz Maria Alonso-Valerdi; David Isaac Ibarra Zarate
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data described in this repository has five items:DataSpecsThis excel file has six worksheets with the following information: demographic data, biofiles available, Immersive Tendencies Questionnaire responses, immersive questionnaire responses, items of questionnaires, and EEG electrode positions in Theta/Phi coordinates.LoudspeakerInformationPDF file explaining the alignment and positions of loudspeakers for stereo, PCMA-3D, and ESMA-3D audio playback. RawDataFolder with individual subfolders of participants labeled with assigned ID. Each folder has EEG, EDA, and BVP files in GDF format for three conditions: 1) resting state (Bl), 2) concert hall (Music), and 3) urban park (Park) soundscapes. The assigned audio group (Stereo or 3D) is specified in file names. Sample rates are: EEG = 500 Hz, BVP = 64 Hz, and EDA = 4 Hz. The assigned audio group is specified in file names. For example, file “01_Stereo_BVP_Bl” corresponds to BVP data in the resting state of the participant 01 assigned to the Stereo group.LatencyAdjustmentFolder with individual subfolders of participants labeled with assigned ID in SET/FDT format. The only difference is that "condition 8" onset was adjusted according to the latency caused by the distance between the audio system and participants (2 m). Condition 8 indicates the moment a soundscape (Music or Park) was played.AudioFilesThis folder contains two subfolders:Music: 2-minute long WAV audio files of concert hall recordings prepared to be heard on PCMA-3D and Stereo (Downmix files) loudspeaker array at 48k Hz of sample rate and 24-bit depthPark: 2-minute long WAV audio files of urban park recordings prepared to be heard on ESMA-3D and Stereo (Downmix files) loudspeaker array at 48k Hz of sample rate and 24-bit depthStereo downmix files include the word “_Downmix_”.Note: In the worksheet Items of DataSpecs, the codes that the questionnaires provide are included. Just one item of the Immersive Tendencies Questionnaire and the items of the Self-assessment manikin test do not have codes in their original publications.
Eda Import Data in September - Seair.co.in
seair.co.in
Updated Sep 29, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2016). Eda Import Data in September - Seair.co.in [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Sep 29, 2016
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
Haiti, Falkland Islands (Malvinas), Solomon Islands, Northern Mariana Islands, Brunei Darussalam, Western Sahara, Heard Island and McDonald Islands, Equatorial Guinea, Taiwan, Saint Helena
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
EDA Movies
kaggle.com
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rehenatun Jannat (2024). EDA Movies [Dataset]. https://www.kaggle.com/datasets/rehenatunjannat/eda-movies/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 1, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rehenatun Jannat
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Rehenatun Jannat

Released under CC0: Public Domain

Contents
h
vgen_cpp
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nwang227, vgen_cpp [Dataset]. https://huggingface.co/datasets/LLM-EDA/vgen_cpp
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
nwang227
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Opencores

In the process of continual pre-training, we utilized the publicly available VGen dataset. VGen aggregates Verilog repositories from GitHub, systematically filters out duplicates and excessively large files, and retains only those files containing \texttt{module} and \texttt{endmodule} statements. We also incorporated the CodeSearchNet dataset \cite{codesearchnet}, which contains approximately 40MB function codes and their documentation.… See the full description on the dataset page: https://huggingface.co/datasets/LLM-EDA/vgen_cpp.
h
DA-Code
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jianwen Luo, DA-Code [Dataset]. https://huggingface.co/datasets/Jianwen2003/DA-Code
Explore at:
Authors
Jianwen Luo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
[EMNLP2024] DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

DA-Code is a comprehensive evaluation dataset designed to assess the data analysis and code generation capabilities of LLM in agent-based data science tasks. Our papers and experiment reports have been published on Arxiv.

Dataset Overview

500 complex real-world data analysis tasks across Data Wrangling (DW), Machine Learning (ML), and Exploratory Data Analysis (EDA). Tasks cover… See the full description on the dataset page: https://huggingface.co/datasets/Jianwen2003/DA-Code.
Eda international inc USA Import & Buyer Data
seair.co.in
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim, Eda international inc USA Import & Buyer Data [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
United States
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Replication Package for 'Data-Driven Analysis and Optimization of Machine...
zenodo.org
zip
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15643706
Dataset updated
Jun 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joel Castaño; Joel Castaño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.

The framework considers the trade-offs between three key objectives:

1. Performance (maximizing throughput)

2. Energy Efficiency (minimizing estimated energy per unit)

3. Cost (minimizing estimated hardware cost)

Repository Structure

This repository is organized as follows:

Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.

Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.

Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.

Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.

Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.

eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.

requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.

eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.

optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.

pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.

shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

Requirements and Installation

To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.

1. Clone the repository:

bash

git clone

cd

2. **Create and activate a virtual environment (optional but recommended):

bash

python -m venv venv

source venv/bin/activate # On Windows, use `venv\Scripts\activate`

3. Install the required packages:

All dependencies are listed in the `requirements.txt` file. Install them using pip:

bash

pip install -r requirements.txt

Step-by-Step Reproduction Workflow

The notebooks are designed to be run in a logical sequence.

Step 1: Data Enrichment (Optional)

The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

Step 2: Exploratory Data Analysis (Optional)

All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

Step 3: Main Model Training, Validation, and Recommendation

This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:

It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.

It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.

It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.

It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
ML-Based RUL Prediction for NPP Transformers
kaggle.com
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmitry_Menyailov (2025). ML-Based RUL Prediction for NPP Transformers [Dataset]. https://www.kaggle.com/datasets/idmitri/ml-based-rul-prediction-for-npp-transformers
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dmitry_Menyailov
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23516597%2F11309e6c4df1437ed2aa6a8fb121daa5%2FScreenshot%202025-04-10%20at%2004.17.42.png?generation=1744233480336962&alt=media" alt="">

Notebooks

1. Exploratory_Data_Analysis

https://www.kaggle.com/code/idmitri/exploratory-data-analysis

2. RUL_Prediction_Modeling

https://www.kaggle.com/code/idmitri/rul-prediction-modeling

О проекте

Силовые трансформаторы на АЭС могут эксплуатироваться дольше расчетного срока службы (25 лет), что требует усиленного мониторинга их состояния для обеспечения надежности и безопасности эксплуатации.

Для оценки состояния трансформаторов применяется хроматографический анализ растворенных газов, который позволяет выявлять дефекты по концентрациям газов в масле и прогнозировать остаточный срок службы трансформатора (RUL). Традиционные системы мониторинга ограничиваются фиксированными пороговыми значениями концентраций, снижая точность диагностики и автоматизацию. Методы машинного обучения позволяют выявлять скрытые зависимости и повышать точность прогнозирования. Подробнее: https://habr.com/ru/articles/743682/

Результаты

В данном проекте проводится глубокий анализ данных (EDA) с созданием 12 групп признаков:
- gases (концентрации газов)
- trend (трендовые компоненты)
- seasonal (сезонные компоненты)
- resid (остаточные компоненты)
- quantiles (квантили распределений)
- volatility (волатильность концентраций)
- range (размах значений)
- coefficient of variation (коэффициент вариации)
- standard deviation (стандартное отклонение)
- skewness (асимметрия распределения)
- kurtosis (эксцесс распределения)
- category (категориальные признаки неисправностей)

Использование статистических и декомпозиционных признаков позволило достичь совпадения точности силуэта распределения RUL с автоматической обработкой выбросов, что ранее требовало ручной корректировки.

Для моделирования использованы алгоритмы машинного обучения (LightGBM, CatBoost, Extra Trees) и их ансамбль. Лучшая точность достигнута моделью LightGBM с оптимизацией гиперпараметров с помощью Optuna: MAE = 61.85, RMSE = 88.21, R2 = 0.8634.

Комментарий

Код для проведения разведочного анализа данных (EDA) был разработан и протестирован локально в VSC Jupyter Notebook с использованием окружения Python 3.10.16. И на платформе Kaggle большинство графиков отображается корректно. Но некоторые сложные и комплексные визуализации (например, многомерные графики с цветовой шкалой) не адаптированы из-за ограничений среды. Несмотря на попытки оптимизировать код без существенных изменений, добиться полной совместимости не удалось. Основная проблема заключалась в конфликте версий библиотек и значительном снижении производительности — расчет занимал примерно в 10 раз больше времени по сравнению с локальной машиной MacBook M3 Pro. На Kaggle либо корректно выполнялись операции с использованием PyCaret, либо работали модели машинного обучения, но не обе части одновременно.

Предлагается гибридный вариант работы:
- Публикация и вывод метрик на Kaggle для визуализации результатов. - Локальный расчет и обучение моделей с использованием предварительно настроенного окружения Python 3.10.16. Для воспроизведения экспериментов подготовлена папка Codes с кодами VSC EDA, RUL и файлом libraries_for_modeling, содержащим список версий всех используемых библиотек.

Готов ответить в комментариях на все вопросы по настройке и запуску кода. И буду признателен за советы по предотвращению подобных проблем.
Final Project EDA Statprob
kaggle.com
Updated Dec 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Revalina F (2023). Final Project EDA Statprob [Dataset]. https://www.kaggle.com/datasets/revalinaf/final-project-eda-statprob/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Revalina F
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Revalina F

Released under MIT

Contents
Physiological Data Collected from smartwatch: EDA, PPG, and Skin Temperature...
zenodo.org
Updated May 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Albarrán Morillo; Carlos Albarrán Morillo; John F. Suárez-Pérez; John F. Suárez-Pérez; Camargo Salinas Mónica Andrea; Camargo Salinas Mónica Andrea; Nasli Miranda Arandia; Nasli Miranda Arandia (2025). Physiological Data Collected from smartwatch: EDA, PPG, and Skin Temperature and external factors in a pharmaceutical case study [Dataset]. http://doi.org/10.5281/zenodo.14891916
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14891916
Dataset updated
May 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Carlos Albarrán Morillo; Carlos Albarrán Morillo; John F. Suárez-Pérez; John F. Suárez-Pérez; Camargo Salinas Mónica Andrea; Camargo Salinas Mónica Andrea; Nasli Miranda Arandia; Nasli Miranda Arandia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was collected in a pharmaceutical case study where participants performed repetitive packing tasks for approximately 20 minutes directly on the production line. The study aimed to assess physiological and ergonomic factors affecting workers during the task.

Key Variables:

Participant Information:

ID participant: Unique identifier for each participant.

Age: Age of the participant.

Experience: Work experience in years.

Task Context:

Moment: Time of measurement during the shift (Start, Middle, End).

Turn: Work shift number.

Plant/Line: Identification of the production line.

Day: Day of the week.

Time: Exact timestamp of data collection.

LoTNum: Lot number for batch packing.

Physiological Measurements (from wearable devices):

eda_scl_usiemens: Electrodermal activity (EDA) in microsiemens.

pulse_rate_bpm: Heart rate in beats per minute.

temperature_celsius: Skin temperature in Celsius.

accelerometers_std_g: Standard deviation of accelerometer readings (movement intensity).

steps_count: Number of steps taken.

activity_counts: General activity level.

Ergonomic and Risk Indicators:

IndexRiskR: Risk index for the right hand.

IndexRiskL: Risk index for the left hand.

Borg Test: Subjective rating of perceived exertion (Borg scale).
Titanic EDA
kaggle.com
zip
Updated Aug 3, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gourav Rohra (2021). Titanic EDA [Dataset]. https://www.kaggle.com/gouravrohra/titanic-eda
Explore at:
zip(58919 bytes)Available download formats
Dataset updated
Aug 3, 2021
Authors
Gourav Rohra
Description
Dataset

This dataset was created by Gourav Rohra

Contents
Eda Import Data in October - Seair.co.in
seair.co.in
Updated Oct 28, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2016). Eda Import Data in October - Seair.co.in [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Oct 28, 2016
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
Cocos (Keeling) Islands, Honduras, Malawi, Guernsey, Kenya, Åland Islands, Saint Barthélemy, Myanmar, Svalbard and Jan Mayen, Central African Republic
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mustafa Ghzi (2023). Advanced exploratory data analysis (EDA) [Dataset]. https://www.kaggle.com/datasets/mustafaghzi/advanced-exploratory-data-analysis-eda/code

Advanced exploratory data analysis (EDA)

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 18, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Mustafa Ghzi

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Dataset

This dataset was created by Mustafa Ghzi

Released under CC BY-NC-SA 4.0

Clear search

Close search

Google apps

Main menu

Advanced exploratory data analysis (EDA)

Dataset

Contents

Eda_all Dataset

Eda_all

EDA Signal Dataset Collected During Startle Events While Walking With a...

EDA Signal Dataset Collected During Startle Events While Walking With a Smart Cane

Description

Folder Structure

File Descriptions

1. S*_DataExperiment.mat

2. S*_audioEventVector.mat

3. S*_SA_Score.mat

Contact Information

Solar Panel Eda Dataset

Solar Panel EDA

opencores

Eda Export Data of HS Code 29212100 India – Seair.co.in

The Global EDA Market size was USD 14.9 billion in 2023!

Guns incident data

Data on EEG, EDA, BVP, psychological responses and audio files used for the...

Eda Import Data in September - Seair.co.in

EDA Movies

Dataset

Contents

vgen_cpp

DA-Code

Eda international inc USA Import & Buyer Data

Replication Package for 'Data-Driven Analysis and Optimization of Machine...

Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

Repository Structure

Requirements and Installation

Step-by-Step Reproduction Workflow

Step 1: Data Enrichment (Optional)

Step 2: Exploratory Data Analysis (Optional)

Step 3: Main Model Training, Validation, and Recommendation

ML-Based RUL Prediction for NPP Transformers

Notebooks

1. Exploratory_Data_Analysis

2. RUL_Prediction_Modeling

О проекте

Результаты

Комментарий

Final Project EDA Statprob

Dataset

Contents

Physiological Data Collected from smartwatch: EDA, PPG, and Skin Temperature...

Titanic EDA

Dataset

Contents

Eda Import Data in October - Seair.co.in

Advanced exploratory data analysis (EDA)

Dataset

Contents

1. `S*_DataExperiment.mat`

2. `S*_audioEventVector.mat`

3. `S*_SA_Score.mat`