10 datasets found
  1. ML-Based RUL Prediction for NPP Transformers

    • kaggle.com
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmitry_Menyailov (2025). ML-Based RUL Prediction for NPP Transformers [Dataset]. https://www.kaggle.com/datasets/idmitri/ml-based-rul-prediction-for-npp-transformers
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dmitry_Menyailov
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23516597%2F11309e6c4df1437ed2aa6a8fb121daa5%2FScreenshot%202025-04-10%20at%2004.17.42.png?generation=1744233480336962&alt=media" alt="">

    Notebooks

    1. Exploratory_Data_Analysis

    https://www.kaggle.com/code/idmitri/exploratory-data-analysis

    2. RUL_Prediction_Modeling

    https://www.kaggle.com/code/idmitri/rul-prediction-modeling

    О проекте

    Силовые трансформаторы на АЭС могут эксплуатироваться дольше расчетного срока службы (25 лет), что требует усиленного мониторинга их состояния для обеспечения надежности и безопасности эксплуатации.

    Для оценки состояния трансформаторов применяется хроматографический анализ растворенных газов, который позволяет выявлять дефекты по концентрациям газов в масле и прогнозировать остаточный срок службы трансформатора (RUL). Традиционные системы мониторинга ограничиваются фиксированными пороговыми значениями концентраций, снижая точность диагностики и автоматизацию. Методы машинного обучения позволяют выявлять скрытые зависимости и повышать точность прогнозирования. Подробнее: https://habr.com/ru/articles/743682/

    Результаты

    В данном проекте проводится глубокий анализ данных (EDA) с созданием 12 групп признаков:
    - gases (концентрации газов)
    - trend (трендовые компоненты)
    - seasonal (сезонные компоненты)
    - resid (остаточные компоненты)
    - quantiles (квантили распределений)
    - volatility (волатильность концентраций)
    - range (размах значений)
    - coefficient of variation (коэффициент вариации)
    - standard deviation (стандартное отклонение)
    - skewness (асимметрия распределения)
    - kurtosis (эксцесс распределения)
    - category (категориальные признаки неисправностей)

    Использование статистических и декомпозиционных признаков позволило достичь совпадения точности силуэта распределения RUL с автоматической обработкой выбросов, что ранее требовало ручной корректировки.

    Для моделирования использованы алгоритмы машинного обучения (LightGBM, CatBoost, Extra Trees) и их ансамбль. Лучшая точность достигнута моделью LightGBM с оптимизацией гиперпараметров с помощью Optuna: MAE = 61.85, RMSE = 88.21, R2 = 0.8634.

    Комментарий

    Код для проведения разведочного анализа данных (EDA) был разработан и протестирован локально в VSC Jupyter Notebook с использованием окружения Python 3.10.16. И на платформе Kaggle большинство графиков отображается корректно. Но некоторые сложные и комплексные визуализации (например, многомерные графики с цветовой шкалой) не адаптированы из-за ограничений среды. Несмотря на попытки оптимизировать код без существенных изменений, добиться полной совместимости не удалось. Основная проблема заключалась в конфликте версий библиотек и значительном снижении производительности — расчет занимал примерно в 10 раз больше времени по сравнению с локальной машиной MacBook M3 Pro. На Kaggle либо корректно выполнялись операции с использованием PyCaret, либо работали модели машинного обучения, но не обе части одновременно.

    Предлагается гибридный вариант работы:
    - Публикация и вывод метрик на Kaggle для визуализации результатов. - Локальный расчет и обучение моделей с использованием предварительно настроенного окружения Python 3.10.16. Для воспроизведения экспериментов подготовлена папка Codes с кодами VSC EDA, RUL и файлом libraries_for_modeling, содержащим список версий всех используемых библиотек.

    Готов ответить в комментариях на все вопросы по настройке и запуску кода. И буду признателен за советы по предотвращению подобных проблем.

  2. Replication Package for 'Data-Driven Analysis and Optimization of Machine...

    • zenodo.org
    zip
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joel Castaño; Joel Castaño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

    This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.
    The framework considers the trade-offs between three key objectives:
    1. Performance (maximizing throughput)
    2. Energy Efficiency (minimizing estimated energy per unit)
    3. Cost (minimizing estimated hardware cost)

    Repository Structure

    This repository is organized as follows:
    • Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.
    • Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.
    • Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.
    • Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.
    • Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.
    • eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.
    • requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.
    • eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.
    • optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.
    • pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.
    • shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

    Requirements and Installation

    To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.
    1. Clone the repository:
    bash
    git clone
    cd
    2. **Create and activate a virtual environment (optional but recommended):
    bash
    python -m venv venv
    source venv/bin/activate # On Windows, use `venv\Scripts\activate`
    3. Install the required packages:
    All dependencies are listed in the `requirements.txt` file. Install them using pip:
    bash
    pip install -r requirements.txt

    Step-by-Step Reproduction Workflow

    The notebooks are designed to be run in a logical sequence.

    Step 1: Data Enrichment (Optional)

    The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

    Step 2: Exploratory Data Analysis (Optional)

    All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

    Step 3: Main Model Training, Validation, and Recommendation

    This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:
    1. It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.
    2. It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.
    3. It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.
    4. It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
  3. Preventive Maintenance for Marine Engines

    • kaggle.com
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fijabi J. Adekunle (2025). Preventive Maintenance for Marine Engines [Dataset]. https://www.kaggle.com/datasets/jeleeladekunlefijabi/preventive-maintenance-for-marine-engines
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    Kaggle
    Authors
    Fijabi J. Adekunle
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Preventive Maintenance for Marine Engines: Data-Driven Insights

    Introduction:

    Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.

    Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.

    Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.

    Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization

    Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning

    Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.

    Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.

    Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.

  4. A

    ‘COVID-19 dataset in Japan’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19 dataset in Japan’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-dataset-in-japan-2665/latest
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Japan
    Description

    Analysis of ‘COVID-19 dataset in Japan’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lisphilar/covid19-dataset-in-japan on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    1. Context

    This is a COVID-19 dataset in Japan. This does not include the cases in Diamond Princess cruise ship (Yokohama city, Kanagawa prefecture) and Costa Atlantica cruise ship (Nagasaki city, Nagasaki prefecture). - Total number of cases in Japan - The number of vaccinated people (New/experimental) - The number of cases at prefecture level - Metadata of each prefecture

    Note: Lisphilar (author) uploads the same files to https://github.com/lisphilar/covid19-sir/tree/master/data

    This dataset can be retrieved with CovsirPhy (Python library).

    pip install covsirphy --upgrade
    
    import covsirphy as cs
    data_loader = cs.DataLoader()
    japan_data = data_loader.japan()
    # The number of cases (Total/each province)
    clean_df = japan_data.cleaned()
    # Metadata
    meta_df = japan_data.meta()
    

    Please refer to CovsirPhy Documentation: Japan-specific dataset.

    Note: Before analysing the data, please refer to Kaggle notebook: EDA of Japan dataset and COVID-19: Government/JHU data in Japan. The detailed explanation of the build process is discussed in Steps to build the dataset in Japan. If you find errors or have any questions, feel free to create a discussion topic.

    1.1 Total number of cases in Japan

    covid_jpn_total.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - with symptoms (to 08May2020) / without symptoms (to 08May2020) / unknown (to 08May2020) - discharged - fatal

    The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with mild symptoms (to 08May2020) / severe symptoms / unknown (to 08May2020) - requiring hospitalization, but waiting in hotels or at home (to 08May2020)

    In primary source, some variables were removed on 09May2020. Values are NA in this dataset from 09May2020.

    Manually collected the data from Ministry of Health, Labour and Welfare HP:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English)

    The number of vaccinated people: - Vaccinated_1st: the number of vaccinated persons for the first time on the date - Vaccinated_2nd: the number of vaccinated persons with the second dose on the date - Vaccinated_3rd: the number of vaccinated persons with the third dose on the date

    Data sources for vaccination: - To 09Apr2021: 厚生労働省 HP 新型コロナワクチンの接種実績(in Japanese) - 首相官邸 新型コロナワクチンについて - From 10APr2021: Twitter: 首相官邸(新型コロナワクチン情報)

    1.2 The number of cases at prefecture level

    covid_jpn_prefecture.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - discharged - fatal

    The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with severe symptoms (from 09May2020)

    Using pdf-excel converter, manually collected the data from Ministry of Health, Labour and Welfare HP:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English)

    Note: covid_jpn_prefecture.groupby("Date").sum() does not match covid_jpn_total. When you analyse total data in Japan, please use covid_jpn_total data.

    1.3 Metadata of each prefecture

    covid_jpn_metadata.csv - Population (Total, Male, Female): 厚生労働省 厚生統計要覧(2017年度)第1-5表 - Area (Total, Habitable): Wikipedia 都道府県の面積一覧 (2015)

    2. Acknowledgements

    To create this dataset, edited and transformed data of the following sites was used.

    厚生労働省 Ministry of Health, Labour and Welfare, Japan:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English) 厚生労働省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

    国土交通省 Ministry of Land, Infrastructure, Transport and Tourism, Japan: 国土交通省 HP (in Japanese) 国土交通省 HP (in English) 国土交通省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

    Code for Japan / COVID-19 Japan: Code for Japan COVID-19 Japan Dashboard (CC BY 4.0) COVID-19 Japan 都道府県別 感染症病床数 (CC BY)

    Wikipedia: Wikipedia

    LinkData: LinkData (Public Domain)

    Inspiration

    1. Changes in number of cases over time
    2. Percentage of patients without symptoms / mild or severe symptoms
    3. What to do next to prevent outbreak

    License and how to cite

    Kindly cite this dataset under CC BY-4.0 license as follows. - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan, or - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, Kaggle Dataset, https://www.kaggle.com/lisphilar/covid19-dataset-in-japan

    --- Original source retains full ownership of the source dataset ---

  5. All Lending Club loan data

    • kaggle.com
    Updated Apr 10, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan George (2019). All Lending Club loan data [Dataset]. https://www.kaggle.com/wordsforthewise/lending-club/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nathan George
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Update: I probably won't be able to update the data anymore, as LendingClub now has a scary 'TOS' popup when downloading the data. Worst case, they will ask me/Kaggle to take it down from here.

    This dataset contains the full LendingClub data available from their site. There are separate files for accepted and rejected loans. The accepted loans also include the FICO scores, which can only be downloaded when you are signed in to LendingClub and download the data.

    See the Python and R getting started kernels to get started:

    I created a git repo for the code which is used to create this data: https://github.com/nateGeorge/preprocess_lending_club_data

    Background

    I wanted an easy way to share all the lending club data with others. Unfortunately, the data on their site is fragmented into many smaller files. There is another lending club dataset on Kaggle, but it wasn't updated in years. It seems like the "Kaggle Team" is updating it now. I think it also doesn't include the full rejected loans, which are included here. It seems like the other dataset confusingly has some of the rejected loans mixed into the accepted ones. Now there are a ton of other LendingClub datasets on here too, most of which seem to have no documentation or explanation of what the data actually is.

    Content

    The definitions for the fields are on the LendingClub site, at the bottom of the page. Kaggle won't let me upload the .xlsx file for some reason since it seems to be in multiple other data repos. This file seems to be in the other main repo, but again, it's better to get it directly from the source.

    Unfortunately, there is (maybe "was" now?) a limit of 500MB for dataset files, so I had to compress the files with gzip in the Python pandas package.

    I cleaned the data a tiny bit: I removed percent symbols (%) from int_rate and revol_util columns in the accepted loans and converted those columns to floats.

    Update

    The URL column is in the dataset for completeness, as of 2018 Q2.

  6. Cyclistic Bike - Data Analysis (Python)

    • kaggle.com
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amirthavarshini (2024). Cyclistic Bike - Data Analysis (Python) [Dataset]. https://www.kaggle.com/datasets/amirthavarshini12/cyclistic-bike-data-analysis-python/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Amirthavarshini
    Description

    Conducted an in-depth analysis of Cyclistic bike-share data to uncover customer usage patterns and trends. Cleaned and processed raw data using Python libraries such as pandas and NumPy to ensure data quality. Performed exploratory data analysis (EDA) to identify insights, including peak usage times, customer demographics, and trip duration patterns. Created visualizations using Matplotlib and Seaborn to effectively communicate findings. Delivered actionable recommendations to enhance customer engagement and optimize operational efficiency.

  7. h

    Electrical-engineering

    • huggingface.co
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mod (2024). Electrical-engineering [Dataset]. https://huggingface.co/datasets/STEM-AI-mtl/Electrical-engineering
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 16, 2024
    Authors
    mod
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    To the electrical engineering community

    This dataset contains Q&A prompts about electrical engineering, Kicad's EDA software features and scripting console Python codes.

      Authors
    

    STEM.AI: stem.ai.mtl@gmail.comWilliam Harbec

  8. f

    Attention and Cognitive Workload

    • figshare.com
    csv
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Varandas; Inês Silveira; Hugo Gamboa (2025). Attention and Cognitive Workload [Dataset]. http://doi.org/10.6084/m9.figshare.28184417.v3
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 4, 2025
    Dataset provided by
    figshare
    Authors
    Rui Varandas; Inês Silveira; Hugo Gamboa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. Attention and Cognitive Workload1.1. Experimental designTwo standard cognitive tasks, N-Back and mental subtraction, were conducted using PsychoPy. The N-Back task is a working memory task where participants are presented with a sequence of stimuli and are required to indicate when the current stimulus matched the one from 'n' steps earlier in the sequence, with 'n' varying across different levels. To avoid any interference from reading the instructions, rest periods of 60 seconds were incorporated before, in between, and after the two main tasks, along with a 20-second rest period between the explanation of tasks and the procedure. Additionally, a 10-second rest period was introduced between the different difficulty levels of the N-Back task and between the subtraction periods. The N-Back task was divided into 4 levels, each consisting of 60 trials. The mental subtraction task involved 20 periods of 10 seconds each, during which participants were required to continuously subtract a given number from the result of the previous subtraction, all while a visual cue was displayed.In the final stage, participants engaged in a practical learning task that required them to complete a Python tutorial, which included both theoretical concepts and practical examples. During this phase of the data collection process, not only were physiological sensors used, but HCI was also tracked.1.2. Data recordingData was collected from a group of 8 volunteers (including 4 females), who were all between the ages of 20 and 27 (average age=22.9, standard deviation=2.1). Each participant was right-handed and did not report any psychological or neurological conditions. None of them were on any medication, except for contraceptive pills.The data for subject 2 do not include the 2nd part of the acquisition (python task) because the equipment stopped acquiring; subject 3 has the 1st (N-Back task and mental subtraction) and the 2nd part (python tutorial) together in the First part folder (file D1_S3_PB_description.json indicates the start and end of each task); subject 4 only has the mental subtraction task in the 1st part acquisition and in subject 8, the subtraction task data is included in the 2nd part acquisition, along with python task.1.3. Data labellingData labeling can be performed in two ways: to categorize data into cognitive workload levels and baseline, either the PB description JSON files or the task_results.csv files can be used. Separately, the labelling of data into cognitive states was carried out every 10 seconds by researchers in biomedical engineering, in which they used image captures of the participants at various instants of the experiment, response time and signals from the respiration sensor to label the subjects’ state as bored, frustrated, interested and at rest. These cognitive state labels are stored in the cognitive_states_labels.txt files located within each subject's folder.1.4. Data descriptionBiosignals include EEG, fNIRS (not converted to oxi and deoxiHb), ECG, EDA, respiration (RIP), accelerometer (ACC), and push-button (PB) data. All signals have already been converted to physical units. In each biosignal file, the first column corresponds to the timestamps. For the first dataset, the biosignals folder is split into two parts: part 1 corresponds to the mental n-back and subtraction tasks, and part 2 corresponds to the Python tutorial. The PB files can be inside each part of the Biosignals folder, in case there are 2 files instead of 1.HCI features encompass keyboard, mouse, and screenshot data. A Python code snippet for extracting screenshot files from the screenshots csv file can be found below.import base64from os import mkdirfrom os.path import joinfile = '...'with open(file, 'r') as f: lines = f.readlines()for line in lines[1:]: timestamp = line.split(',')[0] code = line.split(',')[-1][:-2] imgdata = base64.b64decode(code) filename = str(timestamp) + '.jpeg' mkdir('screenshot') with open(join('screenshot', filename), 'wb') as f: f.write(imgdata)A characterization file containing age and gender information for all subjects in each dataset is provided within the respective dataset folder (e.g., D1_subject-info.csv). Other complementary files include (i) description of the pushbuttons to help segment the signals (e.g., D1_S2_PB_description.json) and (ii) labelling (e.g., D1_S2_cognitive_states_labels.txt). The D1_Sx_task_results.csv files show the results for the n-back task. A result of -1 means no answer, 0 wrong answer and 1 right answer. As for difficulty, 0 corresponds to baseline or rest periods, 1 corresponds to the 0-back task, 2 to 1-back, 3 to 2-back and 4 to 3-back. In the case of the mental subtraction task, we only distinguish between rest, represented with 0, and task, represented with 1. The response time refers to the time it takes the subject to respond and the key answer was the key the subject pressed ('y' corresponding to yes if, for example, for the 0-back task, the letter shown on the screen was identical to the previous one, 'n' corresponding to no if it wasn't and 'None' if there was no response). This file also provides the information needed to segment the signals into the different tasks and baselines.
  9. App Store Mobile Games 2008 - 2019

    • kaggle.com
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mayank Singh (2024). App Store Mobile Games 2008 - 2019 [Dataset]. https://www.kaggle.com/datasets/mayanksinghr/app-store-mobile-games-2008-2019
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    Kaggle
    Authors
    Mayank Singh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset contains 1 excel workbook (.xlsx) with 2 sheets.

    • Sheet 1 - App Store Games contains the mobile games launched on App Store from 2008 - 2019.
    • Sheet 2 - Data Dictionary is just the explanation of columns in data.

    This data can be used to practice EDA and some data cleaning tasks. Can be used for Data visualization using python Matplotlib and Seaborn libraries.

    I used this dataset for a Power BI project also and created a Dashboard on it. Used python inside power query to clean and convert some encoded and Unicode characters from App URL, Name, and Description columns.

    Total Columns: 16

    • App URL
    • App ID
    • Name
    • Subtitle
    • Icon URL
    • Average User Rating
    • User Rating Count
    • Price per App (USD)
    • Description
    • Developer
    • Age Rating
    • Languages
    • Size in Bytes
    • Primary Genre
    • Genres
    • Release Date
  10. Sephora Skincare Reviews

    • kaggle.com
    Updated Dec 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melissa Monfared (2024). Sephora Skincare Reviews [Dataset]. https://www.kaggle.com/datasets/melissamonfared/sephora-skincare-reviews
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 22, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Melissa Monfared
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please note that the original dataset was uploaded by nadyinky on Kaggle and is accessible through the following link: https://www.kaggle.com/datasets/nadyinky/sephora-products-and-skincare-reviews

    In this dataset, the skincare products have been separated from other products in the reviews datasets, such as cosmetics and makeup, for use in the intended project.

    This dataset was collected via Python scraper in March 2023 BY https://www.kaggle.com/nadyinky and contains:

    information about all beauty products (over 8,000) from the Sephora online store, including product and brand names, prices, ingredients, ratings, and all features. user reviews (about 1 million on over 2,000 products) of all products from the Skincare category, including user appearances, and review ratings by other users Dataset Usage Examples: - Exploratory Data Analysis (EDA): Explore product categories, regular and discount prices, brand popularity, the impact of different characteristics on price, and ingredient trends - Sentiment Analysis: Is the emotional tone of the review positive, negative, or neutral? Which brands or products have the most positive or negative reviews? - Text Analysis: What do customers say most often in their negative and positive reviews? Do customers have any common problems with their skincare? Recommender System: Analyzing the customer's past purchase history and reviews, suggest products that are likely to be of interest to them - Data Visualization: What are the most popular brands and products? What is the distribution of prices? Which products are closest to each other in ingredients? What does the cloud of the most frequently used words look like?

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dmitry_Menyailov (2025). ML-Based RUL Prediction for NPP Transformers [Dataset]. https://www.kaggle.com/datasets/idmitri/ml-based-rul-prediction-for-npp-transformers
Organization logo

ML-Based RUL Prediction for NPP Transformers

EDA and ML (LightGBM) with Optuna using data-driven features for RUL

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dmitry_Menyailov
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23516597%2F11309e6c4df1437ed2aa6a8fb121daa5%2FScreenshot%202025-04-10%20at%2004.17.42.png?generation=1744233480336962&alt=media" alt="">

Notebooks

1. Exploratory_Data_Analysis

https://www.kaggle.com/code/idmitri/exploratory-data-analysis

2. RUL_Prediction_Modeling

https://www.kaggle.com/code/idmitri/rul-prediction-modeling

О проекте

Силовые трансформаторы на АЭС могут эксплуатироваться дольше расчетного срока службы (25 лет), что требует усиленного мониторинга их состояния для обеспечения надежности и безопасности эксплуатации.

Для оценки состояния трансформаторов применяется хроматографический анализ растворенных газов, который позволяет выявлять дефекты по концентрациям газов в масле и прогнозировать остаточный срок службы трансформатора (RUL). Традиционные системы мониторинга ограничиваются фиксированными пороговыми значениями концентраций, снижая точность диагностики и автоматизацию. Методы машинного обучения позволяют выявлять скрытые зависимости и повышать точность прогнозирования. Подробнее: https://habr.com/ru/articles/743682/

Результаты

В данном проекте проводится глубокий анализ данных (EDA) с созданием 12 групп признаков:
- gases (концентрации газов)
- trend (трендовые компоненты)
- seasonal (сезонные компоненты)
- resid (остаточные компоненты)
- quantiles (квантили распределений)
- volatility (волатильность концентраций)
- range (размах значений)
- coefficient of variation (коэффициент вариации)
- standard deviation (стандартное отклонение)
- skewness (асимметрия распределения)
- kurtosis (эксцесс распределения)
- category (категориальные признаки неисправностей)

Использование статистических и декомпозиционных признаков позволило достичь совпадения точности силуэта распределения RUL с автоматической обработкой выбросов, что ранее требовало ручной корректировки.

Для моделирования использованы алгоритмы машинного обучения (LightGBM, CatBoost, Extra Trees) и их ансамбль. Лучшая точность достигнута моделью LightGBM с оптимизацией гиперпараметров с помощью Optuna: MAE = 61.85, RMSE = 88.21, R2 = 0.8634.

Комментарий

Код для проведения разведочного анализа данных (EDA) был разработан и протестирован локально в VSC Jupyter Notebook с использованием окружения Python 3.10.16. И на платформе Kaggle большинство графиков отображается корректно. Но некоторые сложные и комплексные визуализации (например, многомерные графики с цветовой шкалой) не адаптированы из-за ограничений среды. Несмотря на попытки оптимизировать код без существенных изменений, добиться полной совместимости не удалось. Основная проблема заключалась в конфликте версий библиотек и значительном снижении производительности — расчет занимал примерно в 10 раз больше времени по сравнению с локальной машиной MacBook M3 Pro. На Kaggle либо корректно выполнялись операции с использованием PyCaret, либо работали модели машинного обучения, но не обе части одновременно.

Предлагается гибридный вариант работы:
- Публикация и вывод метрик на Kaggle для визуализации результатов. - Локальный расчет и обучение моделей с использованием предварительно настроенного окружения Python 3.10.16. Для воспроизведения экспериментов подготовлена папка Codes с кодами VSC EDA, RUL и файлом libraries_for_modeling, содержащим список версий всех используемых библиотек.

Готов ответить в комментариях на все вопросы по настройке и запуску кода. И буду признателен за советы по предотвращению подобных проблем.

Search
Clear search
Close search
Google apps
Main menu