Tailor made data to apply the machine learning models on the dataset. Where the newcomers can easily perform their EDA.
The data consists of all the features of the four wheelers available in the market in 1985. We need to predict the **price of the car ** using Linear Regression or PCA or SVM-R etc.,
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23516597%2F11309e6c4df1437ed2aa6a8fb121daa5%2FScreenshot%202025-04-10%20at%2004.17.42.png?generation=1744233480336962&alt=media" alt="">
https://www.kaggle.com/code/idmitri/exploratory-data-analysis
https://www.kaggle.com/code/idmitri/rul-prediction-modeling
Силовые трансформаторы на АЭС могут эксплуатироваться дольше расчетного срока службы (25 лет), что требует усиленного мониторинга их состояния для обеспечения надежности и безопасности эксплуатации.
Для оценки состояния трансформаторов применяется хроматографический анализ растворенных газов, который позволяет выявлять дефекты по концентрациям газов в масле и прогнозировать остаточный срок службы трансформатора (RUL). Традиционные системы мониторинга ограничиваются фиксированными пороговыми значениями концентраций, снижая точность диагностики и автоматизацию. Методы машинного обучения позволяют выявлять скрытые зависимости и повышать точность прогнозирования. Подробнее: https://habr.com/ru/articles/743682/
В данном проекте проводится глубокий анализ данных (EDA) с созданием 12 групп признаков:
- gases (концентрации газов)
- trend (трендовые компоненты)
- seasonal (сезонные компоненты)
- resid (остаточные компоненты)
- quantiles (квантили распределений)
- volatility (волатильность концентраций)
- range (размах значений)
- coefficient of variation (коэффициент вариации)
- standard deviation (стандартное отклонение)
- skewness (асимметрия распределения)
- kurtosis (эксцесс распределения)
- category (категориальные признаки неисправностей)
Использование статистических и декомпозиционных признаков позволило достичь совпадения точности силуэта распределения RUL с автоматической обработкой выбросов, что ранее требовало ручной корректировки.
Для моделирования использованы алгоритмы машинного обучения (LightGBM, CatBoost, Extra Trees) и их ансамбль. Лучшая точность достигнута моделью LightGBM с оптимизацией гиперпараметров с помощью Optuna: MAE = 61.85, RMSE = 88.21, R2 = 0.8634.
Код для проведения разведочного анализа данных (EDA) был разработан и протестирован локально в VSC Jupyter Notebook с использованием окружения Python 3.10.16. И на платформе Kaggle большинство графиков отображается корректно. Но некоторые сложные и комплексные визуализации (например, многомерные графики с цветовой шкалой) не адаптированы из-за ограничений среды. Несмотря на попытки оптимизировать код без существенных изменений, добиться полной совместимости не удалось. Основная проблема заключалась в конфликте версий библиотек и значительном снижении производительности — расчет занимал примерно в 10 раз больше времени по сравнению с локальной машиной MacBook M3 Pro. На Kaggle либо корректно выполнялись операции с использованием PyCaret, либо работали модели машинного обучения, но не обе части одновременно.
Предлагается гибридный вариант работы:
- Публикация и вывод метрик на Kaggle для визуализации результатов.
- Локальный расчет и обучение моделей с использованием предварительно настроенного окружения Python 3.10.16. Для воспроизведения экспериментов подготовлена папка Codes
с кодами VSC EDA
, RUL
и файлом libraries_for_modeling
, содержащим список версий всех используемых библиотек.
Готов ответить в комментариях на все вопросы по настройке и запуску кода. И буду признателен за советы по предотвращению подобных проблем.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains 10,000 synthetic records simulating the migratory behavior of various bird species across global regions. Each entry represents a single bird tagged with a tracking device and includes detailed information such as flight distance, speed, altitude, weather conditions, tagging information, and migration outcomes.
The data was entirely synthetically generated using randomized yet realistic values based on known ranges from ornithological studies. It is ideal for practicing data analysis and visualization techniques without privacy concerns or real-world data access restrictions. Because it’s artificial, the dataset can be freely used in education, portfolio projects, demo dashboards, machine learning pipelines, or business intelligence training.
With over 40 columns, this dataset supports a wide array of analysis types. Analysts can explore questions like “Do certain species migrate in larger flocks?”, “How does weather impact nesting success?”, or “What conditions lead to migration interruptions?”. Users can also perform geospatial mapping of start and end locations, cluster birds by behavior, or build time series models based on migration months and environmental factors.
For data visualization, tools like Power BI, Python (Matplotlib/Seaborn/Plotly), or Excel can be used to create insightful dashboards and interactive charts.
Join the Fabric Community DataViz Contest | May 2025: https://community.fabric.microsoft.com/t5/Power-BI-Community-Blog/%EF%B8%8F-Fabric-Community-DataViz-Contest-May-2025/ba-p/4668560
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data_Analysis.ipynb
: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/
directory.Dataset_Extension.ipynb
: A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv
` and produces the Inference_data_Extended.csv
by adding detailed hardware specifications, cost estimates, and derived energy metrics.Optimization_Model.ipynb
: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.Inference_data.csv
: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.Inference_data_Extended.csv
: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb
notebook.eda_log.txt
: A text log file containing summary statistics generated during the exploratory data analysis.requirements.txt
: A list of all necessary Python libraries and their versions required to run the code in this repository.eda_plots/
: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.optimization_models_final/
: A directory where the trained and saved final model files (.joblib
) are stored after running the optimization notebook.pareto_validation_plot_fold_0.png
: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.shap_waterfall_final_model.png
: The SHAP plot used for the model interpretability analysis, as presented in the thesis.
bash
git clone
cd
bash
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
bash
pip install -r requirements.txt
Inference_data_Extended.csv
`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb
`** notebook. It will take `Inference_data.csv` as input and generate the extended version.eda_plots/
` directory. To regenerate them, run the **`Data_Analysis.ipynb
`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.Optimization_Model.ipynb
notebook will execute the entire pipeline described in the paper:optimization_models_final/
directory.pareto_validation_plot_fold_0.png
and shap_waterfall_final_model.png
.MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Olympics Data Analysis project explores historical Olympic data using Exploratory Data Analysis (EDA) techniques. By leveraging Python libraries such as pandas, seaborn, and matplotlib, the project uncovers patterns in medal distribution, athlete demographics, and country-wise performance.
Key findings reveal that most medalists are aged between 20-30 years, with USA, China, and Russia leading in total medals. Over time, female participation has increased significantly, reflecting improved gender equality in sports. Additionally, athlete characteristics like height and weight play a crucial role in certain sports, such as basketball (favoring taller players) and gymnastics (favoring younger athletes).
The project includes interactive visualizations such as heatmaps, medal trends, and gender-wise participation charts to provide a comprehensive understanding of Olympic history and trends. The insights can help sports analysts, researchers, and enthusiasts better understand performance patterns in the Olympics.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
To the electrical engineering community
This dataset contains Q&A prompts about electrical engineering, Kicad's EDA software features and scripting console Python codes.
Authors
STEM.AI: stem.ai.mtl@gmail.comWilliam Harbec
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Description:
This dataset provides historical stock price data for selected ticker symbols ['AAPL', 'MSFT', 'JPM', 'GS', 'AMZN', 'PG', 'KO', 'JNJ', 'XOM', 'CAT'] from January 1, 2014, to December 31, 2023. It contains the daily opening, highest, lowest, closing, adjusted closing prices, and trading volume for each trading day. These tickers represent a diverse range of sectors to allow comprehensive financial analysis.
Purpose and Use Case:
This dataset is ideal for financial analysis, market trend assessments, and investment decision-making. Analysts and researchers can use this dataset to: * Analyze price and market trends. * Evaluate volatility by analyzing price fluctuations and trading volume. * Use historical price movements to forecast and predict future trends. * Assess investment opportunities and portfolio performance.
Acknowledgments:
Data was collected using Python and Yahoo Finance. This dataset supports visualization, exploratory data analysis (EDA), and in-depth analysis to develop a predictive model for forecasting stock prices, aiming to gain insights, identify patterns, and improve prediction accuracy.
Potential Research Questions and Inspiration:
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains data obtained from Goodreads, a popular website for book lovers, to gain insights into the best books of the 21st century. The data was scraped from the Best Books of the 21st Century list on Goodreads using the Beautiful Soup and Requests libraries in Python. After obtaining the data, cleaning and exploratory data analysis (EDA) were performed using Pandas, Plotly, Seaborn, and Matplotlib.
The dataset contains top books of the 21st century, spanning from the 2000s to the present day. The data is scraped from a popular book website, Goodreads. Some notable books in the dataset include the Harry Potter series, A Thousand Splendid Suns, The Kite Runner, and The Fault in Our Stars.
The dataset consists of a total of 84,033 books and comprises 15 columns.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🚀**# BCG Data Science Job Simulation | Forage** This notebook focuses on feature engineering techniques to enhance a dataset for churn prediction modeling. As part of the BCG Data Science Job Simulation, I transformed raw customer data into valuable features to improve predictive performance.
📊 What’s Inside? ✅ Data Cleaning: Removing irrelevant columns to reduce noise ✅ Date-Based Feature Extraction: Converting raw dates into useful insights like activation year, contract length, and renewal month ✅ New Predictive Features:
consumption_trend → Measures if a customer’s last-month usage is increasing or decreasing total_gas_and_elec → Aggregates total energy consumption ✅ Final Processed Dataset: Ready for churn prediction modeling
📂Dataset Used: 📌 clean_data_after_eda.csv → Original dataset after Exploratory Data Analysis (EDA) 📌 clean_data_with_new_features.csv → Final dataset after feature engineering
🛠 Technologies Used: 🔹 Python (Pandas, NumPy) 🔹 Data Preprocessing & Feature Engineering
🌟 Why Feature Engineering? Feature engineering is one of the most critical steps in machine learning. Well-engineered features improve model accuracy and uncover deeper insights into customer behavior.
🚀 This notebook is a great reference for anyone learning data preprocessing, feature selection, and predictive modeling in Data Science!
📩 Connect with Me: 🔗 GitHub Repo: https://github.com/Pavitr-Swain/BCG-Data-Science-Job-Simulation 💼 LinkedIn: https://www.linkedin.com/in/pavitr-kumar-swain-ab708b227/
🔍 Let’s explore churn prediction insights together! 🎯
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Tailor made data to apply the machine learning models on the dataset. Where the newcomers can easily perform their EDA.
The data consists of all the features of the four wheelers available in the market in 1985. We need to predict the **price of the car ** using Linear Regression or PCA or SVM-R etc.,