9 datasets found

Cleaned Auto Dataset 1985
kaggle.com
Updated Oct 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Faisal Moiz Hussain (2021). Cleaned Auto Dataset 1985 [Dataset]. https://www.kaggle.com/datasets/faisalmoizhussain/cleaned-auto-dataset-1985/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 3, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Faisal Moiz Hussain
Description
Context

Tailor made data to apply the machine learning models on the dataset. Where the newcomers can easily perform their EDA.

The data consists of all the features of the four wheelers available in the market in 1985. We need to predict the **price of the car ** using Linear Regression or PCA or SVM-R etc.,
ML-Based RUL Prediction for NPP Transformers
kaggle.com
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmitry_Menyailov (2025). ML-Based RUL Prediction for NPP Transformers [Dataset]. https://www.kaggle.com/datasets/idmitri/ml-based-rul-prediction-for-npp-transformers/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dmitry_Menyailov
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23516597%2F11309e6c4df1437ed2aa6a8fb121daa5%2FScreenshot%202025-04-10%20at%2004.17.42.png?generation=1744233480336962&alt=media" alt="">

Notebooks

1. Exploratory_Data_Analysis

https://www.kaggle.com/code/idmitri/exploratory-data-analysis

2. RUL_Prediction_Modeling

https://www.kaggle.com/code/idmitri/rul-prediction-modeling

О проекте

Силовые трансформаторы на АЭС могут эксплуатироваться дольше расчетного срока службы (25 лет), что требует усиленного мониторинга их состояния для обеспечения надежности и безопасности эксплуатации.

Для оценки состояния трансформаторов применяется хроматографический анализ растворенных газов, который позволяет выявлять дефекты по концентрациям газов в масле и прогнозировать остаточный срок службы трансформатора (RUL). Традиционные системы мониторинга ограничиваются фиксированными пороговыми значениями концентраций, снижая точность диагностики и автоматизацию. Методы машинного обучения позволяют выявлять скрытые зависимости и повышать точность прогнозирования. Подробнее: https://habr.com/ru/articles/743682/

Результаты

В данном проекте проводится глубокий анализ данных (EDA) с созданием 12 групп признаков:
- gases (концентрации газов)
- trend (трендовые компоненты)
- seasonal (сезонные компоненты)
- resid (остаточные компоненты)
- quantiles (квантили распределений)
- volatility (волатильность концентраций)
- range (размах значений)
- coefficient of variation (коэффициент вариации)
- standard deviation (стандартное отклонение)
- skewness (асимметрия распределения)
- kurtosis (эксцесс распределения)
- category (категориальные признаки неисправностей)

Использование статистических и декомпозиционных признаков позволило достичь совпадения точности силуэта распределения RUL с автоматической обработкой выбросов, что ранее требовало ручной корректировки.

Для моделирования использованы алгоритмы машинного обучения (LightGBM, CatBoost, Extra Trees) и их ансамбль. Лучшая точность достигнута моделью LightGBM с оптимизацией гиперпараметров с помощью Optuna: MAE = 61.85, RMSE = 88.21, R2 = 0.8634.

Комментарий

Код для проведения разведочного анализа данных (EDA) был разработан и протестирован локально в VSC Jupyter Notebook с использованием окружения Python 3.10.16. И на платформе Kaggle большинство графиков отображается корректно. Но некоторые сложные и комплексные визуализации (например, многомерные графики с цветовой шкалой) не адаптированы из-за ограничений среды. Несмотря на попытки оптимизировать код без существенных изменений, добиться полной совместимости не удалось. Основная проблема заключалась в конфликте версий библиотек и значительном снижении производительности — расчет занимал примерно в 10 раз больше времени по сравнению с локальной машиной MacBook M3 Pro. На Kaggle либо корректно выполнялись операции с использованием PyCaret, либо работали модели машинного обучения, но не обе части одновременно.

Предлагается гибридный вариант работы:
- Публикация и вывод метрик на Kaggle для визуализации результатов. - Локальный расчет и обучение моделей с использованием предварительно настроенного окружения Python 3.10.16. Для воспроизведения экспериментов подготовлена папка Codes с кодами VSC EDA, RUL и файлом libraries_for_modeling, содержащим список версий всех используемых библиотек.

Готов ответить в комментариях на все вопросы по настройке и запуску кода. И буду признателен за советы по предотвращению подобных проблем.
Bird Migration Dataset (Data Visualization / EDA)
kaggle.com
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahir Maharaj (2025). Bird Migration Dataset (Data Visualization / EDA) [Dataset]. https://www.kaggle.com/datasets/sahirmaharajj/bird-migration-dataset-data-visualization-eda/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 23, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sahir Maharaj
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains 10,000 synthetic records simulating the migratory behavior of various bird species across global regions. Each entry represents a single bird tagged with a tracking device and includes detailed information such as flight distance, speed, altitude, weather conditions, tagging information, and migration outcomes.

The data was entirely synthetically generated using randomized yet realistic values based on known ranges from ornithological studies. It is ideal for practicing data analysis and visualization techniques without privacy concerns or real-world data access restrictions. Because it’s artificial, the dataset can be freely used in education, portfolio projects, demo dashboards, machine learning pipelines, or business intelligence training.

With over 40 columns, this dataset supports a wide array of analysis types. Analysts can explore questions like “Do certain species migrate in larger flocks?”, “How does weather impact nesting success?”, or “What conditions lead to migration interruptions?”. Users can also perform geospatial mapping of start and end locations, cluster birds by behavior, or build time series models based on migration months and environmental factors.

For data visualization, tools like Power BI, Python (Matplotlib/Seaborn/Plotly), or Excel can be used to create insightful dashboards and interactive charts.

Join the Fabric Community DataViz Contest | May 2025: https://community.fabric.microsoft.com/t5/Power-BI-Community-Blog/%EF%B8%8F-Fabric-Community-DataViz-Contest-May-2025/ba-p/4668560
Replication Package for 'Data-Driven Analysis and Optimization of Machine...
zenodo.org
zip
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15643706
Dataset updated
Jun 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joel Castaño; Joel Castaño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.

The framework considers the trade-offs between three key objectives:

1. Performance (maximizing throughput)

2. Energy Efficiency (minimizing estimated energy per unit)

3. Cost (minimizing estimated hardware cost)

Repository Structure

This repository is organized as follows:

Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.

Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.

Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.

Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.

Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.

eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.

requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.

eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.

optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.

pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.

shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

Requirements and Installation

To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.

1. Clone the repository:

bash

git clone

cd

2. **Create and activate a virtual environment (optional but recommended):

bash

python -m venv venv

source venv/bin/activate # On Windows, use `venv\Scripts\activate`

3. Install the required packages:

All dependencies are listed in the `requirements.txt` file. Install them using pip:

bash

pip install -r requirements.txt

Step-by-Step Reproduction Workflow

The notebooks are designed to be run in a logical sequence.

Step 1: Data Enrichment (Optional)

The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

Step 2: Exploratory Data Analysis (Optional)

All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

Step 3: Main Model Training, Validation, and Recommendation

This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:

It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.

It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.

It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.

It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
Olympics game data analysis
kaggle.com
Updated Mar 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sarita (2025). Olympics game data analysis [Dataset]. https://www.kaggle.com/datasets/saritas95/olympics-game-data-analysis/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
sarita
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The Olympics Data Analysis project explores historical Olympic data using Exploratory Data Analysis (EDA) techniques. By leveraging Python libraries such as pandas, seaborn, and matplotlib, the project uncovers patterns in medal distribution, athlete demographics, and country-wise performance.

Key findings reveal that most medalists are aged between 20-30 years, with USA, China, and Russia leading in total medals. Over time, female participation has increased significantly, reflecting improved gender equality in sports. Additionally, athlete characteristics like height and weight play a crucial role in certain sports, such as basketball (favoring taller players) and gymnastics (favoring younger athletes).

The project includes interactive visualizations such as heatmaps, medal trends, and gender-wise participation charts to provide a comprehensive understanding of Olympic history and trends. The insights can help sports analysts, researchers, and enthusiasts better understand performance patterns in the Olympics.
h
Electrical-engineering
huggingface.co
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mod (2024). Electrical-engineering [Dataset]. https://huggingface.co/datasets/STEM-AI-mtl/Electrical-engineering
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 16, 2024
Authors
mod
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
To the electrical engineering community

This dataset contains Q&A prompts about electrical engineering, Kicad's EDA software features and scripting console Python codes.

Authors

STEM.AI: stem.ai.mtl@gmail.comWilliam Harbec
Historical Stock Price Dataset
kaggle.com
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anita Rostami (2024). Historical Stock Price Dataset [Dataset]. https://www.kaggle.com/datasets/anitarostami/historical-stock-price-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 16, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anita Rostami
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Description:

This dataset provides historical stock price data for selected ticker symbols ['AAPL', 'MSFT', 'JPM', 'GS', 'AMZN', 'PG', 'KO', 'JNJ', 'XOM', 'CAT'] from January 1, 2014, to December 31, 2023. It contains the daily opening, highest, lowest, closing, adjusted closing prices, and trading volume for each trading day. These tickers represent a diverse range of sectors to allow comprehensive financial analysis.

Purpose and Use Case:

This dataset is ideal for financial analysis, market trend assessments, and investment decision-making. Analysts and researchers can use this dataset to: * Analyze price and market trends. * Evaluate volatility by analyzing price fluctuations and trading volume. * Use historical price movements to forecast and predict future trends. * Assess investment opportunities and portfolio performance.

Acknowledgments:

Data was collected using Python and Yahoo Finance. This dataset supports visualization, exploratory data analysis (EDA), and in-depth analysis to develop a predictive model for forecasting stock prices, aiming to gain insights, identify patterns, and improve prediction accuracy.

Potential Research Questions and Inspiration:

What is the correlation between stock prices and trading volume over time?

How do corporate actions and adjustments affect adjusted closing prices?

How does volatility vary across different stocks and sectors?

What key factors influence stock price dynamics, such as earnings reports, industry news, regulatory changes, or global economic trends?
Goodreads Best 21st Century Book List
kaggle.com
Updated Apr 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prakash Mahatra (2024). Goodreads Best 21st Century Book List [Dataset]. https://www.kaggle.com/datasets/prakashmahatra/goodreads-best-21st-century-book-list/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2024
Dataset provided by
Kaggle
Authors
Prakash Mahatra
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains data obtained from Goodreads, a popular website for book lovers, to gain insights into the best books of the 21st century. The data was scraped from the Best Books of the 21st Century list on Goodreads using the Beautiful Soup and Requests libraries in Python. After obtaining the data, cleaning and exploratory data analysis (EDA) were performed using Pandas, Plotly, Seaborn, and Matplotlib.

The dataset contains top books of the 21st century, spanning from the 2000s to the present day. The data is scraped from a popular book website, Goodreads. Some notable books in the dataset include the Harry Potter series, A Thousand Splendid Suns, The Kite Runner, and The Fault in Our Stars.

The dataset consists of a total of 84,033 books and comprises 15 columns.
BCG Data Science Simulation
kaggle.com
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PAVITR KUMAR SWAIN (2025). BCG Data Science Simulation [Dataset]. https://www.kaggle.com/datasets/pavitrkumar/bcg-data-science-simulation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
PAVITR KUMAR SWAIN
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
** Feature Engineering for Churn Prediction**

🚀**# BCG Data Science Job Simulation | Forage** This notebook focuses on feature engineering techniques to enhance a dataset for churn prediction modeling. As part of the BCG Data Science Job Simulation, I transformed raw customer data into valuable features to improve predictive performance.

📊 What’s Inside? ✅ Data Cleaning: Removing irrelevant columns to reduce noise ✅ Date-Based Feature Extraction: Converting raw dates into useful insights like activation year, contract length, and renewal month ✅ New Predictive Features:

consumption_trend → Measures if a customer’s last-month usage is increasing or decreasing total_gas_and_elec → Aggregates total energy consumption ✅ Final Processed Dataset: Ready for churn prediction modeling

📂Dataset Used: 📌 clean_data_after_eda.csv → Original dataset after Exploratory Data Analysis (EDA) 📌 clean_data_with_new_features.csv → Final dataset after feature engineering

🛠 Technologies Used: 🔹 Python (Pandas, NumPy) 🔹 Data Preprocessing & Feature Engineering

🌟 Why Feature Engineering? Feature engineering is one of the most critical steps in machine learning. Well-engineered features improve model accuracy and uncover deeper insights into customer behavior.

🚀 This notebook is a great reference for anyone learning data preprocessing, feature selection, and predictive modeling in Data Science!

📩 Connect with Me: 🔗 GitHub Repo: https://github.com/Pavitr-Swain/BCG-Data-Science-Job-Simulation 💼 LinkedIn: https://www.linkedin.com/in/pavitr-kumar-swain-ab708b227/

🔍 Let’s explore churn prediction insights together! 🎯
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Faisal Moiz Hussain (2021). Cleaned Auto Dataset 1985 [Dataset]. https://www.kaggle.com/datasets/faisalmoizhussain/cleaned-auto-dataset-1985/discussion

Cleaned Auto Dataset 1985

Cleaned Auto Dataset for EDA

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 3, 2021

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Faisal Moiz Hussain

Description

Context

Tailor made data to apply the machine learning models on the dataset. Where the newcomers can easily perform their EDA.

The data consists of all the features of the four wheelers available in the market in 1985. We need to predict the **price of the car ** using Linear Regression or PCA or SVM-R etc.,

Clear search

Close search

Google apps

Main menu

Cleaned Auto Dataset 1985

Context

ML-Based RUL Prediction for NPP Transformers

Notebooks

1. Exploratory_Data_Analysis

2. RUL_Prediction_Modeling

О проекте

Результаты

Комментарий

Bird Migration Dataset (Data Visualization / EDA)

Replication Package for 'Data-Driven Analysis and Optimization of Machine...

Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

Repository Structure

Requirements and Installation

Step-by-Step Reproduction Workflow

Step 1: Data Enrichment (Optional)

Step 2: Exploratory Data Analysis (Optional)

Step 3: Main Model Training, Validation, and Recommendation

Olympics game data analysis

Electrical-engineering

Historical Stock Price Dataset

Goodreads Best 21st Century Book List

BCG Data Science Simulation

** Feature Engineering for Churn Prediction**

Cleaned Auto Dataset 1985

Cleaned Auto Dataset for EDA

Context

Feature Engineering for Churn Prediction