10 datasets found

Cleaned Auto Dataset 1985
kaggle.com
Updated Oct 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Faisal Moiz Hussain (2021). Cleaned Auto Dataset 1985 [Dataset]. https://www.kaggle.com/datasets/faisalmoizhussain/cleaned-auto-dataset-1985/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 3, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Faisal Moiz Hussain
Description
Context

Tailor made data to apply the machine learning models on the dataset. Where the newcomers can easily perform their EDA.

The data consists of all the features of the four wheelers available in the market in 1985. We need to predict the **price of the car ** using Linear Regression or PCA or SVM-R etc.,
ML-Based RUL Prediction for NPP Transformers
kaggle.com
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmitry_Menyailov (2025). ML-Based RUL Prediction for NPP Transformers [Dataset]. https://www.kaggle.com/datasets/idmitri/ml-based-rul-prediction-for-npp-transformers/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dmitry_Menyailov
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23516597%2F11309e6c4df1437ed2aa6a8fb121daa5%2FScreenshot%202025-04-10%20at%2004.17.42.png?generation=1744233480336962&alt=media" alt="">

Notebooks

1. Exploratory_Data_Analysis

https://www.kaggle.com/code/idmitri/exploratory-data-analysis

2. RUL_Prediction_Modeling

https://www.kaggle.com/code/idmitri/rul-prediction-modeling

О проекте

Силовые трансформаторы на АЭС могут эксплуатироваться дольше расчетного срока службы (25 лет), что требует усиленного мониторинга их состояния для обеспечения надежности и безопасности эксплуатации.

Для оценки состояния трансформаторов применяется хроматографический анализ растворенных газов, который позволяет выявлять дефекты по концентрациям газов в масле и прогнозировать остаточный срок службы трансформатора (RUL). Традиционные системы мониторинга ограничиваются фиксированными пороговыми значениями концентраций, снижая точность диагностики и автоматизацию. Методы машинного обучения позволяют выявлять скрытые зависимости и повышать точность прогнозирования. Подробнее: https://habr.com/ru/articles/743682/

Результаты

В данном проекте проводится глубокий анализ данных (EDA) с созданием 12 групп признаков:
- gases (концентрации газов)
- trend (трендовые компоненты)
- seasonal (сезонные компоненты)
- resid (остаточные компоненты)
- quantiles (квантили распределений)
- volatility (волатильность концентраций)
- range (размах значений)
- coefficient of variation (коэффициент вариации)
- standard deviation (стандартное отклонение)
- skewness (асимметрия распределения)
- kurtosis (эксцесс распределения)
- category (категориальные признаки неисправностей)

Использование статистических и декомпозиционных признаков позволило достичь совпадения точности силуэта распределения RUL с автоматической обработкой выбросов, что ранее требовало ручной корректировки.

Для моделирования использованы алгоритмы машинного обучения (LightGBM, CatBoost, Extra Trees) и их ансамбль. Лучшая точность достигнута моделью LightGBM с оптимизацией гиперпараметров с помощью Optuna: MAE = 61.85, RMSE = 88.21, R2 = 0.8634.

Комментарий

Код для проведения разведочного анализа данных (EDA) был разработан и протестирован локально в VSC Jupyter Notebook с использованием окружения Python 3.10.16. И на платформе Kaggle большинство графиков отображается корректно. Но некоторые сложные и комплексные визуализации (например, многомерные графики с цветовой шкалой) не адаптированы из-за ограничений среды. Несмотря на попытки оптимизировать код без существенных изменений, добиться полной совместимости не удалось. Основная проблема заключалась в конфликте версий библиотек и значительном снижении производительности — расчет занимал примерно в 10 раз больше времени по сравнению с локальной машиной MacBook M3 Pro. На Kaggle либо корректно выполнялись операции с использованием PyCaret, либо работали модели машинного обучения, но не обе части одновременно.

Предлагается гибридный вариант работы:
- Публикация и вывод метрик на Kaggle для визуализации результатов. - Локальный расчет и обучение моделей с использованием предварительно настроенного окружения Python 3.10.16. Для воспроизведения экспериментов подготовлена папка Codes с кодами VSC EDA, RUL и файлом libraries_for_modeling, содержащим список версий всех используемых библиотек.

Готов ответить в комментариях на все вопросы по настройке и запуску кода. И буду признателен за советы по предотвращению подобных проблем.
Replication Package for 'Data-Driven Analysis and Optimization of Machine...
zenodo.org
zip
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15643706
Dataset updated
Jun 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joel Castaño; Joel Castaño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.

The framework considers the trade-offs between three key objectives:

1. Performance (maximizing throughput)

2. Energy Efficiency (minimizing estimated energy per unit)

3. Cost (minimizing estimated hardware cost)

Repository Structure

This repository is organized as follows:

Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.

Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.

Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.

Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.

Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.

eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.

requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.

eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.

optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.

pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.

shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

Requirements and Installation

To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.

1. Clone the repository:

bash

git clone

cd

2. **Create and activate a virtual environment (optional but recommended):

bash

python -m venv venv

source venv/bin/activate # On Windows, use `venv\Scripts\activate`

3. Install the required packages:

All dependencies are listed in the `requirements.txt` file. Install them using pip:

bash

pip install -r requirements.txt

Step-by-Step Reproduction Workflow

The notebooks are designed to be run in a logical sequence.

Step 1: Data Enrichment (Optional)

The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

Step 2: Exploratory Data Analysis (Optional)

All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

Step 3: Main Model Training, Validation, and Recommendation

This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:

It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.

It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.

It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.

It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
A
‘COVID-19 dataset in Japan’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19 dataset in Japan’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-dataset-in-japan-2665/latest
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Japan
Description
Analysis of ‘COVID-19 dataset in Japan’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lisphilar/covid19-dataset-in-japan on 28 January 2022.

--- Dataset description provided by original source is as follows ---

1. Context

This is a COVID-19 dataset in Japan. This does not include the cases in Diamond Princess cruise ship (Yokohama city, Kanagawa prefecture) and Costa Atlantica cruise ship (Nagasaki city, Nagasaki prefecture). - Total number of cases in Japan - The number of vaccinated people (New/experimental) - The number of cases at prefecture level - Metadata of each prefecture

Note: Lisphilar (author) uploads the same files to https://github.com/lisphilar/covid19-sir/tree/master/data

This dataset can be retrieved with CovsirPhy (Python library).

pip install covsirphy --upgrade

import covsirphy as cs data_loader = cs.DataLoader() japan_data = data_loader.japan() # The number of cases (Total/each province) clean_df = japan_data.cleaned() # Metadata meta_df = japan_data.meta()

Please refer to CovsirPhy Documentation: Japan-specific dataset.

Note: Before analysing the data, please refer to Kaggle notebook: EDA of Japan dataset and COVID-19: Government/JHU data in Japan. The detailed explanation of the build process is discussed in Steps to build the dataset in Japan. If you find errors or have any questions, feel free to create a discussion topic.

1.1 Total number of cases in Japan

covid_jpn_total.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - with symptoms (to 08May2020) / without symptoms (to 08May2020) / unknown (to 08May2020) - discharged - fatal

The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with mild symptoms (to 08May2020) / severe symptoms / unknown (to 08May2020) - requiring hospitalization, but waiting in hotels or at home (to 08May2020)

In primary source, some variables were removed on 09May2020. Values are NA in this dataset from 09May2020.

Manually collected the data from Ministry of Health, Labour and Welfare HP:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)

The number of vaccinated people: - Vaccinated_1st: the number of vaccinated persons for the first time on the date - Vaccinated_2nd: the number of vaccinated persons with the second dose on the date - Vaccinated_3rd: the number of vaccinated persons with the third dose on the date

Data sources for vaccination: - To 09Apr2021: 厚生労働省 HP 新型コロナワクチンの接種実績(in Japanese) - 首相官邸新型コロナワクチンについて - From 10APr2021: Twitter: 首相官邸（新型コロナワクチン情報）

1.2 The number of cases at prefecture level

covid_jpn_prefecture.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - discharged - fatal

The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with severe symptoms (from 09May2020)

Using pdf-excel converter, manually collected the data from Ministry of Health, Labour and Welfare HP:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)

Note: covid_jpn_prefecture.groupby("Date").sum() does not match covid_jpn_total. When you analyse total data in Japan, please use covid_jpn_total data.

1.3 Metadata of each prefecture

covid_jpn_metadata.csv - Population (Total, Male, Female): 厚生労働省厚生統計要覧（2017年度）第１－５表 - Area (Total, Habitable): Wikipedia 都道府県の面積一覧 (2015)

Hospital_bed: With the primary data of 厚生労働省感染症指定医療機関の指定状況（平成31年4月1日現在）, 厚生労働省第二種感染症指定医療機関の指定状況（平成31年4月1日現在）, 厚生労働省医療施設動態調査（令和２年１月末概数）, 厚生労働省感染症指定医療機関について and secondary data of COVID-19 Japan 都道府県別感染症病床数,

Specific: Hospital beds of medical institutions designated for specific infectious diseases

Type-I: Hospital beds of medical institutions designated for type I infectious diseases

Type-II: Hospital beds of medical institutions designated for type II infectious diseases

Tuberculosis: Hospital beds of medical institutions designated for tuberculosis (outpatient care)

Care: long term care bed of hospitals

Total: Beds of all hospitals

Clinic_bed: With the primary data of 医療施設動態調査（令和２年１月末概数） ,

Care: long term care beds of clinics

Total: Beds of all clinics

Location: Data is from LinkData 都道府県庁所在地 (Public Domain) (secondary data).

Latitude

Longitude

Admin

Capital: Prefectural capital city. Data is from LinkData 都道府県庁所在地 (Public Domain) (secondary data).

Region: Region name. Data is from WIkipedia (secondary data). "Kyushu-Okinawa region" was separated to "Kyushu" and "Okinawa" by this datasets' author.

Num: Prefecture code (JIS X 0401: Hokkaido=1,...Okinawa=47). Data is from 国土交通省 GIS HP Pref code. cf. (not source) Japan VIsitor: Japan Prefectures Map.

2. Acknowledgements

To create this dataset, edited and transformed data of the following sites was used.

厚生労働省 Ministry of Health, Labour and Welfare, Japan:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English) 厚生労働省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

国土交通省 Ministry of Land, Infrastructure, Transport and Tourism, Japan: 国土交通省 HP (in Japanese) 国土交通省 HP (in English) 国土交通省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

Code for Japan / COVID-19 Japan: Code for Japan COVID-19 Japan Dashboard (CC BY 4.0) COVID-19 Japan 都道府県別感染症病床数 (CC BY)

Wikipedia: Wikipedia

LinkData: LinkData (Public Domain)

Inspiration

Changes in number of cases over time

Percentage of patients without symptoms / mild or severe symptoms

What to do next to prevent outbreak

License and how to cite

Kindly cite this dataset under CC BY-4.0 license as follows. - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan, or - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, Kaggle Dataset, https://www.kaggle.com/lisphilar/covid19-dataset-in-japan

--- Original source retains full ownership of the source dataset ---
f
Attention and Cognitive Workload
figshare.com
csv
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Varandas; Inês Silveira; Hugo Gamboa (2025). Attention and Cognitive Workload [Dataset]. http://doi.org/10.6084/m9.figshare.28184417.v3
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28184417.v3
Dataset updated
Jun 4, 2025
Dataset provided by
figshare
Authors
Rui Varandas; Inês Silveira; Hugo Gamboa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Attention and Cognitive Workload1.1. Experimental designTwo standard cognitive tasks, N-Back and mental subtraction, were conducted using PsychoPy. The N-Back task is a working memory task where participants are presented with a sequence of stimuli and are required to indicate when the current stimulus matched the one from 'n' steps earlier in the sequence, with 'n' varying across different levels. To avoid any interference from reading the instructions, rest periods of 60 seconds were incorporated before, in between, and after the two main tasks, along with a 20-second rest period between the explanation of tasks and the procedure. Additionally, a 10-second rest period was introduced between the different difficulty levels of the N-Back task and between the subtraction periods. The N-Back task was divided into 4 levels, each consisting of 60 trials. The mental subtraction task involved 20 periods of 10 seconds each, during which participants were required to continuously subtract a given number from the result of the previous subtraction, all while a visual cue was displayed.In the final stage, participants engaged in a practical learning task that required them to complete a Python tutorial, which included both theoretical concepts and practical examples. During this phase of the data collection process, not only were physiological sensors used, but HCI was also tracked.1.2. Data recordingData was collected from a group of 8 volunteers (including 4 females), who were all between the ages of 20 and 27 (average age=22.9, standard deviation=2.1). Each participant was right-handed and did not report any psychological or neurological conditions. None of them were on any medication, except for contraceptive pills.The data for subject 2 do not include the 2nd part of the acquisition (python task) because the equipment stopped acquiring; subject 3 has the 1st (N-Back task and mental subtraction) and the 2nd part (python tutorial) together in the First part folder (file D1_S3_PB_description.json indicates the start and end of each task); subject 4 only has the mental subtraction task in the 1st part acquisition and in subject 8, the subtraction task data is included in the 2nd part acquisition, along with python task.1.3. Data labellingData labeling can be performed in two ways: to categorize data into cognitive workload levels and baseline, either the PB description JSON files or the task_results.csv files can be used. Separately, the labelling of data into cognitive states was carried out every 10 seconds by researchers in biomedical engineering, in which they used image captures of the participants at various instants of the experiment, response time and signals from the respiration sensor to label the subjects’ state as bored, frustrated, interested and at rest. These cognitive state labels are stored in the cognitive_states_labels.txt files located within each subject's folder.1.4. Data descriptionBiosignals include EEG, fNIRS (not converted to oxi and deoxiHb), ECG, EDA, respiration (RIP), accelerometer (ACC), and push-button (PB) data. All signals have already been converted to physical units. In each biosignal file, the first column corresponds to the timestamps. For the first dataset, the biosignals folder is split into two parts: part 1 corresponds to the mental n-back and subtraction tasks, and part 2 corresponds to the Python tutorial. The PB files can be inside each part of the Biosignals folder, in case there are 2 files instead of 1.HCI features encompass keyboard, mouse, and screenshot data. A Python code snippet for extracting screenshot files from the screenshots csv file can be found below.import base64from os import mkdirfrom os.path import joinfile = '...'with open(file, 'r') as f: lines = f.readlines()for line in lines[1:]: timestamp = line.split(',')[0] code = line.split(',')[-1][:-2] imgdata = base64.b64decode(code) filename = str(timestamp) + '.jpeg' mkdir('screenshot') with open(join('screenshot', filename), 'wb') as f: f.write(imgdata)A characterization file containing age and gender information for all subjects in each dataset is provided within the respective dataset folder (e.g., D1_subject-info.csv). Other complementary files include (i) description of the pushbuttons to help segment the signals (e.g., D1_S2_PB_description.json) and (ii) labelling (e.g., D1_S2_cognitive_states_labels.txt). The D1_Sx_task_results.csv files show the results for the n-back task. A result of -1 means no answer, 0 wrong answer and 1 right answer. As for difficulty, 0 corresponds to baseline or rest periods, 1 corresponds to the 0-back task, 2 to 1-back, 3 to 2-back and 4 to 3-back. In the case of the mental subtraction task, we only distinguish between rest, represented with 0, and task, represented with 1. The response time refers to the time it takes the subject to respond and the key answer was the key the subject pressed ('y' corresponding to yes if, for example, for the 0-back task, the letter shown on the screen was identical to the previous one, 'n' corresponding to no if it wasn't and 'None' if there was no response). This file also provides the information needed to segment the signals into the different tasks and baselines.
h
Electrical-engineering
huggingface.co
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mod (2024). Electrical-engineering [Dataset]. https://huggingface.co/datasets/STEM-AI-mtl/Electrical-engineering
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 16, 2024
Authors
mod
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
To the electrical engineering community

This dataset contains Q&A prompts about electrical engineering, Kicad's EDA software features and scripting console Python codes.

Authors

STEM.AI: stem.ai.mtl@gmail.comWilliam Harbec
Olympics game data analysis
kaggle.com
Updated Mar 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sarita (2025). Olympics game data analysis [Dataset]. https://www.kaggle.com/datasets/saritas95/olympics-game-data-analysis/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
sarita
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The Olympics Data Analysis project explores historical Olympic data using Exploratory Data Analysis (EDA) techniques. By leveraging Python libraries such as pandas, seaborn, and matplotlib, the project uncovers patterns in medal distribution, athlete demographics, and country-wise performance.

Key findings reveal that most medalists are aged between 20-30 years, with USA, China, and Russia leading in total medals. Over time, female participation has increased significantly, reflecting improved gender equality in sports. Additionally, athlete characteristics like height and weight play a crucial role in certain sports, such as basketball (favoring taller players) and gymnastics (favoring younger athletes).

The project includes interactive visualizations such as heatmaps, medal trends, and gender-wise participation charts to provide a comprehensive understanding of Olympic history and trends. The insights can help sports analysts, researchers, and enthusiasts better understand performance patterns in the Olympics.
Historical Stock Price Dataset
kaggle.com
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anita Rostami (2024). Historical Stock Price Dataset [Dataset]. https://www.kaggle.com/datasets/anitarostami/historical-stock-price-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 16, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anita Rostami
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Description:

This dataset provides historical stock price data for selected ticker symbols ['AAPL', 'MSFT', 'JPM', 'GS', 'AMZN', 'PG', 'KO', 'JNJ', 'XOM', 'CAT'] from January 1, 2014, to December 31, 2023. It contains the daily opening, highest, lowest, closing, adjusted closing prices, and trading volume for each trading day. These tickers represent a diverse range of sectors to allow comprehensive financial analysis.

Purpose and Use Case:

This dataset is ideal for financial analysis, market trend assessments, and investment decision-making. Analysts and researchers can use this dataset to: * Analyze price and market trends. * Evaluate volatility by analyzing price fluctuations and trading volume. * Use historical price movements to forecast and predict future trends. * Assess investment opportunities and portfolio performance.

Acknowledgments:

Data was collected using Python and Yahoo Finance. This dataset supports visualization, exploratory data analysis (EDA), and in-depth analysis to develop a predictive model for forecasting stock prices, aiming to gain insights, identify patterns, and improve prediction accuracy.

Potential Research Questions and Inspiration:

What is the correlation between stock prices and trading volume over time?

How do corporate actions and adjustments affect adjusted closing prices?

How does volatility vary across different stocks and sectors?

What key factors influence stock price dynamics, such as earnings reports, industry news, regulatory changes, or global economic trends?
BCG Data Science Simulation
kaggle.com
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PAVITR KUMAR SWAIN (2025). BCG Data Science Simulation [Dataset]. https://www.kaggle.com/datasets/pavitrkumar/bcg-data-science-simulation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
PAVITR KUMAR SWAIN
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
** Feature Engineering for Churn Prediction**

🚀**# BCG Data Science Job Simulation | Forage** This notebook focuses on feature engineering techniques to enhance a dataset for churn prediction modeling. As part of the BCG Data Science Job Simulation, I transformed raw customer data into valuable features to improve predictive performance.

📊 What’s Inside? ✅ Data Cleaning: Removing irrelevant columns to reduce noise ✅ Date-Based Feature Extraction: Converting raw dates into useful insights like activation year, contract length, and renewal month ✅ New Predictive Features:

consumption_trend → Measures if a customer’s last-month usage is increasing or decreasing total_gas_and_elec → Aggregates total energy consumption ✅ Final Processed Dataset: Ready for churn prediction modeling

📂Dataset Used: 📌 clean_data_after_eda.csv → Original dataset after Exploratory Data Analysis (EDA) 📌 clean_data_with_new_features.csv → Final dataset after feature engineering

🛠 Technologies Used: 🔹 Python (Pandas, NumPy) 🔹 Data Preprocessing & Feature Engineering

🌟 Why Feature Engineering? Feature engineering is one of the most critical steps in machine learning. Well-engineered features improve model accuracy and uncover deeper insights into customer behavior.

🚀 This notebook is a great reference for anyone learning data preprocessing, feature selection, and predictive modeling in Data Science!

📩 Connect with Me: 🔗 GitHub Repo: https://github.com/Pavitr-Swain/BCG-Data-Science-Job-Simulation 💼 LinkedIn: https://www.linkedin.com/in/pavitr-kumar-swain-ab708b227/

🔍 Let’s explore churn prediction insights together! 🎯
Goodreads Best 21st Century Book List
kaggle.com
Updated Apr 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prakash Mahatra (2024). Goodreads Best 21st Century Book List [Dataset]. https://www.kaggle.com/datasets/prakashmahatra/goodreads-best-21st-century-book-list/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2024
Dataset provided by
Kaggle
Authors
Prakash Mahatra
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains data obtained from Goodreads, a popular website for book lovers, to gain insights into the best books of the 21st century. The data was scraped from the Best Books of the 21st Century list on Goodreads using the Beautiful Soup and Requests libraries in Python. After obtaining the data, cleaning and exploratory data analysis (EDA) were performed using Pandas, Plotly, Seaborn, and Matplotlib.

The dataset contains top books of the 21st century, spanning from the 2000s to the present day. The data is scraped from a popular book website, Goodreads. Some notable books in the dataset include the Harry Potter series, A Thousand Splendid Suns, The Kite Runner, and The Fault in Our Stars.

The dataset consists of a total of 84,033 books and comprises 15 columns.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Faisal Moiz Hussain (2021). Cleaned Auto Dataset 1985 [Dataset]. https://www.kaggle.com/datasets/faisalmoizhussain/cleaned-auto-dataset-1985/discussion

Cleaned Auto Dataset 1985

Cleaned Auto Dataset for EDA

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 3, 2021

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Faisal Moiz Hussain

Description

Context

Tailor made data to apply the machine learning models on the dataset. Where the newcomers can easily perform their EDA.

The data consists of all the features of the four wheelers available in the market in 1985. We need to predict the **price of the car ** using Linear Regression or PCA or SVM-R etc.,

Clear search

Close search

Google apps

Main menu

Cleaned Auto Dataset 1985

Context

ML-Based RUL Prediction for NPP Transformers

Notebooks

1. Exploratory_Data_Analysis

2. RUL_Prediction_Modeling

О проекте

Результаты

Комментарий

Replication Package for 'Data-Driven Analysis and Optimization of Machine...

Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

Repository Structure

Requirements and Installation

Step-by-Step Reproduction Workflow

Step 1: Data Enrichment (Optional)

Step 2: Exploratory Data Analysis (Optional)

Step 3: Main Model Training, Validation, and Recommendation

‘COVID-19 dataset in Japan’ analyzed by Analyst-2

1. Context

1.1 Total number of cases in Japan

1.2 The number of cases at prefecture level

1.3 Metadata of each prefecture

2. Acknowledgements

Inspiration

License and how to cite

Attention and Cognitive Workload

Electrical-engineering

Olympics game data analysis

Historical Stock Price Dataset

BCG Data Science Simulation

** Feature Engineering for Churn Prediction**

Goodreads Best 21st Century Book List

Cleaned Auto Dataset 1985

Cleaned Auto Dataset for EDA

Context

Feature Engineering for Churn Prediction