10 datasets found

h
warvan-ml-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset
Explore at:
Authors
warvan
Description
Dataset Name

This dataset contains structured data for machine learning and analysis purposes.

Contents

data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

Usage

Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.
BCG Data Science Simulation
kaggle.com
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PAVITR KUMAR SWAIN (2025). BCG Data Science Simulation [Dataset]. https://www.kaggle.com/datasets/pavitrkumar/bcg-data-science-simulation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
PAVITR KUMAR SWAIN
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
** Feature Engineering for Churn Prediction**

🚀**# BCG Data Science Job Simulation | Forage** This notebook focuses on feature engineering techniques to enhance a dataset for churn prediction modeling. As part of the BCG Data Science Job Simulation, I transformed raw customer data into valuable features to improve predictive performance.

📊 What’s Inside? ✅ Data Cleaning: Removing irrelevant columns to reduce noise ✅ Date-Based Feature Extraction: Converting raw dates into useful insights like activation year, contract length, and renewal month ✅ New Predictive Features:

consumption_trend → Measures if a customer’s last-month usage is increasing or decreasing total_gas_and_elec → Aggregates total energy consumption ✅ Final Processed Dataset: Ready for churn prediction modeling

📂Dataset Used: 📌 clean_data_after_eda.csv → Original dataset after Exploratory Data Analysis (EDA) 📌 clean_data_with_new_features.csv → Final dataset after feature engineering

🛠 Technologies Used: 🔹 Python (Pandas, NumPy) 🔹 Data Preprocessing & Feature Engineering

🌟 Why Feature Engineering? Feature engineering is one of the most critical steps in machine learning. Well-engineered features improve model accuracy and uncover deeper insights into customer behavior.

🚀 This notebook is a great reference for anyone learning data preprocessing, feature selection, and predictive modeling in Data Science!

📩 Connect with Me: 🔗 GitHub Repo: https://github.com/Pavitr-Swain/BCG-Data-Science-Job-Simulation 💼 LinkedIn: https://www.linkedin.com/in/pavitr-kumar-swain-ab708b227/

🔍 Let’s explore churn prediction insights together! 🎯
f
Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and...
frontiersin.figshare.com
docx
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang (2025). Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm.docx [Dataset]. http://doi.org/10.3389/frai.2024.1473837.s008
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2024.1473837.s008
Dataset updated
Jan 15, 2025
Dataset provided by
Frontiers
Authors
Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.

Blog-1K

zenodo.org
data.niaid.nih.gov

application/gzip

Updated Dec 21, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Haining Wang; Haining Wang (2022). Blog-1K [Dataset]. http://doi.org/10.5281/zenodo.7455623

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7455623

Dataset updated

Dec 21, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Haining Wang; Haining Wang

License

https://www.isc.org/downloads/software-support-policy/isc-license/https://www.isc.org/downloads/software-support-policy/isc-license/

Description

The Blog-1K corpus is a redistributable authorship identification testbed for contemporary English prose. It has 1,000 candidate authors, 16K+ posts, and a pre-defined data split (train/dev/test proportional to ca. 8:1:1). It is a subset of the Blog Authorship Corpus from Kaggle. The MD5 for Blog-1K is '0a9e38740af9f921b6316b7f400acf06'.

1. Preprocessing

We first filter out texts shorter than 1,000 characters. Then we select one thousand authors whose writings meet the following criteria:
- accumulatively at least 10,000 characters,
- accumulatively at most 49,410 characters,
- accumulatively at least 16 posts,
- accumulatively at most 40 posts, and
- each text has at least 50 function words found in the Koppel512 list (to filter out non-English prose).

Blog-1K has three columns: 'id', 'text', and 'split', where 'id' corresponds to its parent corpus.

2. Statistics

Its creation and statistics can be found in the Jupyter Notebook.

Split	# Authors	# Posts	# Characters	Avg. Characters Per Author (Std.)	Avg. Characters Per Post (Std.)
Train	1,000	16,132	30,092,057	30,092 (5,884)	1,865 (1,007)
Validation	935	2,017	3,755,362	4,016 (2,269)	1,862 (999)
Test	924	2,017	3,732,448	4,039 (2,188)	1,850 (936)

3. Usage

import pandas as pd

df = pd.read_csv('blog1000.csv.gz', compression='infer')

# read in training data
train_text, train_label = zip(*df.loc[df.split=='train'][['text', 'id']].itertuples(index=False))

4. License
All the materials is licensed under the ISC License.

5. Contact
Please contact its maintainer for questions.

h
load_timeseries
huggingface.co
Updated May 22, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weijie Xia (2015). load_timeseries [Dataset]. https://huggingface.co/datasets/Weijie1996/load_timeseries
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2015
Authors
Weijie Xia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Timeseries Data Processing

This repository contains a script for loading and processing time series data using the datasets library and converting it to a pandas DataFrame for further analysis.

Dataset

The dataset used contains time series data with the following features:

id: Identifier for the dataset, formatted as Country_Number of Household (e.g., GE_1 for Germany, household 1).
datetime: Timestamp indicating the date and time of the observation.
target: Energy… See the full description on the dataset page: https://huggingface.co/datasets/Weijie1996/load_timeseries.
H
Enhancing Stock Market Forecasting with Machine Learning A PineScript-Driven...
dataverse.harvard.edu
Updated Nov 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Narla (2024). Enhancing Stock Market Forecasting with Machine Learning A PineScript-Driven Approach [Dataset]. http://doi.org/10.7910/DVN/HF0PFX
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/HF0PFX
Dataset updated
Nov 19, 2024
Dataset provided by
Harvard Dataverse
Authors
Gautam Narla
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This study investigates the application of machine learning (ML) models in stock market forecasting, with a focus on their integration using PineScript, a domain-specific language for algorithmic trading. Leveraging diverse datasets, including historical stock prices and market sentiment data, we developed and tested various ML models such as neural networks, decision trees, and linear regression. Rigorous backtesting over multiple timeframes and market conditions allowed us to evaluate their predictive accuracy and financial performance. The neural network model demonstrated the highest accuracy, achieving a 75% success rate, significantly outperforming traditional models. Additionally, trading strategies derived from these ML models yielded a return on investment (ROI) of up to 12%, compared to an 8% benchmark index ROI. These findings underscore the transformative potential of ML in refining trading strategies, providing critical insights for financial analysts, investors, and developers. The study draws on insights from 15 peer-reviewed articles, financial datasets, and industry reports, establishing a robust foundation for future exploration of ML-driven financial forecasting. Tools and Technologies Used †PineScript PineScript, a scripting language integrated within the TradingView platform, was the primary tool used to develop and implement the machine learning models. Its robust features allowed for custom indicator creation, strategy backtesting, and real-time market data analysis. †Python Python was utilized for data preprocessing, model training, and performance evaluation. Key libraries included: Pandas
Z
AIS data
data.niaid.nih.gov
zenodo.org
Updated Jun 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luka Grgičević (2023). AIS data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8064487
Explore at:
Dataset updated
Jun 26, 2023
Dataset authored and provided by
Luka Grgičević
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Terrestrial vessel automatic identification system (AIS) data was collected around Ålesund, Norway in 2020, from multiple receiving stations with unsynchronized clocks. Features are 'mmsi', 'imo', 'length', 'latitude', 'longitude', 'sog', 'cog', 'true_heading', 'datetime UTC', 'navigational status', and 'message number'. Compact parquet files can be turned into data frames with python's pandas library. Data is irregularly sampled because of the navigational status. The preprocessing script for training the machine learning models can be found here. There you will find gathered dozen of trainable models and hundreds of datasets. Visit this website for more information about the data. If you have additional questions, please find our information in the links below:

Luka Grgičević

Ottar Laurits Osen
h
TUDelft-Electricity-Consumption-1.0
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenSynth-Energy, TUDelft-Electricity-Consumption-1.0 [Dataset]. https://huggingface.co/datasets/OpenSynth/TUDelft-Electricity-Consumption-1.0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
OpenSynth-Energy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Timeseries Data Processing

This repository contains a script for loading and processing time series data using the datasets library and converting it to a pandas DataFrame for further analysis.

Dataset

The dataset used contains time series data with the following features:

id: Identifier for the dataset, formatted as Country_Number of Household (e.g., GE_1 for Germany, household 1).
datetime: Timestamp indicating the date and time of the observation.
target: Energy… See the full description on the dataset page: https://huggingface.co/datasets/OpenSynth/TUDelft-Electricity-Consumption-1.0.
Apple Leaf Disease Detection Using Vision Transformer
zenodo.org
text/x-python
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amreen Batool; Amreen Batool (2025). Apple Leaf Disease Detection Using Vision Transformer [Dataset]. http://doi.org/10.5281/zenodo.15702007
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15702007
Dataset updated
Jun 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amreen Batool; Amreen Batool
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains a Python script for classifying apple leaf diseases using a Vision Transformer (ViT) model. The dataset used is the Plant Village dataset, which contains images of apple leaves with four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

Table of Contents

Introduction

Code Explanation

Steps for Implementation

Example Usage

Conclusion

Introduction

The goal of this project is to classify apple leaf diseases using a Vision Transformer (ViT) model. The dataset is divided into four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

Code Explanation

1. Importing Libraries

The script starts by importing necessary libraries such as matplotlib, seaborn, numpy, pandas, tensorflow, and sklearn. These libraries are used for data visualization, data manipulation, and building/training the deep learning model.

2. Visualizing the Dataset

The walk_through_dir function is used to explore the dataset directory structure and count the number of images in each class.

The dataset is divided into Train, Val, and Test directories, each containing subdirectories for the four classes.

3. Data Augmentation

The script uses ImageDataGenerator from Keras to apply data augmentation techniques such as rotation, horizontal flipping, and rescaling to the training data. This helps in improving the model's generalization ability.

Separate generators are created for training, validation, and test datasets.

4. Patch Visualization

The script defines a Patches layer that extracts patches from the images. This is a crucial step in Vision Transformers, where images are divided into smaller patches that are then processed by the transformer.

The script visualizes these patches for different patch sizes (32x32, 16x16, 8x8) to understand how the image is divided.

5. Model Training

The script defines a Vision Transformer (ViT) model using TensorFlow and Keras. The model is compiled with the Adam optimizer and categorical cross-entropy loss.

The model is trained for a specified number of epochs, and the training history is stored for later analysis.

6. Model Evaluation

After training, the model is evaluated on the test dataset. The script generates a confusion matrix and a classification report to assess the model's performance.

The confusion matrix is visualized using seaborn to provide a clear understanding of the model's predictions.

7. Visualizing Misclassified Images

The script includes functionality to visualize misclassified images, which helps in understanding where the model is making errors.

8. Fine-Tuning and Learning Rate Adjustment

The script demonstrates how to fine-tune the model by adjusting the learning rate and re-training the model.

Steps for Implementation

Dataset Preparation

Ensure that the dataset is organized into Train, Val, and Test directories, with each directory containing subdirectories for each class (Healthy, Apple Scab, Black Rot, Cedar Apple Rust).

Install Required Libraries

Install the necessary Python libraries using pip:

pip install tensorflow matplotlib seaborn numpy pandas scikit-learn

Run the Script

Execute the script in a Python environment. The script will automatically:

Load and preprocess the dataset.

Apply data augmentation.

Train the Vision Transformer model.

Evaluate the model and generate performance metrics.

Analyze Results

Review the confusion matrix and classification report to understand the model's performance.

Visualize misclassified images to identify potential areas for improvement.

Fine-Tuning

Experiment with different patch sizes, learning rates, and data augmentation techniques to improve the model's accuracy.
h
Dataset-EfficientDrivingTimeDeterminationSystem
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ACHMAD AKBAR, Dataset-EfficientDrivingTimeDeterminationSystem [Dataset]. https://huggingface.co/datasets/jellysquish/Dataset-EfficientDrivingTimeDeterminationSystem
Explore at:
Authors
ACHMAD AKBAR
Description
import re import pandas as pd from sklearn.tree import DecisionTreeClassifier from sklearn.preprocessing import LabelEncoder from google.colab import drive from sklearn.tree import export_text from sklearn.metrics import accuracy_score

1. Mount Google Drive

drive.mount('/content/drive')

2. Baca file Excel

file_path = '/content/drive/MyDrive/Colab Notebooks/AI_GACOR_Cleaned.xlsx' data = pd.read_excel(file_path)

3. Encode kolom 'Hari'

label_encoder_hari =… See the full description on the dataset page: https://huggingface.co/datasets/jellysquish/Dataset-EfficientDrivingTimeDeterminationSystem.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset

warvan-ml-dataset

warvan/warvan-ml-dataset

Explore at:

Authors

warvan

Description

Dataset Name

This dataset contains structured data for machine learning and analysis purposes.

  Contents

data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

  Usage

Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.

Clear search

Close search

Google apps

Main menu

warvan-ml-dataset

BCG Data Science Simulation

Feature Engineering for Churn Prediction

Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and...

Blog-1K

load_timeseries

Enhancing Stock Market Forecasting with Machine Learning A PineScript-Driven...

AIS data

TUDelft-Electricity-Consumption-1.0

Apple Leaf Disease Detection Using Vision Transformer

Table of Contents

Introduction

Code Explanation

1. Importing Libraries

2. Visualizing the Dataset

3. Data Augmentation

4. Patch Visualization

5. Model Training

6. Model Evaluation

7. Visualizing Misclassified Images

8. Fine-Tuning and Learning Rate Adjustment

Steps for Implementation

Dataset-EfficientDrivingTimeDeterminationSystem

warvan-ml-dataset

warvan/warvan-ml-dataset

warvan-ml-dataset

BCG Data Science Simulation

** Feature Engineering for Churn Prediction**

Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and...

Blog-1K

load_timeseries

Enhancing Stock Market Forecasting with Machine Learning A PineScript-Driven...

AIS data

TUDelft-Electricity-Consumption-1.0

Apple Leaf Disease Detection Using Vision Transformer

Table of Contents

Introduction

Code Explanation

1. Importing Libraries

2. Visualizing the Dataset

3. Data Augmentation

4. Patch Visualization

5. Model Training

6. Model Evaluation

7. Visualizing Misclassified Images

8. Fine-Tuning and Learning Rate Adjustment

Steps for Implementation

Dataset-EfficientDrivingTimeDeterminationSystem

warvan-ml-dataset

warvan/warvan-ml-dataset

Feature Engineering for Churn Prediction