10 datasets found
  1. h

    warvan-ml-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset
    Explore at:
    Authors
    warvan
    Description

    Dataset Name

    This dataset contains structured data for machine learning and analysis purposes.

      Contents
    

    data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

      Usage
    

    Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

    Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.

  2. BCG Data Science Simulation

    • kaggle.com
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PAVITR KUMAR SWAIN (2025). BCG Data Science Simulation [Dataset]. https://www.kaggle.com/datasets/pavitrkumar/bcg-data-science-simulation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    PAVITR KUMAR SWAIN
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description
    ** Feature Engineering for Churn Prediction**

    🚀**# BCG Data Science Job Simulation | Forage** This notebook focuses on feature engineering techniques to enhance a dataset for churn prediction modeling. As part of the BCG Data Science Job Simulation, I transformed raw customer data into valuable features to improve predictive performance.

    📊 What’s Inside? ✅ Data Cleaning: Removing irrelevant columns to reduce noise ✅ Date-Based Feature Extraction: Converting raw dates into useful insights like activation year, contract length, and renewal month ✅ New Predictive Features:

    consumption_trend → Measures if a customer’s last-month usage is increasing or decreasing total_gas_and_elec → Aggregates total energy consumption ✅ Final Processed Dataset: Ready for churn prediction modeling

    📂Dataset Used: 📌 clean_data_after_eda.csv → Original dataset after Exploratory Data Analysis (EDA) 📌 clean_data_with_new_features.csv → Final dataset after feature engineering

    🛠 Technologies Used: 🔹 Python (Pandas, NumPy) 🔹 Data Preprocessing & Feature Engineering

    🌟 Why Feature Engineering? Feature engineering is one of the most critical steps in machine learning. Well-engineered features improve model accuracy and uncover deeper insights into customer behavior.

    🚀 This notebook is a great reference for anyone learning data preprocessing, feature selection, and predictive modeling in Data Science!

    📩 Connect with Me: 🔗 GitHub Repo: https://github.com/Pavitr-Swain/BCG-Data-Science-Job-Simulation 💼 LinkedIn: https://www.linkedin.com/in/pavitr-kumar-swain-ab708b227/

    🔍 Let’s explore churn prediction insights together! 🎯

  3. f

    Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and...

    • frontiersin.figshare.com
    docx
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang (2025). Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm.docx [Dataset]. http://doi.org/10.3389/frai.2024.1473837.s008
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Frontiers
    Authors
    Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.

  4. Blog-1K

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Dec 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haining Wang; Haining Wang (2022). Blog-1K [Dataset]. http://doi.org/10.5281/zenodo.7455623
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Dec 21, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Haining Wang; Haining Wang
    License

    https://www.isc.org/downloads/software-support-policy/isc-license/https://www.isc.org/downloads/software-support-policy/isc-license/

    Description

    The Blog-1K corpus is a redistributable authorship identification testbed for contemporary English prose. It has 1,000 candidate authors, 16K+ posts, and a pre-defined data split (train/dev/test proportional to ca. 8:1:1). It is a subset of the Blog Authorship Corpus from Kaggle. The MD5 for Blog-1K is '0a9e38740af9f921b6316b7f400acf06'.

    1. Preprocessing

    We first filter out texts shorter than 1,000 characters. Then we select one thousand authors whose writings meet the following criteria:
    - accumulatively at least 10,000 characters,
    - accumulatively at most 49,410 characters,
    - accumulatively at least 16 posts,
    - accumulatively at most 40 posts, and
    - each text has at least 50 function words found in the Koppel512 list (to filter out non-English prose).

    Blog-1K has three columns: 'id', 'text', and 'split', where 'id' corresponds to its parent corpus.

    2. Statistics

    Its creation and statistics can be found in the Jupyter Notebook.

    Split# Authors# Posts# CharactersAvg. Characters Per Author (Std.)Avg. Characters Per Post (Std.)
    Train1,00016,13230,092,05730,092 (5,884)1,865 (1,007)
    Validation9352,0173,755,3624,016 (2,269)1,862 (999)
    Test9242,0173,732,4484,039 (2,188)1,850 (936)


    3. Usage

    import pandas as pd
    
    df = pd.read_csv('blog1000.csv.gz', compression='infer')
    
    # read in training data
    train_text, train_label = zip(*df.loc[df.split=='train'][['text', 'id']].itertuples(index=False))

    4. License
    All the materials is licensed under the ISC License.


    5. Contact
    Please contact its maintainer for questions.

  5. h

    load_timeseries

    • huggingface.co
    Updated May 22, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weijie Xia (2015). load_timeseries [Dataset]. https://huggingface.co/datasets/Weijie1996/load_timeseries
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2015
    Authors
    Weijie Xia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Timeseries Data Processing

    This repository contains a script for loading and processing time series data using the datasets library and converting it to a pandas DataFrame for further analysis.

      Dataset
    

    The dataset used contains time series data with the following features:

    id: Identifier for the dataset, formatted as Country_Number of Household (e.g., GE_1 for Germany, household 1).
    datetime: Timestamp indicating the date and time of the observation.
    target: Energy… See the full description on the dataset page: https://huggingface.co/datasets/Weijie1996/load_timeseries.

  6. H

    Enhancing Stock Market Forecasting with Machine Learning A PineScript-Driven...

    • dataverse.harvard.edu
    Updated Nov 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Narla (2024). Enhancing Stock Market Forecasting with Machine Learning A PineScript-Driven Approach [Dataset]. http://doi.org/10.7910/DVN/HF0PFX
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 19, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Gautam Narla
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This study investigates the application of machine learning (ML) models in stock market forecasting, with a focus on their integration using PineScript, a domain-specific language for algorithmic trading. Leveraging diverse datasets, including historical stock prices and market sentiment data, we developed and tested various ML models such as neural networks, decision trees, and linear regression. Rigorous backtesting over multiple timeframes and market conditions allowed us to evaluate their predictive accuracy and financial performance. The neural network model demonstrated the highest accuracy, achieving a 75% success rate, significantly outperforming traditional models. Additionally, trading strategies derived from these ML models yielded a return on investment (ROI) of up to 12%, compared to an 8% benchmark index ROI. These findings underscore the transformative potential of ML in refining trading strategies, providing critical insights for financial analysts, investors, and developers. The study draws on insights from 15 peer-reviewed articles, financial datasets, and industry reports, establishing a robust foundation for future exploration of ML-driven financial forecasting. Tools and Technologies Used †PineScript PineScript, a scripting language integrated within the TradingView platform, was the primary tool used to develop and implement the machine learning models. Its robust features allowed for custom indicator creation, strategy backtesting, and real-time market data analysis. †Python Python was utilized for data preprocessing, model training, and performance evaluation. Key libraries included: Pandas

  7. Z

    AIS data

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luka Grgičević (2023). AIS data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8064487
    Explore at:
    Dataset updated
    Jun 26, 2023
    Dataset authored and provided by
    Luka Grgičević
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Terrestrial vessel automatic identification system (AIS) data was collected around Ålesund, Norway in 2020, from multiple receiving stations with unsynchronized clocks. Features are 'mmsi', 'imo', 'length', 'latitude', 'longitude', 'sog', 'cog', 'true_heading', 'datetime UTC', 'navigational status', and 'message number'. Compact parquet files can be turned into data frames with python's pandas library. Data is irregularly sampled because of the navigational status. The preprocessing script for training the machine learning models can be found here. There you will find gathered dozen of trainable models and hundreds of datasets. Visit this website for more information about the data. If you have additional questions, please find our information in the links below:

    Luka Grgičević

    Ottar Laurits Osen

  8. h

    TUDelft-Electricity-Consumption-1.0

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenSynth-Energy, TUDelft-Electricity-Consumption-1.0 [Dataset]. https://huggingface.co/datasets/OpenSynth/TUDelft-Electricity-Consumption-1.0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    OpenSynth-Energy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Timeseries Data Processing

    This repository contains a script for loading and processing time series data using the datasets library and converting it to a pandas DataFrame for further analysis.

      Dataset
    

    The dataset used contains time series data with the following features:

    id: Identifier for the dataset, formatted as Country_Number of Household (e.g., GE_1 for Germany, household 1).
    datetime: Timestamp indicating the date and time of the observation.
    target: Energy… See the full description on the dataset page: https://huggingface.co/datasets/OpenSynth/TUDelft-Electricity-Consumption-1.0.

  9. Apple Leaf Disease Detection Using Vision Transformer

    • zenodo.org
    text/x-python
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amreen Batool; Amreen Batool (2025). Apple Leaf Disease Detection Using Vision Transformer [Dataset]. http://doi.org/10.5281/zenodo.15702007
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    Jun 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amreen Batool; Amreen Batool
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains a Python script for classifying apple leaf diseases using a Vision Transformer (ViT) model. The dataset used is the Plant Village dataset, which contains images of apple leaves with four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

    Table of Contents

    Introduction

    The goal of this project is to classify apple leaf diseases using a Vision Transformer (ViT) model. The dataset is divided into four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

    Code Explanation

    1. Importing Libraries

    • The script starts by importing necessary libraries such as matplotlib, seaborn, numpy, pandas, tensorflow, and sklearn. These libraries are used for data visualization, data manipulation, and building/training the deep learning model.

    2. Visualizing the Dataset

    • The walk_through_dir function is used to explore the dataset directory structure and count the number of images in each class.
    • The dataset is divided into Train, Val, and Test directories, each containing subdirectories for the four classes.

    3. Data Augmentation

    • The script uses ImageDataGenerator from Keras to apply data augmentation techniques such as rotation, horizontal flipping, and rescaling to the training data. This helps in improving the model's generalization ability.
    • Separate generators are created for training, validation, and test datasets.

    4. Patch Visualization

    • The script defines a Patches layer that extracts patches from the images. This is a crucial step in Vision Transformers, where images are divided into smaller patches that are then processed by the transformer.
    • The script visualizes these patches for different patch sizes (32x32, 16x16, 8x8) to understand how the image is divided.

    5. Model Training

    • The script defines a Vision Transformer (ViT) model using TensorFlow and Keras. The model is compiled with the Adam optimizer and categorical cross-entropy loss.
    • The model is trained for a specified number of epochs, and the training history is stored for later analysis.

    6. Model Evaluation

    • After training, the model is evaluated on the test dataset. The script generates a confusion matrix and a classification report to assess the model's performance.
    • The confusion matrix is visualized using seaborn to provide a clear understanding of the model's predictions.

    7. Visualizing Misclassified Images

    • The script includes functionality to visualize misclassified images, which helps in understanding where the model is making errors.

    8. Fine-Tuning and Learning Rate Adjustment

    • The script demonstrates how to fine-tune the model by adjusting the learning rate and re-training the model.

    Steps for Implementation

    1. Dataset Preparation

      • Ensure that the dataset is organized into Train, Val, and Test directories, with each directory containing subdirectories for each class (Healthy, Apple Scab, Black Rot, Cedar Apple Rust).
    2. Install Required Libraries

      • Install the necessary Python libraries using pip:
        pip install tensorflow matplotlib seaborn numpy pandas scikit-learn
    3. Run the Script

      • Execute the script in a Python environment. The script will automatically:
        • Load and preprocess the dataset.
        • Apply data augmentation.
        • Train the Vision Transformer model.
        • Evaluate the model and generate performance metrics.
    4. Analyze Results

      • Review the confusion matrix and classification report to understand the model's performance.
      • Visualize misclassified images to identify potential areas for improvement.
    5. Fine-Tuning

      • Experiment with different patch sizes, learning rates, and data augmentation techniques to improve the model's accuracy.
  10. h

    Dataset-EfficientDrivingTimeDeterminationSystem

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ACHMAD AKBAR, Dataset-EfficientDrivingTimeDeterminationSystem [Dataset]. https://huggingface.co/datasets/jellysquish/Dataset-EfficientDrivingTimeDeterminationSystem
    Explore at:
    Authors
    ACHMAD AKBAR
    Description

    import re import pandas as pd from sklearn.tree import DecisionTreeClassifier from sklearn.preprocessing import LabelEncoder from google.colab import drive from sklearn.tree import export_text from sklearn.metrics import accuracy_score

      1. Mount Google Drive
    

    drive.mount('/content/drive')

      2. Baca file Excel
    

    file_path = '/content/drive/MyDrive/Colab Notebooks/AI_GACOR_Cleaned.xlsx' data = pd.read_excel(file_path)

      3. Encode kolom 'Hari'
    

    label_encoder_hari =… See the full description on the dataset page: https://huggingface.co/datasets/jellysquish/Dataset-EfficientDrivingTimeDeterminationSystem.

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset

warvan-ml-dataset

warvan/warvan-ml-dataset

Explore at:
Authors
warvan
Description

Dataset Name

This dataset contains structured data for machine learning and analysis purposes.

  Contents

data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

  Usage

Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.

Search
Clear search
Close search
Google apps
Main menu