53 datasets found
  1. Wireless Sensor Network Dataset

    • kaggle.com
    zip
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rehan Adil Abbasi (2024). Wireless Sensor Network Dataset [Dataset]. https://www.kaggle.com/datasets/rehanadilabbasi/wireless-sensor-network-dataset/code
    Explore at:
    zip(258458 bytes)Available download formats
    Dataset updated
    Jun 19, 2024
    Authors
    Rehan Adil Abbasi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Basic Information:

    Number of entries: 374,661 Number of features: 19 Data Types:

    15 integer columns 3 float columns 1 object column (label) Column Names:

    id, Time, Is_CH, who CH, Dist_To_CH, ADV_S, ADV_R, JOIN_S, JOIN_R, SCH_S, SCH_R, Rank, DATA_S, DATA_R, Data_Sent_To_BS, dist_CH_To_BS, send_code, Consumed Energy, label Explore the Dataset First Five Rows:

    id Time Is_CH who CH Dist_To_CH ADV_S ADV_R JOIN_S JOIN_R SCH_S SCH_R Rank DATA_S DATA_R Data_Sent_To_BS dist_CH_To_BS send_code Consumed Energy label 0 101000 50 1 101000 0.00000 1 0 0 25 1 0 0 0 1200 48 0.00000 1 0.00000 Attack 1 101001 50 0 101044 75.32345 0 4 1 0 0 1 2 38 0 0 0.00000 1 0.09797 Normal 2 101002 50 0 101010 46.95453 0 4 1 0 0 1 19 41 0 0 0.00000 1 0.09797 Normal 3 101003 50 0 101044 64.85231 0 4 1 0 0 1 16 38 0 0 0.00000 1 0.09797 Normal 4 101004 50 0 101010 4.83341 0 4 1 0 0 1 0 41 0 0 0.00000 1 0.09797 Normal Missing Values: No missing values detected in the dataset.

    Statistical Summary:

    The dataset includes various features related to network operations such as Time, Is_CH, who CH, Dist_To_CH, and more. The target variable label contains categorical values such as "Attack" and "Normal". Analyze Class Distribution Let's analyze the distribution of the classes within the dataset.

    python Copy code class_distribution = dataset['label'].value_counts() class_distribution Handle Class Imbalance If a class imbalance is identified, appropriate techniques such as undersampling, oversampling, or SMOTE (Synthetic Minority Over-sampling Technique) will be applied to balance the dataset.

    Next Steps:

    Identify the class distribution. Apply balancing techniques if necessary. Continue with data preprocessing and feature engineering. We will perform the class distribution analysis and balancing in the subsequent step. ​​

    i have found some duplicate values and drop them dataset.duplicated().sum() dataset.drop_duplicates(inplace=True)

    ChatGPT Duplicate Handling Initial Duplicate Count Duplicates found: 8,873 Action Taken Duplicates dropped: The dataset has been cleaned by removing all duplicate entries. Verification Duplicates after cleaning: 0 The dataset is now free of duplicate entries. We can proceed with the analysis of the class distribution and any necessary steps to handle class imbalance.

    Analyze Class Distribution Let's analyze the distribution of the label column to understand the balance between the classes.

    python Copy code class_distribution = dataset['label'].value_counts() class_distribution I'll compute this now. ​​

    Class Distribution Analysis The distribution of the classes within the dataset is as follows:

    Normal: 332,040 Grayhole: 13,909 Blackhole: 10,049 TDMA: 6,633 Flooding: 3,157 Observations There is a significant class imbalance, with the majority class being "Normal" (332,040 instances). Minority classes include "Grayhole", "Blackhole", "TDMA", and "Flooding".

  2. Animals (Cats, Dogs, and Snakes)

    • kaggle.com
    zip
    Updated Nov 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Rehan (2025). Animals (Cats, Dogs, and Snakes) [Dataset]. https://www.kaggle.com/datasets/aiomarrehan/animals-cats-dogs-and-snakes
    Explore at:
    zip(40219983 bytes)Available download formats
    Dataset updated
    Nov 18, 2025
    Authors
    Omar Rehan
    Description

    Cats, Dogs, and Snakes Dataset

    Dataset Overview

    The dataset contains images of three animal classes: Cats, Dogs, and Snakes. It is balanced and cleaned, designed for supervised image classification tasks.

    ClassNumber of ImagesDescription
    Cats1,000Includes multiple breeds and poses
    Dogs1,000Covers various breeds and backgrounds
    Snakes1,000Includes multiple species and natural settings

    Total Images: 3,000

    Image Properties:

    • Resolution: 224×224 pixels (resized for consistency)
    • Color Mode: RGB
    • Format: JPEG/PNG
    • Cleaned: Duplicate, blurry, and irrelevant images removed

    Data Split Recommendation

    SetPercentageNumber of Images
    Training70%2,100
    Validation15%450
    Test15%450

    Preprocessing

    Images in the dataset have been standardized to support machine learning pipelines:

    1. Resizing to 224×224 pixels.
    2. Normalization of pixel values to [0,1] or mean subtraction for deep learning frameworks.
    3. Label encoding: Integer encoding (0 = Cat, 1 = Dog, 2 = Snake) or one-hot encoding for model training.

    Example: Loading and Using the Dataset (Python)

    import os
    import tensorflow as tf
    from tensorflow.keras.preprocessing.image import ImageDataGenerator
    
    # Path to dataset
    dataset_path = "path/to/dataset"
    
    # ImageDataGenerator for preprocessing
    datagen = ImageDataGenerator(
      rescale=1./255,
      validation_split=0.15 # 15% for validation
    )
    
    # Load training data
    train_generator = datagen.flow_from_directory(
      dataset_path,
      target_size=(224, 224),
      batch_size=32,
      class_mode='categorical',
      subset='training',
      shuffle=True
    )
    
    # Load validation data
    validation_generator = datagen.flow_from_directory(
      dataset_path,
      target_size=(224, 224),
      batch_size=32,
      class_mode='categorical',
      subset='validation',
      shuffle=False
    )
    
    # Example: Iterate over one batch
    images, labels = next(train_generator)
    print(images.shape, labels.shape) # (32, 224, 224, 3) (32, 3)
    

    Key Features

    • Balanced: Equal number of samples per class reduces bias.
    • Cleaned: High-quality, relevant images improve model performance.
    • Diverse: Covers multiple breeds, species, and environments to ensure generalization.
    • Ready for ML: Preprocessed and easily integrated into popular deep learning frameworks.
  3. Different estimator’s average best validation performance for the class...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramona Leenings; Nils Ralf Winter; Lucas Plagwitz; Vincent Holstein; Jan Ernsting; Kelvin Sarink; Lukas Fisch; Jakob Steenweg; Leon Kleine-Vennekate; Julian Gebker; Daniel Emden; Dominik Grotegerd; Nils Opel; Benjamin Risse; Xiaoyi Jiang; Udo Dannlowski; Tim Hahn (2023). Different estimator’s average best validation performance for the class balancing pipeline. [Dataset]. http://doi.org/10.1371/journal.pone.0254062.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ramona Leenings; Nils Ralf Winter; Lucas Plagwitz; Vincent Holstein; Jan Ernsting; Kelvin Sarink; Lukas Fisch; Jakob Steenweg; Leon Kleine-Vennekate; Julian Gebker; Daniel Emden; Dominik Grotegerd; Nils Opel; Benjamin Risse; Xiaoyi Jiang; Udo Dannlowski; Tim Hahn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Different estimator’s average best validation performance for the class balancing pipeline.

  4. h

    balanced-accuracy

    • huggingface.co
    Updated Oct 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuhao Tian (2025). balanced-accuracy [Dataset]. https://huggingface.co/datasets/OliverOnHF/balanced-accuracy
    Explore at:
    Dataset updated
    Oct 6, 2025
    Authors
    Yuhao Tian
    Description

    Balanced Accuracy Metrics for 🤗 Evaluate

    A minimal, production-ready set of balanced accuracy metrics for imbalanced vision/NLP tasks, implemented as plain Python scripts that you can load with evaluate from a dataset-type repo on the Hugging Face Hub.

    What this is Three drop‑in metrics that focus on fair evaluation under class imbalance:

    balanced_accuracy.py — binary & multiclass balanced accuracy with options for sample_weight, threshold="auto" (Youden’s J), ignore_index… See the full description on the dataset page: https://huggingface.co/datasets/OliverOnHF/balanced-accuracy.

  5. Waste Classfication Dataset

    • kaggle.com
    Updated Jun 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaan Çerkez (2025). Waste Classfication Dataset [Dataset]. https://www.kaggle.com/datasets/kaanerkez/waste-classfication-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kaan Çerkez
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    Balanced Waste Classification Dataset - E-Waste & Mixed Materials

    🎯 Dataset Overview

    This dataset contains a comprehensive collection of waste images designed for training machine learning models to classify different types of waste materials, with a strong focus on electronic waste (e-waste) and mixed materials. The dataset includes 7 electronic device categories alongside traditional recyclable materials, making it ideal for modern waste management challenges where electronic devices constitute a significant portion of waste streams. The dataset has been carefully curated and balanced to ensure optimal performance for multi-category waste classification tasks using deep learning approaches.

    📊 Dataset Statistics

    • Total Classes: 17 different waste categories
    • Images per Class: 400 (balanced)
    • Total Images: 6,800
    • Image Format: RGB (3 channels)
    • Recommended Input Size: 224×224 pixels
    • Data Structure: Single balanced dataset (not pre-split)

    🗂️ Waste Categories

    The dataset includes 17 distinct waste categories covering various types of materials commonly found in waste management scenarios:

    1. Battery - Various types of batteries
    2. Cardboard - Cardboard packaging and boxes
    3. Glass - Glass containers and bottles
    4. Keyboard - Computer keyboards and input devices
    5. Metal - Metal cans and metallic waste
    6. Microwave - Microwave ovens and similar appliances
    7. Mobile - Mobile phones and smartphones
    8. Mouse - Computer mice and peripherals
    9. Organic - Biodegradable organic waste
    10. Paper - Paper products and documents
    11. PCB - Printed Circuit Boards (electronic components)
    12. Plastic - Plastic containers and packaging
    13. Player - Media players and entertainment devices
    14. Printer - Printers and printing equipment
    15. Television - TV sets and display devices
    16. Trash - General mixed waste
    17. Washing Machine - Washing machines and large appliances

    🛠️ Data Processing Pipeline

    1. Data Balancing

    • Undersampling: Applied to classes with >400 images
    • Data Augmentation: Applied to classes with <400 images
    • Target: Exactly 400 images per class for balanced training

    2. Data Augmentation Techniques

    • Rotation: ±20 degrees
    • Width/Height Shift: ±20%
    • Shear Range: 20%
    • Zoom Range: 20%
    • Horizontal Flip: Enabled
    • Fill Mode: Nearest neighbor

    3. Quality Assurance

    • Consistent image dimensions
    • Proper file format validation
    • Balanced class distribution
    • Clean data structure

    🎯 Recommended Use Cases

    Primary Applications

    • E-Waste Classification: Specialized in electronic devices (Mobile, Keyboard, Mouse, PCB, etc.)
    • Mixed Waste Sorting: Traditional recyclables (Paper, Plastic, Glass, Metal, Cardboard)
    • Smart Recycling Systems: Automated waste sorting for both organic and electronic materials
    • Environmental Monitoring: Multi-category waste identification
    • Appliance Recycling: Large appliance classification (Microwave, TV, Washing Machine)

    Special Features

    • Electronic Waste Focus: Strong representation of e-waste categories (7 out of 17 classes)
    • Diverse Material Types: From organic waste to complex electronic devices
    • Real-world Categories: Practical classification for actual waste management scenarios
    • Appliance Recognition: Specialized in identifying large household appliances

    Model Architectures

    • Convolutional Neural Networks (CNN)
    • Transfer Learning with MobileNetV2, ResNet, EfficientNet
    • Vision Transformers (ViT)
    • Custom architectures for waste classification

    📁 Dataset Structure

    balanced_waste_images/
    ├── category_1/
    │  ├── image_001.jpg
    │  ├── image_002.jpg
    │  └── ... (400 images)
    ├── category_2/
    │  ├── image_001.jpg
    │  └── ... (400 images)
    └── ... (17 categories total)
    

    Note: Dataset is not pre-split. Users need to create train/validation/test splits as needed.

    🚀 Getting Started

    Step 1: Data Splitting

    Since the dataset is not pre-split, you'll need to create train/validation/test splits:

    import splitfolders
    
    # Split dataset: 80% train, 10% val, 10% test
    splitfolders.ratio(
      input='balanced_waste_images', 
      output='split_data',
      seed=42, 
      ratio=(.8, .1, .1),
      group_prefix=None,
      move=False
    )
    

    Step 2: Data Loading & Preprocessing

    from tensorflow.keras.preprocessing.image import ImageDataGenerator
    
    # Data generators with preprocessing
    train_datagen = ImageDataGenerator(rescale=1./255)
    val_datagen = ImageDataGenerator(rescale=1./255)
    
    train_generator = train_datagen.flow_from_directory(
      'split_data/train/',
      target_size=(224, 224),
      batch_size=32,
      class_mode='categorical'
    )
    
    val_generator = val_datagen.flow_from_director...
    
  6. o

    Amazon_employee_access_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True...

    • openml.org
    Updated Nov 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eddie Bergman (2022). Amazon_employee_access_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True [Dataset]. https://www.openml.org/d/44712
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2022
    Authors
    Eddie Bergman
    Description

    Subsampling of the dataset Amazon_employee_access (4135) with

    seed=4 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code:

      def subsample(
        self,
        seed: int,
        nrows_max: int = 2_000,
        ncols_max: int = 100,
        nclasses_max: int = 10,
        stratified: bool = True,
      ) -> Dataset:
        rng = np.random.default_rng(seed)
    
        x = self.x
        y = self.y
    
        # Uniformly sample
        classes = y.unique()
        if len(classes) > nclasses_max:
          vcs = y.value_counts()
          selected_classes = rng.choice(
            classes,
            size=nclasses_max,
            replace=False,
            p=vcs / sum(vcs),
          )
    
          # Select the indices where one of these classes is present
          idxs = y.index[y.isin(classes)]
          x = x.iloc[idxs]
          y = y.iloc[idxs]
    
        # Uniformly sample columns if required
        if len(x.columns) > ncols_max:
          columns_idxs = rng.choice(
            list(range(len(x.columns))), size=ncols_max, replace=False
          )
          sorted_column_idxs = sorted(columns_idxs)
          selected_columns = list(x.columns[sorted_column_idxs])
          x = x[selected_columns]
        else:
          sorted_column_idxs = list(range(len(x.columns)))
    
        if len(x) > nrows_max:
          # Stratify accordingly
          target_name = y.name
          data = pd.concat((x, y), axis="columns")
          _, subset = train_test_split(
            data,
            test_size=nrows_max,
            stratify=data[target_name],
            shuffle=True,
            random_state=seed,
          )
          x = subset.drop(target_name, axis="columns")
          y = subset[target_name]
    
        # We need to convert categorical columns to string for openml
        categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs]
        columns = list(x.columns)
    
        return Dataset(
          # Technically this is not the same but it's where it was derived from
          dataset=self.dataset,
          x=x,
          y=y,
          categorical_mask=categorical_mask,
          columns=columns,
        )
    
  7. Imbalanced Cifar-10

    • kaggle.com
    zip
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akhil Theerthala (2023). Imbalanced Cifar-10 [Dataset]. https://www.kaggle.com/datasets/akhiltheerthala/imbalanced-cifar-10
    Explore at:
    zip(807146485 bytes)Available download formats
    Dataset updated
    Jun 17, 2023
    Authors
    Akhil Theerthala
    Description

    This dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">

    The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.

    The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.

    This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.

    Usage Information:

    The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.

    License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.

    Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.

  8. f

    Comparison of simulated raw data (4 classes) and oversampled data, repeated...

    • plos.figshare.com
    xls
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet (2023). Comparison of simulated raw data (4 classes) and oversampled data, repeated 100 times. Displayed results are mean (s.d.). [Dataset]. http://doi.org/10.1371/journal.pone.0287705.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 29, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of simulated raw data (4 classes) and oversampled data, repeated 100 times. Displayed results are mean (s.d.).

  9. Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hossein Keshavarz; Hossein Keshavarz; Meiyappan Nagappan; Meiyappan Nagappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. http://doi.org/10.5281/zenodo.5907847
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 27, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hossein Keshavarz; Hossein Keshavarz; Meiyappan Nagappan; Meiyappan Nagappan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

    The datasets are available under directory dataset. There are 4 datasets in this directory.

    1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.
    2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).
    3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.
    4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

    In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

    The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

    More specifically, git_token handles GitHub API token that is necessary for requests to GitHub API. Script collector performs GitHub search. Tracing changed lines and git annotate is done in gitminer using PyDriller. Finally, gumtree applies 4 filtering steps (number of lines, number of files, language, and change significance).

    References:

    1. GumTree

    • https://github.com/GumTreeDiff/gumtree

    • Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

    2. PyDriller

    • https://pydriller.readthedocs.io/en/latest/

    • Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911

  10. T

    gap

    • tensorflow.org
    Updated Dec 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). gap [Dataset]. https://www.tensorflow.org/datasets/catalog/gap
    Explore at:
    Dataset updated
    Dec 22, 2022
    Description

    GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia and released by Google AI Language for the evaluation of coreference resolution in practical applications.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('gap', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  11. T

    ag_news_subset

    • tensorflow.org
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ag_news_subset [Dataset]. http://identifiers.org/arxiv:1509.01626
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

    The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

    The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('ag_news_subset', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  12. m

    Data from: Tea Leaf Dataset

    • data.mendeley.com
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Megha Gupta (2025). Tea Leaf Dataset [Dataset]. http://doi.org/10.17632/94fzcdz8gz.1
    Explore at:
    Dataset updated
    Jul 22, 2025
    Authors
    Megha Gupta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Tea leaf dataset consists of two folders: Healthy leaves and Diseased leaves. The folder: “Healthy leaves” consists of images of leaves that is free from infection. The folder: “diseased leaves” consists of into 5 classes: Blister Blight, Brown Blight, Tea Mosquito Bug, Leaf Red Rust and Red Spider Mite. Each class is balanced in the dataset, that is1500 images in each class. Using python programming, raw images are first resized to 256*256 dimensions and then augmented using zoom, flip, rotation, shift and shear. Images are further enhanced using median filter.

  13. f

    Percentage of the number of pixels of each class on Maupiti data, based on...

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet (2023). Percentage of the number of pixels of each class on Maupiti data, based on expert mapping. [Dataset]. http://doi.org/10.1371/journal.pone.0287705.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 29, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Maupiti Island
    Description

    Percentage of the number of pixels of each class on Maupiti data, based on expert mapping.

  14. Hyperparameters used in Scikit-learn package in Python [56], including both...

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nivedita Bhadra; Shre Kumar Chatterjee; Saptarshi Das (2023). Hyperparameters used in Scikit-learn package in Python [56], including both the default and customized values yielding robust classification on both the 15D and 7D feature space. [Dataset]. http://doi.org/10.1371/journal.pone.0285321.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Nivedita Bhadra; Shre Kumar Chatterjee; Saptarshi Das
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hyperparameters used in Scikit-learn package in Python [56], including both the default and customized values yielding robust classification on both the 15D and 7D feature space.

  15. T

    cmaterdb

    • tensorflow.org
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). cmaterdb [Dataset]. https://www.tensorflow.org/datasets/catalog/cmaterdb
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    This dataset contains images of - Handwritten Bangla numerals - balanced dataset of total 6000 Bangla numerals (32x32 RGB coloured, 6000 images), each having 600 images per class(per digit). Handwritten Devanagari numerals - balanced dataset of total 3000 Devanagari numerals (32x32 RGB coloured, 3000 images), each having 300 images per class(per digit). Handwritten Telugu numerals - balanced dataset of total 3000 Telugu numerals (32x32 RGB coloured, 3000 images), each having 300 images per class(per digit).

    CMATERdb is the pattern recognition database repository created at the 'Center for Microprocessor Applications for Training Education and Research' (CMATER) research lab, Jadavpur University, India.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('cmaterdb', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/cmaterdb-bangla-1.0.0.png" alt="Visualization" width="500px">

  16. Results comparing raw Tecator data (3 classes) and oversampled with a...

    • plos.figshare.com
    xls
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet (2023). Results comparing raw Tecator data (3 classes) and oversampled with a 10-fold cross validation, iterated 100 times. Displayed results are mean (s.d.). [Dataset]. http://doi.org/10.1371/journal.pone.0287705.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 29, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results comparing raw Tecator data (3 classes) and oversampled with a 10-fold cross validation, iterated 100 times. Displayed results are mean (s.d.).

  17. Google's Audioset: Reformatted

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    tsv
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bakhtin; Bakhtin (2022). Google's Audioset: Reformatted [Dataset]. http://doi.org/10.5281/zenodo.7096702
    Explore at:
    tsvAvailable download formats
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bakhtin; Bakhtin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    Google's AudioSet consistently reformatted
    
    During my work with Google's AudioSet(https://research.google.com/audioset/index.html)
    I encountered some problems due to the fact that Weak (https://research.google.com/audioset/download.html) and
     Strong (https://research.google.com/audioset/download_strong.html) versions of the dataset used different csv formatting for the data, and that also labels used in the two datasets are different (https://github.com/audioset/ontology/issues/9) and also presented in files with different formatting.
    
    This dataset reformatting aims to unify the formats of the datasets so that it is possible
    to analyse them in the same pipelines, and also make the dataset files compatible
    with psds_eval, dcase_util and sed_eval Python packages used in Audio Processing.
    
    For better formatted documentation and source code of reformatting refer to https://github.com/bakhtos/GoogleAudioSetReformatted 
    
    -Changes in dataset
    
    All files are converted to tab-separated `*.tsv` files (i.e. `csv` files with `\t`
    as a separator). All files have a header as the first line.
    
    -New fields and filenames
    
    Fields are renamed according to the following table, to be compatible with psds_eval:
    
    Old field -> New field
    YTID -> filename
    segment_id -> filename
    start_seconds -> onset
    start_time_seconds -> onset
    end_seconds -> offset
    end_time_seconds -> offset
    positive_labels -> event_label
    label -> event_label
    present -> present
    
    For class label files, `id` is now the name for the for `mid` label (e.g. `/m/09xor`)
    and `label` for the human-readable label (e.g. `Speech`). Index of label indicated
    for Weak dataset labels (`index` field in `class_labels_indices.csv`) is not used.
    
    Files are renamed according to the following table to ensure consisted naming
    of the form `audioset_[weak|strong]_[train|eval]_[balanced|unbalanced|posneg]*.tsv`:
    
    Old name -> New name
    balanced_train_segments.csv -> audioset_weak_train_balanced.tsv
    unbalanced_train_segments.csv -> audioset_weak_train_unbalanced.tsv
    eval_segments.csv -> audioset_weak_eval.tsv
    audioset_train_strong.tsv -> audioset_strong_train.tsv
    audioset_eval_strong.tsv -> audioset_strong_eval.tsv
    audioset_eval_strong_framed_posneg.tsv -> audioset_strong_eval_posneg.tsv
    class_labels_indices.csv -> class_labels.tsv (merged with mid_to_display_name.tsv)
    mid_to_display_name.tsv -> class_labels.tsv (merged with class_labels_indices.csv)
    
    -Strong dataset changes
    
    Only changes to the Strong dataset are renaming of fields and reordering of columns,
    so that both Weak and Strong version have `filename` and `event_label` as first 
    two columns.
    
    -Weak dataset changes
    
    -- Labels are given one per line, instead of comma-separated and quoted list
    
    -- To make sure that `filename` format is the same as in Strong version, the following
    format change is made:
    The value of the `start_seconds` field is converted to milliseconds and appended to the `filename` with an underscore. Since all files in the dataset are assumed to be 10 seconds long, this unifies the format of `filename` with the Strong version and makes `end_seconds` also redundant.
    
    -Class labels changes
    
    Class labels from both datasets are merged into one file and given in alphabetical order of `id`s. Since same `id`s are present in both datasets, but sometimes with different human-readable labels, labels from Strong dataset overwrite those from Weak. It is possible to regenerate `class_labels.tsv` while giving priority to the Weak version of labels by calling `convert_labels(False)` from convert.py in the GitHub repository.
    
    -License
    
    Google's AudioSet was published in two stages - first the Weakly labelled data (Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017.), then the strongly labelled data (Hershey, Shawn, et al. "The benefit of temporally-strong labels in audio event classification." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.)
    
    Both the original dataset and this reworked version are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
    

    Class labels come from the AudioSet Ontology, which is licensed under CC BY-SA 4.0.

  18. Data from: Powerful significance testing for unbalanced clusters

    • tandf.figshare.com
    json
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas H. Keefe; J. S. Marron (2025). Powerful significance testing for unbalanced clusters [Dataset]. http://doi.org/10.6084/m9.figshare.28473706.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Feb 24, 2025
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Thomas H. Keefe; J. S. Marron
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering methods are popular for revealing structure in data, particularly in the high-dimensional setting common to contemporary data science. A central statistical question is “are the clusters really there?” One pioneering method in statistical cluster validation is SigClust, but it is severely underpowered in the important setting where the candidate clusters have unbalanced sizes, such as in rare subtypes of disease. We show why this is the case and propose a remedy that is powerful in both the unbalanced and balanced settings, using a novel generalization of k-means clustering. We illustrate the value of our method using a high-dimensional dataset of gene expression in kidney cancer patients. A Python implementation is available at https://github.com/thomaskeefe/sigclust.

  19. Results comparing raw Maupiti data (4 classes) and oversampled with a 5-fold...

    • plos.figshare.com
    xls
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet (2023). Results comparing raw Maupiti data (4 classes) and oversampled with a 5-fold cross validation. Displayed results are mean (s.d.). [Dataset]. http://doi.org/10.1371/journal.pone.0287705.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 29, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Maupiti Island
    Description

    Results comparing raw Maupiti data (4 classes) and oversampled with a 5-fold cross validation. Displayed results are mean (s.d.).

  20. T

    imagenet2012_fewshot

    • tensorflow.org
    Updated Dec 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). imagenet2012_fewshot [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012_fewshot
    Explore at:
    Dataset updated
    Dec 10, 2022
    Description

    Imagenet2012Fewshot is a subset of original ImageNet ILSVRC 2012 dataset. The dataset share the same validation set as the original ImageNet ILSVRC 2012 dataset. However, the training set is subsampled in a label balanced fashion. In 5shot configuration, 5 images per label, or 5000 images are sampled; and in 10shot configuration, 10 images per label, or 10000 images are sampled.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imagenet2012_fewshot', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012_fewshot-1shot-5.0.1.png" alt="Visualization" width="500px">

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rehan Adil Abbasi (2024). Wireless Sensor Network Dataset [Dataset]. https://www.kaggle.com/datasets/rehanadilabbasi/wireless-sensor-network-dataset/code
Organization logo

Wireless Sensor Network Dataset

Basic Information: Number of entries: 374,661 Number of features: 19 Data Types

Explore at:
zip(258458 bytes)Available download formats
Dataset updated
Jun 19, 2024
Authors
Rehan Adil Abbasi
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Basic Information:

Number of entries: 374,661 Number of features: 19 Data Types:

15 integer columns 3 float columns 1 object column (label) Column Names:

id, Time, Is_CH, who CH, Dist_To_CH, ADV_S, ADV_R, JOIN_S, JOIN_R, SCH_S, SCH_R, Rank, DATA_S, DATA_R, Data_Sent_To_BS, dist_CH_To_BS, send_code, Consumed Energy, label Explore the Dataset First Five Rows:

id Time Is_CH who CH Dist_To_CH ADV_S ADV_R JOIN_S JOIN_R SCH_S SCH_R Rank DATA_S DATA_R Data_Sent_To_BS dist_CH_To_BS send_code Consumed Energy label 0 101000 50 1 101000 0.00000 1 0 0 25 1 0 0 0 1200 48 0.00000 1 0.00000 Attack 1 101001 50 0 101044 75.32345 0 4 1 0 0 1 2 38 0 0 0.00000 1 0.09797 Normal 2 101002 50 0 101010 46.95453 0 4 1 0 0 1 19 41 0 0 0.00000 1 0.09797 Normal 3 101003 50 0 101044 64.85231 0 4 1 0 0 1 16 38 0 0 0.00000 1 0.09797 Normal 4 101004 50 0 101010 4.83341 0 4 1 0 0 1 0 41 0 0 0.00000 1 0.09797 Normal Missing Values: No missing values detected in the dataset.

Statistical Summary:

The dataset includes various features related to network operations such as Time, Is_CH, who CH, Dist_To_CH, and more. The target variable label contains categorical values such as "Attack" and "Normal". Analyze Class Distribution Let's analyze the distribution of the classes within the dataset.

python Copy code class_distribution = dataset['label'].value_counts() class_distribution Handle Class Imbalance If a class imbalance is identified, appropriate techniques such as undersampling, oversampling, or SMOTE (Synthetic Minority Over-sampling Technique) will be applied to balance the dataset.

Next Steps:

Identify the class distribution. Apply balancing techniques if necessary. Continue with data preprocessing and feature engineering. We will perform the class distribution analysis and balancing in the subsequent step. ​​

i have found some duplicate values and drop them dataset.duplicated().sum() dataset.drop_duplicates(inplace=True)

ChatGPT Duplicate Handling Initial Duplicate Count Duplicates found: 8,873 Action Taken Duplicates dropped: The dataset has been cleaned by removing all duplicate entries. Verification Duplicates after cleaning: 0 The dataset is now free of duplicate entries. We can proceed with the analysis of the class distribution and any necessary steps to handle class imbalance.

Analyze Class Distribution Let's analyze the distribution of the label column to understand the balance between the classes.

python Copy code class_distribution = dataset['label'].value_counts() class_distribution I'll compute this now. ​​

Class Distribution Analysis The distribution of the classes within the dataset is as follows:

Normal: 332,040 Grayhole: 13,909 Blackhole: 10,049 TDMA: 6,633 Flooding: 3,157 Observations There is a significant class imbalance, with the majority class being "Normal" (332,040 instances). Minority classes include "Grayhole", "Blackhole", "TDMA", and "Flooding".

Search
Clear search
Close search
Google apps
Main menu