53 datasets found

Wireless Sensor Network Dataset
kaggle.com
zip
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rehan Adil Abbasi (2024). Wireless Sensor Network Dataset [Dataset]. https://www.kaggle.com/datasets/rehanadilabbasi/wireless-sensor-network-dataset/code
Explore at:
zip(258458 bytes)Available download formats
Dataset updated
Jun 19, 2024
Authors
Rehan Adil Abbasi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Basic Information:

Number of entries: 374,661 Number of features: 19 Data Types:

15 integer columns 3 float columns 1 object column (label) Column Names:

id, Time, Is_CH, who CH, Dist_To_CH, ADV_S, ADV_R, JOIN_S, JOIN_R, SCH_S, SCH_R, Rank, DATA_S, DATA_R, Data_Sent_To_BS, dist_CH_To_BS, send_code, Consumed Energy, label Explore the Dataset First Five Rows:

id Time Is_CH who CH Dist_To_CH ADV_S ADV_R JOIN_S JOIN_R SCH_S SCH_R Rank DATA_S DATA_R Data_Sent_To_BS dist_CH_To_BS send_code Consumed Energy label 0 101000 50 1 101000 0.00000 1 0 0 25 1 0 0 0 1200 48 0.00000 1 0.00000 Attack 1 101001 50 0 101044 75.32345 0 4 1 0 0 1 2 38 0 0 0.00000 1 0.09797 Normal 2 101002 50 0 101010 46.95453 0 4 1 0 0 1 19 41 0 0 0.00000 1 0.09797 Normal 3 101003 50 0 101044 64.85231 0 4 1 0 0 1 16 38 0 0 0.00000 1 0.09797 Normal 4 101004 50 0 101010 4.83341 0 4 1 0 0 1 0 41 0 0 0.00000 1 0.09797 Normal Missing Values: No missing values detected in the dataset.

Statistical Summary:

The dataset includes various features related to network operations such as Time, Is_CH, who CH, Dist_To_CH, and more. The target variable label contains categorical values such as "Attack" and "Normal". Analyze Class Distribution Let's analyze the distribution of the classes within the dataset.

python Copy code class_distribution = dataset['label'].value_counts() class_distribution Handle Class Imbalance If a class imbalance is identified, appropriate techniques such as undersampling, oversampling, or SMOTE (Synthetic Minority Over-sampling Technique) will be applied to balance the dataset.

Next Steps:

Identify the class distribution. Apply balancing techniques if necessary. Continue with data preprocessing and feature engineering. We will perform the class distribution analysis and balancing in the subsequent step.

i have found some duplicate values and drop them dataset.duplicated().sum() dataset.drop_duplicates(inplace=True)

ChatGPT Duplicate Handling Initial Duplicate Count Duplicates found: 8,873 Action Taken Duplicates dropped: The dataset has been cleaned by removing all duplicate entries. Verification Duplicates after cleaning: 0 The dataset is now free of duplicate entries. We can proceed with the analysis of the class distribution and any necessary steps to handle class imbalance.

Analyze Class Distribution Let's analyze the distribution of the label column to understand the balance between the classes.

python Copy code class_distribution = dataset['label'].value_counts() class_distribution I'll compute this now.

Class Distribution Analysis The distribution of the classes within the dataset is as follows:

Normal: 332,040 Grayhole: 13,909 Blackhole: 10,049 TDMA: 6,633 Flooding: 3,157 Observations There is a significant class imbalance, with the majority class being "Normal" (332,040 instances). Minority classes include "Grayhole", "Blackhole", "TDMA", and "Flooding".
Animals (Cats, Dogs, and Snakes)
kaggle.com
zip
Updated Nov 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Rehan (2025). Animals (Cats, Dogs, and Snakes) [Dataset]. https://www.kaggle.com/datasets/aiomarrehan/animals-cats-dogs-and-snakes
Explore at:
zip(40219983 bytes)Available download formats
Dataset updated
Nov 18, 2025
Authors
Omar Rehan
Description
Cats, Dogs, and Snakes Dataset

Dataset Overview

The dataset contains images of three animal classes: Cats, Dogs, and Snakes. It is balanced and cleaned, designed for supervised image classification tasks.

Class Number of Images Description
Cats 1,000 Includes multiple breeds and poses
Dogs 1,000 Covers various breeds and backgrounds
Snakes 1,000 Includes multiple species and natural settings

Total Images: 3,000

Image Properties:

Resolution: 224×224 pixels (resized for consistency)

Color Mode: RGB

Format: JPEG/PNG

Cleaned: Duplicate, blurry, and irrelevant images removed

Data Split Recommendation

Set Percentage Number of Images
Training 70% 2,100
Validation 15% 450
Test 15% 450

Preprocessing

Images in the dataset have been standardized to support machine learning pipelines:

Resizing to 224×224 pixels.

Normalization of pixel values to [0,1] or mean subtraction for deep learning frameworks.

Label encoding: Integer encoding (0 = Cat, 1 = Dog, 2 = Snake) or one-hot encoding for model training.

Example: Loading and Using the Dataset (Python)

import os import tensorflow as tf from tensorflow.keras.preprocessing.image import ImageDataGenerator # Path to dataset dataset_path = "path/to/dataset" # ImageDataGenerator for preprocessing datagen = ImageDataGenerator( rescale=1./255, validation_split=0.15 # 15% for validation ) # Load training data train_generator = datagen.flow_from_directory( dataset_path, target_size=(224, 224), batch_size=32, class_mode='categorical', subset='training', shuffle=True ) # Load validation data validation_generator = datagen.flow_from_directory( dataset_path, target_size=(224, 224), batch_size=32, class_mode='categorical', subset='validation', shuffle=False ) # Example: Iterate over one batch images, labels = next(train_generator) print(images.shape, labels.shape) # (32, 224, 224, 3) (32, 3)

Key Features

Balanced: Equal number of samples per class reduces bias.

Cleaned: High-quality, relevant images improve model performance.

Diverse: Covers multiple breeds, species, and environments to ensure generalization.

Ready for ML: Preprocessed and easily integrated into popular deep learning frameworks.
Different estimator’s average best validation performance for the class...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramona Leenings; Nils Ralf Winter; Lucas Plagwitz; Vincent Holstein; Jan Ernsting; Kelvin Sarink; Lukas Fisch; Jakob Steenweg; Leon Kleine-Vennekate; Julian Gebker; Daniel Emden; Dominik Grotegerd; Nils Opel; Benjamin Risse; Xiaoyi Jiang; Udo Dannlowski; Tim Hahn (2023). Different estimator’s average best validation performance for the class balancing pipeline. [Dataset]. http://doi.org/10.1371/journal.pone.0254062.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0254062.t002
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Ramona Leenings; Nils Ralf Winter; Lucas Plagwitz; Vincent Holstein; Jan Ernsting; Kelvin Sarink; Lukas Fisch; Jakob Steenweg; Leon Kleine-Vennekate; Julian Gebker; Daniel Emden; Dominik Grotegerd; Nils Opel; Benjamin Risse; Xiaoyi Jiang; Udo Dannlowski; Tim Hahn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Different estimator’s average best validation performance for the class balancing pipeline.
h
balanced-accuracy
huggingface.co
Updated Oct 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuhao Tian (2025). balanced-accuracy [Dataset]. https://huggingface.co/datasets/OliverOnHF/balanced-accuracy
Explore at:
Dataset updated
Oct 6, 2025
Authors
Yuhao Tian
Description
Balanced Accuracy Metrics for 🤗 Evaluate

A minimal, production-ready set of balanced accuracy metrics for imbalanced vision/NLP tasks, implemented as plain Python scripts that you can load with evaluate from a dataset-type repo on the Hugging Face Hub.

What this is Three drop‑in metrics that focus on fair evaluation under class imbalance:

balanced_accuracy.py — binary & multiclass balanced accuracy with options for sample_weight, threshold="auto" (Youden’s J), ignore_index… See the full description on the dataset page: https://huggingface.co/datasets/OliverOnHF/balanced-accuracy.
Waste Classfication Dataset
kaggle.com
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaan Çerkez (2025). Waste Classfication Dataset [Dataset]. https://www.kaggle.com/datasets/kaanerkez/waste-classfication-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kaan Çerkez
License
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Description
Balanced Waste Classification Dataset - E-Waste & Mixed Materials

🎯 Dataset Overview

This dataset contains a comprehensive collection of waste images designed for training machine learning models to classify different types of waste materials, with a strong focus on electronic waste (e-waste) and mixed materials. The dataset includes 7 electronic device categories alongside traditional recyclable materials, making it ideal for modern waste management challenges where electronic devices constitute a significant portion of waste streams. The dataset has been carefully curated and balanced to ensure optimal performance for multi-category waste classification tasks using deep learning approaches.

📊 Dataset Statistics

Total Classes: 17 different waste categories

Images per Class: 400 (balanced)

Total Images: 6,800

Image Format: RGB (3 channels)

Recommended Input Size: 224×224 pixels

Data Structure: Single balanced dataset (not pre-split)

🗂️ Waste Categories

The dataset includes 17 distinct waste categories covering various types of materials commonly found in waste management scenarios:

Battery - Various types of batteries

Cardboard - Cardboard packaging and boxes

Glass - Glass containers and bottles

Keyboard - Computer keyboards and input devices

Metal - Metal cans and metallic waste

Microwave - Microwave ovens and similar appliances

Mobile - Mobile phones and smartphones

Mouse - Computer mice and peripherals

Organic - Biodegradable organic waste

Paper - Paper products and documents

PCB - Printed Circuit Boards (electronic components)

Plastic - Plastic containers and packaging

Player - Media players and entertainment devices

Printer - Printers and printing equipment

Television - TV sets and display devices

Trash - General mixed waste

Washing Machine - Washing machines and large appliances

🛠️ Data Processing Pipeline

1. Data Balancing

Undersampling: Applied to classes with >400 images

Data Augmentation: Applied to classes with <400 images

Target: Exactly 400 images per class for balanced training

2. Data Augmentation Techniques

Rotation: ±20 degrees

Width/Height Shift: ±20%

Shear Range: 20%

Zoom Range: 20%

Horizontal Flip: Enabled

Fill Mode: Nearest neighbor

3. Quality Assurance

Consistent image dimensions

Proper file format validation

Balanced class distribution

Clean data structure

🎯 Recommended Use Cases

Primary Applications

E-Waste Classification: Specialized in electronic devices (Mobile, Keyboard, Mouse, PCB, etc.)

Mixed Waste Sorting: Traditional recyclables (Paper, Plastic, Glass, Metal, Cardboard)

Smart Recycling Systems: Automated waste sorting for both organic and electronic materials

Environmental Monitoring: Multi-category waste identification

Appliance Recycling: Large appliance classification (Microwave, TV, Washing Machine)

Special Features

Electronic Waste Focus: Strong representation of e-waste categories (7 out of 17 classes)

Diverse Material Types: From organic waste to complex electronic devices

Real-world Categories: Practical classification for actual waste management scenarios

Appliance Recognition: Specialized in identifying large household appliances

Model Architectures

Convolutional Neural Networks (CNN)

Transfer Learning with MobileNetV2, ResNet, EfficientNet

Vision Transformers (ViT)

Custom architectures for waste classification

📁 Dataset Structure

balanced_waste_images/ ├── category_1/ │ ├── image_001.jpg │ ├── image_002.jpg │ └── ... (400 images) ├── category_2/ │ ├── image_001.jpg │ └── ... (400 images) └── ... (17 categories total)

Note: Dataset is not pre-split. Users need to create train/validation/test splits as needed.

🚀 Getting Started

Step 1: Data Splitting

Since the dataset is not pre-split, you'll need to create train/validation/test splits:

import splitfolders # Split dataset: 80% train, 10% val, 10% test splitfolders.ratio( input='balanced_waste_images', output='split_data', seed=42, ratio=(.8, .1, .1), group_prefix=None, move=False )

Step 2: Data Loading & Preprocessing

from tensorflow.keras.preprocessing.image import ImageDataGenerator # Data generators with preprocessing train_datagen = ImageDataGenerator(rescale=1./255) val_datagen = ImageDataGenerator(rescale=1./255) train_generator = train_datagen.flow_from_directory( 'split_data/train/', target_size=(224, 224), batch_size=32, class_mode='categorical' ) val_generator = val_datagen.flow_from_director...

Class	Number of Images	Description
Cats	1,000	Includes multiple breeds and poses
Dogs	1,000	Covers various breeds and backgrounds
Snakes	1,000	Includes multiple species and natural settings

Set	Percentage	Number of Images
Training	70%	2,100
Validation	15%	450
Test	15%	450

Amazon_employee_access_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True...

openml.org

Updated Nov 17, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Eddie Bergman (2022). Amazon_employee_access_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True [Dataset]. https://www.openml.org/d/44712

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 17, 2022

Authors

Eddie Bergman

Description

Subsampling of the dataset Amazon_employee_access (4135) with

seed=4 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code:

  def subsample(
    self,
    seed: int,
    nrows_max: int = 2_000,
    ncols_max: int = 100,
    nclasses_max: int = 10,
    stratified: bool = True,
  ) -> Dataset:
    rng = np.random.default_rng(seed)

    x = self.x
    y = self.y

    # Uniformly sample
    classes = y.unique()
    if len(classes) > nclasses_max:
      vcs = y.value_counts()
      selected_classes = rng.choice(
        classes,
        size=nclasses_max,
        replace=False,
        p=vcs / sum(vcs),
      )

      # Select the indices where one of these classes is present
      idxs = y.index[y.isin(classes)]
      x = x.iloc[idxs]
      y = y.iloc[idxs]

    # Uniformly sample columns if required
    if len(x.columns) > ncols_max:
      columns_idxs = rng.choice(
        list(range(len(x.columns))), size=ncols_max, replace=False
      )
      sorted_column_idxs = sorted(columns_idxs)
      selected_columns = list(x.columns[sorted_column_idxs])
      x = x[selected_columns]
    else:
      sorted_column_idxs = list(range(len(x.columns)))

    if len(x) > nrows_max:
      # Stratify accordingly
      target_name = y.name
      data = pd.concat((x, y), axis="columns")
      _, subset = train_test_split(
        data,
        test_size=nrows_max,
        stratify=data[target_name],
        shuffle=True,
        random_state=seed,
      )
      x = subset.drop(target_name, axis="columns")
      y = subset[target_name]

    # We need to convert categorical columns to string for openml
    categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs]
    columns = list(x.columns)

    return Dataset(
      # Technically this is not the same but it's where it was derived from
      dataset=self.dataset,
      x=x,
      y=y,
      categorical_mask=categorical_mask,
      columns=columns,
    )

Imbalanced Cifar-10
kaggle.com
zip
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akhil Theerthala (2023). Imbalanced Cifar-10 [Dataset]. https://www.kaggle.com/datasets/akhiltheerthala/imbalanced-cifar-10
Explore at:
zip(807146485 bytes)Available download formats
Dataset updated
Jun 17, 2023
Authors
Akhil Theerthala
Description
This dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">

The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.

The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.

This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.

Usage Information:

The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.

License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.

Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.
f
Comparison of simulated raw data (4 classes) and oversampled data, repeated...
plos.figshare.com
xls
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet (2023). Comparison of simulated raw data (4 classes) and oversampled data, repeated 100 times. Displayed results are mean (s.d.). [Dataset]. http://doi.org/10.1371/journal.pone.0287705.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0287705.t002
Dataset updated
Jun 29, 2023
Dataset provided by
PLOS ONE
Authors
Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of simulated raw data (4 classes) and oversampled data, repeated 100 times. Displayed results are mean (s.d.).
Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hossein Keshavarz; Hossein Keshavarz; Meiyappan Nagappan; Meiyappan Nagappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. http://doi.org/10.5281/zenodo.5907847
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5907847
Dataset updated
Jan 27, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hossein Keshavarz; Hossein Keshavarz; Meiyappan Nagappan; Meiyappan Nagappan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

The datasets are available under directory dataset. There are 4 datasets in this directory.

apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.

apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).

apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.

apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

More specifically, git_token handles GitHub API token that is necessary for requests to GitHub API. Script collector performs GitHub search. Tracing changed lines and git annotate is done in gitminer using PyDriller. Finally, gumtree applies 4 filtering steps (number of lines, number of files, language, and change significance).

References:

1. GumTree

https://github.com/GumTreeDiff/gumtree

Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

2. PyDriller

https://pydriller.readthedocs.io/en/latest/

Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911
T
gap
tensorflow.org
Updated Dec 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). gap [Dataset]. https://www.tensorflow.org/datasets/catalog/gap
Explore at:
Dataset updated
Dec 22, 2022
Description
GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia and released by Google AI Language for the evaluation of coreference resolution in practical applications.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('gap', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
ag_news_subset
tensorflow.org
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). ag_news_subset [Dataset]. http://identifiers.org/arxiv:1509.01626
Explore at:
Unique identifier
https://identifiers.org/arxiv:1509.01626
Dataset updated
Dec 6, 2022
Description
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('ag_news_subset', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
m
Data from: Tea Leaf Dataset
data.mendeley.com
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Megha Gupta (2025). Tea Leaf Dataset [Dataset]. http://doi.org/10.17632/94fzcdz8gz.1
Explore at:
Unique identifier
https://doi.org/10.17632/94fzcdz8gz.1
Dataset updated
Jul 22, 2025
Authors
Megha Gupta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Tea leaf dataset consists of two folders: Healthy leaves and Diseased leaves. The folder: “Healthy leaves” consists of images of leaves that is free from infection. The folder: “diseased leaves” consists of into 5 classes: Blister Blight, Brown Blight, Tea Mosquito Bug, Leaf Red Rust and Red Spider Mite. Each class is balanced in the dataset, that is1500 images in each class. Using python programming, raw images are first resized to 256*256 dimensions and then augmented using zoom, flip, rotation, shift and shear. Images are further enhanced using median filter.
f
Percentage of the number of pixels of each class on Maupiti data, based on...
figshare.com
plos.figshare.com
xls
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet (2023). Percentage of the number of pixels of each class on Maupiti data, based on expert mapping. [Dataset]. http://doi.org/10.1371/journal.pone.0287705.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0287705.t001
Dataset updated
Jun 29, 2023
Dataset provided by
PLOS ONE
Authors
Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Maupiti Island
Description
Percentage of the number of pixels of each class on Maupiti data, based on expert mapping.
Hyperparameters used in Scikit-learn package in Python [56], including both...
plos.figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nivedita Bhadra; Shre Kumar Chatterjee; Saptarshi Das (2023). Hyperparameters used in Scikit-learn package in Python [56], including both the default and customized values yielding robust classification on both the 15D and 7D feature space. [Dataset]. http://doi.org/10.1371/journal.pone.0285321.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0285321.t003
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Nivedita Bhadra; Shre Kumar Chatterjee; Saptarshi Das
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hyperparameters used in Scikit-learn package in Python [56], including both the default and customized values yielding robust classification on both the 15D and 7D feature space.
T
cmaterdb
tensorflow.org
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). cmaterdb [Dataset]. https://www.tensorflow.org/datasets/catalog/cmaterdb
Explore at:
Dataset updated
Jun 1, 2024
Description
This dataset contains images of - Handwritten Bangla numerals - balanced dataset of total 6000 Bangla numerals (32x32 RGB coloured, 6000 images), each having 600 images per class(per digit). Handwritten Devanagari numerals - balanced dataset of total 3000 Devanagari numerals (32x32 RGB coloured, 3000 images), each having 300 images per class(per digit). Handwritten Telugu numerals - balanced dataset of total 3000 Telugu numerals (32x32 RGB coloured, 3000 images), each having 300 images per class(per digit).

CMATERdb is the pattern recognition database repository created at the 'Center for Microprocessor Applications for Training Education and Research' (CMATER) research lab, Jadavpur University, India.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('cmaterdb', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/cmaterdb-bangla-1.0.0.png" alt="Visualization" width="500px">
Results comparing raw Tecator data (3 classes) and oversampled with a...
plos.figshare.com
xls
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet (2023). Results comparing raw Tecator data (3 classes) and oversampled with a 10-fold cross validation, iterated 100 times. Displayed results are mean (s.d.). [Dataset]. http://doi.org/10.1371/journal.pone.0287705.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0287705.t004
Dataset updated
Jun 29, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results comparing raw Tecator data (3 classes) and oversampled with a 10-fold cross validation, iterated 100 times. Displayed results are mean (s.d.).

Google's Audioset: Reformatted

zenodo.org
data.niaid.nih.gov
+1more

tsv

Updated Sep 21, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Bakhtin; Bakhtin (2022). Google's Audioset: Reformatted [Dataset]. http://doi.org/10.5281/zenodo.7096702

Explore at:

tsvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7096702

Dataset updated

Sep 21, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Bakhtin; Bakhtin

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Google's AudioSet consistently reformatted

During my work with Google's AudioSet(https://research.google.com/audioset/index.html)
I encountered some problems due to the fact that Weak (https://research.google.com/audioset/download.html) and
 Strong (https://research.google.com/audioset/download_strong.html) versions of the dataset used different csv formatting for the data, and that also labels used in the two datasets are different (https://github.com/audioset/ontology/issues/9) and also presented in files with different formatting.

This dataset reformatting aims to unify the formats of the datasets so that it is possible
to analyse them in the same pipelines, and also make the dataset files compatible
with psds_eval, dcase_util and sed_eval Python packages used in Audio Processing.

For better formatted documentation and source code of reformatting refer to https://github.com/bakhtos/GoogleAudioSetReformatted 

-Changes in dataset

All files are converted to tab-separated `*.tsv` files (i.e. `csv` files with `\t`
as a separator). All files have a header as the first line.

-New fields and filenames

Fields are renamed according to the following table, to be compatible with psds_eval:

Old field -> New field
YTID -> filename
segment_id -> filename
start_seconds -> onset
start_time_seconds -> onset
end_seconds -> offset
end_time_seconds -> offset
positive_labels -> event_label
label -> event_label
present -> present

For class label files, `id` is now the name for the for `mid` label (e.g. `/m/09xor`)
and `label` for the human-readable label (e.g. `Speech`). Index of label indicated
for Weak dataset labels (`index` field in `class_labels_indices.csv`) is not used.

Files are renamed according to the following table to ensure consisted naming
of the form `audioset_[weak|strong]_[train|eval]_[balanced|unbalanced|posneg]*.tsv`:

Old name -> New name
balanced_train_segments.csv -> audioset_weak_train_balanced.tsv
unbalanced_train_segments.csv -> audioset_weak_train_unbalanced.tsv
eval_segments.csv -> audioset_weak_eval.tsv
audioset_train_strong.tsv -> audioset_strong_train.tsv
audioset_eval_strong.tsv -> audioset_strong_eval.tsv
audioset_eval_strong_framed_posneg.tsv -> audioset_strong_eval_posneg.tsv
class_labels_indices.csv -> class_labels.tsv (merged with mid_to_display_name.tsv)
mid_to_display_name.tsv -> class_labels.tsv (merged with class_labels_indices.csv)

-Strong dataset changes

Only changes to the Strong dataset are renaming of fields and reordering of columns,
so that both Weak and Strong version have `filename` and `event_label` as first 
two columns.

-Weak dataset changes

-- Labels are given one per line, instead of comma-separated and quoted list

-- To make sure that `filename` format is the same as in Strong version, the following
format change is made:
The value of the `start_seconds` field is converted to milliseconds and appended to the `filename` with an underscore. Since all files in the dataset are assumed to be 10 seconds long, this unifies the format of `filename` with the Strong version and makes `end_seconds` also redundant.

-Class labels changes

Class labels from both datasets are merged into one file and given in alphabetical order of `id`s. Since same `id`s are present in both datasets, but sometimes with different human-readable labels, labels from Strong dataset overwrite those from Weak. It is possible to regenerate `class_labels.tsv` while giving priority to the Weak version of labels by calling `convert_labels(False)` from convert.py in the GitHub repository.

-License

Google's AudioSet was published in two stages - first the Weakly labelled data (Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017.), then the strongly labelled data (Hershey, Shawn, et al. "The benefit of temporally-strong labels in audio event classification." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.)

Both the original dataset and this reworked version are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)

Class labels come from the AudioSet Ontology, which is licensed under CC BY-SA 4.0.

Data from: Powerful significance testing for unbalanced clusters
tandf.figshare.com
json
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas H. Keefe; J. S. Marron (2025). Powerful significance testing for unbalanced clusters [Dataset]. http://doi.org/10.6084/m9.figshare.28473706.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28473706.v1
Dataset updated
Feb 24, 2025
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Thomas H. Keefe; J. S. Marron
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering methods are popular for revealing structure in data, particularly in the high-dimensional setting common to contemporary data science. A central statistical question is “are the clusters really there?” One pioneering method in statistical cluster validation is SigClust, but it is severely underpowered in the important setting where the candidate clusters have unbalanced sizes, such as in rare subtypes of disease. We show why this is the case and propose a remedy that is powerful in both the unbalanced and balanced settings, using a novel generalization of k-means clustering. We illustrate the value of our method using a high-dimensional dataset of gene expression in kidney cancer patients. A Python implementation is available at https://github.com/thomaskeefe/sigclust.
Results comparing raw Maupiti data (4 classes) and oversampled with a 5-fold...
plos.figshare.com
xls
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet (2023). Results comparing raw Maupiti data (4 classes) and oversampled with a 5-fold cross validation. Displayed results are mean (s.d.). [Dataset]. http://doi.org/10.1371/journal.pone.0287705.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0287705.t003
Dataset updated
Jun 29, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Teo Nguyen; Kerrie Mengersen; Damien Sous; Benoit Liquet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Maupiti Island
Description
Results comparing raw Maupiti data (4 classes) and oversampled with a 5-fold cross validation. Displayed results are mean (s.d.).
T
imagenet2012_fewshot
tensorflow.org
Updated Dec 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). imagenet2012_fewshot [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012_fewshot
Explore at:
Dataset updated
Dec 10, 2022
Description
Imagenet2012Fewshot is a subset of original ImageNet ILSVRC 2012 dataset. The dataset share the same validation set as the original ImageNet ILSVRC 2012 dataset. However, the training set is subsampled in a label balanced fashion. In 5shot configuration, 5 images per label, or 5000 images are sampled; and in 10shot configuration, 10 images per label, or 10000 images are sampled.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('imagenet2012_fewshot', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012_fewshot-1shot-5.0.1.png" alt="Visualization" width="500px">

Facebook

Twitter

Click to copy link

Link copied

Cite

Rehan Adil Abbasi (2024). Wireless Sensor Network Dataset [Dataset]. https://www.kaggle.com/datasets/rehanadilabbasi/wireless-sensor-network-dataset/code

Wireless Sensor Network Dataset

Basic Information: Number of entries: 374,661 Number of features: 19 Data Types

Explore at:

zip(258458 bytes)Available download formats

Dataset updated

Jun 19, 2024

Authors

Rehan Adil Abbasi

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Basic Information:

Number of entries: 374,661 Number of features: 19 Data Types:

15 integer columns 3 float columns 1 object column (label) Column Names:

id, Time, Is_CH, who CH, Dist_To_CH, ADV_S, ADV_R, JOIN_S, JOIN_R, SCH_S, SCH_R, Rank, DATA_S, DATA_R, Data_Sent_To_BS, dist_CH_To_BS, send_code, Consumed Energy, label Explore the Dataset First Five Rows:

id Time Is_CH who CH Dist_To_CH ADV_S ADV_R JOIN_S JOIN_R SCH_S SCH_R Rank DATA_S DATA_R Data_Sent_To_BS dist_CH_To_BS send_code Consumed Energy label 0 101000 50 1 101000 0.00000 1 0 0 25 1 0 0 0 1200 48 0.00000 1 0.00000 Attack 1 101001 50 0 101044 75.32345 0 4 1 0 0 1 2 38 0 0 0.00000 1 0.09797 Normal 2 101002 50 0 101010 46.95453 0 4 1 0 0 1 19 41 0 0 0.00000 1 0.09797 Normal 3 101003 50 0 101044 64.85231 0 4 1 0 0 1 16 38 0 0 0.00000 1 0.09797 Normal 4 101004 50 0 101010 4.83341 0 4 1 0 0 1 0 41 0 0 0.00000 1 0.09797 Normal Missing Values: No missing values detected in the dataset.

Statistical Summary:

The dataset includes various features related to network operations such as Time, Is_CH, who CH, Dist_To_CH, and more. The target variable label contains categorical values such as "Attack" and "Normal". Analyze Class Distribution Let's analyze the distribution of the classes within the dataset.

python Copy code class_distribution = dataset['label'].value_counts() class_distribution Handle Class Imbalance If a class imbalance is identified, appropriate techniques such as undersampling, oversampling, or SMOTE (Synthetic Minority Over-sampling Technique) will be applied to balance the dataset.

Next Steps:

Identify the class distribution. Apply balancing techniques if necessary. Continue with data preprocessing and feature engineering. We will perform the class distribution analysis and balancing in the subsequent step.

i have found some duplicate values and drop them dataset.duplicated().sum() dataset.drop_duplicates(inplace=True)

ChatGPT Duplicate Handling Initial Duplicate Count Duplicates found: 8,873 Action Taken Duplicates dropped: The dataset has been cleaned by removing all duplicate entries. Verification Duplicates after cleaning: 0 The dataset is now free of duplicate entries. We can proceed with the analysis of the class distribution and any necessary steps to handle class imbalance.

Analyze Class Distribution Let's analyze the distribution of the label column to understand the balance between the classes.

python Copy code class_distribution = dataset['label'].value_counts() class_distribution I'll compute this now.

Class Distribution Analysis The distribution of the classes within the dataset is as follows:

Normal: 332,040 Grayhole: 13,909 Blackhole: 10,049 TDMA: 6,633 Flooding: 3,157 Observations There is a significant class imbalance, with the majority class being "Normal" (332,040 instances). Minority classes include "Grayhole", "Blackhole", "TDMA", and "Flooding".

Clear search

Close search

Google apps

Main menu

Wireless Sensor Network Dataset

Animals (Cats, Dogs, and Snakes)

Cats, Dogs, and Snakes Dataset

Dataset Overview

Data Split Recommendation

Preprocessing

Example: Loading and Using the Dataset (Python)

Key Features

Different estimator’s average best validation performance for the class...

balanced-accuracy

Waste Classfication Dataset

Balanced Waste Classification Dataset - E-Waste & Mixed Materials

🎯 Dataset Overview

📊 Dataset Statistics

🗂️ Waste Categories

🛠️ Data Processing Pipeline

1. Data Balancing

2. Data Augmentation Techniques

3. Quality Assurance

🎯 Recommended Use Cases

Primary Applications

Special Features

Model Architectures

📁 Dataset Structure

🚀 Getting Started

Step 1: Data Splitting

Step 2: Data Loading & Preprocessing

Amazon_employee_access_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True...

Imbalanced Cifar-10

Comparison of simulated raw data (4 classes) and oversampled data, repeated...

Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

gap

ag_news_subset

Data from: Tea Leaf Dataset

Percentage of the number of pixels of each class on Maupiti data, based on...

Hyperparameters used in Scikit-learn package in Python [56], including both...

cmaterdb

Results comparing raw Tecator data (3 classes) and oversampled with a...

Google's Audioset: Reformatted

Data from: Powerful significance testing for unbalanced clusters

Results comparing raw Maupiti data (4 classes) and oversampled with a 5-fold...

imagenet2012_fewshot

Wireless Sensor Network Dataset

Basic Information: Number of entries: 374,661 Number of features: 19 Data Types