11 datasets found

Caltech-256: Pre-Processed 80/20 Train-Test Split
kaggle.com
zip
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KUSHAGRA MATHUR (2025). Caltech-256: Pre-Processed 80/20 Train-Test Split [Dataset]. https://www.kaggle.com/datasets/kushubhai/caltech-256-train-test
Explore at:
zip(1138799273 bytes)Available download formats
Dataset updated
Nov 12, 2025
Authors
KUSHAGRA MATHUR
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).

The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:

A clean, pre-defined 80/20 train-test split.

Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.

A flat directory structure (train/, test/) for simplified file access.

File Content The dataset is organized into a single top-level folder and two CSV files:

train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.

test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.

Caltech-256_Train_Test/: The primary data folder.

train/: This directory contains 80% of the images from all 257 categories, intended for model training.

test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.

Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.

Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.

Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data

Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.
DRIVE Train/Validation Split Dataset
kaggle.com
Updated Feb 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sovit Ranjan Rath (2023). DRIVE Train/Validation Split Dataset [Dataset]. https://www.kaggle.com/datasets/sovitrath/drive-trainvalidation-split-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sovit Ranjan Rath
Description
This dataset contains images and masks for Retinal Vessel Extraction (Segmentation). It contains a training and validation split to easily train semantic segmentation models.

The original dataset can be found here => https://www.kaggle.com/datasets/andrewmvd/drive-digital-retinal-images-for-vessel-extraction

This dataset also has an accompanying blog post => Retinal Vessel Segmentation using PyTorch Semantic Segmentation

Split sample numbers: Training images and masks: 16 Validation images and masks: 4 Test images: 20
Z
Data from: Solar flare forecasting based on magnetogram sequences learning...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grim, Luís Fernando Lopes; Sampaio Gradvohl, André Leon (2023). Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10246576
Explore at:
Dataset updated
Dec 4, 2023
Dataset provided by
Universidade Estadual de Campinas
Universidade Estadual de Campinas (UNICAMP)
Authors
Grim, Luís Fernando Lopes; Sampaio Gradvohl, André Leon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation". Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 × 224 pixels and normalized from 0 to 1. Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data). We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets. We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively. GitHub version The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version. Folders Structure In the Root directory of the project, we have two folders:

magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA). Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders.

M24/M48: both present the following sub-folders structure:

Seqs16; SF_MViT; SF_MViT_oT; SF_MViT_oTV; SF_MViT_oTV_Test. There are also two files in root:

inst_packages.sh: install the packages and dependencies to run the models. download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache. M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files. Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders. All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models. Naming pattern for the files:

magnetogram_jpg: follows the format "hmi.sharp_720s...magnetogram.fits.jpg" and Seqs16: follows the format "hmi.sharp_720s...to.", where:

hmi: is the instrument that captured the image

sharp_720s: is the database source of SDO/HMI.

is the identification of SHARP region, and can contain one or more solar ARs classified by the (NOAA).

is the date-time the instrument captured the image in the format yyyymmdd_hhnnss_TAI (y:year, m:month, d:day, h:hours, n:minutes, s:seconds).

is the date-time when the sequence starts, and follow the same format of .

is the date-time when the sequence ends, and follow the same format of . Reference text files in M24 and M48 or inside SF_MViT... folders follows the format "flare_Mclass_.txt", where:

is Seq16 if refers to a sequence, or void if refers direct to images.

"24h" or "48h".

is "TrainVal" or "Test". The refers to the split of Train/Val.

void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. All SF_MViT...folders:

Model training codes: "SF_MViT_M+_", where:

void or "oT" (over Train) or "oTV" (over Train and Val) or "oTV_Test" (over Train, Val and Test);

"24h" or "48h";

"oneSplit" for a specific split or "allSplits" if run all splits.

void is default to run 1 GPU or "2gpu" to run into 2 gpus systems; Job submission files: "jobMViT_", where:

point the queue in Lovelace environment hosted on CENAPAD-SP (https://www.cenapad.unicamp.br/parque/jobsLovelace) Temporary inputs: "Seq16_flare_Mclass_.txt:

train or val;

void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. Outputs: "saida_MViT_Adam_10-7", where:

k0 to k4, means the correlated split of the output, or void if the output is from all splits. Error files: "err_MViT_Adam_10-7", where:

k0 to k4, means the correlated split of the error log file, or void if the error file is from all splits. Checkpoint files: "sample-FLARE_MViT_S_10-7-epoch=-valid_loss=-Wloss_k=.ckpt", where:

epoch number of the checkpoint;

corresponding valid loss;

0 to 4.
z
Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks...
zenodo.org
bin, pdf, zip
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johan David Rodriguez Portela; Johan David Rodriguez Portela; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán (2025). Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation" [Dataset]. http://doi.org/10.5281/zenodo.15002575
Explore at:
bin, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15002575
Dataset updated
Nov 12, 2025
Dataset provided by
Arxiv
Authors
Johan David Rodriguez Portela; Johan David Rodriguez Portela; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation

This is the complete code, model and datasets for the article ESNLIR: Expanding Spanish NLI Benchmarks with Multi-genre and Causal Annotation

In case you cannot access the article this preprint is available: ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships.

How to cite:

Portela, J.R., Pérez-Terán, N., Manrique, R. (2026). ESNLIR: Expanding Spanish NLI Benchmarks with Multi-genre and Causal Annotation. In: Florez, H., Peluffo-Ordoñez, D. (eds) Applied Informatics. ICAI 2025. Communications in Computer and Information Science, vol 2667. Springer, Cham. https://doi.org/10.1007/978-3-032-07175-0_23

IMPORTANT UPDATE!!!

It is strongly advised to work with the following links, instead of working directly from Zenodo:

CODE REPOSITORY: This repository contains the code used for the article.

SMALL EXAMPLE REPOSITORY: This repository contains a small code example showing you how to train, and predict using a very small toy dataset, with the same structure.

HUGGING FACE COLLECTION: Huggingface collection containing the dataset and models.

If you still want to use the Zenodo repository, follow the steps below. But once again, it is way easier to work with the links above.

----------------------------------------------------------------------------------------------

Installation

This repository is a poetry project, which means that it can be installed easily by executing the following command from a shell in the repository folder:

poetry install

As this repository is script based, the README.md file contains all the commands executed to generate the dataset and train models.

----------------------------------------------------------------------------------------------

Core code

The core code used for all the experiments is in the folder auto-nli and all the calls to the core code with the parameters requested are found in README.md

----------------------------------------------------------------------------------------------

Parameters

All the parameters to create datasets and train models with the core code are found in the folder parameters.

----------------------------------------------------------------------------------------------

Models

Model types

For BERT based models, all in pytorch, there are two types of models from huggingfaces that were used for training and also are required to load a dataset because of the tokenizer:

RoBERTa (BERTIN): https://huggingface.co/bertin-project/bertin-roberta-base-spanish

XLMRoBERTa: https://huggingface.co/FacebookAI/xlm-roberta-base

Model folder

The model folder contains all the trained models for the paper. There are three types of models:

baseline: An XGBoost model that can be loaded with pickle.

roberta: BERTIN based models in pytorch. You can load them with the model_path

xlmroberta: XLMRoBERTa based models in pytorch. You can load them with the model_path

Models with the suffix _annot are models trained with the premise (first sentence) only. Apart from the pytorch model folder, each model result folder (ex: ) contains the test results for the test set and the stress test sets (ex: )

Load model

Models are found in the folder model and all of them are pytorch models which can be loaded with the huggingface interface:

from transformers import AutoModel model = AutoModel.from_pretrained('

----------------------------------------------------------------------------------------------

Dataset

labeled_final_dataset.jsonl

This file is included outside the ZIP containing all other files, and it contains the final test dataset with 974 examples selected by human majority label matching the original linking phrase label.

Other datasets:

The datasets can be found in the folder data that is divided in the following folders:

base_dataset

The splits to train, validate and test the models.

splits_data

Splits of train-val-test extracted for each corpora. They are used to generate base_dataset.

sentence_data

Pairs of sentences found in each corpus. They are used to generate splits_data.

Dataset dictionary

This repository contains the splits that resulted from the research project "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships". All the splits are in JSONL format and have the same fields per example:

sentence_1: First sentence of the pair.

sentence_2: Second sentence of the pair.

connector: Linking phrase used to extract pair.

connector_type: NLI label, between "contrasting", "entailment", "reasoning" or "neutral"

extraction_strategy: "linking_phrase" for "contrasting", "entailment", "reasoning" and "none" for neutral.

distance: How many sentences before the connector is the sentence_1

sentence_1_position: Number of sentence for sentence_1 in the source document

sentence_1_paragraph: Number of paragraph for sentence_1 in the source document

sentence_2_position: Number of sentence for sentence_2 in the source document

sentence_2_paragraph: Number of paragraph for sentence_2 in the source document

id: Unique identifier for the example

dataset: Source corpus of the pair. Metadata of corpus, including source can be found in dataset_metadata.xlsx.

genre: Writing genre of the dataset.

domain: Domain genre of the dataset.

Example:

{"sentence_1":"sefior Bcajavides no es moderado, tampoco lo convertirse e\u00f1 declarada divergencia de miras polileido en griego","sentence_2":"era mayor claricomentarios, as\u00ed de los peri\u00f3dicos como de los homes dado \u00e1 la voluntad de los hombres, sin que sobreticas","connector":"por consiguiente,","connector_type":"reasoning","extraction_strategy":"linking_phrase","distance":1.0,"sentence_1_paragraph":4,"sentence_1_position":86,"sentence_2_paragraph":4,"sentence_2_position":87,"id":"esnews_spanish_pd_news_531537","dataset":"esnews_spanish_pd_news","genre":"news","domain":"spanish_public_domain_news"}

Dataset load

To load a dataset/split as a pytorch object used to train-validate-test models you must use the custom class dataset

from auto_nli.model.bert_based.dataset import BERTDataset

dataset = BERTDataset(

os.path.join(dataset_folder,

max_len=

model_type=

only_premise=

max_samples=

----------------------------------------------------------------------------------------------

Notebooks

The folder notebooks contains a collection of jupyter notebooks used to preprocess datasets and visualize results.
D
Replication Data for: Super-resolution reconstruction of scalar fields from...
darus.uni-stuttgart.de
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Shamooni (2025). Replication Data for: Super-resolution reconstruction of scalar fields from the pyrolysis of pulverised biomass using deep learning [Dataset]. http://doi.org/10.18419/DARUS-5519
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-5519
Dataset updated
Nov 14, 2025
Dataset provided by
DaRUS
Authors
Ali Shamooni
License
https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-5519https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-5519
Dataset funded by
China Scholarship Council (CSC)
DFG
Helmholtz Association of German Research Centers (HGF)
Description
README Repository for publication: A. Shamooni et al., Super-resolution reconstruction of scalar fields from the pyrolysis of pulverised biomass using deep learning, Proc. Combust. Inst. (2025) Containing torch_code The main Pytorch source code used for training/testing is provided in torch_code.tar.gz file. torch_code_tradGAN To compare with traditional GAN, we use the code in torch_code_tradGAN with similar particle-laden datasets. The source code is torch_code_tradGAN.tar.gz file. datasets The training/validation/testing datasets have been provided in lmdb format which is ready to use in the code. The datasets in datasets.tar.gz contain: Training dataset: data_train_OF-mass_kinematics_mk0x_1x_2x_FHIT_particle_128_Re52-2D_20736_lmdb.lmdb Test dataset: data_valid_inSample_OF-mass_kinematics_mk0x_1x_2x_FHIT_particle_128_Re52-2D_3456_lmdb.lmdb Note that the samples from 9 DNS cases are collected in order (each case 2304 samples for training and 384 samples for testing) which can be recognized using the provided metadata file in each folder. Out of distribution test datasets: Out of distribution test dataset (used in Fig 10 of the paper): data_valid_inSample_OF-mass_kinematics_mk3x_FHIT_particle_128_Re52-2D_nonUniform_1024_lmdb.lmdb | We have two separate OOD DNS cases and from each we select 512 samples. experiments The main trained models are provided in experiments.tar.gz file. Each experiment contains the log file of the training, the last training state (for restart) and the model wights used in the publication. Trained model using the main dataset (used in Figs 2-10 of the paper): h_oldOrder_mk_700-11-c_PFT_Inp4TrZk_outTrZ_RRDBNetCBAM-4Prt_DcondPrtWav_f128g64b16_BS16x4_LrG45D5_DS-mk012-20k_LStandLog To compare with traditional GAN, we use the code in torch_code_tradGAN with similar particle-laden datasets as above. The training consists of one pre-training step and two separate fine-tuning. One fine-tuning with the loss weights from the litreature and one fine-tuning with tuned loss weights. The final results are in experiments/trad_GAN/experiments/ Pre-trained traditional GAN model (used in Figs 8-9 of the paper): train_RRDB_SRx4_particle_PSNR Fine-tuned traditional GAN model with loss weights from lit. (used in Figs 8-9 of the paper) train_ESRGAN_SRx4_particle_Nista_oneBlock Fine-tuned traditional GAN model with optimized loss weights (used in Figs 8-9 of the paper) train_ESRGAN_SRx4_particle_oneBlock_betaA inference_notebooks The inference_notebooks folder contains example notebooks to do inference. The folder contains "torch_code_inference" and "torch_code_tradGAN_inference". The "torch_code_inference" is the inference of main trained model. The "torch_code_tradGAN_inference" is the inference for traditional GAN approach. Move the inference folders in each of these folders into the corresponding torch_code roots. Also create softlinks of datasets and experiments in the main torch_code roots. Note that in each notebook you must double check the required paths to make sure they are set correctly. How to Build the environment To build the environment required for the training and inference you need Anaconda. Go to the torch_code folder and conda env create -f environment.yml Then create ipython kernel for post processing, conda activate torch_22_2025_Shamooni_PCI python -m ipykernel install --user --name ipyk_torch_22_2025_Shamooni_PCI --display-name "ipython kernel for post processing of PCI2025" Perform training It is suggested to create softlinks to the dataset folder directly in the torch_code folder: cd torch_code ln -s datasets You can also simply move the datasets and inference forlders in the torch_code folder beside the cfd_sr folder and other files. In general, we prefer to have a root structure as below: root files and directories: cfd_sr datasets experiments inference options init.py test.py train.py version.py Then activate the conda environment conda activate torch_22_2025_Shamooni_PCI An example script to run on single node with 2 GPUs: torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py -opt options/train/condSRGAN/use_h_mk_700-011_PFT.yml --launcher pytorch Make sure that the paths to datasets "dataroot_gt" and "meta_info_file" for both training and validation data in option files are set correctly.
h
cifar10
huggingface.co
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Élie Goudout (2025). cifar10 [Dataset]. https://huggingface.co/datasets/ego-thales/cifar10
Explore at:
Dataset updated
Aug 5, 2025
Authors
Élie Goudout
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Specifications

Contains the entire CIFAR10 dataset, downloaded via PyTorch, then split and saved as .png files representing 32x32 images. There a three splits, perfectly balanced class-wise:

train: 49,000 out of the original 50,000 samples from the training set of CIFAR10; calibration: 1,000 left-out samples from the training set; test: 10,000 samples, the entire original test set.

File Structure

Files are archives
feral-cat-segmentation_dataset
kaggle.com
universe.roboflow.com
zip
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lu hou yang (2025). feral-cat-segmentation_dataset [Dataset]. https://www.kaggle.com/datasets/luhouyang/feral-cat-segmentation-dataset
Explore at:
zip(971125684 bytes)Available download formats
Dataset updated
Mar 18, 2025
Authors
lu hou yang
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Feral Cat Segmentation Dataset

Overview

This dataset provides image segmentation data for feral cats, designed for computer vision and machine learning tasks. It builds upon the original public domain dataset by Paul Cashman from Roboflow, with additional preprocessing and multiple data formats for easier consumption.

Dataset Source

Original Author: Paul Cashman

Original Source: Roboflow Universe

Extended by: Lu Hou Yang

GitHub: https://github.com/luhouyang/open_circles

License: Public Domain

Dataset Contents

The dataset is organized into three standard splits: - Train set - Validation set - Test set

Each split contains data in multiple formats: 1. Original JPG images 2. Segmentation mask JPG images 3. Parquet files containing flattened image and mask data 4. Pickle files containing serialized image and mask data

Data Formats

1. Image Files

Format: JPG

Resolution: 224×224 pixels

Directory Structure:

train/: Original training images

valid/: Original validation images

test/: Original test images

train_mask/: Corresponding segmentation masks for training

valid_mask/: Corresponding segmentation masks for validation

test_mask/: Corresponding segmentation masks for testing

2. Parquet Files

Files: train_dataset.parquet, valid_dataset.parquet, test_dataset.parquet

Content: Flattened image data and corresponding masks combined in a single table

Structure: Each row contains the flattened pixel values of an image followed by the flattened pixel values of its mask

Data Division: Image and mask data are split at index split_at = image_size[0] * image_size[1] * image_channels

Data before this index: image pixel values (reshaped to [-1, 224, 224, 3])

Data after this index: mask pixel values (reshaped to [-1, 224, 224, 1])

Benefits: Efficient storage and faster loading compared to individual image files

3. Pickle Files

Files: train_dataset.pkl, valid_dataset.pkl, test_dataset.pkl

Content: Serialized Python objects containing images and their corresponding masks

Structure: List of [image, mask] pairs, where each image and mask is serialized using Python's pickle

Data Access: Similar to parquet files, when loaded through the provided dataset class, data is split at the same index: split_at = image_size[0] * image_size[1] * image_channels

Benefits: Preserves original data structure and enables quick loading in Python

4. CSV Files

Files: train_dataset.csv, valid_dataset.csv, test_dataset.csv

Content: Same data as parquet files but in CSV format

Structure: No headers, raw flattened pixel values

Data Division: Same split point as parquet files

Image Preprocessing

All images were preprocessed with the following operations: - Resized to 224×224 pixels using bilinear interpolation - Segmentation masks were also resized to match the images using nearest neighbor interpolation - Original RLE (Run-Length Encoding) segmentation data converted to binary masks

Data Normalization

When used with the provided PyTorch dataset class, images are normalized with: - Mean: [0.48235, 0.45882, 0.40784] - Standard Deviation: [0.00392156862745098, 0.00392156862745098, 0.00392156862745098]

PyTorch Integration

A custom CatDataset class is included for easy integration with PyTorch:

from cat_dataset import CatDataset # Load from parquet format dataset = CatDataset( root="path/to/dataset", split="train", # Options: "train", "valid", "test" format="parquet", # Options: "parquet", "pkl" image_size=[224, 224], image_channels=3, mask_channels=1 ) # Use with PyTorch DataLoader from torch.utils.data import DataLoader dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

Performance Comparison

Loading time benchmarks from the original implementation: - Parquet format: ~1.29 seconds per iteration - Pickle format: ~0.71 seconds per iteration

The pickle format provides the fastest loading times and is recommended for most use cases.

Citation

If you use this dataset in your research or projects, please cite:

@misc{feral-cat-segmentation_dataset, title = {feral-cat-segmentation Dataset}, type = {Open Source Dataset}, author = {Paul Cashman}, howpublished = {\url{https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}}, url = {https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}, journal = {Roboflow Universe}, publisher = {Roboflow}, year = {2025}, month = {mar}, note = {visited on 2025-03-19}, }

Sample Usage Code

Basic Dataset Loading

from ca...
Rice Classification cnn with pytorch
kaggle.com
zip
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seyed Arman Hossaini (2025). Rice Classification cnn with pytorch [Dataset]. https://www.kaggle.com/datasets/seyedarmanhossaini/rice-turkish
Explore at:
zip(227207926 bytes)Available download formats
Dataset updated
Aug 29, 2025
Authors
Seyed Arman Hossaini
Description
This dataset contains images of five rice varieties: Arborio, Basmati, Ipsala, Jasmine, and Karacadag. Each class has 15,000 images organized in separate folders.

Image size: 128x128 (resized)

Dataset split: Training 70%, Validation 15%, Test 15%

It is designed for training convolutional neural networks (CNNs) for rice variety classification, supporting applications in agriculture and food quality monitoring.
h
brats2023_5slices
huggingface.co
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arijit Ghosh (2023). brats2023_5slices [Dataset]. https://huggingface.co/datasets/sohonjit/brats2023_5slices
Explore at:
Dataset updated
Jun 1, 2023
Authors
Arijit Ghosh
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is based on the BraTS2023 dataset. It takes 5 middle slices from each nifti volume of the BraTS2023 dataset after normalizing to a value of (-1,1). All of these images are .npy files and one can load them using the np.load(FILEPATH).astype(np.float32). We provide the training and the test set which contains 6255 and 1095 files respectively. It is highly recommend to create a separate validation set from the training dataset for applications. We use Pytorch to do this. We do… See the full description on the dataset page: https://huggingface.co/datasets/sohonjit/brats2023_5slices.
h
resisc45
huggingface.co
Updated Sep 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PyTorch Image Models (2025). resisc45 [Dataset]. https://huggingface.co/datasets/timm/resisc45
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 19, 2025
Dataset authored and provided by
PyTorch Image Models
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Description

RESISC45 dataset is a publicly available benchmark for Remote Sensing Image Scene Classification (RESISC), created by Northwestern Polytechnical University (NWPU). This dataset contains 31,500 images, covering 45 scene classes with 700 images in each class. The dataset does not have any default splits. Train, validation, and test splits were based on these definitions here… See the full description on the dataset page: https://huggingface.co/datasets/timm/resisc45.
h
imagenet-1k-wds
huggingface.co
Updated Jan 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PyTorch Image Models (2024). imagenet-1k-wds [Dataset]. https://huggingface.co/datasets/timm/imagenet-1k-wds
Explore at:
Dataset updated
Jan 5, 2024
Dataset authored and provided by
PyTorch Image Models
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Summary

ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). ImageNet aims to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. 💡… See the full description on the dataset page: https://huggingface.co/datasets/timm/imagenet-1k-wds.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

KUSHAGRA MATHUR (2025). Caltech-256: Pre-Processed 80/20 Train-Test Split [Dataset]. https://www.kaggle.com/datasets/kushubhai/caltech-256-train-test

Caltech-256: Pre-Processed 80/20 Train-Test Split

Classic Caltech-256 dataset, pre-partitioned for rapid model training

Explore at:

zip(1138799273 bytes)Available download formats

Dataset updated

Nov 12, 2025

Authors

KUSHAGRA MATHUR

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).

The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:

A clean, pre-defined 80/20 train-test split.

Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.

A flat directory structure (train/, test/) for simplified file access.

File Content The dataset is organized into a single top-level folder and two CSV files:

train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.

test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.

Caltech-256_Train_Test/: The primary data folder.

train/: This directory contains 80% of the images from all 257 categories, intended for model training.

test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.

Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.

Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.

Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data

Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.

Clear search

Close search

Google apps

Main menu

Caltech-256: Pre-Processed 80/20 Train-Test Split

DRIVE Train/Validation Split Dataset

Data from: Solar flare forecasting based on magnetogram sequences learning...

Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks...

ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation

How to cite:

IMPORTANT UPDATE!!!

It is strongly advised to work with the following links, instead of working directly from Zenodo:

CODE REPOSITORY: This repository contains the code used for the article.

SMALL EXAMPLE REPOSITORY: This repository contains a small code example showing you how to train, and predict using a very small toy dataset, with the same structure.

HUGGING FACE COLLECTION: Huggingface collection containing the dataset and models.

Installation

Core code

Parameters

Models

Model types

Model folder

Load model

Dataset

labeled_final_dataset.jsonl

Other datasets:

base_dataset

splits_data

sentence_data

Dataset dictionary

Dataset load

Notebooks

Replication Data for: Super-resolution reconstruction of scalar fields from...

cifar10

feral-cat-segmentation_dataset

Feral Cat Segmentation Dataset

Overview

Dataset Source

Dataset Contents

Data Formats

1. Image Files

2. Parquet Files

3. Pickle Files

4. CSV Files

Image Preprocessing

Data Normalization

PyTorch Integration

Performance Comparison

Citation

Sample Usage Code

Basic Dataset Loading

Rice Classification cnn with pytorch

brats2023_5slices

resisc45

imagenet-1k-wds

Caltech-256: Pre-Processed 80/20 Train-Test Split

Classic Caltech-256 dataset, pre-partitioned for rapid model training