61 datasets found

DRIVE Train/Validation Split Dataset
kaggle.com
Updated Feb 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sovit Ranjan Rath (2023). DRIVE Train/Validation Split Dataset [Dataset]. https://www.kaggle.com/datasets/sovitrath/drive-trainvalidation-split-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sovit Ranjan Rath
Description
This dataset contains images and masks for Retinal Vessel Extraction (Segmentation). It contains a training and validation split to easily train semantic segmentation models.

The original dataset can be found here => https://www.kaggle.com/datasets/andrewmvd/drive-digital-retinal-images-for-vessel-extraction

This dataset also has an accompanying blog post => Retinal Vessel Segmentation using PyTorch Semantic Segmentation

Split sample numbers: Training images and masks: 16 Validation images and masks: 4 Test images: 20
Water Bodies Segmentation Dataset with Split
kaggle.com
zip
Updated Jan 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sovit Ranjan Rath (2023). Water Bodies Segmentation Dataset with Split [Dataset]. https://www.kaggle.com/datasets/sovitrath/water-bodies-segmentation-dataset-with-split
Explore at:
zip(258844554 bytes)Available download formats
Dataset updated
Jan 30, 2023
Authors
Sovit Ranjan Rath
Description
Dataset for segmentation of water bodies in satellite imagery. Find the blog post using this dataset - Train PyTorch DeepLabV3 on Custom Dataset

Acknowledgments Original data => https://www.kaggle.com/datasets/franciscoescobar/satellite-images-of-water-bodies
h
VideoDD-ICLR-distill
huggingface.co
Updated Aug 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kkk (2025). VideoDD-ICLR-distill [Dataset]. https://huggingface.co/datasets/turturtur250/VideoDD-ICLR-distill
Explore at:
Dataset updated
Aug 31, 2025
Authors
kkk
Description
VideoDD-ICLR-distill

This dataset contains pre-processed versions of HMDB51 and UCF101 video datasets, where each video has been cut into frames for use in distillation and video classification tasks.

Structure

data/HMDB51/ : HMDB51 dataset split into frames data/UCF101/ : UCF101 dataset split into frames dataset.py : PyTorch dataset loader resize_mydata.py : Frame resizing script show_img.py : Visualization script gpu123_logs/ : Logs from training runs… See the full description on the dataset page: https://huggingface.co/datasets/turturtur250/VideoDD-ICLR-distill.
h
flowers102
huggingface.co
Updated Nov 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pu Fanyi (2025). flowers102 [Dataset]. https://huggingface.co/datasets/pufanyi/flowers102
Explore at:
Dataset updated
Nov 21, 2025
Authors
Pu Fanyi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Oxford 102 Flowers (Custom Split)

This dataset re-packages the Oxford 102 Flowers dataset with a custom train/validation/test split produced by src.data.upload_flowers102.

Split Ratios

Train: 80.00% Validation: 10.00% Test: 10.00%

Source

Original images and annotations come from the Oxford 102 Flowers dataset, as distributed on the Hugging Face Hub under pytorch/oxford-flowers.
z
Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks...
zenodo.org
bin, pdf, zip
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johan David Rodriguez Portela; Johan David Rodriguez Portela; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán (2025). Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation" [Dataset]. http://doi.org/10.5281/zenodo.15002575
Explore at:
bin, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15002575
Dataset updated
Nov 12, 2025
Dataset provided by
Arxiv
Authors
Johan David Rodriguez Portela; Johan David Rodriguez Portela; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation

This is the complete code, model and datasets for the article ESNLIR: Expanding Spanish NLI Benchmarks with Multi-genre and Causal Annotation

In case you cannot access the article this preprint is available: ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships.

How to cite:

Portela, J.R., Pérez-Terán, N., Manrique, R. (2026). ESNLIR: Expanding Spanish NLI Benchmarks with Multi-genre and Causal Annotation. In: Florez, H., Peluffo-Ordoñez, D. (eds) Applied Informatics. ICAI 2025. Communications in Computer and Information Science, vol 2667. Springer, Cham. https://doi.org/10.1007/978-3-032-07175-0_23

IMPORTANT UPDATE!!!

It is strongly advised to work with the following links, instead of working directly from Zenodo:

CODE REPOSITORY: This repository contains the code used for the article.

SMALL EXAMPLE REPOSITORY: This repository contains a small code example showing you how to train, and predict using a very small toy dataset, with the same structure.

HUGGING FACE COLLECTION: Huggingface collection containing the dataset and models.

If you still want to use the Zenodo repository, follow the steps below. But once again, it is way easier to work with the links above.

----------------------------------------------------------------------------------------------

Installation

This repository is a poetry project, which means that it can be installed easily by executing the following command from a shell in the repository folder:

poetry install

As this repository is script based, the README.md file contains all the commands executed to generate the dataset and train models.

----------------------------------------------------------------------------------------------

Core code

The core code used for all the experiments is in the folder auto-nli and all the calls to the core code with the parameters requested are found in README.md

----------------------------------------------------------------------------------------------

Parameters

All the parameters to create datasets and train models with the core code are found in the folder parameters.

----------------------------------------------------------------------------------------------

Models

Model types

For BERT based models, all in pytorch, there are two types of models from huggingfaces that were used for training and also are required to load a dataset because of the tokenizer:

RoBERTa (BERTIN): https://huggingface.co/bertin-project/bertin-roberta-base-spanish

XLMRoBERTa: https://huggingface.co/FacebookAI/xlm-roberta-base

Model folder

The model folder contains all the trained models for the paper. There are three types of models:

baseline: An XGBoost model that can be loaded with pickle.

roberta: BERTIN based models in pytorch. You can load them with the model_path

xlmroberta: XLMRoBERTa based models in pytorch. You can load them with the model_path

Models with the suffix _annot are models trained with the premise (first sentence) only. Apart from the pytorch model folder, each model result folder (ex: ) contains the test results for the test set and the stress test sets (ex: )

Load model

Models are found in the folder model and all of them are pytorch models which can be loaded with the huggingface interface:

from transformers import AutoModel model = AutoModel.from_pretrained('

----------------------------------------------------------------------------------------------

Dataset

labeled_final_dataset.jsonl

This file is included outside the ZIP containing all other files, and it contains the final test dataset with 974 examples selected by human majority label matching the original linking phrase label.

Other datasets:

The datasets can be found in the folder data that is divided in the following folders:

base_dataset

The splits to train, validate and test the models.

splits_data

Splits of train-val-test extracted for each corpora. They are used to generate base_dataset.

sentence_data

Pairs of sentences found in each corpus. They are used to generate splits_data.

Dataset dictionary

This repository contains the splits that resulted from the research project "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships". All the splits are in JSONL format and have the same fields per example:

sentence_1: First sentence of the pair.

sentence_2: Second sentence of the pair.

connector: Linking phrase used to extract pair.

connector_type: NLI label, between "contrasting", "entailment", "reasoning" or "neutral"

extraction_strategy: "linking_phrase" for "contrasting", "entailment", "reasoning" and "none" for neutral.

distance: How many sentences before the connector is the sentence_1

sentence_1_position: Number of sentence for sentence_1 in the source document

sentence_1_paragraph: Number of paragraph for sentence_1 in the source document

sentence_2_position: Number of sentence for sentence_2 in the source document

sentence_2_paragraph: Number of paragraph for sentence_2 in the source document

id: Unique identifier for the example

dataset: Source corpus of the pair. Metadata of corpus, including source can be found in dataset_metadata.xlsx.

genre: Writing genre of the dataset.

domain: Domain genre of the dataset.

Example:

{"sentence_1":"sefior Bcajavides no es moderado, tampoco lo convertirse e\u00f1 declarada divergencia de miras polileido en griego","sentence_2":"era mayor claricomentarios, as\u00ed de los peri\u00f3dicos como de los homes dado \u00e1 la voluntad de los hombres, sin que sobreticas","connector":"por consiguiente,","connector_type":"reasoning","extraction_strategy":"linking_phrase","distance":1.0,"sentence_1_paragraph":4,"sentence_1_position":86,"sentence_2_paragraph":4,"sentence_2_position":87,"id":"esnews_spanish_pd_news_531537","dataset":"esnews_spanish_pd_news","genre":"news","domain":"spanish_public_domain_news"}

Dataset load

To load a dataset/split as a pytorch object used to train-validate-test models you must use the custom class dataset

from auto_nli.model.bert_based.dataset import BERTDataset

dataset = BERTDataset(

os.path.join(dataset_folder,

max_len=

model_type=

only_premise=

max_samples=

----------------------------------------------------------------------------------------------

Notebooks

The folder notebooks contains a collection of jupyter notebooks used to preprocess datasets and visualize results.
h
cifar10
huggingface.co
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Élie Goudout (2025). cifar10 [Dataset]. https://huggingface.co/datasets/ego-thales/cifar10
Explore at:
Dataset updated
Aug 5, 2025
Authors
Élie Goudout
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Specifications

Contains the entire CIFAR10 dataset, downloaded via PyTorch, then split and saved as .png files representing 32x32 images. There a three splits, perfectly balanced class-wise:

train: 49,000 out of the original 50,000 samples from the training set of CIFAR10; calibration: 1,000 left-out samples from the training set; test: 10,000 samples, the entire original test set.

File Structure

Files are archives
Caltech-256: Pre-Processed 80/20 Train-Test Split
kaggle.com
zip
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KUSHAGRA MATHUR (2025). Caltech-256: Pre-Processed 80/20 Train-Test Split [Dataset]. https://www.kaggle.com/datasets/kushubhai/caltech-256-train-test
Explore at:
zip(1138799273 bytes)Available download formats
Dataset updated
Nov 12, 2025
Authors
KUSHAGRA MATHUR
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).

The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:

A clean, pre-defined 80/20 train-test split.

Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.

A flat directory structure (train/, test/) for simplified file access.

File Content The dataset is organized into a single top-level folder and two CSV files:

train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.

test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.

Caltech-256_Train_Test/: The primary data folder.

train/: This directory contains 80% of the images from all 257 categories, intended for model training.

test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.

Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.

Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.

Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data

Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.
Z
Data from: Solar flare forecasting based on magnetogram sequences learning...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grim, Luís Fernando Lopes; Sampaio Gradvohl, André Leon (2023). Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10246576
Explore at:
Dataset updated
Dec 4, 2023
Dataset provided by
Universidade Estadual de Campinas
Universidade Estadual de Campinas (UNICAMP)
Authors
Grim, Luís Fernando Lopes; Sampaio Gradvohl, André Leon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation". Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 × 224 pixels and normalized from 0 to 1. Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data). We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets. We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively. GitHub version The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version. Folders Structure In the Root directory of the project, we have two folders:

magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA). Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders.

M24/M48: both present the following sub-folders structure:

Seqs16; SF_MViT; SF_MViT_oT; SF_MViT_oTV; SF_MViT_oTV_Test. There are also two files in root:

inst_packages.sh: install the packages and dependencies to run the models. download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache. M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files. Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders. All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models. Naming pattern for the files:

magnetogram_jpg: follows the format "hmi.sharp_720s...magnetogram.fits.jpg" and Seqs16: follows the format "hmi.sharp_720s...to.", where:

hmi: is the instrument that captured the image

sharp_720s: is the database source of SDO/HMI.

is the identification of SHARP region, and can contain one or more solar ARs classified by the (NOAA).

is the date-time the instrument captured the image in the format yyyymmdd_hhnnss_TAI (y:year, m:month, d:day, h:hours, n:minutes, s:seconds).

is the date-time when the sequence starts, and follow the same format of .

is the date-time when the sequence ends, and follow the same format of . Reference text files in M24 and M48 or inside SF_MViT... folders follows the format "flare_Mclass_.txt", where:

is Seq16 if refers to a sequence, or void if refers direct to images.

"24h" or "48h".

is "TrainVal" or "Test". The refers to the split of Train/Val.

void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. All SF_MViT...folders:

Model training codes: "SF_MViT_M+_", where:

void or "oT" (over Train) or "oTV" (over Train and Val) or "oTV_Test" (over Train, Val and Test);

"24h" or "48h";

"oneSplit" for a specific split or "allSplits" if run all splits.

void is default to run 1 GPU or "2gpu" to run into 2 gpus systems; Job submission files: "jobMViT_", where:

point the queue in Lovelace environment hosted on CENAPAD-SP (https://www.cenapad.unicamp.br/parque/jobsLovelace) Temporary inputs: "Seq16_flare_Mclass_.txt:

train or val;

void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. Outputs: "saida_MViT_Adam_10-7", where:

k0 to k4, means the correlated split of the output, or void if the output is from all splits. Error files: "err_MViT_Adam_10-7", where:

k0 to k4, means the correlated split of the error log file, or void if the error file is from all splits. Checkpoint files: "sample-FLARE_MViT_S_10-7-epoch=-valid_loss=-Wloss_k=.ckpt", where:

epoch number of the checkpoint;

corresponding valid loss;

0 to 4.
TecoGan Pytorch Dataset
kaggle.com
zip
Updated Mar 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dwight Foster (2021). TecoGan Pytorch Dataset [Dataset]. https://www.kaggle.com/datasets/gtownfoster/ucf101-images-for-tecogan-pytorch
Explore at:
zip(4545427538 bytes)Available download formats
Dataset updated
Mar 19, 2021
Authors
Dwight Foster
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This is a dataset for the TecoGan Pytorch model. The Github repo can be found here.

Content

There are 400 scenes from the UCF101 dataset. Each video was split into photos with a maximum length of 120 photos. The photos were put into this dataset in the format that the TecoGan dataloader takes.

Acknowledgements

The original UCF101 dataset can be found here. And you can find the original TecoGan repo here.

Inspiration

Let's see how good your super resolution images can look. How close can you get to the original?
Synthetic Airborne Intruder Dataset: A dataset based on High-Resolution...
zenodo.org
bin, xz
Updated Aug 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Lyhs; Lars Hinneburg; Florian Oelsner; Michael Fischer; Jeremy Tschirner; Stefan Milz; Stefan Milz; Patrick Maeder; Patrick Maeder; Jonathan Lyhs; Lars Hinneburg; Florian Oelsner; Michael Fischer; Jeremy Tschirner (2023). Synthetic Airborne Intruder Dataset: A dataset based on High-Resolution Inpainting for Safety Critical Detect and Avoid [Dataset]. http://doi.org/10.5281/zenodo.8301120
Explore at:
bin, xzAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8301120
Dataset updated
Aug 30, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jonathan Lyhs; Lars Hinneburg; Florian Oelsner; Michael Fischer; Jeremy Tschirner; Stefan Milz; Stefan Milz; Patrick Maeder; Patrick Maeder; Jonathan Lyhs; Lars Hinneburg; Florian Oelsner; Michael Fischer; Jeremy Tschirner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Modern machine learning techniques have shown tremendous potential, especially for object detection on camera images. For this reason, they are also used to enable safety-critical automated processes such as autonomous drone flights. We present a study on object detection for Detect and Avoid, a safety critical function for drones that detects air traffic during automated flights for safety reasons. An ill-posed problem is the generation of good and especially large data sets, since detection itself is the corner case. Most models suffer from limited ground truth in raw data, e.g. recorded air traffic or frontal flight with a small aircraft. It often leads to poor and critical detection rates. We overcome this problem by using inpainting methods to bootstrap the dataset such that it explicitly contains the corner cases of the raw data. We provide an overview of inpainting methods and generative models and present an example pipeline given a small annotated dataset. We validate our method by generating a high-resolution dataset and present it to an independent object detector that was fully trained on real data.

This dataset is represented in the following repository. The dataset is structured as follows:

# Synthetic Airborne Intruder Dataset

This dataset was syntheticaly generated using an adapted [Pix2Pix](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix) with different background images and object segementations. Each image contains one object instance.

The annotations are in the COCO annotation format.

## Data Structure
Synthetic Dataset Root:
--train
|--images
|--instances.json
--val
|--images
|--instances.json
--test
|--images
|--instances.json
--Background_Sources
|--sources_train.csv
|--sources_val.csv
|--sources_test.cs
--README.md

## Categories

| Id | Name | Instances over all splits |
| ---| --- | --- |
| 0 | large airplane | 1695 |
| 1 | small airplane | 1255 |
| 2 | very small airplane | 46 |
| 3 | helicopter | 2201 |
| 4 | drone | 961 |
| 5 | hot air balloon | 315 |
| 6 | paraglider | 565 |
| 7 | airship | 42 |
| 8 | UFO | 0 |

### Note:
UFO is a placeholder for future expansion of the dataset.

## Splits
The dataset consists of 3 splits: train 5900 images, val 590 images, test 590 images.
The Number of instances per class and per split can be seen in the table below:

Class | train | val | test
-------|-------|-----|--------
large airplane | 1416 | 142 | 137
small airplane | 1046 | 96 | 113
very small airplane | 38 | 2 | 6
helicopter | 1812 | 206 | 183
drone | 800 | 86 | 75
hot air balloon | 268 | 21 | 26
paragliders | 492 | 32 | 41
airship | 28 | 5 | 9
UFO | 0 | 0 | 0

## Sources
The sources of the background images can be found in the files [here](./Background_Sources/).
Z
Simulated datasets for detector and particle flow reconstruction: CLIC...
nde-dev.biothings.io
data.niaid.nih.gov
+1more
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mokhtar, Farouk (2025). Simulated datasets for detector and particle flow reconstruction: CLIC detector, machine learning format [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_8409591
Explore at:
Dataset updated
Mar 21, 2025
Dataset provided by
Kagan, Michael
Garcia, Dolores
Duarte, Javier
Pata, Joosep
Zhang, Mengke
Wulff, Eric
Mokhtar, Farouk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synopsis

Machine-learning friendly format of tracks, clusters and target particles in electron-positron events, simulated with the CLIC detector. Ready to be used with jpata/particleflow:v2.3.0. Derived from the EDM4HEP ROOT files in https://zenodo.org/record/8260741.

clic_edm_ttbar_pf.zip: e+e- -> ttbar, center of mass energy at 380 GeV

clic_edm_qq_pf.zip: e+e- -> Z* -> qqbar, center of mass energy at 380 GeV

clic_edm_ww_fullhad_pf.zip: e+e- -> WW -> W decaying hadronically, center of mass energy at 380 GeV

clic-tfds.ipynb: an example notebook on how to load the files

Contents

Each .zip file contains the dataset in the tensorflow-datasets, array_record format. We have split the full datasets into 10 subsets, due to space considerations on zenodo, two subsets from each dataset are uploaded. Each dataset contains a train and test split of events.

Dataset semantics (to be updated)

Each dataset consists of events that can be iterated over using the tensorflow-datasets library and used in either tensorflow or pytorch. Each event has the following information available:

X: the reconstruction input features, i.e. tracks and clusters

ytarget: the ground truth particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle

ycand: the baseline Pandora PF particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle

The full semantics, including the list of features for X, are available at https://github.com/jpata/particleflow/blob/v2.3.0/mlpf/heptfds/clic_pf_edm4hep/utils_edm.py and https://github.com/jpata/particleflow/blob/v2.3.0/mlpf/data/key4hep/postprocessing.py.
FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1)
zenodo.org
bin, png, zip
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek (2024). FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1) [Dataset]. http://doi.org/10.5281/zenodo.8328113
Explore at:
bin, zip, pngAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8328113
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# FiN-2 Large-Scale Real-World PLC-Dataset

## About
#### FiN-2 dataset in a nutshell:
FiN-2 is the first large-scale real-world dataset on data collected in a powerline communication infrastructure. Since the electricity grid is inherently a graph, our dataset could be interpreted as a graph dataset. Therefore, we use the word node to describe points (cable distribution cabinets) of measurement within the low-voltage electricity grid and the word edge to describe connections (cables) in between them. However, since these are PLC connections, an edge does not necessarily have to correspond to a real cable; more on this in our paper.
FiN-2 shows measurements that relate to the nodes (voltage, total harmonic distortion) as well as to the edges (signal-to-noise ratio spectrum, tonemap). In total, FiN-2 is distributed across three different sites with a total of 1,930,762,116 node measurements each for the individual features and 638,394,025 edge measurements each for all 917 PLC channels. All data was collected over a 25-month period from mid-2020 to the end of 2022.
We propose this dataset to foster research in the domain of grid automation and smart grid. Therefore, we provide different example use cases in asset management, grid state visualization, forecasting, predictive maintenance, and novelty detection. For more decent information on this dataset, please see our [paper](https://arxiv.org/abs/2209.12693).

* * *
## Content
FiN-2 dataset splits up into two compressed `csv-Files`: *nodes.csv* and *edges.csv*.

All files are provided as a compressed ZIP file and are divided into four parts. The first part can be found in this repo, while the remaining parts can be found in the following:
- https://zenodo.org/record/8328105
- https://zenodo.org/record/8328108
- https://zenodo.org/record/8328111

### Node data

| id | ts | v1 | v2 | v3 | thd1 | thd2 | thd3 | phase_angle1 | phase_angle2 | phase_angle3 | temp |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
|112|1605530460|236.5|236.4|236.0|2.9|2.5|2.4|120.0|119.8|120.0|35.3|
|112|1605530520|236.9|236.6|236.6|3.1|2.7|2.5|120.1|119.8|120.0|35.3|
|112|1605530580|236.2|236.4|236.0|3.1|2.7|2.5|120.0|120.0|119.9|35.5|

- id / ts: Unique identifier of the node that is measured and timestemp of the measurement
- v1/v2/v3: Voltage measurements of all three phases
- thd1/thd2/thd3: Total harmonic distortion of all three phases
- phase_angle1/2/3: Phase angle of all three phases
- temp: Temperature in-circuit of the sensor inside a cable distribution unit (in °C)

### Edge data
| src | dst | ts | snr0 | snr1 | snr2 | ... | snr916 |
|----|----|----|----|----|----|----|----|
|62|94|1605528900|70|72|45|...|-53|
|62|32|1605529800|16|24|13|...|-51|
|17|94|1605530700|37|25|24|...|-55|

- src & dst & ts: Unique identifier of the source and target nodes where the spectrum is measured and time of measurement
- snr0/snr1/.../snr916: 917 SNR measurements in tenths of a decibel (e.g. 50 --> 5dB).

### Metadata
Metadata that is provided along with the data covers:

- Number of cable joints
- Cable properties (length, type, number of sections)
- Relative position of the nodes (location, zero-centered gps)
- Adjacent PV or wallbox installations
- Year of installation w.r.t. the nodes and cables

Since the electricity grid is part of the critical infrastructure, it is not possible to provide exact GPS locations.

* * *
## Usage
Simple data access using pandas:

```
import pandas as pd

nodes_file = "nodes.csv.gz" # /path/to/nodes.csv.gz
edges_file = "edges.csv.gz" # /path/to/edges.csv.gz

# read the first 10 rows
data = pd.read_csv(nodes_file, nrows=10, compression='gzip')

# read the row number 5 to 15
data = pd.read_csv(nodes_file, nrows=10, skiprows=[i for i in range(1,6)], compression='gzip')

# ... same for the edges
```

Compressed csv-data format was used to make sharing as easy as possible, however it comes with significant drawbacks for machine learning. Due to the inherent graph structure, a single snapshot of the whole graph consists of a set of node and edge measurements. But due to timeouts, noise and other disturbances, nodes sometimes fail in collecting the data, wherefore the number of measurements for a specific timestamp differs. This, plus the high sparsity of the graph, leads to a high inefficiency when using the csv-format for an ML training.
To utilize the data in an ML pipeline, we recommend other data formats like [datadings](https://datadings.readthedocs.io/en/latest/) or specialized database solutions like [VictoriaMetrics](https://victoriametrics.com/).

### Example use case (voltage forecasting)

Forecasting of the voltage is one potential use cases. The Jupyter notebook provided in the repository gives an overview of how the dataset can be loaded, preprocessed and used for ML training. Thereby, a MinMax scaling was used as simple preprocessing and a PyTorch dataset class was created to handle the data. Furthermore, a vanilla autoencoder is utilized to process and forecast the voltage into the future.
d
MountainScape Segmentation Dataset
search.dataone.org
borealisdata.ca
Updated Dec 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mountain Legacy Project (2024). MountainScape Segmentation Dataset [Dataset]. http://doi.org/10.5683/SP3/CEYU10
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/CEYU10
Dataset updated
Dec 11, 2024
Dataset provided by
Borealis
Authors
Mountain Legacy Project
Time period covered
Jan 1, 1870 - Aug 30, 2023
Description
This dataset contains the MountainScape Segmentation Dataset (MS2D), a collection of oblique mountain images from the Mountain Legacy Project and corresponding manually annotated land cover masks. The dataset is split into 144 historic grayscale images collected by early phototopographic surveyors and 140 modern repeat images captured by the Mountain Legacy Project. The image resolutions range from 16 to 80 megapixels and the corresponding masks are RGB images with 8 landcover classes. The image dataset was used to train and test the Python Landscape Classifier (PyLC), a trainable segmentation network and land cover classification tool for oblique landscape photography. The dataset also contains PyTorch models trained with PyLC using the collection of images and masks.
h
imagenet-w21-wds
huggingface.co
Updated Sep 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PyTorch Image Models (2025). imagenet-w21-wds [Dataset]. https://huggingface.co/datasets/timm/imagenet-w21-wds
Explore at:
Dataset updated
Sep 19, 2025
Dataset authored and provided by
PyTorch Image Models
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Summary

This is a copy of the full Winter21 release of ImageNet in webdataset tar format with JPEG images. This release consists of 19167 classes, 2674 fewer classes than the original 21841 class Fall11 release of the full ImageNet. The classes were removed due to these concerns: https://www.image-net.org/update-sep-17-2019.php

Data Splits

The full ImageNet dataset has no defined splits. This release follows that and leaves everything in the train split.… See the full description on the dataset page: https://huggingface.co/datasets/timm/imagenet-w21-wds.
feral-cat-segmentation_dataset
kaggle.com
universe.roboflow.com
zip
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lu hou yang (2025). feral-cat-segmentation_dataset [Dataset]. https://www.kaggle.com/datasets/luhouyang/feral-cat-segmentation-dataset
Explore at:
zip(971125684 bytes)Available download formats
Dataset updated
Mar 18, 2025
Authors
lu hou yang
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Feral Cat Segmentation Dataset

Overview

This dataset provides image segmentation data for feral cats, designed for computer vision and machine learning tasks. It builds upon the original public domain dataset by Paul Cashman from Roboflow, with additional preprocessing and multiple data formats for easier consumption.

Dataset Source

Original Author: Paul Cashman

Original Source: Roboflow Universe

Extended by: Lu Hou Yang

GitHub: https://github.com/luhouyang/open_circles

License: Public Domain

Dataset Contents

The dataset is organized into three standard splits: - Train set - Validation set - Test set

Each split contains data in multiple formats: 1. Original JPG images 2. Segmentation mask JPG images 3. Parquet files containing flattened image and mask data 4. Pickle files containing serialized image and mask data

Data Formats

1. Image Files

Format: JPG

Resolution: 224×224 pixels

Directory Structure:

train/: Original training images

valid/: Original validation images

test/: Original test images

train_mask/: Corresponding segmentation masks for training

valid_mask/: Corresponding segmentation masks for validation

test_mask/: Corresponding segmentation masks for testing

2. Parquet Files

Files: train_dataset.parquet, valid_dataset.parquet, test_dataset.parquet

Content: Flattened image data and corresponding masks combined in a single table

Structure: Each row contains the flattened pixel values of an image followed by the flattened pixel values of its mask

Data Division: Image and mask data are split at index split_at = image_size[0] * image_size[1] * image_channels

Data before this index: image pixel values (reshaped to [-1, 224, 224, 3])

Data after this index: mask pixel values (reshaped to [-1, 224, 224, 1])

Benefits: Efficient storage and faster loading compared to individual image files

3. Pickle Files

Files: train_dataset.pkl, valid_dataset.pkl, test_dataset.pkl

Content: Serialized Python objects containing images and their corresponding masks

Structure: List of [image, mask] pairs, where each image and mask is serialized using Python's pickle

Data Access: Similar to parquet files, when loaded through the provided dataset class, data is split at the same index: split_at = image_size[0] * image_size[1] * image_channels

Benefits: Preserves original data structure and enables quick loading in Python

4. CSV Files

Files: train_dataset.csv, valid_dataset.csv, test_dataset.csv

Content: Same data as parquet files but in CSV format

Structure: No headers, raw flattened pixel values

Data Division: Same split point as parquet files

Image Preprocessing

All images were preprocessed with the following operations: - Resized to 224×224 pixels using bilinear interpolation - Segmentation masks were also resized to match the images using nearest neighbor interpolation - Original RLE (Run-Length Encoding) segmentation data converted to binary masks

Data Normalization

When used with the provided PyTorch dataset class, images are normalized with: - Mean: [0.48235, 0.45882, 0.40784] - Standard Deviation: [0.00392156862745098, 0.00392156862745098, 0.00392156862745098]

PyTorch Integration

A custom CatDataset class is included for easy integration with PyTorch:

from cat_dataset import CatDataset # Load from parquet format dataset = CatDataset( root="path/to/dataset", split="train", # Options: "train", "valid", "test" format="parquet", # Options: "parquet", "pkl" image_size=[224, 224], image_channels=3, mask_channels=1 ) # Use with PyTorch DataLoader from torch.utils.data import DataLoader dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

Performance Comparison

Loading time benchmarks from the original implementation: - Parquet format: ~1.29 seconds per iteration - Pickle format: ~0.71 seconds per iteration

The pickle format provides the fastest loading times and is recommended for most use cases.

Citation

If you use this dataset in your research or projects, please cite:

@misc{feral-cat-segmentation_dataset, title = {feral-cat-segmentation Dataset}, type = {Open Source Dataset}, author = {Paul Cashman}, howpublished = {\url{https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}}, url = {https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}, journal = {Roboflow Universe}, publisher = {Roboflow}, year = {2025}, month = {mar}, note = {visited on 2025-03-19}, }

Sample Usage Code

Basic Dataset Loading

from ca...
Lunar Reconnaissance Orbiter Imagery for LROCNet Moon Classifier
zenodo.org
bin, zip
Updated Nov 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emily Dunkel; Emily Dunkel (2022). Lunar Reconnaissance Orbiter Imagery for LROCNet Moon Classifier [Dataset]. http://doi.org/10.5281/zenodo.7041842
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7041842
Dataset updated
Nov 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Emily Dunkel; Emily Dunkel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary

We provide imagery used to train LROCNet -- our Convolutional Neural Network classifier of orbital imagery of the moon. Images are divided into train, validation, and test zip files, which contain class specific sub-folders. We have three classes: "fresh crater", "old crater", and "none". Classes are described in detail in the attached labeling guide.

Directory Contents

We include the labeling guide and training, testing, and validation data. Training data was split to avoid upload timeouts.

LROC_Labeling_Intro_for_release.ppt: Labeling guide

val: Validation images divided into class sub-folders

ejecta: "fresh crater" class

oldcrater: "old crater" class

none: "none" class

test: Testing images divided into class sub-folders

ejecta: "fresh crater" class

oldcrater: "old crater" class

none: "none" class

ejecta_train: Training images of "fresh crater" class

oldcrater_train: Training images of "old crater" class

none_train1-4: Training images of "none" class (divided into 4 just for uploading)

Data Description

We use CDR (Calibrated Data Record) browse imagery (50% resolution) from the Lunar Reconnaissance Orbiter's Narrow Angle Cameras (NACs). Data we get from the NACs are 5-km swaths, at nominal orbit, so we perform a saliency detection step to find surface features of interest. A detector developed for Mars HiRISE (Wagstaff et al.) worked well for our purposes, after updating based on LROC NAC image resolution. We use this detector to create a set of image chipouts (small 227x277 cutouts) from the larger image, sampling the lunar globe.

Class Labeling

We select classes of interest based on what is visible at the NAC resolution, consulting with scientists and performing a literature review. Initially, we have 7 classes: "fresh crater", "old crater", "overlapping craters", "irregular mare patches", "rockfalls and landfalls", "of scientific interest", and "none".

Using the Zooniverse platform, we set up a labeling tool and labeled 5,000 images. We found that "fresh crater" make up 11% of the data, "old crater" 18%, with the vast majority "none". Due to limited examples of the other classes, we reduce our initial class set to: "fresh crater" (with impact ejecta), "old crater", and "none".

We divide the images into train/validation/test sets making sure no image swaths span multiple sets.

Data Augmentation

Using PyTorch, we apply the following augmentation on the training set only: horizontal flip, vertical flip, rotation by 90/180/270 degrees, and brightness adjustment (0.5, 2). In addition, we use weighted sampling so that each class is weighted equally. The training set included here does not include augmentation since that was performed within PyTorch.

Acknowledgements

The author would like to thank the volunteers who provided annotations for this data set, as well as others who contributed to this work (as in the Contributor list). We would also like to thank the PDS Imaging Node for support of this work.

The research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004).

CL#22-4763

© 2022 California Institute of Technology. Government sponsorship acknowledged.
Trained Weights for DRIVE Train/Validation Split
kaggle.com
zip
Updated Feb 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sovit Ranjan Rath (2023). Trained Weights for DRIVE Train/Validation Split [Dataset]. https://www.kaggle.com/datasets/sovitrath/trained-weights-for-drive-trainvalidation-split
Explore at:
zip(3795490926 bytes)Available download formats
Dataset updated
Feb 21, 2023
Authors
Sovit Ranjan Rath
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Trained weights for the Retinal Vessel Segmentation dataset which can be found here.

Trained weights: * DeepLabV3 ResNet50 (trained with 512x512 and 768x768 resolution images) * DeepLabV3 ResNet101 (trained with 512x512 and 640x640 resolution images)

Accompanying blog post => Retinal Vessel Segmentation using PyTorch Semantic Segmentation
Leaf Disease Segmentation with Train/Valid Split
kaggle.com
zip
Updated Feb 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sovit Ranjan Rath (2023). Leaf Disease Segmentation with Train/Valid Split [Dataset]. https://www.kaggle.com/datasets/sovitrath/leaf-disease-segmentation-with-trainvalid-split/discussion
Explore at:
zip(528293844 bytes)Available download formats
Dataset updated
Feb 12, 2023
Authors
Sovit Ranjan Rath
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is a leaf disease dataset for semantic segmentation. The dataset contains a collection of different diseases. But each disease has not been segregated into a different class. Disease on leaves is one class and background is another class.

The dataset contains two folders. One with original images and another with augmented images. Both formats have a train/validation split for easier experimentation.

Find the corresponding blog post here: "https://debuggercafe.com/leaf-disease-segmentation-using-pytorch-deeplabv3/">Leaf Disease Segmentation using PyTorch DeepLabV3

Roughly, in both cases, 15% is reserved for validation and the rest for training.

The images have been taken from the PlantDoc dataset.

Original dataset => https://www.kaggle.com/datasets/fakhrealam9537/leaf-disease-segmentation-dataset
Sentence/Table Pair Data from Wikipedia for Pre-training with...
data.niaid.nih.gov
zenodo.org
Updated Oct 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315
Explore at:
Dataset updated
Oct 29, 2021
Dataset provided by
Google Research
The Ohio State University
Authors
Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

path to the uncompressed files, should be a directory with a set of tar files

url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }
Z
3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in...
nde-dev.biothings.io
data.niaid.nih.gov
+1more
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Strohmayer, Julian (2024). 3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_10925350
Explore at:
Dataset updated
Dec 5, 2024
Dataset provided by
Strohmayer, Julian
Kampel, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios

This repository contains the 3DO dataset proposed in [1].

PyTroch Dataloader

A minimal PyTorch dataloader for the 3DO dataset is provided at: https://github.com/StrohmayerJ/3DO

Dataset Description

The 3DO dataset comprises 42 five-minute recordings (~1.25M WiFi packets) of three human activities performed by a single person, captured in a WiFi through-wall sensing scenario over three consecutive days. Each WiFi packet is annotated with a 3D trajectory label and a class label for the activities: no person/background (0), walking (1), sitting (2), and lying (3). (Note: The labels returned in our dataloader example are walking (0), sitting (1), and lying (2), because background sequences are not used.)

The directories 3DO/d1/, 3DO/d2/, and 3DO/d3/ contain the sequences from days 1, 2, and 3, respectively. Furthermore, each sequence directory (e.g., 3DO/d1/w1/) contains a csiposreg.csv file storing the raw WiFi packet time series and a csiposreg_complex.npy cache file, which stores the complex Channel State Information (CSI) of the WiFi packet time series. (If missing, csiposreg_complex.npy is automatically generated by the provided dataloader.)

Dataset Structure:

/3DO

├── d1 <-- day 1 subdirectory

└── w1 <-- sequence subdirectory └── csiposreg.csv <-- raw WiFi packet time series └── csiposreg_complex.npy <-- CSI time series cache

├── d2 <-- day 2 subdirectory

├── d3 <-- day 3 subdirectory

In [1], we use the following training, validation, and test split:

Subset Day Sequences

Train 1 w1, w2, w3, s1, s2, s3, l1, l2, l3

Val 1 w4, s4, l4

Test 1 w5 , s5, l5

Test 2 w1, w2, w3, w4, w5, s1, s2, s3, s4, s5, l1, l2, l3, l4, l5

Test 3 w1, w2, w4, w5, s1, s2, s3, s4, s5, l1, l2, l4

w = walking, s = sitting and l= lying

Note: On each day, we additionally recorded three ten-minute background sequences (b1, b2, b3), which are provided as well.

Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

[1] Strohmayer, J., Kampel, M. (2025). On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios. In: Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15315. Springer, Cham. https://doi.org/10.1007/978-3-031-78354-8_13

BibTeX citation:

@inproceedings{strohmayerOn2025, author="Strohmayer, Julian and Kampel, Martin", title="On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios", booktitle="Pattern Recognition", year="2025", publisher="Springer Nature Switzerland", address="Cham", pages="194--211", isbn="978-3-031-78354-8" }

Facebook

Twitter

Click to copy link

Link copied

Cite

Sovit Ranjan Rath (2023). DRIVE Train/Validation Split Dataset [Dataset]. https://www.kaggle.com/datasets/sovitrath/drive-trainvalidation-split-dataset/code

DRIVE Train/Validation Split Dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 19, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Sovit Ranjan Rath

Description

This dataset contains images and masks for Retinal Vessel Extraction (Segmentation). It contains a training and validation split to easily train semantic segmentation models.

The original dataset can be found here => https://www.kaggle.com/datasets/andrewmvd/drive-digital-retinal-images-for-vessel-extraction

This dataset also has an accompanying blog post => Retinal Vessel Segmentation using PyTorch Semantic Segmentation

Split sample numbers: Training images and masks: 16 Validation images and masks: 4 Test images: 20

Clear search

Close search

Google apps

Main menu

DRIVE Train/Validation Split Dataset

Water Bodies Segmentation Dataset with Split

VideoDD-ICLR-distill

flowers102

Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks...

ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation

How to cite:

IMPORTANT UPDATE!!!

It is strongly advised to work with the following links, instead of working directly from Zenodo:

CODE REPOSITORY: This repository contains the code used for the article.

SMALL EXAMPLE REPOSITORY: This repository contains a small code example showing you how to train, and predict using a very small toy dataset, with the same structure.

HUGGING FACE COLLECTION: Huggingface collection containing the dataset and models.

Installation

Core code

Parameters

Models

Model types

Model folder

Load model

Dataset

labeled_final_dataset.jsonl

Other datasets:

base_dataset

splits_data

sentence_data

Dataset dictionary

Dataset load

Notebooks

cifar10

Caltech-256: Pre-Processed 80/20 Train-Test Split

Data from: Solar flare forecasting based on magnetogram sequences learning...

TecoGan Pytorch Dataset

Context

Content

Acknowledgements

Inspiration

Synthetic Airborne Intruder Dataset: A dataset based on High-Resolution...

Simulated datasets for detector and particle flow reconstruction: CLIC...

FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1)

MountainScape Segmentation Dataset

imagenet-w21-wds

feral-cat-segmentation_dataset

Feral Cat Segmentation Dataset

Overview

Dataset Source

Dataset Contents

Data Formats

1. Image Files

2. Parquet Files

3. Pickle Files

4. CSV Files

Image Preprocessing

Data Normalization

PyTorch Integration

Performance Comparison

Citation

Sample Usage Code

Basic Dataset Loading

Lunar Reconnaissance Orbiter Imagery for LROCNet Moon Classifier

Trained Weights for DRIVE Train/Validation Split

Leaf Disease Segmentation with Train/Valid Split

Sentence/Table Pair Data from Wikipedia for Pre-training with...

path to the uncompressed files, should be a directory with a set of tar files

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in...

DRIVE Train/Validation Split Dataset