61 datasets found
  1. DRIVE Train/Validation Split Dataset

    • kaggle.com
    Updated Feb 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sovit Ranjan Rath (2023). DRIVE Train/Validation Split Dataset [Dataset]. https://www.kaggle.com/datasets/sovitrath/drive-trainvalidation-split-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sovit Ranjan Rath
    Description

    This dataset contains images and masks for Retinal Vessel Extraction (Segmentation). It contains a training and validation split to easily train semantic segmentation models.

    The original dataset can be found here => https://www.kaggle.com/datasets/andrewmvd/drive-digital-retinal-images-for-vessel-extraction

    This dataset also has an accompanying blog post => Retinal Vessel Segmentation using PyTorch Semantic Segmentation

    Split sample numbers: Training images and masks: 16 Validation images and masks: 4 Test images: 20

  2. Water Bodies Segmentation Dataset with Split

    • kaggle.com
    zip
    Updated Jan 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sovit Ranjan Rath (2023). Water Bodies Segmentation Dataset with Split [Dataset]. https://www.kaggle.com/datasets/sovitrath/water-bodies-segmentation-dataset-with-split
    Explore at:
    zip(258844554 bytes)Available download formats
    Dataset updated
    Jan 30, 2023
    Authors
    Sovit Ranjan Rath
    Description

    Dataset for segmentation of water bodies in satellite imagery. Find the blog post using this dataset - Train PyTorch DeepLabV3 on Custom Dataset

    Acknowledgments Original data => https://www.kaggle.com/datasets/franciscoescobar/satellite-images-of-water-bodies

  3. h

    VideoDD-ICLR-distill

    • huggingface.co
    Updated Aug 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kkk (2025). VideoDD-ICLR-distill [Dataset]. https://huggingface.co/datasets/turturtur250/VideoDD-ICLR-distill
    Explore at:
    Dataset updated
    Aug 31, 2025
    Authors
    kkk
    Description

    VideoDD-ICLR-distill

    This dataset contains pre-processed versions of HMDB51 and UCF101 video datasets, where each video has been cut into frames for use in distillation and video classification tasks.

      Structure
    

    data/HMDB51/ : HMDB51 dataset split into frames data/UCF101/ : UCF101 dataset split into frames dataset.py : PyTorch dataset loader resize_mydata.py : Frame resizing script show_img.py : Visualization script gpu123_logs/ : Logs from training runs… See the full description on the dataset page: https://huggingface.co/datasets/turturtur250/VideoDD-ICLR-distill.

  4. h

    flowers102

    • huggingface.co
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pu Fanyi (2025). flowers102 [Dataset]. https://huggingface.co/datasets/pufanyi/flowers102
    Explore at:
    Dataset updated
    Nov 21, 2025
    Authors
    Pu Fanyi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Oxford 102 Flowers (Custom Split)

    This dataset re-packages the Oxford 102 Flowers dataset with a custom train/validation/test split produced by src.data.upload_flowers102.

      Split Ratios
    

    Train: 80.00% Validation: 10.00% Test: 10.00%

      Source
    

    Original images and annotations come from the Oxford 102 Flowers dataset, as distributed on the Hugging Face Hub under pytorch/oxford-flowers.

  5. z

    Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks...

    • zenodo.org
    bin, pdf, zip
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johan David Rodriguez Portela; Johan David Rodriguez Portela; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán (2025). Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation" [Dataset]. http://doi.org/10.5281/zenodo.15002575
    Explore at:
    bin, zip, pdfAvailable download formats
    Dataset updated
    Nov 12, 2025
    Dataset provided by
    Arxiv
    Authors
    Johan David Rodriguez Portela; Johan David Rodriguez Portela; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation

    This is the complete code, model and datasets for the article ESNLIR: Expanding Spanish NLI Benchmarks with Multi-genre and Causal Annotation

    In case you cannot access the article this preprint is available: ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships.

    How to cite:

    Portela, J.R., Pérez-Terán, N., Manrique, R. (2026). ESNLIR: Expanding Spanish NLI Benchmarks with Multi-genre and Causal Annotation. In: Florez, H., Peluffo-Ordoñez, D. (eds) Applied Informatics. ICAI 2025. Communications in Computer and Information Science, vol 2667. Springer, Cham. https://doi.org/10.1007/978-3-032-07175-0_23

    IMPORTANT UPDATE!!!

    It is strongly advised to work with the following links, instead of working directly from Zenodo:

    • CODE REPOSITORY: This repository contains the code used for the article.

    • SMALL EXAMPLE REPOSITORY: This repository contains a small code example showing you how to train, and predict using a very small toy dataset, with the same structure.

    • HUGGING FACE COLLECTION: Huggingface collection containing the dataset and models.

    If you still want to use the Zenodo repository, follow the steps below. But once again, it is way easier to work with the links above.

    ----------------------------------------------------------------------------------------------

    Installation

    This repository is a poetry project, which means that it can be installed easily by executing the following command from a shell in the repository folder:

    poetry install

    As this repository is script based, the README.md file contains all the commands executed to generate the dataset and train models.

    ----------------------------------------------------------------------------------------------

    Core code

    The core code used for all the experiments is in the folder auto-nli and all the calls to the core code with the parameters requested are found in README.md

    ----------------------------------------------------------------------------------------------

    Parameters

    All the parameters to create datasets and train models with the core code are found in the folder parameters.

    ----------------------------------------------------------------------------------------------

    Models

    Model types

    For BERT based models, all in pytorch, there are two types of models from huggingfaces that were used for training and also are required to load a dataset because of the tokenizer:

    Model folder

    The model folder contains all the trained models for the paper. There are three types of models:

    • baseline: An XGBoost model that can be loaded with pickle.
    • roberta: BERTIN based models in pytorch. You can load them with the model_path
    • xlmroberta: XLMRoBERTa based models in pytorch. You can load them with the model_path

    Models with the suffix _annot are models trained with the premise (first sentence) only. Apart from the pytorch model folder, each model result folder (ex: ) contains the test results for the test set and the stress test sets (ex: )

    Load model

    Models are found in the folder model and all of them are pytorch models which can be loaded with the huggingface interface:

    from transformers import AutoModel
    
    model = AutoModel.from_pretrained('

    ----------------------------------------------------------------------------------------------

    Dataset

    labeled_final_dataset.jsonl

    This file is included outside the ZIP containing all other files, and it contains the final test dataset with 974 examples selected by human majority label matching the original linking phrase label.

    Other datasets:

    The datasets can be found in the folder data that is divided in the following folders:

    base_dataset

    The splits to train, validate and test the models.

    splits_data

    Splits of train-val-test extracted for each corpora. They are used to generate base_dataset.

    sentence_data

    Pairs of sentences found in each corpus. They are used to generate splits_data.

    Dataset dictionary

    This repository contains the splits that resulted from the research project "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships". All the splits are in JSONL format and have the same fields per example:

    • sentence_1: First sentence of the pair.
    • sentence_2: Second sentence of the pair.
    • connector: Linking phrase used to extract pair.
    • connector_type: NLI label, between "contrasting", "entailment", "reasoning" or "neutral"
    • extraction_strategy: "linking_phrase" for "contrasting", "entailment", "reasoning" and "none" for neutral.
    • distance: How many sentences before the connector is the sentence_1
    • sentence_1_position: Number of sentence for sentence_1 in the source document
    • sentence_1_paragraph: Number of paragraph for sentence_1 in the source document
    • sentence_2_position: Number of sentence for sentence_2 in the source document
    • sentence_2_paragraph: Number of paragraph for sentence_2 in the source document
    • id: Unique identifier for the example
    • dataset: Source corpus of the pair. Metadata of corpus, including source can be found in dataset_metadata.xlsx.
    • genre: Writing genre of the dataset.
    • domain: Domain genre of the dataset.

    Example:

    {"sentence_1":"sefior Bcajavides no es moderado, tampoco lo convertirse e\u00f1 declarada divergencia de miras polileido en griego","sentence_2":"era mayor claricomentarios, as\u00ed de los peri\u00f3dicos como de los homes dado \u00e1 la voluntad de los hombres, sin que sobreticas","connector":"por consiguiente,","connector_type":"reasoning","extraction_strategy":"linking_phrase","distance":1.0,"sentence_1_paragraph":4,"sentence_1_position":86,"sentence_2_paragraph":4,"sentence_2_position":87,"id":"esnews_spanish_pd_news_531537","dataset":"esnews_spanish_pd_news","genre":"news","domain":"spanish_public_domain_news"}

    Dataset load

    To load a dataset/split as a pytorch object used to train-validate-test models you must use the custom class dataset

    from auto_nli.model.bert_based.dataset import BERTDataset
    dataset = BERTDataset(
    os.path.join(dataset_folder,
    max_len=
    model_type=

    only_premise=
    max_samples=

    ----------------------------------------------------------------------------------------------

    Notebooks

    The folder notebooks contains a collection of jupyter notebooks used to preprocess datasets and visualize results.

  6. h

    cifar10

    • huggingface.co
    Updated Aug 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Élie Goudout (2025). cifar10 [Dataset]. https://huggingface.co/datasets/ego-thales/cifar10
    Explore at:
    Dataset updated
    Aug 5, 2025
    Authors
    Élie Goudout
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Specifications

    Contains the entire CIFAR10 dataset, downloaded via PyTorch, then split and saved as .png files representing 32x32 images. There a three splits, perfectly balanced class-wise:

    train: 49,000 out of the original 50,000 samples from the training set of CIFAR10; calibration: 1,000 left-out samples from the training set; test: 10,000 samples, the entire original test set.

      File Structure
    

    Files are archives

  7. Caltech-256: Pre-Processed 80/20 Train-Test Split

    • kaggle.com
    zip
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KUSHAGRA MATHUR (2025). Caltech-256: Pre-Processed 80/20 Train-Test Split [Dataset]. https://www.kaggle.com/datasets/kushubhai/caltech-256-train-test
    Explore at:
    zip(1138799273 bytes)Available download formats
    Dataset updated
    Nov 12, 2025
    Authors
    KUSHAGRA MATHUR
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).

    The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:

    A clean, pre-defined 80/20 train-test split.

    Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.

    A flat directory structure (train/, test/) for simplified file access.

    File Content The dataset is organized into a single top-level folder and two CSV files:

    train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.

    test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.

    Caltech-256_Train_Test/: The primary data folder.

    train/: This directory contains 80% of the images from all 257 categories, intended for model training.

    test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.

    Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.

    Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.

    Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data

    Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.

  8. Z

    Data from: Solar flare forecasting based on magnetogram sequences learning...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grim, Luís Fernando Lopes; Sampaio Gradvohl, André Leon (2023). Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10246576
    Explore at:
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    Universidade Estadual de Campinas
    Universidade Estadual de Campinas (UNICAMP)
    Authors
    Grim, Luís Fernando Lopes; Sampaio Gradvohl, André Leon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation". Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 × 224 pixels and normalized from 0 to 1. Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data). We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets. We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively. GitHub version The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version. Folders Structure In the Root directory of the project, we have two folders:

    magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA). Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders.

    M24/M48: both present the following sub-folders structure:

    Seqs16; SF_MViT; SF_MViT_oT; SF_MViT_oTV; SF_MViT_oTV_Test. There are also two files in root:

    inst_packages.sh: install the packages and dependencies to run the models. download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache. M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files. Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders. All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models. Naming pattern for the files:

    magnetogram_jpg: follows the format "hmi.sharp_720s...magnetogram.fits.jpg" and Seqs16: follows the format "hmi.sharp_720s...to.", where:

    hmi: is the instrument that captured the image
    sharp_720s: is the database source of SDO/HMI.
    is the identification of SHARP region, and can contain one or more solar ARs classified by the (NOAA).
    is the date-time the instrument captured the image in the format yyyymmdd_hhnnss_TAI (y:year, m:month, d:day, h:hours, n:minutes, s:seconds).
    is the date-time when the sequence starts, and follow the same format of .

    is the date-time when the sequence ends, and follow the same format of . Reference text files in M24 and M48 or inside SF_MViT... folders follows the format "flare_Mclass_.txt", where:

    is Seq16 if refers to a sequence, or void if refers direct to images.

    "24h" or "48h".

    is "TrainVal" or "Test". The refers to the split of Train/Val.

    void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. All SF_MViT...folders:

    Model training codes: "SF_MViT_M+_", where:

    void or "oT" (over Train) or "oTV" (over Train and Val) or "oTV_Test" (over Train, Val and Test);

    "24h" or "48h";

    "oneSplit" for a specific split or "allSplits" if run all splits.

    void is default to run 1 GPU or "2gpu" to run into 2 gpus systems; Job submission files: "jobMViT_", where:

    point the queue in Lovelace environment hosted on CENAPAD-SP (https://www.cenapad.unicamp.br/parque/jobsLovelace) Temporary inputs: "Seq16_flare_Mclass_.txt:

    train or val;

    void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. Outputs: "saida_MViT_Adam_10-7", where:

    k0 to k4, means the correlated split of the output, or void if the output is from all splits. Error files: "err_MViT_Adam_10-7", where:

    k0 to k4, means the correlated split of the error log file, or void if the error file is from all splits. Checkpoint files: "sample-FLARE_MViT_S_10-7-epoch=-valid_loss=-Wloss_k=.ckpt", where:

    epoch number of the checkpoint;

    corresponding valid loss;

    0 to 4.

  9. TecoGan Pytorch Dataset

    • kaggle.com
    zip
    Updated Mar 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dwight Foster (2021). TecoGan Pytorch Dataset [Dataset]. https://www.kaggle.com/datasets/gtownfoster/ucf101-images-for-tecogan-pytorch
    Explore at:
    zip(4545427538 bytes)Available download formats
    Dataset updated
    Mar 19, 2021
    Authors
    Dwight Foster
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This is a dataset for the TecoGan Pytorch model. The Github repo can be found here.

    Content

    There are 400 scenes from the UCF101 dataset. Each video was split into photos with a maximum length of 120 photos. The photos were put into this dataset in the format that the TecoGan dataloader takes.

    Acknowledgements

    The original UCF101 dataset can be found here. And you can find the original TecoGan repo here.

    Inspiration

    Let's see how good your super resolution images can look. How close can you get to the original?

  10. Synthetic Airborne Intruder Dataset: A dataset based on High-Resolution...

    • zenodo.org
    bin, xz
    Updated Aug 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Lyhs; Lars Hinneburg; Florian Oelsner; Michael Fischer; Jeremy Tschirner; Stefan Milz; Stefan Milz; Patrick Maeder; Patrick Maeder; Jonathan Lyhs; Lars Hinneburg; Florian Oelsner; Michael Fischer; Jeremy Tschirner (2023). Synthetic Airborne Intruder Dataset: A dataset based on High-Resolution Inpainting for Safety Critical Detect and Avoid [Dataset]. http://doi.org/10.5281/zenodo.8301120
    Explore at:
    bin, xzAvailable download formats
    Dataset updated
    Aug 30, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jonathan Lyhs; Lars Hinneburg; Florian Oelsner; Michael Fischer; Jeremy Tschirner; Stefan Milz; Stefan Milz; Patrick Maeder; Patrick Maeder; Jonathan Lyhs; Lars Hinneburg; Florian Oelsner; Michael Fischer; Jeremy Tschirner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Modern machine learning techniques have shown tremendous potential, especially for object detection on camera images. For this reason, they are also used to enable safety-critical automated processes such as autonomous drone flights. We present a study on object detection for Detect and Avoid, a safety critical function for drones that detects air traffic during automated flights for safety reasons. An ill-posed problem is the generation of good and especially large data sets, since detection itself is the corner case. Most models suffer from limited ground truth in raw data, e.g. recorded air traffic or frontal flight with a small aircraft. It often leads to poor and critical detection rates. We overcome this problem by using inpainting methods to bootstrap the dataset such that it explicitly contains the corner cases of the raw data. We provide an overview of inpainting methods and generative models and present an example pipeline given a small annotated dataset. We validate our method by generating a high-resolution dataset and present it to an independent object detector that was fully trained on real data.

    This dataset is represented in the following repository. The dataset is structured as follows:

    # Synthetic Airborne Intruder Dataset

    This dataset was syntheticaly generated using an adapted [Pix2Pix](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix) with different background images and object segementations. Each image contains one object instance.

    The annotations are in the COCO annotation format.

    ## Data Structure
    Synthetic Dataset Root:
    --train
    |--images
    |--instances.json
    --val
    |--images
    |--instances.json
    --test
    |--images
    |--instances.json
    --Background_Sources
    |--sources_train.csv
    |--sources_val.csv
    |--sources_test.cs
    --README.md

    ## Categories

    | Id | Name | Instances over all splits |
    | ---| --- | --- |
    | 0 | large airplane | 1695 |
    | 1 | small airplane | 1255 |
    | 2 | very small airplane | 46 |
    | 3 | helicopter | 2201 |
    | 4 | drone | 961 |
    | 5 | hot air balloon | 315 |
    | 6 | paraglider | 565 |
    | 7 | airship | 42 |
    | 8 | UFO | 0 |

    ### Note:
    UFO is a placeholder for future expansion of the dataset.

    ## Splits
    The dataset consists of 3 splits: train 5900 images, val 590 images, test 590 images.
    The Number of instances per class and per split can be seen in the table below:

    Class | train | val | test
    -------|-------|-----|--------
    large airplane | 1416 | 142 | 137
    small airplane | 1046 | 96 | 113
    very small airplane | 38 | 2 | 6
    helicopter | 1812 | 206 | 183
    drone | 800 | 86 | 75
    hot air balloon | 268 | 21 | 26
    paragliders | 492 | 32 | 41
    airship | 28 | 5 | 9
    UFO | 0 | 0 | 0

    ## Sources
    The sources of the background images can be found in the files [here](./Background_Sources/).

  11. Z

    Simulated datasets for detector and particle flow reconstruction: CLIC...

    • nde-dev.biothings.io
    • data.niaid.nih.gov
    • +1more
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mokhtar, Farouk (2025). Simulated datasets for detector and particle flow reconstruction: CLIC detector, machine learning format [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_8409591
    Explore at:
    Dataset updated
    Mar 21, 2025
    Dataset provided by
    Kagan, Michael
    Garcia, Dolores
    Duarte, Javier
    Pata, Joosep
    Zhang, Mengke
    Wulff, Eric
    Mokhtar, Farouk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synopsis

    Machine-learning friendly format of tracks, clusters and target particles in electron-positron events, simulated with the CLIC detector. Ready to be used with jpata/particleflow:v2.3.0. Derived from the EDM4HEP ROOT files in https://zenodo.org/record/8260741.

    clic_edm_ttbar_pf.zip: e+e- -> ttbar, center of mass energy at 380 GeV

    clic_edm_qq_pf.zip: e+e- -> Z* -> qqbar, center of mass energy at 380 GeV

    clic_edm_ww_fullhad_pf.zip: e+e- -> WW -> W decaying hadronically, center of mass energy at 380 GeV

    clic-tfds.ipynb: an example notebook on how to load the files

    Contents

    Each .zip file contains the dataset in the tensorflow-datasets, array_record format. We have split the full datasets into 10 subsets, due to space considerations on zenodo, two subsets from each dataset are uploaded. Each dataset contains a train and test split of events.

    Dataset semantics (to be updated)

    Each dataset consists of events that can be iterated over using the tensorflow-datasets library and used in either tensorflow or pytorch. Each event has the following information available:

    X: the reconstruction input features, i.e. tracks and clusters

    ytarget: the ground truth particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle

    ycand: the baseline Pandora PF particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle

    The full semantics, including the list of features for X, are available at https://github.com/jpata/particleflow/blob/v2.3.0/mlpf/heptfds/clic_pf_edm4hep/utils_edm.py and https://github.com/jpata/particleflow/blob/v2.3.0/mlpf/data/key4hep/postprocessing.py.

  12. FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1)

    • zenodo.org
    bin, png, zip
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek (2024). FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1) [Dataset]. http://doi.org/10.5281/zenodo.8328113
    Explore at:
    bin, zip, pngAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    # FiN-2 Large-Scale Real-World PLC-Dataset

    ## About
    #### FiN-2 dataset in a nutshell:
    FiN-2 is the first large-scale real-world dataset on data collected in a powerline communication infrastructure. Since the electricity grid is inherently a graph, our dataset could be interpreted as a graph dataset. Therefore, we use the word node to describe points (cable distribution cabinets) of measurement within the low-voltage electricity grid and the word edge to describe connections (cables) in between them. However, since these are PLC connections, an edge does not necessarily have to correspond to a real cable; more on this in our paper.
    FiN-2 shows measurements that relate to the nodes (voltage, total harmonic distortion) as well as to the edges (signal-to-noise ratio spectrum, tonemap). In total, FiN-2 is distributed across three different sites with a total of 1,930,762,116 node measurements each for the individual features and 638,394,025 edge measurements each for all 917 PLC channels. All data was collected over a 25-month period from mid-2020 to the end of 2022.
    We propose this dataset to foster research in the domain of grid automation and smart grid. Therefore, we provide different example use cases in asset management, grid state visualization, forecasting, predictive maintenance, and novelty detection. For more decent information on this dataset, please see our [paper](https://arxiv.org/abs/2209.12693).

    * * *
    ## Content
    FiN-2 dataset splits up into two compressed `csv-Files`: *nodes.csv* and *edges.csv*.

    All files are provided as a compressed ZIP file and are divided into four parts. The first part can be found in this repo, while the remaining parts can be found in the following:
    - https://zenodo.org/record/8328105
    - https://zenodo.org/record/8328108
    - https://zenodo.org/record/8328111

    ### Node data

    | id | ts | v1 | v2 | v3 | thd1 | thd2 | thd3 | phase_angle1 | phase_angle2 | phase_angle3 | temp |
    |----|----|----|----|----|----|----|----|----|----|----|----|----|----|
    |112|1605530460|236.5|236.4|236.0|2.9|2.5|2.4|120.0|119.8|120.0|35.3|
    |112|1605530520|236.9|236.6|236.6|3.1|2.7|2.5|120.1|119.8|120.0|35.3|
    |112|1605530580|236.2|236.4|236.0|3.1|2.7|2.5|120.0|120.0|119.9|35.5|

    - id / ts: Unique identifier of the node that is measured and timestemp of the measurement
    - v1/v2/v3: Voltage measurements of all three phases
    - thd1/thd2/thd3: Total harmonic distortion of all three phases
    - phase_angle1/2/3: Phase angle of all three phases
    - temp: Temperature in-circuit of the sensor inside a cable distribution unit (in °C)

    ### Edge data
    | src | dst | ts | snr0 | snr1 | snr2 | ... | snr916 |
    |----|----|----|----|----|----|----|----|
    |62|94|1605528900|70|72|45|...|-53|
    |62|32|1605529800|16|24|13|...|-51|
    |17|94|1605530700|37|25|24|...|-55|

    - src & dst & ts: Unique identifier of the source and target nodes where the spectrum is measured and time of measurement
    - snr0/snr1/.../snr916: 917 SNR measurements in tenths of a decibel (e.g. 50 --> 5dB).

    ### Metadata
    Metadata that is provided along with the data covers:

    - Number of cable joints
    - Cable properties (length, type, number of sections)
    - Relative position of the nodes (location, zero-centered gps)
    - Adjacent PV or wallbox installations
    - Year of installation w.r.t. the nodes and cables

    Since the electricity grid is part of the critical infrastructure, it is not possible to provide exact GPS locations.

    * * *
    ## Usage
    Simple data access using pandas:

    ```
    import pandas as pd

    nodes_file = "nodes.csv.gz" # /path/to/nodes.csv.gz
    edges_file = "edges.csv.gz" # /path/to/edges.csv.gz

    # read the first 10 rows
    data = pd.read_csv(nodes_file, nrows=10, compression='gzip')

    # read the row number 5 to 15
    data = pd.read_csv(nodes_file, nrows=10, skiprows=[i for i in range(1,6)], compression='gzip')

    # ... same for the edges
    ```

    Compressed csv-data format was used to make sharing as easy as possible, however it comes with significant drawbacks for machine learning. Due to the inherent graph structure, a single snapshot of the whole graph consists of a set of node and edge measurements. But due to timeouts, noise and other disturbances, nodes sometimes fail in collecting the data, wherefore the number of measurements for a specific timestamp differs. This, plus the high sparsity of the graph, leads to a high inefficiency when using the csv-format for an ML training.
    To utilize the data in an ML pipeline, we recommend other data formats like [datadings](https://datadings.readthedocs.io/en/latest/) or specialized database solutions like [VictoriaMetrics](https://victoriametrics.com/).


    ### Example use case (voltage forecasting)

    Forecasting of the voltage is one potential use cases. The Jupyter notebook provided in the repository gives an overview of how the dataset can be loaded, preprocessed and used for ML training. Thereby, a MinMax scaling was used as simple preprocessing and a PyTorch dataset class was created to handle the data. Furthermore, a vanilla autoencoder is utilized to process and forecast the voltage into the future.

  13. d

    MountainScape Segmentation Dataset

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mountain Legacy Project (2024). MountainScape Segmentation Dataset [Dataset]. http://doi.org/10.5683/SP3/CEYU10
    Explore at:
    Dataset updated
    Dec 11, 2024
    Dataset provided by
    Borealis
    Authors
    Mountain Legacy Project
    Time period covered
    Jan 1, 1870 - Aug 30, 2023
    Description

    This dataset contains the MountainScape Segmentation Dataset (MS2D), a collection of oblique mountain images from the Mountain Legacy Project and corresponding manually annotated land cover masks. The dataset is split into 144 historic grayscale images collected by early phototopographic surveyors and 140 modern repeat images captured by the Mountain Legacy Project. The image resolutions range from 16 to 80 megapixels and the corresponding masks are RGB images with 8 landcover classes. The image dataset was used to train and test the Python Landscape Classifier (PyLC), a trainable segmentation network and land cover classification tool for oblique landscape photography. The dataset also contains PyTorch models trained with PyLC using the collection of images and masks.

  14. h

    imagenet-w21-wds

    • huggingface.co
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PyTorch Image Models (2025). imagenet-w21-wds [Dataset]. https://huggingface.co/datasets/timm/imagenet-w21-wds
    Explore at:
    Dataset updated
    Sep 19, 2025
    Dataset authored and provided by
    PyTorch Image Models
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Summary

    This is a copy of the full Winter21 release of ImageNet in webdataset tar format with JPEG images. This release consists of 19167 classes, 2674 fewer classes than the original 21841 class Fall11 release of the full ImageNet. The classes were removed due to these concerns: https://www.image-net.org/update-sep-17-2019.php

      Data Splits
    

    The full ImageNet dataset has no defined splits. This release follows that and leaves everything in the train split.… See the full description on the dataset page: https://huggingface.co/datasets/timm/imagenet-w21-wds.

  15. feral-cat-segmentation_dataset

    • kaggle.com
    • universe.roboflow.com
    zip
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    lu hou yang (2025). feral-cat-segmentation_dataset [Dataset]. https://www.kaggle.com/datasets/luhouyang/feral-cat-segmentation-dataset
    Explore at:
    zip(971125684 bytes)Available download formats
    Dataset updated
    Mar 18, 2025
    Authors
    lu hou yang
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Feral Cat Segmentation Dataset

    Overview

    This dataset provides image segmentation data for feral cats, designed for computer vision and machine learning tasks. It builds upon the original public domain dataset by Paul Cashman from Roboflow, with additional preprocessing and multiple data formats for easier consumption.

    Dataset Source

    Dataset Contents

    The dataset is organized into three standard splits: - Train set - Validation set - Test set

    Each split contains data in multiple formats: 1. Original JPG images 2. Segmentation mask JPG images 3. Parquet files containing flattened image and mask data 4. Pickle files containing serialized image and mask data

    Data Formats

    1. Image Files

    • Format: JPG
    • Resolution: 224×224 pixels
    • Directory Structure:
      • train/: Original training images
      • valid/: Original validation images
      • test/: Original test images
      • train_mask/: Corresponding segmentation masks for training
      • valid_mask/: Corresponding segmentation masks for validation
      • test_mask/: Corresponding segmentation masks for testing

    2. Parquet Files

    • Files: train_dataset.parquet, valid_dataset.parquet, test_dataset.parquet
    • Content: Flattened image data and corresponding masks combined in a single table
    • Structure: Each row contains the flattened pixel values of an image followed by the flattened pixel values of its mask
    • Data Division: Image and mask data are split at index split_at = image_size[0] * image_size[1] * image_channels
      • Data before this index: image pixel values (reshaped to [-1, 224, 224, 3])
      • Data after this index: mask pixel values (reshaped to [-1, 224, 224, 1])
    • Benefits: Efficient storage and faster loading compared to individual image files

    3. Pickle Files

    • Files: train_dataset.pkl, valid_dataset.pkl, test_dataset.pkl
    • Content: Serialized Python objects containing images and their corresponding masks
    • Structure: List of [image, mask] pairs, where each image and mask is serialized using Python's pickle
    • Data Access: Similar to parquet files, when loaded through the provided dataset class, data is split at the same index: split_at = image_size[0] * image_size[1] * image_channels
    • Benefits: Preserves original data structure and enables quick loading in Python

    4. CSV Files

    • Files: train_dataset.csv, valid_dataset.csv, test_dataset.csv
    • Content: Same data as parquet files but in CSV format
    • Structure: No headers, raw flattened pixel values
    • Data Division: Same split point as parquet files

    Image Preprocessing

    All images were preprocessed with the following operations: - Resized to 224×224 pixels using bilinear interpolation - Segmentation masks were also resized to match the images using nearest neighbor interpolation - Original RLE (Run-Length Encoding) segmentation data converted to binary masks

    Data Normalization

    When used with the provided PyTorch dataset class, images are normalized with: - Mean: [0.48235, 0.45882, 0.40784] - Standard Deviation: [0.00392156862745098, 0.00392156862745098, 0.00392156862745098]

    PyTorch Integration

    A custom CatDataset class is included for easy integration with PyTorch:

    from cat_dataset import CatDataset
    
    # Load from parquet format
    dataset = CatDataset(
      root="path/to/dataset",
      split="train", # Options: "train", "valid", "test"
      format="parquet", # Options: "parquet", "pkl"
      image_size=[224, 224],
      image_channels=3,
      mask_channels=1
    )
    
    # Use with PyTorch DataLoader
    from torch.utils.data import DataLoader
    dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
    

    Performance Comparison

    Loading time benchmarks from the original implementation: - Parquet format: ~1.29 seconds per iteration - Pickle format: ~0.71 seconds per iteration

    The pickle format provides the fastest loading times and is recommended for most use cases.

    Citation

    If you use this dataset in your research or projects, please cite:

    @misc{feral-cat-segmentation_dataset,
     title = {feral-cat-segmentation Dataset},
     type = {Open Source Dataset},
     author = {Paul Cashman},
     howpublished = {\url{https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}},
     url = {https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation},
     journal = {Roboflow Universe},
     publisher = {Roboflow},
     year = {2025},
     month = {mar},
     note = {visited on 2025-03-19},
    }
    

    Sample Usage Code

    Basic Dataset Loading

    from ca...
    
  16. Lunar Reconnaissance Orbiter Imagery for LROCNet Moon Classifier

    • zenodo.org
    bin, zip
    Updated Nov 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emily Dunkel; Emily Dunkel (2022). Lunar Reconnaissance Orbiter Imagery for LROCNet Moon Classifier [Dataset]. http://doi.org/10.5281/zenodo.7041842
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Nov 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Emily Dunkel; Emily Dunkel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary

    We provide imagery used to train LROCNet -- our Convolutional Neural Network classifier of orbital imagery of the moon. Images are divided into train, validation, and test zip files, which contain class specific sub-folders. We have three classes: "fresh crater", "old crater", and "none". Classes are described in detail in the attached labeling guide.

    Directory Contents

    We include the labeling guide and training, testing, and validation data. Training data was split to avoid upload timeouts.

    • LROC_Labeling_Intro_for_release.ppt: Labeling guide
    • val: Validation images divided into class sub-folders
      • ejecta: "fresh crater" class
      • oldcrater: "old crater" class
      • none: "none" class
    • test: Testing images divided into class sub-folders
      • ejecta: "fresh crater" class
      • oldcrater: "old crater" class
      • none: "none" class
    • ejecta_train: Training images of "fresh crater" class
    • oldcrater_train: Training images of "old crater" class
    • none_train1-4: Training images of "none" class (divided into 4 just for uploading)

    Data Description

    We use CDR (Calibrated Data Record) browse imagery (50% resolution) from the Lunar Reconnaissance Orbiter's Narrow Angle Cameras (NACs). Data we get from the NACs are 5-km swaths, at nominal orbit, so we perform a saliency detection step to find surface features of interest. A detector developed for Mars HiRISE (Wagstaff et al.) worked well for our purposes, after updating based on LROC NAC image resolution. We use this detector to create a set of image chipouts (small 227x277 cutouts) from the larger image, sampling the lunar globe.

    Class Labeling

    We select classes of interest based on what is visible at the NAC resolution, consulting with scientists and performing a literature review. Initially, we have 7 classes: "fresh crater", "old crater", "overlapping craters", "irregular mare patches", "rockfalls and landfalls", "of scientific interest", and "none".

    Using the Zooniverse platform, we set up a labeling tool and labeled 5,000 images. We found that "fresh crater" make up 11% of the data, "old crater" 18%, with the vast majority "none". Due to limited examples of the other classes, we reduce our initial class set to: "fresh crater" (with impact ejecta), "old crater", and "none".

    We divide the images into train/validation/test sets making sure no image swaths span multiple sets.

    Data Augmentation

    Using PyTorch, we apply the following augmentation on the training set only: horizontal flip, vertical flip, rotation by 90/180/270 degrees, and brightness adjustment (0.5, 2). In addition, we use weighted sampling so that each class is weighted equally. The training set included here does not include augmentation since that was performed within PyTorch.

    Acknowledgements

    The author would like to thank the volunteers who provided annotations for this data set, as well as others who contributed to this work (as in the Contributor list). We would also like to thank the PDS Imaging Node for support of this work.

    The research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004).

    CL#22-4763

    © 2022 California Institute of Technology. Government sponsorship acknowledged.

  17. Trained Weights for DRIVE Train/Validation Split

    • kaggle.com
    zip
    Updated Feb 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sovit Ranjan Rath (2023). Trained Weights for DRIVE Train/Validation Split [Dataset]. https://www.kaggle.com/datasets/sovitrath/trained-weights-for-drive-trainvalidation-split
    Explore at:
    zip(3795490926 bytes)Available download formats
    Dataset updated
    Feb 21, 2023
    Authors
    Sovit Ranjan Rath
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Trained weights for the Retinal Vessel Segmentation dataset which can be found here.

    Trained weights: * DeepLabV3 ResNet50 (trained with 512x512 and 768x768 resolution images) * DeepLabV3 ResNet101 (trained with 512x512 and 640x640 resolution images)

    Accompanying blog post => Retinal Vessel Segmentation using PyTorch Semantic Segmentation

  18. Leaf Disease Segmentation with Train/Valid Split

    • kaggle.com
    zip
    Updated Feb 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sovit Ranjan Rath (2023). Leaf Disease Segmentation with Train/Valid Split [Dataset]. https://www.kaggle.com/datasets/sovitrath/leaf-disease-segmentation-with-trainvalid-split/discussion
    Explore at:
    zip(528293844 bytes)Available download formats
    Dataset updated
    Feb 12, 2023
    Authors
    Sovit Ranjan Rath
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is a leaf disease dataset for semantic segmentation. The dataset contains a collection of different diseases. But each disease has not been segregated into a different class. Disease on leaves is one class and background is another class.

    The dataset contains two folders. One with original images and another with augmented images. Both formats have a train/validation split for easier experimentation.

    Find the corresponding blog post here: "https://debuggercafe.com/leaf-disease-segmentation-using-pytorch-deeplabv3/">Leaf Disease Segmentation using PyTorch DeepLabV3

    Roughly, in both cases, 15% is reserved for validation and the rest for training.

    The images have been taken from the PlantDoc dataset.

    Original dataset => https://www.kaggle.com/datasets/fakhrealam9537/leaf-disease-segmentation-dataset

  19. Sentence/Table Pair Data from Wikipedia for Pre-training with...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315
    Explore at:
    Dataset updated
    Oct 29, 2021
    Dataset provided by
    Google Research
    The Ohio State University
    Authors
    Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

    There are two files:

    sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

    table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

    The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

    For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

    Below is a sample code snippet to load the data

    import webdataset as wds

    path to the uncompressed files, should be a directory with a set of tar files

    url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

    Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

    You can also iterate through all examples and dump them with your preferred data format

    Below we show how the data is organized with two examples.

    Text-only

    {'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

    Hybrid

    {'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }

  20. Z

    3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in...

    • nde-dev.biothings.io
    • data.niaid.nih.gov
    • +1more
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Strohmayer, Julian (2024). 3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_10925350
    Explore at:
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Strohmayer, Julian
    Kampel, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios

    This repository contains the 3DO dataset proposed in [1].

    PyTroch Dataloader

    A minimal PyTorch dataloader for the 3DO dataset is provided at: https://github.com/StrohmayerJ/3DO

    Dataset Description

    The 3DO dataset comprises 42 five-minute recordings (~1.25M WiFi packets) of three human activities performed by a single person, captured in a WiFi through-wall sensing scenario over three consecutive days. Each WiFi packet is annotated with a 3D trajectory label and a class label for the activities: no person/background (0), walking (1), sitting (2), and lying (3). (Note: The labels returned in our dataloader example are walking (0), sitting (1), and lying (2), because background sequences are not used.)

    The directories 3DO/d1/, 3DO/d2/, and 3DO/d3/ contain the sequences from days 1, 2, and 3, respectively. Furthermore, each sequence directory (e.g., 3DO/d1/w1/) contains a csiposreg.csv file storing the raw WiFi packet time series and a csiposreg_complex.npy cache file, which stores the complex Channel State Information (CSI) of the WiFi packet time series. (If missing, csiposreg_complex.npy is automatically generated by the provided dataloader.)

    Dataset Structure:

    /3DO

    ├── d1 <-- day 1 subdirectory

      └── w1 <-- sequence subdirectory
    
         └── csiposreg.csv <-- raw WiFi packet time series
    
         └── csiposreg_complex.npy <-- CSI time series cache
    

    ├── d2 <-- day 2 subdirectory

    ├── d3 <-- day 3 subdirectory

    In [1], we use the following training, validation, and test split:

    Subset Day Sequences

    Train 1 w1, w2, w3, s1, s2, s3, l1, l2, l3

    Val 1 w4, s4, l4

    Test 1 w5 , s5, l5

    Test 2 w1, w2, w3, w4, w5, s1, s2, s3, s4, s5, l1, l2, l3, l4, l5

    Test 3 w1, w2, w4, w5, s1, s2, s3, s4, s5, l1, l2, l4

    w = walking, s = sitting and l= lying

    Note: On each day, we additionally recorded three ten-minute background sequences (b1, b2, b3), which are provided as well.

    Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

    [1] Strohmayer, J., Kampel, M. (2025). On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios. In: Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15315. Springer, Cham. https://doi.org/10.1007/978-3-031-78354-8_13

    BibTeX citation:

    @inproceedings{strohmayerOn2025, author="Strohmayer, Julian and Kampel, Martin", title="On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios", booktitle="Pattern Recognition", year="2025", publisher="Springer Nature Switzerland", address="Cham", pages="194--211", isbn="978-3-031-78354-8" }

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sovit Ranjan Rath (2023). DRIVE Train/Validation Split Dataset [Dataset]. https://www.kaggle.com/datasets/sovitrath/drive-trainvalidation-split-dataset/code
Organization logo

DRIVE Train/Validation Split Dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sovit Ranjan Rath
Description

This dataset contains images and masks for Retinal Vessel Extraction (Segmentation). It contains a training and validation split to easily train semantic segmentation models.

The original dataset can be found here => https://www.kaggle.com/datasets/andrewmvd/drive-digital-retinal-images-for-vessel-extraction

This dataset also has an accompanying blog post => Retinal Vessel Segmentation using PyTorch Semantic Segmentation

Split sample numbers: Training images and masks: 16 Validation images and masks: 4 Test images: 20

Search
Clear search
Close search
Google apps
Main menu