47 datasets found
  1. Pytorch Models

    • kaggle.com
    zip
    Updated May 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufian Othman (2025). Pytorch Models [Dataset]. https://www.kaggle.com/datasets/mohdsufianbinothman/pytorch-models/data
    Explore at:
    zip(21493 bytes)Available download formats
    Dataset updated
    May 10, 2025
    Authors
    Sufian Othman
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    ✅ Step 1: Mount to Dataset

    Search for my dataset pytorch-models and add it — this will mount it at:

    /kaggle/input/pytorch-models/

    ✅ Step 2: Check file paths Once mounted, the four files will be available at:

    /kaggle/input/pytorch-models/base_models.py
    /kaggle/input/pytorch-models/ext_base_models.py
    /kaggle/input/pytorch-models/ext_hybrid_models.py
    /kaggle/input/pytorch-models/hybrid_models.py
    

    ✅ Step 3: Copy files to working directory To make them importable, copy the .py files to your notebook’s working directory (/kaggle/working/):

    import shutil
    
    shutil.copy('/kaggle/input/pytorch-models/base_models.py', '/kaggle/working/')
    shutil.copy('/kaggle/input/pytorch-models/ext_base_models.py', '/kaggle/working/')
    shutil.copy('/kaggle/input/pytorch-models/ext_hybrid_models.py', '/kaggle/working/')
    shutil.copy('/kaggle/input/pytorch-models/hybrid_models.py', '/kaggle/working/')
    

    ✅ Step 4: Import your modules Now that they are in the working directory, you can import them like normal:

    import base_models
    import ext_base_models
    import ext_hybrid_models
    import hybrid_models
    

    Or, if you only want to import specific classes or functions:

    from base_models import YourModelClass
    from ext_base_models import AnotherModelClass
    

    ✅ Step 5: Use the models You can now initialize and use the models/classes/functions defined inside each file:

    model = base_models.YourModelClass()
    output = model(input_data)
    
  2. Z

    Model Zoo: A Dataset of Diverse Populations of Neural Network Models - MNIST...

    • data.niaid.nih.gov
    Updated Jun 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schürholt, Konstantin; Taskiran, Diyar; Knyazev, Boris; Giró-i-Nieto, Xavier; Borth, Damian (2022). Model Zoo: A Dataset of Diverse Populations of Neural Network Models - MNIST [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6632086
    Explore at:
    Dataset updated
    Jun 13, 2022
    Dataset provided by
    Image Processing Group, Universitat Politècnica de Catalunya
    AI Lab Montreal, Samsung Advanced Institute of Technology
    AIML Lab, University of St.Gallen
    Authors
    Schürholt, Konstantin; Taskiran, Diyar; Knyazev, Boris; Giró-i-Nieto, Xavier; Borth, Damian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    In the last years, neural networks have evolved from laboratory environments to the state-of-the-art for many real-world problems. Our hypothesis is that neural network models (i.e., their weights and biases) evolve on unique, smooth trajectories in weight space during training. Following, a population of such neural network models (refereed to as “model zoo”) would form topological structures in weight space. We think that the geometry, curvature and smoothness of these structures contain information about the state of training and can be reveal latent properties of individual models. With such zoos, one could investigate novel approaches for (i) model analysis, (ii) discover unknown learning dynamics, (iii) learn rich representations of such populations, or (iv) exploit the model zoos for generative modelling of neural network weights and biases. Unfortunately, the lack of standardized model zoos and available benchmarks significantly increases the friction for further research about populations of neural networks. With this work, we publish a novel dataset of model zoos containing systematically generated and diverse populations of neural network models for further research. In total the proposed model zoo dataset is based on six image datasets, consist of 24 model zoos with varying hyperparameter combinations are generated and includes 47’360 unique neural network models resulting in over 2’415’360 collected model states. Additionally, to the model zoo data we provide an in-depth analysis of the zoos and provide benchmarks for multiple downstream tasks as mentioned before.

    Dataset

    This dataset is part of a larger collection of model zoos and contains the zoos trained on the labelled samples from MNIST. All zoos with extensive information and code can be found at www.modelzoos.cc.

    This repository contains two types of files: the raw model zoos as collections of models (file names beginning with "mnist_"), as well as preprocessed model zoos wrapped in a custom pytorch dataset class (filenames beginning with "dataset"). Zoos are trained in three configurations varying the seed only (seed), varying hyperparameters with fixed seeds (hyp_fix) or varying hyperparameters with random seeds (hyp_rand). The index_dict.json files contain information on how to read the vectorized models.

    For more information on the zoos and code to access and use the zoos, please see www.modelzoos.cc.

  3. Oxford 102 Flower Dataset

    • kaggle.com
    zip
    Updated May 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lalu Erfandi Maula Yusnu (2021). Oxford 102 Flower Dataset [Dataset]. https://www.kaggle.com/nunenuh/pytorch-challange-flower-dataset
    Explore at:
    zip(346507679 bytes)Available download formats
    Dataset updated
    May 26, 2021
    Authors
    Lalu Erfandi Maula Yusnu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    We have created a 102 category dataset, consisting of 102 flower categories. The flowers chosen to be flower commonly occuring in the United Kingdom. Each class consists of between 40 and 258 images. The details of the categories and the number of images for each class can be found on this category statistics page.

    The images have large scale, pose and light variations. In addition, there are categories that have large variations within the category and several very similar categories. The dataset is visualized using isomap with shape and colour features.

    Directory Structure

    > dataset
      > train
      > valid
      > test
    - cat_to_name.json
    - README.md
    - sample_submission.csv
    

    Visualization of the dataset

    We visualize the categories in the dataset using SIFT features as shape descriptors and HSV as colour descriptor. The images are randomly sampled from the category.

    https://i.imgur.com/Tl6TKUC.png" alt="">

    Publications

    Nilsback, M-E. and Zisserman, A. Automated flower classification over a large number of classes
    Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing (2008)

    Source

  4. h

    bigearthnet

    • huggingface.co
    Updated Jul 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luca Colomba (2024). bigearthnet [Dataset]. https://huggingface.co/datasets/lc-col/bigearthnet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2024
    Authors
    Luca Colomba
    Description

    BigEarthNet - HDF5 version

    This repository contains an export of the existing BigEarthNet dataset in HDF5 format. All Sentinel-2 acquisitions are exported according to TorchGeo's dataset (120x120 pixels resolution). Sentinel-1 is not contained in this repository for the moment. CSV files contain for each satellite acquisition the corresponding HDF5 file and the index. A PyTorch dataset class which can be used to iterate over this dataset can be found here, as well as the script used… See the full description on the dataset page: https://huggingface.co/datasets/lc-col/bigearthnet.

  5. cifar-100-python

    • kaggle.com
    zip
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ThanhTan (2024). cifar-100-python [Dataset]. https://www.kaggle.com/datasets/duongthanhtan/cifar-100-python
    Explore at:
    zip(168517675 bytes)Available download formats
    Dataset updated
    Dec 26, 2024
    Authors
    ThanhTan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CIFAR-100 Dataset

    1. Overview

    • CIFAR-100 is an extension of the CIFAR-10 dataset, with more classes and finer-grained categorization.
    • It contains 100 classes, making it more challenging than CIFAR-10, which has only 10 classes.
    • Each image in CIFAR-100 is labeled with both a fine label (specific category) and a coarse label (broader category, such as animals or vehicles).

    2. Dataset Details

    • Number of Images: 60,000 color images in total.
      • 50,000 for training.
      • 10,000 for testing.
    • Image Size: Each image is a small 32x32 pixel RGB (color) image.
    • Classes: 100 classes, grouped into 20 superclasses.
      • Each superclass contains 5 related classes.

    3. Fine and Coarse Labels

    • Fine Labels: The dataset has specific categories, such as 'apple', 'bicycle', 'rose', etc.
    • Coarse Labels: These are broader categories, like 'fruit', 'flower', 'vehicle', etc.

    4. Applications

    • Image Classification: Used for training models to classify images into their respective categories.
    • Feature Extraction: Useful for benchmarking feature extraction techniques in computer vision.
    • Transfer Learning: Often used to pre-train models for other similar tasks.
    • Deep Learning Research: Commonly used to test architectures like CNNs (Convolutional Neural Networks).

    5. Challenges

    • The images are very small (32x32 pixels), making it harder for models to learn intricate details.
    • High class count (100) increases classification complexity.
    • Intra-class variability and inter-class similarity make it a challenging dataset for classification.

    6. File Format

    • The dataset is usually available in Python-friendly formats like .pkl or .npz.
    • It can also be downloaded and loaded using frameworks like TensorFlow or PyTorch.

    7. Example Classes

    Some example classes include: - Animals: beaver, dolphin, otter, elephant, snake. - Plants: apple, orange, mushroom, palm tree, pine tree. - Vehicles: bicycle, bus, motorcycle, train, rocket. - Everyday Objects: clock, keyboard, lamp, table, chair.

  6. MIEDT dataset

    • kaggle.com
    Updated Jan 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    机关鸢鸟 (2025). MIEDT dataset [Dataset]. https://www.kaggle.com/datasets/lidang78/miedt-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    机关鸢鸟
    Description
      1. Dataset Overview This dataset is organized based on the edge detection task, aiming to provide rich image resources and corresponding edge detection annotation information for related research and applications, which can be used for the testing of edge detection algorithms. In order to evaluate the performance of the edge detection method comprehensively, we created the Medical Image Edge Detection Test (MIEDT) dataset. The MIEDT contains 100 medical images, which were randomly selected from three publicly available datasets, Head CT-hemorrhage, Coronary Artery Diseases DataSet, and Skin Cancer MNIST: HAM10000 .
      1. Data Set Structure Original image: This folder stores the original image data. It contains 15 Head CT images in PNG format with varying image resolutions; 25 coronary heart disease images in JPG format and with an image resolution of [1024 * 1024]; 60 skin images in JPG format and with an image resolution of [600 * 450]. It covers a variety of medical image materials with different imaging and contrast, providing diverse input data for edge detection algorithms. Ground truth:The data in this folder are the edge detection annotation images corresponding to the images in the "Originals" folder. They are in PNG format. In these images, the white pixels represent the edge parts of the image, and the black pixels represent the non-edge areas. These annotation information accurately outlines the object contours and edge features in the original images.
      1. Usage Instructions For users who conduct image processing using Python, they can utilize the cv2 (OpenCV) library to read image data. The sample code is as follows:

    import cv2 original_image = cv2.imread('Original image/IMG-001.png') # Read original image ground_truth_image = cv2.imread('Ground truth/GT-001.png', cv2.IMREAD_GRAYSCALE) # Read the corresponding Ground Truth image When performing model training based on deep learning frameworks (such as TensorFlow, PyTorch), the dataset path can be configured into the corresponding dataset loading class according to the data loading mechanism of the framework to ensure that the model can correctly read and process the image and its annotation data.

    • 4. Data Sources and References Data Sources: The original images are collected from public image datasets Head CT-hemorrhage, Coronary Artery Diseases DataSet, and Skin Cancer MNIST: HAM10000 to ensure the quality and diversity of the images. If you are using this dataset in academic research, please cite the following literature.

    References: [1] Noel Codella, Veronica Rotemberg, Philipp Tschandl, M. Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, Harald Kittler, Allan Halpern: “Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)”, 2018; https://arxiv.org/abs/1902.03368

    [2] Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 doi:10.1038/sdata.2018.161 (2018).

    [3] Classification of Brain Hemorrhage Using Deep Learning from CT Scan Images - https://link.springer.com/chapter/10.1007/978-981-19-7528-8_15

  7. f

    Data from: Deep learning neural network derivation and testing to...

    • tandf.figshare.com
    • datasetcatalog.nlm.nih.gov
    png
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omid Mehrpour; Christopher Hoyte; Abdullah Al Masud; Ashis Biswas; Jonathan Schimmel; Samaneh Nakhaee; Mohammad Sadegh Nasr; Heather Delva-Clark; Foster Goss (2023). Deep learning neural network derivation and testing to distinguish acute poisonings [Dataset]. http://doi.org/10.6084/m9.figshare.23694504.v1
    Explore at:
    pngAvailable download formats
    Dataset updated
    Aug 8, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Omid Mehrpour; Christopher Hoyte; Abdullah Al Masud; Ashis Biswas; Jonathan Schimmel; Samaneh Nakhaee; Mohammad Sadegh Nasr; Heather Delva-Clark; Foster Goss
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Acute poisoning is a significant global health burden, and the causative agent is often unclear. The primary aim of this pilot study was to develop a deep learning algorithm that predicts the most probable agent a poisoned patient was exposed to from a pre-specified list of drugs. Data were queried from the National Poison Data System (NPDS) from 2014 through 2018 for eight single-agent poisonings (acetaminophen, diphenhydramine, aspirin, calcium channel blockers, sulfonylureas, benzodiazepines, bupropion, and lithium). Two Deep Neural Networks (PyTorch and Keras) designed for multi-class classification tasks were applied. There were 201,031 single-agent poisonings included in the analysis. For distinguishing among selected poisonings, PyTorch model had specificity of 97%, accuracy of 83%, precision of 83%, recall of 83%, and a F1-score of 82%. Keras had specificity of 98%, accuracy of 83%, precision of 84%, recall of 83%, and a F1-score of 83%. The best performance was achieved in the diagnosis of single-agent poisoning in diagnosing poisoning by lithium, sulfonylureas, diphenhydramine, calcium channel blockers, then acetaminophen, in PyTorch (F1-score = 99%, 94%, 85%, 83%, and 82%, respectively) and Keras (F1-score = 99%, 94%, 86%, 82%, and 82%, respectively). Deep neural networks can potentially help in distinguishing the causative agent of acute poisoning. This study used a small list of drugs, with polysubstance ingestions excluded.Reproducible source code and results can be obtained at https://github.com/ashiskb/npds-workspace.git.

  8. SimCATS_GaAs_v1_random_variations_v2

    • resodate.org
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabian Hader; Fabian Fuchs; Sarah Fleitmann (2024). SimCATS_GaAs_v1_random_variations_v2 [Dataset]. http://doi.org/10.26165/JUELICH-DATA/5PB3GT
    Explore at:
    Dataset updated
    Oct 9, 2024
    Dataset provided by
    Forschungszentrum Jülichhttp://www.fz-juelich.de/
    Peter Grünberg Institute - Integrated Computing Architectures (ICA/PGI-4)
    Authors
    Fabian Hader; Fabian Fuchs; Sarah Fleitmann
    Description

    Dataset: SimCATS_GaAs_v1_random_variations_v2 Simulated data from the geometric SimCATS model (GitHub Repository, Paper) for benchmarking of semiconductor quantum dot tuning algorithms. Generated using this Jupyter Notebook and used for the final evaluation in Automated Charge Transition Detection in Quantum Dot Charge Stability Diagrams. Key Facts Contains pink, white & random telegraph noise, transition blurring, and dot jumps Random variations of charge transitions, sensor, and distortions 1.000 randomly sampled configurations with 100 CSDs each (in total: 100.000 CSDs) Usage To load the data, e.g. for calculating metrics, please have a look at SimCATS-Datasets (GitHub Repository, ReadTheDocs). The dataset can be loaded as numpy arrays using the function load_dataset or as PyTorch Dataset class (for machine learning purposes) using the class SimcatsDataset.

  9. GISE-51

    • zenodo.org
    application/gzip, txt
    Updated Apr 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarthak Yadav; Sarthak Yadav; Mary Ellen Foster; Mary Ellen Foster (2021). GISE-51 [Dataset]. http://doi.org/10.5281/zenodo.4593514
    Explore at:
    application/gzip, txtAvailable download formats
    Dataset updated
    Apr 13, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sarthak Yadav; Sarthak Yadav; Mary Ellen Foster; Mary Ellen Foster
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GISE-51 is an open dataset of 51 isolated sound events based on the FSD50K dataset. The release also includes the GISE-51-Mixtures subset, a dataset of 5-second soundscapes with up to three sound events synthesized from GISE-51. The GISE-51 release attempts to address some of the shortcomings of recent sound event datasets, providing an open, reproducible benchmark for future research and the freedom to adapt the included isolated sound events for domain-specific applications, which was not possible using existing large-scale weakly labelled datasets. GISE-51 release also included accompanying code for baseline experiments, which can be found at https://github.com/SarthakYadav/GISE-51-pytorch.

    Citation

    If you use the GISE-51 dataset and/or the released code, please cite our paper:

    Sarthak Yadav and Mary Ellen Foster, "GISE-51: A scalable isolated sound events dataset", arXiv:2103.12306, 2021

    Since GISE-51 is based on FSD50K, if you use GISE-51 kindly also cite the FSD50K paper:

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.

    About GISE-51 and GISE-51-Mixtures

    The following sections summarize key characteristics of the GISE-51 and the GISE-51-Mixtures datasets, including details left out from the paper.

    GISE-51

    • Three subsets: train, val and eval with 12465, 1716, and2176 utterances. Subsets are in coherence with the FSD50K release.
    • Encompasses 51 sound classes from the FSD50K release
    • View meta/lbl_map.csv for the complete vocabulary.
    • The dataset was obtained from FSD50K using the following steps:
      • Unsmearing annotations to obtain single instances with a single label using the provided metadata and ground truth in FSD50K.
      • Manual inspection to qualitatively evaluate shortlisted utterances.
      • Volume-threshold based automated silence filtering using sox. Different volume thresholds are selected for various sound event class bins using trial-and-error. silence_thresholds.txt lists class bins and their corresponding volume threshold. Files that were determined by sox to contain no audio at all were manually clipped. Code for performing silence filtering can be found in scripts/strip_silence_sox.py in the code repository.
      • Re-evaluate sound event classes, removing ones with too few samples and merging those with high inter-class ambiguity.

    GISE-51-Mixtures

    • Synthetic 5-second soundscapes with up to 3 events created using Scaper.
    • Weighted sampling with replacement for sound event selection, effectively oversampling events with very few samples. Synthetic soundscapes generated thus have a near equal number of annotations per sound event.
    • The number of soundscapes in val and eval set is 10000 each.
    • The number of soundscapes in the final train set is 60000. We do provide training sets with 5k-100k soundscapes.
    • GISE-51-Mixtures is our proposed subset that can be used to benchmark the performance of future works.

    LICENSE

    All audio clips (i.e., found in isolated_events.tar.gz) used in the preparation of the Glasgow Isolated Events Dataset (GISE-51) are designated Creative Commons and were obtained from FSD50K. The source data in isolated_events.tar.gz is based on the FSD50K dataset, which is licensed as Creative Commons Attribution 4.0 International (CC BY 4.0) License.

    GISE-51 dataset (including GISE-51-Mixtures) is a curated, processed and generated preparation, and is released under Creative Commons Attribution 4.0 International (CC BY 4.0) License. The license is specified in the LICENSE-DATASET file in license.tar.gz.

    Baselines

    Several sound event recognition experiments were conducted, establishing baseline performance on several prominent convolutional neural network architectures. The experiments are described in Section 4 of our paper, and the implementation for reproducing these experiments is available at https://github.com/SarthakYadav/GISE-51-pytorch.

    Files

    GISE-51 is available as a collection of several tar archives. All audio files are PCM 16 bit, 22050 Hz. Following lists the contents of these files in detail:

    • isolated_events.tar.gz: The core GISE-51 isolated events dataset containing train, val and eval subfolders.
    • meta.tar.gz: contains lbl_map.json
    • noises.tar.gz: contains background noises used for GISE-51-Mixtures soundscape generation
    • mixtures_jams.tar.gz: This file contains annotation files in .jams format that, alongside isolated_events.tar.gz and noises.tar.gz can be reused to generate exact GISE-51-Mixtures soundscapes. (Optional, we provide the complete set of GISE-51-Mixtures soundscapes as independent tar archives.)
    • train.tar.gz: GISE-51-Mixtures train set, containing 60k synthetic soundscapes.
    • val.tar.gz: GISE-51-Mixtures val set, containing 10k synthetic soundscapes.
    • eval.tar.gz: GISE-51-Mixtures eval set, containing 10k synthetic soundscapes.
    • train_*.tar.gz: These are tar archives containing training mixtures of a various number of soundscapes, used primarily in Section 4.1 of the paper, which compares val mAP performance v/s number of training soundscapes. A helper script is provided in the code release, prepare_mixtures_lmdb.sh, to prepare data for experiments in Section 4.1.
    • pretrained-models.tar.gz: Contains model checkpoints for all experiments conducted in the paper. More information on these checkpoints can be found in the code release README.
      • experiments_60k_mixtures: model checkpoints from section 4.2 of the paper.
      • exported_weights_60k: ResNet-18 and EfficientNet-B1 exported as plain state_dicts for use with transfer learning experiments.
      • experiments_audioset: checkpoints from AudioSet Balanced (Sec 4.3.1) experiments
      • experiments_vggsound: checkpoints from Section 4.3.2 of the paper
      • experiments_esc50: ESC-50 dataset checkpoints, from Section 4.3.3
    • license.tar.gz: contains dataset license info.
    • silence_thresholds.txt: contains volume thresholds for various sound event bins used for silence filtering.

    Contact

    In case of queries and clarifications, feel free to contact Sarthak at s.yadav.2@research.gla.ac.uk. (Adding [GISE-51] to the subject of the email would be appreciated!)

  10. Z

    Dataset for class comment analysis

    • data.niaid.nih.gov
    Updated Feb 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pooja Rani (2022). Dataset for class comment analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4311838
    Explore at:
    Dataset updated
    Feb 22, 2022
    Dataset provided by
    University of Bern
    Authors
    Pooja Rani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A list of different projects selected to analyze class comments (available in the source code) of various languages such as Java, Python, and Pharo. The projects vary in terms of size, contributors, and domain.

    Structure

    Projects/
      Java_projects/
        eclipse.zip
        guava.zip
        guice.zip
        hadoop.zip
        spark.zip
        vaadin.zip
    
      Pharo_projects/
        images/
          GToolkit.zip
          Moose.zip
          PetitParser.zip
          Pillar.zip
          PolyMath.zip
          Roassal2.zip
          Seaside.zip
    
        vm/
          70-x64/Pharo
    
        Scripts/
          ClassCommentExtraction.st
          SampleSelectionScript.st    
    
      Python_projects/
        django.zip
        ipython.zip
        Mailpile.zip
        pandas.zip
        pipenv.zip
        pytorch.zip   
        requests.zip 
      
    

    Contents of the Replication Package

    Projects/ contains the raw projects of each language that are used to analyze class comments. - Java_projects/ - eclipse.zip - Eclipse project downloaded from the GitHub. More detail about the project is available on GitHub Eclipse. - guava.zip - Guava project downloaded from the GitHub. More detail about the project is available on GitHub Guava. - guice.zip - Guice project downloaded from the GitHub. More detail about the project is available on GitHub Guice - hadoop.zip - Apache Hadoop project downloaded from the GitHub. More detail about the project is available on GitHub Apache Hadoop - spark.zip - Apache Spark project downloaded from the GitHub. More detail about the project is available on GitHub Apache Spark - vaadin.zip - Vaadin project downloaded from the GitHub. More detail about the project is available on GitHub Vaadin

    • Pharo_projects/

      • images/ -

        • GToolkit.zip - Gtoolkit project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.
        • Moose.zip - Moose project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.
        • PetitParser.zip - Petit Parser project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.
        • Pillar.zip - Pillar project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.
        • PolyMath.zip - PolyMath project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.
        • Roassal2.zip - Roassal2 project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.
        • Seaside.zip - Seaside project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.
      • vm/ -

      • 70-x64/Pharo - Pharo7 (version 7 of Pharo) virtual machine to instantiate the Pharo images given in the images/ folder. The user can run the vm on macOS and select any of the Pharo image.

      • Scripts/ - It contains the sample Smalltalk scripts to extract class comments from various projects.

      • ClassCommentExtraction.st - A Smalltalk script to show how class comments are extracted from various Pharo projects. This script is already provided in the respective project image.

      • SampleSelectionScript.st - A Smalltalk script to show sample class comments of Pharo projects are selected. This script can be run in any of the Pharo images given in the images/ folder.

    • Python_projects/

      • django.zip - Django project downloaded from the GitHub. More detail about the project is available on GitHub Django
      • ipython.zip - IPython project downloaded from the GitHub. More detail about the project is available on GitHub on IPython
      • Mailpile.zip - Mailpile project downloaded from the GitHub. More detail about the project is available on GitHub on Mailpile
      • pandas.zip - pandas project downloaded from the GitHub. More detail about the project is available on GitHub on pandas
      • pipenv.zip - Pipenv project downloaded from the GitHub. More detail about the project is available on GitHub on Pipenv
      • pytorch.zip - PyTorch project downloaded from the GitHub. More detail about the project is available on GitHub on PyTorch
      • requests.zip - Requests project downloaded from the GitHub. More detail about the project is available on GitHub on Requests
  11. Genomics OOD

    • kaggle.com
    • tensorflow.org
    zip
    Updated Mar 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Elflein (2021). Genomics OOD [Dataset]. https://www.kaggle.com/svenel/genomics-ood
    Explore at:
    zip(2282016677 bytes)Available download formats
    Dataset updated
    Mar 31, 2021
    Authors
    Sven Elflein
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Bacteria Genomics OOD dataset

    This dataset implements a PyTorch dataset for the Genomics OOD dataset proposed in

    J. Ren et al., “Likelihood Ratios for Out-of-Distribution Detection,” arXiv:1906.02845 [cs, stat], Available: http://arxiv.org/abs/1906.02845.

    Code can be found at Github.

    The dataset contains for each input sample - A sequence of 250 integers, where each number is from {0, 1, 2, 3} indicating {A, C, G, T}. - A class label, range from 0 to 129 for the bacteria class. - A a string notating where the sequence comes from.

    In total there a 5 splits: Train, Validation, Test split with 10 in-distribution classes and a valdidation out-of-distribution dataset, as well as a out-of-distribution test set with 60 classes each.

    The dataset with generated indices can be downloaded via the Releases.

    Attribution

    The original dataset was released by

    Jie Ren, Google Research, 05/23/2019, jjren@google.com

    Following CC BY 4.0 International license, this is released and distributed under the CC BY 4.0 license. The original dataset can be found here.

  12. Imbalanced Cifar-10

    • kaggle.com
    zip
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akhil Theerthala (2023). Imbalanced Cifar-10 [Dataset]. https://www.kaggle.com/datasets/akhiltheerthala/imbalanced-cifar-10
    Explore at:
    zip(807146485 bytes)Available download formats
    Dataset updated
    Jun 17, 2023
    Authors
    Akhil Theerthala
    Description

    This dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">

    The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.

    The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.

    This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.

    Usage Information:

    The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.

    License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.

    Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.

  13. Z

    Data from: Self-Supervised Representation Learning on Neural Network Weights...

    • data.niaid.nih.gov
    Updated Nov 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schürholt, Kontantin; Kostadinov, Dimche; Borth, Damian (2021). Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction - Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5645137
    Explore at:
    Dataset updated
    Nov 13, 2021
    Dataset provided by
    University of St.Gallen
    Authors
    Schürholt, Kontantin; Kostadinov, Dimche; Borth, Damian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets to NeurIPS 2021 accepted paper "Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction".

    Datasets are pytorch files containing a dictionary with training, validation and test sets. Train, validation and test sets are custom dataset classes which inherit from the standard torch dataset class. Corresponding code an be found at https://github.com/HSG-AIML/NeurIPS_2021-Weight_Space_Learning.

    Datasets 41, 42, 43 and 44 are our dataset format wrapped around the zoos from Unterthiner et al, 2020 (https://github.com/google-research/google-research/tree/master/dnn_predict_accuracy)

    Abstract: Self-Supervised Learning (SSL) has been shown to learn useful and information-preserving representations. Neural Networks (NNs) are widely applied, yet their weight space is still not fully understood. Therefore, we propose to use SSL to learn neural representations of the weights of populations of NNs. To that end, we introduce domain specific data augmentations and an adapted attention architecture. Our empirical evaluation demonstrates that self-supervised representation learning in this domain is able to recover diverse NN model characteristics. Further, we show that the proposed learned representations outperform prior work for predicting hyper-parameters, test accuracy, and generalization gap as well as transfer to out-of-distribution settings.

  14. Cleaned ISIC Skin Cancer Dataset (6 Classes)

    • kaggle.com
    zip
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aayyyyyyuuussshhh (2025). Cleaned ISIC Skin Cancer Dataset (6 Classes) [Dataset]. https://www.kaggle.com/datasets/aayyyyyyuuussshhh/cleaned-isic-skin-cancer-dataset-6-classes
    Explore at:
    zip(538458444 bytes)Available download formats
    Dataset updated
    Feb 10, 2025
    Authors
    Aayyyyyyuuussshhh
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains dermatoscopic images of skin lesions organized into six classes:

    Melanoma Nevus (Mole) Basal Cell Carcinoma Actinic Keratosis Benign Keratosis Vascular Lesion The dataset has been preprocessed to remove duplicate images and ensure consistency between the training and test sets. It is structured into train and test folders, with subfolders for each class. This makes it ready for use in machine learning and deep learning projects.

    Key Features: Total Images: 1888 (1820 train, 68 test) Classes: 6 Image Size: Variable (can be resized during preprocessing) Preprocessing: Duplicate images removed using perceptual hashing.

    profile Qwen2.5-Max 1:26 am Here’s a polished and professional description you can use for your Kaggle dataset. It’s concise, informative, and includes all the necessary details about the dataset, its purpose, preprocessing steps, and usage guidelines.

    Dataset Description Title: Cleaned Skin Cancer Dataset (6 Classes)

    Description:

    This dataset contains dermatoscopic images of skin lesions organized into six classes:

    Melanoma Nevus (Mole) Basal Cell Carcinoma Actinic Keratosis Benign Keratosis Vascular Lesion The dataset has been preprocessed to remove duplicate images and ensure consistency between the training and test sets. It is structured into train and test folders, with subfolders for each class. This makes it ready for use in machine learning and deep learning projects.

    Key Features: Total Images: 1888 (1820 train, 68 test) Classes: 6 Image Size: Variable (can be resized during preprocessing) Preprocessing: Duplicate images removed using perceptual hashing. Use Case: This dataset is ideal for training and evaluating models for skin cancer classification. It can be used with frameworks like TensorFlow, PyTorch, or scikit-learn. The cleaned structure ensures that the dataset is free from duplicates and ready for immediate use.

    Acknowledgments: The original dataset was sourced from the International Skin Imaging Collaboration (ISIC) . Cleaning and preprocessing were performed to remove duplicates and prepare the dataset for modeling. Please refer to the ISIC website for more information about the original dataset: ISIC Archive .

    License: This dataset is derived from the ISIC dataset and is made available under the CC BY-NC-SA license. Any use of this dataset must comply with the original licensing terms, including non-commercial use and attribution.

  15. Sentence/Table Pair Data from Wikipedia for Pre-training with...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Oct 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun; Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. http://doi.org/10.5281/zenodo.5612316
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Oct 29, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun; Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

    There are two files:

    sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

    table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

    The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

    For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

    Below is a sample code snippet to load the data

    import webdataset as wds
    
    # path to the uncompressed files, should be a directory with a set of tar files
    url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar'
    dataset = (
      wds.Dataset(url)
      .shuffle(1000) # cache 1000 samples and shuffle
      .decode()
      .to_tuple("json")
      .batched(20) # group every 20 examples into a batch
    )
    
    # Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch
    # You can also iterate through all examples and dump them with your preferred data format

    Below we show how the data is organized with two examples.

    Text-only

    {'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence
     's1_all_links': {
      'Sils,_Girona': [[0, 4]],
      'municipality': [[10, 22]],
      'Comarques_of_Catalonia': [[30, 37]],
      'Selva': [[41, 46]],
      'Catalonia': [[51, 60]]
     }, # list of entities and their mentions in the sentence (start, end location)
     'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs
      {
        'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair
        's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query
        's2s': [ # list of other sentences that contain the common entity pair, or evidence
         {
           'md5': '2777e32bddd6ec414f0bc7a0b7fea331',
           'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.',
           's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence
           'pair_locs': [ # mentions of the entity pair in the evidence
            [[19, 27]], # mentions of entity 1
            [[0, 5], [288, 293]] # mentions of entity 2
           ],
           'all_links': {
            'Selva': [[0, 5], [288, 293]],
            'Comarques_of_Catalonia': [[19, 27]],
            'Catalonia': [[40, 49]]
           }
          }
        ,...] # there are multiple evidence sentences
       },
     ,...] # there are multiple entity pairs in the query
    }

    Hybrid

    {'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.',
     's1_all_links': {...}, # same as text-only
     'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only
     'table_pairs': [
      'tid': 'Major_League_Baseball-1',
      'text':[
        ['World Series Records', 'World Series Records', ...],
        ['Team', 'Number of Series won', ...],
        ['St. Louis Cardinals (NL)', '11', ...],
      ...] # table content, list of rows
      'index':[
        [[0, 0], [0, 1], ...],
        [[1, 0], [1, 1], ...],
      ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table.
      'value_ranks':[
        [0, 0, ...],
        [0, 0, ...],
        [0, 10, ...],
      ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS
      'value_inv_ranks': [], # inverse rank
      'all_links':{
        'St._Louis_Cardinals': {
         '2': [
          [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]]
         ] # list of mentions in the second row, the key is row_id
        },
        'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]},
      }
      'name': '', # table name, if exists
      'pairs': {
        'pair': ['American_League', 'National_League'],
        's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query
        'table_pair_locs': {
         '17': [ # mention of entity pair in row 17
           [
            [[17, 0], [3, 18]],
            [[17, 1], [3, 18]],
            [[17, 2], [3, 18]],
            [[17, 3], [3, 18]]
           ], # mention of the first entity
           [
            [[17, 0], [21, 36]],
            [[17, 1], [21, 36]],
           ] # mention of the second entity
         ]
        }
       }
     ]
    }

  16. phishing-email-classifier-bert

    • kaggle.com
    zip
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Piiashev (2024). phishing-email-classifier-bert [Dataset]. https://www.kaggle.com/datasets/ivan314sh/phishing-email-classifier-bert
    Explore at:
    zip(439956757 bytes)Available download formats
    Dataset updated
    Jun 28, 2024
    Authors
    Ivan Piiashev
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset has: Classifier save, Tokenizer save, Encoded Phishing Email Dataset!

    Encoded dataset

    Directory: scam-email-classifier-bert-uncased

    Contains preprocessed (removed special chars, encoded urls to special token, emails to special token) data for phishing email classification using BERT. It has been tokenized with the bert-base-uncased tokenizer and split into three parts: - train.pth: 80% of the data for training - validation.pth: 10% of the data for validation - test.pth: 10% of the data for testing All of them contain the serialized SpecialDataset objects, ready for immediate use in PyTorch, one can see the definition of dataset class in notebook. The text cleaning function is there as well!

    These files are derived from the Phishing Email Dataset and provide a quick start for training and evaluating models with BERT.

    Model and Tokenizer

    Directory: scam-email-classifier-bert-uncased - config.json: This file contains the configuration parameters for the BERT model architecture, including details about the model layers, attention heads, hidden size, etc. It ensures that the model structure can be correctly instantiated when loaded. - model.safetensors: This file contains the trained weights of the BERT model in the SafeTensors format. It is used to store and load the model parameters efficiently and safely. - training_args.bin: This file includes the arguments and hyperparameters used during the training of the BERT model, such as learning rate, batch size, number of training epochs, etc.

    Directory: scam-email-bert-tokenizer - special_tokens_map.json: This file maps special tokens (like [CLS], [SEP], [PAD], [UNK], and others) to their corresponding IDs used by the tokenizer. - tokenizer_config.json: This file contains the configuration parameters for the tokenizer, detailing how text should be processed and tokenized before being fed into the model. - vocab.txt: This file lists the vocabulary used by the tokenizer, mapping each token to a unique index.

    These files allow to easily load the tokenizer and model using BertTokenizer.from_pretrained() and BertClassifier.from_pretrained() respectively.

    Dataset Information

    The BERT model has been fine-tuned on the Phishing Email Dataset provided by Naser Abdullah Alam. This dataset is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. The dataset includes a collection of phishing and legitimate emails, which has been used to train and evaluate the model. The actual training can be seen in notebook.

    Citations

    Original BERT Model:

    Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv preprint arXiv:1810.04805, 2018. Phishing Email Dataset:

    Original Dataset:

    Naser Abdullah Alam. "Phishing Email Dataset." Kaggle, 2021.

  17. SynthCave: 3D Odometry Estimation

    • kaggle.com
    zip
    Updated Jan 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Bader (2024). SynthCave: 3D Odometry Estimation [Dataset]. https://www.kaggle.com/datasets/badertim/synthcave-3d-odometry-estimation
    Explore at:
    zip(22393814647 bytes)Available download formats
    Dataset updated
    Jan 28, 2024
    Authors
    Tim Bader
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    SynthCave is a synthetic dataset for 3D odometry estimation in cave-like environments, where GPS signals are unavailable and other sensors like cameras may be unreliable due to poor lightning. The dataset contains synthetic LiDAR data in three different forms: point clouds, depth-images, and graphs, along with IMU and ground-truth data.

    • Baseline Models Code & Dataset Code: https://github.com/BaderTim/SynthCave
    • Minecraft Measurement Mod: https://github.com/BaderTim/minecraft-measurement-mod
    • Paper: TBA https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18671719%2F4bb40865aaf99eabd02424df978a5186%2Fmmm_demo.jpg?generation=1706469582201415&alt=media" alt="Synthcave Demo Image"> The dataset is generated using a simulation environment with structured domain randomization to mimic the real-world noise and variability. SynthCave is designed to facilitate the development and evaluation of novel deep learning methods for 3D odometry estimation, especially graph-based ones, which are underrepresented in the current literature. It is the first benchmark dataset of its kind for indoor 3D odometry estimation.

    1) Dataset Content

    Cave Section TypeSequence CountDuration (in s)XZ-Distance (in m)Y-Distance (in m)Avg. Phi (in °/s)Avg. Theta (in °/s)
    Default
    Even Path20274.40462.420.0059.425.25
    Even Path Upwards20351.80628.77359.1243.2510.36
    Even Path Downwards20338.20578.29261.0022.409.84
    Advanced
    Entrance20280.60521.00287.9227.1910.53
    Curvy Even Path20348.60588.010.8286.728.24
    Curvy Path Upwards20339.00604.23333.1360.809.65
    Curvy Path Downwards20350.60628.10230.9457.2013.20
    Miscellaneous
    Underwater20432.81228.17582.8661.2817.41
    Mineshaft20402.2653.6145.6481.918.49
    Roping Up Shaft20360.0246.84932.1769.1524.46
    Roping Down Shaft20164.4242.891073.2168.4716.98
    Total2203642.606382.314106.5558.5412.27

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18671719%2Fb7a75abffddb4269172c9e1fb332e90a%2Fdistribution.jpg?generation=1706469864428719&alt=media" alt="GT Data Distribution"> (left) Histogram of the position changes, rounded to 0.1 and limited to 1 and -1, of the GT values. Outside the limit are 3 X, 1 Z and 1035 Y values. (right) Histogram of the rotation changes, converted to radians, rounded to 0.1 and limited to 1 and -1, of the GT values. Outside the limit are 3 theta and 361 phi values.

    2) Citation

    Please cite the following paper if you use this dataset or the code in your work: latex @article{bader2023synthcave, title={SynthCave: A Deep Learning Benchmark for 3D Odometry Estimation in Caves}, author={Bader, Tim}, journal={TBA}, year={223} }

    3) Usage

    The following classes are PyTorch datasets which can be used to process the data.

    3.1) Graph

    import os
    import torch
    import numpy as np
    from torch.utils.data import Dataset
    
    class GraphDataset(Dataset):
      def _init_(self, data_folder: str, frames: int, gt_as_rad: bool = True, gt_limit: None | list = [-1, 1], return_seq_name: bool = False):
        """
        Args:
          data_folder (string): Path to the graph dataset's train/val folder.
                     In each subfolder, there should be a labels.csv file and a folder for each sample. 
          frames (int): Number of frames in each sample.
          gt_as_rad (bool): Whether to return the ground truth as radians or not.
          gt_limit (None | list): If not None, the ground truth will be limited to the given range.
        """
        self.path = data_folder
        self.frames = frames
        self.gt_as_rad = gt_as_rad
        self.gt_limit = gt_limit
        self.return_seq_name = return_seq_name
        self.theta_rounded_hist = []
        self.phi_rounded_hist = []
        self.x_rounded_hist = []
        self.y_rounded_hist = []
        self.z_rounded_hist = []
        # the keys represent the cumulative number of samples
        self.index = {}
        self.id_name_map = {}
        self.samples = 0
        print(f"Initializing dataset from '{self.path}'...")
        # loop through folders
        for file in os.listdir(self.path):
          filename = os.fsdecode(file)
          if filename.endswith("_gt.npy"): # load sequence set at once and not after another
            sequence_id = int(filename.split("_")[0])
            sequence_name = "_".joi...
    
  18. Z

    3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in...

    • nde-dev.biothings.io
    • data.niaid.nih.gov
    • +1more
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Strohmayer, Julian (2024). 3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_10925350
    Explore at:
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Kampel, Martin
    Strohmayer, Julian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios

    This repository contains the 3DO dataset proposed in [1].

    PyTroch Dataloader

    A minimal PyTorch dataloader for the 3DO dataset is provided at: https://github.com/StrohmayerJ/3DO

    Dataset Description

    The 3DO dataset comprises 42 five-minute recordings (~1.25M WiFi packets) of three human activities performed by a single person, captured in a WiFi through-wall sensing scenario over three consecutive days. Each WiFi packet is annotated with a 3D trajectory label and a class label for the activities: no person/background (0), walking (1), sitting (2), and lying (3). (Note: The labels returned in our dataloader example are walking (0), sitting (1), and lying (2), because background sequences are not used.)

    The directories 3DO/d1/, 3DO/d2/, and 3DO/d3/ contain the sequences from days 1, 2, and 3, respectively. Furthermore, each sequence directory (e.g., 3DO/d1/w1/) contains a csiposreg.csv file storing the raw WiFi packet time series and a csiposreg_complex.npy cache file, which stores the complex Channel State Information (CSI) of the WiFi packet time series. (If missing, csiposreg_complex.npy is automatically generated by the provided dataloader.)

    Dataset Structure:

    /3DO

    ├── d1 <-- day 1 subdirectory

      └── w1 <-- sequence subdirectory
    
         └── csiposreg.csv <-- raw WiFi packet time series
    
         └── csiposreg_complex.npy <-- CSI time series cache
    

    ├── d2 <-- day 2 subdirectory

    ├── d3 <-- day 3 subdirectory

    In [1], we use the following training, validation, and test split:

    Subset Day Sequences

    Train 1 w1, w2, w3, s1, s2, s3, l1, l2, l3

    Val 1 w4, s4, l4

    Test 1 w5 , s5, l5

    Test 2 w1, w2, w3, w4, w5, s1, s2, s3, s4, s5, l1, l2, l3, l4, l5

    Test 3 w1, w2, w4, w5, s1, s2, s3, s4, s5, l1, l2, l4

    w = walking, s = sitting and l= lying

    Note: On each day, we additionally recorded three ten-minute background sequences (b1, b2, b3), which are provided as well.

    Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

    [1] Strohmayer, J., Kampel, M. (2025). On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios. In: Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15315. Springer, Cham. https://doi.org/10.1007/978-3-031-78354-8_13

    BibTeX citation:

    @inproceedings{strohmayerOn2025, author="Strohmayer, Julian and Kampel, Martin", title="On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios", booktitle="Pattern Recognition", year="2025", publisher="Springer Nature Switzerland", address="Cham", pages="194--211", isbn="978-3-031-78354-8" }

  19. FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1)

    • zenodo.org
    bin, png, zip
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek (2024). FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1) [Dataset]. http://doi.org/10.5281/zenodo.8328113
    Explore at:
    bin, zip, pngAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    # FiN-2 Large-Scale Real-World PLC-Dataset

    ## About
    #### FiN-2 dataset in a nutshell:
    FiN-2 is the first large-scale real-world dataset on data collected in a powerline communication infrastructure. Since the electricity grid is inherently a graph, our dataset could be interpreted as a graph dataset. Therefore, we use the word node to describe points (cable distribution cabinets) of measurement within the low-voltage electricity grid and the word edge to describe connections (cables) in between them. However, since these are PLC connections, an edge does not necessarily have to correspond to a real cable; more on this in our paper.
    FiN-2 shows measurements that relate to the nodes (voltage, total harmonic distortion) as well as to the edges (signal-to-noise ratio spectrum, tonemap). In total, FiN-2 is distributed across three different sites with a total of 1,930,762,116 node measurements each for the individual features and 638,394,025 edge measurements each for all 917 PLC channels. All data was collected over a 25-month period from mid-2020 to the end of 2022.
    We propose this dataset to foster research in the domain of grid automation and smart grid. Therefore, we provide different example use cases in asset management, grid state visualization, forecasting, predictive maintenance, and novelty detection. For more decent information on this dataset, please see our [paper](https://arxiv.org/abs/2209.12693).

    * * *
    ## Content
    FiN-2 dataset splits up into two compressed `csv-Files`: *nodes.csv* and *edges.csv*.

    All files are provided as a compressed ZIP file and are divided into four parts. The first part can be found in this repo, while the remaining parts can be found in the following:
    - https://zenodo.org/record/8328105
    - https://zenodo.org/record/8328108
    - https://zenodo.org/record/8328111

    ### Node data

    | id | ts | v1 | v2 | v3 | thd1 | thd2 | thd3 | phase_angle1 | phase_angle2 | phase_angle3 | temp |
    |----|----|----|----|----|----|----|----|----|----|----|----|----|----|
    |112|1605530460|236.5|236.4|236.0|2.9|2.5|2.4|120.0|119.8|120.0|35.3|
    |112|1605530520|236.9|236.6|236.6|3.1|2.7|2.5|120.1|119.8|120.0|35.3|
    |112|1605530580|236.2|236.4|236.0|3.1|2.7|2.5|120.0|120.0|119.9|35.5|

    - id / ts: Unique identifier of the node that is measured and timestemp of the measurement
    - v1/v2/v3: Voltage measurements of all three phases
    - thd1/thd2/thd3: Total harmonic distortion of all three phases
    - phase_angle1/2/3: Phase angle of all three phases
    - temp: Temperature in-circuit of the sensor inside a cable distribution unit (in °C)

    ### Edge data
    | src | dst | ts | snr0 | snr1 | snr2 | ... | snr916 |
    |----|----|----|----|----|----|----|----|
    |62|94|1605528900|70|72|45|...|-53|
    |62|32|1605529800|16|24|13|...|-51|
    |17|94|1605530700|37|25|24|...|-55|

    - src & dst & ts: Unique identifier of the source and target nodes where the spectrum is measured and time of measurement
    - snr0/snr1/.../snr916: 917 SNR measurements in tenths of a decibel (e.g. 50 --> 5dB).

    ### Metadata
    Metadata that is provided along with the data covers:

    - Number of cable joints
    - Cable properties (length, type, number of sections)
    - Relative position of the nodes (location, zero-centered gps)
    - Adjacent PV or wallbox installations
    - Year of installation w.r.t. the nodes and cables

    Since the electricity grid is part of the critical infrastructure, it is not possible to provide exact GPS locations.

    * * *
    ## Usage
    Simple data access using pandas:

    ```
    import pandas as pd

    nodes_file = "nodes.csv.gz" # /path/to/nodes.csv.gz
    edges_file = "edges.csv.gz" # /path/to/edges.csv.gz

    # read the first 10 rows
    data = pd.read_csv(nodes_file, nrows=10, compression='gzip')

    # read the row number 5 to 15
    data = pd.read_csv(nodes_file, nrows=10, skiprows=[i for i in range(1,6)], compression='gzip')

    # ... same for the edges
    ```

    Compressed csv-data format was used to make sharing as easy as possible, however it comes with significant drawbacks for machine learning. Due to the inherent graph structure, a single snapshot of the whole graph consists of a set of node and edge measurements. But due to timeouts, noise and other disturbances, nodes sometimes fail in collecting the data, wherefore the number of measurements for a specific timestamp differs. This, plus the high sparsity of the graph, leads to a high inefficiency when using the csv-format for an ML training.
    To utilize the data in an ML pipeline, we recommend other data formats like [datadings](https://datadings.readthedocs.io/en/latest/) or specialized database solutions like [VictoriaMetrics](https://victoriametrics.com/).


    ### Example use case (voltage forecasting)

    Forecasting of the voltage is one potential use cases. The Jupyter notebook provided in the repository gives an overview of how the dataset can be loaded, preprocessed and used for ML training. Thereby, a MinMax scaling was used as simple preprocessing and a PyTorch dataset class was created to handle the data. Furthermore, a vanilla autoencoder is utilized to process and forecast the voltage into the future.

  20. NTU60 Processed Skeleton Dataset

    • kaggle.com
    zip
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oucherif Mohammed Ouail (2025). NTU60 Processed Skeleton Dataset [Dataset]. https://www.kaggle.com/datasets/oucherifouail/ntu60-processed-skeleton-dataset
    Explore at:
    zip(3075187118 bytes)Available download formats
    Dataset updated
    Aug 29, 2025
    Authors
    Oucherif Mohammed Ouail
    Description

    NTU RGB+D 60 – Preprocessed Skeleton Dataset

    This dataset provides preprocessed skeleton sequences from the NTU RGB+D 60 benchmark, widely used for skeleton-based human action recognition.

    The preprocessing module standardizes the raw NTU skeleton data to make it directly usable for training deep learning models.

    Preprocessing Steps

    Each skeleton sequence was processed by:

    • ✅ Removing NaN / invalid frames
    • ✅ Translating skeletons (centered spine base joint at origin)
    • ✅ Normalizing body scale using spine length
    • ✅ Aligning all sequences to 300 frames (padding or truncation)
    • ✅ Formatting sequences to include up to 2 persons per clip

    Output Files

    Two .npz files are provided, following the standard evaluation protocols:

    1. NTU60_CS.npz → Cross-Subject split
    2. NTU60_CV.npz → Cross-View split

    Each file contains:

    • x_train → Training data, shape (N_train, 300, 150)
    • y_train → Training labels, shape (N_train, 60) (one-hot)
    • x_test → Testing data, shape (N_test, 300, 150)
    • y_test → Testing labels, shape (N_test, 60) (one-hot)

    Data Format

    • 300 = max frames per sequence (zero-padded)
    • 150 = 2 persons × 25 joints × 3 coordinates (x, y, z)
    • 60 = number of action classes

    If a sequence has only 1 person, the second person’s features are zero-filled.

    Skeleton Properties

    • Centered → Spine base joint (joint-2) at origin (0,0,0)
    • Normalized → Body size scaled consistently
    • Aligned → Fixed-length sequences (300 frames)
    • Two-person setting → Always represented with 150 features

    Evaluation Protocols

    • Cross-Subject (CS): Train and test sets split by different actors. The model is evaluated on unseen subjects to measure generalization across people.
    • Cross-View (CV): Train and test sets split by different camera views. The model is evaluated on unseen viewpoints to measure viewpoint invariance.

    Usage

    These .npz files can be directly loaded in PyTorch or NumPy-based pipelines. They are fully compatible with graph convolutional networks (GCNs), transformers, and other deep learning models for skeleton-based action recognition.

    Example:

    import numpy as np
    
    data = np.load("NTU60_CS.npz")
    x_train, y_train = data["x_train"], data["y_train"]
    
    print(x_train.shape) # (N_train, 300, 150)
    print(y_train.shape) # (N_train, 60)
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sufian Othman (2025). Pytorch Models [Dataset]. https://www.kaggle.com/datasets/mohdsufianbinothman/pytorch-models/data
Organization logo

Pytorch Models

Deep learning class function

Explore at:
zip(21493 bytes)Available download formats
Dataset updated
May 10, 2025
Authors
Sufian Othman
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

✅ Step 1: Mount to Dataset

Search for my dataset pytorch-models and add it — this will mount it at:

/kaggle/input/pytorch-models/

✅ Step 2: Check file paths Once mounted, the four files will be available at:

/kaggle/input/pytorch-models/base_models.py
/kaggle/input/pytorch-models/ext_base_models.py
/kaggle/input/pytorch-models/ext_hybrid_models.py
/kaggle/input/pytorch-models/hybrid_models.py

✅ Step 3: Copy files to working directory To make them importable, copy the .py files to your notebook’s working directory (/kaggle/working/):

import shutil

shutil.copy('/kaggle/input/pytorch-models/base_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/ext_base_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/ext_hybrid_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/hybrid_models.py', '/kaggle/working/')

✅ Step 4: Import your modules Now that they are in the working directory, you can import them like normal:

import base_models
import ext_base_models
import ext_hybrid_models
import hybrid_models

Or, if you only want to import specific classes or functions:

from base_models import YourModelClass
from ext_base_models import AnotherModelClass

✅ Step 5: Use the models You can now initialize and use the models/classes/functions defined inside each file:

model = base_models.YourModelClass()
output = model(input_data)
Search
Clear search
Close search
Google apps
Main menu