47 datasets found

Pytorch Models
kaggle.com
zip
Updated May 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufian Othman (2025). Pytorch Models [Dataset]. https://www.kaggle.com/datasets/mohdsufianbinothman/pytorch-models/data
Explore at:
zip(21493 bytes)Available download formats
Dataset updated
May 10, 2025
Authors
Sufian Othman
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
✅ Step 1: Mount to Dataset

Search for my dataset pytorch-models and add it — this will mount it at:

/kaggle/input/pytorch-models/

✅ Step 2: Check file paths Once mounted, the four files will be available at:

/kaggle/input/pytorch-models/base_models.py /kaggle/input/pytorch-models/ext_base_models.py /kaggle/input/pytorch-models/ext_hybrid_models.py /kaggle/input/pytorch-models/hybrid_models.py

✅ Step 3: Copy files to working directory To make them importable, copy the .py files to your notebook’s working directory (/kaggle/working/):

import shutil shutil.copy('/kaggle/input/pytorch-models/base_models.py', '/kaggle/working/') shutil.copy('/kaggle/input/pytorch-models/ext_base_models.py', '/kaggle/working/') shutil.copy('/kaggle/input/pytorch-models/ext_hybrid_models.py', '/kaggle/working/') shutil.copy('/kaggle/input/pytorch-models/hybrid_models.py', '/kaggle/working/')

✅ Step 4: Import your modules Now that they are in the working directory, you can import them like normal:

import base_models import ext_base_models import ext_hybrid_models import hybrid_models

Or, if you only want to import specific classes or functions:

from base_models import YourModelClass from ext_base_models import AnotherModelClass

✅ Step 5: Use the models You can now initialize and use the models/classes/functions defined inside each file:

model = base_models.YourModelClass() output = model(input_data)
Z
Model Zoo: A Dataset of Diverse Populations of Neural Network Models - MNIST...
data.niaid.nih.gov
Updated Jun 13, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schürholt, Konstantin; Taskiran, Diyar; Knyazev, Boris; Giró-i-Nieto, Xavier; Borth, Damian (2022). Model Zoo: A Dataset of Diverse Populations of Neural Network Models - MNIST [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6632086
Explore at:
Dataset updated
Jun 13, 2022
Dataset provided by
Image Processing Group, Universitat Politècnica de Catalunya
AI Lab Montreal, Samsung Advanced Institute of Technology
AIML Lab, University of St.Gallen
Authors
Schürholt, Konstantin; Taskiran, Diyar; Knyazev, Boris; Giró-i-Nieto, Xavier; Borth, Damian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

In the last years, neural networks have evolved from laboratory environments to the state-of-the-art for many real-world problems. Our hypothesis is that neural network models (i.e., their weights and biases) evolve on unique, smooth trajectories in weight space during training. Following, a population of such neural network models (refereed to as “model zoo”) would form topological structures in weight space. We think that the geometry, curvature and smoothness of these structures contain information about the state of training and can be reveal latent properties of individual models. With such zoos, one could investigate novel approaches for (i) model analysis, (ii) discover unknown learning dynamics, (iii) learn rich representations of such populations, or (iv) exploit the model zoos for generative modelling of neural network weights and biases. Unfortunately, the lack of standardized model zoos and available benchmarks significantly increases the friction for further research about populations of neural networks. With this work, we publish a novel dataset of model zoos containing systematically generated and diverse populations of neural network models for further research. In total the proposed model zoo dataset is based on six image datasets, consist of 24 model zoos with varying hyperparameter combinations are generated and includes 47’360 unique neural network models resulting in over 2’415’360 collected model states. Additionally, to the model zoo data we provide an in-depth analysis of the zoos and provide benchmarks for multiple downstream tasks as mentioned before.

Dataset

This dataset is part of a larger collection of model zoos and contains the zoos trained on the labelled samples from MNIST. All zoos with extensive information and code can be found at www.modelzoos.cc.

This repository contains two types of files: the raw model zoos as collections of models (file names beginning with "mnist_"), as well as preprocessed model zoos wrapped in a custom pytorch dataset class (filenames beginning with "dataset"). Zoos are trained in three configurations varying the seed only (seed), varying hyperparameters with fixed seeds (hyp_fix) or varying hyperparameters with random seeds (hyp_rand). The index_dict.json files contain information on how to read the vectorized models.

For more information on the zoos and code to access and use the zoos, please see www.modelzoos.cc.
Oxford 102 Flower Dataset
kaggle.com
zip
Updated May 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lalu Erfandi Maula Yusnu (2021). Oxford 102 Flower Dataset [Dataset]. https://www.kaggle.com/nunenuh/pytorch-challange-flower-dataset
Explore at:
zip(346507679 bytes)Available download formats
Dataset updated
May 26, 2021
Authors
Lalu Erfandi Maula Yusnu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

We have created a 102 category dataset, consisting of 102 flower categories. The flowers chosen to be flower commonly occuring in the United Kingdom. Each class consists of between 40 and 258 images. The details of the categories and the number of images for each class can be found on this category statistics page.

The images have large scale, pose and light variations. In addition, there are categories that have large variations within the category and several very similar categories. The dataset is visualized using isomap with shape and colour features.

Directory Structure

> dataset > train > valid > test - cat_to_name.json - README.md - sample_submission.csv

Visualization of the dataset

We visualize the categories in the dataset using SIFT features as shape descriptors and HSV as colour descriptor. The images are randomly sampled from the category.

https://i.imgur.com/Tl6TKUC.png" alt="">

Publications

Nilsback, M-E. and Zisserman, A. Automated flower classification over a large number of classes
Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing (2008)

Source

Original source of this data can be found in 102 Category Flower Dataset

Original readme from author can be found in AUTHOR README

Directory test is added from another kaggle dataset that can be found in Oxford 102 Flower Pytorch
h
bigearthnet
huggingface.co
Updated Jul 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luca Colomba (2024). bigearthnet [Dataset]. https://huggingface.co/datasets/lc-col/bigearthnet
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2024
Authors
Luca Colomba
Description
BigEarthNet - HDF5 version

This repository contains an export of the existing BigEarthNet dataset in HDF5 format. All Sentinel-2 acquisitions are exported according to TorchGeo's dataset (120x120 pixels resolution). Sentinel-1 is not contained in this repository for the moment. CSV files contain for each satellite acquisition the corresponding HDF5 file and the index. A PyTorch dataset class which can be used to iterate over this dataset can be found here, as well as the script used… See the full description on the dataset page: https://huggingface.co/datasets/lc-col/bigearthnet.
cifar-100-python
kaggle.com
zip
Updated Dec 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ThanhTan (2024). cifar-100-python [Dataset]. https://www.kaggle.com/datasets/duongthanhtan/cifar-100-python
Explore at:
zip(168517675 bytes)Available download formats
Dataset updated
Dec 26, 2024
Authors
ThanhTan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
CIFAR-100 Dataset

1. Overview

CIFAR-100 is an extension of the CIFAR-10 dataset, with more classes and finer-grained categorization.

It contains 100 classes, making it more challenging than CIFAR-10, which has only 10 classes.

Each image in CIFAR-100 is labeled with both a fine label (specific category) and a coarse label (broader category, such as animals or vehicles).

2. Dataset Details

Number of Images: 60,000 color images in total.

50,000 for training.

10,000 for testing.

Image Size: Each image is a small 32x32 pixel RGB (color) image.

Classes: 100 classes, grouped into 20 superclasses.

Each superclass contains 5 related classes.

3. Fine and Coarse Labels

Fine Labels: The dataset has specific categories, such as 'apple', 'bicycle', 'rose', etc.

Coarse Labels: These are broader categories, like 'fruit', 'flower', 'vehicle', etc.

4. Applications

Image Classification: Used for training models to classify images into their respective categories.

Feature Extraction: Useful for benchmarking feature extraction techniques in computer vision.

Transfer Learning: Often used to pre-train models for other similar tasks.

Deep Learning Research: Commonly used to test architectures like CNNs (Convolutional Neural Networks).

5. Challenges

The images are very small (32x32 pixels), making it harder for models to learn intricate details.

High class count (100) increases classification complexity.

Intra-class variability and inter-class similarity make it a challenging dataset for classification.

6. File Format

The dataset is usually available in Python-friendly formats like .pkl or .npz.

It can also be downloaded and loaded using frameworks like TensorFlow or PyTorch.

7. Example Classes

Some example classes include: - Animals: beaver, dolphin, otter, elephant, snake. - Plants: apple, orange, mushroom, palm tree, pine tree. - Vehicles: bicycle, bus, motorcycle, train, rocket. - Everyday Objects: clock, keyboard, lamp, table, chair.
MIEDT dataset
kaggle.com
Updated Jan 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
机关鸢鸟 (2025). MIEDT dataset [Dataset]. https://www.kaggle.com/datasets/lidang78/miedt-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
机关鸢鸟
Description
Dataset Overview This dataset is organized based on the edge detection task, aiming to provide rich image resources and corresponding edge detection annotation information for related research and applications, which can be used for the testing of edge detection algorithms. In order to evaluate the performance of the edge detection method comprehensively, we created the Medical Image Edge Detection Test (MIEDT) dataset. The MIEDT contains 100 medical images, which were randomly selected from three publicly available datasets, Head CT-hemorrhage, Coronary Artery Diseases DataSet, and Skin Cancer MNIST: HAM10000 .

Data Set Structure Original image: This folder stores the original image data. It contains 15 Head CT images in PNG format with varying image resolutions; 25 coronary heart disease images in JPG format and with an image resolution of [1024 * 1024]; 60 skin images in JPG format and with an image resolution of [600 * 450]. It covers a variety of medical image materials with different imaging and contrast, providing diverse input data for edge detection algorithms. Ground truth：The data in this folder are the edge detection annotation images corresponding to the images in the "Originals" folder. They are in PNG format. In these images, the white pixels represent the edge parts of the image, and the black pixels represent the non-edge areas. These annotation information accurately outlines the object contours and edge features in the original images.

Usage Instructions For users who conduct image processing using Python, they can utilize the cv2 (OpenCV) library to read image data. The sample code is as follows:

import cv2 original_image = cv2.imread('Original image/IMG-001.png') # Read original image ground_truth_image = cv2.imread('Ground truth/GT-001.png', cv2.IMREAD_GRAYSCALE) # Read the corresponding Ground Truth image When performing model training based on deep learning frameworks (such as TensorFlow, PyTorch), the dataset path can be configured into the corresponding dataset loading class according to the data loading mechanism of the framework to ensure that the model can correctly read and process the image and its annotation data.

4. Data Sources and References Data Sources: The original images are collected from public image datasets Head CT-hemorrhage, Coronary Artery Diseases DataSet, and Skin Cancer MNIST: HAM10000 to ensure the quality and diversity of the images. If you are using this dataset in academic research, please cite the following literature.

References: [1] Noel Codella, Veronica Rotemberg, Philipp Tschandl, M. Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, Harald Kittler, Allan Halpern: “Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)”, 2018; https://arxiv.org/abs/1902.03368

[2] Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 doi:10.1038/sdata.2018.161 (2018).

[3] Classification of Brain Hemorrhage Using Deep Learning from CT Scan Images - https://link.springer.com/chapter/10.1007/978-981-19-7528-8_15
f
Data from: Deep learning neural network derivation and testing to...
tandf.figshare.com
datasetcatalog.nlm.nih.gov
png
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omid Mehrpour; Christopher Hoyte; Abdullah Al Masud; Ashis Biswas; Jonathan Schimmel; Samaneh Nakhaee; Mohammad Sadegh Nasr; Heather Delva-Clark; Foster Goss (2023). Deep learning neural network derivation and testing to distinguish acute poisonings [Dataset]. http://doi.org/10.6084/m9.figshare.23694504.v1
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23694504.v1
Dataset updated
Aug 8, 2023
Dataset provided by
Taylor & Francis
Authors
Omid Mehrpour; Christopher Hoyte; Abdullah Al Masud; Ashis Biswas; Jonathan Schimmel; Samaneh Nakhaee; Mohammad Sadegh Nasr; Heather Delva-Clark; Foster Goss
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Acute poisoning is a significant global health burden, and the causative agent is often unclear. The primary aim of this pilot study was to develop a deep learning algorithm that predicts the most probable agent a poisoned patient was exposed to from a pre-specified list of drugs. Data were queried from the National Poison Data System (NPDS) from 2014 through 2018 for eight single-agent poisonings (acetaminophen, diphenhydramine, aspirin, calcium channel blockers, sulfonylureas, benzodiazepines, bupropion, and lithium). Two Deep Neural Networks (PyTorch and Keras) designed for multi-class classification tasks were applied. There were 201,031 single-agent poisonings included in the analysis. For distinguishing among selected poisonings, PyTorch model had specificity of 97%, accuracy of 83%, precision of 83%, recall of 83%, and a F1-score of 82%. Keras had specificity of 98%, accuracy of 83%, precision of 84%, recall of 83%, and a F1-score of 83%. The best performance was achieved in the diagnosis of single-agent poisoning in diagnosing poisoning by lithium, sulfonylureas, diphenhydramine, calcium channel blockers, then acetaminophen, in PyTorch (F1-score = 99%, 94%, 85%, 83%, and 82%, respectively) and Keras (F1-score = 99%, 94%, 86%, 82%, and 82%, respectively). Deep neural networks can potentially help in distinguishing the causative agent of acute poisoning. This study used a small list of drugs, with polysubstance ingestions excluded.Reproducible source code and results can be obtained at https://github.com/ashiskb/npds-workspace.git.
SimCATS_GaAs_v1_random_variations_v2
resodate.org
Updated Oct 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Hader; Fabian Fuchs; Sarah Fleitmann (2024). SimCATS_GaAs_v1_random_variations_v2 [Dataset]. http://doi.org/10.26165/JUELICH-DATA/5PB3GT
Explore at:
Unique identifier
https://doi.org/10.26165/JUELICH-DATA/5PB3GT
Dataset updated
Oct 9, 2024
Dataset provided by
Forschungszentrum Jülichhttp://www.fz-juelich.de/
Peter Grünberg Institute - Integrated Computing Architectures (ICA/PGI-4)
Authors
Fabian Hader; Fabian Fuchs; Sarah Fleitmann
Description
Dataset: SimCATS_GaAs_v1_random_variations_v2 Simulated data from the geometric SimCATS model (GitHub Repository, Paper) for benchmarking of semiconductor quantum dot tuning algorithms. Generated using this Jupyter Notebook and used for the final evaluation in Automated Charge Transition Detection in Quantum Dot Charge Stability Diagrams. Key Facts Contains pink, white & random telegraph noise, transition blurring, and dot jumps Random variations of charge transitions, sensor, and distortions 1.000 randomly sampled configurations with 100 CSDs each (in total: 100.000 CSDs) Usage To load the data, e.g. for calculating metrics, please have a look at SimCATS-Datasets (GitHub Repository, ReadTheDocs). The dataset can be loaded as numpy arrays using the function load_dataset or as PyTorch Dataset class (for machine learning purposes) using the class SimcatsDataset.
GISE-51
zenodo.org
application/gzip, txt
Updated Apr 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarthak Yadav; Sarthak Yadav; Mary Ellen Foster; Mary Ellen Foster (2021). GISE-51 [Dataset]. http://doi.org/10.5281/zenodo.4593514
Explore at:
application/gzip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4593514
Dataset updated
Apr 13, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sarthak Yadav; Sarthak Yadav; Mary Ellen Foster; Mary Ellen Foster
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
GISE-51 is an open dataset of 51 isolated sound events based on the FSD50K dataset. The release also includes the GISE-51-Mixtures subset, a dataset of 5-second soundscapes with up to three sound events synthesized from GISE-51. The GISE-51 release attempts to address some of the shortcomings of recent sound event datasets, providing an open, reproducible benchmark for future research and the freedom to adapt the included isolated sound events for domain-specific applications, which was not possible using existing large-scale weakly labelled datasets. GISE-51 release also included accompanying code for baseline experiments, which can be found at https://github.com/SarthakYadav/GISE-51-pytorch.

Citation

If you use the GISE-51 dataset and/or the released code, please cite our paper:

Sarthak Yadav and Mary Ellen Foster, "GISE-51: A scalable isolated sound events dataset", arXiv:2103.12306, 2021

Since GISE-51 is based on FSD50K, if you use GISE-51 kindly also cite the FSD50K paper:

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.

About GISE-51 and GISE-51-Mixtures

The following sections summarize key characteristics of the GISE-51 and the GISE-51-Mixtures datasets, including details left out from the paper.

GISE-51

Three subsets: train, val and eval with 12465, 1716, and2176 utterances. Subsets are in coherence with the FSD50K release.

Encompasses 51 sound classes from the FSD50K release

View meta/lbl_map.csv for the complete vocabulary.

The dataset was obtained from FSD50K using the following steps:

Unsmearing annotations to obtain single instances with a single label using the provided metadata and ground truth in FSD50K.

Manual inspection to qualitatively evaluate shortlisted utterances.

Volume-threshold based automated silence filtering using sox. Different volume thresholds are selected for various sound event class bins using trial-and-error. silence_thresholds.txt lists class bins and their corresponding volume threshold. Files that were determined by sox to contain no audio at all were manually clipped. Code for performing silence filtering can be found in scripts/strip_silence_sox.py in the code repository.

Re-evaluate sound event classes, removing ones with too few samples and merging those with high inter-class ambiguity.

GISE-51-Mixtures

Synthetic 5-second soundscapes with up to 3 events created using Scaper.

Weighted sampling with replacement for sound event selection, effectively oversampling events with very few samples. Synthetic soundscapes generated thus have a near equal number of annotations per sound event.

The number of soundscapes in val and eval set is 10000 each.

The number of soundscapes in the final train set is 60000. We do provide training sets with 5k-100k soundscapes.

GISE-51-Mixtures is our proposed subset that can be used to benchmark the performance of future works.

LICENSE

All audio clips (i.e., found in isolated_events.tar.gz) used in the preparation of the Glasgow Isolated Events Dataset (GISE-51) are designated Creative Commons and were obtained from FSD50K. The source data in isolated_events.tar.gz is based on the FSD50K dataset, which is licensed as Creative Commons Attribution 4.0 International (CC BY 4.0) License.

GISE-51 dataset (including GISE-51-Mixtures) is a curated, processed and generated preparation, and is released under Creative Commons Attribution 4.0 International (CC BY 4.0) License. The license is specified in the LICENSE-DATASET file in license.tar.gz.

Baselines

Several sound event recognition experiments were conducted, establishing baseline performance on several prominent convolutional neural network architectures. The experiments are described in Section 4 of our paper, and the implementation for reproducing these experiments is available at https://github.com/SarthakYadav/GISE-51-pytorch.

Files

GISE-51 is available as a collection of several tar archives. All audio files are PCM 16 bit, 22050 Hz. Following lists the contents of these files in detail:

isolated_events.tar.gz: The core GISE-51 isolated events dataset containing train, val and eval subfolders.

meta.tar.gz: contains lbl_map.json

noises.tar.gz: contains background noises used for GISE-51-Mixtures soundscape generation

mixtures_jams.tar.gz: This file contains annotation files in .jams format that, alongside isolated_events.tar.gz and noises.tar.gz can be reused to generate exact GISE-51-Mixtures soundscapes. (Optional, we provide the complete set of GISE-51-Mixtures soundscapes as independent tar archives.)

train.tar.gz: GISE-51-Mixtures train set, containing 60k synthetic soundscapes.

val.tar.gz: GISE-51-Mixtures val set, containing 10k synthetic soundscapes.

eval.tar.gz: GISE-51-Mixtures eval set, containing 10k synthetic soundscapes.

train_*.tar.gz: These are tar archives containing training mixtures of a various number of soundscapes, used primarily in Section 4.1 of the paper, which compares val mAP performance v/s number of training soundscapes. A helper script is provided in the code release, prepare_mixtures_lmdb.sh, to prepare data for experiments in Section 4.1.

pretrained-models.tar.gz: Contains model checkpoints for all experiments conducted in the paper. More information on these checkpoints can be found in the code release README.

experiments_60k_mixtures: model checkpoints from section 4.2 of the paper.

exported_weights_60k: ResNet-18 and EfficientNet-B1 exported as plain state_dicts for use with transfer learning experiments.

experiments_audioset: checkpoints from AudioSet Balanced (Sec 4.3.1) experiments

experiments_vggsound: checkpoints from Section 4.3.2 of the paper

experiments_esc50: ESC-50 dataset checkpoints, from Section 4.3.3

license.tar.gz: contains dataset license info.

silence_thresholds.txt: contains volume thresholds for various sound event bins used for silence filtering.

Contact

In case of queries and clarifications, feel free to contact Sarthak at s.yadav.2@research.gla.ac.uk. (Adding [GISE-51] to the subject of the email would be appreciated!)
Z
Dataset for class comment analysis
data.niaid.nih.gov
Updated Feb 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pooja Rani (2022). Dataset for class comment analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4311838
Explore at:
Dataset updated
Feb 22, 2022
Dataset provided by
University of Bern
Authors
Pooja Rani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A list of different projects selected to analyze class comments (available in the source code) of various languages such as Java, Python, and Pharo. The projects vary in terms of size, contributors, and domain.

Structure

Projects/ Java_projects/ eclipse.zip guava.zip guice.zip hadoop.zip spark.zip vaadin.zip Pharo_projects/ images/ GToolkit.zip Moose.zip PetitParser.zip Pillar.zip PolyMath.zip Roassal2.zip Seaside.zip vm/ 70-x64/Pharo Scripts/ ClassCommentExtraction.st SampleSelectionScript.st Python_projects/ django.zip ipython.zip Mailpile.zip pandas.zip pipenv.zip pytorch.zip requests.zip

Contents of the Replication Package

Projects/ contains the raw projects of each language that are used to analyze class comments. - Java_projects/ - eclipse.zip - Eclipse project downloaded from the GitHub. More detail about the project is available on GitHub Eclipse. - guava.zip - Guava project downloaded from the GitHub. More detail about the project is available on GitHub Guava. - guice.zip - Guice project downloaded from the GitHub. More detail about the project is available on GitHub Guice - hadoop.zip - Apache Hadoop project downloaded from the GitHub. More detail about the project is available on GitHub Apache Hadoop - spark.zip - Apache Spark project downloaded from the GitHub. More detail about the project is available on GitHub Apache Spark - vaadin.zip - Vaadin project downloaded from the GitHub. More detail about the project is available on GitHub Vaadin

Pharo_projects/

images/ -

GToolkit.zip - Gtoolkit project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.

Moose.zip - Moose project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.

PetitParser.zip - Petit Parser project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.

Pillar.zip - Pillar project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.

PolyMath.zip - PolyMath project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.

Roassal2.zip - Roassal2 project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.

Seaside.zip - Seaside project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.

vm/ -

70-x64/Pharo - Pharo7 (version 7 of Pharo) virtual machine to instantiate the Pharo images given in the images/ folder. The user can run the vm on macOS and select any of the Pharo image.

Scripts/ - It contains the sample Smalltalk scripts to extract class comments from various projects.

ClassCommentExtraction.st - A Smalltalk script to show how class comments are extracted from various Pharo projects. This script is already provided in the respective project image.

SampleSelectionScript.st - A Smalltalk script to show sample class comments of Pharo projects are selected. This script can be run in any of the Pharo images given in the images/ folder.

Python_projects/

django.zip - Django project downloaded from the GitHub. More detail about the project is available on GitHub Django

ipython.zip - IPython project downloaded from the GitHub. More detail about the project is available on GitHub on IPython

Mailpile.zip - Mailpile project downloaded from the GitHub. More detail about the project is available on GitHub on Mailpile

pandas.zip - pandas project downloaded from the GitHub. More detail about the project is available on GitHub on pandas

pipenv.zip - Pipenv project downloaded from the GitHub. More detail about the project is available on GitHub on Pipenv

pytorch.zip - PyTorch project downloaded from the GitHub. More detail about the project is available on GitHub on PyTorch

requests.zip - Requests project downloaded from the GitHub. More detail about the project is available on GitHub on Requests
Genomics OOD
kaggle.com
tensorflow.org
zip
Updated Mar 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sven Elflein (2021). Genomics OOD [Dataset]. https://www.kaggle.com/svenel/genomics-ood
Explore at:
zip(2282016677 bytes)Available download formats
Dataset updated
Mar 31, 2021
Authors
Sven Elflein
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Bacteria Genomics OOD dataset

This dataset implements a PyTorch dataset for the Genomics OOD dataset proposed in

J. Ren et al., “Likelihood Ratios for Out-of-Distribution Detection,” arXiv:1906.02845 [cs, stat], Available: http://arxiv.org/abs/1906.02845.

Code can be found at Github.

The dataset contains for each input sample - A sequence of 250 integers, where each number is from {0, 1, 2, 3} indicating {A, C, G, T}. - A class label, range from 0 to 129 for the bacteria class. - A a string notating where the sequence comes from.

In total there a 5 splits: Train, Validation, Test split with 10 in-distribution classes and a valdidation out-of-distribution dataset, as well as a out-of-distribution test set with 60 classes each.

The dataset with generated indices can be downloaded via the Releases.

Attribution

The original dataset was released by

Jie Ren, Google Research, 05/23/2019, jjren@google.com

Following CC BY 4.0 International license, this is released and distributed under the CC BY 4.0 license. The original dataset can be found here.
Imbalanced Cifar-10
kaggle.com
zip
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akhil Theerthala (2023). Imbalanced Cifar-10 [Dataset]. https://www.kaggle.com/datasets/akhiltheerthala/imbalanced-cifar-10
Explore at:
zip(807146485 bytes)Available download formats
Dataset updated
Jun 17, 2023
Authors
Akhil Theerthala
Description
This dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">

The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.

The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.

This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.

Usage Information:

The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.

License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.

Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.
Z
Data from: Self-Supervised Representation Learning on Neural Network Weights...
data.niaid.nih.gov
Updated Nov 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schürholt, Kontantin; Kostadinov, Dimche; Borth, Damian (2021). Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction - Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5645137
Explore at:
Dataset updated
Nov 13, 2021
Dataset provided by
University of St.Gallen
Authors
Schürholt, Kontantin; Kostadinov, Dimche; Borth, Damian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets to NeurIPS 2021 accepted paper "Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction".

Datasets are pytorch files containing a dictionary with training, validation and test sets. Train, validation and test sets are custom dataset classes which inherit from the standard torch dataset class. Corresponding code an be found at https://github.com/HSG-AIML/NeurIPS_2021-Weight_Space_Learning.

Datasets 41, 42, 43 and 44 are our dataset format wrapped around the zoos from Unterthiner et al, 2020 (https://github.com/google-research/google-research/tree/master/dnn_predict_accuracy)

Abstract: Self-Supervised Learning (SSL) has been shown to learn useful and information-preserving representations. Neural Networks (NNs) are widely applied, yet their weight space is still not fully understood. Therefore, we propose to use SSL to learn neural representations of the weights of populations of NNs. To that end, we introduce domain specific data augmentations and an adapted attention architecture. Our empirical evaluation demonstrates that self-supervised representation learning in this domain is able to recover diverse NN model characteristics. Further, we show that the proposed learned representations outperform prior work for predicting hyper-parameters, test accuracy, and generalization gap as well as transfer to out-of-distribution settings.
Cleaned ISIC Skin Cancer Dataset (6 Classes)
kaggle.com
zip
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aayyyyyyuuussshhh (2025). Cleaned ISIC Skin Cancer Dataset (6 Classes) [Dataset]. https://www.kaggle.com/datasets/aayyyyyyuuussshhh/cleaned-isic-skin-cancer-dataset-6-classes
Explore at:
zip(538458444 bytes)Available download formats
Dataset updated
Feb 10, 2025
Authors
Aayyyyyyuuussshhh
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains dermatoscopic images of skin lesions organized into six classes:

Melanoma Nevus (Mole) Basal Cell Carcinoma Actinic Keratosis Benign Keratosis Vascular Lesion The dataset has been preprocessed to remove duplicate images and ensure consistency between the training and test sets. It is structured into train and test folders, with subfolders for each class. This makes it ready for use in machine learning and deep learning projects.

Key Features: Total Images: 1888 (1820 train, 68 test) Classes: 6 Image Size: Variable (can be resized during preprocessing) Preprocessing: Duplicate images removed using perceptual hashing.

profile Qwen2.5-Max 1:26 am Here’s a polished and professional description you can use for your Kaggle dataset. It’s concise, informative, and includes all the necessary details about the dataset, its purpose, preprocessing steps, and usage guidelines.

Dataset Description Title: Cleaned Skin Cancer Dataset (6 Classes)

Description:

This dataset contains dermatoscopic images of skin lesions organized into six classes:

Melanoma Nevus (Mole) Basal Cell Carcinoma Actinic Keratosis Benign Keratosis Vascular Lesion The dataset has been preprocessed to remove duplicate images and ensure consistency between the training and test sets. It is structured into train and test folders, with subfolders for each class. This makes it ready for use in machine learning and deep learning projects.

Key Features: Total Images: 1888 (1820 train, 68 test) Classes: 6 Image Size: Variable (can be resized during preprocessing) Preprocessing: Duplicate images removed using perceptual hashing. Use Case: This dataset is ideal for training and evaluating models for skin cancer classification. It can be used with frameworks like TensorFlow, PyTorch, or scikit-learn. The cleaned structure ensures that the dataset is free from duplicates and ready for immediate use.

Acknowledgments: The original dataset was sourced from the International Skin Imaging Collaboration (ISIC) . Cleaning and preprocessing were performed to remove duplicates and prepare the dataset for modeling. Please refer to the ISIC website for more information about the original dataset: ISIC Archive .

License: This dataset is derived from the ISIC dataset and is made available under the CC BY-NC-SA license. Any use of this dataset must comply with the original licensing terms, including non-commercial use and attribution.

Sentence/Table Pair Data from Wikipedia for Pre-training with...

zenodo.org
data.niaid.nih.gov

application/gzip

Updated Oct 29, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun; Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. http://doi.org/10.5281/zenodo.5612316

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.5612316

Dataset updated

Oct 29, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun; Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

# path to the uncompressed files, should be a directory with a set of tar files
url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar'
dataset = (
  wds.Dataset(url)
  .shuffle(1000) # cache 1000 samples and shuffle
  .decode()
  .to_tuple("json")
  .batched(20) # group every 20 examples into a batch
)

# Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch
# You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence
 's1_all_links': {
  'Sils,_Girona': [[0, 4]],
  'municipality': [[10, 22]],
  'Comarques_of_Catalonia': [[30, 37]],
  'Selva': [[41, 46]],
  'Catalonia': [[51, 60]]
 }, # list of entities and their mentions in the sentence (start, end location)
 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs
  {
    'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair
    's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query
    's2s': [ # list of other sentences that contain the common entity pair, or evidence
     {
       'md5': '2777e32bddd6ec414f0bc7a0b7fea331',
       'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.',
       's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence
       'pair_locs': [ # mentions of the entity pair in the evidence
        [[19, 27]], # mentions of entity 1
        [[0, 5], [288, 293]] # mentions of entity 2
       ],
       'all_links': {
        'Selva': [[0, 5], [288, 293]],
        'Comarques_of_Catalonia': [[19, 27]],
        'Catalonia': [[40, 49]]
       }
      }
    ,...] # there are multiple evidence sentences
   },
 ,...] # there are multiple entity pairs in the query
}

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.',
 's1_all_links': {...}, # same as text-only
 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only
 'table_pairs': [
  'tid': 'Major_League_Baseball-1',
  'text':[
    ['World Series Records', 'World Series Records', ...],
    ['Team', 'Number of Series won', ...],
    ['St. Louis Cardinals (NL)', '11', ...],
  ...] # table content, list of rows
  'index':[
    [[0, 0], [0, 1], ...],
    [[1, 0], [1, 1], ...],
  ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table.
  'value_ranks':[
    [0, 0, ...],
    [0, 0, ...],
    [0, 10, ...],
  ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS
  'value_inv_ranks': [], # inverse rank
  'all_links':{
    'St._Louis_Cardinals': {
     '2': [
      [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]]
     ] # list of mentions in the second row, the key is row_id
    },
    'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]},
  }
  'name': '', # table name, if exists
  'pairs': {
    'pair': ['American_League', 'National_League'],
    's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query
    'table_pair_locs': {
     '17': [ # mention of entity pair in row 17
       [
        [[17, 0], [3, 18]],
        [[17, 1], [3, 18]],
        [[17, 2], [3, 18]],
        [[17, 3], [3, 18]]
       ], # mention of the first entity
       [
        [[17, 0], [21, 36]],
        [[17, 1], [21, 36]],
       ] # mention of the second entity
     ]
    }
   }
 ]
}

phishing-email-classifier-bert
kaggle.com
zip
Updated Jun 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Piiashev (2024). phishing-email-classifier-bert [Dataset]. https://www.kaggle.com/datasets/ivan314sh/phishing-email-classifier-bert
Explore at:
zip(439956757 bytes)Available download formats
Dataset updated
Jun 28, 2024
Authors
Ivan Piiashev
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset has: Classifier save, Tokenizer save, Encoded Phishing Email Dataset!

Encoded dataset

Directory: scam-email-classifier-bert-uncased

Contains preprocessed (removed special chars, encoded urls to special token, emails to special token) data for phishing email classification using BERT. It has been tokenized with the bert-base-uncased tokenizer and split into three parts: - train.pth: 80% of the data for training - validation.pth: 10% of the data for validation - test.pth: 10% of the data for testing All of them contain the serialized SpecialDataset objects, ready for immediate use in PyTorch, one can see the definition of dataset class in notebook. The text cleaning function is there as well!

These files are derived from the Phishing Email Dataset and provide a quick start for training and evaluating models with BERT.

Model and Tokenizer

Directory: scam-email-classifier-bert-uncased - config.json: This file contains the configuration parameters for the BERT model architecture, including details about the model layers, attention heads, hidden size, etc. It ensures that the model structure can be correctly instantiated when loaded. - model.safetensors: This file contains the trained weights of the BERT model in the SafeTensors format. It is used to store and load the model parameters efficiently and safely. - training_args.bin: This file includes the arguments and hyperparameters used during the training of the BERT model, such as learning rate, batch size, number of training epochs, etc.

Directory: scam-email-bert-tokenizer - special_tokens_map.json: This file maps special tokens (like [CLS], [SEP], [PAD], [UNK], and others) to their corresponding IDs used by the tokenizer. - tokenizer_config.json: This file contains the configuration parameters for the tokenizer, detailing how text should be processed and tokenized before being fed into the model. - vocab.txt: This file lists the vocabulary used by the tokenizer, mapping each token to a unique index.

These files allow to easily load the tokenizer and model using BertTokenizer.from_pretrained() and BertClassifier.from_pretrained() respectively.

Dataset Information

The BERT model has been fine-tuned on the Phishing Email Dataset provided by Naser Abdullah Alam. This dataset is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. The dataset includes a collection of phishing and legitimate emails, which has been used to train and evaluate the model. The actual training can be seen in notebook.

Citations

Original BERT Model:

Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv preprint arXiv:1810.04805, 2018. Phishing Email Dataset:

Original Dataset:

Naser Abdullah Alam. "Phishing Email Dataset." Kaggle, 2021.

SynthCave: 3D Odometry Estimation

kaggle.com

zip

Updated Jan 28, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Tim Bader (2024). SynthCave: 3D Odometry Estimation [Dataset]. https://www.kaggle.com/datasets/badertim/synthcave-3d-odometry-estimation

Explore at:

zip(22393814647 bytes)Available download formats

Dataset updated

Jan 28, 2024

Authors

Tim Bader

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

SynthCave is a synthetic dataset for 3D odometry estimation in cave-like environments, where GPS signals are unavailable and other sensors like cameras may be unreliable due to poor lightning. The dataset contains synthetic LiDAR data in three different forms: point clouds, depth-images, and graphs, along with IMU and ground-truth data.

Baseline Models Code & Dataset Code: https://github.com/BaderTim/SynthCave
Minecraft Measurement Mod: https://github.com/BaderTim/minecraft-measurement-mod
Paper: TBA https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18671719%2F4bb40865aaf99eabd02424df978a5186%2Fmmm_demo.jpg?generation=1706469582201415&alt=media" alt="Synthcave Demo Image"> The dataset is generated using a simulation environment with structured domain randomization to mimic the real-world noise and variability. SynthCave is designed to facilitate the development and evaluation of novel deep learning methods for 3D odometry estimation, especially graph-based ones, which are underrepresented in the current literature. It is the first benchmark dataset of its kind for indoor 3D odometry estimation.

1) Dataset Content

Cave Section Type	Sequence Count	Duration (in s)	XZ-Distance (in m)	Y-Distance (in m)	Avg. Phi (in °/s)	Avg. Theta (in °/s)
Default
Even Path	20	274.40	462.42	0.00	59.42	5.25
Even Path Upwards	20	351.80	628.77	359.12	43.25	10.36
Even Path Downwards	20	338.20	578.29	261.00	22.40	9.84
Advanced
Entrance	20	280.60	521.00	287.92	27.19	10.53
Curvy Even Path	20	348.60	588.01	0.82	86.72	8.24
Curvy Path Upwards	20	339.00	604.23	333.13	60.80	9.65
Curvy Path Downwards	20	350.60	628.10	230.94	57.20	13.20
Miscellaneous
Underwater	20	432.8	1228.17	582.86	61.28	17.41
Mineshaft	20	402.2	653.61	45.64	81.91	8.49
Roping Up Shaft	20	360.0	246.84	932.17	69.15	24.46
Roping Down Shaft	20	164.4	242.89	1073.21	68.47	16.98
Total	220	3642.60	6382.31	4106.55	58.54	12.27

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18671719%2Fb7a75abffddb4269172c9e1fb332e90a%2Fdistribution.jpg?generation=1706469864428719&alt=media" alt="GT Data Distribution"> (left) Histogram of the position changes, rounded to 0.1 and limited to 1 and -1, of the GT values. Outside the limit are 3 X, 1 Z and 1035 Y values. (right) Histogram of the rotation changes, converted to radians, rounded to 0.1 and limited to 1 and -1, of the GT values. Outside the limit are 3 theta and 361 phi values.

2) Citation

Please cite the following paper if you use this dataset or the code in your work: latex @article{bader2023synthcave, title={SynthCave: A Deep Learning Benchmark for 3D Odometry Estimation in Caves}, author={Bader, Tim}, journal={TBA}, year={223} }

3) Usage

The following classes are PyTorch datasets which can be used to process the data.

3.1) Graph

import os
import torch
import numpy as np
from torch.utils.data import Dataset

class GraphDataset(Dataset):
  def _init_(self, data_folder: str, frames: int, gt_as_rad: bool = True, gt_limit: None | list = [-1, 1], return_seq_name: bool = False):
    """
    Args:
      data_folder (string): Path to the graph dataset's train/val folder.
                 In each subfolder, there should be a labels.csv file and a folder for each sample. 
      frames (int): Number of frames in each sample.
      gt_as_rad (bool): Whether to return the ground truth as radians or not.
      gt_limit (None | list): If not None, the ground truth will be limited to the given range.
    """
    self.path = data_folder
    self.frames = frames
    self.gt_as_rad = gt_as_rad
    self.gt_limit = gt_limit
    self.return_seq_name = return_seq_name
    self.theta_rounded_hist = []
    self.phi_rounded_hist = []
    self.x_rounded_hist = []
    self.y_rounded_hist = []
    self.z_rounded_hist = []
    # the keys represent the cumulative number of samples
    self.index = {}
    self.id_name_map = {}
    self.samples = 0
    print(f"Initializing dataset from '{self.path}'...")
    # loop through folders
    for file in os.listdir(self.path):
      filename = os.fsdecode(file)
      if filename.endswith("_gt.npy"): # load sequence set at once and not after another
        sequence_id = int(filename.split("_")[0])
        sequence_name = "_".joi...

Z
3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in...
nde-dev.biothings.io
data.niaid.nih.gov
+1more
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Strohmayer, Julian (2024). 3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_10925350
Explore at:
Dataset updated
Dec 5, 2024
Dataset provided by
Kampel, Martin
Strohmayer, Julian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios

This repository contains the 3DO dataset proposed in [1].

PyTroch Dataloader

A minimal PyTorch dataloader for the 3DO dataset is provided at: https://github.com/StrohmayerJ/3DO

Dataset Description

The 3DO dataset comprises 42 five-minute recordings (~1.25M WiFi packets) of three human activities performed by a single person, captured in a WiFi through-wall sensing scenario over three consecutive days. Each WiFi packet is annotated with a 3D trajectory label and a class label for the activities: no person/background (0), walking (1), sitting (2), and lying (3). (Note: The labels returned in our dataloader example are walking (0), sitting (1), and lying (2), because background sequences are not used.)

The directories 3DO/d1/, 3DO/d2/, and 3DO/d3/ contain the sequences from days 1, 2, and 3, respectively. Furthermore, each sequence directory (e.g., 3DO/d1/w1/) contains a csiposreg.csv file storing the raw WiFi packet time series and a csiposreg_complex.npy cache file, which stores the complex Channel State Information (CSI) of the WiFi packet time series. (If missing, csiposreg_complex.npy is automatically generated by the provided dataloader.)

Dataset Structure:

/3DO

├── d1 <-- day 1 subdirectory

└── w1 <-- sequence subdirectory └── csiposreg.csv <-- raw WiFi packet time series └── csiposreg_complex.npy <-- CSI time series cache

├── d2 <-- day 2 subdirectory

├── d3 <-- day 3 subdirectory

In [1], we use the following training, validation, and test split:

Subset Day Sequences

Train 1 w1, w2, w3, s1, s2, s3, l1, l2, l3

Val 1 w4, s4, l4

Test 1 w5 , s5, l5

Test 2 w1, w2, w3, w4, w5, s1, s2, s3, s4, s5, l1, l2, l3, l4, l5

Test 3 w1, w2, w4, w5, s1, s2, s3, s4, s5, l1, l2, l4

w = walking, s = sitting and l= lying

Note: On each day, we additionally recorded three ten-minute background sequences (b1, b2, b3), which are provided as well.

Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

[1] Strohmayer, J., Kampel, M. (2025). On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios. In: Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15315. Springer, Cham. https://doi.org/10.1007/978-3-031-78354-8_13

BibTeX citation:

@inproceedings{strohmayerOn2025, author="Strohmayer, Julian and Kampel, Martin", title="On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios", booktitle="Pattern Recognition", year="2025", publisher="Springer Nature Switzerland", address="Cham", pages="194--211", isbn="978-3-031-78354-8" }
FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1)
zenodo.org
bin, png, zip
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek (2024). FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1) [Dataset]. http://doi.org/10.5281/zenodo.8328113
Explore at:
bin, zip, pngAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8328113
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# FiN-2 Large-Scale Real-World PLC-Dataset

## About
#### FiN-2 dataset in a nutshell:
FiN-2 is the first large-scale real-world dataset on data collected in a powerline communication infrastructure. Since the electricity grid is inherently a graph, our dataset could be interpreted as a graph dataset. Therefore, we use the word node to describe points (cable distribution cabinets) of measurement within the low-voltage electricity grid and the word edge to describe connections (cables) in between them. However, since these are PLC connections, an edge does not necessarily have to correspond to a real cable; more on this in our paper.
FiN-2 shows measurements that relate to the nodes (voltage, total harmonic distortion) as well as to the edges (signal-to-noise ratio spectrum, tonemap). In total, FiN-2 is distributed across three different sites with a total of 1,930,762,116 node measurements each for the individual features and 638,394,025 edge measurements each for all 917 PLC channels. All data was collected over a 25-month period from mid-2020 to the end of 2022.
We propose this dataset to foster research in the domain of grid automation and smart grid. Therefore, we provide different example use cases in asset management, grid state visualization, forecasting, predictive maintenance, and novelty detection. For more decent information on this dataset, please see our [paper](https://arxiv.org/abs/2209.12693).

* * *
## Content
FiN-2 dataset splits up into two compressed `csv-Files`: *nodes.csv* and *edges.csv*.

All files are provided as a compressed ZIP file and are divided into four parts. The first part can be found in this repo, while the remaining parts can be found in the following:
- https://zenodo.org/record/8328105
- https://zenodo.org/record/8328108
- https://zenodo.org/record/8328111

### Node data

| id | ts | v1 | v2 | v3 | thd1 | thd2 | thd3 | phase_angle1 | phase_angle2 | phase_angle3 | temp |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
|112|1605530460|236.5|236.4|236.0|2.9|2.5|2.4|120.0|119.8|120.0|35.3|
|112|1605530520|236.9|236.6|236.6|3.1|2.7|2.5|120.1|119.8|120.0|35.3|
|112|1605530580|236.2|236.4|236.0|3.1|2.7|2.5|120.0|120.0|119.9|35.5|

- id / ts: Unique identifier of the node that is measured and timestemp of the measurement
- v1/v2/v3: Voltage measurements of all three phases
- thd1/thd2/thd3: Total harmonic distortion of all three phases
- phase_angle1/2/3: Phase angle of all three phases
- temp: Temperature in-circuit of the sensor inside a cable distribution unit (in °C)

### Edge data
| src | dst | ts | snr0 | snr1 | snr2 | ... | snr916 |
|----|----|----|----|----|----|----|----|
|62|94|1605528900|70|72|45|...|-53|
|62|32|1605529800|16|24|13|...|-51|
|17|94|1605530700|37|25|24|...|-55|

- src & dst & ts: Unique identifier of the source and target nodes where the spectrum is measured and time of measurement
- snr0/snr1/.../snr916: 917 SNR measurements in tenths of a decibel (e.g. 50 --> 5dB).

### Metadata
Metadata that is provided along with the data covers:

- Number of cable joints
- Cable properties (length, type, number of sections)
- Relative position of the nodes (location, zero-centered gps)
- Adjacent PV or wallbox installations
- Year of installation w.r.t. the nodes and cables

Since the electricity grid is part of the critical infrastructure, it is not possible to provide exact GPS locations.

* * *
## Usage
Simple data access using pandas:

```
import pandas as pd

nodes_file = "nodes.csv.gz" # /path/to/nodes.csv.gz
edges_file = "edges.csv.gz" # /path/to/edges.csv.gz

# read the first 10 rows
data = pd.read_csv(nodes_file, nrows=10, compression='gzip')

# read the row number 5 to 15
data = pd.read_csv(nodes_file, nrows=10, skiprows=[i for i in range(1,6)], compression='gzip')

# ... same for the edges
```

Compressed csv-data format was used to make sharing as easy as possible, however it comes with significant drawbacks for machine learning. Due to the inherent graph structure, a single snapshot of the whole graph consists of a set of node and edge measurements. But due to timeouts, noise and other disturbances, nodes sometimes fail in collecting the data, wherefore the number of measurements for a specific timestamp differs. This, plus the high sparsity of the graph, leads to a high inefficiency when using the csv-format for an ML training.
To utilize the data in an ML pipeline, we recommend other data formats like [datadings](https://datadings.readthedocs.io/en/latest/) or specialized database solutions like [VictoriaMetrics](https://victoriametrics.com/).

### Example use case (voltage forecasting)

Forecasting of the voltage is one potential use cases. The Jupyter notebook provided in the repository gives an overview of how the dataset can be loaded, preprocessed and used for ML training. Thereby, a MinMax scaling was used as simple preprocessing and a PyTorch dataset class was created to handle the data. Furthermore, a vanilla autoencoder is utilized to process and forecast the voltage into the future.
NTU60 Processed Skeleton Dataset
kaggle.com
zip
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oucherif Mohammed Ouail (2025). NTU60 Processed Skeleton Dataset [Dataset]. https://www.kaggle.com/datasets/oucherifouail/ntu60-processed-skeleton-dataset
Explore at:
zip(3075187118 bytes)Available download formats
Dataset updated
Aug 29, 2025
Authors
Oucherif Mohammed Ouail
Description
NTU RGB+D 60 – Preprocessed Skeleton Dataset

This dataset provides preprocessed skeleton sequences from the NTU RGB+D 60 benchmark, widely used for skeleton-based human action recognition.

The preprocessing module standardizes the raw NTU skeleton data to make it directly usable for training deep learning models.

Preprocessing Steps

Each skeleton sequence was processed by:

✅ Removing NaN / invalid frames

✅ Translating skeletons (centered spine base joint at origin)

✅ Normalizing body scale using spine length

✅ Aligning all sequences to 300 frames (padding or truncation)

✅ Formatting sequences to include up to 2 persons per clip

Output Files

Two .npz files are provided, following the standard evaluation protocols:

NTU60_CS.npz → Cross-Subject split

NTU60_CV.npz → Cross-View split

Each file contains:

x_train → Training data, shape (N_train, 300, 150)

y_train → Training labels, shape (N_train, 60) (one-hot)

x_test → Testing data, shape (N_test, 300, 150)

y_test → Testing labels, shape (N_test, 60) (one-hot)

Data Format

300 = max frames per sequence (zero-padded)

150 = 2 persons × 25 joints × 3 coordinates (x, y, z)

60 = number of action classes

If a sequence has only 1 person, the second person’s features are zero-filled.

Skeleton Properties

Centered → Spine base joint (joint-2) at origin (0,0,0)

Normalized → Body size scaled consistently

Aligned → Fixed-length sequences (300 frames)

Two-person setting → Always represented with 150 features

Evaluation Protocols

Cross-Subject (CS): Train and test sets split by different actors. The model is evaluated on unseen subjects to measure generalization across people.

Cross-View (CV): Train and test sets split by different camera views. The model is evaluated on unseen viewpoints to measure viewpoint invariance.

Usage

These .npz files can be directly loaded in PyTorch or NumPy-based pipelines. They are fully compatible with graph convolutional networks (GCNs), transformers, and other deep learning models for skeleton-based action recognition.

Example:

import numpy as np data = np.load("NTU60_CS.npz") x_train, y_train = data["x_train"], data["y_train"] print(x_train.shape) # (N_train, 300, 150) print(y_train.shape) # (N_train, 60)

Facebook

Twitter

Click to copy link

Link copied

Cite

Sufian Othman (2025). Pytorch Models [Dataset]. https://www.kaggle.com/datasets/mohdsufianbinothman/pytorch-models/data

Pytorch Models

Deep learning class function

Explore at:

zip(21493 bytes)Available download formats

Dataset updated

May 10, 2025

Authors

Sufian Othman

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

✅ Step 1: Mount to Dataset

Search for my dataset pytorch-models and add it — this will mount it at:

/kaggle/input/pytorch-models/

✅ Step 2: Check file paths Once mounted, the four files will be available at:

/kaggle/input/pytorch-models/base_models.py
/kaggle/input/pytorch-models/ext_base_models.py
/kaggle/input/pytorch-models/ext_hybrid_models.py
/kaggle/input/pytorch-models/hybrid_models.py

✅ Step 3: Copy files to working directory To make them importable, copy the .py files to your notebook’s working directory (/kaggle/working/):

import shutil

shutil.copy('/kaggle/input/pytorch-models/base_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/ext_base_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/ext_hybrid_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/hybrid_models.py', '/kaggle/working/')

✅ Step 4: Import your modules Now that they are in the working directory, you can import them like normal:

import base_models
import ext_base_models
import ext_hybrid_models
import hybrid_models

Or, if you only want to import specific classes or functions:

from base_models import YourModelClass
from ext_base_models import AnotherModelClass

✅ Step 5: Use the models You can now initialize and use the models/classes/functions defined inside each file:

model = base_models.YourModelClass()
output = model(input_data)

Clear search

Close search

Google apps

Main menu

Pytorch Models

Model Zoo: A Dataset of Diverse Populations of Neural Network Models - MNIST...

Oxford 102 Flower Dataset

Overview

Directory Structure

Visualization of the dataset

Publications

Source

bigearthnet

cifar-100-python

CIFAR-100 Dataset

1. Overview

2. Dataset Details

3. Fine and Coarse Labels

4. Applications

5. Challenges

6. File Format

7. Example Classes

MIEDT dataset

Data from: Deep learning neural network derivation and testing to...

SimCATS_GaAs_v1_random_variations_v2

GISE-51

Dataset for class comment analysis

Structure

Contents of the Replication Package

Genomics OOD

Bacteria Genomics OOD dataset

Attribution

Imbalanced Cifar-10

Data from: Self-Supervised Representation Learning on Neural Network Weights...

Cleaned ISIC Skin Cancer Dataset (6 Classes)

Sentence/Table Pair Data from Wikipedia for Pre-training with...

phishing-email-classifier-bert

This dataset has: Classifier save, Tokenizer save, Encoded Phishing Email Dataset!

Encoded dataset

Model and Tokenizer

Dataset Information

Citations

SynthCave: 3D Odometry Estimation

1) Dataset Content

2) Citation

3) Usage

3.1) Graph

3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in...

FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1)

NTU60 Processed Skeleton Dataset

NTU RGB+D 60 – Preprocessed Skeleton Dataset

Preprocessing Steps

Output Files

Data Format

Skeleton Properties

Evaluation Protocols

Usage

Pytorch Models

Deep learning class function