84 datasets found
  1. Z

    Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841
    Explore at:
    Dataset updated
    Aug 8, 2022
    Authors
    Köhler, Juliane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

    Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

    ger_train.csv – The German training set as CSV file.

    ger_validation.csv – The German validation set as CSV file.

    en_test.csv – The English test set as CSV file.

    en_train.csv – The English training set as CSV file.

    en_validation.csv – The English validation set as CSV file.

    splitting.py – The python code for splitting a dataset into train, test and validation set.

    DataSetTrans_de.csv – The final German dataset as a CSV file.

    DataSetTrans_en.csv – The final English dataset as a CSV file.

    translation.py – The python code for translating the cleaned dataset.

  2. Z

    Multimodal Vision-Audio-Language Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Goethe University Frankfurt
    Authors
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

    pip install pandas pyarrow Example

    import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

    dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  3. Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

    • data.europa.eu
    unknown
    Updated Feb 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/88u/dataset/oai-zenodo-org-4571228
    Explore at:
    unknown(395470535)Available download formats
    Dataset updated
    Feb 28, 2021
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file. All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file. The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file. Notable changes to each version of the dataset are documented in CHANGELOG.md.

  4. feral-cat-segmentation_dataset

    • kaggle.com
    • universe.roboflow.com
    zip
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    lu hou yang (2025). feral-cat-segmentation_dataset [Dataset]. https://www.kaggle.com/datasets/luhouyang/feral-cat-segmentation-dataset
    Explore at:
    zip(971125684 bytes)Available download formats
    Dataset updated
    Mar 18, 2025
    Authors
    lu hou yang
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Feral Cat Segmentation Dataset

    Overview

    This dataset provides image segmentation data for feral cats, designed for computer vision and machine learning tasks. It builds upon the original public domain dataset by Paul Cashman from Roboflow, with additional preprocessing and multiple data formats for easier consumption.

    Dataset Source

    Dataset Contents

    The dataset is organized into three standard splits: - Train set - Validation set - Test set

    Each split contains data in multiple formats: 1. Original JPG images 2. Segmentation mask JPG images 3. Parquet files containing flattened image and mask data 4. Pickle files containing serialized image and mask data

    Data Formats

    1. Image Files

    • Format: JPG
    • Resolution: 224×224 pixels
    • Directory Structure:
      • train/: Original training images
      • valid/: Original validation images
      • test/: Original test images
      • train_mask/: Corresponding segmentation masks for training
      • valid_mask/: Corresponding segmentation masks for validation
      • test_mask/: Corresponding segmentation masks for testing

    2. Parquet Files

    • Files: train_dataset.parquet, valid_dataset.parquet, test_dataset.parquet
    • Content: Flattened image data and corresponding masks combined in a single table
    • Structure: Each row contains the flattened pixel values of an image followed by the flattened pixel values of its mask
    • Data Division: Image and mask data are split at index split_at = image_size[0] * image_size[1] * image_channels
      • Data before this index: image pixel values (reshaped to [-1, 224, 224, 3])
      • Data after this index: mask pixel values (reshaped to [-1, 224, 224, 1])
    • Benefits: Efficient storage and faster loading compared to individual image files

    3. Pickle Files

    • Files: train_dataset.pkl, valid_dataset.pkl, test_dataset.pkl
    • Content: Serialized Python objects containing images and their corresponding masks
    • Structure: List of [image, mask] pairs, where each image and mask is serialized using Python's pickle
    • Data Access: Similar to parquet files, when loaded through the provided dataset class, data is split at the same index: split_at = image_size[0] * image_size[1] * image_channels
    • Benefits: Preserves original data structure and enables quick loading in Python

    4. CSV Files

    • Files: train_dataset.csv, valid_dataset.csv, test_dataset.csv
    • Content: Same data as parquet files but in CSV format
    • Structure: No headers, raw flattened pixel values
    • Data Division: Same split point as parquet files

    Image Preprocessing

    All images were preprocessed with the following operations: - Resized to 224×224 pixels using bilinear interpolation - Segmentation masks were also resized to match the images using nearest neighbor interpolation - Original RLE (Run-Length Encoding) segmentation data converted to binary masks

    Data Normalization

    When used with the provided PyTorch dataset class, images are normalized with: - Mean: [0.48235, 0.45882, 0.40784] - Standard Deviation: [0.00392156862745098, 0.00392156862745098, 0.00392156862745098]

    PyTorch Integration

    A custom CatDataset class is included for easy integration with PyTorch:

    from cat_dataset import CatDataset
    
    # Load from parquet format
    dataset = CatDataset(
      root="path/to/dataset",
      split="train", # Options: "train", "valid", "test"
      format="parquet", # Options: "parquet", "pkl"
      image_size=[224, 224],
      image_channels=3,
      mask_channels=1
    )
    
    # Use with PyTorch DataLoader
    from torch.utils.data import DataLoader
    dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
    

    Performance Comparison

    Loading time benchmarks from the original implementation: - Parquet format: ~1.29 seconds per iteration - Pickle format: ~0.71 seconds per iteration

    The pickle format provides the fastest loading times and is recommended for most use cases.

    Citation

    If you use this dataset in your research or projects, please cite:

    @misc{feral-cat-segmentation_dataset,
     title = {feral-cat-segmentation Dataset},
     type = {Open Source Dataset},
     author = {Paul Cashman},
     howpublished = {\url{https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}},
     url = {https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation},
     journal = {Roboflow Universe},
     publisher = {Roboflow},
     year = {2025},
     month = {mar},
     note = {visited on 2025-03-19},
    }
    

    Sample Usage Code

    Basic Dataset Loading

    from ca...
    
  5. h

    codeparrot-train-more-filtering

    • huggingface.co
    Updated Jun 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot (2022). codeparrot-train-more-filtering [Dataset]. https://huggingface.co/datasets/codeparrot/codeparrot-train-more-filtering
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2022
    Dataset provided by
    Good Engineering, Inc
    Authors
    CodeParrot
    Description

    CodeParrot 🦜 Dataset Cleaned and filtered (train)

      Dataset Description
    

    A dataset of Python files from Github. It is a more filtered version of the train split codeparrot-clean-train of codeparrot-clean. The additional filters aim at detecting configuration and test files, as well as outlier files that are unlikely to help the model learn code. The first three filters are applied with a probability of 0.7:

    files with a mention of "test file" or "configuration file" or… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-train-more-filtering.

  6. T

    ref_coco

    • tensorflow.org
    • opendatalab.com
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). ref_coco [Dataset]. https://www.tensorflow.org/datasets/catalog/ref_coco
    Explore at:
    Dataset updated
    May 31, 2024
    Description

    A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.

    RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.

    Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".

    Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):

    datasetpartitionsplitrefsimages
    refcocogoogletrain4000019213
    refcocogoogleval50004559
    refcocogoogletest50004527
    refcocounctrain4240416994
    refcocouncval38111500
    refcocounctestA1975750
    refcocounctestB1810750
    refcoco+unctrain4227816992
    refcoco+uncval38051500
    refcoco+unctestA1975750
    refcoco+unctestB1798750
    refcocoggoogletrain4482224698
    refcocoggoogleval50004650
    refcocogumdtrain4222621899
    refcocogumdval25731300
    refcocogumdtest50232600

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('ref_coco', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">

  7. VegeNet - Image datasets and Codes

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jo Yen Tan; Jo Yen Tan (2022). VegeNet - Image datasets and Codes [Dataset]. http://doi.org/10.5281/zenodo.7254508
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 27, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jo Yen Tan; Jo Yen Tan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).

    Image datasets:

    1. vege_original : Images of vegetables captured manually in data acquisition stage
    2. vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed
    3. non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods
    4. food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.
    5. food_image_dataset_split : Image dataset (4) split into train and test sets
    6. process : Images created when cropping (pre-processing step) to create dataset (2).
  8. E

    Data from: Keyword extraction datasets for Croatian, Estonian, Latvian and...

    • live.european-language-grid.eu
    binary format
    Updated Jun 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Keyword extraction datasets for Croatian, Estonian, Latvian and Russian 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8369
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 3, 2021
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    Estonia
    Description

    EACL Hackashop Keyword Challenge Datasets

    In this repository you can find ids of articles used for the keyword extraction challenge at

    EACL Hackashop on News Media Content Analysis and Automated Report Generation (http://embeddia.eu/hackashop2021/). The article ids can be used to generate train-test split used in paper:

    Koloski, B., Pollak, S., Škrlj, B., & Martinc, M. (2021). Extending Neural Keyword Extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Kiev, Ukraine, pages 22–29.

    Train and test splits are provided for Latvian, Estonian, Russian and Croatian.

    The articles with the corresponding ID-s can be extracted from the following datasets:

    - Estonian and Russian (use the eearticles2015-2019 dataset): https://www.clarin.si/repository/xmlui/handle/11356/1408

    - Latvian: https://www.clarin.si/repository/xmlui/handle/11356/1409

    - Croatian: https://www.clarin.si/repository/xmlui/handle/11356/1410

    dataset_ids folder is organized in the following way:

    - latvian – containing latvian_train.json: a json file with ids from train articles to replicate the data used in Koloski et al. (2020), the latvian_test.json: a json file with ids from test articles to replicate the data

    - estonian – containing estonian_train.json: a json file with ids from train articles to replicate the data used in Koloski et al. (2020), the estonian_test.json: a json file with ids from test articles to replicate the data

    - russian – containing russian_train.json: a json file with ids from train articles to replicate the train data used in Koloski et al. (2020), the russian_test.json: a json file with ids from test articles to replicate the data

    - croatian - containing croatian_id_train.tsv file with sites and ids (note that just ids are not unique across dataset, therefore site information also needs to be included to obtain a unique article identifier) of articles in the train set, and the croatian_id_test.tsv file with sites and ids of articles in the test set.

    In addition, scripts are provided for extracting articles (see folder parse containing scripts parse.py and build_croatian_dataset.py, requirements for scripts are pandas and bs4 Python libraries):

    parse.py is used for extraction of Estonian, Russian and Latvian train and test datasets:

    Instructions:

    ESTONIAN-RUSSIAN

    1) Retrieve the data ee_articles_2015_2019.zip

    2) Create a folder 'data' and subfolder 'ee'

    3) Unzip them in the 'data/ee' folder

    To extract train/test Estonian articles:

    run function 'build_dataset(lang="ee", opt="nat")' in the parse.py script

    To extract train/test Russian articles:

    run function 'build_dataset(lang="ee", opt="rus")' in the parse.py script

    LATVIAN:

    1) Retrieve the latvian data

    2) Unzip it in 'data/lv' folder

    3) To extract train/test Latvian articles:

    run function 'build_dataset(lang="lv", opt="nat")' in the parse.py script

    build_croatian_dataset.py is used for extraction of Croatian train and test datasets:

    Instructions:

    CROATIAN:

    1) Retrieve the Croatian data (file 'STY_24sata_articles_hr_PUB-01.csv')

    2) put the script 'build_croatian_dataset.py' in the same folder as the extracted data and run it (e.g., python build_croatian_dataset.py).

    For additional questions: {Boshko.Koloski,Matej.Martinc,Senja.Pollak}@ijs.si

  9. u

    Surrogate flood model comparison - Datasets and python code

    • figshare.unimelb.edu.au
    bin
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niels Fraehr (2024). Surrogate flood model comparison - Datasets and python code [Dataset]. http://doi.org/10.26188/24312658.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 19, 2024
    Dataset provided by
    The University of Melbourne
    Authors
    Niels Fraehr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data used for publication in "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Five surrogate models for flood inundation is to emulate the results of high-resolution hydrodynamic models. The surrogate models are compared based on accuracy and computational speed for three distinct case studies namely Carlisle (United Kingdom), Chowilla floodplain (Australia), and Burnett River (Australia).The dataset is structured in 5 files - "Carlisle", "Chowilla", "BurnettRV", "Comparison_results", and "Python_data". As a minimum to run the models the "Python_data" file and one of "Carlisle", "Chowilla", or "BurnettRV" are needed. We suggest to use the "Carlisle" case study for initial testing given its small size and small data requirement."Carlisle", "Chowilla", and "BurnettRV" files These files contain hydrodynamic modelling data for training and validation for each individual case study, as well as specific Python scripts for training and running the surrogate models in each case study. There are only small differences between each folder, depending on the hydrodynamic model trying to emulate and input boundary conditions (input features).Each case study file has the following folders:Geometry_data: DEM files, .npz files containing of the high-fidelity models grid (XYZ-coordinates) and areas (Same data is available for the low-fidelity model used in the LSG model), .shp files indicating location of boundaries and main flow paths (mainly used in the LSTM-SRR model). XXX_modeldata: Folder to storage trained model data for each XXX surrogate model. For example, GP_EOF_modeldata contains files used to store the trainined GP-EOF model.HD_model_data: High-fidelity (And low-fidelity) simulation results for all flood events of that case study. This folder also contains all boundary input conditions.HF_EOF_analysis: Storing of data used in the EOF analysis. EOF analysis is applied for the LSG, GP-EOF, and LSTM-EOF surrogate models. Results_data: Storing results of running the evaluation of the surrogate models.Train_test_split_data: The train-test-validation data split is the same for all surrogate models. The specific split for each cross-validation fold is stored in this folder.And Python files:YYY_event_summary, YYY_Extrap_event_summary: Files containing overview of all events, and which events are connected between the low- and high-fidelity models for each YYY case study.EOF_analysis_HFdata_preprocessing, EOF_analysis_HFdata: Preprocessing before EOF analysis and the EOF analysis of the high-fidelity data. This is used for the LSG, GP-EOF, and LSTM-EOF surrogate models.Evaluation, Evaluation_extrap: Scripts for evaluating the surrogate model for that case study and saving the results for each cross-validation fold.train_test_split: Script for splitting the flood datasets for each cross-validation fold, so all surrogate models train on the same data.XXX_training: Script for training each XXX surrogate model.XXX_preprocessing: Some surrogate models might rely on some information that needs to be generated before training. This is performed using these scripts."Comparison_results" fileFiles used for comparing surrogate models and generate the figures in the paper "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Figures are also included. "Python_data" fileFolder containing Python script with utility functions for setting up, training, and running the surrogate models, as well as for evaluating the surrogate models. This folder also contains a python_environment.yml file with all Python package versions and dependencies.This folder also contains two sub-folders:LSG_mods_and_func: Python scripts for using the LSG model. Some of these scripts are also utilized when working with the other surrogate models. SRR_method_master_Zhou2021: Scripts obtained from https://github.com/yuerongz/SRR-method. Small edits have for speed and use in this study.

  10. Waste Classfication Dataset

    • kaggle.com
    Updated Jun 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaan Çerkez (2025). Waste Classfication Dataset [Dataset]. https://www.kaggle.com/datasets/kaanerkez/waste-classfication-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kaan Çerkez
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    Balanced Waste Classification Dataset - E-Waste & Mixed Materials

    🎯 Dataset Overview

    This dataset contains a comprehensive collection of waste images designed for training machine learning models to classify different types of waste materials, with a strong focus on electronic waste (e-waste) and mixed materials. The dataset includes 7 electronic device categories alongside traditional recyclable materials, making it ideal for modern waste management challenges where electronic devices constitute a significant portion of waste streams. The dataset has been carefully curated and balanced to ensure optimal performance for multi-category waste classification tasks using deep learning approaches.

    📊 Dataset Statistics

    • Total Classes: 17 different waste categories
    • Images per Class: 400 (balanced)
    • Total Images: 6,800
    • Image Format: RGB (3 channels)
    • Recommended Input Size: 224×224 pixels
    • Data Structure: Single balanced dataset (not pre-split)

    🗂️ Waste Categories

    The dataset includes 17 distinct waste categories covering various types of materials commonly found in waste management scenarios:

    1. Battery - Various types of batteries
    2. Cardboard - Cardboard packaging and boxes
    3. Glass - Glass containers and bottles
    4. Keyboard - Computer keyboards and input devices
    5. Metal - Metal cans and metallic waste
    6. Microwave - Microwave ovens and similar appliances
    7. Mobile - Mobile phones and smartphones
    8. Mouse - Computer mice and peripherals
    9. Organic - Biodegradable organic waste
    10. Paper - Paper products and documents
    11. PCB - Printed Circuit Boards (electronic components)
    12. Plastic - Plastic containers and packaging
    13. Player - Media players and entertainment devices
    14. Printer - Printers and printing equipment
    15. Television - TV sets and display devices
    16. Trash - General mixed waste
    17. Washing Machine - Washing machines and large appliances

    🛠️ Data Processing Pipeline

    1. Data Balancing

    • Undersampling: Applied to classes with >400 images
    • Data Augmentation: Applied to classes with <400 images
    • Target: Exactly 400 images per class for balanced training

    2. Data Augmentation Techniques

    • Rotation: ±20 degrees
    • Width/Height Shift: ±20%
    • Shear Range: 20%
    • Zoom Range: 20%
    • Horizontal Flip: Enabled
    • Fill Mode: Nearest neighbor

    3. Quality Assurance

    • Consistent image dimensions
    • Proper file format validation
    • Balanced class distribution
    • Clean data structure

    🎯 Recommended Use Cases

    Primary Applications

    • E-Waste Classification: Specialized in electronic devices (Mobile, Keyboard, Mouse, PCB, etc.)
    • Mixed Waste Sorting: Traditional recyclables (Paper, Plastic, Glass, Metal, Cardboard)
    • Smart Recycling Systems: Automated waste sorting for both organic and electronic materials
    • Environmental Monitoring: Multi-category waste identification
    • Appliance Recycling: Large appliance classification (Microwave, TV, Washing Machine)

    Special Features

    • Electronic Waste Focus: Strong representation of e-waste categories (7 out of 17 classes)
    • Diverse Material Types: From organic waste to complex electronic devices
    • Real-world Categories: Practical classification for actual waste management scenarios
    • Appliance Recognition: Specialized in identifying large household appliances

    Model Architectures

    • Convolutional Neural Networks (CNN)
    • Transfer Learning with MobileNetV2, ResNet, EfficientNet
    • Vision Transformers (ViT)
    • Custom architectures for waste classification

    📁 Dataset Structure

    balanced_waste_images/
    ├── category_1/
    │  ├── image_001.jpg
    │  ├── image_002.jpg
    │  └── ... (400 images)
    ├── category_2/
    │  ├── image_001.jpg
    │  └── ... (400 images)
    └── ... (17 categories total)
    

    Note: Dataset is not pre-split. Users need to create train/validation/test splits as needed.

    🚀 Getting Started

    Step 1: Data Splitting

    Since the dataset is not pre-split, you'll need to create train/validation/test splits:

    import splitfolders
    
    # Split dataset: 80% train, 10% val, 10% test
    splitfolders.ratio(
      input='balanced_waste_images', 
      output='split_data',
      seed=42, 
      ratio=(.8, .1, .1),
      group_prefix=None,
      move=False
    )
    

    Step 2: Data Loading & Preprocessing

    from tensorflow.keras.preprocessing.image import ImageDataGenerator
    
    # Data generators with preprocessing
    train_datagen = ImageDataGenerator(rescale=1./255)
    val_datagen = ImageDataGenerator(rescale=1./255)
    
    train_generator = train_datagen.flow_from_directory(
      'split_data/train/',
      target_size=(224, 224),
      batch_size=32,
      class_mode='categorical'
    )
    
    val_generator = val_datagen.flow_from_director...
    
  11. Rescaled CIFAR-10 dataset

    • zenodo.org
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Description

    Motivation

    The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled CIFAR-10 dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

    and is therefore significantly more challenging.

    Access and rights

    The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

    [4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

    The h5 files containing the dataset

    The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

    Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

  12. T

    wikihow

    • tensorflow.org
    • opendatalab.com
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wikihow [Dataset]. https://www.tensorflow.org/datasets/catalog/wikihow
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

    There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

    There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

    Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikihow', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  13. Rescaled Fashion-MNIST dataset

    • zenodo.org
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST dataset [Dataset]. http://doi.org/10.5281/zenodo.15187793
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Time period covered
    Apr 10, 2025
    Description

    Motivation

    The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled Fashion-MNIST dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

    Access and rights

    The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

    [4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

    The h5 files containing the dataset

    The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

    Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

    There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.

  14. T

    wiki_table_questions

    • tensorflow.org
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wiki_table_questions [Dataset]. https://www.tensorflow.org/datasets/catalog/wiki_table_questions
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    The dataset contains pairs table-question, and the respective answer. The questions require multi-step reasoning and various data operations such as comparison, aggregation, and arithmetic computation. The tables were randomly selected among Wikipedia tables with at least 8 rows and 5 columns.

    (As per the documentation usage notes)

    • Dev: Mean accuracy over three (not five) splits of the training data. In other words, train on 'split-{1,2,3}-train' and test on 'split-{1,2,3}-dev', respectively, then average the accuracy.

    • Test: Train on 'train' and test on 'test'.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wiki_table_questions', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  15. Fruits Classification 🍇

    • kaggle.com
    zip
    Updated Apr 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DeepNets (2023). Fruits Classification 🍇 [Dataset]. https://www.kaggle.com/datasets/utkarshsaxenadn/fruits-classification/suggestions
    Explore at:
    zip(88954615 bytes)Available download formats
    Dataset updated
    Apr 9, 2023
    Authors
    DeepNets
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The fruit classification dataset is a collection of images of various fruits used for the purpose of the training and testing computer vision models. The dataset includes five different types of fruit: * Apples * Bananas * Grapes * Mangoes * Strawberries

    Each class contains 2000 images, resulting in a total of 10,000 images in the dataset.

    The images in the dataset are of various shapes, sizes, and colors, and have been captured under different lighting conditions. The dataset is useful for training and testing models that perform tasks such as object detection, image classification, and segmentation.

    The dataset can be used for various research projects, such as developing and testing new image classification algorithms, and for benchmarking existing algorithms. The dataset can also be used to train machine learning models that can be used in real-world applications, such as in the agricultural industry for fruit grading and sorting.

    Overall, the fruit classification dataset is a valuable resource for researchers and developers working in the field of computer vision, and its availability will help advance the development of new algorithms and technologies for image analysis and classification.

    Data Structure

    The data is split into three sets: training, validation, and testing. The training set is used to train the model, while the validation set is used to evaluate the model's performance during training and make adjustments as necessary. The testing set is used to evaluate the final performance of the model after training is complete.

    The dataset is split based on a ratio of 97% for training, 2% for validation, and 1% for testing. This means that the training set contains 9700 images (97% of the total), the validation set contains 200 images (2% of the total), and the testing set contains 100 images (1% of the total).

    Each class in the dataset is split into three sets based on the ratio. For example, for the "Apple" class, 97% (1940 images) are used for training, 2% (40 images) are used for validation, and 1% (20 images) are used for testing. This ensures that the distribution of classes is consistent across all three sets and that the model is trained on a representative sample of all classes.

    Overall, the split of the dataset into training, validation, and testing sets ensures that the model is robust and generalizes well to new, unseen data.

    Python Script

    The script provided creates train, validation, and test sets from a fruit image dataset by splitting the dataset into predetermined ratios, shuffling the images, and moving them to their respective directories.

  16. R

    Egohands Dataset

    • universe.roboflow.com
    zip
    Updated Apr 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brad Dwyer (2022). Egohands Dataset [Dataset]. https://universe.roboflow.com/brad-dwyer/egohands-public/model/5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 22, 2022
    Dataset authored and provided by
    Brad Dwyer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Hands Bounding Boxes
    Description

    https://i.imgur.com/eEWi4PT.png" alt="EgoHands Dataset">

    About this dataset

    The EgoHands dataset is a collection of 4800 annotated images of human hands from a first-person view originally collected and labeled by Sven Bambach, Stefan Lee, David Crandall, and Chen Yu of Indiana University.

    The dataset was captured via frames extracted from video recorded through head-mounted cameras on a Google Glass headset while peforming four activities: building a puzzle, playing chess, playing Jenga, and playing cards. There are 100 labeled frames for each of 48 video clips.

    Our modifications

    The original EgoHands dataset was labeled with polygons for segmentation and released in a Matlab binary format. We converted it to an object detection dataset using a modified version of this script from @molyswu and have archived it in many popular formats for use with your computer vision models.

    After converting to bounding boxes for object detection, we noticed that there were several dozen unlabeled hands. We added these by hand and improved several hundred of the other labels that did not fully encompass the hands (usually to include omitted fingertips, knuckles, or thumbs). In total, 344 images' annotations were edited manually.

    We chose a new random train/test split of 80% training, 10% validation, and 10% testing. Notably, this is not the same split as in the original EgoHands paper.

    There are two versions of the converted dataset available: * specific is labeled with four classes: myleft, myright, yourleft, yourright representing which hand of which person (the viewer or the opponent across the table) is contained in the bounding box. * generic contains the same boxes but with a single hand class.

    Using this dataset

    The authors have graciously allowed Roboflow to re-host this derivative dataset. It is released under a Creative Commons by Attribution 4.0 license. You may use it for academic or commercial purposes but must cite the original paper.

    Please use the following Bibtext: @inproceedings{egohands2015iccv, title = {Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions}, author = {Sven Bambach and Stefan Lee and David Crandall and Chen Yu}, booktitle = {IEEE International Conference on Computer Vision (ICCV)}, year = {2015} }

  17. t

    Tour Recommendation Model

    • test.researchdata.tuwien.at
    bin, png +1
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
    Explore at:
    text/markdown, png, binAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Dataset Description for Tour Recommendation Model

    Context and Methodology:

    • Research Domain/Project:
      This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

    • Purpose:
      The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

    • Creation Methodology:
      The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

    Technical Details:

    • Structure of the Dataset:
      The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

      • place_or_event_id: Unique identifier for each tourist place or event.

      • rating: Rating given by the user, ranging from 1 to 5.

      The data is split into three subsets:

      • Training Set: 80% of the dataset used to train the model.

      • Validation Set: A small portion used for hyperparameter tuning.

      • Test Set: 20% used to evaluate model performance.

    • Folder and File Naming Conventions:
      The dataset files are stored in the following structure:

      • user_ratings_dataset.csv: The original dataset file containing user ratings.

      • tour_recommendation_model.pkl: The saved model after training.

      • actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

    • Software Requirements:
      To open and work with this dataset, the following software and libraries are required:

      • Python 3.x

      • Pandas for data manipulation

      • Scikit-learn for training and evaluating machine learning models

      • Matplotlib for chart generation

      • Joblib for saving and loading the trained model

      The dataset can be opened and processed using any Python environment that supports these libraries.

    • Additional Resources:

      • The model training code, README file, and performance chart are available in the project repository.

      • For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

    Further Details:

    • Dataset Reusability:
      The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

      • Train other types of models (e.g., regression, classification).

      • Experiment with different features or add more metadata to enrich the dataset.

    • Data Integrity:
      The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

    • Licensing:
      The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.

  18. IMDB_from_torchtext

    • kaggle.com
    zip
    Updated Dec 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Tu (2021). IMDB_from_torchtext [Dataset]. https://www.kaggle.com/datasets/tusonggao/imdb-from-torchtext/discussion
    Explore at:
    zip(25846530 bytes)Available download formats
    Dataset updated
    Dec 12, 2021
    Authors
    Andrew Tu
    Description

    Context

    This is IMDB data from torchtext, with its train, test split. 25000 for train, 25000 for test.

    NOTE

    There are 96 lines of duplicated data in imdb_train.csv. If you want to split a dev dataset from train dataset, maybe you should handle it.

    df_train = pd.read_csv('./imdb_train.csv')
    df_train = df_train.drop_duplicates()
    print('after drop_duplicates, df_train.shape: ', df_train.shape)
    

    Actually, there are also duplicated data in imdb_test.csv, but i choose to just ignore it.

    The script create this data

    https://www.kaggle.com/tusonggao/get-imdb-data-from-torchtext/notebook

  19. T

    wiki40b

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Aug 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). wiki40b [Dataset]. https://www.tensorflow.org/datasets/catalog/wiki40b
    Explore at:
    Dataset updated
    Aug 30, 2023
    Description

    Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects. The language models trained on this corpus - including 41 monolingual models, and 2 multilingual models - can be found at https://tfhub.dev/google/collections/wiki40b-lm/1.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wiki40b', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  20. Skin Cancer Classification Images

    • kaggle.com
    zip
    Updated Dec 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rik (2024). Skin Cancer Classification Images [Dataset]. https://www.kaggle.com/datasets/rimkomatic/skin-cancer/discussion
    Explore at:
    zip(5195283401 bytes)Available download formats
    Dataset updated
    Dec 1, 2024
    Authors
    Rik
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Skin Cancer Classification Dataset

    Overview

    The Skin Cancer Classification Dataset is designed to support the development and evaluation of machine learning models for classifying skin cancer images into 8 distinct classes. This dataset provides a robust foundation for training, validating, and testing image classification models, particularly for deep learning frameworks.

    Features

    • Total Classes: 8 types of skin cancer.
    • Image Data: Preprocessed and standardized for efficient training.
    • Data Splits: The dataset is divided into:
      • Training Set
      • Validation Set
      • Test Set
    • File Format:
      • Features and labels are stored as pickle files:
      • train_x.pkl, train_y.pkl
      • val_x.pkl, val_y.pkl
      • test_x.pkl, test_y.pkl

    Dataset Structure

    SplitFeatures FileLabels FileDescription
    Trainingtrain_x.pkltrain_y.pklContains input features and labels for training
    Validationval_x.pklval_y.pklData used for model evaluation during training
    Testingtest_x.pkltest_y.pklData for final performance testing

    Input Details

    • Image Shape: (224, 224, 3) (Height, Width, Channels)
    • Label Encoding: One-hot or integer-encoded labels for 8 classes.

    Applications

    This dataset is ideal for: - Building deep learning models for multi-class image classification. - Experimenting with transfer learning and ensemble methods. - Developing tools for skin cancer detection in clinical applications.

    Instructions

    1. Loading Data

    The dataset is saved as pickle files for efficient storage and loading. Use the following Python code to load the data:

    import pickle
    
    # Example: Loading training data
    with open('train_x.pkl', 'rb') as f:
      train_x = pickle.load(f)
    
    with open('train_y.pkl', 'rb') as f:
      train_y = pickle.load(f)
    
    print("Training data loaded successfully!")
    

    Training a Model

    The dataset is compatible with popular deep learning frameworks like TensorFlow and PyTorch . Ensure to preprocess the data as per your model's requirements.

    Acknowledgements

    This dataset was prepared with the goal of aiding researchers and developers in advancing skin cancer detection technologies. Special thanks to all contributors and sources for the dataset's creation.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:
Dataset updated
Aug 8, 2022
Authors
Köhler, Juliane
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.

Search
Clear search
Close search
Google apps
Main menu