100+ datasets found
  1. nuScenes fog augmented samples for bad weather pre

    • kaggle.com
    Updated Sep 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dipkumar Patel (2023). nuScenes fog augmented samples for bad weather pre [Dataset]. https://www.kaggle.com/datasets/dipkumar/nuscenes-fog-augmented-samples-for-bad-weather-pre
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 17, 2023
    Dataset provided by
    Kaggle
    Authors
    Dipkumar Patel
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    nuScenes is a public large-scale dataset for autonomous driving. It enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car.

    I used the dataset and augmented fog in it for my research. This dataset can be used to train your road, lane or traffic light detection models in bad weather conditions. Level 5 autonomous vehicles should work well in all weather conditions and these datasets help you to test if your trained model is performing well in inclined weather conditions or not.

    Note: This dataset used images from nuScenes dataset which is open for non-Commercial Use only and it also applies to this dataset as well.

  2. linto-dataset-audio-ar-tn-augmented

    • huggingface.co
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LINAGORA Labs (2025). linto-dataset-audio-ar-tn-augmented [Dataset]. https://huggingface.co/datasets/linagora/linto-dataset-audio-ar-tn-augmented
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Linagora
    Authors
    LINAGORA Labs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LinTO DataSet Audio for Arabic Tunisian Augmented A collection of Tunisian dialect audio and its annotations for STT task

    This is the augmented datasets used to train the Linto Tunisian dialect with code-switching STT linagora/linto-asr-ar-tn.

    Dataset Summary Dataset composition Sources Content Types Languages and Dialects

    Example use (python) License Citations

      Dataset Summary
    

    The LinTO DataSet Audio for Arabic Tunisian Augmented is a dataset that builds on LinTO… See the full description on the dataset page: https://huggingface.co/datasets/linagora/linto-dataset-audio-ar-tn-augmented.

  3. f

    Datasets GO ID/attribute p-value q-value.

    • figshare.com
    xls
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sifan Feng; Zhenyou Wang; Yinghua Jin; Shengbin Xu (2024). Datasets GO ID/attribute p-value q-value. [Dataset]. http://doi.org/10.1371/journal.pone.0305857.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Sifan Feng; Zhenyou Wang; Yinghua Jin; Shengbin Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Traditional differential expression genes (DEGs) identification models have limitations in small sample size datasets because they require meeting distribution assumptions, otherwise resulting high false positive/negative rates due to sample variation. In contrast, tabular data model based on deep learning (DL) frameworks do not need to consider the data distribution types and sample variation. However, applying DL to RNA-Seq data is still a challenge due to the lack of proper labeling and the small sample size compared to the number of genes. Data augmentation (DA) extracts data features using different methods and procedures, which can significantly increase complementary pseudo-values from limited data without significant additional cost. Based on this, we combine DA and DL framework-based tabular data model, propose a model TabDEG, to predict DEGs and their up-regulation/down-regulation directions from gene expression data obtained from the Cancer Genome Atlas database. Compared to five counterpart methods, TabDEG has high sensitivity and low misclassification rates. Experiment shows that TabDEG is robust and effective in enhancing data features to facilitate classification of high-dimensional small sample size datasets and validates that TabDEG-predicted DEGs are mapped to important gene ontology terms and pathways associated with cancer.

  4. S

    Research on Suicidal Ideation Data Augmentation and Recognition Technology...

    • scidb.cn
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhang Yanbo; Phoenix; Mo Liuling; Liu Xiaoqian; Zhu Tingshao (2025). Research on Suicidal Ideation Data Augmentation and Recognition Technology Based on Large Language Models (Dataset) [Dataset]. http://doi.org/10.57760/sciencedb.22432
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 20, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Zhang Yanbo; Phoenix; Mo Liuling; Liu Xiaoqian; Zhu Tingshao
    License

    https://mit-license.orghttps://mit-license.org

    Description

    Research 1 Training Dataset (OurAugSGD):This dataset is used for training the suicide ideation data augmentation model in Research 1. The construction method is as follows: To achieve sufficient data augmentation effects, this study employs a combination of zero-shot and few-shot approaches to build the training dataset. By equally incorporating zero-shot data and few-shot data (4000 entries in total), a high-quality dataset named OurAugSGD is formed.Research 1 Test Dataset and Test Results:This dataset is used for evaluating the suicide ideation recognition model in Research 2. The test dataset randomly selects 50 positive samples from the original dataset, which undergo the same prompt engineering processing as the training dataset OurAugSGD for model evaluation. These samples are guaranteed to be non-overlapping with the training dataset to ensure the validity of test results and the model's performance on unseen data.After inferencing the test dataset through various models (baseline models and experimental models) and conducting respective data augmentations, a total of 2028 generated text results are obtained. These results are subject to consistent manual annotation by 6 groups of raters, yielding the final test results.Research 2 Training Dataset (OurDetSGD):This dataset is used for training the suicide ideation recognition model in Research 2. The construction method is as follows: First, 2000 positive samples and 4000 negative samples are randomly extracted from the original dataset of Research 1. These samples are fused with 2000 samples generated by the self-developed model OurAugSTM, resulting in 8000 text entries with a 1:1 positive-negative ratio, which serves as the training dataset OurDetSGD.Research 2 Training Dataset (OriginDetSGD):This dataset is used for training the suicide ideation recognition model in Research 2. The construction method is as follows: First, 2000 positive samples and 4000 negative samples are randomly extracted from the original dataset of Research 1. These samples are fused to form 6000 text entries with a 1:2 positive-negative ratio, which serves as the training dataset OriginDetSGD.Research 2 Test Dataset:This dataset is used for evaluating the suicide ideation recognition model in Research 2. The test dataset is constructed following strict non-overlapping principles: after excluding samples used in the training dataset OurDetSGD, 1000 entries (500 positive and 500 negative samples, maintaining a 1:1 ratio) are randomly extracted from the original dataset (excluding data used in Research 1). Similar to the training dataset OurDetSGD, the test dataset undergoes prompt engineering processing to ensure format consistency with the training dataset, guaranteeing validity and consistency during model evaluation.

  5. R

    Table Extraction Pdf Dataset

    • universe.roboflow.com
    zip
    Updated Nov 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Traore (2022). Table Extraction Pdf Dataset [Dataset]. https://universe.roboflow.com/mohamed-traore-2ekkp/table-extraction-pdf/model/6
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 4, 2022
    Dataset authored and provided by
    Mohamed Traore
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Data Table Bounding Boxes
    Description

    The dataset comes from Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure - creators of CascadeTabNet.

    Depending on the dataset version downloaded, the images will include annotations for 'borderless' tables, 'bordered' tables', and 'cells'. Borderless tables are those in which every cell in the table does not have a border. Bordered tables are those in which every cell in the table has a border, and the table is bordered. Cells are the individual data points within the table.

    A subset of the full dataset, the ICDAR Table Cells Dataset, was extracted and imported to Roboflow to create this hosted version of the Cascade TabNet project. All the additional dataset components used in the full project are available here: All Files.

    Versions:

    1. Version 1, raw-images : 342 raw images of tables. No augmentations, preprocessing step of auto-orient was all that was added.
    2. Version 2, tableBordersOnly-rawImages : 342 raw images of tables. This dataset version contains the same images as version 1, but with the caveat of Modify Classes being applied to omit the 'cell' class from all images (rendering these images to be apt for creating a model to detect 'borderless' tables and 'bordered' tables.

    For the versions below: Preprocessing step of Resize (416by416 Fit within-white edges) was added along with more augmentations to increase the size of the training set and to make our images more uniform. Preprocessing applies to all images whereas augmentations only apply to training set images. 3. Version 3, augmented-FAST-model : 818 raw images of tables. Trained from Scratch (no transfer learning) with the "Fast" model from Roboflow Train. 3X augmentation (generated images). 4. Version 4, augmented-ACCURATE-model : 818 raw images of tables. Trained from Scratch with the "Accurate" model from Roboflow Train. 3X augmentation. 5. Version 5, tableBordersOnly-augmented-FAST-model : 818 raw images of tables. 'Cell' class ommitted with Modify Classes. Trained from Scratch with the "Fast" model from Roboflow Train. 3X augmentation. 6. Version 6, tableBordersOnly-augmented-ACCURATE-model : 818 raw images of tables. 'Cell' class ommitted with Modify Classes. Trained from Scratch with the "Accurate" model from Roboflow Train. 3X augmentation.

    Example Image from the Datasethttps://i.imgur.com/ruizSQN.png" alt="Example Image from the Dataset">

    Cascade TabNet in Actionhttps://i.imgur.com/nyn98Ue.png" alt="Cascade TabNet in Action"> CascadeTabNet is an automatic table recognition method for interpretation of tabular data in document images. We present an improved deep learning-based end to end approach for solving both problems of table detection and structure recognition using a single Convolution Neural Network (CNN) model. CascadeTabNet is a Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet) based model that detects the regions of tables and recognizes the structural body cells from the detected tables at the same time. We evaluate our results on ICDAR 2013, ICDAR 2019 and TableBank public datasets. We achieved 3rd rank in ICDAR 2019 post-competition results for table detection while attaining the best accuracy results for the ICDAR 2013 and TableBank dataset. We also attain the highest accuracy results on the ICDAR 2019 table structure recognition dataset.

    From the Original Authors:

    If you find this work useful for your research, please cite our paper: @misc{ cascadetabnet2020, title={CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents}, author={Devashish Prasad and Ayan Gadpal and Kshitij Kapadni and Manish Visave and Kavita Sultanpure}, year={2020}, eprint={2004.12629}, archivePrefix={arXiv}, primaryClass={cs.CV} }

  6. m

    Database of scalable training of neural network potentials for complex...

    • archive.materialscloud.org
    bz2, text/markdown +1
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    In Won Yeu; Annika Stuke; Alexander Urban; Nongnuch Artrith; In Won Yeu; Annika Stuke; Alexander Urban; Nongnuch Artrith (2025). Database of scalable training of neural network potentials for complex interfaces through data augmentation [Dataset]. http://doi.org/10.24435/materialscloud:w6-9a
    Explore at:
    bz2, text/markdown, txtAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset provided by
    Materials Cloud
    Authors
    In Won Yeu; Annika Stuke; Alexander Urban; Nongnuch Artrith; In Won Yeu; Annika Stuke; Alexander Urban; Nongnuch Artrith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This database contains the reference data used for direct force training of Artificial Neural Network (ANN) interatomic potentials using the atomic energy network (ænet) and ænet-PyTorch packages (https://github.com/atomisticnet/aenet-PyTorch). It also includes the GPR-augmented data used for indirect force training via Gaussian Process Regression (GPR) surrogate models using the ænet-GPR package (https://github.com/atomisticnet/aenet-gpr). Each data file contains atomic structures, energies, and atomic forces in XCrySDen Structure Format (XSF). The dataset includes all reference training/test data and corresponding GPR-augmented data used in the four benchmark examples presented in the reference paper, "Scalable Training of Neural Network Potentials for Complex Interfaces Through Data Augmentation". A hierarchy of the dataset is described in the README.txt file, and an overview of the dataset is also summarized in supplementary Table S1 of the reference paper.

  7. R

    Hard Hat Workers Dataset

    • universe.roboflow.com
    zip
    Updated Sep 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph Nelson (2022). Hard Hat Workers Dataset [Dataset]. https://universe.roboflow.com/joseph-nelson/hard-hat-workers/model/13
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 30, 2022
    Dataset authored and provided by
    Joseph Nelson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Workers Bounding Boxes
    Description

    Overview

    The Hard Hat dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.

    The original dataset has a 75/25 train-test split.

    Example Image: https://i.imgur.com/7spoIJT.png" alt="Example Image">

    Use Cases

    One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.

    Using this Dataset

    Use the fork or Download this Dataset button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

    Dataset Versions:

    Image Preprocessing | Image Augmentation | Modify Classes * v1 (resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations * v2 (raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images * v3 (v3): generated with the original 75/25 train-test split | Modify Classes used to drop person class | Preprocessing and Augmentation applied * v5 (raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class * v8 (raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and person classes * v9 (raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and helmet classes * v10 (raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images * v11 (augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model * v12 (augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Fast Model * v13 (augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Accurate Model * v14 (raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class, and remap/relabel helmet class to head

    Choosing Between Computer Vision Model Sizes | Roboflow Train

    About Roboflow

    Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

    Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.

    Roboflow Workmark

  8. WaivOps HH-TRP: Open Audio Resources for Machine Learning in Music

    • zenodo.org
    application/gzip +1
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patchbanks (2025). WaivOps HH-TRP: Open Audio Resources for Machine Learning in Music [Dataset]. http://doi.org/10.5281/zenodo.15734094
    Explore at:
    json, application/gzipAvailable download formats
    Dataset updated
    Jun 25, 2025
    Dataset provided by
    Patchbanks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    HH-TRP Dataset

    HH-TRP Dataset is an open audio collection of drum recordings in the style of modern hip hop (urban trap) music. It features 15,000 audio loops provided in uncompressed stereo WAV format, along with paired JSON files containing label data for supervised training of generative AI audio models.

    Overview

    The dataset was developed using an algorithmic framework to randomly generate audio loops from a customized database of MIDI patterns and one-shot drum samples. Data augmentation included random sample-swapping to generate unique drum kits and sound effects. It is intended for training or fine-tuning AI models with paired labels, adaptable for prompt-driven drum generation and other supervised learning objectives.

    Its primary purpose is to provide accessible content for machine learning applications in music. Potential use cases include text-to-audio, prompt engineering, feature extraction, tempo detection, audio classification, rhythm analysis, music information retrieval (MIR), sound design and signal processing.

    Specifications

    • 15,000 audio loops (approximately 55 hours)
    • 16-bit WAV format
    • Tempo range: 110-180 BPM
    • Paired label data (WAV + JSON)
    • Variational drum kits and patterns
    • Subgenre styles (drill, trapsoul, cloud rap, emo rap)

    A key map JSON file is provided for referencing and converting MIDI note numbers to text labels. You can update the text labels to suit your preferences.

    License

    This dataset was compiled by WaivOps, a crowdsourced music project managed by Patchbanks. All recordings have been sourced from verified composers and providers for copyright clearance.

    The HH-TRP Dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

    Additional Info

    Time signature data has been added to the standard JSON file format.

    For audio examples or more information about this dataset, please refer to the GitHub repository.

  9. h

    generated-usa-passeports-dataset

    • huggingface.co
    Updated Jul 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Training Data (2023). generated-usa-passeports-dataset [Dataset]. https://huggingface.co/datasets/TrainingDataPro/generated-usa-passeports-dataset
    Explore at:
    Dataset updated
    Jul 15, 2023
    Authors
    Training Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Data generation in machine learning involves creating or manipulating data to train and evaluate machine learning models. The purpose of data generation is to provide diverse and representative examples that cover a wide range of scenarios, ensuring the model's robustness and generalization. Data augmentation techniques involve applying various transformations to existing data samples to create new ones. These transformations include: random rotations, translations, scaling, flips, and more. Augmentation helps in increasing the dataset size, introducing natural variations, and improving model performance by making it more invariant to specific transformations. The dataset contains GENERATED USA passports, which are replicas of official passports but with randomly generated details, such as name, date of birth etc. The primary intention of generating these fake passports is to demonstrate the structure and content of a typical passport document and to train the neural network to identify this type of document. Generated passports can assist in conducting research without accessing or compromising real user data that is often sensitive and subject to privacy regulations. Synthetic data generation allows researchers to develop and refine models using simulated passport data without risking privacy leaks.

  10. R

    Data from: Anotado Dataset

    • universe.roboflow.com
    zip
    Updated Oct 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    new-workspace-lzcyx (2021). Anotado Dataset [Dataset]. https://universe.roboflow.com/new-workspace-lzcyx/anotado/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 9, 2021
    Dataset authored and provided by
    new-workspace-lzcyx
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Anotado Bounding Boxes
    Description

    https://www.youtube.com/watch?v=4MA_6oZQz7s&ab_channel=tektronix475

    Spotted caps, are the normal OK class (fully closed). Clean caps, are the bad or anomally target class (partially closed). One double prediction at 3:59. 100x100 classification accuracy, out of 200 samples. Inference over unseen test dataset. 150 epochs training. 700 samples training dataset, no data augmentation.

    PREPROCESSING Auto-Orient: Applied Resize: Stretch to 416x416 Grayscale: Applied AUGMENTATIONS No augmentations were applied.

    Anomaly detection with: Roboflow, tensorflow, google colab, Ultralytics, yolo v5, cvat,

  11. R

    Car Highway Dataset

    • universe.roboflow.com
    zip
    Updated Sep 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sallar (2023). Car Highway Dataset [Dataset]. https://universe.roboflow.com/sallar/car-highway/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 13, 2023
    Dataset authored and provided by
    Sallar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Vehicles Bounding Boxes
    Description

    Car-Highway Data Annotation Project

    Introduction

    In this project, we aim to annotate car images captured on highways. The annotated data will be used to train machine learning models for various computer vision tasks, such as object detection and classification.

    Project Goals

    • Collect a diverse dataset of car images from highway scenes.
    • Annotate the dataset to identify and label cars within each image.
    • Organize and format the annotated data for machine learning model training.

    Tools and Technologies

    For this project, we will be using Roboflow, a powerful platform for data annotation and preprocessing. Roboflow simplifies the annotation process and provides tools for data augmentation and transformation.

    Annotation Process

    1. Upload the raw car images to the Roboflow platform.
    2. Use the annotation tools in Roboflow to draw bounding boxes around each car in the images.
    3. Label each bounding box with the corresponding class (e.g., car).
    4. Review and validate the annotations for accuracy.

    Data Augmentation

    Roboflow offers data augmentation capabilities, such as rotation, flipping, and resizing. These augmentations can help improve the model's robustness.

    Data Export

    Once the data is annotated and augmented, Roboflow allows us to export the dataset in various formats suitable for training machine learning models, such as YOLO, COCO, or TensorFlow Record.

    Milestones

    1. Data Collection and Preprocessing
    2. Annotation of Car Images
    3. Data Augmentation
    4. Data Export
    5. Model Training

    Conclusion

    By completing this project, we will have a well-annotated dataset ready for training machine learning models. This dataset can be used for a wide range of applications in computer vision, including car detection and tracking on highways.

  12. f

    Examples of the original text after data augmentation using ChatGPT is as...

    • plos.figshare.com
    xls
    Updated Jun 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yapeng Gao; Lin Zhang; Yangshuyi Xu (2024). Examples of the original text after data augmentation using ChatGPT is as follows. [Dataset]. http://doi.org/10.1371/journal.pone.0301508.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 27, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Yapeng Gao; Lin Zhang; Yangshuyi Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We implement the calculation of cosine similarity using the sklearn package [45].

  13. P

    SpaceNet: A Comprehensive Astronomical Dataset Dataset

    • paperswithcode.com
    Updated May 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). SpaceNet: A Comprehensive Astronomical Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/spacenet-a-comprehensive-astronomical-dataset
    Explore at:
    Dataset updated
    May 19, 2025
    Description

    Description:

    👉 Download the dataset here

    SpaceNet: A Comprehensive Astronomical Dataset, obtained via a novel double-stage augmentation framework called FLARE is a hierarchically structured and high-quality astronomical image dataset. It is meticulously designed for both fine-grained and macro classification tasks. Comprising approximately 12,900 samples, SpaceNet incorporates lower (LR) to higher resolution (HR) conversion with standard augmentations and a diffusion approach for synthetic sample generation. This comprehensive dataset enables superior generalization on various recognition tasks, including classification.

    Download Dataset

    Key Features

    High-Resolution Images: The dataset includes high-quality images that facilitate accurate analysis and classification.

    Hierarchical Structure: The dataset is hierarchically organized to support both macro and fine-grained classification tasks.

    Advanced Augmentation Techniques: Utilizes FLARE framework for double-stage augmentation, enhancing the dataset’s diversity and robustness.

    Synthetic Sample Generation: Employs a diffusion approach to create synthetic samples, boosting the dataset’s size and variability.

    Usage

    SpaceNet is ideal for:

    Training and Evaluation: Developing and testing machine learning models for fine-grained and macro astronomical classification tasks.

    Research: Exploring hierarchical classification approaches within the astronomy domain.

    Model Development: Creating robust models capable of generalizing across both in-domain and out-of-domain datasets.

    Educational Purposes: Providing a rich dataset for educational projects in astronomy and machine learning.

    This dataset is sourced from Kaggle.

  14. m

    Dataset for pest classification in Mango farms from Indonesia

    • data.mendeley.com
    Updated Feb 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kusrini Kusrini (2020). Dataset for pest classification in Mango farms from Indonesia [Dataset]. http://doi.org/10.17632/94jf97jzc8.1
    Explore at:
    Dataset updated
    Feb 27, 2020
    Authors
    Kusrini Kusrini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Indonesia
    Description

    The infestation of pests affecting the Mango cultivation in Indonesia has an economic impact in the region. Following the recent development in the field of machine learning, the application of deep-learning models for multi-class pest-classification requires large collection of image samples upon which the algorithms can be trained. Addressing such a requirement the paper presents a detailed outline of the dataset collected from the Mango farms in Indonesia. The data consists of images captured from the Mango farms affected by 15-categories of pests which are identifiable through structural and visual deformity exhibited in the Mango leaves. The collection of the data involved the use of a low-cost sensing equipment that are commonly used by the farmers for collecting images from the farm. The collected data is subjected to two processes, namely the data augmentation process and training of the classification model. The dataset collection consists of 510 images that includes 15-caterogies of pests that affect Mango leaves along with the original appearance of the Mango leaves (resulting in 16-classes) collected over a period of 6 months. For the purposes of training the deep-learning neural network, the images are subjected to data augmentation to expand the dataset and to emulate closely the large-scale data collection process carried out by farmers. The outcome of the data augmentation process results in a total of 62,047 image samples, which are used to train the network. The multi-class classification framework. The training framework presented in the paper builds on the VGG-16 feature extractor and replaces the last 3-year network with a fully connected neural network layers resulting in 16-output classes. The dataset includes the annotation of the image samples for both original images captured from the field and the augmented image samples. Both the original and augmented data has been classified as training, validation and testing. The overall dataset is divided into 3-parts, namely version 0, version 1 and version 2. The version 0 consists of the original data set, with 310 images to be used for training, 103 images to be used for the validation and finally 97 images for testing. The version 1 of the dataset of includes 46,500 image samples for training, following the application of the data augmentation process followed by the 103 original images used for validation and 97 images for testing. Finally, the version 2 of the dataset uses 47, 500 images for training and 15, 450 images for validation and 97 images for the testing. The three versions of the dataset include images available in JPEG format. The visual appearance of the pests captured in the dataset provides an ideal testbed for benchmarking the performance of various deep-learning networks trained to detect specific categories of pests. In addition, the dataset also provides an opportunity to evaluate the impact of data augmentation techniques on the original dataset.

  15. m

    Aruzz22.5K: An Image Dataset of Rice Varieties

    • data.mendeley.com
    Updated Mar 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Masudul Islam (2024). Aruzz22.5K: An Image Dataset of Rice Varieties [Dataset]. http://doi.org/10.17632/3mn9843tz2.4
    Explore at:
    Dataset updated
    Mar 12, 2024
    Authors
    Md Masudul Islam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This extensive dataset presents a meticulously curated collection of low-resolution images showcasing 20 well-established rice varieties native to diverse regions of Bangladesh. The rice samples were carefully gathered from both rural areas and local marketplaces, ensuring a comprehensive and varied representation. Serving as a visual compendium, the dataset provides a thorough exploration of the distinct characteristics of these rice varieties, facilitating precise classification.

    Dataset Composition

    The dataset encompasses 20 distinct classes, encompassing Subol Lota, Bashmoti (Deshi), Ganjiya, Shampakatari, Sugandhi Katarivog, BR-28, BR-29, Paijam, Bashful, Lal Aush, BR-Jirashail, Gutisharna, Birui, Najirshail, Pahari Birui, Polao (Katari), Polao (Chinigura), Amon, Shorna-5, and Lal Binni. In total, the dataset comprises 4,730 original JPG images and 23,650 augmented images.

    Image Capture and Dataset Organization

    These images were captured using an iPhone 11 camera with a 5x zoom feature. Each image capturing these rice varieties was diligently taken between October 18 and November 29, 2023. To facilitate efficient data management and organization, the dataset is structured into two variants: Original images and Augmented images. Each variant is systematically categorized into 20 distinct sub-directories, each corresponding to a specific rice variety.

    Original Image Dataset

    The primary image set comprises 4,730 JPG images, uniformly sized at 853 × 853 pixels. Due to the initial low resolution, the file size was notably 268 MB. Employing compression through a zip program significantly optimized the dataset, resulting in a final size of 254 MB.

    Augmented Image Dataset

    To address the substantial image volume requirements of deep learning models for machine vision, data augmentation techniques were implemented. Total 23,650 images was obtained from augmentation. These augmented images, also in JPG format and uniformly sized at 512 × 512 pixels, initially amounted to 781 MB. However, post-compression, the dataset was further streamlined to 699 MB.

    Dataset Storage and Access

    The raw and augmented datasets are stored in two distinct zip files, namely 'Original.zip' and 'Augmented.zip'. Both zip files contain 20 sub-folders representing a unique rice variety, namely 1_Subol_Lota, 2_Bashmoti, 3_Ganjiya, 4_Shampakatari, 5_Katarivog, 6_BR28, 7_BR29, 8_Paijam, 9_Bashful, 10_Lal_Aush, 11_Jirashail, 12_Gutisharna, 13_Red_Cargo,14_Najirshail, 15_Katari_Polao, 16_Lal_Biroi, 17_Chinigura_Polao, 18_Amon, 19_Shorna5, 20_Lal_Binni.

    Train and Test Data Organization

    To ease the experimenting process for the researchers we have balanced the data and split it in an 80:20 train-test ratio. The ‘Train_n_Test.zip’ folder contains two sub-directories: ‘1_TEST’ which contains 1125 images per class and ‘2_VALID’ which contains 225 images per class.

  16. Aptos and Messidor eye images

    • kaggle.com
    Updated Jun 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anik Bhowmick ae20b102 (2024). Aptos and Messidor eye images [Dataset]. https://www.kaggle.com/datasets/anikbhowmickae20b102/binary-classification-data-aptos-and-messidor
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2024
    Dataset provided by
    Kaggle
    Authors
    Anik Bhowmick ae20b102
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Early detection of Diabetic Retinopathy is a key challenge to prevent a patient from potential vision loss. The task of DR detection often requires special expertise from ophthalmologists. In remote places of the world such facilities may not be available, so In an attempt to automate the detection of DR, machine learning and deep learning techniques can be adopted. Some of the recent papers have proven such success on various publicly available dataset.

    Another challenge of deep learning techniques is the availability of rightly processed standardized data. Cleaning and preprocessing the data often takes much longer time than the model training. As a part of my research work, I had to preprocess the images taken from APTOS and Messidor before training the model. I applied circle-crop and Graham Ben's preprocessing technique and scaled all the images to 512X512 format. Also, I applied the data augmentation technique and increased the number of samples from 3662 data of APTOS to 18310, and 400 messidor samples to 3600 samples. I divided the images into two classes class 0 (NO DR) and class 1 (DR). The large number of data is essential for transfer learning. This process is very cumbersome and time-consuming. So I thought to upload the newly generated dataset in Kaggle so that some people might find it useful for their work. I hope this will help many people. Feel free to use the data.

  17. Open Poetry Vision Dataset

    • universe.roboflow.com
    zip
    Updated Apr 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roboflow (2022). Open Poetry Vision Dataset [Dataset]. https://universe.roboflow.com/roboflow-gw7yv/open-poetry-vision/model/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 7, 2022
    Dataset authored and provided by
    Roboflowhttps://roboflow.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Text Bounding Boxes
    Description

    Overview

    The Open Poetry Vision dataset is a synthetic dataset created by Roboflow for OCR tasks.

    It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.

    Example Image: https://i.imgur.com/sZT516a.png" alt="Example Image">

    Use Cases

    A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.

    Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.

    Using this Dataset

    Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

    Version 5 of this dataset (classes_all_text-raw-images) has all classes remapped to be labeled as "text." This was accomplished by using Modify Classes as a preprocessing step.

    Version 6 of this dataset (classes_all_text-augmented-FAST) has all classes remapped to be labeled as "text." and was trained with Roboflow's Fast Model.

    Version 7 of this dataset (classes_all_text-augmented-ACCURATE) has all classes remapped to be labeled as "text." and was trained with Roboflow's Accurate Model.

    About Roboflow

    Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

    Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.

    Roboflow Workmark

  18. f

    Examples of EA selection rules (positive results).

    • plos.figshare.com
    xls
    Updated Sep 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Examples of EA selection rules (positive results). [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t013
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 26, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Examples of EA selection rules (positive results).

  19. h

    High_quality_datasets

    • huggingface.co
    Updated Apr 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li yu (2025). High_quality_datasets [Dataset]. https://huggingface.co/datasets/LIxy839/High_quality_datasets
    Explore at:
    Dataset updated
    Apr 13, 2025
    Authors
    Li yu
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This folder contains two datasets: 1. generated_instruction_dataset

    This dataset is generated from Task 2: Self-Augmentation. I randomly sampled 150 single-turn examples from the LIMA dataset. Then, using the backward model fine-tuned in Task 1, we generated instructions based on the original responses. Each data point is a pair of (generated_instruction, original_response), where the instruction is generated by the backward model, and the response is directly taken from LIMA. 2.… See the full description on the dataset page: https://huggingface.co/datasets/LIxy839/High_quality_datasets.

  20. h

    rag-dataset-12000

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neural Bridge AI, rag-dataset-12000 [Dataset]. https://huggingface.co/datasets/neural-bridge/rag-dataset-12000
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Neural Bridge AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Retrieval-Augmented Generation (RAG) Dataset 12000

    Retrieval-Augmented Generation (RAG) Dataset 12000 is an English dataset designed for RAG-optimized models, built by Neural Bridge AI, and released under Apache license 2.0.

      Dataset Description
    
    
    
    
    
      Dataset Summary
    

    Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by allowing them to consult an external authoritative knowledge base before generating responses. This approach significantly… See the full description on the dataset page: https://huggingface.co/datasets/neural-bridge/rag-dataset-12000.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dipkumar Patel (2023). nuScenes fog augmented samples for bad weather pre [Dataset]. https://www.kaggle.com/datasets/dipkumar/nuscenes-fog-augmented-samples-for-bad-weather-pre
Organization logo

nuScenes fog augmented samples for bad weather pre

Train deep learning models on bad weather conditions dataset for autonomous cars

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 17, 2023
Dataset provided by
Kaggle
Authors
Dipkumar Patel
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

nuScenes is a public large-scale dataset for autonomous driving. It enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car.

I used the dataset and augmented fog in it for my research. This dataset can be used to train your road, lane or traffic light detection models in bad weather conditions. Level 5 autonomous vehicles should work well in all weather conditions and these datasets help you to test if your trained model is performing well in inclined weather conditions or not.

Note: This dataset used images from nuScenes dataset which is open for non-Commercial Use only and it also applies to this dataset as well.

Search
Clear search
Close search
Google apps
Main menu