100+ datasets found

nuScenes fog augmented samples for bad weather pre
kaggle.com
Updated Sep 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dipkumar Patel (2023). nuScenes fog augmented samples for bad weather pre [Dataset]. https://www.kaggle.com/datasets/dipkumar/nuscenes-fog-augmented-samples-for-bad-weather-pre
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 17, 2023
Dataset provided by
Kaggle
Authors
Dipkumar Patel
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
nuScenes is a public large-scale dataset for autonomous driving. It enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car.

I used the dataset and augmented fog in it for my research. This dataset can be used to train your road, lane or traffic light detection models in bad weather conditions. Level 5 autonomous vehicles should work well in all weather conditions and these datasets help you to test if your trained model is performing well in inclined weather conditions or not.

Note: This dataset used images from nuScenes dataset which is open for non-Commercial Use only and it also applies to this dataset as well.
linto-dataset-audio-ar-tn-augmented
huggingface.co
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LINAGORA Labs (2025). linto-dataset-audio-ar-tn-augmented [Dataset]. https://huggingface.co/datasets/linagora/linto-dataset-audio-ar-tn-augmented
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2025
Dataset provided by
Linagora
Authors
LINAGORA Labs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LinTO DataSet Audio for Arabic Tunisian Augmented A collection of Tunisian dialect audio and its annotations for STT task

This is the augmented datasets used to train the Linto Tunisian dialect with code-switching STT linagora/linto-asr-ar-tn.

Dataset Summary Dataset composition Sources Content Types Languages and Dialects

Example use (python) License Citations

Dataset Summary

The LinTO DataSet Audio for Arabic Tunisian Augmented is a dataset that builds on LinTO… See the full description on the dataset page: https://huggingface.co/datasets/linagora/linto-dataset-audio-ar-tn-augmented.
f
Datasets GO ID/attribute p-value q-value.
figshare.com
xls
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sifan Feng; Zhenyou Wang; Yinghua Jin; Shengbin Xu (2024). Datasets GO ID/attribute p-value q-value. [Dataset]. http://doi.org/10.1371/journal.pone.0305857.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305857.t004
Dataset updated
Jul 22, 2024
Dataset provided by
PLOS ONE
Authors
Sifan Feng; Zhenyou Wang; Yinghua Jin; Shengbin Xu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Traditional differential expression genes (DEGs) identification models have limitations in small sample size datasets because they require meeting distribution assumptions, otherwise resulting high false positive/negative rates due to sample variation. In contrast, tabular data model based on deep learning (DL) frameworks do not need to consider the data distribution types and sample variation. However, applying DL to RNA-Seq data is still a challenge due to the lack of proper labeling and the small sample size compared to the number of genes. Data augmentation (DA) extracts data features using different methods and procedures, which can significantly increase complementary pseudo-values from limited data without significant additional cost. Based on this, we combine DA and DL framework-based tabular data model, propose a model TabDEG, to predict DEGs and their up-regulation/down-regulation directions from gene expression data obtained from the Cancer Genome Atlas database. Compared to five counterpart methods, TabDEG has high sensitivity and low misclassification rates. Experiment shows that TabDEG is robust and effective in enhancing data features to facilitate classification of high-dimensional small sample size datasets and validates that TabDEG-predicted DEGs are mapped to important gene ontology terms and pathways associated with cancer.
S
Research on Suicidal Ideation Data Augmentation and Recognition Technology...
scidb.cn
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhang Yanbo; Phoenix; Mo Liuling; Liu Xiaoqian; Zhu Tingshao (2025). Research on Suicidal Ideation Data Augmentation and Recognition Technology Based on Large Language Models (Dataset) [Dataset]. http://doi.org/10.57760/sciencedb.22432
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.22432
Dataset updated
Mar 20, 2025
Dataset provided by
Science Data Bank
Authors
Zhang Yanbo; Phoenix; Mo Liuling; Liu Xiaoqian; Zhu Tingshao
License
https://mit-license.orghttps://mit-license.org
Description
Research 1 Training Dataset (OurAugSGD):This dataset is used for training the suicide ideation data augmentation model in Research 1. The construction method is as follows: To achieve sufficient data augmentation effects, this study employs a combination of zero-shot and few-shot approaches to build the training dataset. By equally incorporating zero-shot data and few-shot data (4000 entries in total), a high-quality dataset named OurAugSGD is formed.Research 1 Test Dataset and Test Results:This dataset is used for evaluating the suicide ideation recognition model in Research 2. The test dataset randomly selects 50 positive samples from the original dataset, which undergo the same prompt engineering processing as the training dataset OurAugSGD for model evaluation. These samples are guaranteed to be non-overlapping with the training dataset to ensure the validity of test results and the model's performance on unseen data.After inferencing the test dataset through various models (baseline models and experimental models) and conducting respective data augmentations, a total of 2028 generated text results are obtained. These results are subject to consistent manual annotation by 6 groups of raters, yielding the final test results.Research 2 Training Dataset (OurDetSGD):This dataset is used for training the suicide ideation recognition model in Research 2. The construction method is as follows: First, 2000 positive samples and 4000 negative samples are randomly extracted from the original dataset of Research 1. These samples are fused with 2000 samples generated by the self-developed model OurAugSTM, resulting in 8000 text entries with a 1:1 positive-negative ratio, which serves as the training dataset OurDetSGD.Research 2 Training Dataset (OriginDetSGD):This dataset is used for training the suicide ideation recognition model in Research 2. The construction method is as follows: First, 2000 positive samples and 4000 negative samples are randomly extracted from the original dataset of Research 1. These samples are fused to form 6000 text entries with a 1:2 positive-negative ratio, which serves as the training dataset OriginDetSGD.Research 2 Test Dataset:This dataset is used for evaluating the suicide ideation recognition model in Research 2. The test dataset is constructed following strict non-overlapping principles: after excluding samples used in the training dataset OurDetSGD, 1000 entries (500 positive and 500 negative samples, maintaining a 1:1 ratio) are randomly extracted from the original dataset (excluding data used in Research 1). Similar to the training dataset OurDetSGD, the test dataset undergoes prompt engineering processing to ensure format consistency with the training dataset, guaranteeing validity and consistency during model evaluation.
R
Table Extraction Pdf Dataset
universe.roboflow.com
zip
Updated Nov 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Traore (2022). Table Extraction Pdf Dataset [Dataset]. https://universe.roboflow.com/mohamed-traore-2ekkp/table-extraction-pdf/model/6
Explore at:
zipAvailable download formats
Dataset updated
Nov 4, 2022
Dataset authored and provided by
Mohamed Traore
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Data Table Bounding Boxes
Description
The dataset comes from Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure - creators of CascadeTabNet.

Depending on the dataset version downloaded, the images will include annotations for 'borderless' tables, 'bordered' tables', and 'cells'. Borderless tables are those in which every cell in the table does not have a border. Bordered tables are those in which every cell in the table has a border, and the table is bordered. Cells are the individual data points within the table.

A subset of the full dataset, the ICDAR Table Cells Dataset, was extracted and imported to Roboflow to create this hosted version of the Cascade TabNet project. All the additional dataset components used in the full project are available here: All Files.

Versions:

Version 1, raw-images : 342 raw images of tables. No augmentations, preprocessing step of auto-orient was all that was added.

Version 2, tableBordersOnly-rawImages : 342 raw images of tables. This dataset version contains the same images as version 1, but with the caveat of Modify Classes being applied to omit the 'cell' class from all images (rendering these images to be apt for creating a model to detect 'borderless' tables and 'bordered' tables.

For the versions below: Preprocessing step of Resize (416by416 Fit within-white edges) was added along with more augmentations to increase the size of the training set and to make our images more uniform. Preprocessing applies to all images whereas augmentations only apply to training set images. 3. Version 3, augmented-FAST-model : 818 raw images of tables. Trained from Scratch (no transfer learning) with the "Fast" model from Roboflow Train. 3X augmentation (generated images). 4. Version 4, augmented-ACCURATE-model : 818 raw images of tables. Trained from Scratch with the "Accurate" model from Roboflow Train. 3X augmentation. 5. Version 5, tableBordersOnly-augmented-FAST-model : 818 raw images of tables. 'Cell' class ommitted with Modify Classes. Trained from Scratch with the "Fast" model from Roboflow Train. 3X augmentation. 6. Version 6, tableBordersOnly-augmented-ACCURATE-model : 818 raw images of tables. 'Cell' class ommitted with Modify Classes. Trained from Scratch with the "Accurate" model from Roboflow Train. 3X augmentation.

Example Image from the Datasethttps://i.imgur.com/ruizSQN.png" alt="Example Image from the Dataset">

Cascade TabNet in Actionhttps://i.imgur.com/nyn98Ue.png" alt="Cascade TabNet in Action"> CascadeTabNet is an automatic table recognition method for interpretation of tabular data in document images. We present an improved deep learning-based end to end approach for solving both problems of table detection and structure recognition using a single Convolution Neural Network (CNN) model. CascadeTabNet is a Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet) based model that detects the regions of tables and recognizes the structural body cells from the detected tables at the same time. We evaluate our results on ICDAR 2013, ICDAR 2019 and TableBank public datasets. We achieved 3rd rank in ICDAR 2019 post-competition results for table detection while attaining the best accuracy results for the ICDAR 2013 and TableBank dataset. We also attain the highest accuracy results on the ICDAR 2019 table structure recognition dataset.

From the Original Authors:

If you find this work useful for your research, please cite our paper: @misc{ cascadetabnet2020, title={CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents}, author={Devashish Prasad and Ayan Gadpal and Kshitij Kapadni and Manish Visave and Kavita Sultanpure}, year={2020}, eprint={2004.12629}, archivePrefix={arXiv}, primaryClass={cs.CV} }
m
Database of scalable training of neural network potentials for complex...
archive.materialscloud.org
bz2, text/markdown +1
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
In Won Yeu; Annika Stuke; Alexander Urban; Nongnuch Artrith; In Won Yeu; Annika Stuke; Alexander Urban; Nongnuch Artrith (2025). Database of scalable training of neural network potentials for complex interfaces through data augmentation [Dataset]. http://doi.org/10.24435/materialscloud:w6-9a
Explore at:
bz2, text/markdown, txtAvailable download formats
Unique identifier
https://doi.org/10.24435/materialscloud:w6-9a
Dataset updated
Apr 2, 2025
Dataset provided by
Materials Cloud
Authors
In Won Yeu; Annika Stuke; Alexander Urban; Nongnuch Artrith; In Won Yeu; Annika Stuke; Alexander Urban; Nongnuch Artrith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This database contains the reference data used for direct force training of Artificial Neural Network (ANN) interatomic potentials using the atomic energy network (ænet) and ænet-PyTorch packages (https://github.com/atomisticnet/aenet-PyTorch). It also includes the GPR-augmented data used for indirect force training via Gaussian Process Regression (GPR) surrogate models using the ænet-GPR package (https://github.com/atomisticnet/aenet-gpr). Each data file contains atomic structures, energies, and atomic forces in XCrySDen Structure Format (XSF). The dataset includes all reference training/test data and corresponding GPR-augmented data used in the four benchmark examples presented in the reference paper, "Scalable Training of Neural Network Potentials for Complex Interfaces Through Data Augmentation". A hierarchy of the dataset is described in the README.txt file, and an overview of the dataset is also summarized in supplementary Table S1 of the reference paper.
R
Hard Hat Workers Dataset
universe.roboflow.com
zip
Updated Sep 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph Nelson (2022). Hard Hat Workers Dataset [Dataset]. https://universe.roboflow.com/joseph-nelson/hard-hat-workers/model/13
Explore at:
zipAvailable download formats
Dataset updated
Sep 30, 2022
Dataset authored and provided by
Joseph Nelson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variables measured
Workers Bounding Boxes
Description
Overview

The Hard Hat dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.

The original dataset has a 75/25 train-test split.

Example Image: https://i.imgur.com/7spoIJT.png" alt="Example Image">

Use Cases

One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.

Using this Dataset

Use the fork or Download this Dataset button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

Dataset Versions:

Image Preprocessing | Image Augmentation | Modify Classes * v1 (resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations * v2 (raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images * v3 (v3): generated with the original 75/25 train-test split | Modify Classes used to drop person class | Preprocessing and Augmentation applied * v5 (raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class * v8 (raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and person classes * v9 (raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and helmet classes * v10 (raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images * v11 (augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model * v12 (augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Fast Model * v13 (augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Accurate Model * v14 (raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class, and remap/relabel helmet class to head

Choosing Between Computer Vision Model Sizes | Roboflow Train

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
WaivOps HH-TRP: Open Audio Resources for Machine Learning in Music
zenodo.org
application/gzip +1
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patchbanks (2025). WaivOps HH-TRP: Open Audio Resources for Machine Learning in Music [Dataset]. http://doi.org/10.5281/zenodo.15734094
Explore at:
json, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15734094
Dataset updated
Jun 25, 2025
Dataset provided by
Patchbanks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HH-TRP Dataset

HH-TRP Dataset is an open audio collection of drum recordings in the style of modern hip hop (urban trap) music. It features 15,000 audio loops provided in uncompressed stereo WAV format, along with paired JSON files containing label data for supervised training of generative AI audio models.

Overview

The dataset was developed using an algorithmic framework to randomly generate audio loops from a customized database of MIDI patterns and one-shot drum samples. Data augmentation included random sample-swapping to generate unique drum kits and sound effects. It is intended for training or fine-tuning AI models with paired labels, adaptable for prompt-driven drum generation and other supervised learning objectives.

Its primary purpose is to provide accessible content for machine learning applications in music. Potential use cases include text-to-audio, prompt engineering, feature extraction, tempo detection, audio classification, rhythm analysis, music information retrieval (MIR), sound design and signal processing.

Specifications

15,000 audio loops (approximately 55 hours)

16-bit WAV format

Tempo range: 110-180 BPM

Paired label data (WAV + JSON)

Variational drum kits and patterns

Subgenre styles (drill, trapsoul, cloud rap, emo rap)

A key map JSON file is provided for referencing and converting MIDI note numbers to text labels. You can update the text labels to suit your preferences.

License

This dataset was compiled by WaivOps, a crowdsourced music project managed by Patchbanks. All recordings have been sourced from verified composers and providers for copyright clearance.

The HH-TRP Dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

Additional Info

Time signature data has been added to the standard JSON file format.

For audio examples or more information about this dataset, please refer to the GitHub repository.
h
generated-usa-passeports-dataset
huggingface.co
Updated Jul 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training Data (2023). generated-usa-passeports-dataset [Dataset]. https://huggingface.co/datasets/TrainingDataPro/generated-usa-passeports-dataset
Explore at:
Dataset updated
Jul 15, 2023
Authors
Training Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Data generation in machine learning involves creating or manipulating data to train and evaluate machine learning models. The purpose of data generation is to provide diverse and representative examples that cover a wide range of scenarios, ensuring the model's robustness and generalization. Data augmentation techniques involve applying various transformations to existing data samples to create new ones. These transformations include: random rotations, translations, scaling, flips, and more. Augmentation helps in increasing the dataset size, introducing natural variations, and improving model performance by making it more invariant to specific transformations. The dataset contains GENERATED USA passports, which are replicas of official passports but with randomly generated details, such as name, date of birth etc. The primary intention of generating these fake passports is to demonstrate the structure and content of a typical passport document and to train the neural network to identify this type of document. Generated passports can assist in conducting research without accessing or compromising real user data that is often sensitive and subject to privacy regulations. Synthetic data generation allows researchers to develop and refine models using simulated passport data without risking privacy leaks.
R
Data from: Anotado Dataset
universe.roboflow.com
zip
Updated Oct 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
new-workspace-lzcyx (2021). Anotado Dataset [Dataset]. https://universe.roboflow.com/new-workspace-lzcyx/anotado/dataset/2
Explore at:
zipAvailable download formats
Dataset updated
Oct 9, 2021
Dataset authored and provided by
new-workspace-lzcyx
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Anotado Bounding Boxes
Description
https://www.youtube.com/watch?v=4MA_6oZQz7s&ab_channel=tektronix475

Spotted caps, are the normal OK class (fully closed). Clean caps, are the bad or anomally target class (partially closed). One double prediction at 3:59. 100x100 classification accuracy, out of 200 samples. Inference over unseen test dataset. 150 epochs training. 700 samples training dataset, no data augmentation.

PREPROCESSING Auto-Orient: Applied Resize: Stretch to 416x416 Grayscale: Applied AUGMENTATIONS No augmentations were applied.

Anomaly detection with: Roboflow, tensorflow, google colab, Ultralytics, yolo v5, cvat,
R
Car Highway Dataset
universe.roboflow.com
zip
Updated Sep 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sallar (2023). Car Highway Dataset [Dataset]. https://universe.roboflow.com/sallar/car-highway/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Sep 13, 2023
Dataset authored and provided by
Sallar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Vehicles Bounding Boxes
Description
Car-Highway Data Annotation Project

Introduction

In this project, we aim to annotate car images captured on highways. The annotated data will be used to train machine learning models for various computer vision tasks, such as object detection and classification.

Project Goals

Collect a diverse dataset of car images from highway scenes.

Annotate the dataset to identify and label cars within each image.

Organize and format the annotated data for machine learning model training.

Tools and Technologies

For this project, we will be using Roboflow, a powerful platform for data annotation and preprocessing. Roboflow simplifies the annotation process and provides tools for data augmentation and transformation.

Annotation Process

Upload the raw car images to the Roboflow platform.

Use the annotation tools in Roboflow to draw bounding boxes around each car in the images.

Label each bounding box with the corresponding class (e.g., car).

Review and validate the annotations for accuracy.

Data Augmentation

Roboflow offers data augmentation capabilities, such as rotation, flipping, and resizing. These augmentations can help improve the model's robustness.

Data Export

Once the data is annotated and augmented, Roboflow allows us to export the dataset in various formats suitable for training machine learning models, such as YOLO, COCO, or TensorFlow Record.

Milestones

Data Collection and Preprocessing

Annotation of Car Images

Data Augmentation

Data Export

Model Training

Conclusion

By completing this project, we will have a well-annotated dataset ready for training machine learning models. This dataset can be used for a wide range of applications in computer vision, including car detection and tracking on highways.
f
Examples of the original text after data augmentation using ChatGPT is as...
plos.figshare.com
xls
Updated Jun 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yapeng Gao; Lin Zhang; Yangshuyi Xu (2024). Examples of the original text after data augmentation using ChatGPT is as follows. [Dataset]. http://doi.org/10.1371/journal.pone.0301508.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0301508.t003
Dataset updated
Jun 27, 2024
Dataset provided by
PLOS ONE
Authors
Yapeng Gao; Lin Zhang; Yangshuyi Xu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We implement the calculation of cosine similarity using the sklearn package [45].
P
SpaceNet: A Comprehensive Astronomical Dataset Dataset
paperswithcode.com
Updated May 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). SpaceNet: A Comprehensive Astronomical Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/spacenet-a-comprehensive-astronomical-dataset
Explore at:
Dataset updated
May 19, 2025
Description
Description:

👉 Download the dataset here

SpaceNet: A Comprehensive Astronomical Dataset, obtained via a novel double-stage augmentation framework called FLARE is a hierarchically structured and high-quality astronomical image dataset. It is meticulously designed for both fine-grained and macro classification tasks. Comprising approximately 12,900 samples, SpaceNet incorporates lower (LR) to higher resolution (HR) conversion with standard augmentations and a diffusion approach for synthetic sample generation. This comprehensive dataset enables superior generalization on various recognition tasks, including classification.

Download Dataset

Key Features

High-Resolution Images: The dataset includes high-quality images that facilitate accurate analysis and classification.

Hierarchical Structure: The dataset is hierarchically organized to support both macro and fine-grained classification tasks.

Advanced Augmentation Techniques: Utilizes FLARE framework for double-stage augmentation, enhancing the dataset’s diversity and robustness.

Synthetic Sample Generation: Employs a diffusion approach to create synthetic samples, boosting the dataset’s size and variability.

Usage

SpaceNet is ideal for:

Training and Evaluation: Developing and testing machine learning models for fine-grained and macro astronomical classification tasks.

Research: Exploring hierarchical classification approaches within the astronomy domain.

Model Development: Creating robust models capable of generalizing across both in-domain and out-of-domain datasets.

Educational Purposes: Providing a rich dataset for educational projects in astronomy and machine learning.

This dataset is sourced from Kaggle.
m
Dataset for pest classification in Mango farms from Indonesia
data.mendeley.com
Updated Feb 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kusrini Kusrini (2020). Dataset for pest classification in Mango farms from Indonesia [Dataset]. http://doi.org/10.17632/94jf97jzc8.1
Explore at:
Unique identifier
https://doi.org/10.17632/94jf97jzc8.1
Dataset updated
Feb 27, 2020
Authors
Kusrini Kusrini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Indonesia
Description
The infestation of pests affecting the Mango cultivation in Indonesia has an economic impact in the region. Following the recent development in the field of machine learning, the application of deep-learning models for multi-class pest-classification requires large collection of image samples upon which the algorithms can be trained. Addressing such a requirement the paper presents a detailed outline of the dataset collected from the Mango farms in Indonesia. The data consists of images captured from the Mango farms affected by 15-categories of pests which are identifiable through structural and visual deformity exhibited in the Mango leaves. The collection of the data involved the use of a low-cost sensing equipment that are commonly used by the farmers for collecting images from the farm. The collected data is subjected to two processes, namely the data augmentation process and training of the classification model. The dataset collection consists of 510 images that includes 15-caterogies of pests that affect Mango leaves along with the original appearance of the Mango leaves (resulting in 16-classes) collected over a period of 6 months. For the purposes of training the deep-learning neural network, the images are subjected to data augmentation to expand the dataset and to emulate closely the large-scale data collection process carried out by farmers. The outcome of the data augmentation process results in a total of 62,047 image samples, which are used to train the network. The multi-class classification framework. The training framework presented in the paper builds on the VGG-16 feature extractor and replaces the last 3-year network with a fully connected neural network layers resulting in 16-output classes. The dataset includes the annotation of the image samples for both original images captured from the field and the augmented image samples. Both the original and augmented data has been classified as training, validation and testing. The overall dataset is divided into 3-parts, namely version 0, version 1 and version 2. The version 0 consists of the original data set, with 310 images to be used for training, 103 images to be used for the validation and finally 97 images for testing. The version 1 of the dataset of includes 46,500 image samples for training, following the application of the data augmentation process followed by the 103 original images used for validation and 97 images for testing. Finally, the version 2 of the dataset uses 47, 500 images for training and 15, 450 images for validation and 97 images for the testing. The three versions of the dataset include images available in JPEG format. The visual appearance of the pests captured in the dataset provides an ideal testbed for benchmarking the performance of various deep-learning networks trained to detect specific categories of pests. In addition, the dataset also provides an opportunity to evaluate the impact of data augmentation techniques on the original dataset.
m
Aruzz22.5K: An Image Dataset of Rice Varieties
data.mendeley.com
Updated Mar 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Masudul Islam (2024). Aruzz22.5K: An Image Dataset of Rice Varieties [Dataset]. http://doi.org/10.17632/3mn9843tz2.4
Explore at:
Unique identifier
https://doi.org/10.17632/3mn9843tz2.4
Dataset updated
Mar 12, 2024
Authors
Md Masudul Islam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This extensive dataset presents a meticulously curated collection of low-resolution images showcasing 20 well-established rice varieties native to diverse regions of Bangladesh. The rice samples were carefully gathered from both rural areas and local marketplaces, ensuring a comprehensive and varied representation. Serving as a visual compendium, the dataset provides a thorough exploration of the distinct characteristics of these rice varieties, facilitating precise classification.

Dataset Composition

The dataset encompasses 20 distinct classes, encompassing Subol Lota, Bashmoti (Deshi), Ganjiya, Shampakatari, Sugandhi Katarivog, BR-28, BR-29, Paijam, Bashful, Lal Aush, BR-Jirashail, Gutisharna, Birui, Najirshail, Pahari Birui, Polao (Katari), Polao (Chinigura), Amon, Shorna-5, and Lal Binni. In total, the dataset comprises 4,730 original JPG images and 23,650 augmented images.

Image Capture and Dataset Organization

These images were captured using an iPhone 11 camera with a 5x zoom feature. Each image capturing these rice varieties was diligently taken between October 18 and November 29, 2023. To facilitate efficient data management and organization, the dataset is structured into two variants: Original images and Augmented images. Each variant is systematically categorized into 20 distinct sub-directories, each corresponding to a specific rice variety.

Original Image Dataset

The primary image set comprises 4,730 JPG images, uniformly sized at 853 × 853 pixels. Due to the initial low resolution, the file size was notably 268 MB. Employing compression through a zip program significantly optimized the dataset, resulting in a final size of 254 MB.

Augmented Image Dataset

To address the substantial image volume requirements of deep learning models for machine vision, data augmentation techniques were implemented. Total 23,650 images was obtained from augmentation. These augmented images, also in JPG format and uniformly sized at 512 × 512 pixels, initially amounted to 781 MB. However, post-compression, the dataset was further streamlined to 699 MB.

Dataset Storage and Access

The raw and augmented datasets are stored in two distinct zip files, namely 'Original.zip' and 'Augmented.zip'. Both zip files contain 20 sub-folders representing a unique rice variety, namely 1_Subol_Lota, 2_Bashmoti, 3_Ganjiya, 4_Shampakatari, 5_Katarivog, 6_BR28, 7_BR29, 8_Paijam, 9_Bashful, 10_Lal_Aush, 11_Jirashail, 12_Gutisharna, 13_Red_Cargo,14_Najirshail, 15_Katari_Polao, 16_Lal_Biroi, 17_Chinigura_Polao, 18_Amon, 19_Shorna5, 20_Lal_Binni.

Train and Test Data Organization

To ease the experimenting process for the researchers we have balanced the data and split it in an 80:20 train-test ratio. The ‘Train_n_Test.zip’ folder contains two sub-directories: ‘1_TEST’ which contains 1125 images per class and ‘2_VALID’ which contains 225 images per class.
Aptos and Messidor eye images
kaggle.com
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anik Bhowmick ae20b102 (2024). Aptos and Messidor eye images [Dataset]. https://www.kaggle.com/datasets/anikbhowmickae20b102/binary-classification-data-aptos-and-messidor
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2024
Dataset provided by
Kaggle
Authors
Anik Bhowmick ae20b102
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Early detection of Diabetic Retinopathy is a key challenge to prevent a patient from potential vision loss. The task of DR detection often requires special expertise from ophthalmologists. In remote places of the world such facilities may not be available, so In an attempt to automate the detection of DR, machine learning and deep learning techniques can be adopted. Some of the recent papers have proven such success on various publicly available dataset.

Another challenge of deep learning techniques is the availability of rightly processed standardized data. Cleaning and preprocessing the data often takes much longer time than the model training. As a part of my research work, I had to preprocess the images taken from APTOS and Messidor before training the model. I applied circle-crop and Graham Ben's preprocessing technique and scaled all the images to 512X512 format. Also, I applied the data augmentation technique and increased the number of samples from 3662 data of APTOS to 18310, and 400 messidor samples to 3600 samples. I divided the images into two classes class 0 (NO DR) and class 1 (DR). The large number of data is essential for transfer learning. This process is very cumbersome and time-consuming. So I thought to upload the newly generated dataset in Kaggle so that some people might find it useful for their work. I hope this will help many people. Feel free to use the data.
Open Poetry Vision Dataset
universe.roboflow.com
zip
Updated Apr 7, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roboflow (2022). Open Poetry Vision Dataset [Dataset]. https://universe.roboflow.com/roboflow-gw7yv/open-poetry-vision/model/2
Explore at:
zipAvailable download formats
Dataset updated
Apr 7, 2022
Dataset authored and provided by
Roboflowhttps://roboflow.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Text Bounding Boxes
Description
Overview

The Open Poetry Vision dataset is a synthetic dataset created by Roboflow for OCR tasks.

It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.

Example Image: https://i.imgur.com/sZT516a.png" alt="Example Image">

Use Cases

A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.

Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.

Using this Dataset

Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

Version 5 of this dataset (classes_all_text-raw-images) has all classes remapped to be labeled as "text." This was accomplished by using Modify Classes as a preprocessing step.

Version 6 of this dataset (classes_all_text-augmented-FAST) has all classes remapped to be labeled as "text." and was trained with Roboflow's Fast Model.

Version 7 of this dataset (classes_all_text-augmented-ACCURATE) has all classes remapped to be labeled as "text." and was trained with Roboflow's Accurate Model.

Introducing the New Roboflow Train

What to Think About When Choosing Model Sizes

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
f
Examples of EA selection rules (positive results).
plos.figshare.com
xls
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Examples of EA selection rules (positive results). [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t013
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t013
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Examples of EA selection rules (positive results).
h
High_quality_datasets
huggingface.co
Updated Apr 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li yu (2025). High_quality_datasets [Dataset]. https://huggingface.co/datasets/LIxy839/High_quality_datasets
Explore at:
Dataset updated
Apr 13, 2025
Authors
Li yu
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This folder contains two datasets: 1. generated_instruction_dataset

This dataset is generated from Task 2: Self-Augmentation. I randomly sampled 150 single-turn examples from the LIMA dataset. Then, using the backward model fine-tuned in Task 1, we generated instructions based on the original responses. Each data point is a pair of (generated_instruction, original_response), where the instruction is generated by the backward model, and the response is directly taken from LIMA. 2.… See the full description on the dataset page: https://huggingface.co/datasets/LIxy839/High_quality_datasets.
h
rag-dataset-12000
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neural Bridge AI, rag-dataset-12000 [Dataset]. https://huggingface.co/datasets/neural-bridge/rag-dataset-12000
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Neural Bridge AI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Retrieval-Augmented Generation (RAG) Dataset 12000

Retrieval-Augmented Generation (RAG) Dataset 12000 is an English dataset designed for RAG-optimized models, built by Neural Bridge AI, and released under Apache license 2.0.

Dataset Description Dataset Summary

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by allowing them to consult an external authoritative knowledge base before generating responses. This approach significantly… See the full description on the dataset page: https://huggingface.co/datasets/neural-bridge/rag-dataset-12000.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dipkumar Patel (2023). nuScenes fog augmented samples for bad weather pre [Dataset]. https://www.kaggle.com/datasets/dipkumar/nuscenes-fog-augmented-samples-for-bad-weather-pre

nuScenes fog augmented samples for bad weather pre

Train deep learning models on bad weather conditions dataset for autonomous cars

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 17, 2023

Dataset provided by

Kaggle

Authors

Dipkumar Patel

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

nuScenes is a public large-scale dataset for autonomous driving. It enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car.

I used the dataset and augmented fog in it for my research. This dataset can be used to train your road, lane or traffic light detection models in bad weather conditions. Level 5 autonomous vehicles should work well in all weather conditions and these datasets help you to test if your trained model is performing well in inclined weather conditions or not.

Note: This dataset used images from nuScenes dataset which is open for non-Commercial Use only and it also applies to this dataset as well.

Clear search

Close search

Google apps

Main menu

nuScenes fog augmented samples for bad weather pre

linto-dataset-audio-ar-tn-augmented

Datasets GO ID/attribute p-value q-value.

Research on Suicidal Ideation Data Augmentation and Recognition Technology...

Table Extraction Pdf Dataset

Versions:

From the Original Authors:

Database of scalable training of neural network potentials for complex...

Hard Hat Workers Dataset

Overview

Use Cases

Using this Dataset

Dataset Versions:

About Roboflow

WaivOps HH-TRP: Open Audio Resources for Machine Learning in Music

generated-usa-passeports-dataset

Data from: Anotado Dataset

Car Highway Dataset

Car-Highway Data Annotation Project

Introduction

Project Goals

Tools and Technologies

Annotation Process

Data Augmentation

Data Export

Milestones

Conclusion

Examples of the original text after data augmentation using ChatGPT is as...

SpaceNet: A Comprehensive Astronomical Dataset Dataset

Dataset for pest classification in Mango farms from Indonesia

Aruzz22.5K: An Image Dataset of Rice Varieties

Dataset Composition

Image Capture and Dataset Organization

Original Image Dataset

Augmented Image Dataset

Dataset Storage and Access

Train and Test Data Organization

Aptos and Messidor eye images

Open Poetry Vision Dataset

Overview

Use Cases

Using this Dataset

Version 5 of this dataset (classes_all_text-raw-images) has all classes remapped to be labeled as "text." This was accomplished by using Modify Classes as a preprocessing step.

Version 6 of this dataset (classes_all_text-augmented-FAST) has all classes remapped to be labeled as "text." and was trained with Roboflow's Fast Model.

Version 7 of this dataset (classes_all_text-augmented-ACCURATE) has all classes remapped to be labeled as "text." and was trained with Roboflow's Accurate Model.

About Roboflow

Examples of EA selection rules (positive results).

High_quality_datasets

rag-dataset-12000

nuScenes fog augmented samples for bad weather pre

Train deep learning models on bad weather conditions dataset for autonomous cars