Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
nuScenes is a public large-scale dataset for autonomous driving. It enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car.
I used the dataset and augmented fog in it for my research. This dataset can be used to train your road, lane or traffic light detection models in bad weather conditions. Level 5 autonomous vehicles should work well in all weather conditions and these datasets help you to test if your trained model is performing well in inclined weather conditions or not.
Note: This dataset used images from nuScenes dataset which is open for non-Commercial Use only and it also applies to this dataset as well.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LinTO DataSet Audio for Arabic Tunisian Augmented A collection of Tunisian dialect audio and its annotations for STT task
This is the augmented datasets used to train the Linto Tunisian dialect with code-switching STT linagora/linto-asr-ar-tn.
Dataset Summary Dataset composition Sources Content Types Languages and Dialects
Example use (python) License Citations
Dataset Summary
The LinTO DataSet Audio for Arabic Tunisian Augmented is a dataset that builds on LinTO… See the full description on the dataset page: https://huggingface.co/datasets/linagora/linto-dataset-audio-ar-tn-augmented.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Traditional differential expression genes (DEGs) identification models have limitations in small sample size datasets because they require meeting distribution assumptions, otherwise resulting high false positive/negative rates due to sample variation. In contrast, tabular data model based on deep learning (DL) frameworks do not need to consider the data distribution types and sample variation. However, applying DL to RNA-Seq data is still a challenge due to the lack of proper labeling and the small sample size compared to the number of genes. Data augmentation (DA) extracts data features using different methods and procedures, which can significantly increase complementary pseudo-values from limited data without significant additional cost. Based on this, we combine DA and DL framework-based tabular data model, propose a model TabDEG, to predict DEGs and their up-regulation/down-regulation directions from gene expression data obtained from the Cancer Genome Atlas database. Compared to five counterpart methods, TabDEG has high sensitivity and low misclassification rates. Experiment shows that TabDEG is robust and effective in enhancing data features to facilitate classification of high-dimensional small sample size datasets and validates that TabDEG-predicted DEGs are mapped to important gene ontology terms and pathways associated with cancer.
https://mit-license.orghttps://mit-license.org
Research 1 Training Dataset (OurAugSGD):This dataset is used for training the suicide ideation data augmentation model in Research 1. The construction method is as follows: To achieve sufficient data augmentation effects, this study employs a combination of zero-shot and few-shot approaches to build the training dataset. By equally incorporating zero-shot data and few-shot data (4000 entries in total), a high-quality dataset named OurAugSGD is formed.Research 1 Test Dataset and Test Results:This dataset is used for evaluating the suicide ideation recognition model in Research 2. The test dataset randomly selects 50 positive samples from the original dataset, which undergo the same prompt engineering processing as the training dataset OurAugSGD for model evaluation. These samples are guaranteed to be non-overlapping with the training dataset to ensure the validity of test results and the model's performance on unseen data.After inferencing the test dataset through various models (baseline models and experimental models) and conducting respective data augmentations, a total of 2028 generated text results are obtained. These results are subject to consistent manual annotation by 6 groups of raters, yielding the final test results.Research 2 Training Dataset (OurDetSGD):This dataset is used for training the suicide ideation recognition model in Research 2. The construction method is as follows: First, 2000 positive samples and 4000 negative samples are randomly extracted from the original dataset of Research 1. These samples are fused with 2000 samples generated by the self-developed model OurAugSTM, resulting in 8000 text entries with a 1:1 positive-negative ratio, which serves as the training dataset OurDetSGD.Research 2 Training Dataset (OriginDetSGD):This dataset is used for training the suicide ideation recognition model in Research 2. The construction method is as follows: First, 2000 positive samples and 4000 negative samples are randomly extracted from the original dataset of Research 1. These samples are fused to form 6000 text entries with a 1:2 positive-negative ratio, which serves as the training dataset OriginDetSGD.Research 2 Test Dataset:This dataset is used for evaluating the suicide ideation recognition model in Research 2. The test dataset is constructed following strict non-overlapping principles: after excluding samples used in the training dataset OurDetSGD, 1000 entries (500 positive and 500 negative samples, maintaining a 1:1 ratio) are randomly extracted from the original dataset (excluding data used in Research 1). Similar to the training dataset OurDetSGD, the test dataset undergoes prompt engineering processing to ensure format consistency with the training dataset, guaranteeing validity and consistency during model evaluation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset comes from Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure - creators of CascadeTabNet.
Depending on the dataset version downloaded, the images will include annotations for 'borderless' tables, 'bordered' tables', and 'cells'. Borderless tables are those in which every cell in the table does not have a border. Bordered tables are those in which every cell in the table has a border, and the table is bordered. Cells are the individual data points within the table.
A subset of the full dataset, the ICDAR Table Cells Dataset, was extracted and imported to Roboflow to create this hosted version of the Cascade TabNet project. All the additional dataset components used in the full project are available here: All Files.
For the versions below: Preprocessing step of Resize (416by416 Fit within-white edges) was added along with more augmentations to increase the size of the training set and to make our images more uniform. Preprocessing applies to all images whereas augmentations only apply to training set images. 3. Version 3, augmented-FAST-model : 818 raw images of tables. Trained from Scratch (no transfer learning) with the "Fast" model from Roboflow Train. 3X augmentation (generated images). 4. Version 4, augmented-ACCURATE-model : 818 raw images of tables. Trained from Scratch with the "Accurate" model from Roboflow Train. 3X augmentation. 5. Version 5, tableBordersOnly-augmented-FAST-model : 818 raw images of tables. 'Cell' class ommitted with Modify Classes. Trained from Scratch with the "Fast" model from Roboflow Train. 3X augmentation. 6. Version 6, tableBordersOnly-augmented-ACCURATE-model : 818 raw images of tables. 'Cell' class ommitted with Modify Classes. Trained from Scratch with the "Accurate" model from Roboflow Train. 3X augmentation.
Example Image from the Datasethttps://i.imgur.com/ruizSQN.png" alt="Example Image from the Dataset">
Cascade TabNet in Actionhttps://i.imgur.com/nyn98Ue.png" alt="Cascade TabNet in Action">
CascadeTabNet is an automatic table recognition method for interpretation of tabular data in document images. We present an improved deep learning-based end to end approach for solving both problems of table detection and structure recognition using a single Convolution Neural Network (CNN) model. CascadeTabNet is a Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet) based model that detects the regions of tables and recognizes the structural body cells from the detected tables at the same time. We evaluate our results on ICDAR 2013, ICDAR 2019 and TableBank public datasets. We achieved 3rd rank in ICDAR 2019 post-competition results for table detection while attaining the best accuracy results for the ICDAR 2013 and TableBank dataset. We also attain the highest accuracy results on the ICDAR 2019 table structure recognition dataset.
If you find this work useful for your research, please cite our paper: @misc{ cascadetabnet2020, title={CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents}, author={Devashish Prasad and Ayan Gadpal and Kshitij Kapadni and Manish Visave and Kavita Sultanpure}, year={2020}, eprint={2004.12629}, archivePrefix={arXiv}, primaryClass={cs.CV} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database contains the reference data used for direct force training of Artificial Neural Network (ANN) interatomic potentials using the atomic energy network (ænet) and ænet-PyTorch packages (https://github.com/atomisticnet/aenet-PyTorch). It also includes the GPR-augmented data used for indirect force training via Gaussian Process Regression (GPR) surrogate models using the ænet-GPR package (https://github.com/atomisticnet/aenet-gpr). Each data file contains atomic structures, energies, and atomic forces in XCrySDen Structure Format (XSF). The dataset includes all reference training/test data and corresponding GPR-augmented data used in the four benchmark examples presented in the reference paper, "Scalable Training of Neural Network Potentials for Complex Interfaces Through Data Augmentation". A hierarchy of the dataset is described in the README.txt file, and an overview of the dataset is also summarized in supplementary Table S1 of the reference paper.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Hard Hat
dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.
The original dataset has a 75/25 train-test split.
Example Image:
https://i.imgur.com/7spoIJT.png" alt="Example Image">
One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.
Use the fork
or Download this Dataset
button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.
Image Preprocessing | Image Augmentation | Modify Classes
* v1
(resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations
* v2
(raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images
* v3
(v3): generated with the original 75/25 train-test split | Modify Classes used to drop person
class | Preprocessing and Augmentation applied
* v5
(raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person
class
* v8
(raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head
and person
classes
* v9
(raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head
and helmet
classes
* v10
(raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images
* v11
(augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model
* v12
(augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person
class | 3x image generation | Trained with Roboflow's Fast Model
* v13
(augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person
class | 3x image generation | Trained with Roboflow's Accurate Model
* v14
(raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person
class, and remap/relabel helmet
class to head
Choosing Between Computer Vision Model Sizes | Roboflow Train
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HH-TRP Dataset
HH-TRP Dataset is an open audio collection of drum recordings in the style of modern hip hop (urban trap) music. It features 15,000 audio loops provided in uncompressed stereo WAV format, along with paired JSON files containing label data for supervised training of generative AI audio models.
Overview
The dataset was developed using an algorithmic framework to randomly generate audio loops from a customized database of MIDI patterns and one-shot drum samples. Data augmentation included random sample-swapping to generate unique drum kits and sound effects. It is intended for training or fine-tuning AI models with paired labels, adaptable for prompt-driven drum generation and other supervised learning objectives.
Its primary purpose is to provide accessible content for machine learning applications in music. Potential use cases include text-to-audio, prompt engineering, feature extraction, tempo detection, audio classification, rhythm analysis, music information retrieval (MIR), sound design and signal processing.
Specifications
A key map JSON file is provided for referencing and converting MIDI note numbers to text labels. You can update the text labels to suit your preferences.
License
This dataset was compiled by WaivOps, a crowdsourced music project managed by Patchbanks. All recordings have been sourced from verified composers and providers for copyright clearance.
The HH-TRP Dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
Additional Info
Time signature data has been added to the standard JSON file format.
For audio examples or more information about this dataset, please refer to the GitHub repository.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Data generation in machine learning involves creating or manipulating data to train and evaluate machine learning models. The purpose of data generation is to provide diverse and representative examples that cover a wide range of scenarios, ensuring the model's robustness and generalization. Data augmentation techniques involve applying various transformations to existing data samples to create new ones. These transformations include: random rotations, translations, scaling, flips, and more. Augmentation helps in increasing the dataset size, introducing natural variations, and improving model performance by making it more invariant to specific transformations. The dataset contains GENERATED USA passports, which are replicas of official passports but with randomly generated details, such as name, date of birth etc. The primary intention of generating these fake passports is to demonstrate the structure and content of a typical passport document and to train the neural network to identify this type of document. Generated passports can assist in conducting research without accessing or compromising real user data that is often sensitive and subject to privacy regulations. Synthetic data generation allows researchers to develop and refine models using simulated passport data without risking privacy leaks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://www.youtube.com/watch?v=4MA_6oZQz7s&ab_channel=tektronix475
Spotted caps, are the normal OK class (fully closed). Clean caps, are the bad or anomally target class (partially closed). One double prediction at 3:59. 100x100 classification accuracy, out of 200 samples. Inference over unseen test dataset. 150 epochs training. 700 samples training dataset, no data augmentation.
PREPROCESSING Auto-Orient: Applied Resize: Stretch to 416x416 Grayscale: Applied AUGMENTATIONS No augmentations were applied.
Anomaly detection with: Roboflow, tensorflow, google colab, Ultralytics, yolo v5, cvat,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this project, we aim to annotate car images captured on highways. The annotated data will be used to train machine learning models for various computer vision tasks, such as object detection and classification.
For this project, we will be using Roboflow, a powerful platform for data annotation and preprocessing. Roboflow simplifies the annotation process and provides tools for data augmentation and transformation.
Roboflow offers data augmentation capabilities, such as rotation, flipping, and resizing. These augmentations can help improve the model's robustness.
Once the data is annotated and augmented, Roboflow allows us to export the dataset in various formats suitable for training machine learning models, such as YOLO, COCO, or TensorFlow Record.
By completing this project, we will have a well-annotated dataset ready for training machine learning models. This dataset can be used for a wide range of applications in computer vision, including car detection and tracking on highways.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We implement the calculation of cosine similarity using the sklearn package [45].
Description:
SpaceNet: A Comprehensive Astronomical Dataset, obtained via a novel double-stage augmentation framework called FLARE is a hierarchically structured and high-quality astronomical image dataset. It is meticulously designed for both fine-grained and macro classification tasks. Comprising approximately 12,900 samples, SpaceNet incorporates lower (LR) to higher resolution (HR) conversion with standard augmentations and a diffusion approach for synthetic sample generation. This comprehensive dataset enables superior generalization on various recognition tasks, including classification.
Download Dataset
Key Features
High-Resolution Images: The dataset includes high-quality images that facilitate accurate analysis and classification.
Hierarchical Structure: The dataset is hierarchically organized to support both macro and fine-grained classification tasks.
Advanced Augmentation Techniques: Utilizes FLARE framework for double-stage augmentation, enhancing the dataset’s diversity and robustness.
Synthetic Sample Generation: Employs a diffusion approach to create synthetic samples, boosting the dataset’s size and variability.
Usage
SpaceNet is ideal for:
Training and Evaluation: Developing and testing machine learning models for fine-grained and macro astronomical classification tasks.
Research: Exploring hierarchical classification approaches within the astronomy domain.
Model Development: Creating robust models capable of generalizing across both in-domain and out-of-domain datasets.
Educational Purposes: Providing a rich dataset for educational projects in astronomy and machine learning.
This dataset is sourced from Kaggle.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The infestation of pests affecting the Mango cultivation in Indonesia has an economic impact in the region. Following the recent development in the field of machine learning, the application of deep-learning models for multi-class pest-classification requires large collection of image samples upon which the algorithms can be trained. Addressing such a requirement the paper presents a detailed outline of the dataset collected from the Mango farms in Indonesia. The data consists of images captured from the Mango farms affected by 15-categories of pests which are identifiable through structural and visual deformity exhibited in the Mango leaves. The collection of the data involved the use of a low-cost sensing equipment that are commonly used by the farmers for collecting images from the farm. The collected data is subjected to two processes, namely the data augmentation process and training of the classification model. The dataset collection consists of 510 images that includes 15-caterogies of pests that affect Mango leaves along with the original appearance of the Mango leaves (resulting in 16-classes) collected over a period of 6 months. For the purposes of training the deep-learning neural network, the images are subjected to data augmentation to expand the dataset and to emulate closely the large-scale data collection process carried out by farmers. The outcome of the data augmentation process results in a total of 62,047 image samples, which are used to train the network. The multi-class classification framework. The training framework presented in the paper builds on the VGG-16 feature extractor and replaces the last 3-year network with a fully connected neural network layers resulting in 16-output classes. The dataset includes the annotation of the image samples for both original images captured from the field and the augmented image samples. Both the original and augmented data has been classified as training, validation and testing. The overall dataset is divided into 3-parts, namely version 0, version 1 and version 2. The version 0 consists of the original data set, with 310 images to be used for training, 103 images to be used for the validation and finally 97 images for testing. The version 1 of the dataset of includes 46,500 image samples for training, following the application of the data augmentation process followed by the 103 original images used for validation and 97 images for testing. Finally, the version 2 of the dataset uses 47, 500 images for training and 15, 450 images for validation and 97 images for the testing. The three versions of the dataset include images available in JPEG format. The visual appearance of the pests captured in the dataset provides an ideal testbed for benchmarking the performance of various deep-learning networks trained to detect specific categories of pests. In addition, the dataset also provides an opportunity to evaluate the impact of data augmentation techniques on the original dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This extensive dataset presents a meticulously curated collection of low-resolution images showcasing 20 well-established rice varieties native to diverse regions of Bangladesh. The rice samples were carefully gathered from both rural areas and local marketplaces, ensuring a comprehensive and varied representation. Serving as a visual compendium, the dataset provides a thorough exploration of the distinct characteristics of these rice varieties, facilitating precise classification.
The dataset encompasses 20 distinct classes, encompassing Subol Lota, Bashmoti (Deshi), Ganjiya, Shampakatari, Sugandhi Katarivog, BR-28, BR-29, Paijam, Bashful, Lal Aush, BR-Jirashail, Gutisharna, Birui, Najirshail, Pahari Birui, Polao (Katari), Polao (Chinigura), Amon, Shorna-5, and Lal Binni. In total, the dataset comprises 4,730 original JPG images and 23,650 augmented images.
These images were captured using an iPhone 11 camera with a 5x zoom feature. Each image capturing these rice varieties was diligently taken between October 18 and November 29, 2023. To facilitate efficient data management and organization, the dataset is structured into two variants: Original images and Augmented images. Each variant is systematically categorized into 20 distinct sub-directories, each corresponding to a specific rice variety.
The primary image set comprises 4,730 JPG images, uniformly sized at 853 × 853 pixels. Due to the initial low resolution, the file size was notably 268 MB. Employing compression through a zip program significantly optimized the dataset, resulting in a final size of 254 MB.
To address the substantial image volume requirements of deep learning models for machine vision, data augmentation techniques were implemented. Total 23,650 images was obtained from augmentation. These augmented images, also in JPG format and uniformly sized at 512 × 512 pixels, initially amounted to 781 MB. However, post-compression, the dataset was further streamlined to 699 MB.
The raw and augmented datasets are stored in two distinct zip files, namely 'Original.zip' and 'Augmented.zip'. Both zip files contain 20 sub-folders representing a unique rice variety, namely 1_Subol_Lota, 2_Bashmoti, 3_Ganjiya, 4_Shampakatari, 5_Katarivog, 6_BR28, 7_BR29, 8_Paijam, 9_Bashful, 10_Lal_Aush, 11_Jirashail, 12_Gutisharna, 13_Red_Cargo,14_Najirshail, 15_Katari_Polao, 16_Lal_Biroi, 17_Chinigura_Polao, 18_Amon, 19_Shorna5, 20_Lal_Binni.
To ease the experimenting process for the researchers we have balanced the data and split it in an 80:20 train-test ratio. The ‘Train_n_Test.zip’ folder contains two sub-directories: ‘1_TEST’ which contains 1125 images per class and ‘2_VALID’ which contains 225 images per class.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Early detection of Diabetic Retinopathy is a key challenge to prevent a patient from potential vision loss. The task of DR detection often requires special expertise from ophthalmologists. In remote places of the world such facilities may not be available, so In an attempt to automate the detection of DR, machine learning and deep learning techniques can be adopted. Some of the recent papers have proven such success on various publicly available dataset.
Another challenge of deep learning techniques is the availability of rightly processed standardized data. Cleaning and preprocessing the data often takes much longer time than the model training. As a part of my research work, I had to preprocess the images taken from APTOS and Messidor before training the model. I applied circle-crop and Graham Ben's preprocessing technique and scaled all the images to 512X512 format. Also, I applied the data augmentation technique and increased the number of samples from 3662 data of APTOS to 18310, and 400 messidor samples to 3600 samples. I divided the images into two classes class 0 (NO DR) and class 1 (DR). The large number of data is essential for transfer learning. This process is very cumbersome and time-consuming. So I thought to upload the newly generated dataset in Kaggle so that some people might find it useful for their work. I hope this will help many people. Feel free to use the data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Open Poetry Vision
dataset is a synthetic dataset created by Roboflow for OCR tasks.
It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.
Example Image:
https://i.imgur.com/sZT516a.png" alt="Example Image">
A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.
Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.
Use the fork
button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Examples of EA selection rules (positive results).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This folder contains two datasets: 1. generated_instruction_dataset
This dataset is generated from Task 2: Self-Augmentation. I randomly sampled 150 single-turn examples from the LIMA dataset. Then, using the backward model fine-tuned in Task 1, we generated instructions based on the original responses. Each data point is a pair of (generated_instruction, original_response), where the instruction is generated by the backward model, and the response is directly taken from LIMA. 2.… See the full description on the dataset page: https://huggingface.co/datasets/LIxy839/High_quality_datasets.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Retrieval-Augmented Generation (RAG) Dataset 12000
Retrieval-Augmented Generation (RAG) Dataset 12000 is an English dataset designed for RAG-optimized models, built by Neural Bridge AI, and released under Apache license 2.0.
Dataset Description
Dataset Summary
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by allowing them to consult an external authoritative knowledge base before generating responses. This approach significantly… See the full description on the dataset page: https://huggingface.co/datasets/neural-bridge/rag-dataset-12000.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
nuScenes is a public large-scale dataset for autonomous driving. It enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car.
I used the dataset and augmented fog in it for my research. This dataset can be used to train your road, lane or traffic light detection models in bad weather conditions. Level 5 autonomous vehicles should work well in all weather conditions and these datasets help you to test if your trained model is performing well in inclined weather conditions or not.
Note: This dataset used images from nuScenes dataset which is open for non-Commercial Use only and it also applies to this dataset as well.