Aegis AI Content Safety Dataset is an open-source content safety dataset (CC-BY-4.0), which adheres to Nvidia's content safety taxonomy, covering 13 critical risk categories (see Dataset Description).
Dataset Details Dataset Description The Aegis AI Content Safety Dataset is comprised of approximately 11,000 manually annotated interactions between humans and LLMs, split into 10,798 training samples and 1,199 test samples.
To curate the dataset, we use the Hugging Face version of human preference data about harmlessness from Anthropic HH-RLHF. We extract only the prompts, and elicit responses from Mistral-7B-v0.1. Mistral excels at instruction following and generates high quality responses for the content moderation categories. We use examples in the system prompt to ensure diversity by instructing Mistral to not generate similar responses. Our data comprises four different formats: user prompt only, system prompt with user prompt, single turn user prompt with Mistral response, and multi-turn user prompt with Mistral responses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SEMFIRE Datasets (Forest environment dataset)
These datasets are used for semantic segmentation and data augmentation and contain various forestry scenes. They were collected as part of the research work conducted by the Institute of Systems and Robotics, University of Coimbra team within the scope of the Safety, Exploration and Maintenance of Forests with Ecological Robotics (SEMFIRE, ref. CENTRO-01-0247-FEDER-032691) research project coordinated by Ingeniarius Ltd.
The semantic segmentation algorithms attempt to identify various semantic classes (e.g. background, live flammable materials, trunks, canopies etc.) in the images of the datasets.
The datasets include diverse image types, e.g. original camera images and their labeled images. In total the SEMFIRE datasets include about 1700 image pairs. Each dataset includes corresponding .bag files.
To launch those .bag files on your ROS environment, use the instructions on the following Github repository
Description of each dataset:
Each dataset consists of following directories:
Each images directory consists of following directories:
Each rosbags directory contains .bag files with the following topics:
All datasets include a detailed description as a text file. In addition, they include a rosbag_info.txt file with a description for each ROS inside the .bag files as well as a description for each ROS topic.
The following table shows the statistical description of typical portuguese woodland configurations with structured plantations of Pinus pinaster (Pp, pine trees) and Eucalyptus globulus (Eg, eucalyptus).
"Low density" structured plantation | "High density" structured plantation | |
Tree density (assuming plantation in rows spaced 3m apart in all cases) |
Eg: 900 trees/ha Pp: 450 trees/ha |
Eg: 1400 trees/ha Pp: 1250 trees/ha |
Average heights and corresponding ages of plantation trees |
Eg: 12m (6 years old) Pp: 10m (15 years old) |
Eg: 12m (6 years old) Pp: 10m (15 years old) |
Maximum heights and corresponding fully-matured ages of plantation trees |
Eg: 20m (11 years old) Pp: 30m (40 years old) |
Eg: 20m (11 years old) Pp: 30m (40 years old) |
Diameter at chest level (DCL – 1,3m) of plantation trees (average/maximum) |
Eg: 15cm/25cm Pp: 20cm/50cm |
Eg: 15cm/25cm Pp: 20cm/50cm |
Natural density of herbaceous plants |
30% of woodland area |
30% of woodland area |
Natural density of bush and shrubbery |
30% of woodland area |
30% of woodland area |
Natural density of arboreal plants (not part of plantation) |
5% of woodland area |
5% of woodland area |
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for "alpaca-gpt4"
This dataset contains English Instruction-Following generated by GPT-4 using Alpaca prompts for fine-tuning LLMs. The dataset was originaly shared in this repository: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM. This is just a wraper for compatibility with huggingface's datasets library.
Dataset structure
It contains 52K instruction-following data generated by GPT-4 using the same prompts as in Alpaca. The dataset has… See the full description on the dataset page: https://huggingface.co/datasets/vicgalle/alpaca-gpt4.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Aegis AI Content Safety Dataset is an open-source content safety dataset (CC-BY-4.0), which adheres to Nvidia's content safety taxonomy, covering 13 critical risk categories (see Dataset Description).
Dataset Details Dataset Description The Aegis AI Content Safety Dataset is comprised of approximately 11,000 manually annotated interactions between humans and LLMs, split into 10,798 training samples and 1,199 test samples.
To curate the dataset, we use the Hugging Face version of human preference data about harmlessness from Anthropic HH-RLHF. We extract only the prompts, and elicit responses from Mistral-7B-v0.1. Mistral excels at instruction following and generates high quality responses for the content moderation categories. We use examples in the system prompt to ensure diversity by instructing Mistral to not generate similar responses. Our data comprises four different formats: user prompt only, system prompt with user prompt, single turn user prompt with Mistral response, and multi-turn user prompt with Mistral responses.