3 datasets found
  1. P

    Nvidia's Aegis-AI-Content-Safety-Dataset-1.0 Dataset

    • paperswithcode.com
    Updated Apr 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Nvidia's Aegis-AI-Content-Safety-Dataset-1.0 Dataset [Dataset]. https://paperswithcode.com/dataset/https-huggingface-co-datasets-nvidia-aegis-ai
    Explore at:
    Dataset updated
    Apr 8, 2024
    Description

    Aegis AI Content Safety Dataset is an open-source content safety dataset (CC-BY-4.0), which adheres to Nvidia's content safety taxonomy, covering 13 critical risk categories (see Dataset Description).

    Dataset Details Dataset Description The Aegis AI Content Safety Dataset is comprised of approximately 11,000 manually annotated interactions between humans and LLMs, split into 10,798 training samples and 1,199 test samples.

    To curate the dataset, we use the Hugging Face version of human preference data about harmlessness from Anthropic HH-RLHF. We extract only the prompts, and elicit responses from Mistral-7B-v0.1. Mistral excels at instruction following and generates high quality responses for the content moderation categories. We use examples in the system prompt to ensure diversity by instructing Mistral to not generate similar responses. Our data comprises four different formats: user prompt only, system prompt with user prompt, single turn user prompt with Mistral response, and multi-turn user prompt with Mistral responses.

  2. Data from: SEMFIRE forest dataset for semantic segmentation and data...

    • zenodo.org
    application/gzip, bin +2
    Updated Jan 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Bittner; Dominik Bittner; Maria Eduarda Andrada; Maria Eduarda Andrada; David Portugal; David Portugal; João Filipe Ferreira; João Filipe Ferreira (2022). SEMFIRE forest dataset for semantic segmentation and data augmentation [Dataset]. http://doi.org/10.5281/zenodo.5819064
    Explore at:
    zip, application/gzip, bin, txtAvailable download formats
    Dataset updated
    Jan 20, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Bittner; Dominik Bittner; Maria Eduarda Andrada; Maria Eduarda Andrada; David Portugal; David Portugal; João Filipe Ferreira; João Filipe Ferreira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SEMFIRE Datasets (Forest environment dataset)

    These datasets are used for semantic segmentation and data augmentation and contain various forestry scenes. They were collected as part of the research work conducted by the Institute of Systems and Robotics, University of Coimbra team within the scope of the Safety, Exploration and Maintenance of Forests with Ecological Robotics (SEMFIRE, ref. CENTRO-01-0247-FEDER-032691) research project coordinated by Ingeniarius Ltd.

    The semantic segmentation algorithms attempt to identify various semantic classes (e.g. background, live flammable materials, trunks, canopies etc.) in the images of the datasets.

    The datasets include diverse image types, e.g. original camera images and their labeled images. In total the SEMFIRE datasets include about 1700 image pairs. Each dataset includes corresponding .bag files.

    To launch those .bag files on your ROS environment, use the instructions on the following Github repository

    Description of each dataset:

    1. 2019_2020_quinta_do_bolao_coimbra: Robot moving on a path through a forest environment
    2. 2020_ctcv_parking_lot_coimbra: Robot moving in a circle in a parking lot for testings
    3. 2020_sete_fontes_forest: A set of forest images acquired by hand-held apparatus

    Each dataset consists of following directories:

    1. images directory: diverse image types, e.g. original camera images and their labeled images
    2. rosbags directory: .bag files, which correspond to the image directory

    Each images directory consists of following directories:

    • img: original camera images
    • lbl: single channel images (ground truth) with corresponding labels for each image in img
    • lbl_colored: camera images in lbl colorized according to different semantic classes (for more details see the datasets descriptions)
    • lbl_overlaid: camera images in img overlaid with corresponding labels (colored)

    Each rosbags directory contains .bag files with the following topics:

    • 2019_2020_quinta_do_bolao_coimbra_rosbags:
      • /back_lslidar_packet
      • /dalsa_camera_720p/compressed
      • /flir_ax8/compressed
      • /front_lslidar_packet
      • /gps_fix
      • /gps_time
      • /gps_vel
      • /imu/data
      • /realsense/aligned_depth_to_color/image_raw
      • /realsense/color/camera_info
      • /realsense/color/image_raw/compressed
      • /realsense/depth/camera_info
      • /realsense/depth/image_rect_raw/compressed
      • /realsense/extrinsics/depth_to_color
    • 2020_ctcv_parking_lot_coimbra_rosbags:
      • /dalsa_camera_720p/compressed
      • /gps_fix
      • /gps_ime
      • /fused_point_cloud
      • /imu/data
      • /imu/mag
      • /imu/rpy
    • 2020_sete_fontes_forest_rosbags:
      • /realsense/camera_info
      • /realsense/depth_compressed/compressedDepth
      • /realsense/nir/left/compressed
      • /realsense/nir/right/compressed
      • /realsense/rgb/compressed

    All datasets include a detailed description as a text file. In addition, they include a rosbag_info.txt file with a description for each ROS inside the .bag files as well as a description for each ROS topic.

    The following table shows the statistical description of typical portuguese woodland configurations with structured plantations of Pinus pinaster (Pp, pine trees) and Eucalyptus globulus (Eg, eucalyptus).

    "Low density" structured plantation"High density" structured plantation
    Tree density (assuming plantation in rows spaced 3m apart in all cases)

    Eg: 900 trees/ha

    Pp: 450 trees/ha

    Eg: 1400 trees/ha

    Pp: 1250 trees/ha

    Average heights and corresponding ages of plantation trees

    Eg: 12m (6 years old)

    Pp: 10m (15 years old)

    Eg: 12m (6 years old)

    Pp: 10m (15 years old)

    Maximum heights and corresponding fully-matured ages of plantation trees

    Eg: 20m (11 years old)

    Pp: 30m (40 years old)

    Eg: 20m (11 years old)

    Pp: 30m (40 years old)

    Diameter at chest level (DCL – 1,3m) of plantation trees (average/maximum)

    Eg: 15cm/25cm

    Pp: 20cm/50cm

    Eg: 15cm/25cm

    Pp: 20cm/50cm

    Natural density of herbaceous plants

    30% of woodland area

    30% of woodland area

    Natural density of bush and shrubbery

    30% of woodland area

    30% of woodland area

    Natural density of arboreal plants (not part of plantation)

    5% of woodland area

    5% of woodland area

  3. h

    alpaca-gpt4

    • huggingface.co
    • opendatalab.com
    Updated Apr 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    alpaca-gpt4 [Dataset]. https://huggingface.co/datasets/vicgalle/alpaca-gpt4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 14, 2023
    Authors
    Victor Gallego
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for "alpaca-gpt4"

    This dataset contains English Instruction-Following generated by GPT-4 using Alpaca prompts for fine-tuning LLMs. The dataset was originaly shared in this repository: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM. This is just a wraper for compatibility with huggingface's datasets library.

      Dataset structure
    

    It contains 52K instruction-following data generated by GPT-4 using the same prompts as in Alpaca. The dataset has… See the full description on the dataset page: https://huggingface.co/datasets/vicgalle/alpaca-gpt4.

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). Nvidia's Aegis-AI-Content-Safety-Dataset-1.0 Dataset [Dataset]. https://paperswithcode.com/dataset/https-huggingface-co-datasets-nvidia-aegis-ai

Nvidia's Aegis-AI-Content-Safety-Dataset-1.0 Dataset

Explore at:
Dataset updated
Apr 8, 2024
Description

Aegis AI Content Safety Dataset is an open-source content safety dataset (CC-BY-4.0), which adheres to Nvidia's content safety taxonomy, covering 13 critical risk categories (see Dataset Description).

Dataset Details Dataset Description The Aegis AI Content Safety Dataset is comprised of approximately 11,000 manually annotated interactions between humans and LLMs, split into 10,798 training samples and 1,199 test samples.

To curate the dataset, we use the Hugging Face version of human preference data about harmlessness from Anthropic HH-RLHF. We extract only the prompts, and elicit responses from Mistral-7B-v0.1. Mistral excels at instruction following and generates high quality responses for the content moderation categories. We use examples in the system prompt to ensure diversity by instructing Mistral to not generate similar responses. Our data comprises four different formats: user prompt only, system prompt with user prompt, single turn user prompt with Mistral response, and multi-turn user prompt with Mistral responses.

Search
Clear search
Close search
Google apps
Main menu