We collect a new dataset of human-posed free-form natural language questions about CLEVR images. Many of these questions have out-of-vocabulary words and require reasoning skills that are absent from our model’s repertoire
Relative Human (RH) contains multi-person in-the-wild RGB images with rich human annotations, including:
Depth layers: relative depth relationship/ordering between all people in the image. Age group classfication: adults, teenagers, kids, babies. Others: Genders, Bounding box, 2D pose.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Data is collected and shared by Toni Aaltonen. The Original data can be viewed on github: https://github.com/tonaalt/sonar_human_dataset and is shared under the MIT License.
MIT License
Copyright (c) 2023 Toni Aaltonen
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
We propose Safe Human dataset consisting of 17 different objects referred to as SH17 dataset. We scrapped images from the Pexels website, which offers "https://www.pexels.com/license/">clear usage rights for all its images, showcasing a range of human activities across diverse industrial operations.
To extract relevant images, we used multiple queries such as manufacturing worker, industrial worker, human worker, labor, etc. The tags associated with Pexels images proved reasonably accurate. After removing duplicate samples, we obtained a dataset of 8,099 images. The dataset exhibits significant diversity, representing manufacturing environments globally, thus minimizing potential regional or racial biases. Samples of the dataset are shown below.
The data consists of three folders, - images contains all images - labels contains labels in YOLO format for all images - voc_labels contains labels in VOC format for all images - train_files.txt contains list of all images we used for training - val_files.txt contains list of all images we used for validation
This dataset, scrapped through the Pexels website, is intended for educational, research, and analysis purposes only. You may be able to use the data for training of the Machine learning models only. Users are urged to use this data responsibly, ethically, and within the bounds of legal stipulations.
Legal Simplicity: All photos and videos on Pexels can be downloaded and used for free.
The dataset is provided "as is," without warranty, and the creator disclaims any legal liability for its use by others.
Users are encouraged to consider the ethical implications of their analyses and the potential impact on broader community.
https://github.com/ahmadmughees/SH17dataset
@misc{ahmad2024sh17datasethumansafety,
title={SH17: A Dataset for Human Safety and Personal Protective Equipment Detection in Manufacturing Industry},
author={Hafiz Mughees Ahmad and Afshin Rahimi},
year={2024},
eprint={2407.04590},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.04590},
}
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2806979%2F0a24bd8b9a3f281cf924a5171db28a40%2Fpexels-photo-3862627.jpeg?generation=1720104820503689&alt=media" alt="">
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Evaluation dataset collected for, and utilised in, Trotter et al., Human Body Shape Classification Based on a Single Image (2023).
The dataset contains the following:
This data is licensed under CC-BY-NC-SA 4.0 and thus may not be used for commercial purposes. If you wish to discuss utilising this dataset for commercial purposes please contact the authors.
The ArtiFact dataset is a large-scale image dataset that aims to include a diverse collection of real and synthetic images from multiple categories, including Human/Human Faces, Animal/Animal Faces, Places, Vehicles, Art, and many other real-life objects. The dataset comprises 8 sources that were carefully chosen to ensure diversity and includes images synthesized from 25 distinct methods, including 13 GANs, 7 Diffusion, and 5 other miscellaneous generators. The dataset contains 2,496,738 images, comprising 964,989 real images and 1,531,749 fake images.
To ensure diversity across different sources, the real images of the dataset are randomly sampled from source datasets containing numerous categories, whereas synthetic images are generated within the same categories as the real images. Captions and image masks from the COCO dataset are utilized to generate images for text2image and inpainting generators, while normally distributed noise with different random seeds is used for noise2image generators. The dataset is further processed to reflect real-world scenarios by applying random cropping, downscaling, and JPEG compression, in accordance with the IEEE VIP Cup 2022 standards.
The ArtiFact dataset is intended to serve as a benchmark for evaluating the performance of synthetic image detectors under real-world conditions. It includes a broad spectrum of diversity in terms of generators used and syntheticity, providing a challenging dataset for image detection tasks.
Total number of images: 2,496,738 Number of real images: 964,989 Number of fake images: 1,531,749 Number of generators used for fake images: 25 (including 13 GANs, 7 Diffusion, and 5 miscellaneous generators) Number of sources used for real images: 8 Categories included in the dataset: Human/Human Faces, Animal/Animal Faces, Places, Vehicles, Art, and other real-life objects Image Resolution: 200 x 200
This data was used for a review and meta-analysis on personal space in human-robot interaction by the contributors of this data. Some of the information had to be gained from email contact or estimated from figures and text of the source article.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Surveillance and Security: The "Human" model can be used in CCTV and surveillance cameras to detect human activity in specified areas. It will be particularly useful in detecting any suspicious activities, identifying individual people during an incident, and even in crowd control during public events.
Social Distance Monitoring: Amidst the COVID-19 pandemic, this model could help authorities monitor public spaces for adherence to social distancing guidelines. By identifying people and their relative positions, it could provide real-time updates about social distancing compliance.
Smart Homes: In smart homes and IoT environments, the model can be used for advanced automation by recognising the residents. It can allow for personalised experiences by identifying individual members, their facial expressions, and their hand gestures.
Gesture Recognition: The model can be used in applications like gaming or virtual reality where it can interpret human hand and body gestures. This would allow for a more interactive and immersive experience.
Access Control System: The model can be employed in biometric systems for face recognition, enhancing security in office buildings, residential areas or on personal devices. The model can also be adjusted to raise an alarm if an unfamiliar face is detected.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WiFi CSI-Based Long-Range Through-Wall Human Activity Recognition with the ESP32
This repository contains the WiFi CSI human presence detection and activity recognition datasets proposed in [1].
Datasets
DP_LOS - Line-of-sight (LOS) presence detection dataset, comprised of 392 CSI amplitude spectrograms.
DP_NLOS - Non-line-of-sight (NLOS) presence detection dataset, comprised of 384 CSI amplitude spectrograms.
DA_LOS - LOS activity recognition dataset, comprised of 392 CSI amplitude spectrograms.
DA_NLOS - NLOS activity recognition dataset, comprised of 384 CSI amplitude spectrograms.
Table 1: Characteristics of presence detection and activity recognition datasets.
Dataset Scenario
Packet Sending Rate Interval
DP_LOS LOS 1 1 6 100Hz 4s (400 packets) 392
DP_NLOS NLOS 5 1 6 100Hz 4s (400 packets) 384
DA_LOS LOS 1 1 3 100Hz 4s (400 packets) 392
DA_NLOS NLOS 5 1 3 100Hz 4s (400 packets) 384
Data Format
Each dataset employs an 8:1:1 training-validation-test split, defined in the provided label files trainLabels.csv, validationLabels.csv, and testLabels.csv. Label files use the sample format [i c], with i corresponding to the spectrogram index (i.png) and c corresponding to the class. For presence detection datasets (DP_LOS , DP_NLOS), c in {0 = "no presence", 1 = "presence in room 1", ..., 5 = "presence in room 5"}. For activity recognition datasets (DA_LOS , DA_NLOS), c in {0="no activity", 1="walking", and 2="walking + arm-waving"}. Furthermore, the mean and standard deviation of a given dataset are provided in meanStd.csv.
Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].
[1] Strohmayer, Julian, and Martin Kampel. "WiFi CSI-Based Long-Range Through-Wall Human Activity Recognition with the ESP32" International Conference on Computer Vision Systems. Cham: Springer Nature Switzerland, 2023.
BibTeX citation:
@inproceedings{strohmayer2023wifi, title={WiFi CSI-Based Long-Range Through-Wall Human Activity Recognition with the ESP32}, author={Strohmayer, Julian and Kampel, Martin}, booktitle={International Conference on Computer Vision Systems}, pages={41--50}, year={2023}, organization={Springer} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Humans need food, shelter, and water to survive. Our planet provides the resources to help fulfill these needs and many more. But exactly how much of an impact are we making on our planet? And will we reach a point at which the Earth can no longer support our growing population?Just like a bank account tracks money spent and earned, the relationship between human consumption of resources and the number of resources the Earth can supply—our human footprint—can be measured. Our human footprint can be calculated for an individual, town, or country, and quantifies the intensity of human pressures on the environment. The Human Footprint map layer is designed to do this by deriving a value representing the magnitude of the human footprint per one square kilometer (0.39 square miles) for every biome.This map layer was created by scientists with data from NASA's Socioeconomic Data and Applications Center to highlight where human pressures are most extreme in hopes to reduce environmental damage. The Human Footprint map asks the question, where are the least influenced, most “wild” parts of the world?The Human Footprint map was produced by combining thirteen global data layers that spatially visualize what is presumed to be the most prominent ways humans influence the environment. These layers include human population pressure (population density), human land use and infrastructure (built-up areas, nighttime lights, land use/land cover), and human access (coastlines, roads, railroads, navigable rivers). Based on the amount of overlap between layers, each square kilometer value is scaled between zero and one for each biome. Meaning that if an area in a Moist Tropical Forest biome scored a value of one, that square kilometer of land is part of the one percent least influenced/most wild area in its biome. Knowing this, we can help preserve the more wild areas in every biome, while also highlighting where to start mitigating human pressures in areas with high human footprints.So how can you reduce your individual human footprint? Here are just a few ways:Recycle: Recycling helps conserve resources, reduces water and air pollution, and helps save space in overcrowded landfills.Use less water: The average American uses 310 liters (82 gallons) of water a day. Reduce water consumption by taking shorter showers, turning off the water when brushing your teeth, avoiding pouring excess drinking water down the sink, and washing fruits and vegetables in a bowl of water rather than under the tap.Reduce driving: When you can, walk, bike, or take a bus instead of driving. Even 3 kilometers (2 miles) in a car puts about two pounds of carbon dioxide (CO2) into the atmosphere. If you must drive, try to carpool to reduce pollution. Lastly, skip the drive-through. You pollute more when you sit in a line while your car is emitting pollutant gases.Know how much you’re consuming: Most people are unaware of how much they are consuming every day. Calculate your individual ecological footprint to see how you can reduce your consumption here.Systemic implications: Individually, we are a rounding error. Take some time to understand how our individual actions can inform more systemic changes that may ultimately have a bigger impact on reducing humanity's overarching footprint.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Retail Analytics: Store owners can use the model to track the number of customers visiting their stores during different times of the day or seasons, which can help in workforce and resource allocation.
Crowd Management: Event organizers or public authorities can utilize the model to monitor crowd sizes at concerts, festivals, public gatherings or protests, aiding in security and emergency planning.
Smart Transportation: The model can be integrated into public transit systems to count the number of passengers in buses or trains, providing real-time occupancy information and assisting in transportation planning.
Health and Safety Compliance: During times of pandemics or emergencies, the model can be used to count the number of people in a location, ensuring compliance with restrictions on gathering sizes.
Building Security: The model can be adopted in security systems to track how many people enter and leave a building or a particular area, providing useful data for access control.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The original Udacity Self Driving Car Dataset is missing labels for thousands of pedestrians, bikers, cars, and traffic lights. This will result in poor model performance. When used in the context of self driving cars, this could even lead to human fatalities.
We re-labeled the dataset to correct errors and omissions. We have provided convenient downloads in many formats including VOC XML, COCO JSON, Tensorflow Object Detection TFRecords, and more.
Some examples of labels missing from the original dataset:
https://i.imgur.com/A5J3qSt.jpg" alt="Examples of Missing Labels">
The dataset contains 97,942 labels across 11 classes and 15,000 images. There are 1,720 null examples (images with no labels).
All images are 1920x1200 (download size ~3.1 GB). We have also provided a version downsampled to 512x512 (download size ~580 MB) that is suitable for most common machine learning models (including YOLO v3, Mask R-CNN, SSD, and mobilenet).
Annotations have been hand-checked for accuracy by Roboflow.
https://i.imgur.com/bOFkueI.pnghttps://" alt="Class Balance">
Annotation Distribution:
https://i.imgur.com/NwcrQKK.png" alt="Annotation Heatmap">
Udacity is building an open source self driving car! You might also try using this dataset to do person-detection and tracking.
Our updates to the dataset are released under the MIT License (the same license as the original annotations and images).
Note: the dataset contains many duplicated bounding boxes for the same subject which we have not corrected. You will probably want to filter them by taking the IOU for classes that are 100% overlapping or it could affect your model performance (expecially in stoplight detection which seems to suffer from an especially severe case of duplicated bounding boxes).
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their boilerplate code when using Roboflow's workflow, save training time, and increase model reproducibility. :fa-spacer:
https://choosealicense.com/licenses/bsd-2-clause/https://choosealicense.com/licenses/bsd-2-clause/
Dataset Card for MPII Human Pose
MPII Human Pose dataset is a state of the art benchmark for evaluation of articulated human pose estimation. The dataset includes around 25K images containing over 40K people with annotated body joints. The images were systematically collected using an established taxonomy of every day human activities. Overall the dataset covers 410 human activities and each image is provided with an activity label. Each image was extracted from a YouTube video… See the full description on the dataset page: https://huggingface.co/datasets/Voxel51/MPII_Human_Pose_Dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset for this project is characterised by photos of individual human emotion expression and these photos are taken with the help of both digital camera and a mobile phone camera from different angles, posture, background, light exposure, and distances. This task might look and sound very easy but there were some challenges encountered along the process which are reviewed below: 1) People constraint One of the major challenges faced during this project is getting people to participate in the image capturing process as school was on vacation, and other individuals gotten around the environment were not willing to let their images be captured for personal and security reasons even after explaining the notion behind the project which is mainly for academic research purposes. Due to this challenge, we resorted to capturing the images of the researcher and just a few other willing individuals. 2) Time constraint As with all deep learning projects, the more data available the more accuracy and less error the result will produce. At the initial stage of the project, it was agreed to have 10 emotional expression photos each of at least 50 persons and we can increase the number of photos for more accurate results but due to the constraint in time of this project an agreement was later made to just capture the researcher and a few other people that are willing and available. These photos were taken for just two types of human emotion expression that is, “happy” and “sad” faces due to time constraint too. To expand our work further on this project (as future works and recommendations), photos of other facial expression such as anger, contempt, disgust, fright, and surprise can be included if time permits. 3) The approved facial emotions capture. It was agreed to capture as many angles and posture of just two facial emotions for this project with at least 10 images emotional expression per individual, but due to time and people constraints few persons were captured with as many postures as possible for this project which is stated below: Ø Happy faces: 65 images Ø Sad faces: 62 images There are many other types of facial emotions and again to expand our project in the future, we can include all the other types of the facial emotions if time permits, and people are readily available. 4) Expand Further. This project can be improved furthermore with so many abilities, again due to the limitation of time given to this project, these improvements can be implemented later as future works. In simple words, this project is to detect/predict real-time human emotion which involves creating a model that can detect the percentage confidence of any happy or sad facial image. The higher the percentage confidence the more accurate the facial fed into the model. 5) Other Questions Can the model be reproducible? the supposed response to this question should be YES. If and only if the model will be fed with the proper data (images) such as images of other types of emotional expression.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The balding dataset consists of images of bald people and corresponding segmentation masks.
Segmentation masks highlight the regions of the images that delineate the bald scalp. By using these segmentation masks, researchers and practitioners can focus only on the areas of interest, they also could study androgenetic alopecia via this dataset.
The alopecia dataset is designed to be accessible and easy to use, providing high-resolution images and corresponding segmentation masks in PNG format.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F799b481d0bd964f0b78e15159d6f7267%2FMacBook%20Air%20-%201.png?generation=1691150402722829&alt=media" alt="">
keywords: bald segmentation, image dataset, bald dataset, hair segmentation, facial images, human segmentation, bald computer vision, bald classification, bald detection, balding men, balding women, baldness, bald woman, bald scalp, bald head, biometric dataset, biometric data dataset, deep learning dataset, facial analysis, human images dataset, androgenetic alopecia, hair loss dataset, balding and non-balding
The INRIA Person dataset is a dataset of images of persons used for pedestrian detection. It consists of 614 person detections for training and 288 for testing.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Action recognition has received increasing attentions from the computer vision and machine learning community in the last decades. Ever since then, the recognition task has evolved from single view recording under controlled laboratory environment to unconstrained environment (i.e., surveillance environment or user generated videos). Furthermore, recent work focused on other aspect of action recognition problem, such as cross-view classification, cross domain learning, multi-modality learning, and action localization. Despite the large variations of studies, we observed limited works that explore the open-set and open-view classification problem, which is a genuine inherited properties in action recognition problem. In other words, a well designed algorithm should robustly identify an unfamiliar action as "unknown" and achieved similar performance across sensors with similar field of view. The Multi-Camera Action Dataset (MCAD) is designed to evaluate the open-view classification problem under surveillance environment.
In our multi-camera action dataset, different from common action datasets we use a total of five cameras, which can be divided into two types of cameras (StaticandPTZ), to record actions. Particularly, there are three Static cameras (Cam04 & Cam05 & Cam06) with fish eye effect and two PanTilt-Zoom (PTZ) cameras (PTZ04 & PTZ06). Static camera has a resolution of 1280×960 pixels, while PTZ camera has a resolution of 704×576 pixels and a smaller field of view than Static camera. What's more, we don't control the illumination environment. We even set two contrasting conditions (Daytime and Nighttime environment) which makes our dataset more challenge than many controlled datasets with strongly controlled illumination environment.The distribution of the cameras is shown in the picture on the right.
We identified 18 units single person daily actions with/without object which are inherited from the KTH, IXMAS, and TRECIVD datasets etc. The list and the definition of actions are shown in the table. These actions can also be divided into 4 types actions. Micro action without object (action ID of 01, 02 ,05) and with object (action ID of 10, 11, 12 ,13). Intense action with object (action ID of 03, 04 ,06, 07, 08, 09) and with object (action ID of 14, 15, 16, 17, 18). We recruited a total of 20 human subjects. Each candidate repeats 8 times (4 times during the day and 4 times in the evening) of each action under one camera. In the recording process, we use five cameras to record each action sample separately. During recording stage we just tell candidates the action name then they could perform the action freely with their own habit, only if they do the action in the field of view of the current camera. This can make our dataset much closer to reality. As a results there is high intra action class variation among different action samples as shown in picture of action samples.
URL: http://mmas.comp.nus.edu.sg/MCAD/MCAD.html
Resources:
IDXXXX.mp4.tar.gz contains video data for each individual
boundingbox.tar.gz contains person bounding box for all videos
protocol.json contains the evaluation protocol
img_list.txt contains the download URLs for the images version of the video data
idt_list.txt contians the download URLs for the improved Dense Trajectory feature
stip_list.txt contians the download URLs for the STIP feature
Manual annotated 2D joints for selected camera view and action class (available via http://zju-capg.org/heightmap/)
How to Cite:
Please cite the following paper if you use the MCAD dataset in your work (papers, articles, reports, books, software, etc):
Wenhui Liu, Yongkang Wong, An-An Liu, Yang Li, Yu-Ting Su, Mohan Kankanhalli Multi-Camera Action Dataset for Cross-Camera Action Recognition Benchmarking IEEE Winter Conference on Applications of Computer Vision (WACV), 2017. http://doi.org/10.1109/WACV.2017.28
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Security and Surveillance: Utilize the "human gesture for surveillance" model to analyze live video feeds from security cameras and detect suspicious gestures (e.g. walking_with_weapon or crawling_with_weapon) in critical or high-security areas, immediately alerting law enforcement for rapid response.
Military and Defense: Integrate the model into military surveillance systems to monitor border areas or conflict zones for potential threats and effectively distinguish between hostile and non-hostile actions, enabling prompt action to protect personnel and assets.
Public Event Monitoring: Use the model to evaluate crowd behavior during large public events, such as concerts or sports matches, identifying any potentially dangerous gestures or individuals and notifying security personnel to prevent possible incidents.
Airport Security: Implement the model in airport security systems to identify suspicious behaviors (e.g. creeping, crouching_with_weapon) in passengers within terminals, ensuring swift intervention from security forces and minimizing risk to other travelers.
Smart Traffic Management: Apply the model to analyze the behavior of cyclists, motorbike riders, and pedestrians in urban settings, allowing for adjustments to traffic signals or road infrastructure to enhance safety and efficiency for all road users.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a new dataset, including behavioral, biometric, and environmental data, obtained from 23 subjects each spending 1 week to 2 months in smart rooms in Tokyo, Japan. The approximate duration of the experiment is 2 years. This dataset includes personal data, such as the use of home appliances, heartbeat rate, sleep status, temperature, and illumination. Although there are many datasets that publish these data individually, datasets that publish them all at once, tied to individual IDs, are valuable. The number of days for which data were obtained was 488, the number of records was 18,418,359, and the total size of the obtained data was 2.76 GB. This dataset can be used for machine learning and analysis for tips on getting a good night's sleep, for example.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset was taken from the GitHub Repository. This dataset is made public by Databricks for research and commercial use-cases. Originally the repository provides a jsonl file which was used to create a csv file included in this dataset.
Blog post: Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM
databricks-dolly-15k
is an open source dataset of instruction-following records used in training databricks/dolly-v2-12b that was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License.
Supported Tasks: - Training LLMs - Synthetic Data Generation - Data Augmentation
Languages: English Version: 1.0
Owner: Databricks, Inc.
databricks-dolly-15k
is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category.
Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly.
For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the context
field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. [42]
) which we recommend users remove for downstream applications.
While immediately valuable for instruction fine tuning large language models, as a corpus of human-generated instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods outlined in the Self-Instruct paper. For example, contributor--generated prompts could be submitted as few-shot examples to a large open language model to generate a corpus of millions of examples of instructions in each of the respective InstructGPT categories.
Likewise, both the instructions and responses present fertile ground for data augmentation. A paraphrasing model might be used to restate each prompt or short responses, with the resulting text associated to the respective ground-truth sample. Such an approach might provide a form of regularization on the dataset that could allow for more robust instruction-following behavior in models derived from these synthetic datasets.
As part of our continuing commitment to open source, Databricks developed what is, to the best of our knowledge, the first open source, human-generated instruction corpus specifically designed to enable large language models to exhibit the magical interactivity of ChatGPT. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications.
To create a record, employees were given a brief description of the annotation task as well as examples of the types of prompts typical of each annotation task. Guidelines were succinct by design so as to encourage a high task completion rate, possibly at the cost of rigorous co...
We collect a new dataset of human-posed free-form natural language questions about CLEVR images. Many of these questions have out-of-vocabulary words and require reasoning skills that are absent from our model’s repertoire