Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Associative Tag Recommendation Exploiting Multiple Textual FeaturesFabiano Belem, Eder Martins, Jussara M. Almeida Marcos Goncalves In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, July. 2011AbstractThis work addresses the task of recommending relevant tags to a target object by jointly exploiting three dimen- sions of the problem: (i) term co-occurrence with tags preassigned to the target object, (ii) terms extracted from mul- tiple textual features, and (iii) several metrics of tag relevance. In particular, we propose several new heuristic meth- ods, which extend previous, highly effective and efficient, state-of-the-art strategies by including new metrics that try to capture how accurately a candidate term describes the object’s content. We also exploit two learning to rank techniques, namely RankSVM and Genetic Programming, for the task of generating ranking functions that combine multiple metrics to accurately estimate the relevance of a tag to a given object. We evaluate all proposed methods in various scenarios for three popular Web 2.0 applications, namely, LastFM, YouTube and YahooVideo. We found that our new heuristics greatly outperform the methods on which they are based, producing gains in precision of up to 181%, as well as another state-of-the-art technique, with improvements in precision of up to 40% over the best baseline in any scenario. Some further improvements can also be achieved, in some scenarios, with the new learning-to-rank based strategies, which have the additional advantage of being quite flexible and easily extensible to exploit other aspects of the tag recommendation problem.Bibtex Citation@inproceedings{belem@sigir11, author = {Fabiano Bel\'em and Eder Martins and Jussara Almeida and Marcos Gon\c{c}alves}, title = {Associative Tag Recommendation Exploiting Multiple Textual Features}, booktitle = {{Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (SIGIR'11)}}, month = {{July}}, year = {2011} }
The MLB-YouTube dataset is a new, large-scale dataset consisting of 20 baseball games from the 2017 MLB post-season available on YouTube with over 42 hours of video footage. The dataset consists of two components: segmented videos for activity recognition and continuous videos for activity classification. It is quite challenging as it is created from TV broadcast baseball games where multiple different activities share the camera angle. Further, the motion/appearance difference between the various activities is quite small.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for youtube-commons-asr-eval
Dataset Summary
This evaluation dataset is created from a subset of Youtube-Commons [PleIAs/YouTube-Commons] by selecting English YouTube videos and corresponding english subtitle.
Supported Tasks and Leaderboards
This dataset will be primarily useful for automatic speech recognition evaluation tasks such as hf-audio/open_asr_leaderboard.
Languages
This subset is for English language evaluations.… See the full description on the dataset page: https://huggingface.co/datasets/mobiuslabsgmbh/youtube-commons-asr-eval.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Version 1.0, March 2024
Lloyd May (1), Keita Ohshiro (2,3), Khang Dang (2,3), Sripathi Sridhar (2,3), Jhanvi Pai (2,3), Magdalena Fuentes (4), Sooyeon Lee (3), Mark Cartwright (2,3,4)
If using this data in an academic work, please reference the DOI and version, as well as cite the following paper, which presented the data collection procedure and the first version of the dataset:
May, L., Ohshiro, K., Dang, K., Sridhar, S., Pai, J., Fuentes, M., Lee, S., Cartwright, M. Unspoken Sound: Identifying Trends in Non-Speech Audio Captioning on YouTube. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI), 2024.
The YouTube NSI Captioning Dataset was developed to analyze the contemporary and historical state of non-speech information (NSI) captioning on YouTube. NSI includes information about non-speech sounds such as environmental sounds, sound effects, incidental sounds, and music, as well as additional narrative information and extra-speech information (ESI), which gives context to spoken or signed language such as manner of speech (e.g. "[Whispering] Oh no") or speaker label (e.g., "[Juan] Oh no"). The dataset contains measures of estimated and annotated NSI in the captions of two different samples of videos: a popular video sample and a studio video sample. The aim of the popular sample is to understand the captioning practices in a broad spectrum of popular, impactful videos on YouTube. In contrast, the aim of the studio sample is to examine captioning practices among the top-tier production houses, often viewed as industry benchmarks due to their influence and vast resources available for accessibility. Using the YouTube API, we queried for videos in these two samples for each month from 2013 to 2022. We then estimated which captions contain NSI by searching for non-alphanumeric symbols that are indicative of NSI, e.g., "[" and "]" (see Section 3.2 of the paper for a full list). In addition, the research team manually annotated which captions have NSI from a subset of approximately 1800 videos from years 2013, 2018, and 2022. Please see the Section 3.3 of the paper for details of the annotation process.
The resulting YouTube NSI Captioning Dataset consists of NSI information from ~715k videos containing ~273M lines of captions, ~ 6M of which are estimated instances of NSI. These videos span 10 years and 21 topics. The annotated subset consists of 1799 videos with a total of ~36k annotated captions lines, ~114k of which are instances of NSI annotated on 7 different categories. These videos span 3 years (2013, 2018, and 2022) and 20 YouTube-assigned topics. Each video was annotated by two annotators along with the consensus annotation. The dataset contains the links to the YouTube videos, video metadata from the YouTube API, and measures of both estimated and annotated NSI. Due to copyright concerns, we are only publicly releasing data consisting of summary NSI measures for each video. If you need access to the raw data used to create these summary NSI measures, contact Mark Cartwright at mark.cartwright@njit.edu.
estimated_full_set_aggregate.csv
: Data file containing the full set of video data with measures of estimated NSI.
annotated_subset_aggregate.csv
: Data file containing the smaller annotated subset of video data with measures of both annotated and estimated NSI.
The following columns are present in both data files.
video_id : The YouTube video ID
year : The year associated with the time period from which the video was sampled.
sample : The sample which the video is from (i.e., popular or studio)
sampling_period_start_date : The start date of the time period from which the video was sampled.
sampling_period_end_date : The end date of the time period from which the video was sampled.
caption_type : This can take one of three values: auto
which indicates a caption was provided by YouTube's automated caption system, manual
which indicates a caption was provided by the uploader, or none
which indicates that no captions are present for the video.
duration_minutes : The duration of the video in minutes.
channel_id : The ID that YouTube uses to uniquely identify the channel.
published_datetime : The date and time at which the video was published on YouTube.
youtube_topics : The YouTube-provided list of Wikipedia URLs that provide a description of the video's content.
category_id : The YouTube video category associated with the video.
view_count : The count of views on YouTube at the time of sampling (Spring 2023).
like_count : The count of likes on YouTube at the time of sampling (Spring 2023).
comment_count : The count of comments on YouTube at the time of sampling (Spring 2023).
high_level_topics : List of topics at a higher semantic level than youtube_topics that provide a description of the video's content. See paper for details on the mapping between youtube_topics and high_level_topics.
: The remainder of the columns take this form with the values listed below.
Values for :
estimated_nsi : This NSI type is an estimation of NSI based on the presence of particular non-alphanumeric characters that are indicative of NSI as described in Section 3.2 of the paper.
general_nsi (only in annotated_subset_aggregate.csv
) : The most general of NSI types that is inclusive of music_nsi
, environmental_nsi
, additionalnarrativ_nsi
, and quotedspeech_nsi
. All of these NSI types are included in the calculation of measures associated with general_nsi
. Note that misc_nsi
and nonenglish_captions
are not included as those may or may not contain NSI, and thus, we opt for precision over recall. Not present for the unlabeled
music_nsi (only in annotated_subset_aggregate.csv
) : Any genre of music, whether diegetic or not.
environmental_nsi (only in annotated_subset_aggregate.csv
) : Environmental sounds, sound effects, and incidental sounds, i.e., non-music and non-speech sounds. This includes non-verbal vocalizations like laughter, grunts, and crying, provided they aren't used to modify speech.
extraspeech_nsi (only in annotated_subset_aggregate.csv
) : Extra-speech Information (ESI), i.e., text that gives added context to spoken or signed language.
additionalnarrative_nsi (only in annotated_subset_aggregate.csv
) : Additional narrative information in the form of descriptive text that doesn't pertain directly to sounds.
quotedspeech_nsi (only in annotated_subset_aggregate.csv
) : Quoted Speech Captions containing internal quotation marks.
misc_nsi (only in annotated_subset_aggregate.csv
) : Unsure, misc, or ambiguous, i.e., instances where the appropriate label is unclear or the caption doesn't fit current categories.
nonenglish_captions (only in annotated_subset_aggregate.csv
) : Captions not written in English and thus have uncertain NSI status.
Values for :
count : The number of captions identified as containing NSI of the specified type in the video.
presence : Indication of whether there is NSI of the specified type present in the video. 1
if present (e.g., count > 0), 0
if not present (e.g., count==0).
count_per_minute : A measure of the density of NSI captions. count_per_min = count / duration_minutes
count_per_minute_if_present : If presence==1, then count_per_minute, else, NaN
. This is used for computing the aggregate CPMIP measure, which as discussed in the paper is intended to be a measure of the quality of NSI captions based on the assumption that more frequently captioned NSI within a video is an indicator of better NSI captioning. See Section 5 of the paper for details.
Dataset created by Lloyd May, Keita Ohshiro, Khang Dang, Sripathi Sridhar, Jhanvi Pai, Magdalena Fuentes, Sooyeon Lee, and Mark Cartwright
The YouTube NSI Captioning Dataset dataset is offered free of charge under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license: https://creativecommons.org/licenses/by/4.0/
Please help us improve YouTube NSI Captioning Dataset by sending your feedback to:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EMOPIA (pronounced ‘yee-mò-pi-uh’) dataset is a shared multi-modal (audio and MIDI) database focusing on perceived emotion in pop piano music, to facilitate research on various tasks related to music emotion. The dataset contains 1,087 music clips from 387 songs and clip-level emotion labels annotated by four dedicated annotators.
For more detailed information about the dataset, please refer to our paper: EMOPIA: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation.
File Description
midis/: midi clips transcribed using GiantMIDI.
Filename Q1_xxxxxxx_2.mp3
: Q1 means this clip belongs to Q1 on the V-A space; xxxxxxx is the song ID on YouTube, and the 2
means this clip is the 2nd clip taken from the full song.
metadata/: metadata from YouTube. (Got when crawling)
songs_lists/: YouTube URLs of songs.
tagging_lists/: raw tagging result for each sample.
label.csv: metadata that records filename, 4Q label, and annotator.
metadata_by_song.csv: list all the clips by the song. Can be used to create the train/val/test splits to avoid the same song appear in both train and test.
scripts/prepare_split.ipynb: the script to create train/val/test splits and save them to csv files.
2.2 Update
Add tagging files in tagging_lists/ that are missing in the previous version.
Add timestamps.json for easier usage. It records all the timestamps in dict format. You can see scripts/load_timestamp.ipynb for the format example.
Add scripts/timestamp2clip.py: After the raw audio are crawled and put in audios/raw, you can use this script to get audio clips. The script will read timestamps.json and use the timestamp to extract clips. The clips will be saved to audios/seg folder.
remove 7 midi files that were added by mistake, and also corrected the number in metadata_by_song.csv.
2.1 Update
Add one file and one folder:
key_mode_tempo.csv: key, mode, and tempo information extracted from files.
CP_events/: CP events used in our paper. Extracted using this script, and add the emotion event to the front.
Modify one folder:
The REMI_events/ files in version 2.0 contain some information that is not related to the paper, so remove it.
2.0 Update
Add two new folders:
corpus/: processed data that following the preprocessing flow. (Please notice that although we have 1078 clips in our dataset, we lost some clips during steps 1~4 of the flow, so the final number of clips in this corpus is 1052, and that's the number we used for training the generative model.)
REMI_events/: REMI event for each midi file. They are generated using this script.
Cite this dataset
@inproceedings{{EMOPIA}, author = {Hung, Hsiao-Tzu and Ching, Joann and Doh, Seungheon and Kim, Nabin and Nam, Juhan and Yang, Yi-Hsuan}, title = {{MOPIA}: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation}, booktitle = {Proc. Int. Society for Music Information Retrieval Conf.}, year = {2021} }
The Kinetics-600 is a large-scale action recognition dataset which consists of around 480K videos from 600 action categories. The 480K videos are divided into 390K, 30K, 60K for training, validation and test sets, respectively. Each video in the dataset is a 10-second clip of action moment annotated from raw YouTube video. It is an extensions of the Kinetics-400 dataset.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Because this dataset has been used in a competition, we had to hide some of the data to prepare the test dataset for the competition. Thus, in the previous version of the dataset, only train.csv file is existed.
This dataset represents 10 different physical poses that can be used to distinguish 5 exercises. The exercises are Push-up, Pull-up, Sit-up, Jumping Jack and Squat. For every exercise, 2 different classes have been used to represent the terminal positions of that exercise (e.g., “up” and “down” positions for push-ups).
About 500 videos of people doing the exercises have been used in order to collect this data. The videos are from Countix Dataset that contain the YouTube links of several human activity videos. Using a simple Python script, the videos of 5 different physical exercises are downloaded. From every video, at least 2 frames are manually extracted. The extracted frames represent the terminal positions of the exercise.
For every frame, MediaPipe framework is used for applying pose estimation, which detects the human skeleton of the person in the frame. The landmark model in MediaPipe Pose predicts the location of 33 pose landmarks (see figure below). Visit Mediapipe Pose Classification page for more details.
https://mediapipe.dev/images/mobile/pose_tracking_full_body_landmarks.png" alt="33 pose landmarks">
The number of Youtube users in Africa was forecast to continuously increase between 2024 and 2029 by in total 0.03 million users (+3.95 percent). The Youtube user base is estimated to amount to 0.79 million users in 2029. User figures, shown here regarding the platform youtube, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Youtube users in countries like Worldwide and the Americas.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
=====================================================================
=====================================================================
Authors: Trung-Nghia Le (1), Khanh-Duy Nguyen (2), Huy H. Nguyen (1), Junichi Yamagishi (1), Isao Echizen (1)
Affiliations: (1)National Institute of Informatics, Japan (2)University of Information Technology-VNUHCM, Vietnam
National Institute of Informatics Copyright (c) 2021
Emails: {ltnghia, nhhuy, jyamagis, iechizen}@nii.ac.jp, {khanhd}@uit.edu.vn
Arxiv: https://arxiv.org/abs/2111.12888 NII Face Mask Dataset v1.0: https://zenodo.org/record/5761725
=============================== INTRODUCTION ===============================
The NII Face Mask Dataset is the first large-scale dataset targeting mask-wearing ratio estimation in street cameras. This dataset contains 581,108 face annotations extracted from 18,088 video frames (1920x1080 pixels) in 17 street-view videos obtained from the Rambalac's YouTube channel.
The videos were taken in multiple places, at various times, before and during the COVID-19 pandemic. The total length of the videos is approximately 56 hours.
=============================== REFERENCES ===============================
If your publish using any of the data in this dataset please cite the following papers:
@article{Nguyen202112888, title={Effectiveness of Detection-based and Regression-based Approaches for Estimating Mask-Wearing Ratio}, author={Nguyen, Khanh-Duy and Nguyen, Huy H and Le, Trung-Nghia and Yamagishi, Junichi and Echizen, Isao}, archivePrefix={arXiv}, arxivId={2111.12888}, url={https://arxiv.org/abs/2111.12888}, year={2021} }
@INPROCEEDINGS{Nguyen2021EstMaskWearing, author={Nguyen, Khanh-Duv and Nguyen, Huv H. and Le, Trung-Nghia and Yamagishi, Junichi and Echizen, Isao}, booktitle={2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021)}, title={Effectiveness of Detection-based and Regression-based Approaches for Estimating Mask-Wearing Ratio}, year={2021}, pages={1-8}, url={https://ieeexplore.ieee.org/document/9667046}, doi={10.1109/FG52635.2021.9667046}}
======================== DATA STRUCTURE ==================================
./NFM ├── dataset │ ├── train.csv: annotations for the train set. │ ├── test.csv: annotations for the test set. └── README_v1.0.md
We use the same structure for two CSV files (train.csv and test.csv). Both CSV files have the same columns: <1st column>: video_id (a source video can be found by following the link: https://www.youtube.com/watch?v=) <2nd column>: frame_id (the index of a frame extracted from the source video) <3rd column>: timestamp in milisecond (the timestamp of a frame extracted from the source video) <4th column>: label (for each annotated face, one of three labels was attached with a bounding box: 'Mask'/'No-Mask'/'Unknown') <5th column>: left <6th column>: top <7th column>: right <8th column>: bottom Four coordinates (left, top, right, bottom) were used to denote a face's bounding box.
============================== COPYING ================================
This repository is made available under Creative Commons Attribution License (CC-BY).
Regarding Creative Commons License: Attribution 4.0 International (CC BY 4.0), please see https://creativecommons.org/licenses/by/4.0/
THIS DATABASE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DATABASE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE
====================== ACKNOWLEDGEMENTS ================================
This research was partly supported by JSPS KAKENHI Grants (JP16H06302, JP18H04120, JP21H04907, JP20K23355, JP21K18023), and JST CREST Grants (JPMJCR20D3, JPMJCR18A6), Japan.
This dataset is based on the Rambalac's YouTube channel: https://www.youtube.com/c/Rambalac
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The YouTube-ASMR dataset contains URLS for over 900 hours of ASMR video clips with stereo/binaural audio produced by various YouTube artists. The following paper contains a detailed description of the dataset and how it was compiled:
K. Yang, B. Russell and J. Salamon, "Telling Left from Right: Learning Spatial Correspondence of Sight and Sound", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Conference, June 2020.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EOAD is a collection of videos captured by wearable cameras, mostly of sports activities. It contains both visual and audio modalities.
It was initiated by the HUJI and FPVSum egocentric activity datasets. However, the number of samples and diversity of activities for HUJI and FPVSum were insufficient. Therefore, we combined these datasets and populated them with new YouTube videos.
The selection of videos was based on the following criteria:
Video samples were trimmed depending on scene changes for long videos (such as driving, scuba diving, and cycling). As a result, a video may have several clips depicting egocentric actions. Hence, video clips were extracted from carefully defined time intervals within videos. The final dataset includes video clips with a single action and natural audio information.
Statistics for EOAD:
The detailed statistics for the selected datasets and the crawled videos clips from YouTube are given below:
The video clips used for training, validation and test sets for each activity are listed in Table 1. Multiple video clips may belong to a single video because of trimming it for some reasons (i.e., scene cut, temporary overlayed text on videos, or video parts unrelated to activities).
While splitting the dataset, the minimum number of videos for each activity was selected as 8. Additionally, the video samples were divided as 50%, 25%, and 25% for training (minimum four videos), validation (minimum two videos), and testing (minimum two videos), respectively. On the other hand, videos were split according to the raw video footage to prevent the mixing of similar video clips (having the same actors and scenes) into training, validation, and test sets. Therefore, we ensured that the video clips trimmed from the same videos were split together into training, validation, or test sets to satisfy a fair comparison.
Some activities have continuity throughout the video, such as scuba, longboarding, or riding horse, which also have an equal number of video segments with the number of videos. However, some activities, such as skating, occurred in a short time, making the number of video segments higher than the others. As a result, the number of video clips for training, validation, and test sets was highly imbalanced for the selected activities (i.e., jet ski and rafting have 4; however, soccer has 99 video clips for training).
Table 1 - Dataset splitting for EOAD
Train |
Validation |
Test | |||||
---|---|---|---|---|---|---|---|
Action Label |
#Clips |
Total Duration |
#Clips |
Total Duration |
#Clips |
Total Duration | |
AmericanFootball |
34 |
00:06:09 |
36 |
00:05:03 |
9 |
00:01:20 | |
Basketball |
43 |
01:13:22 |
19 |
00:08:13 |
10 |
00:28:46 | |
Biking |
9 |
01:58:01 |
6 |
00:32:22 |
11 |
00:36:16 | |
Boxing |
7 |
00:24:54 |
11 |
00:14:14 |
5 |
00:17:30 | |
BungeeJumping |
7 |
00:02:22 |
4 |
00:01:36 |
4 |
00:01:31 | |
Driving |
19 |
00:37:23 |
9 |
00:24:46 |
9 |
00:29:23 | |
GoKart |
5 |
00:40:00 |
3 |
00:11:46 |
3 |
00:19:46 | |
Horseback |
5 |
01:15:14 |
5 |
01:02:26 |
2 |
00:20:38 | |
IceHockey |
52 |
00:19:22 |
46 |
00:20:34 |
10 |
00:36:59 | |
Jetski |
4 |
00:23:35 |
5 |
00:18:42 |
6 |
00:02:43 | |
Kayaking |
28 |
00:43:11 |
22 |
00:14:23 |
4 |
00:11:05 | |
Kitesurfing |
30 |
00:21:51 |
17 |
00:05:38 |
6 |
00:01:32 | |
Longboarding |
5 |
00:15:40 |
4 |
00:18:03 |
4 |
00:09:11 | |
Motorcycle |
20 |
00:49:38 |
21 |
00:13:53 |
8 |
00:20:30 | |
Paintball |
7 |
00:33:52 |
4 |
00:12:08 |
4 |
00:08:52 | |
Paragliding |
11 |
00:28:42 |
4 |
00:10:16 |
4 |
00:19:50 | |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accident Detection Model is made using YOLOv8, Google Collab, Python, Roboflow, Deep Learning, OpenCV, Machine Learning, Artificial Intelligence. It can detect an accident on any accident by live camera, image or video provided. This model is trained on a dataset of 3200+ images, These images were annotated on roboflow.
https://user-images.githubusercontent.com/78155393/233774342-287492bb-26c1-4acf-bc2c-9462e97a03ca.png" alt="Survey">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Modern content sharing environments such as Flickr or YouTube contain a large number of private resources such as photos showing weddings, family holidays, and private parties. These resources can be of a highly sensitive nature, disclosing many details of the users' private sphere. In order to support users in making privacy decisions in the context of image sharing and to provide them with a better overview of privacy-related visual content available on the Web, we propose techniques to automatically detect private images and to enable privacy-oriented image search. In order to classify images, we use the metadata like title and tags and plan to use visual features which are described in our scientific paper. The data set used in the paper is now available.
Picalet! cleaned dataset - ( recommended for experiments) userstudy - (images annotated with queries, anonymized user id and privacy value)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Even though culture has been found to play some role in negative emotion expression, affective computing research primarily takes on a basic emotion approach when analyzing social signals for automatic emotion recognition technologies. Furthermore, automatic negative emotion recognition systems still train data that originates primarily from North America and contains a majority of Caucasian training samples. As such, the current study aims to address this problem by analyzing what the differences are of the underlying social signals by leveraging machine learning models to classify 3 negative emotions, contempt, anger and disgust (CAD) amongst 3 different cultures: North American, Persian, and Filipino. Using a curated data set compiled from YouTube videos, a support vector machine (SVM) was used to predict negative emotions amongst differing cultures. In addition a one-way ANOVA was used to analyse the differences that exist between each culture group in-terms of level of activation of underlying social signal. Our results not only highlighted the significant differences in the associated social signals that were activated for each culture, but also indicated the specific underlying social signals that differ in our cross-cultural data sets. Furthermore, the automatic classification methods showed North American expressions of CAD to be well-recognized, while Filipino and Persian expressions were recognized at near chance levels.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Speech Emotion Recognition (SER) is a rapidly evolving field of research aimed at identifying and categorizing emotional states through the analysis of speech signals. As SER holds significant socio-cultural and commercial importance, researchers are increasingly leveraging machine learning and deep learning techniques to drive advancements in this domain. A high-quality dataset is an essential resource for SER studies in any language. Despite Urdu being the 10th most spoken language globally, there is a significant lack of robust SER datasets, creating a research gap. Existing Urdu SER datasets are often limited by their small size, narrow emotional range, and repetitive content, reducing their applicability in real-world scenarios. To address this gap, the Urdu Speech Emotion Recognition (UrduSER) was developed. This comprehensive dataset includes 3500 Urdu speech signals sourced from 10 professional actors, with an equal representation of male and female speakers from diverse age groups. The dataset encompasses seven emotional states: Angry, Fear, Boredom, Disgust, Happy, Neutral, and Sad. The speech samples were curated from a wide collection of Pakistani Urdu drama serials and telefilms available on YouTube, ensuring diversity and natural delivery. Unlike conventional datasets, which rely on predefined dialogs recorded in controlled environments, UrduSER features unique and contextually varied utterances, making it more realistic and applicable for practical applications. To ensure balance and consistency, the dataset contains 500 samples per emotional class, with 50 samples contributed by each actor for each emotion. Additionally, an accompanying Excel file provides detailed metadata for each recording, including the file name, duration, format, sample rate, actor details, emotional state, and corresponding Urdu dialog. This metadata enables researchers to efficiently organize and utilize the dataset for their specific needs. The UrduSER dataset underwent rigorous validation, integrating expert evaluation and model-based validation to ensure its reliability, accuracy, and overall suitability for advancing research and development in Urdu Speech Emotion Recognition.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
UnusualAction Dataset for Action Recognition Nitika Nigam, Tanima Dutta and Hari Prabhat Gupta Indian Institute of Technology (BHU), India. Overview: UnusualAction is an uncertain action recognition dataset that rarely happens and collected from YouTube. The dataset comprises 14 unusual action categories, and each category contains 50-100 videos. UnusualAction gives the diversity in terms of different falling actions and with the presence of noises, such as, variations in camera motions, person appearance, viewpoint, cluttered background, illumination conditions, etc. It is a challenging dataset for uncertain action recognition. Most action recognition datasets are based on certain actions; on the contrary, UnusualAction aims to encourage further research into uncertain action recognition by learning and exploring new realistic action categories. Structure for UnusualAction Dataset ● Data associated with each UnusualAction category is stored in separate directories. ● Each directory comprises *.mp4 or *.avi files for videos. ● The directory is arranged in the following structure: FallAction_datasets ├──Blending_phone ├── Crushing_laptop ├── Cutting_keyboard ├── Drilling_Laptop ├── Drilling_Phone ├── Frying_Phone ├── Hammering_Laptop ├── Hammering_phone ├── Hammering_pumpkin ├── Hammering_watermelon ├── Microwave_shoes ├──Microwave_phone ├── Washing_laptop ├── Washing_Paptop
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TweetNERD - End to End Entity Linking Benchmark for Tweets
Paper - Video - Neurips Page
This is the dataset described in the paper TweetNERD - End to End Entity Linking Benchmark for Tweets (accepted to Thirty-sixth Conference on Neural Information Processing Systems (Neurips) Datasets and Benchmarks Track).
Named Entity Recognition and Disambiguation (NERD) systems are foundational for information retrieval, question answering, event detection, and other natural language processing (NLP) applications. We introduce TweetNERD, a dataset of 340K+ Tweets across 2010-2021, for benchmarking NERD systems on Tweets. This is the largest and most temporally diverse open sourced dataset benchmark for NERD on Tweets and can be used to facilitate research in this area.
TweetNERD dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0) LICENSE.
The license only applies to the data files present in this dataset. See Data usage policy below.
Check out more details at https://github.com/twitter-research/TweetNERD
Usage
We provide the dataset split across the following tab seperated files:
part_*.public.tsv
: Remaining data split into parts in no particular order.Each file is tab separated and has has the following format:
tweet_id | phrase | start | end | entityId | score |
---|---|---|---|---|---|
22 | twttr | 20 | 25 | Q918 | 3 |
21 | twttr | 20 | 25 | Q918 | 3 |
1457198399032287235 | Diwali | 30 | 38 | Q10244 | 3 |
1232456079247736833 | NO_PHRASE | -1 | -1 | NO_ENTITY | -1 |
For tweets which don't have any entity, their column values for phrase, start, end, entityId, score
are set NO_PHRASE, -1, -1, NO_ENTITY, -1
respectively.
Description of file columns is as follows:
Column | Type | Missing Value | Description |
---|---|---|---|
tweet_id | string | ID of the Tweet | |
phrase | string | NO_PHRASE | entity phrase |
start | int | -1 | start offset of the phrase in text using UTF-16BE encoding |
end | int | -1 | end offset of the phrase in the text using UTF-16BE encoding |
entityId | string | NO_ENTITY | Entity ID. If not missing can be NOT FOUND, AMBIGUOUS, or Wikidata ID of format Q{numbers}, e.g. Q918 |
score | int | -1 | Number of annotators who agreed on the phrase, start, end, entityId information |
In order to use the dataset you need to utilize the tweet_id
column and get the Tweet text using the Twitter API (See Data usage policy section below).
Data stats
Split | Number of Rows | Number unique tweets |
---|---|---|
OOD | 34102 | 25000 |
Academic | 51685 | 30119 |
part_0 | 11830 | 10000 |
part_1 | 35681 | 25799 |
part_2 | 34256 | 25000 |
part_3 | 36478 | 25000 |
part_4 | 37518 | 24999 |
part_5 | 36626 | 25000 |
part_6 | 34001 | 24984 |
part_7 | 34125 | 24981 |
part_8 | 32556 | 25000 |
part_9 | 32657 | 25000 |
part_10 | 32442 | 25000 |
part_11 | 32033 | 24972 |
Data usage policy
Use of this dataset is subject to you obtaining lawful access to the Twitter API, which requires you to agree to the Developer Terms Policies and Agreements.
Please cite the following if you use TweetNERD in your paper:
@dataset{TweetNERD_Zenodo_2022_6617192, author = {Mishra, Shubhanshu and Saini, Aman and Makki, Raheleh and Mehta, Sneha and Haghighi, Aria and Mollahosseini, Ali}, title = {{TweetNERD - End to End Entity Linking Benchmark for Tweets}}, month = jun, year = 2022, note = {{Data usage policy Use of this dataset is subject to you obtaining lawful access to the [Twitter API](https://developer.twitter.com/en/docs /twitter-api), which requires you to agree to the [Developer Terms Policies and Agreements](https://developer.twitter.com/en /developer-terms/).}}, publisher = {Zenodo}, version = {0.0.0}, doi = {10.5281/zenodo.6617192}, url = {https://doi.org/10.5281/zenodo.6617192} } @inproceedings{TweetNERDNeurips2022, author = {Mishra, Shubhanshu and Saini, Aman and Makki, Raheleh and Mehta, Sneha and Haghighi, Aria and Mollahosseini, Ali}, booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks}, pages = {}, title = {TweetNERD - End to End Entity Linking Benchmark for Tweets}, volume = {2}, year = {2022}, eprint = {arXiv:2210.08129}, doi = {10.48550/arXiv.2210.08129} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce the Group Affect from ViDeos (GAViD) dataset, which comprises 5091 video clips with multimodal data (video, audio, and context), annotated with ternary valence and discrete emotion labels, and enriched with VideoGPT-generated contextual metadata and human-annotated action cues. We also present CAGNet, a baseline model for multimodal context aware group affect recognition. CAGNet achieves 61.20% test accuracy on GAViD, comparable to state-of-the art performance in the field.
NOTE: For now we are providing only Train video clips. The corresponding paper is under Review in ACM Multimedia 2025 Dataset Track. After its publication, the validation and Test set access will be granted upon request and approval, in accordance with the Responsible Use Policy.
GAViD is a large-scale, in-the-wild multimodal dataset of 5091 samples, each annotated with the elements listed below. The following sections describe its key details and compilation procedure.
Dataset details
Positive | Positive | Negative | Negative | Neutral | Neutral |
Team Celebration | Happy | Protest | Angry Sport | Group Meeting | Panel Discussion |
Group Meeting | Video Conference | Heated Argument | Violent Protest | Parliament speech | People on street |
Get Together | Meeting | Emotional breakdown in Public | Aggressive Argument | People walking on street | Team brainstorming Session |
Celebration | Press Conference | Spritual Gathering | Aggressive Group | Team Building Activities | Group Discussion |
Religious gathering | Talk Show | Street Race | Condolence | Group work session | Team Planning session |
Farewell | Group Performance | Group Fight | Wrestling | Students in Discussion | Wedding Group Dance |
People Dancing on Street | Street Comedy | MMA Fight | VIolence | Roundtable Discus- sion | Oath |
Wedding Performance | Dhol masti | Boxing | Silent Protest | Mental health ad- dress | General Talk |
Couple group dance | Comedy show | People in the fight | Group Fight | Wedding Celebration | Festival Celebration |
Model | Val Acc. | Val F1 | Test Acc. | Test F1 |
CAGNet | 62.55% | 0.454 | 60.33% | 0.448 |
The dataset comprises two main components:
The dataset is structured in GAViD.csv file along with corresponding Videos in related folders. This CSV file includes the following fields:
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Data set of materials in vessels
The handling of materials in glassware vessels is the main task in chemistry laboratory research as well as a large number of other activities. Visual recognition of the physical phase of the
materials is essential for many methods ranging from a simple task such as fill-level evaluation to the
identification of more complex properties such as solvation, precipitation, crystallization and phase
separation. To help train neural nets for this task, a new data set was created. The data set contains a
thousand images of materials, in different phases and involved in different chemical processes, in a
laboratory setting. Each pixel in each image is labeled according to several layers of classification, as
given below:
a. Vessel/Background: For each pixel assign value of one if it is part of the vessel and zero otherwise.
This annotation was used as the ROI map for the valve filter method.
b. Filled/Empty: This is similar to the above, but also distinguishes between the filled and empty
regions of the vessel. For each pixel, one of the following three values is assigned:0 (background); 1
(empty vessel); or 2 (filled vessel).
c. Phase type: This is similar to the above but distinguishes between liquid and solid regions of the
filled vessel. For each pixel, one of the following four values: 0 (background); 1 (empty vessel); 2
(liquid); or 3 (solid).
d. Fine-grained physical phase type: This is similar to the above but distinguishes between specific
classes of physical phase. For each pixel, one of 15 values is assigned: 1 (background); 2 (empty
vessel); 3 (liquid); 4 (liquid phase two, in the case where more than one phase of the liquid appears in
the vessel); 5 (suspension); 6 (emulsion); 7 (foam); 8 (solid); 9 (gel); 10 (powder); 11 (granular); 12
(bulk); 13 (solid-liquid mixture); 14 (solid phase two, in the case where more than one phase of solid
exists in the vessel): and 15 (vapor).
The annotations are given as images of the size of the original image, where the pixel value is the
class number. The annotation of the vessel region (a) is used in the ROI input for the valve filter net .
4.1. Validation/testing set
The data set is divided into training and testing sets. The testing set is itself divided into two subsets;
one contains images extracted from the same YouTube channels as the training set, and therefore was
taken under similar conditions as the training images. The second subset contains images extracted
from YouTube channels not included in the training set, and hence contains images taken under
different conditions from those used to train the net.
4.2. Creating the data set
The creation of a large number of images with a variety of chemical processes and settings could have
been a daunting task. Luckily, several YouTube channels dedicated to chemical experiments exist
which offer high-quality footage of chemistry experiments. Thanks to these channels, including
NurdRage, NileRed, ChemPlayer, it was possible to collect a large number of high-quality images in a
short time. Pixel-wise annotation of these images was another challenging task, and was performed by
Alexandra Emanuel and Mor Bismuth.
For more details see: Setting attention region for convolutional neural networks using region selective features, for recognition of materials within glass vessels
This dataset was first published in 2017.8
For newer and Bigger datasets see
https://zenodo.org/record/4736111#.YbG-RrtyZH4
https://zenodo.org/record/3697452#.YbG-TLtyZH4
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Associative Tag Recommendation Exploiting Multiple Textual FeaturesFabiano Belem, Eder Martins, Jussara M. Almeida Marcos Goncalves In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, July. 2011AbstractThis work addresses the task of recommending relevant tags to a target object by jointly exploiting three dimen- sions of the problem: (i) term co-occurrence with tags preassigned to the target object, (ii) terms extracted from mul- tiple textual features, and (iii) several metrics of tag relevance. In particular, we propose several new heuristic meth- ods, which extend previous, highly effective and efficient, state-of-the-art strategies by including new metrics that try to capture how accurately a candidate term describes the object’s content. We also exploit two learning to rank techniques, namely RankSVM and Genetic Programming, for the task of generating ranking functions that combine multiple metrics to accurately estimate the relevance of a tag to a given object. We evaluate all proposed methods in various scenarios for three popular Web 2.0 applications, namely, LastFM, YouTube and YahooVideo. We found that our new heuristics greatly outperform the methods on which they are based, producing gains in precision of up to 181%, as well as another state-of-the-art technique, with improvements in precision of up to 40% over the best baseline in any scenario. Some further improvements can also be achieved, in some scenarios, with the new learning-to-rank based strategies, which have the additional advantage of being quite flexible and easily extensible to exploit other aspects of the tag recommendation problem.Bibtex Citation@inproceedings{belem@sigir11, author = {Fabiano Bel\'em and Eder Martins and Jussara Almeida and Marcos Gon\c{c}alves}, title = {Associative Tag Recommendation Exploiting Multiple Textual Features}, booktitle = {{Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (SIGIR'11)}}, month = {{July}}, year = {2011} }