Overview This data set consists of links to social network items for 34 different forensic events that took place between August 14th, 2018 and January 06th, 2021. The majority of the text and images are from Twitter (a minor part is from Flickr, Facebook and Google+), and every video is from YouTube. Data Collection We used Social Tracker, along with the social medias' APIs, to gather most of the collections. For a minor part, we used Twint. In both cases, we provided keywords related to the event to receive the data. It is important to mention that, in procedures like this one, usually only a small fraction of the collected data is in fact related to the event and useful for a further forensic analysis. Content We have data from 34 events, and for each of them we provide the files: items_full.csv: It contains links to any social media post that was collected. images.csv: Enlists the images collected. In some files there is a field called "ItemUrl", that refers to the social network post (e.g., a tweet) that mentions that media. video.csv: Urls of YouTube videos that were gathered about the event. video_tweet.csv: This file contains IDs of tweets and IDs of YouTube videos. A tweet whose ID is in this file has a video in its content. In turn, the link of a Youtube video whose ID is in this file was mentioned by at least one collected tweet. Only two collections have this file. description.txt: Contains some standard information about the event, and possibly some comments about any specific issue related to it. In fact, most of the collections do not have all the files above. Such an issue is due to changes in our collection procedure throughout the time of this work. Events We divided the events into six groups. They are: Fire: Devastating fire is the main issue of the event, therefore most of the informative pictures show flames or burned constructions. 14 Events Collapse: Most of the relevant images depict collapsed buildings, bridges, etc. (not caused by fire). 5 Events Shooting: Likely images of guns and police officers. Few or no destruction of the environment. 5 Events Demonstration: Plethora of people on the streets. Possibly some problem took place on that, but in most cases the demonstration is the actual event. 7 Events Collision: Traffic collision. Pictures of damaged vehicles on an urban landscape. Possibly there are images with victims on the street. 1 Event Flood: Events that range from fierce rain to a tsunami. Many pictures depict water. 2 Events Media Content Due to the terms of use from the social networks, we do not make publicly available the texts, images and videos that were collected. However, we can provide some extra piece of media content related to one (or more) events by contacting the authors.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accident Detection Model is made using YOLOv8, Google Collab, Python, Roboflow, Deep Learning, OpenCV, Machine Learning, Artificial Intelligence. It can detect an accident on any accident by live camera, image or video provided. This model is trained on a dataset of 3200+ images, These images were annotated on roboflow.
https://user-images.githubusercontent.com/78155393/233774342-287492bb-26c1-4acf-bc2c-9462e97a03ca.png" alt="Survey">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classify video clips with natural scenes of actions performed by people visible in the videos.
See the UCF101 Dataset web page: https://www.crcv.ucf.edu/data/UCF101.php#Results_on_UCF101
This example datasets consists of the 5 most numerous video from the UCF101 dataset. For the top 10 version see: https://doi.org/10.5281/zenodo.7882861 .
Based on this code: https://keras.io/examples/vision/video_classification/ (needs to be updated, if has not yet been already; see the issue: https://github.com/keras-team/keras-io/issues/1342).
Testing if data can be downloaded from figshare with `wget`, see: https://github.com/mojaveazure/angsd-wrapper/issues/10
For generating the subset, see this notebook: https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb -- however, it also needs to be adjusted (if has not yet been already - then I will post a link to the notebook here or elsewhere, e.g., in the corrected notebook with Keras example).
I would like to thank Sayak Paul for contacting me about his example at Keras documentation being out of date.
Cite this dataset as:
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. https://doi.org/10.48550/arXiv.1212.0402
To download the dataset via the command line, please use:
wget -q https://zenodo.org/record/7924745/files/ucf101_top5.tar.gz -O ucf101_top5.tar.gz
tar xf ucf101_top5.tar.gz
The data was collected using a Google Form to find out which video streaming platforms people use. The purpose of the collection was to apply the Apriori algorithm to it, but I am posting it here to find out what else could be done on it.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
IdiapVideoAge dataset is a set of youtube video IDs with age labels to facilitates the research in the area of audio-visual age verification with the focus on detecting ages of people below 18 years old. The dataset contains 4260 IDs to the youtube videos that come from two existing video databases: VoxCeleb2 and child speech dataset from Google. Our main contribution are the age labels of people in the videos. Three different human annotators were used for labeling. They were instructed give a valid age label if a person's face in a video is visible within more than 80% of the frames and it is clear that the audible speech matches the person in the video. As the age label, we used the average of the three annotators. Out of the total 4260 videos, 1973 videos are of the minors below 18 years old.
Reference
If you use this dataset, please cite the following publication:
Pavel Korshunov and Sebastien Marcel, "Face Anthropometry Aware Audio-visual Age Verification", ACM Multimedia international conference (MM'22), October 2022.
https://publications.idiap.ch/index.php/publications/show/4862
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008. This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
IoTeX is a decentralized crypto system, a new generation of blockchain platform for the development of the Internet of things (IoT). The project team is sure that the users do not have such an application that would motivate to implement the technology of the Internet of things in life. And while this will not be created, people will not have the desire to spend money and time on IoT. The developers of IoTeX decided to implement not the application itself, but the platform for creation. It is through the platform that innovative steps in the space of the Internet of things will be encouraged. Learn more... This dataset is one of many crypto datasets that are available within the Google Cloud Public Datasets . As with other Google Cloud public datasets, you can query this dataset for free, up to 1TB/month of free processing, every month. Watch this short video to learn how to get started with the public datasets. Want to know how the data from these blockchains were brought into BigQuery, and learn how to analyze the data? Más información
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
I3D Video Features, Labels and Splits for Multicamera Overlapping Datasets Pets-2009, HQFS and Up-Fall
The Inflated 3D (I3D) video features, ground truths, and train/test splits for the multicamera datasets Pets-2009, HQFS, and Up-Fall are available here. We relabeled two datasets (HQFS and Pets-2009) for the task of VAD-MIL under multiple cameras. Three feature dispositions of I3D data are available: I3D-RGB, I3D-OF, and the linear concatenation of these features. These datasets can be used as benchmarks for the video anomaly detection task under multiple instance learning and multiple overlapping cameras.
Preprocessed Datasets
PETS-2009 is a benchmark dataset (https://cs.binghamton.edu/~mrldata/pets2009) aggregating different scene sets with multiple overlapping camera views and distinct events involving crowds. We labeled the scenes at \textit{frame} level as anomaly or normal events. Scenes with background, people walking individually or in a crowd, and regular passing of cars are considered normal patterns. Frames with occurrences of people running (individually or in the crowd), crowding of people in the middle of the traffic intersection, and people in the counterflow were considered anomalous patterns. Videos of scenes with the occurrence of anomalous frames are labeled as anomalous, while videos without the occurrence of anomalies are marked as normal videos. The High-Quality Fall Simulation Data - HQFS dataset (https://iiw.kuleuven.be/onderzoek/advise/datasets/fall-and-adl-meta-data) is an indoor scenario with five overlapping cameras with the occurrence of fall incidents. We consider a person falling on the floor an uncommon event. We also relabeled the frame annotations to consider the intervals where the person remains lying on the ground after the fall. The multi-class Up-Fall (https://sites.google.com/up.edu.mx/har-up/) detection dataset contains two overlapping camera views and infrared sensors in a laboratory scenario.
Video Feature Extraction
We use Inflated 3D (I3D) features to represent video clips of 16 frames. We use the Video Features library (https://github.com/v-iashin/video_features) that uses a pre-trained model on the Kinetics 400 dataset. For this procedure, the frame sequence length from which to get the video clip feature representation (or window size) and the number of frames to step before extracting the next features were set to 16 frames. After the video extraction process, each video from each camera corresponds to a matrix with dimension n x 1024, where n is a variable number of existing segments and the number of attributes is 1024 (I3D attributes referring to RGB appearance information or I3D attributes referring to Optical Flow information). It is important to note that the videos (\textit{bags}) are divided into clips with a fixed number of \textit{frames}. Consequently, each video \textit{bag} contains a variable number of clips. A clip can be completely normal, completely anomalous, or mixed with normal and anomalous frames. There are three possible deep feature dispositions considered: I3D features generated with only RGB (1024 I3D features from RGB data), Optical Flow (1024 I3D features from optical flow data), and the combination of both (by simple linear concatenation). We also make available 10-crop features (https://pytorch.org/vision/main/generated/torchvision.transforms.TenCrop.html) by yielding 10 crops for a given video clip.
File Description
center-crop.zip: Folder with I3D features of Pets-2009, HQFS and Up-Fall datasets;
10-crop.zip: Folder with I3D features (10-crop) of Pets-2009, HQFS and Up-Fall datasets;
gts.zip: Folder with ground truths at frame-level and video-level of Pets-2009, HQFS and Up-Fall datasets;
splits.zip: Folder with Lists of training and test splits of Pets-2009, HQFS and Up-Fall datasets;
A portion of the preprocessed I3D feature sets was leveraged in the studies outlined in these publications:
Pereira, S. S., & Maia, J. E. B. (2024). MC-MIL: video surveillance anomaly detection with multi-instance learning and multiple overlapped cameras. Neural Computing and Applications, 36(18), 10527-10543. Available at https://link.springer.com/article/10.1007/s00521-024-09611-3.
Pereira, S. S. L., Maia, J. E. B., & Proença, H. (2024, September). Video Anomaly Detection in Overlapping Data: The More Cameras, the Better?. In 2024 IEEE International Joint Conference on Biometrics (IJCB) (pp. 1-10). IEEE. Available at https://ieeexplore.ieee.org/document/10744502.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Because this dataset has been used in a competition, we had to hide some of the data to prepare the test dataset for the competition. Thus, in the previous version of the dataset, only train.csv file is existed.
This dataset represents 10 different physical poses that can be used to distinguish 5 exercises. The exercises are Push-up, Pull-up, Sit-up, Jumping Jack and Squat. For every exercise, 2 different classes have been used to represent the terminal positions of that exercise (e.g., “up” and “down” positions for push-ups).
About 500 videos of people doing the exercises have been used in order to collect this data. The videos are from Countix Dataset that contain the YouTube links of several human activity videos. Using a simple Python script, the videos of 5 different physical exercises are downloaded. From every video, at least 2 frames are manually extracted. The extracted frames represent the terminal positions of the exercise.
For every frame, MediaPipe framework is used for applying pose estimation, which detects the human skeleton of the person in the frame. The landmark model in MediaPipe Pose predicts the location of 33 pose landmarks (see figure below). Visit Mediapipe Pose Classification page for more details.
https://mediapipe.dev/images/mobile/pose_tracking_full_body_landmarks.png" alt="33 pose landmarks">
This work develops a spatially-led practice to negotiate and share individuals’ perspectives of their own life course. This technique is designed particularly for researching culture(s) and feeling(s) - everyday life (Highmore, 2011) - attached to a given epoch. The focus of my ESRC Postdoctoral Fellowship project is to understand the increasingly suburban and car-oriented places built in the 1960s and 1970s. The technique relies upon online mapping systems and technologies which allow video conversations to be recorded. The broad methodology takes essential elements of one-to-one biographical walking interviews. Sometimes referred to as go-alongs (Carpiano, 2009), the participant leads the way to show spaces and place significant to their life, with the interviewer guiding the conversation. Covid-19 restrictions limited face-to-face interviews (Hall, Gaved, & Sargent, 2021) but also opened the possibility for many conversations to move on to digital platforms. Spatially-led interviews are hosted on digital platforms such as Zoom, where participants and researchers share walks through media such as Google Maps. The conversation is digitally recorded, providing a complete visual record of the spaces visited during the conversation alongside the faces of the participants and their commentary. There are three specific films in this record. They concern an interview with Pat Wright, who was happy for her likeness to be used. • Moving to Newport in 1963. This gives context about the advantages of modern housing in the 1960s compared to older terraced houses with no central heating. • Demolitions in Newport mid-1970s. Account of the plan to build a by-pass road through Newport. Interesting background on renewing urban fabric of towns and cities in the UK and the rise of Civic Trusts to protect the built environment. • Video opening Newport Library 1968. Context about the opening of Newport Library. Reveals power of geography to connect people with memories. Two other individuals were interviewed using this technique and this data may be made available at a later date. Theoretical considerations Walking approaches allow us to explore the affective connections that people have to spaces such as streets and neighbourhoods. Though less atmospheric and embodied than being on an outdoor walk, the walk through digitally-mapped space is promotes the interviewee to recall memories and feelings. The non-verbal elements of “vitality, performativity, corporeality, sensuality, and mobility” (Vannini, 2015, p. 318) are partly captured through the visual records. These interviews complement other biographical or life story techniques and are particularly useful for meeting people some distance away. In my case I seek to explore the attitudes and values of people who are now considered to be older. The main application for my project is to develop participatory walking tours (Evans & Jones, 2011). The stories that people share through these interviews are interpreted by performance artists, whose playful approach helps to communicate with the public (people of all ages). This is an edited 2-minute film captured using the spatially-led digital walking interview technique developed though my project. The participant reveals her memories of Newport Library being opened on April 5 1968.This ESRC Fellowship project will explore the sensibilities which attach to post-war aesthetics and how those born in the late 1940s, 1950s and early 1960s are navigating the present. Through a focus on the environment in which the UK's ageing population grew up, and spaces including semidetached houses, cul-de-sacs, red brick universities campuses, primary schools, and shopping centres, the research will examine how these spaces still influence contemporary life and maintain an affective appeal. Spaces built between the late 1950s and early 1970s form a large bulk of the UK's built environment. But beyond architecture and planning, they also attract deeper affective, sub-emotional or unconscious connections (Pile, 2010). This study will generate insights on how these generations are adapting and navigating social and cultural change. Included within this record are three short films edited down from a longer filmed recording of an interiview. They are captured using the spatially-led digital walking interview where converdsations take place on an online video chat facility, such as Zoom. We use an online mapping system, such as Google Maps, to navigate a given place. See record for participant information sheet and topic guide.
An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.
The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.
Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.
There are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating, etc. ) in order to get a higher lead conversion.
X Education wants to select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score h have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.
Variables Description
* Prospect ID - A unique ID with which the customer is identified.
* Lead Number - A lead number assigned to each lead procured.
* Lead Origin - The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc.
* Lead Source - The source of the lead. Includes Google, Organic Search, Olark Chat, etc.
* Do Not Email -An indicator variable selected by the customer wherein they select whether of not they want to be emailed about the course or not.
* Do Not Call - An indicator variable selected by the customer wherein they select whether of not they want to be called about the course or not.
* Converted - The target variable. Indicates whether a lead has been successfully converted or not.
* TotalVisits - The total number of visits made by the customer on the website.
* Total Time Spent on Website - The total time spent by the customer on the website.
* Page Views Per Visit - Average number of pages on the website viewed during the visits.
* Last Activity - Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc.
* Country - The country of the customer.
* Specialization - The industry domain in which the customer worked before. Includes the level 'Select Specialization' which means the customer had not selected this option while filling the form.
* How did you hear about X Education - The source from which the customer heard about X Education.
* What is your current occupation - Indicates whether the customer is a student, umemployed or employed.
* What matters most to you in choosing this course An option selected by the customer - indicating what is their main motto behind doing this course.
* Search - Indicating whether the customer had seen the ad in any of the listed items.
* Magazine
* Newspaper Article
* X Education Forums
* Newspaper
* Digital Advertisement
* Through Recommendations - Indicates whether the customer came in through recommendations.
* Receive More Updates About Our Courses - Indicates whether the customer chose to receive more updates about the courses.
* Tags - Tags assigned to customers indicating the current status of the lead.
* Lead Quality - Indicates the quality of lead based on the data and intuition the employee who has been assigned to the lead.
* Update me on Supply Chain Content - Indicates whether the customer wants updates on the Supply Chain Content.
* Get updates on DM Content - Indicates whether the customer wants updates on the DM Content.
* Lead Profile - A lead level assigned to each customer based on their profile.
* City - The city of the customer.
* Asymmetric Activity Index - An index and score assigned to each customer based on their activity and their profile
* Asymmetric Profile Index
* Asymmetric Activity Score
* Asymmetric Profile Score
* I agree to pay the amount through cheque - Indicates whether the customer has agreed to pay the amount through cheque or not.
* a free copy of Mastering The Interview - Indicates whether the customer wants a free copy of 'Mastering the Interview' or not.
* Last Notable Activity - The last notable activity performed by the student.
UpGrad Case Study
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
In Chapter 3 of my dissertation (tentatively titled " Becoming Users:Layers of People, Technology, and Power on the Internet. "), I describe how online user activities are datafied and monetized in subtle and often obfuscated ways. The chapter focuses on Google’s reCAPTCHA, a popular implementation of a CAPTCHA challenge. A CAPTCHA, or “Completely Automated Turning test to tell Computers and Humans Apart” is a simple task or challenge which is intended to differentiate between genuine human users and those who may be using software or other automated means to interact maliciously with a website, such as for spam, mass data scraping, or denial of service attacks. reCAPTCHA challenges are increasingly being hidden from direct view of the user, and instead assessing our mouse movements, browsing patterns, and other data to evaluate the likelihood that we are “authentic” users. These hidden challenges raise the stakes of understanding our own construction as Users because they obfuscate practices of surveillance and the ways that our activities as users are commodified by large corporations (Pettis, 2023). By studying the specifics of how such data collection works—that is, how we’re called upon and situated as Users—we can make more informed decisions about how we engage with the contemporary internet. This data set contains metadata for the 214 reCAPTCHA elements that I encountered during my personal use of the Web for the period of one year (September 2022 through September 2023). Of these reCAPTCHAs, 137 were visible challenges—meaning that there was some indication of the presence of a reCAPTCHA challenge. The remaining 77 reCAPTCHAs were entirely hidden on the page. If I had not been running my browser extension, I would likely never have been aware of the use of a reCAPTCHA on the page. The data set also includes screenshots for 174 of the reCAPTCHAs. Screenshots that contain sensitive or private information have been excluded from public access. Researchers can request access to these additional files by contacting Ben Pettis bpettis@wisc.edu. A browsable and searchable version of the data is also available at https://capturingcaptcha.com Methods I developed a custom Google Chrome extension which detects when a page contains a reCAPTCHA and prompts the user to save a screenshot or screen recording while also collecting basic metadata. During Summer 2022, I began work on this website to collate and present the screen captures that I save throughout the year. The purpose of collecting these examples of websites where reCAPTCHAs appear is to understand how this Web element is situated within websites and presented to users, along with sketching out the frequency of their use and on what kinds of websites. Given that I will only be collecting records of my own interactions with reCAPTCHAs, this will not be a comprehensive sample that I can generalize as representative of all Web users. Though my experiences of the reCAPTCHA will differ from those of any other person, this collection will nevertheless be useful for demonstrating how the interface element may be embedded within websites and presented to users. Following Niels Brügger’s descriptions of Web history methods, these screen capture techniques provide an effective way to preserve a portion of the Web as it was actually encountered by a person, as opposed to methods such as automated scraping. Therefore my dissertation offers a methodological contribution to Web historians by demonstrating a technique for identifying and preserving a representation of one Web element within a page, as opposed to focusing an analysis on a whole page or entire website. The browser extension is configured to store data in a cloud-based document database running in MongoDB Atlas. Any screenshots or video recordings are uploaded to a Google Cloud Storage bucket. Both the database and cloud storage bucket are private and are restricted from direct access. The data and screenshots are viewable and searchable at https://capturingcaptcha.com. This data set represents an export of the database as of June 10, 2024. After this date, it is possible that data collection will be resumed, causing more information to be displayed in the online website. The data was exported from the database to a single JSON file (lines format) using the mongoexport command line tool: mongoexport --uri mongodb+srv://[database-url].mongodb.net/production --collection submissions --out captcha-out.json --username [databaseuser]
The American Community Survey (ACS) is an ongoing survey that provides vital information on a yearly basis about our nation and its people by contacting over 3.5 million households across the country. The resulting data provides incredibly detailed demographic information across the US aggregated at various geographic levels which helps determine how more than $675 billion in federal and state funding are distributed each year. Businesses use ACS data to inform strategic decision-making. ACS data can be used as a component of market research, provide information about concentrations of potential employees with a specific education or occupation, and which communities could be good places to build offices or facilities. For example, someone scouting a new location for an assisted-living center might look for an area with a large proportion of seniors and a large proportion of people employed in nursing occupations. Through the ACS, we know more about jobs and occupations, educational attainment, veterans, whether people own or rent their homes, and other topics. Public officials, planners, and entrepreneurs use this information to assess the past and plan the future. For more information, see the Census Bureau's ACS Information Guide . This public dataset is hosted in Google BigQuery as part of the Google Cloud Public Datasets Program , with Carto providing cleaning and onboarding support. It is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
As of April 2024, it was found that men between the ages of 25 and 34 years made up Facebook largest audience, accounting for 18.4 percent of global users. Additionally, Facebook's second largest audience base could be found with men aged 18 to 24 years.
Facebook connects the world
Founded in 2004 and going public in 2012, Facebook is one of the biggest internet companies in the world with influence that goes beyond social media. It is widely considered as one of the Big Four tech companies, along with Google, Apple, and Amazon (all together known under the acronym GAFA). Facebook is the most popular social network worldwide and the company also owns three other billion-user properties: mobile messaging apps WhatsApp and Facebook Messenger,
as well as photo-sharing app Instagram. Facebook usersThe vast majority of Facebook users connect to the social network via mobile devices. This is unsurprising, as Facebook has many users in mobile-first online markets. Currently, India ranks first in terms of Facebook audience size with 378 million users. The United States, Brazil, and Indonesia also all have more than 100 million Facebook users each.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The people from Czech are publishing a dataset for the HTTPS traffic classification.
Since the data were captured mainly in the real backbone network, they omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).
During research, they divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.
They have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. They also used several popular websites that primarily focus on the audience in Czech. The identified traffic classes and their representatives are provided below:
Live Video Stream Twitch, Czech TV, YouTube Live Video Player DailyMotion, Stream.cz, Vimeo, YouTube Music Player AppleMusic, Spotify, SoundCloud File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive Website and Other Traffic Websites from Alexa Top 1M list
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The expansion of Internet connectivity has revolutionized our daily lives, with people increasingly relying on smartphones and laptops for various tasks. This technological evolution has prompted the development of innovative solutions to enhance the quality of life for diverse populations, including the elderly and individuals with disabilities. Among the most impactful advancements are voice-command-enabled technologies such as SIRI and Google voice commands, which are built upon the foundation of Speech Recognition modules, a critical component in facilitating human-machine communication.Automatic Speech Recognition (ASR) has witnessed significant progress in achieving human-like performance through data-driven methods. In the context of our research, we have meticulously crafted an Arabic voice command dataset to facilitate advancements in ASR and other speech processing tasks. This dataset comprises 10 distinct commands spoken by 10 unique speakers, each repeated 10 times. Despite its modest size, the dataset has demonstrated remarkable performance across a range of speech processing tasks.The dataset was rigorously evaluated, yielding exceptional results. In ASR, it achieved an accuracy of 95.9%, showcasing its potential for effectively transcribing spoken Arabic commands. Furthermore, the dataset excelled in speaker identification, gender recognition, accent recognition, and spoken language understanding, with macro F1 scores of 99.67%, 100%, 100%, and 97.98%, respectively.This Arabic Voice Command Dataset represents a valuable resource for researchers and developers in the field of speech processing and human-machine interaction. Its quality and diversity make it a robust foundation for developing and testing ASR and other related systems, ultimately contributing to the advancement of voice-command technologies and their widespread accessibility.
This dataset contains 2.2 million digitized books stretching back two centuries, encompassing the complete English-language public domain collections of HathiTrust. These collections have been processed using the GDELT Global Knowledge Graph and are available in Google BigQuery. More than a billion pages stretching back 215 years have been examined to compile a list of all people, organizations, and other names, fulltext geocoded to render them fully mappable, and more than 4,500 emotions and themes compiled. All of this computed metadata is combined with all available book-level metadata, including title, author, publisher, and subject tags as provided by the contributing libraries. HathiTrust data includes all English language public domain books 1800-2015. They were provided as part of a special research extract and only public domain volumes are included. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Gum bleeding is a common dental problem, and numerous patients seek health-related information on this topic online. The YouTube website is a popular resource for people searching for medical information. To our knowledge, no recent study has evaluated content related to bleeding gums on YouTube™. Therefore, this study aimed to conduct a quantitative and qualitative analysis of YouTube videos related to bleeding gums. A search was performed on YouTube using the keyword "bleeding gums" from Google Trends. Of the first 200 results, 107 videos met the inclusion criteria. The descriptive statistics for the videos included the time since upload, the video length, and the number of likes, views, comments, subscribers, and viewing rates. The global quality score (GQS), usefulness score, and DISCERN were used to evaluate the video quality. Statistical analysis was performed using the Kruskal–Wallis test, Mann–Whitney test, and Spearman correlation analysis. The majority (n = 69, 64.48%) of the videos observed were uploaded by hospitals/clinics and dentists/specialists. The highest coverage was for symptoms (95.33%). Only 14.02% of the videos were classified as "good". The average video length of the videos rated as "good" was significantly longer than the other groups (p <0.05), and the average viewing rate of the videos rated as "poor" (63,943.68%) was substantially higher than the other groups (p <0.05). YouTube videos on bleeding gums were of moderate quality, but their content was incomplete and unreliable. Incorrect and inadequate content can significantly influence patients’ attitudes and medical decisions. Effort needs to be expended by dental professionals, organizations, and the YouTube platform to ensure that YouTube can serve as a reliable source of information on bleeding gums.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset compiles georeferenced media - including videos (480), articles (20), and datasets (6) - specifically curated to facilitate the understanding of reef habitats across northern Australia. It was designed as a research tool for virtual fieldwork with a particular focus on identifying sources of information that allow an understanding of both inshore and offshore reef environments. This dataset provides a record of the literature and media that was reviewed as part of mapping the reef boundaries from remote sensing as part of project NESP MaC 3.17.
This dataset only focuses on media that is useful for understanding shallow reef habitats. It includes videos of snorkelling, diving, spearfishing, and aerial drone imagery. It includes websites, books and journal papers that talk about the structure of reefs and datasets that provide fine scale benthic mapping.
This dataset is likely to not comprehensive. While considerable time was put into collecting relevant media, finding all available information sources is very difficult and time consuming.
A relatively comprehensive search was conducted on:
- AIMS Metadata catalogue for benthic habitat mapping with tow videos and BRUVS
- A review of the eAtlas for benthic habitat mapping
- YouTube searches for video media of fishing, cruises, snorkelling of many named locations.
The dataset is far less comprehensive on existing literature from journals, reports and dataset.
As the NESP MaC 3.17 project progresses we will continue to expand the dataset.
Changelog:
Changes made to the dataset will be noted in the change log and indicated in the dataset via the 'Revision' date.
1st Ed. - 2024-04-10 - Initial release of the dataset
Methods:
Identifying media - YouTube videos
The initial discovery of videos for a given area was achieved by searching for place names in YouTube search using terms such as diving, snorkeling or spearfishing combined with the location name.
Each potential video was reviewed to:
1. Determine if the video had any visual content that would useful for understanding the marine environment.
2. Determine if the footage could be georeferenced to a specific location, the more specific the better.
In cases where the YouTube channel was making travel videos that were of a high quality, then all the relevant videos in that channel were reviewed. A high proportion of the most useful videos were found using this technique.
The most useful videos were those that had named specific locations (typically in their title or description) and contained drone footage and underwater footage. The drone footage would often show enough of the landscape for features to be matched with satellite imagery allowing precise geolocation of the imagery.
To minimise the time required to find relevant videos, the scrubbing feature on YouTube was used to allow the timeline of the video to be quickly reviewed for relevant scenes. The scrubbing feature shows a very quick, but low resolution version of the video as the cursor is moved along the video timeline. This scrubbing was used to quickly look through the videos for any scenes that contained drone footage, for underwater footage. This was particularly useful for travel videos that contained significant footage of overland travel mixed in with boating or shoreline activities. It was also useful for fishing videos where all the fishing activities could be quickly skipped over to focus on any available drone footage or underwater footage from snorkeling or spearfishing.
Where a video lacked direct clues to the location (such as in the title), but the footage contained particularly relevant and useful footage, additional effort was made listen to the conversations and other footage in the videos for additional clues. This includes people in the video talking about the names of locations, or any marine charts in the footage, or previous and proceeding scenes, where the location could be determined, adding constraints to the location of the relevant scene. Where the footage could not be precisely determine, but the footage was still useful then it was added to a video playlist for the region.
In many remote locations there were so few videos that the bar for including the videos was quite low as these videos would at least provide some general indication of the landscape.
When on PC, Google Maps was used to look up locations and act as reference satellite imagery for locating places, QGIS was used to record the polygons of locations and YouTube in a browser was used for video review.
YouTube Playlists:
The initial collection of videos were compiled into YouTube playlists corresponding to relatively large regions. Using playlists was the most convenient way to record useful videos when viewing YouTube from an iPad. This compilation was done prior to the setup of this dataset.
Localising Playlists:
For YouTube playlists the region digitised was based on the region represented by the playlist name and the collection of videos. Google maps was used to help determine the locations of each region. Where a particularly useful video is found in one of the playlists and its location can be determined accurately then this video was entered into this database as an individual video with its own finer scale mapping. However this process of migrating the videos from the playlists to more highly georeferenced individual videos in the dataset is incomplete.
The playlists are really a catch-all for potentially useful videos.
Localising individual videos:
Candidate videos were quickly assessed for likely usefulness by reviewing the title and quickly scrubbing through the video looking for any marine footage, in water or as drone footage. If a video had a useful section then the focus was to determine the location of that part of the footage as accurately as possible. This was done by searching for locations listed in the title, chapter markers, video description, or mentions in video. These were then looked up in Google Maps. In general we would start with any drone footage that shows a large area with distinct features that could be matched with satellite imagery. The region around named locations were scanned for matching coastline and marine features. Once a match was found then the footage would be reviewed to track the likely area that the video covers in multiple scenes.
The video region was then digitised approximately in QGIS into the AU_AIMS_NESP-3-17_Reef-map-geo-media.shp shapefile. Notes were then added about the important features seen in the footage. A link to the video, including the time code so that it would start at the relevant portion of the video. Long videos showing multiple locations were added as multiple entries, each with a separate polygon location and a different URL link with a different start time.
Articles and Datasets
While this dataset primarily focuses on videos, we started adding relevant datasets, websites, articles and reports. These categories of media are not complete in this version of the dataset.
Data dictionary:
RegionName: (String, 255 characters): Name of the location, Examples: 'Oyster Stacks Snorkelling Area', 'Kurrajong Campground', 'South Lefroy Bay'
State: (String, 30 characters): Abbreviation of the state that the region corresponds to. For example: 'WA', 'QLD', 'NT'. For locations far offshore link the location to the closest state or to an existing well known region name. For example: Herald Cay -> Coral Sea, Rowley shoals -> WA.
MediaType: (String, 20 characters): One of the following:
- Video
- Video Playlist
- Website
- Report
- EIS
- Book
- Journal Paper
HabitatRef: (Int): An indication that this resource shows high accuracy spatial habitat information can be used for improving the UQ habitat reference datasets. This attribute should indicate which resources should be reviewed and converted to habitat reference patches. It should be reserved for where a habitat can be located on satellite imagery with sufficient precision that it has high confidence. Media that corresponds to information that is deeper than 15 m is excluded (assigned a HabitatRef of 0) as this is too deep to be used by the UQ habitat mapping.
- 1 - Use for habitat reference data.
- 0 - Only provides general information about the patch. Imagery can be spatially located accurately or detail is insufficient.
Highlight: (String, 255 characters): This records the classification of reef mapping, or research question that this video is most useful for. Not all videos need this classification. In general this attribute should be reserved for those videos that have the highest level of useful information. Think of it as a shortlist of videos that someone trying to understand a particular aspect of categorising reefs from satellite imagery should review. The following are some of the questions associated with each category that the videos provide some answers.
- High tidal range fringing reef: Here we want to understand the structure of fringing reefs in the Kimberleys and Northern Territory where the tides are large and the water is turbid. Is there coral on the tops of the reef flats? Won't the coral dry out if it grows on the reef flat? How will it get enough light if it grows on the reef slope?
- Ancient coastline: Along many parts of WA there are shallow rocky reefs off the coast that appear to be acient coastline. What is the nature of these reefs? Does coral or macroalgae grow on them?
- Seagrass: What does seagrass look like from satellite imagery
- Ningaloo backreef coral: Ningaloo is a very large reef system with a large sandy back. Should the whole back reef
San Francisco Ford GoBike , managed by Motivate, provides the Bay Area’s bike share system. Bike share is a convenient, healthy, affordable, and fun form of transportation. It involves a fleet of specially designed bikes that are locked into a network of docking stations. Bikes can be unlocked from one station and returned to any other station in the system. People use bike share to commute to work or school, run errands, get to appointments, and more. The dataset contains trip data from 2013-2018, including start time, end time, start station, end station, and latitude/longitude for each station. See detailed metadata for historical and real-time data . This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Overview This data set consists of links to social network items for 34 different forensic events that took place between August 14th, 2018 and January 06th, 2021. The majority of the text and images are from Twitter (a minor part is from Flickr, Facebook and Google+), and every video is from YouTube. Data Collection We used Social Tracker, along with the social medias' APIs, to gather most of the collections. For a minor part, we used Twint. In both cases, we provided keywords related to the event to receive the data. It is important to mention that, in procedures like this one, usually only a small fraction of the collected data is in fact related to the event and useful for a further forensic analysis. Content We have data from 34 events, and for each of them we provide the files: items_full.csv: It contains links to any social media post that was collected. images.csv: Enlists the images collected. In some files there is a field called "ItemUrl", that refers to the social network post (e.g., a tweet) that mentions that media. video.csv: Urls of YouTube videos that were gathered about the event. video_tweet.csv: This file contains IDs of tweets and IDs of YouTube videos. A tweet whose ID is in this file has a video in its content. In turn, the link of a Youtube video whose ID is in this file was mentioned by at least one collected tweet. Only two collections have this file. description.txt: Contains some standard information about the event, and possibly some comments about any specific issue related to it. In fact, most of the collections do not have all the files above. Such an issue is due to changes in our collection procedure throughout the time of this work. Events We divided the events into six groups. They are: Fire: Devastating fire is the main issue of the event, therefore most of the informative pictures show flames or burned constructions. 14 Events Collapse: Most of the relevant images depict collapsed buildings, bridges, etc. (not caused by fire). 5 Events Shooting: Likely images of guns and police officers. Few or no destruction of the environment. 5 Events Demonstration: Plethora of people on the streets. Possibly some problem took place on that, but in most cases the demonstration is the actual event. 7 Events Collision: Traffic collision. Pictures of damaged vehicles on an urban landscape. Possibly there are images with victims on the street. 1 Event Flood: Events that range from fierce rain to a tsunami. Many pictures depict water. 2 Events Media Content Due to the terms of use from the social networks, we do not make publicly available the texts, images and videos that were collected. However, we can provide some extra piece of media content related to one (or more) events by contacting the authors.