Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description:
The "Daily Social Media Active Users" dataset provides a comprehensive and dynamic look into the digital presence and activity of global users across major social media platforms. The data was generated to simulate real-world usage patterns for 13 popular platforms, including Facebook, YouTube, WhatsApp, Instagram, WeChat, TikTok, Telegram, Snapchat, X (formerly Twitter), Pinterest, Reddit, Threads, LinkedIn, and Quora. This dataset contains 10,000 rows and includes several key fields that offer insights into user demographics, engagement, and usage habits.
Dataset Breakdown:
Platform: The name of the social media platform where the user activity is tracked. It includes globally recognized platforms, such as Facebook, YouTube, and TikTok, that are known for their large, active user bases.
Owner: The company or entity that owns and operates the platform. Examples include Meta for Facebook, Instagram, and WhatsApp, Google for YouTube, and ByteDance for TikTok.
Primary Usage: This category identifies the primary function of each platform. Social media platforms differ in their primary usage, whether it's for social networking, messaging, multimedia sharing, professional networking, or more.
Country: The geographical region where the user is located. The dataset simulates global coverage, showcasing users from diverse locations and regions. It helps in understanding how user behavior varies across different countries.
Daily Time Spent (min): This field tracks how much time a user spends on a given platform on a daily basis, expressed in minutes. Time spent data is critical for understanding user engagement levels and the popularity of specific platforms.
Verified Account: Indicates whether the user has a verified account. This feature mimics real-world patterns where verified users (often public figures, businesses, or influencers) have enhanced status on social media platforms.
Date Joined: The date when the user registered or started using the platform. This data simulates user account history and can provide insights into user retention trends or platform growth over time.
Context and Use Cases:
Researchers, data scientists, and developers can use this dataset to:
Model User Behavior: By analyzing patterns in daily time spent, verified status, and country of origin, users can model and predict social media engagement behavior.
Test Analytics Tools: Social media monitoring and analytics platforms can use this dataset to simulate user activity and optimize their tools for engagement tracking, reporting, and visualization.
Train Machine Learning Algorithms: The dataset can be used to train models for various tasks like user segmentation, recommendation systems, or churn prediction based on engagement metrics.
Create Dashboards: This dataset can serve as the foundation for creating user-friendly dashboards that visualize user trends, platform comparisons, and engagement patterns across the globe.
Conduct Market Research: Business intelligence teams can use the data to understand how various demographics use social media, offering valuable insights into the most engaged regions, platform preferences, and usage behaviors.
Sources of Inspiration: This dataset is inspired by public data from industry reports, such as those from Statista, DataReportal, and other market research platforms. These sources provide insights into the global user base and usage statistics of popular social media platforms. The synthetic nature of this dataset allows for the use of realistic engagement metrics without violating any privacy concerns, making it an ideal tool for educational, analytical, and research purposes.
The structure and design of the dataset are based on real-world usage patterns and aim to represent a variety of users from different backgrounds, countries, and activity levels. This diversity makes it an ideal candidate for testing data-driven solutions and exploring social media trends.
Future Considerations:
As the social media landscape continues to evolve, this dataset can be updated or extended to include new platforms, engagement metrics, or user behaviors. Future iterations may incorporate features like post frequency, follower counts, engagement rates (likes, comments, shares), or even sentiment analysis from user-generated content.
By leveraging this dataset, analysts and data scientists can create better, more effective strategies ...
Facebook
TwitterAll-City event calendar - ARCHIVED For the new LA City Events dataset (refreshed daily), see https://data.lacity.org/A-Prosperous-City/LA-City-Events/rx9t-fp7k
Facebook
TwitterBy Jeffrey Mvutu Mabilama [source]
This dataset provides a comprehensive look into 2020’s top trends worldwide, with information on the hottest topics and conversations happening all around the globe. With details such as trending type, country origin, dates of interest, URLs to find further information, keywords related to the trend and more - it's an invaluable insight into what's driving society today.
You can use this data in conjunction with other sources to get ideas for businesses or products tailored to popular desires or opinions. If you are interested in international business perspectives then this is also your go-to source; you can adjust how best to interact with people from certain countries upon learning what they hold important in terms of search engine activity.
It also gives key insights into buzz formation by monitoring trends over many countries over different periods of time then analysing whether events tend to last longer or if their effect is short-lived and how much impact it made in terms column ‘traffic’ – number of searches for an individual topic – for the duration of its period affecting higher positions and opinion polls. In addition, marketing / advertising professionals can anticipate what content is likely best received by audiences based off previous trends related images/snippets provided with each trend/topic as well as URL links tracking users who have shown interest.. This way they become better prepared when rolling out campaigns targeted at specific regions/areas taking cultural perspective into consideration rather than just raw numbers.
Last but not least it serves perfectly as great starting material when getting acquainted foreigners online (at least we know what conversation starters won't be awkward mentioned!) before deepening our empathetic understanding like terms used largely solely within cultures such as TV program titles… So…… question is: What will be next big thing? See for yourself.
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to use this dataset for Insights on Popularity?
This Daily Global Trends 2020 dataset provides valuable information about trends around the world, including insights on their popularity. It can be used to identify popular topics and find ways to capitalize on them through marketing, business ideas and more. Below are some tips for how to use this data in order to gain insight into global trends and the level of popularity they have.
For Business Ideas: Use the URL information provided in order to research each individual trend, analyzing both when it gained traction as well as when its popularity faded away (if at all). This will give insight into transforming a brief trend into a long-lived one or making use of an existing but brief surge in interest – think new apps related to a trending topic! Combining the geographic region listed with these timeframes gives even more granular insight that could be used for product localization or regional target marketing.
To study Crowd Behaviour & Dynamics: Explore both country-wise and globally trending topics by looking at which countries similarly exhibit interest levels for said topics. Go further by understanding what drives people’s interest in particular subjects from different countries; here web scraping techniques can be employed using the URLs provided accompanied with basic text analysis techniques such as word clouds! This allows researchers/marketers get better feedback from customers from multiple regions, enabling smarter decisions based upon real behaviour rather than assumptions.
For **Building Better Products & Selling Techniques: Utilize combine Category (Business, Social etc.), Country and Related keywords mentioned with traffic figures so that you can obtain granular information about what excites people across cultures i.e ‘Food’ is popular everywhere but certain variations depending upon geo-location may not sell due need catering towards local taste buds.-For example selling frozen food that requires little preparation via supermarket chains showing parallels between nutritional requirements vs expenses incurred while shopping will drive effective sales strategy using this data set . Further combining date information also helps make predictions based upon buyers behaviour over seasons i.e buying seedless watermelons during winter season would be futile .
For Social & Small Talk opportunities - Incorporating recently descr...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset: Responses to two open questionnaires: Questionnaire1: "What do you know about climate change?"; Questionnaire 2: "Waht do you want to know about climate change?" Applied to middle school students, from the ages of 10 (5th grade) until the ages of 13 (8th grade), at schools "Escola Básica e Secundária Dr. Pascoal José de Mello" and "Escola Nº 2 de Avelar" from the municipality of Ansião at the central region of Portugal. The survey was applied to students both individually (n=106) and in group (n= 60), encompassing a total of 417 students. Due to logistical reasons, it was no possible to gather individual questionnaires for the 5th grade.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of What Cheer by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for What Cheer. The dataset can be utilized to understand the population distribution of What Cheer by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in What Cheer. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for What Cheer.
Key observations
Largest age group (population): Male # 5-9 years (56) | Female # 20-24 years (38). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Age groups:
Scope of gender :
Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for What Cheer Population by Gender. You can refer the same here
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MIT Restaurant Corpus - CRFs (Conditional Random Fields) Dataset
A Funny Dive into Restaurant Reviews 🥳🍽️
Welcome to MIT Restaurant Corpus - CRF Dataset! If you are someone who loves food, restaurant and all the jargings that come with it, then you are for a treat! (Pun intended! 😉), Let's break it in the most delicious way!
This dataset obtained from MIT Restaurant Corpus (https://sls.csail.mit.edu/downloads/restaurant/) provides valuable restaurant review data for the NER (Named Entity Recognition) functions. With institutions such as ratings, locations and cuisine, it is perfect for the manufacture of CRF models. 🏷️🍴 Let's dive into this rich resource and find out its ability! 📊📍
The MIT Restaurant Corpus is designed to help you understand the intricacies of restaurant reviews and data about restaurants can be pars and classified. It has a set of files that are structured to give you all ingredients required to make CRF (Conditional Random Field) models for NER (Named Entity Recognition). What is served here:
1.**‘sent_train’** 📝: This file contains a collection of sentences. But not just any sentences. These are sentences taken from real - world restaurant reviews! Each sentence is separated by a new line. It is like a dish of text, a sentence at a time.
2.**‘sent_test’** 🍽️: Just like the ‘sent_train’ file, this one contains sentences, but they’re for testing purposes. Think of it as the "taste test" phase of your restaurant review trip. The sentences here help you assess how well your model has learned the art of NER.
3.**‘label_train’** 🏷️: Now here’s where the magic happens. This file holds the NER labels or tags corresponding to each token in the ‘sent_train’ file. So, for every word in a sentence, there is a related label. It helps the model know what is - whether it’s a restaurant name, location, or dish. This review is like a guide to identify the stars of the show!
4.**‘label_test’** 📋: This file is just like ‘label_train’, but for testing. This allows you to verify if your model predictions are with the reality of the restaurant world. Will your model guess that "Burtito Palace" is the name of a restaurant? You will know here!
Therefore, in short, there is a beautiful one-to-one mapping between ‘sent_train’/‘sent_test’ files and ‘label_train’/‘label_test’ files. Each sentence is combined with its NER tag, which makes your model an ideal recipe for training and testing.
The real star of this dataset is the NER tags. If you’re thinking, "Okay, but in reality we are trying to identify in these restaurants reviews?" Well, here is the menu of NER label with which you are working:
These NER tags help create an understanding of all the data you encounter in a restaurant review. You will be able to easily pull names, prices, ratings, dishes, and more. Talk about a full-recourse data food!
Now, once you get your hand on this delicious dataset, what do you do with it? A ** CRF model ** cooking time!🍳
CRF (conditional random field) is a great way to label the sequences of data - such as sentences. Since NER work is about tagging each token (word) in a sentence, CRF models are ideal. They use reference around each word to perform predictions. So, when you were "wonderful for Sushi in Sushi Central!" As the sentence passes in, the model can find out that "Sushi Central" is a Restaurant_Name, and “sushi” is a Dish.
Next, we dive into defines features for CRF model. Features are like secret materials that work your model. You will learn how to define them in the python, so your model can recognize the pattern and make accurate predictions.
...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.
The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.
WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.
Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.
Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.
We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.
The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942
Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.
posts| Column Name | Type | Description |
|---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
crosspost_parent_id | TEXT | The ID of the original Reddit post if this post is a crosspost. |
post_id | TEXT | Unique identifier for the Reddit post. |
created_at | TIMESTAMP | The timestamp when the post was created. |
updated_at | TIMESTAMP | The timestamp when the post was last updated. |
language_code | TEXT | The language code of the post. |
score | INTEGER | The score (upvotes minus downvotes) of the post. |
upvote_ratio | REAL | The ratio of upvotes to total votes. |
gildings | INTEGER | Number of awards (gildings) received by the post. |
num_comments | INTEGER | Number of comments on the post. |
comments| Column Name | Type | Description |
|---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
post_id | TEXT | The ID of the Reddit post the comment belongs to. |
parent_id | TEXT | The ID of the parent comment (if a reply). |
comment_id | TEXT | Unique identifier for the comment. |
created_at | TIMESTAMP | The timestamp when the comment was created. |
last_modified_at | TIMESTAMP | The timestamp when the comment was last modified. |
score | INTEGER | The score (upvotes minus downvotes) of the comment. |
upvote_ratio | REAL | The ratio of upvotes to total votes for the comment. |
gilded | INTEGER | Number of awards (gildings) received by the comment. |
postlinks| Column Name | Type | Description |
|---|---|---|
post_id | TEXT | Unique identifier for the Reddit post. |
end_processed_valid | INTEGER | Whether the extracted URL from the post resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the Reddit post. |
final_valid | INTEGER | Whether the final URL from the post resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final URL. |
final_url | TEXT | The final URL after redirections. |
redirected | INTEGER | Indicator of whether the posted URL was redirected (1) or not (0). |
in_title | INTEGER | Indicator of whether the link appears in the post title (1) or post body (0). |
commentlinks| Column Name | Type | Description |
|---|---|---|
comment_id | TEXT | Unique identifier for the Reddit comment. |
end_processed_valid | INTEGER | Whether the extracted URL from the comment resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the comment. |
final_valid | INTEGER | Whether the final URL from the comment resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final |
Facebook
TwitterThis webinar provides information on the federal government’s role in the process of building a successful child welfare information system and includes resources and guidance available to support states, territories, and tribes.
Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Login Data Set for Risk-Based Authentication
Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.
This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.
The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.
WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.
Overview
The data set contains the following features related to each login attempt on the SSO:
| Feature | Data Type | Description | Range or Example |
|---|---|---|---|
| IP Address | String | IP address belonging to the login attempt | 0.0.0.0 - 255.255.255.255 |
| Country | String | Country derived from the IP address | US |
| Region | String | Region derived from the IP address | New York |
| City | String | City derived from the IP address | Rochester |
| ASN | Integer | Autonomous system number derived from the IP address | 0 - 600000 |
| User Agent String | String | User agent string submitted by the client | Mozilla/5.0 (Windows NT 10.0; Win64; ... |
| OS Name and Version | String | Operating system name and version derived from the user agent string | Windows 10 |
| Browser Name and Version | String | Browser name and version derived from the user agent string | Chrome 70.0.3538 |
| Device Type | String | Device type derived from the user agent string | (mobile, desktop, tablet, bot, unknown)1 |
| User ID | Integer | Idenfication number related to the affected user account | [Random pseudonym] |
| Login Timestamp | Integer | Timestamp related to the login attempt | [64 Bit timestamp] |
| Round-Trip Time (RTT) [ms] | Integer | Server-side measured latency between client and server | 1 - 8600000 |
| Login Successful | Boolean | True: Login was successful, False: Login failed | (true, false) |
| Is Attack IP | Boolean | IP address was found in known attacker data set | (true, false) |
| Is Account Takeover | Boolean | Login attempt was identified as account takeover by incident response team of the online service | (true, false) |
Data Creation
As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.
The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.
The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.
The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.
The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.
Regarding the Data Values
Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.
You can recognize them by the following values:
ASNs with values >= 500.000
IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)
Study Reproduction
Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.
The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.
See RESULTS.md for more details.
Ethics
By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.
The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.
Publication
You can find more details on our conducted study in the following journal article:
Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022)
Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono.
ACM Transactions on Privacy and Security
Bibtex
@article{Wiefling_Pump_2022,
author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi},
title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}},
journal = {{ACM} {Transactions} on {Privacy} and {Security}},
doi = {10.1145/3546069},
publisher = {ACM},
year = {2022}
}
License
This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:
Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069
Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.↩︎
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Gmicalzoma : it means what it says_ when you know what it means : an Enochian dictionary. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Scholarly communication : what everyone needs to know®. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains two supporting documents for the paper title, "We Do Not Understand What It Says -- Studying Student Perceptions of Software Modelling". The first one is an excel sheet containing interview transcripts of 13 of the participants of this study (who agreed to publish their statements) and the second is an appendix file containing the interview guide (questionnaire used for interviews with students and instructors) used in the case study.
The interview transcripts are supported by "in-vivo coding" used by both authors separately during analysis.
Facebook
TwitterThis is a common Zenodo repository for both lastfm-360K and lastfm-1K datasets. See below the details of both datasets, including license, acknowledgements, contact, and instructions to cite.
LASTFM-360K (version 1.2, March 2010).
user-mboxsha1 \t musicbrainz-artist-id \t artist-name \t plays
user-mboxsha1 \t gender (m|f|empty) \t age (int|empty) \t country (str|empty) \t signup (date|empty)
000063d3fe1cf2ba248b9e3c3f0334845a27a6be \t a3cb23fc-acd3-4ce0-8f36-1e5aa6a18432 \t u2 \t 31 ...
000063d3fe1cf2ba248b9e3c3f0334845a27a6be \t m \t 19 \t Mexico \t Apr 28, 2008 ...
LASTFM-1K (version 1.0, March 2010).
userid \t timestamp \t musicbrainz-artist-id \t artist-name \t musicbrainz-track-id \t track-name
userid \t gender ('m'|'f'|empty) \t age (int|empty) \t country (str|empty) \t signup (date|empty)
user_000639 \t 2009-04-08T01:57:47Z \t MBID \t The Dogs D'Amour \t MBID \t Fall in Love Again? user_000639 \t 2009-04-08T01:53:56Z \t MBID \t The Dogs D'Amour \t MBID \t Wait Until I'm Dead ...
user_000639 \t m \t Mexico \t Apr 27, 2005 ...
LICENSE OF BOTH DATASETS. The data contained in both datasets is distributed with permission of Last.fm. The data is made available for non-commercial use. Those interested in using the data or web services in a commercial context should contact:
partners [at] last [dot] fm
For more information see Last.fm terms of service
ACKNOWLEDGEMENTS. Thanks to Last.fm for providing the access to this data via their web services. Special thanks to Norman Casagrande.
REFERENCES. When using this dataset you must reference the Last.fm webpage. Optionally (not mandatory at all!), you can cite Chapter 3 of this book:
@book{Celma:Springer2010,
author = {Celma, O.},
title = {{Music Recommendation and Discovery in the Long Tail}},
publisher = {Springer},
year = {2010}
}
CONTACT: This data was collected by Òscar Celma @ MTG/UPF
Facebook
TwitterThis dataset consists of posts detail from machine learning subredit((https://www.reddit.com/r/MachineLearning/). It consists of one file with 470 rows and 7 columns.
| Variable | Definition |
|---|---|
| id | Unique ID for each post |
| title | Title of the post |
| Score | Number of upvotes on that post |
| URL | url of post |
| num_comments | Number of comments on post |
| body | content of post |
| created | Time of creation of the post in utc |
You can use NLP techniques to analyse the data and do exploratory data analysis. Also you can predict the score of posts.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of 22 data set of 50+ requirements each, expressed as user stories.
The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]
The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light
This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1
The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.
g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.
g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.
g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).
g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.
g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.
g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.
g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.
g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.
g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.
g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.
g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.
g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 3 rows and is filtered where the book is What everyone in Britain should know about crime and punishment. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterThe PWSD is a dataset that can be used to answer questions about various public workforce system programs and how these programs fit in with the overall public workforce system and the economy. It was designed primarily to be used as a tool to understand what has been occurring in the Wagner-Peyser program and contains data from quarter 1 of 1995 through quarter 4 of 2008. Also, it was designed to understand the relationship and flow of participants as they go through the public workforce system. The PWSD can be used to analyze these programs both individually and in combination. The PWSD contains economic variables, Unemployment Insurance System data, and data on programs funded by the Workforce Investment Act and Employment Service. Economic variables included are labor force, employment, unemployment, unemployment rate, and gross domestic product data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 3 rows and is filtered where the books is Know me, like me, follow me : what online social networking means for you and your business. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the What Cheer population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for What Cheer. The dataset can be utilized to understand the population distribution of What Cheer by age. For example, using this dataset, we can identify the largest age group in What Cheer.
Key observations
The largest age group in What Cheer, IA was for the group of age 5 to 9 years years with a population of 92 (14.51%), according to the ACS 2019-2023 5-Year Estimates. At the same time, the smallest age group in What Cheer, IA was the 25 to 29 years years with a population of 5 (0.79%). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates
Age groups:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for What Cheer Population by Age. You can refer the same here
Facebook
TwitterLiking and pleasantness are common concepts in psychological emotion theories, and everyday language related to emotions. Despite obvious similarities between the terms, several empirical and theoretical notions support the idea that pleasantness and liking are cognitively different phenomena, becoming most evident in the context of emotion regulation and art enjoyment. In this study it was investigated whether liking and pleasantness indicate behaviourally measurable differences, not only in the long timespan of emotion regulation, but already within the initial affective responses to visual and auditory stimuli. A cross-modal affective priming protocol was used to assess whether there is a behavioural difference in the response time when providing an affective rating to a liking or pleasantness task. It was hypothesized that the pleasantness task would be faster as it is known to rely on rapid feature detection. Furthermore, an affective priming effect was expected to take place across the sensory modalities and the presentative and non-presentative stimuli. A linear mixed effect analysis indicated a significant priming effect, as well as an interaction effect between the auditory and visual sensory modalities and the affective rating tasks of liking and pleasantness: While liking was rated fastest across modalities, it was significantly faster in vision compared to audition. No significant modality dependent differences between the pleasantness ratings were detected. The results demonstrate that liking and pleasantness rating scales refer to separate processes already within the short time scale of a one to two seconds. Furthermore, the affective priming effect indicates that an affective information transfer takes place across modalities and the types of stimuli applied. Unlike hypothesized, liking rating took place faster across the modalities. This is interpreted to support emotion theoretical notions where liking and disking are crucial properties of emotions perception and homeostatic self-referential information, possibly overriding pleasantness-related feature analysis. Conclusively, the findings provide empirical evidence for a conceptual delineation of common affective processes.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description:
The "Daily Social Media Active Users" dataset provides a comprehensive and dynamic look into the digital presence and activity of global users across major social media platforms. The data was generated to simulate real-world usage patterns for 13 popular platforms, including Facebook, YouTube, WhatsApp, Instagram, WeChat, TikTok, Telegram, Snapchat, X (formerly Twitter), Pinterest, Reddit, Threads, LinkedIn, and Quora. This dataset contains 10,000 rows and includes several key fields that offer insights into user demographics, engagement, and usage habits.
Dataset Breakdown:
Platform: The name of the social media platform where the user activity is tracked. It includes globally recognized platforms, such as Facebook, YouTube, and TikTok, that are known for their large, active user bases.
Owner: The company or entity that owns and operates the platform. Examples include Meta for Facebook, Instagram, and WhatsApp, Google for YouTube, and ByteDance for TikTok.
Primary Usage: This category identifies the primary function of each platform. Social media platforms differ in their primary usage, whether it's for social networking, messaging, multimedia sharing, professional networking, or more.
Country: The geographical region where the user is located. The dataset simulates global coverage, showcasing users from diverse locations and regions. It helps in understanding how user behavior varies across different countries.
Daily Time Spent (min): This field tracks how much time a user spends on a given platform on a daily basis, expressed in minutes. Time spent data is critical for understanding user engagement levels and the popularity of specific platforms.
Verified Account: Indicates whether the user has a verified account. This feature mimics real-world patterns where verified users (often public figures, businesses, or influencers) have enhanced status on social media platforms.
Date Joined: The date when the user registered or started using the platform. This data simulates user account history and can provide insights into user retention trends or platform growth over time.
Context and Use Cases:
Researchers, data scientists, and developers can use this dataset to:
Model User Behavior: By analyzing patterns in daily time spent, verified status, and country of origin, users can model and predict social media engagement behavior.
Test Analytics Tools: Social media monitoring and analytics platforms can use this dataset to simulate user activity and optimize their tools for engagement tracking, reporting, and visualization.
Train Machine Learning Algorithms: The dataset can be used to train models for various tasks like user segmentation, recommendation systems, or churn prediction based on engagement metrics.
Create Dashboards: This dataset can serve as the foundation for creating user-friendly dashboards that visualize user trends, platform comparisons, and engagement patterns across the globe.
Conduct Market Research: Business intelligence teams can use the data to understand how various demographics use social media, offering valuable insights into the most engaged regions, platform preferences, and usage behaviors.
Sources of Inspiration: This dataset is inspired by public data from industry reports, such as those from Statista, DataReportal, and other market research platforms. These sources provide insights into the global user base and usage statistics of popular social media platforms. The synthetic nature of this dataset allows for the use of realistic engagement metrics without violating any privacy concerns, making it an ideal tool for educational, analytical, and research purposes.
The structure and design of the dataset are based on real-world usage patterns and aim to represent a variety of users from different backgrounds, countries, and activity levels. This diversity makes it an ideal candidate for testing data-driven solutions and exploring social media trends.
Future Considerations:
As the social media landscape continues to evolve, this dataset can be updated or extended to include new platforms, engagement metrics, or user behaviors. Future iterations may incorporate features like post frequency, follower counts, engagement rates (likes, comments, shares), or even sentiment analysis from user-generated content.
By leveraging this dataset, analysts and data scientists can create better, more effective strategies ...