100+ datasets found

Daily Social Media Active Users
kaggle.com
zip
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaik Barood Mohammed Umar Adnaan Faiz (2025). Daily Social Media Active Users [Dataset]. https://www.kaggle.com/datasets/umeradnaan/daily-social-media-active-users
Explore at:
zip(126814 bytes)Available download formats
Dataset updated
May 5, 2025
Authors
Shaik Barood Mohammed Umar Adnaan Faiz
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description:

The "Daily Social Media Active Users" dataset provides a comprehensive and dynamic look into the digital presence and activity of global users across major social media platforms. The data was generated to simulate real-world usage patterns for 13 popular platforms, including Facebook, YouTube, WhatsApp, Instagram, WeChat, TikTok, Telegram, Snapchat, X (formerly Twitter), Pinterest, Reddit, Threads, LinkedIn, and Quora. This dataset contains 10,000 rows and includes several key fields that offer insights into user demographics, engagement, and usage habits.

Dataset Breakdown:

Platform: The name of the social media platform where the user activity is tracked. It includes globally recognized platforms, such as Facebook, YouTube, and TikTok, that are known for their large, active user bases.

Owner: The company or entity that owns and operates the platform. Examples include Meta for Facebook, Instagram, and WhatsApp, Google for YouTube, and ByteDance for TikTok.

Primary Usage: This category identifies the primary function of each platform. Social media platforms differ in their primary usage, whether it's for social networking, messaging, multimedia sharing, professional networking, or more.

Country: The geographical region where the user is located. The dataset simulates global coverage, showcasing users from diverse locations and regions. It helps in understanding how user behavior varies across different countries.

Daily Time Spent (min): This field tracks how much time a user spends on a given platform on a daily basis, expressed in minutes. Time spent data is critical for understanding user engagement levels and the popularity of specific platforms.

Verified Account: Indicates whether the user has a verified account. This feature mimics real-world patterns where verified users (often public figures, businesses, or influencers) have enhanced status on social media platforms.

Date Joined: The date when the user registered or started using the platform. This data simulates user account history and can provide insights into user retention trends or platform growth over time.

Context and Use Cases:

This synthetic dataset is designed to offer a privacy-friendly alternative for analytics, research, and machine learning purposes. Given the complexities and privacy concerns around using real user data, especially in the context of social media, this dataset offers a clean and secure way to develop, test, and fine-tune applications, models, and algorithms without the risks of handling sensitive or personal information.

Researchers, data scientists, and developers can use this dataset to:

Model User Behavior: By analyzing patterns in daily time spent, verified status, and country of origin, users can model and predict social media engagement behavior.

Test Analytics Tools: Social media monitoring and analytics platforms can use this dataset to simulate user activity and optimize their tools for engagement tracking, reporting, and visualization.

Train Machine Learning Algorithms: The dataset can be used to train models for various tasks like user segmentation, recommendation systems, or churn prediction based on engagement metrics.

Create Dashboards: This dataset can serve as the foundation for creating user-friendly dashboards that visualize user trends, platform comparisons, and engagement patterns across the globe.

Conduct Market Research: Business intelligence teams can use the data to understand how various demographics use social media, offering valuable insights into the most engaged regions, platform preferences, and usage behaviors.

Sources of Inspiration: This dataset is inspired by public data from industry reports, such as those from Statista, DataReportal, and other market research platforms. These sources provide insights into the global user base and usage statistics of popular social media platforms. The synthetic nature of this dataset allows for the use of realistic engagement metrics without violating any privacy concerns, making it an ideal tool for educational, analytical, and research purposes.

The structure and design of the dataset are based on real-world usage patterns and aim to represent a variety of users from different backgrounds, countries, and activity levels. This diversity makes it an ideal candidate for testing data-driven solutions and exploring social media trends.

Future Considerations:

As the social media landscape continues to evolve, this dataset can be updated or extended to include new platforms, engagement metrics, or user behaviors. Future iterations may incorporate features like post frequency, follower counts, engagement rates (likes, comments, shares), or even sentiment analysis from user-generated content.

By leveraging this dataset, analysts and data scientists can create better, more effective strategies ...
d
What's Happening LA Calendar Dataset - ARCHIVED
catalog.data.gov
data.lacity.org
+2more
Updated Jun 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.lacity.org (2025). What's Happening LA Calendar Dataset - ARCHIVED [Dataset]. https://catalog.data.gov/dataset/whats-happening-la-calendar-dataset-archived
Explore at:
Dataset updated
Jun 21, 2025
Dataset provided by
data.lacity.org
Area covered
Los Angeles
Description
All-City event calendar - ARCHIVED For the new LA City Events dataset (refreshed daily), see https://data.lacity.org/A-Prosperous-City/LA-City-Events/rx9t-fp7k
Daily Global Trends - Insights on Popularity
kaggle.com
zip
Updated Jan 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Daily Global Trends - Insights on Popularity [Dataset]. https://www.kaggle.com/datasets/thedevastator/daily-global-trends-2020-insights-on-popularity
Explore at:
zip(28034217 bytes)Available download formats
Dataset updated
Jan 16, 2023
Authors
The Devastator
Description
Daily Global Trends - Insights on Popularity

Analyzing Crowd Behaviour and Buzz Worldwide

By Jeffrey Mvutu Mabilama [source]

About this dataset

This dataset provides a comprehensive look into 2020’s top trends worldwide, with information on the hottest topics and conversations happening all around the globe. With details such as trending type, country origin, dates of interest, URLs to find further information, keywords related to the trend and more - it's an invaluable insight into what's driving society today.

You can use this data in conjunction with other sources to get ideas for businesses or products tailored to popular desires or opinions. If you are interested in international business perspectives then this is also your go-to source; you can adjust how best to interact with people from certain countries upon learning what they hold important in terms of search engine activity.

It also gives key insights into buzz formation by monitoring trends over many countries over different periods of time then analysing whether events tend to last longer or if their effect is short-lived and how much impact it made in terms column ‘traffic’ – number of searches for an individual topic – for the duration of its period affecting higher positions and opinion polls. In addition, marketing / advertising professionals can anticipate what content is likely best received by audiences based off previous trends related images/snippets provided with each trend/topic as well as URL links tracking users who have shown interest.. This way they become better prepared when rolling out campaigns targeted at specific regions/areas taking cultural perspective into consideration rather than just raw numbers.

Last but not least it serves perfectly as great starting material when getting acquainted foreigners online (at least we know what conversation starters won't be awkward mentioned!) before deepening our empathetic understanding like terms used largely solely within cultures such as TV program titles… So…… question is: What will be next big thing? See for yourself.

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

How to use this dataset for Insights on Popularity?

This Daily Global Trends 2020 dataset provides valuable information about trends around the world, including insights on their popularity. It can be used to identify popular topics and find ways to capitalize on them through marketing, business ideas and more. Below are some tips for how to use this data in order to gain insight into global trends and the level of popularity they have.

For Business Ideas: Use the URL information provided in order to research each individual trend, analyzing both when it gained traction as well as when its popularity faded away (if at all). This will give insight into transforming a brief trend into a long-lived one or making use of an existing but brief surge in interest – think new apps related to a trending topic! Combining the geographic region listed with these timeframes gives even more granular insight that could be used for product localization or regional target marketing.

To study Crowd Behaviour & Dynamics: Explore both country-wise and globally trending topics by looking at which countries similarly exhibit interest levels for said topics. Go further by understanding what drives people’s interest in particular subjects from different countries; here web scraping techniques can be employed using the URLs provided accompanied with basic text analysis techniques such as word clouds! This allows researchers/marketers get better feedback from customers from multiple regions, enabling smarter decisions based upon real behaviour rather than assumptions.

For **Building Better Products & Selling Techniques: Utilize combine Category (Business, Social etc.), Country and Related keywords mentioned with traffic figures so that you can obtain granular information about what excites people across cultures i.e ‘Food’ is popular everywhere but certain variations depending upon geo-location may not sell due need catering towards local taste buds.-For example selling frozen food that requires little preparation via supermarket chains showing parallels between nutritional requirements vs expenses incurred while shopping will drive effective sales strategy using this data set . Further combining date information also helps make predictions based upon buyers behaviour over seasons i.e buying seedless watermelons during winter season would be futile .

For Social & Small Talk opportunities - Incorporating recently descr...
Z
Dataset "What children know and want to know about climate change: a...
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Jan 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teixeira, Zara; Morgado, Rita; Marques, Cátia; Gonçalves, Carlos; Carvalho, Paula; Cunha, Ana; Moreira, Cláudia (2024). Dataset "What children know and want to know about climate change: a prior-knowledge self-assessment" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8003334
Explore at:
Dataset updated
Jan 24, 2024
Dataset provided by
Marine and Environmental Sciences Centre (MARE)
Authors
Teixeira, Zara; Morgado, Rita; Marques, Cátia; Gonçalves, Carlos; Carvalho, Paula; Cunha, Ana; Moreira, Cláudia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset: Responses to two open questionnaires: Questionnaire1: "What do you know about climate change?"; Questionnaire 2: "Waht do you want to know about climate change?" Applied to middle school students, from the ages of 10 (5th grade) until the ages of 13 (8th grade), at schools "Escola Básica e Secundária Dr. Pascoal José de Mello" and "Escola Nº 2 de Avelar" from the municipality of Ansião at the central region of Portugal. The survey was applied to students both individually (n=106) and in group (n= 60), encompassing a total of 417 students. Due to logistical reasons, it was no possible to gather individual questionnaires for the 5th grade.
N
What Cheer, IA Population Breakdown by Gender and Age Dataset: Male and...
neilsberg.com
csv, json
Updated Feb 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). What Cheer, IA Population Breakdown by Gender and Age Dataset: Male and Female Population Distribution Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/e20a1def-f25d-11ef-8c1b-3860777c1fe6/
Explore at:
json, csvAvailable download formats
Dataset updated
Feb 24, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
What Cheer, Iowa
Variables measured
Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the population of What Cheer by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for What Cheer. The dataset can be utilized to understand the population distribution of What Cheer by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in What Cheer. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for What Cheer.

Key observations

Largest age group (population): Male # 5-9 years (56) | Female # 20-24 years (38). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Scope of gender :

Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

Variables / Data Columns

Age Group: This column displays the age group for the What Cheer population analysis. Total expected values are 18 and are define above in the age groups section.

Population (Male): The male population in the What Cheer is shown in the following column.

Population (Female): The female population in the What Cheer is shown in the following column.

Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in What Cheer for each age group.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for What Cheer Population by Gender. You can refer the same here
MIT Restaurant Corpus 🍔 CRF Dataset
kaggle.com
zip
Updated Feb 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sagar Maru (2025). MIT Restaurant Corpus 🍔 CRF Dataset [Dataset]. https://www.kaggle.com/datasets/marusagar/mit-restaurant-corpus-crf-dataset
Explore at:
zip(154704 bytes)Available download formats
Dataset updated
Feb 23, 2025
Authors
Sagar Maru
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MIT Restaurant Corpus - CRFs (Conditional Random Fields) Dataset

A Funny Dive into Restaurant Reviews 🥳🍽️

Welcome to MIT Restaurant Corpus - CRF Dataset! If you are someone who loves food, restaurant and all the jargings that come with it, then you are for a treat! (Pun intended! 😉), Let's break it in the most delicious way!

This dataset obtained from MIT Restaurant Corpus (https://sls.csail.mit.edu/downloads/restaurant/) provides valuable restaurant review data for the NER (Named Entity Recognition) functions. With institutions such as ratings, locations and cuisine, it is perfect for the manufacture of CRF models. 🏷️🍴 Let's dive into this rich resource and find out its ability! 📊📍

🍔 What's Inside This Feast?

The MIT Restaurant Corpus is designed to help you understand the intricacies of restaurant reviews and data about restaurants can be pars and classified. It has a set of files that are structured to give you all ingredients required to make CRF (Conditional Random Field) models for NER (Named Entity Recognition). What is served here:

1.**‘sent_train’** 📝: This file contains a collection of sentences. But not just any sentences. These are sentences taken from real - world restaurant reviews! Each sentence is separated by a new line. It is like a dish of text, a sentence at a time.

2.**‘sent_test’** 🍽️: Just like the ‘sent_train’ file, this one contains sentences, but they’re for testing purposes. Think of it as the "taste test" phase of your restaurant review trip. The sentences here help you assess how well your model has learned the art of NER.

3.**‘label_train’** 🏷️: Now here’s where the magic happens. This file holds the NER labels or tags corresponding to each token in the ‘sent_train’ file. So, for every word in a sentence, there is a related label. It helps the model know what is - whether it’s a restaurant name, location, or dish. This review is like a guide to identify the stars of the show!

4.**‘label_test’** 📋: This file is just like ‘label_train’, but for testing. This allows you to verify if your model predictions are with the reality of the restaurant world. Will your model guess that "Burtito Palace" is the name of a restaurant? You will know here!

Therefore, in short, there is a beautiful one-to-one mapping between ‘sent_train’/‘sent_test’ files and ‘label_train’/‘label_test’ files. Each sentence is combined with its NER tag, which makes your model an ideal recipe for training and testing.

🍕 The NER Labels – What are we Tagging?

The real star of this dataset is the NER tags. If you’re thinking, "Okay, but in reality we are trying to identify in these restaurants reviews?" Well, here is the menu of NER label with which you are working:

Rating ⭐: The stars or ratings (literally) that gives a reviewer to the restaurant. We all know those stars are important when it is to choosing where to eat!

Amenity 🛋️: Think about that in the form of comfortable extra that comes with a restaurant, such as free Wi-Fi, or a pet-friendly courtyard.

Location 📍: This tag marks the location of the restaurant. Therefore, when you see "on Fifth Avenue", you know that it points to the place.

Restaurant_Name 🍴: Ah, place name! Is this "Burger Bonanza" or "Sushi Central"? What this tag recognizes.

Price 💰: How much are we talking here? The price tag can be called "$ $ $" or "appropriate". We all need to know how different we are about, right?

Hours ⏰: Because who wants to show in a restaurant when it is closed? It marks the tag opening and closing time.

Dish 🍲: Which food is being talked about? "Pad Thai" or "Cheeseberger" are your examples.

Cuisine 🍣: A tag for food type - whether it is Italian, Japanese, or good OL 'American comfort food.

These NER tags help create an understanding of all the data you encounter in a restaurant review. You will be able to easily pull names, prices, ratings, dishes, and more. Talk about a full-recourse data food!

🍤 CRF Model – The Chef’s Special!

Now, once you get your hand on this delicious dataset, what do you do with it? A ** CRF model ** cooking time!🍳

CRF (conditional random field) is a great way to label the sequences of data - such as sentences. Since NER work is about tagging each token (word) in a sentence, CRF models are ideal. They use reference around each word to perform predictions. So, when you were "wonderful for Sushi in Sushi Central!" As the sentence passes in, the model can find out that "Sushi Central" is a Restaurant_Name, and “sushi” is a Dish.

🍜 Features – Spice It Up!

Next, we dive into defines features for CRF model. Features are like secret materials that work your model. You will learn how to define them in the python, so your model can recognize the pattern and make accurate predictions.

...

Data from: WikiReddit: Tracing Information and Attention Flows Between...

zenodo.org

bin

Updated May 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

May 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.

SQL Database Schema

Table: `posts`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`crosspost_parent_id`	TEXT	The ID of the original Reddit post if this post is a crosspost.
`post_id`	TEXT	Unique identifier for the Reddit post.
`created_at`	TIMESTAMP	The timestamp when the post was created.
`updated_at`	TIMESTAMP	The timestamp when the post was last updated.
`language_code`	TEXT	The language code of the post.
`score`	INTEGER	The score (upvotes minus downvotes) of the post.
`upvote_ratio`	REAL	The ratio of upvotes to total votes.
`gildings`	INTEGER	Number of awards (gildings) received by the post.
`num_comments`	INTEGER	Number of comments on the post.

Table: `comments`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`post_id`	TEXT	The ID of the Reddit post the comment belongs to.
`parent_id`	TEXT	The ID of the parent comment (if a reply).
`comment_id`	TEXT	Unique identifier for the comment.
`created_at`	TIMESTAMP	The timestamp when the comment was created.
`last_modified_at`	TIMESTAMP	The timestamp when the comment was last modified.
`score`	INTEGER	The score (upvotes minus downvotes) of the comment.
`upvote_ratio`	REAL	The ratio of upvotes to total votes for the comment.
`gilded`	INTEGER	Number of awards (gildings) received by the comment.

Table: `postlinks`

Column Name	Type	Description
`post_id`	TEXT	Unique identifier for the Reddit post.
`end_processed_valid`	INTEGER	Whether the extracted URL from the post resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the Reddit post.
`final_valid`	INTEGER	Whether the final URL from the post resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after redirections.
`redirected`	INTEGER	Indicator of whether the posted URL was redirected (1) or not (0).
`in_title`	INTEGER	Indicator of whether the link appears in the post title (1) or post body (0).

Table: `commentlinks`

Column Name	Type	Description
`comment_id`	TEXT	Unique identifier for the Reddit comment.
`end_processed_valid`	INTEGER	Whether the extracted URL from the comment resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the comment.
`final_valid`	INTEGER	Whether the final URL from the comment resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final

What You Need to Know About Managing a Child Welfare Information System...
data.virginia.gov
catalog.data.gov
html
Updated Sep 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Administration for Children and Families (2025). What You Need to Know About Managing a Child Welfare Information System Project [Dataset]. https://data.virginia.gov/dataset/what-you-need-to-know-about-managing-a-child-welfare-information-system-project
Explore at:
htmlAvailable download formats
Dataset updated
Sep 6, 2025
Dataset provided by
Administration for Children and Families
Description
This webinar provides information on the federal government’s role in the process of building a successful child welfare information system and includes resources and guidance available to support states, territories, and tribes.

Metadata-only record linking to the original dataset. Open original dataset below.

Data from: Login Data Set for Risk-Based Authentication

zenodo.org
data.niaid.nih.gov
+1more

zip

Updated Jun 30, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono (2022). Login Data Set for Risk-Based Authentication [Dataset]. http://doi.org/10.5281/zenodo.6782156

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6782156

Dataset updated

Jun 30, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Login Data Set for Risk-Based Authentication

Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.

This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.

The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.

WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.

Overview

The data set contains the following features related to each login attempt on the SSO:

Feature	Data Type	Description	Range or Example
IP Address	String	IP address belonging to the login attempt	0.0.0.0 - 255.255.255.255
Country	String	Country derived from the IP address	US
Region	String	Region derived from the IP address	New York
City	String	City derived from the IP address	Rochester
ASN	Integer	Autonomous system number derived from the IP address	0 - 600000
User Agent String	String	User agent string submitted by the client	Mozilla/5.0 (Windows NT 10.0; Win64; ...
OS Name and Version	String	Operating system name and version derived from the user agent string	Windows 10
Browser Name and Version	String	Browser name and version derived from the user agent string	Chrome 70.0.3538
Device Type	String	Device type derived from the user agent string	(`mobile`, `desktop`, `tablet`, `bot`, `unknown`)¹
User ID	Integer	Idenfication number related to the affected user account	[Random pseudonym]
Login Timestamp	Integer	Timestamp related to the login attempt	[64 Bit timestamp]
Round-Trip Time (RTT) [ms]	Integer	Server-side measured latency between client and server	1 - 8600000
Login Successful	Boolean	`True`: Login was successful, `False`: Login failed	(`true`, `false`)
Is Attack IP	Boolean	IP address was found in known attacker data set	(`true`, `false`)
Is Account Takeover	Boolean	Login attempt was identified as account takeover by incident response team of the online service	(`true`, `false`)

Data Creation

As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.

The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.

The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.
The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.
The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.

Regarding the Data Values

Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.

You can recognize them by the following values:

ASNs with values >= 500.000
IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)

Study Reproduction

Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.

The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.

See RESULTS.md for more details.

Ethics

By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.

The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.

Publication

You can find more details on our conducted study in the following journal article:

Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022)
Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono.
ACM Transactions on Privacy and Security

Bibtex

@article{Wiefling_Pump_2022,
 author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi},
 title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}},
 journal = {{ACM} {Transactions} on {Privacy} and {Security}},
 doi = {10.1145/3546069},
 publisher = {ACM},
 year  = {2022}
}

License

This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:

Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069

Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.↩︎

w
Dataset of books called Gmicalzoma : it means what it says_ when you know...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Gmicalzoma : it means what it says_ when you know what it means : an Enochian dictionary [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Gmicalzoma+%3A+it+means+what+it+says_+when+you+know+what+it+means+%3A+an+Enochian+dictionary
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Gmicalzoma : it means what it says_ when you know what it means : an Enochian dictionary. It features 7 columns including author, publication date, language, and book publisher.
w
Dataset of books called Scholarly communication : what everyone needs to...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Scholarly communication : what everyone needs to know® [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Scholarly+communication+%3A+what+everyone+needs+to+know%C2%AE
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Scholarly communication : what everyone needs to know®. It features 7 columns including author, publication date, language, and book publisher.
Z
Dataset: We Do Not Understand What It Says -- Studying Student Perceptions...
data.niaid.nih.gov
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chakraborty, Shalini; Liebel, Grischa (2024). Dataset: We Do Not Understand What It Says -- Studying Student Perceptions of Software Modelling [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6913779
Explore at:
Dataset updated
Jul 16, 2024
Dataset provided by
Reykjavik University
Authors
Chakraborty, Shalini; Liebel, Grischa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains two supporting documents for the paper title, "We Do Not Understand What It Says -- Studying Student Perceptions of Software Modelling". The first one is an excel sheet containing interview transcripts of 13 of the participants of this study (who agreed to publish their statements) and the second is an appendix file containing the interview guide (questionnaire used for interviews with students and instructors) used in the case study.

The interview transcripts are supported by "in-vivo coding" used by both authors separately during analysis.
lastfm Music Recommendation Dataset
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Feb 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Òscar Celma; Òscar Celma (2022). lastfm Music Recommendation Dataset [Dataset]. http://doi.org/10.5281/zenodo.6090214
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6090214
Dataset updated
Feb 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Òscar Celma; Òscar Celma
Description
This is a common Zenodo repository for both lastfm-360K and lastfm-1K datasets. See below the details of both datasets, including license, acknowledgements, contact, and instructions to cite.

LASTFM-360K (version 1.2, March 2010).

What is this? This dataset contains

Files:

usersha1-artmbid-artname-plays.tsv (MD5: be672526eb7c69495c27ad27803148f1)

usersha1-profile.tsv (MD5: 51159d4edf6a92cb96f87768aa2be678)

mbox_sha1sum.py (MD5: feb3485eace85f3ba62e324839e6ab39)

Data Statistics:

File usersha1-artmbid-artname-plays.tsv:

Total Lines: 17,559,530

Unique Users: 359,347

Artists with MBID: 186,642

Artists without MBID: 107,373

Data Format: The data is formatted one entry per line as follows (tab separated "\t"):

File usersha1-artmbid-artname-plays.tsv:
user-mboxsha1 \t musicbrainz-artist-id \t artist-name \t plays

File usersha1-profile.tsv:
user-mboxsha1 \t gender (m|f|empty) \t age (int|empty) \t country (str|empty) \t signup (date|empty)

Example:

File usersha1-artmbid-artname-plays.tsv:
000063d3fe1cf2ba248b9e3c3f0334845a27a6be \t a3cb23fc-acd3-4ce0-8f36-1e5aa6a18432 \t u2 \t 31 ...

File usersha1-profile.tsv:
000063d3fe1cf2ba248b9e3c3f0334845a27a6be \t m \t 19 \t Mexico \t Apr 28, 2008 ...

LASTFM-1K (version 1.0, March 2010).

What is this? This dataset contains

Files:

userid-timestamp-artid-artname-traid-traname.tsv (MD5: 64747b21563e3d2aa95751e0ddc46b68)

userid-profile.tsv (MD5: c53608b6b445db201098c1489ea497df)

Data Statistics:

File userid-timestamp-artid-artname-traid-traname.tsv:

Total Lines: 19,150,868

Unique Users: 992

Artists with MBID: 107,528

Artists without MBDID: 69,420

Data Format: The data is formatted one entry per line as follows (tab separated, "\t"):

File userid-timestamp-artid-artname-traid-traname.tsv:
userid \t timestamp \t musicbrainz-artist-id \t artist-name \t musicbrainz-track-id \t track-name

File userid-profile.tsv:
userid \t gender ('m'|'f'|empty) \t age (int|empty) \t country (str|empty) \t signup (date|empty)

Example:

File userid-timestamp-artid-artname-traid-traname.tsv:
user_000639 \t 2009-04-08T01:57:47Z \t MBID \t The Dogs D'Amour \t MBID \t Fall in Love Again? user_000639 \t 2009-04-08T01:53:56Z \t MBID \t The Dogs D'Amour \t MBID \t Wait Until I'm Dead ...

File userid-profile.tsv:
user_000639 \t m \t Mexico \t Apr 27, 2005 ...

LICENSE OF BOTH DATASETS. The data contained in both datasets is distributed with permission of Last.fm. The data is made available for non-commercial use. Those interested in using the data or web services in a commercial context should contact:

partners [at] last [dot] fm

For more information see Last.fm terms of service

ACKNOWLEDGEMENTS. Thanks to Last.fm for providing the access to this data via their web services. Special thanks to Norman Casagrande.

REFERENCES. When using this dataset you must reference the Last.fm webpage. Optionally (not mandatory at all!), you can cite Chapter 3 of this book:

@book{Celma:Springer2010, author = {Celma, O.}, title = {{Music Recommendation and Discovery in the Long Tail}}, publisher = {Springer}, year = {2010} }

CONTACT: This data was collected by Òscar Celma @ MTG/UPF

RedditMachineLearningPosts2020

kaggle.com

zip

Updated Dec 6, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Anubha Singh (2020). RedditMachineLearningPosts2020 [Dataset]. https://www.kaggle.com/datasets/cerolacia/redditmachinelearningposts2020/discussion

Explore at:

zip(206293 bytes)Available download formats

Dataset updated

Dec 6, 2020

Authors

Anubha Singh

Description

Content

This dataset consists of posts detail from machine learning subredit((https://www.reddit.com/r/MachineLearning/). It consists of one file with 470 rows and 7 columns.

Variable	Definition
id	Unique ID for each post
title	Title of the post
Score	Number of upvotes on that post
URL	url of post
num_comments	Number of comments on post
body	content of post
created	Time of creation of the post in utc

Inspiration

You can use NLP techniques to analyse the data and do exploratory data analysis. Also you can predict the score of posts.

z
Requirements data sets (user stories)
zenodo.org
data.mendeley.com
txt
Updated Jan 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.17632/7zbk8zsd8y.1
Dataset updated
Jan 13, 2025
Dataset provided by
Mendeley Data
Authors
Fabiano Dalpiaz; Fabiano Dalpiaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of 22 data set of 50+ requirements each, expressed as user stories.

The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

Overview of the datasets [data and links added in December 2024]

The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

Public administration and transparency

g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

(Research) data and meta-data management

g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its
w
Dataset of books called What everyone in Britain should know about crime and...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called What everyone in Britain should know about crime and punishment [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=What+everyone+in+Britain+should+know+about+crime+and+punishment
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United Kingdom
Description
This dataset is about books. It has 3 rows and is filtered where the book is What everyone in Britain should know about crime and punishment. It features 7 columns including author, publication date, language, and book publisher.
Public Workforce System Dataset (PWSD)
catalog.data.gov
data.amerigeoss.org
+1more
Updated Sep 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Employment and Training Administration (2023). Public Workforce System Dataset (PWSD) [Dataset]. https://catalog.data.gov/dataset/public-workforce-system-dataset-pwsd-7712a
Explore at:
Dataset updated
Sep 26, 2023
Dataset provided by
Employment and Training Administrationhttps://www.dol.gov/agencies/eta
Description
The PWSD is a dataset that can be used to answer questions about various public workforce system programs and how these programs fit in with the overall public workforce system and the economy. It was designed primarily to be used as a tool to understand what has been occurring in the Wagner-Peyser program and contains data from quarter 1 of 1995 through quarter 4 of 2008. Also, it was designed to understand the relationship and flow of participants as they go through the public workforce system. The PWSD can be used to analyze these programs both individually and in combination. The PWSD contains economic variables, Unemployment Insurance System data, and data on programs funded by the Workforce Investment Act and Employment Service. Economic variables included are labor force, employment, unemployment, unemployment rate, and gross domestic product data.
w
Dataset of book subjects that contain Know me, like me, follow me : what...
workwithdata.com
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Dataset of book subjects that contain Know me, like me, follow me : what online social networking means for you and your business [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Know+me%2C+like+me%2C+follow+me+%3A+what+online+social+networking+means+for+you+and+your+business&j=1&j0=books
Explore at:
Dataset updated
Nov 7, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book subjects. It has 3 rows and is filtered where the books is Know me, like me, follow me : what online social networking means for you and your business. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
N
What Cheer, IA Age Group Population Dataset: A Complete Breakdown of What...
neilsberg.com
csv, json
Updated Feb 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). What Cheer, IA Age Group Population Dataset: A Complete Breakdown of What Cheer Age Demographics from 0 to 85 Years and Over, Distributed Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/45510f6c-f122-11ef-8c1b-3860777c1fe6/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Feb 22, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
What Cheer, Iowa
Variables measured
Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the What Cheer population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for What Cheer. The dataset can be utilized to understand the population distribution of What Cheer by age. For example, using this dataset, we can identify the largest age group in What Cheer.

Key observations

The largest age group in What Cheer, IA was for the group of age 5 to 9 years years with a population of 92 (14.51%), according to the ACS 2019-2023 5-Year Estimates. At the same time, the smallest age group in What Cheer, IA was the 25 to 29 years years with a population of 5 (0.79%). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Variables / Data Columns

Age Group: This column displays the age group in consideration

Population: The population for the specific age group in the What Cheer is shown in this column.

% of Total Population: This column displays the population of each age group as a proportion of What Cheer total population. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for What Cheer Population by Age. You can refer the same here
d
Dataset for "I Know What I Like When I See It: Likability is Distinct from...
demo-b2find.dkrz.de
Updated Jul 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Dataset for "I Know What I Like When I See It: Likability is Distinct from Pleasantness since Early Stages of Multimodal Emotion Evaluation " [Dataset]. http://demo-b2find.dkrz.de/dataset/bf3dea6c-af15-55c7-aa9d-96b1615c1f2c
Explore at:
Dataset updated
Jul 20, 2021
Description
Liking and pleasantness are common concepts in psychological emotion theories, and everyday language related to emotions. Despite obvious similarities between the terms, several empirical and theoretical notions support the idea that pleasantness and liking are cognitively different phenomena, becoming most evident in the context of emotion regulation and art enjoyment. In this study it was investigated whether liking and pleasantness indicate behaviourally measurable differences, not only in the long timespan of emotion regulation, but already within the initial affective responses to visual and auditory stimuli. A cross-modal affective priming protocol was used to assess whether there is a behavioural difference in the response time when providing an affective rating to a liking or pleasantness task. It was hypothesized that the pleasantness task would be faster as it is known to rely on rapid feature detection. Furthermore, an affective priming effect was expected to take place across the sensory modalities and the presentative and non-presentative stimuli. A linear mixed effect analysis indicated a significant priming effect, as well as an interaction effect between the auditory and visual sensory modalities and the affective rating tasks of liking and pleasantness: While liking was rated fastest across modalities, it was significantly faster in vision compared to audition. No significant modality dependent differences between the pleasantness ratings were detected. The results demonstrate that liking and pleasantness rating scales refer to separate processes already within the short time scale of a one to two seconds. Furthermore, the affective priming effect indicates that an affective information transfer takes place across modalities and the types of stimuli applied. Unlike hypothesized, liking rating took place faster across the modalities. This is interpreted to support emotion theoretical notions where liking and disking are crucial properties of emotions perception and homeostatic self-referential information, possibly overriding pleasantness-related feature analysis. Conclusively, the findings provide empirical evidence for a conceptual delineation of common affective processes.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shaik Barood Mohammed Umar Adnaan Faiz (2025). Daily Social Media Active Users [Dataset]. https://www.kaggle.com/datasets/umeradnaan/daily-social-media-active-users

Daily Social Media Active Users

"A thorough dataset that displays user activity on major social media platforms

Explore at:

zip(126814 bytes)Available download formats

Dataset updated

May 5, 2025

Authors

Shaik Barood Mohammed Umar Adnaan Faiz

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Description:

The "Daily Social Media Active Users" dataset provides a comprehensive and dynamic look into the digital presence and activity of global users across major social media platforms. The data was generated to simulate real-world usage patterns for 13 popular platforms, including Facebook, YouTube, WhatsApp, Instagram, WeChat, TikTok, Telegram, Snapchat, X (formerly Twitter), Pinterest, Reddit, Threads, LinkedIn, and Quora. This dataset contains 10,000 rows and includes several key fields that offer insights into user demographics, engagement, and usage habits.

Dataset Breakdown:

Platform: The name of the social media platform where the user activity is tracked. It includes globally recognized platforms, such as Facebook, YouTube, and TikTok, that are known for their large, active user bases.
Owner: The company or entity that owns and operates the platform. Examples include Meta for Facebook, Instagram, and WhatsApp, Google for YouTube, and ByteDance for TikTok.
Primary Usage: This category identifies the primary function of each platform. Social media platforms differ in their primary usage, whether it's for social networking, messaging, multimedia sharing, professional networking, or more.
Country: The geographical region where the user is located. The dataset simulates global coverage, showcasing users from diverse locations and regions. It helps in understanding how user behavior varies across different countries.
Daily Time Spent (min): This field tracks how much time a user spends on a given platform on a daily basis, expressed in minutes. Time spent data is critical for understanding user engagement levels and the popularity of specific platforms.
Verified Account: Indicates whether the user has a verified account. This feature mimics real-world patterns where verified users (often public figures, businesses, or influencers) have enhanced status on social media platforms.
Date Joined: The date when the user registered or started using the platform. This data simulates user account history and can provide insights into user retention trends or platform growth over time.

Context and Use Cases:

This synthetic dataset is designed to offer a privacy-friendly alternative for analytics, research, and machine learning purposes. Given the complexities and privacy concerns around using real user data, especially in the context of social media, this dataset offers a clean and secure way to develop, test, and fine-tune applications, models, and algorithms without the risks of handling sensitive or personal information.

Researchers, data scientists, and developers can use this dataset to:

Model User Behavior: By analyzing patterns in daily time spent, verified status, and country of origin, users can model and predict social media engagement behavior.
Test Analytics Tools: Social media monitoring and analytics platforms can use this dataset to simulate user activity and optimize their tools for engagement tracking, reporting, and visualization.
Train Machine Learning Algorithms: The dataset can be used to train models for various tasks like user segmentation, recommendation systems, or churn prediction based on engagement metrics.
Create Dashboards: This dataset can serve as the foundation for creating user-friendly dashboards that visualize user trends, platform comparisons, and engagement patterns across the globe.
Conduct Market Research: Business intelligence teams can use the data to understand how various demographics use social media, offering valuable insights into the most engaged regions, platform preferences, and usage behaviors.
Sources of Inspiration: This dataset is inspired by public data from industry reports, such as those from Statista, DataReportal, and other market research platforms. These sources provide insights into the global user base and usage statistics of popular social media platforms. The synthetic nature of this dataset allows for the use of realistic engagement metrics without violating any privacy concerns, making it an ideal tool for educational, analytical, and research purposes.

The structure and design of the dataset are based on real-world usage patterns and aim to represent a variety of users from different backgrounds, countries, and activity levels. This diversity makes it an ideal candidate for testing data-driven solutions and exploring social media trends.

Future Considerations:

As the social media landscape continues to evolve, this dataset can be updated or extended to include new platforms, engagement metrics, or user behaviors. Future iterations may incorporate features like post frequency, follower counts, engagement rates (likes, comments, shares), or even sentiment analysis from user-generated content.

By leveraging this dataset, analysts and data scientists can create better, more effective strategies ...

Clear search

Close search

Google apps

Main menu

Daily Social Media Active Users

What's Happening LA Calendar Dataset - ARCHIVED

Daily Global Trends - Insights on Popularity

Daily Global Trends - Insights on Popularity

Analyzing Crowd Behaviour and Buzz Worldwide

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Dataset "What children know and want to know about climate change: a...

What Cheer, IA Population Breakdown by Gender and Age Dataset: Male and...

About this dataset

Content

Inspiration

Recommended for further research

MIT Restaurant Corpus 🍔 CRF Dataset

MIT Restaurant Corpus - CRFs (Conditional Random Fields) Dataset

🍔 What's Inside This Feast?

🍕 The NER Labels – What are we Tagging?

🍤 CRF Model – The Chef’s Special!

🍜 Features – Spice It Up!

Data from: WikiReddit: Tracing Information and Attention Flows Between...

Preprint

Abstract

Datasheet

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

SQL Database Schema

Table: posts

Table: comments

Table: postlinks

Table: commentlinks

What You Need to Know About Managing a Child Welfare Information System...

Data from: Login Data Set for Risk-Based Authentication

Dataset of books called Gmicalzoma : it means what it says_ when you know...

Dataset of books called Scholarly communication : what everyone needs to...

Dataset: We Do Not Understand What It Says -- Studying Student Perceptions...

lastfm Music Recommendation Dataset

RedditMachineLearningPosts2020

Content

Inspiration

Requirements data sets (user stories)

Overview of the datasets [data and links added in December 2024]

Public administration and transparency

(Research) data and meta-data management

Dataset of books called What everyone in Britain should know about crime and...

Public Workforce System Dataset (PWSD)

Dataset of book subjects that contain Know me, like me, follow me : what...

What Cheer, IA Age Group Population Dataset: A Complete Breakdown of What...

About this dataset

Content

Inspiration

Recommended for further research

Dataset for "I Know What I Like When I See It: Likability is Distinct from...

Daily Social Media Active Users

"A thorough dataset that displays user activity on major social media platforms

Table: `posts`

Table: `comments`

Table: `postlinks`

Table: `commentlinks`