100+ datasets found
  1. Daily Social Media Active Users

    • kaggle.com
    zip
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaik Barood Mohammed Umar Adnaan Faiz (2025). Daily Social Media Active Users [Dataset]. https://www.kaggle.com/datasets/umeradnaan/daily-social-media-active-users
    Explore at:
    zip(126814 bytes)Available download formats
    Dataset updated
    May 5, 2025
    Authors
    Shaik Barood Mohammed Umar Adnaan Faiz
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description:

    The "Daily Social Media Active Users" dataset provides a comprehensive and dynamic look into the digital presence and activity of global users across major social media platforms. The data was generated to simulate real-world usage patterns for 13 popular platforms, including Facebook, YouTube, WhatsApp, Instagram, WeChat, TikTok, Telegram, Snapchat, X (formerly Twitter), Pinterest, Reddit, Threads, LinkedIn, and Quora. This dataset contains 10,000 rows and includes several key fields that offer insights into user demographics, engagement, and usage habits.

    Dataset Breakdown:

    • Platform: The name of the social media platform where the user activity is tracked. It includes globally recognized platforms, such as Facebook, YouTube, and TikTok, that are known for their large, active user bases.

    • Owner: The company or entity that owns and operates the platform. Examples include Meta for Facebook, Instagram, and WhatsApp, Google for YouTube, and ByteDance for TikTok.

    • Primary Usage: This category identifies the primary function of each platform. Social media platforms differ in their primary usage, whether it's for social networking, messaging, multimedia sharing, professional networking, or more.

    • Country: The geographical region where the user is located. The dataset simulates global coverage, showcasing users from diverse locations and regions. It helps in understanding how user behavior varies across different countries.

    • Daily Time Spent (min): This field tracks how much time a user spends on a given platform on a daily basis, expressed in minutes. Time spent data is critical for understanding user engagement levels and the popularity of specific platforms.

    • Verified Account: Indicates whether the user has a verified account. This feature mimics real-world patterns where verified users (often public figures, businesses, or influencers) have enhanced status on social media platforms.

    • Date Joined: The date when the user registered or started using the platform. This data simulates user account history and can provide insights into user retention trends or platform growth over time.

    Context and Use Cases:

    • This synthetic dataset is designed to offer a privacy-friendly alternative for analytics, research, and machine learning purposes. Given the complexities and privacy concerns around using real user data, especially in the context of social media, this dataset offers a clean and secure way to develop, test, and fine-tune applications, models, and algorithms without the risks of handling sensitive or personal information.

    Researchers, data scientists, and developers can use this dataset to:

    • Model User Behavior: By analyzing patterns in daily time spent, verified status, and country of origin, users can model and predict social media engagement behavior.

    • Test Analytics Tools: Social media monitoring and analytics platforms can use this dataset to simulate user activity and optimize their tools for engagement tracking, reporting, and visualization.

    • Train Machine Learning Algorithms: The dataset can be used to train models for various tasks like user segmentation, recommendation systems, or churn prediction based on engagement metrics.

    • Create Dashboards: This dataset can serve as the foundation for creating user-friendly dashboards that visualize user trends, platform comparisons, and engagement patterns across the globe.

    • Conduct Market Research: Business intelligence teams can use the data to understand how various demographics use social media, offering valuable insights into the most engaged regions, platform preferences, and usage behaviors.

    • Sources of Inspiration: This dataset is inspired by public data from industry reports, such as those from Statista, DataReportal, and other market research platforms. These sources provide insights into the global user base and usage statistics of popular social media platforms. The synthetic nature of this dataset allows for the use of realistic engagement metrics without violating any privacy concerns, making it an ideal tool for educational, analytical, and research purposes.

    The structure and design of the dataset are based on real-world usage patterns and aim to represent a variety of users from different backgrounds, countries, and activity levels. This diversity makes it an ideal candidate for testing data-driven solutions and exploring social media trends.

    Future Considerations:

    As the social media landscape continues to evolve, this dataset can be updated or extended to include new platforms, engagement metrics, or user behaviors. Future iterations may incorporate features like post frequency, follower counts, engagement rates (likes, comments, shares), or even sentiment analysis from user-generated content.

    By leveraging this dataset, analysts and data scientists can create better, more effective strategies ...

  2. d

    What's Happening LA Calendar Dataset - ARCHIVED

    • catalog.data.gov
    • data.lacity.org
    • +2more
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.lacity.org (2025). What's Happening LA Calendar Dataset - ARCHIVED [Dataset]. https://catalog.data.gov/dataset/whats-happening-la-calendar-dataset-archived
    Explore at:
    Dataset updated
    Jun 21, 2025
    Dataset provided by
    data.lacity.org
    Area covered
    Los Angeles
    Description

    All-City event calendar - ARCHIVED For the new LA City Events dataset (refreshed daily), see https://data.lacity.org/A-Prosperous-City/LA-City-Events/rx9t-fp7k

  3. Daily Global Trends - Insights on Popularity

    • kaggle.com
    zip
    Updated Jan 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Daily Global Trends - Insights on Popularity [Dataset]. https://www.kaggle.com/datasets/thedevastator/daily-global-trends-2020-insights-on-popularity
    Explore at:
    zip(28034217 bytes)Available download formats
    Dataset updated
    Jan 16, 2023
    Authors
    The Devastator
    Description

    Daily Global Trends - Insights on Popularity

    Analyzing Crowd Behaviour and Buzz Worldwide

    By Jeffrey Mvutu Mabilama [source]

    About this dataset

    This dataset provides a comprehensive look into 2020’s top trends worldwide, with information on the hottest topics and conversations happening all around the globe. With details such as trending type, country origin, dates of interest, URLs to find further information, keywords related to the trend and more - it's an invaluable insight into what's driving society today.

    You can use this data in conjunction with other sources to get ideas for businesses or products tailored to popular desires or opinions. If you are interested in international business perspectives then this is also your go-to source; you can adjust how best to interact with people from certain countries upon learning what they hold important in terms of search engine activity.

    It also gives key insights into buzz formation by monitoring trends over many countries over different periods of time then analysing whether events tend to last longer or if their effect is short-lived and how much impact it made in terms column ‘traffic’ – number of searches for an individual topic – for the duration of its period affecting higher positions and opinion polls. In addition, marketing / advertising professionals can anticipate what content is likely best received by audiences based off previous trends related images/snippets provided with each trend/topic as well as URL links tracking users who have shown interest.. This way they become better prepared when rolling out campaigns targeted at specific regions/areas taking cultural perspective into consideration rather than just raw numbers.

    Last but not least it serves perfectly as great starting material when getting acquainted foreigners online (at least we know what conversation starters won't be awkward mentioned!) before deepening our empathetic understanding like terms used largely solely within cultures such as TV program titles… So…… question is: What will be next big thing? See for yourself.

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    How to use this dataset for Insights on Popularity?

    This Daily Global Trends 2020 dataset provides valuable information about trends around the world, including insights on their popularity. It can be used to identify popular topics and find ways to capitalize on them through marketing, business ideas and more. Below are some tips for how to use this data in order to gain insight into global trends and the level of popularity they have.

    • For Business Ideas: Use the URL information provided in order to research each individual trend, analyzing both when it gained traction as well as when its popularity faded away (if at all). This will give insight into transforming a brief trend into a long-lived one or making use of an existing but brief surge in interest – think new apps related to a trending topic! Combining the geographic region listed with these timeframes gives even more granular insight that could be used for product localization or regional target marketing.

    • To study Crowd Behaviour & Dynamics: Explore both country-wise and globally trending topics by looking at which countries similarly exhibit interest levels for said topics. Go further by understanding what drives people’s interest in particular subjects from different countries; here web scraping techniques can be employed using the URLs provided accompanied with basic text analysis techniques such as word clouds! This allows researchers/marketers get better feedback from customers from multiple regions, enabling smarter decisions based upon real behaviour rather than assumptions.

    • For **Building Better Products & Selling Techniques: Utilize combine Category (Business, Social etc.), Country and Related keywords mentioned with traffic figures so that you can obtain granular information about what excites people across cultures i.e ‘Food’ is popular everywhere but certain variations depending upon geo-location may not sell due need catering towards local taste buds.-For example selling frozen food that requires little preparation via supermarket chains showing parallels between nutritional requirements vs expenses incurred while shopping will drive effective sales strategy using this data set . Further combining date information also helps make predictions based upon buyers behaviour over seasons i.e buying seedless watermelons during winter season would be futile .

    • For Social & Small Talk opportunities - Incorporating recently descr...

  4. Z

    Dataset "What children know and want to know about climate change: a...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jan 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teixeira, Zara; Morgado, Rita; Marques, Cátia; Gonçalves, Carlos; Carvalho, Paula; Cunha, Ana; Moreira, Cláudia (2024). Dataset "What children know and want to know about climate change: a prior-knowledge self-assessment" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8003334
    Explore at:
    Dataset updated
    Jan 24, 2024
    Dataset provided by
    Marine and Environmental Sciences Centre (MARE)
    Authors
    Teixeira, Zara; Morgado, Rita; Marques, Cátia; Gonçalves, Carlos; Carvalho, Paula; Cunha, Ana; Moreira, Cláudia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset: Responses to two open questionnaires: Questionnaire1: "What do you know about climate change?"; Questionnaire 2: "Waht do you want to know about climate change?" Applied to middle school students, from the ages of 10 (5th grade) until the ages of 13 (8th grade), at schools "Escola Básica e Secundária Dr. Pascoal José de Mello" and "Escola Nº 2 de Avelar" from the municipality of Ansião at the central region of Portugal. The survey was applied to students both individually (n=106) and in group (n= 60), encompassing a total of 417 students. Due to logistical reasons, it was no possible to gather individual questionnaires for the 5th grade.

  5. N

    What Cheer, IA Population Breakdown by Gender and Age Dataset: Male and...

    • neilsberg.com
    csv, json
    Updated Feb 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). What Cheer, IA Population Breakdown by Gender and Age Dataset: Male and Female Population Distribution Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/e20a1def-f25d-11ef-8c1b-3860777c1fe6/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 24, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    What Cheer, Iowa
    Variables measured
    Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the population of What Cheer by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for What Cheer. The dataset can be utilized to understand the population distribution of What Cheer by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in What Cheer. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for What Cheer.

    Key observations

    Largest age group (population): Male # 5-9 years (56) | Female # 20-24 years (38). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Scope of gender :

    Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

    Variables / Data Columns

    • Age Group: This column displays the age group for the What Cheer population analysis. Total expected values are 18 and are define above in the age groups section.
    • Population (Male): The male population in the What Cheer is shown in the following column.
    • Population (Female): The female population in the What Cheer is shown in the following column.
    • Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in What Cheer for each age group.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for What Cheer Population by Gender. You can refer the same here

  6. MIT Restaurant Corpus 🍔 CRF Dataset

    • kaggle.com
    zip
    Updated Feb 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sagar Maru (2025). MIT Restaurant Corpus 🍔 CRF Dataset [Dataset]. https://www.kaggle.com/datasets/marusagar/mit-restaurant-corpus-crf-dataset
    Explore at:
    zip(154704 bytes)Available download formats
    Dataset updated
    Feb 23, 2025
    Authors
    Sagar Maru
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MIT Restaurant Corpus - CRFs (Conditional Random Fields) Dataset

    A Funny Dive into Restaurant Reviews 🥳🍽️

    Welcome to MIT Restaurant Corpus - CRF Dataset! If you are someone who loves food, restaurant and all the jargings that come with it, then you are for a treat! (Pun intended! 😉), Let's break it in the most delicious way!

    This dataset obtained from MIT Restaurant Corpus (https://sls.csail.mit.edu/downloads/restaurant/) provides valuable restaurant review data for the NER (Named Entity Recognition) functions. With institutions such as ratings, locations and cuisine, it is perfect for the manufacture of CRF models. 🏷️🍴 Let's dive into this rich resource and find out its ability! 📊📍

    🍔 What's Inside This Feast?

    The MIT Restaurant Corpus is designed to help you understand the intricacies of restaurant reviews and data about restaurants can be pars and classified. It has a set of files that are structured to give you all ingredients required to make CRF (Conditional Random Field) models for NER (Named Entity Recognition). What is served here:

    1.**‘sent_train’** 📝: This file contains a collection of sentences. But not just any sentences. These are sentences taken from real - world restaurant reviews! Each sentence is separated by a new line. It is like a dish of text, a sentence at a time.

    2.**‘sent_test’** 🍽️: Just like the ‘sent_train’ file, this one contains sentences, but they’re for testing purposes. Think of it as the "taste test" phase of your restaurant review trip. The sentences here help you assess how well your model has learned the art of NER.

    3.**‘label_train’** 🏷️: Now here’s where the magic happens. This file holds the NER labels or tags corresponding to each token in the ‘sent_train’ file. So, for every word in a sentence, there is a related label. It helps the model know what is - whether it’s a restaurant name, location, or dish. This review is like a guide to identify the stars of the show!

    4.**‘label_test’** 📋: This file is just like ‘label_train’, but for testing. This allows you to verify if your model predictions are with the reality of the restaurant world. Will your model guess that "Burtito Palace" is the name of a restaurant? You will know here!

    Therefore, in short, there is a beautiful one-to-one mapping between ‘sent_train’/‘sent_test’ files and ‘label_train’/‘label_test’ files. Each sentence is combined with its NER tag, which makes your model an ideal recipe for training and testing.

    🍕 The NER Labels – What are we Tagging?

    The real star of this dataset is the NER tags. If you’re thinking, "Okay, but in reality we are trying to identify in these restaurants reviews?" Well, here is the menu of NER label with which you are working:

    • Rating ⭐: The stars or ratings (literally) that gives a reviewer to the restaurant. We all know those stars are important when it is to choosing where to eat!
    • Amenity 🛋️: Think about that in the form of comfortable extra that comes with a restaurant, such as free Wi-Fi, or a pet-friendly courtyard.
    • Location 📍: This tag marks the location of the restaurant. Therefore, when you see "on Fifth Avenue", you know that it points to the place.
    • Restaurant_Name 🍴: Ah, place name! Is this "Burger Bonanza" or "Sushi Central"? What this tag recognizes.
    • Price 💰: How much are we talking here? The price tag can be called "$ $ $" or "appropriate". We all need to know how different we are about, right?
    • Hours ⏰: Because who wants to show in a restaurant when it is closed? It marks the tag opening and closing time.
    • Dish 🍲: Which food is being talked about? "Pad Thai" or "Cheeseberger" are your examples.
    • Cuisine 🍣: A tag for food type - whether it is Italian, Japanese, or good OL 'American comfort food.

    These NER tags help create an understanding of all the data you encounter in a restaurant review. You will be able to easily pull names, prices, ratings, dishes, and more. Talk about a full-recourse data food!

    🍤 CRF Model – The Chef’s Special!

    Now, once you get your hand on this delicious dataset, what do you do with it? A ** CRF model ** cooking time!🍳

    CRF (conditional random field) is a great way to label the sequences of data - such as sentences. Since NER work is about tagging each token (word) in a sentence, CRF models are ideal. They use reference around each word to perform predictions. So, when you were "wonderful for Sushi in Sushi Central!" As the sentence passes in, the model can find out that "Sushi Central" is a Restaurant_Name, and “sushi” is a Dish.

    🍜 Features – Spice It Up!

    Next, we dive into defines features for CRF model. Features are like secret materials that work your model. You will learn how to define them in the python, so your model can recognize the pattern and make accurate predictions.

    ...

  7. Data from: WikiReddit: Tracing Information and Attention Flows Between...

    • zenodo.org
    bin
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265
    Explore at:
    binAvailable download formats
    Dataset updated
    May 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 15, 2025
    Description

    Preprint

    Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942
    Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

    Abstract

    The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

    Datasheet

    Motivation

    The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

    Composition

    WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

    Collection Process

    Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

    Preprocessing/cleaning/labeling

    Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

    Uses

    We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

    Distribution

    The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

    Maintenance

    Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.


    SQL Database Schema

    Table: posts

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    crosspost_parent_idTEXTThe ID of the original Reddit post if this post is a crosspost.
    post_idTEXTUnique identifier for the Reddit post.
    created_atTIMESTAMPThe timestamp when the post was created.
    updated_atTIMESTAMPThe timestamp when the post was last updated.
    language_codeTEXTThe language code of the post.
    scoreINTEGERThe score (upvotes minus downvotes) of the post.
    upvote_ratioREALThe ratio of upvotes to total votes.
    gildingsINTEGERNumber of awards (gildings) received by the post.
    num_commentsINTEGERNumber of comments on the post.

    Table: comments

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    post_idTEXTThe ID of the Reddit post the comment belongs to.
    parent_idTEXTThe ID of the parent comment (if a reply).
    comment_idTEXTUnique identifier for the comment.
    created_atTIMESTAMPThe timestamp when the comment was created.
    last_modified_atTIMESTAMPThe timestamp when the comment was last modified.
    scoreINTEGERThe score (upvotes minus downvotes) of the comment.
    upvote_ratioREALThe ratio of upvotes to total votes for the comment.
    gildedINTEGERNumber of awards (gildings) received by the comment.

    Table: postlinks

    Column NameTypeDescription
    post_idTEXTUnique identifier for the Reddit post.
    end_processed_validINTEGERWhether the extracted URL from the post resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the Reddit post.
    final_validINTEGERWhether the final URL from the post resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final URL.
    final_urlTEXTThe final URL after redirections.
    redirectedINTEGERIndicator of whether the posted URL was redirected (1) or not (0).
    in_titleINTEGERIndicator of whether the link appears in the post title (1) or post body (0).

    Table: commentlinks

    Column NameTypeDescription
    comment_idTEXTUnique identifier for the Reddit comment.
    end_processed_validINTEGERWhether the extracted URL from the comment resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the comment.
    final_validINTEGERWhether the final URL from the comment resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final

  8. What You Need to Know About Managing a Child Welfare Information System...

    • data.virginia.gov
    • catalog.data.gov
    html
    Updated Sep 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Administration for Children and Families (2025). What You Need to Know About Managing a Child Welfare Information System Project [Dataset]. https://data.virginia.gov/dataset/what-you-need-to-know-about-managing-a-child-welfare-information-system-project
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    Administration for Children and Families
    Description

    This webinar provides information on the federal government’s role in the process of building a successful child welfare information system and includes resources and guidance available to support states, territories, and tribes.

    Metadata-only record linking to the original dataset. Open original dataset below.

  9. Data from: Login Data Set for Risk-Based Authentication

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jun 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono (2022). Login Data Set for Risk-Based Authentication [Dataset]. http://doi.org/10.5281/zenodo.6782156
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 30, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Login Data Set for Risk-Based Authentication

    Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.

    This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.

    The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.

    WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.

    Overview

    The data set contains the following features related to each login attempt on the SSO:

    FeatureData TypeDescriptionRange or Example
    IP AddressStringIP address belonging to the login attempt0.0.0.0 - 255.255.255.255
    CountryStringCountry derived from the IP addressUS
    RegionStringRegion derived from the IP addressNew York
    CityStringCity derived from the IP addressRochester
    ASNIntegerAutonomous system number derived from the IP address0 - 600000
    User Agent StringStringUser agent string submitted by the clientMozilla/5.0 (Windows NT 10.0; Win64; ...
    OS Name and VersionStringOperating system name and version derived from the user agent stringWindows 10
    Browser Name and VersionStringBrowser name and version derived from the user agent stringChrome 70.0.3538
    Device TypeStringDevice type derived from the user agent string(mobile, desktop, tablet, bot, unknown)1
    User IDIntegerIdenfication number related to the affected user account[Random pseudonym]
    Login TimestampIntegerTimestamp related to the login attempt[64 Bit timestamp]
    Round-Trip Time (RTT) [ms]IntegerServer-side measured latency between client and server1 - 8600000
    Login SuccessfulBooleanTrue: Login was successful, False: Login failed(true, false)
    Is Attack IPBooleanIP address was found in known attacker data set(true, false)
    Is Account TakeoverBooleanLogin attempt was identified as account takeover by incident response team of the online service(true, false)

    Data Creation

    As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.

    The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.

    • The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.

    • The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.

    • The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.

    Regarding the Data Values

    Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.

    You can recognize them by the following values:

    • ASNs with values >= 500.000

    • IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)

    Study Reproduction

    Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.

    The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.

    See RESULTS.md for more details.

    Ethics

    By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.

    The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.

    Publication

    You can find more details on our conducted study in the following journal article:

    Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022)
    Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono.
    ACM Transactions on Privacy and Security

    Bibtex

    @article{Wiefling_Pump_2022,
     author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi},
     title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}},
     journal = {{ACM} {Transactions} on {Privacy} and {Security}},
     doi = {10.1145/3546069},
     publisher = {ACM},
     year  = {2022}
    }

    License

    This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:

    Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069

    1. Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.↩︎

  10. w

    Dataset of books called Gmicalzoma : it means what it says_ when you know...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Gmicalzoma : it means what it says_ when you know what it means : an Enochian dictionary [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Gmicalzoma+%3A+it+means+what+it+says_+when+you+know+what+it+means+%3A+an+Enochian+dictionary
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is Gmicalzoma : it means what it says_ when you know what it means : an Enochian dictionary. It features 7 columns including author, publication date, language, and book publisher.

  11. w

    Dataset of books called Scholarly communication : what everyone needs to...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Scholarly communication : what everyone needs to know® [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Scholarly+communication+%3A+what+everyone+needs+to+know%C2%AE
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is Scholarly communication : what everyone needs to know®. It features 7 columns including author, publication date, language, and book publisher.

  12. Z

    Dataset: We Do Not Understand What It Says -- Studying Student Perceptions...

    • data.niaid.nih.gov
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chakraborty, Shalini; Liebel, Grischa (2024). Dataset: We Do Not Understand What It Says -- Studying Student Perceptions of Software Modelling [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6913779
    Explore at:
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Reykjavik University
    Authors
    Chakraborty, Shalini; Liebel, Grischa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains two supporting documents for the paper title, "We Do Not Understand What It Says -- Studying Student Perceptions of Software Modelling". The first one is an excel sheet containing interview transcripts of 13 of the participants of this study (who agreed to publish their statements) and the second is an appendix file containing the interview guide (questionnaire used for interviews with students and instructors) used in the case study.

    The interview transcripts are supported by "in-vivo coding" used by both authors separately during analysis.

  13. lastfm Music Recommendation Dataset

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Feb 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Òscar Celma; Òscar Celma (2022). lastfm Music Recommendation Dataset [Dataset]. http://doi.org/10.5281/zenodo.6090214
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 15, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Òscar Celma; Òscar Celma
    Description

    This is a common Zenodo repository for both lastfm-360K and lastfm-1K datasets. See below the details of both datasets, including license, acknowledgements, contact, and instructions to cite.

    LASTFM-360K (version 1.2, March 2010).

    • What is this? This dataset contains
    • Files:
      • usersha1-artmbid-artname-plays.tsv (MD5: be672526eb7c69495c27ad27803148f1)
      • usersha1-profile.tsv (MD5: 51159d4edf6a92cb96f87768aa2be678)
      • mbox_sha1sum.py (MD5: feb3485eace85f3ba62e324839e6ab39)
    • Data Statistics:
      • File usersha1-artmbid-artname-plays.tsv:
        • Total Lines: 17,559,530
        • Unique Users: 359,347
        • Artists with MBID: 186,642
        • Artists without MBID: 107,373
    • Data Format: The data is formatted one entry per line as follows (tab separated "\t"):
      • File usersha1-artmbid-artname-plays.tsv:
        user-mboxsha1 \t musicbrainz-artist-id \t artist-name \t plays
      • File usersha1-profile.tsv:
        user-mboxsha1 \t gender (m|f|empty) \t age (int|empty) \t country (str|empty) \t signup (date|empty)
    • Example:
      • File usersha1-artmbid-artname-plays.tsv:
        000063d3fe1cf2ba248b9e3c3f0334845a27a6be \t a3cb23fc-acd3-4ce0-8f36-1e5aa6a18432 \t u2 \t 31 ...
      • File usersha1-profile.tsv:
        000063d3fe1cf2ba248b9e3c3f0334845a27a6be \t m \t 19 \t Mexico \t Apr 28, 2008 ...

    LASTFM-1K (version 1.0, March 2010).

    • What is this? This dataset contains
    • Files:
      • userid-timestamp-artid-artname-traid-traname.tsv (MD5: 64747b21563e3d2aa95751e0ddc46b68)
      • userid-profile.tsv (MD5: c53608b6b445db201098c1489ea497df)
    • Data Statistics:
      • File userid-timestamp-artid-artname-traid-traname.tsv:
        • Total Lines: 19,150,868
        • Unique Users: 992
        • Artists with MBID: 107,528
        • Artists without MBDID: 69,420
    • Data Format: The data is formatted one entry per line as follows (tab separated, "\t"):
      • File userid-timestamp-artid-artname-traid-traname.tsv:
        userid \t timestamp \t musicbrainz-artist-id \t artist-name \t musicbrainz-track-id \t track-name
      • File userid-profile.tsv:
        userid \t gender ('m'|'f'|empty) \t age (int|empty) \t country (str|empty) \t signup (date|empty)
    • Example:
      • File userid-timestamp-artid-artname-traid-traname.tsv:
        user_000639 \t 2009-04-08T01:57:47Z \t MBID \t The Dogs D'Amour \t MBID \t Fall in Love Again?
        user_000639 \t 2009-04-08T01:53:56Z \t MBID \t The Dogs D'Amour \t MBID \t Wait Until I'm Dead ...
      • File userid-profile.tsv:
        user_000639 \t m \t Mexico \t Apr 27, 2005 ...

    LICENSE OF BOTH DATASETS. The data contained in both datasets is distributed with permission of Last.fm. The data is made available for non-commercial use. Those interested in using the data or web services in a commercial context should contact:

    partners [at] last [dot] fm

    For more information see Last.fm terms of service

    ACKNOWLEDGEMENTS. Thanks to Last.fm for providing the access to this data via their web services. Special thanks to Norman Casagrande.

    REFERENCES. When using this dataset you must reference the Last.fm webpage. Optionally (not mandatory at all!), you can cite Chapter 3 of this book:

    @book{Celma:Springer2010,
      author = {Celma, O.},
      title = {{Music Recommendation and Discovery in the Long Tail}},
      publisher = {Springer},
      year = {2010}
    }
    

    CONTACT: This data was collected by Òscar Celma @ MTG/UPF

  14. RedditMachineLearningPosts2020

    • kaggle.com
    zip
    Updated Dec 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anubha Singh (2020). RedditMachineLearningPosts2020 [Dataset]. https://www.kaggle.com/datasets/cerolacia/redditmachinelearningposts2020/discussion
    Explore at:
    zip(206293 bytes)Available download formats
    Dataset updated
    Dec 6, 2020
    Authors
    Anubha Singh
    Description

    Content

    This dataset consists of posts detail from machine learning subredit((https://www.reddit.com/r/MachineLearning/). It consists of one file with 470 rows and 7 columns.

    VariableDefinition
    idUnique ID for each post
    titleTitle of the post
    ScoreNumber of upvotes on that post
    URLurl of post
    num_commentsNumber of comments on post
    bodycontent of post
    createdTime of creation of the post in utc

    Inspiration

    You can use NLP techniques to analyse the data and do exploratory data analysis. Also you can predict the score of posts.

  15. z

    Requirements data sets (user stories)

    • zenodo.org
    • data.mendeley.com
    txt
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Mendeley Data
    Authors
    Fabiano Dalpiaz; Fabiano Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of 22 data set of 50+ requirements each, expressed as user stories.

    The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

    The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

    This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

    Overview of the datasets [data and links added in December 2024]

    The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

    Public administration and transparency

    g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

    g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

    g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

    g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

    g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

    (Research) data and meta-data management

    g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

    g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

    g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

    g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

    g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

    g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

    g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
    born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its

  16. w

    Dataset of books called What everyone in Britain should know about crime and...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called What everyone in Britain should know about crime and punishment [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=What+everyone+in+Britain+should+know+about+crime+and+punishment
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United Kingdom
    Description

    This dataset is about books. It has 3 rows and is filtered where the book is What everyone in Britain should know about crime and punishment. It features 7 columns including author, publication date, language, and book publisher.

  17. Public Workforce System Dataset (PWSD)

    • catalog.data.gov
    • data.amerigeoss.org
    • +1more
    Updated Sep 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Employment and Training Administration (2023). Public Workforce System Dataset (PWSD) [Dataset]. https://catalog.data.gov/dataset/public-workforce-system-dataset-pwsd-7712a
    Explore at:
    Dataset updated
    Sep 26, 2023
    Dataset provided by
    Employment and Training Administrationhttps://www.dol.gov/agencies/eta
    Description

    The PWSD is a dataset that can be used to answer questions about various public workforce system programs and how these programs fit in with the overall public workforce system and the economy. It was designed primarily to be used as a tool to understand what has been occurring in the Wagner-Peyser program and contains data from quarter 1 of 1995 through quarter 4 of 2008. Also, it was designed to understand the relationship and flow of participants as they go through the public workforce system. The PWSD can be used to analyze these programs both individually and in combination. The PWSD contains economic variables, Unemployment Insurance System data, and data on programs funded by the Workforce Investment Act and Employment Service. Economic variables included are labor force, employment, unemployment, unemployment rate, and gross domestic product data.

  18. w

    Dataset of book subjects that contain Know me, like me, follow me : what...

    • workwithdata.com
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of book subjects that contain Know me, like me, follow me : what online social networking means for you and your business [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Know+me%2C+like+me%2C+follow+me+%3A+what+online+social+networking+means+for+you+and+your+business&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book subjects. It has 3 rows and is filtered where the books is Know me, like me, follow me : what online social networking means for you and your business. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.

  19. N

    What Cheer, IA Age Group Population Dataset: A Complete Breakdown of What...

    • neilsberg.com
    csv, json
    Updated Feb 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). What Cheer, IA Age Group Population Dataset: A Complete Breakdown of What Cheer Age Demographics from 0 to 85 Years and Over, Distributed Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/45510f6c-f122-11ef-8c1b-3860777c1fe6/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Feb 22, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    What Cheer, Iowa
    Variables measured
    Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the What Cheer population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for What Cheer. The dataset can be utilized to understand the population distribution of What Cheer by age. For example, using this dataset, we can identify the largest age group in What Cheer.

    Key observations

    The largest age group in What Cheer, IA was for the group of age 5 to 9 years years with a population of 92 (14.51%), according to the ACS 2019-2023 5-Year Estimates. At the same time, the smallest age group in What Cheer, IA was the 25 to 29 years years with a population of 5 (0.79%). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Variables / Data Columns

    • Age Group: This column displays the age group in consideration
    • Population: The population for the specific age group in the What Cheer is shown in this column.
    • % of Total Population: This column displays the population of each age group as a proportion of What Cheer total population. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for What Cheer Population by Age. You can refer the same here

  20. d

    Dataset for "I Know What I Like When I See It: Likability is Distinct from...

    • demo-b2find.dkrz.de
    Updated Jul 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Dataset for "I Know What I Like When I See It: Likability is Distinct from Pleasantness since Early Stages of Multimodal Emotion Evaluation " [Dataset]. http://demo-b2find.dkrz.de/dataset/bf3dea6c-af15-55c7-aa9d-96b1615c1f2c
    Explore at:
    Dataset updated
    Jul 20, 2021
    Description

    Liking and pleasantness are common concepts in psychological emotion theories, and everyday language related to emotions. Despite obvious similarities between the terms, several empirical and theoretical notions support the idea that pleasantness and liking are cognitively different phenomena, becoming most evident in the context of emotion regulation and art enjoyment. In this study it was investigated whether liking and pleasantness indicate behaviourally measurable differences, not only in the long timespan of emotion regulation, but already within the initial affective responses to visual and auditory stimuli. A cross-modal affective priming protocol was used to assess whether there is a behavioural difference in the response time when providing an affective rating to a liking or pleasantness task. It was hypothesized that the pleasantness task would be faster as it is known to rely on rapid feature detection. Furthermore, an affective priming effect was expected to take place across the sensory modalities and the presentative and non-presentative stimuli. A linear mixed effect analysis indicated a significant priming effect, as well as an interaction effect between the auditory and visual sensory modalities and the affective rating tasks of liking and pleasantness: While liking was rated fastest across modalities, it was significantly faster in vision compared to audition. No significant modality dependent differences between the pleasantness ratings were detected. The results demonstrate that liking and pleasantness rating scales refer to separate processes already within the short time scale of a one to two seconds. Furthermore, the affective priming effect indicates that an affective information transfer takes place across modalities and the types of stimuli applied. Unlike hypothesized, liking rating took place faster across the modalities. This is interpreted to support emotion theoretical notions where liking and disking are crucial properties of emotions perception and homeostatic self-referential information, possibly overriding pleasantness-related feature analysis. Conclusively, the findings provide empirical evidence for a conceptual delineation of common affective processes.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shaik Barood Mohammed Umar Adnaan Faiz (2025). Daily Social Media Active Users [Dataset]. https://www.kaggle.com/datasets/umeradnaan/daily-social-media-active-users
Organization logo

Daily Social Media Active Users

"A thorough dataset that displays user activity on major social media platforms

Explore at:
zip(126814 bytes)Available download formats
Dataset updated
May 5, 2025
Authors
Shaik Barood Mohammed Umar Adnaan Faiz
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Description:

The "Daily Social Media Active Users" dataset provides a comprehensive and dynamic look into the digital presence and activity of global users across major social media platforms. The data was generated to simulate real-world usage patterns for 13 popular platforms, including Facebook, YouTube, WhatsApp, Instagram, WeChat, TikTok, Telegram, Snapchat, X (formerly Twitter), Pinterest, Reddit, Threads, LinkedIn, and Quora. This dataset contains 10,000 rows and includes several key fields that offer insights into user demographics, engagement, and usage habits.

Dataset Breakdown:

  • Platform: The name of the social media platform where the user activity is tracked. It includes globally recognized platforms, such as Facebook, YouTube, and TikTok, that are known for their large, active user bases.

  • Owner: The company or entity that owns and operates the platform. Examples include Meta for Facebook, Instagram, and WhatsApp, Google for YouTube, and ByteDance for TikTok.

  • Primary Usage: This category identifies the primary function of each platform. Social media platforms differ in their primary usage, whether it's for social networking, messaging, multimedia sharing, professional networking, or more.

  • Country: The geographical region where the user is located. The dataset simulates global coverage, showcasing users from diverse locations and regions. It helps in understanding how user behavior varies across different countries.

  • Daily Time Spent (min): This field tracks how much time a user spends on a given platform on a daily basis, expressed in minutes. Time spent data is critical for understanding user engagement levels and the popularity of specific platforms.

  • Verified Account: Indicates whether the user has a verified account. This feature mimics real-world patterns where verified users (often public figures, businesses, or influencers) have enhanced status on social media platforms.

  • Date Joined: The date when the user registered or started using the platform. This data simulates user account history and can provide insights into user retention trends or platform growth over time.

Context and Use Cases:

  • This synthetic dataset is designed to offer a privacy-friendly alternative for analytics, research, and machine learning purposes. Given the complexities and privacy concerns around using real user data, especially in the context of social media, this dataset offers a clean and secure way to develop, test, and fine-tune applications, models, and algorithms without the risks of handling sensitive or personal information.

Researchers, data scientists, and developers can use this dataset to:

  • Model User Behavior: By analyzing patterns in daily time spent, verified status, and country of origin, users can model and predict social media engagement behavior.

  • Test Analytics Tools: Social media monitoring and analytics platforms can use this dataset to simulate user activity and optimize their tools for engagement tracking, reporting, and visualization.

  • Train Machine Learning Algorithms: The dataset can be used to train models for various tasks like user segmentation, recommendation systems, or churn prediction based on engagement metrics.

  • Create Dashboards: This dataset can serve as the foundation for creating user-friendly dashboards that visualize user trends, platform comparisons, and engagement patterns across the globe.

  • Conduct Market Research: Business intelligence teams can use the data to understand how various demographics use social media, offering valuable insights into the most engaged regions, platform preferences, and usage behaviors.

  • Sources of Inspiration: This dataset is inspired by public data from industry reports, such as those from Statista, DataReportal, and other market research platforms. These sources provide insights into the global user base and usage statistics of popular social media platforms. The synthetic nature of this dataset allows for the use of realistic engagement metrics without violating any privacy concerns, making it an ideal tool for educational, analytical, and research purposes.

The structure and design of the dataset are based on real-world usage patterns and aim to represent a variety of users from different backgrounds, countries, and activity levels. This diversity makes it an ideal candidate for testing data-driven solutions and exploring social media trends.

Future Considerations:

As the social media landscape continues to evolve, this dataset can be updated or extended to include new platforms, engagement metrics, or user behaviors. Future iterations may incorporate features like post frequency, follower counts, engagement rates (likes, comments, shares), or even sentiment analysis from user-generated content.

By leveraging this dataset, analysts and data scientists can create better, more effective strategies ...

Search
Clear search
Close search
Google apps
Main menu