19 datasets found
  1. threads-math-sx

    • zenodo.org
    • explore.openaire.eu
    json
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Landry; Nicholas Landry (2023). threads-math-sx [Dataset]. http://doi.org/10.5281/zenodo.10373324
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nicholas Landry; Nicholas Landry
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This is a temporal higher-order network dataset, which here means a sequence of timestamped hyperedges where each hyperedge is a set of nodes. In this dataset, nodes are users on https://math.stackexchange.com, and a hyperedge comes from users participating in a thread that lasts for at most 24 hours. The timestamps are the time of the post, but normalized so that the earliest post starts at 0.

    Source of original data

    Source: threads-math-sx dataset

    References

    If you use this data, please cite the following paper:

  2. h

    reddit_threads

    • huggingface.co
    Updated Apr 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Graph Datasets (2023). reddit_threads [Dataset]. https://huggingface.co/datasets/graphs-datasets/reddit_threads
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 13, 2023
    Dataset authored and provided by
    Graph Datasets
    License

    https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

    Description

    Dataset Card for Reddit threads

      Dataset Summary
    

    The Reddit threads dataset contains 'discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them' (doc).

      Supported Tasks and Leaderboards
    

    The related task is the binary classification to predict whether a thread is discussion based or not.

      External Use
    
    
    
    
    
      PyGeometric
    

    To load in… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/reddit_threads.

  3. Cheltenham's Facebook Groups

    • kaggle.com
    zip
    Updated Apr 2, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mike Chirico (2018). Cheltenham's Facebook Groups [Dataset]. https://www.kaggle.com/datasets/mchirico/cheltenham-s-facebook-group
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 2, 2018
    Authors
    Mike Chirico
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Facebook is becoming an essential tool for more than just family and friends. Discover how Cheltenham Township (USA), a diverse community just outside of Philadelphia, deals with major issues such as the Bill Cosby trial, everyday traffic issues, sewer I/I problems and lost cats and dogs. And yes, theft.

    Communities work when they're connected and exchanging information. What and who are the essential forces making a positive impact, and when and how do conversational threads get directed or misdirected?

    Use Any Facebook Public Group

    You can leverage the examples here for any public Facebook group. For an example of the source code used to collect this data, and a quick start docker image, take a look at the following project: facebook-group-scrape.

    Data Sources

    There are 4 csv files in the dataset, with data from the following 5 public Facebook groups:

    post.csv

    These are the main posts you will see on the page. It might help to take a quick look at the page. Commas in the msg field have been replaced with {COMMA}, and apostrophes have been replaced with {APOST}.

    • gid Group id (5 different Facebook groups)
    • pid Main Post id
    • id Id of the user posting
    • name User's name
    • timeStamp
    • shares
    • url
    • msg Text of the message posted.
    • likes Number of likes

    comment.csv

    These are comments to the main post. Note, Facebook postings have comments, and comments on comments.

    • gid Group id
    • pid Matches Main Post identifier in post.csv
    • cid Comment Id.
    • timeStamp
    • id Id of user commenting
    • name Name of user commenting
    • rid Id of user responding to first comment
    • msg Message

    like.csv

    These are likes and responses. The two keys in this file (pid,cid) will join to post and comment respectively.

    • gid Group id
    • pid Matches Main Post identifier in post.csv
    • cid Matches Comments id.
    • response Response such as LIKE, ANGRY etc.
    • id The id of user responding
    • name Name of the user responding

    member.csv

    These are all the members in the group. Some members never, or rarely, post or comment. You may find multiple entries in this table for the same person. The name of the individual never changes, but they change their profile picture. Each profile picture change is captured in this table. Facebook gives users a new id in this table when they change their profile picture.

    • gid Group id
    • id Id of the member
    • name Name of the member
    • url URL of the member
  4. Bluesky Social Dataset

    • zenodo.org
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti (2024). Bluesky Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.11082879
    Explore at:
    Dataset updated
    Dec 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Bluesky Social Dataset

    1st Dec 2024. This version of the dataset has been superseeded and is now restricted. Please refer to the most recent release.

    Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.

    The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

    Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their “like” interactions and time of bookmarking.

    Dataset

    Here is a description of the dataset files.

    • followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers u, v, representing a directed following relation (i.e., user u follows user v).
    • posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in 100 files, each containing the full posts of up to 50,000 users. Each post is stored as a JSON-formatted line.
    • interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers, and represents a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author ,quoted_author, and date.
    • graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread.
    • feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author);
    • feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values, namely the feed name, the user id, and the timestamp.
    • feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp;
    • scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.

    Citation

    If used for research purposes, please cite the following paper describing the dataset details:

    Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight: Insights from a Year Worth of Social Data". PlosOne (2024) a https://doi.org/10.1371/journal.pone.0310330

    Right to Erasure (Right to be forgotten)

    Note: If your account was created after March 21st, 2024, or if you did not post on Bluesky before such date, no data about your account exists in the dataset. Before sending a data removal request, please make sure that you were active and posting on bluesky before March 21st, 2024.

    Users included in the Bluesky dataset have the right to opt out and request the removal of their data, in accordance with GDPR provisions (Article 17). It should be noted, however, that the dataset was created for scientific research purposes, thereby falling under the scenarios for which GDPR provides derogations (Article 17(3)(d) and Article 89).

    We emphasize that, in compliance with GDPR (Article 4(5)), the released data has been thoroughly pseudonymized. Specifically, usernames and object identifiers (e.g., URIs) have been removed, and object timestamps have been coarsened to further protect individual privacy.

    If you wish to have your activities excluded from this dataset, please submit your request to blueskydatasetmoderation@gmail.com (with subject "Removal request: [username]").
    We will process your request within a reasonable timeframe.

    Acknowledgments:

    This work is supported by :

    • the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”,
      Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu);
    • SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021;
    • EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
  5. Email Thread Summary Dataset

    • kaggle.com
    Updated Sep 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marawan Mamdouh (2023). Email Thread Summary Dataset [Dataset]. https://www.kaggle.com/datasets/marawanxmamdouh/email-thread-summary-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Marawan Mamdouh
    Description

    Email Thread Summary Dataset

    Overview:

    The Email Thread Dataset consists of two main files: email_thread_details and email_thread_summaries. These files collectively offer a comprehensive compilation of email thread information alongside human-generated summaries.

    Email Thread Details:

    Description:

    The email_thread_details file provides a detailed perspective on individual email threads, encompassing crucial information such as subject, timestamp, sender, recipients, and the content of the email.

    Columns:

    • thread_id: A unique identifier for each email thread.
    • subject: Subject of the email thread.
    • timestamp: Timestamp indicating when the message was sent.
    • from: Sender of the email.
    • to: List of recipients of the email.
    • body: Content of the email message.

    Additional Information:

    The "to" column is available in both CSV and Pickle (pkl) formats, facilitating convenient access to recipient information as a column of lists of strings.

    Email Thread Summaries:

    Description:

    The email_thread_summaries file contains concise summaries crafted by human annotators for each email thread, offering a high-level overview of the content.

    Columns:

    • thread_id: A unique identifier for each email thread.
    • summary: A concise summary of the email thread.

    Dataset Structure:

    The dataset is organized into threads and emails. There are a total of 4,167 threads and 21,684 emails, providing a rich source of information for analysis and research purposes.

    • Threads: 4,167 threads
    • Emails: 21,684 emails

    Language:

    • Languages: English (en)

    Use Cases:

    1. Natural Language Processing (NLP) Research:
      • Analyze email thread contents and human-generated summaries for advancements in NLP tasks.
    2. Text Summarization Models:
      • Train and evaluate text summarization models using the provided email threads and summaries.
    3. Email Analytics:
      • Gain insights into communication patterns, sender-receiver relationships, and content analysis.

    File Formats:

    • CSV Files:
      • Easily importable into various data analysis tools.
    • Pickle (pkl) Files:
      • Facilitates direct reading of the "to" column as a column of lists of strings.
    • JSON Files:

      • Offers compatibility with JSON data structures, providing an additional option for users who prefer or require this widely-used format in their analytical workflows.
      • ****JSON File Features Description****

        [
          {
            "thread_id": [unique identifier],
            "subject": "[email thread subject]",
            "timestamp": [timestamp in milliseconds],
            "from": "[sender's name and identifier]",
            "to": [
              "[recipient 1]",
              "[recipient 2]",
              "[recipient 3]",
              ...
            ],
            "body": "[email content]"
          },
          ...
        ]
        
        [
          {
            "thread_id": [unique identifier],
            "summary": "[summary content]"
          },
          ...
        ]
        

    ****Files Structure:****

    - Dataset
     ├── CSV
     │  ├── email_thread_details.csv
     │  └── email_thread_summaries.csv
     ├── Pickle
     │  ├── email_thread_details.pkl
     │  └── email_thread_summaries.pkl
     └── JSON
       ├── email_thread_details.json
       └── email_thread_summaries.json
    

    License:

    This dataset is provided under the MIT License.

    Disclaimer:

    The dataset has been anonymized and sanitized to ensure privacy and confidentiality.

  6. h

    BluePrint

    • huggingface.co
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Complex Data Lab (2025). BluePrint [Dataset]. http://doi.org/10.57967/hf/5425
    Explore at:
    Dataset updated
    May 28, 2025
    Dataset authored and provided by
    Complex Data Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📘 BluePrint

    BluePrint is a large-scale dataset of social media conversation threads designed for evaluating and training LLM-based social media agents. It provides realistic, thread-structured data clustered into representative user personas at various levels of granularity.

      ✅ Key Features
    

    Thread-Based Structure: Each example is a list of messages representing a user thread. Persona Clustering: Users are clustered into 2, 25, 100, and 1000 representative personas to… See the full description on the dataset page: https://huggingface.co/datasets/ComplexDataLab/BluePrint.

  7. C

    The Online conversation threads repository

    • dataverse.csuc.cat
    txt
    Updated Oct 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vicenç Gómez; Vicenç Gómez; Andreas Kaltenbrunner; Andreas Kaltenbrunner; David Laniado; David Laniado (2023). The Online conversation threads repository [Dataset]. http://doi.org/10.34810/data497
    Explore at:
    txt(6626), txt(1763476626), txt(110980658), txt(673642981)Available download formats
    Dataset updated
    Oct 13, 2023
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Vicenç Gómez; Vicenç Gómez; Andreas Kaltenbrunner; Andreas Kaltenbrunner; David Laniado; David Laniado
    License

    https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data497https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data497

    Description

    This repository contains datasets with online conversation threads collected and analyzed by different researchers. Currently, you can find datsets from different news aggregators (Slashdot, Barrapunto) and the English Wikipedia talk pages. Slashdot conversations (Aug 2005 - Aug 2006) Online conversations generated at Slashdot during a year. Posts and comments published between August 26th, 2005 and August 31th, 2006. For each discussion thread: sub-domains, title, topics and hierarchical relations between comments. For each comment: user, date, score and textual content. This dataset is different from the Slashdot Zoo social network (it is not a signed network of users) contained in the SNAP repository and represents the full version of the dataset used in the CAW 2.0 - Content Analysis for the WEB 2.0 workshop for the WWW 2009 conference that can be found in several repositories such as Konect/n/nBarrapunto conversations (Jan 2005 - Dec 2008)/nOnline conversations generated at Barrapunto (Spanish clone of Slashdot) during three years. For each discussion thread: sub-domains, title, topics and hierarchical relations between comments. For each comment: user, date, score and textual content Wikipedia (2001 - Mar 2010) Data from articles discussions (talk) pages of the English Wikipedia as of March 2010. It contains comments on about 870,000 articles (i.e. all articles which had a corresponding talk page with at least one comment), in total about 9.4 million comments. The oldest comments date back to as early as 2001.

  8. Classification Graphs

    • kaggle.com
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Classification Graphs [Dataset]. https://www.kaggle.com/wolfram77/graphs-classification/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Deezer Ego Nets

    The ego-nets of Eastern European users collected from the music streaming service Deezer in February 2020. Nodes are users and edges are mutual follower relationships. The related task is the prediction of gender for the ego node in the graph.

    Github Stargazers

    The social networks of developers who starred popular machine learning and web development repositories (with at least 10 stars) until 2019 August. Nodes are users and links are follower relationships. The task is to decide whether a social network belongs to web or machine learning developers. We only included the largest component (at least with 10 users) of graphs.

    Reddit Threads

    Discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them. The task is to predict whether a thread is discussion based or not (binary classification).

    Twitch Ego Nets

    The ego-nets of Twitch users who participated in the partnership program in April 2018. Nodes are users and links are friendships. The binary classification task is to predict using the ego-net whether the ego user plays a single or multple games. Players who play a single game usually have a more dense ego-net.

    Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

    The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

    SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.

    http://snap.stanford.edu/data/index.html#disjointgraphs

  9. 🇨🇦 Reddit r/Canada Subreddit Dataset

    • kaggle.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2025). 🇨🇦 Reddit r/Canada Subreddit Dataset [Dataset]. https://www.kaggle.com/datasets/bwandowando/reddit-rcanada-subreddit-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    BwandoWando
    License

    https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api

    Area covered
    Canada
    Description

    Context

    I've tried looking for an r/Canada/ dataset here in Kaggle havent found one, so I made one for the Canadian Kaggle members

    About

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Ffb80802c4661e6a72ca9a1bba9c334a6%2F_ca5803bc-94b9-481f-9339-effcd87f3ee1_small.jpeg?generation=1736341311974753&alt=media" alt="">

    Created last Jan 25, 2008, r/Canada/ is labeled

    Welcome to Canada’s official subreddit! This is the place to engage on all things Canada. Nous parlons en anglais et en français. Please be respectful of each other when posting, and note that users new to the subreddit might experience posting limitations until they become more active and longer members of the community. Do not hesitate to message the mods if you experience any issues!

    This dataset can be used to extract insights from the trending topics and discussions in the subreddit.

    Banner Image

    Created with Bing Image Creator

  10. d

    Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...

    • datarade.ai
    .json, .csv
    Updated Aug 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataplex (2024). Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-global-social-media-data-1-1m-mill-dataplex
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Aug 12, 2024
    Dataset authored and provided by
    Dataplex
    Area covered
    Macao, Christmas Island, Botswana, Holy See, Chile, Martinique, Gambia, Jersey, Mexico, Côte d'Ivoire
    Description

    The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

    Dataset Overview:

    This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

    2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

    Sourced Directly from Reddit:

    All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

    Key Features:

    • Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.
    • User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.
    • Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.
    • AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

    Use Cases:

    • Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.
    • Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.
    • Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.
    • Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

    Data Quality and Reliability:

    The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

    Integration and Usability:

    The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

    User-Friendly Structure and Metadata:

    The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

    Ideal For:

    • Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.
    • Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.
    • Researchers: Explore the social dynamics of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

    This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...

  11. Z

    TED dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popescu-Belis, Andrei (2020). TED dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4061423
    Explore at:
    Dataset updated
    Oct 6, 2020
    Dataset provided by
    Popescu-Belis, Andrei
    Pappas, Nikolaos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset for recommendations collected from ted.com which contains metadata fields for TED talks and user profiles with rating and commenting transactions.

    The TED dataset contains all the audio-video recordings of the TED talks downloaded from the official TED website, http://www.ted.com, on April 27th 2012 (first version) and on September 10th 2012 (second version). No processing has been done on any of the metadata fields. The metadata was obtained by crawling the HTML source of the list of talks and users, as well as talk and user webpages using scripts written by Nikolaos Pappas at the Idiap Research Institute, Martigny, Switzerland. The dataset is shared under the Creative Commons license (the same as the content of the TED talks) which is stored in the COPYRIGHT file. The dataset is shared for research purposes which are explained in detail in the following papers. The dataset can be used to benchmark systems that perform two tasks, namely personalized recommendations and generic recommendations. Please check the CBMI 2013 paper for a detailed description of each task.

    Nikolaos Pappas, Andrei Popescu-Belis, "Combining Content with User Preferences for TED Lecture Recommendation", 11th International Workshop on Content Based Multimedia Indexing, Veszprém, Hungary, IEEE, 2013 PDF document, Bibtex citation

    Nikolaos Pappas, Andrei Popescu-Belis, Sentiment Analysis of User Comments for One-Class Collaborative Filtering over TED Talks, 36th ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, ACM, 2013 PDF document, Bibtex citation

    If you use the TED dataset for your research please cite one of the above papers (specifically the 1st paper for the April 2012 version and the 2nd paper for the September 2012 version of the dataset).

    TED website

    The TED website is a popular online repository of audiovisual recordings of public lectures given by prominent speakers, under a Creative Commons non-commercial license (see www.ted.com). The site provides extended metadata and user-contributed material. The speakers are scientists, writers, journalists, artists, and businesspeople from all over the world who are generally given a maximum of 18 minutes to present their ideas. The talks are given in English and are usually transcribed and then translated into several other languages by volunteer users. The quality of the talks has made TED one of the most popular online lecture repositories, as each talk was viewed on average almost 500,000 times.

    Metadata

    The dataset contains two main entry types: talks and users. The talks have the following data fields: identifier, title, description, speaker name, TED event at which they were given, transcript, publication date, filming date, number of views. Each talk has a variable number of user comments, organized in threads. In addition, three fields were assigned by TED editorial staff: related tags, related themes, and related talks. Each talk generally has three related talks and 95% of them have a high- quality transcript available. The dataset includes 1,149 talks from 960 speakers and 69,023 registered users that have made about 100,000 favorites and 200,000 comments.

  12. Z

    Dataset for: The Evolution of the Manosphere Across the Web

    • data.niaid.nih.gov
    Updated Aug 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Savvas Zannettou (2020). Dataset for: The Evolution of the Manosphere Across the Web [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4007912
    Explore at:
    Dataset updated
    Aug 30, 2020
    Dataset provided by
    Gianluca Stringhini
    Jeremy Blackburn
    Stephanie Greenberg
    Savvas Zannettou
    Summer Long
    Emiliano De Cristofaro
    Barry Bradlyn
    Manoel Horta Ribeiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Evolution of the Manosphere Across the Web

    We make available data related to subreddit and standalone forums from the manosphere.

    We also make available Perspective API annotations for all posts.

    You can find the code in GitHub.

    Please cite this paper if you use this data:

    @article{ribeiroevolution2021, title={The Evolution of the Manosphere Across the Web}, author={Ribeiro, Manoel Horta and Blackburn, Jeremy and Bradlyn, Barry and De Cristofaro, Emiliano and Stringhini, Gianluca and Long, Summer and Greenberg, Stephanie and Zannettou, Savvas}, booktitle = {{Proceedings of the 15th International AAAI Conference on Weblogs and Social Media (ICWSM'21)}}, year={2021} }

    1. Reddit data

    We make available data for forums and for relevant subreddits (56 of them, as described in subreddit_descriptions.csv). These are available, 1 line per post in each subreddit Reddit in /ndjson/reddit.ndjson. A sample for example is:

    { "author": "Handheld_Gaming", "date_post": 1546300852, "id_post": "abcusl", "number_post": 9.0, "subreddit": "Braincels", "text_post": "Its been 2019 for almost 1 hour And I am at a party with 120 people, half of them being foids. The last year had been the best in my life. I actually was happy living hope because I was redpilled to the death.

    Now that I am blackpilled I see that I am the shortest of all men and that I am the only one with a recessed jaw.

    Its over. Its only thanks to my age old friendship with chads and my social skills I had developed in the past year that a lot of men like me a lot as a friend.

    No leg lengthening syrgery is gonna save me. Ignorance was a bliss. Its just horror now seeing that everyone can make out wirth some slin hoe at the party.

    I actually feel so unbelivably bad for turbomanlets. Life as an unattractive manlet is a pain, I cant imagine the hell being an ugly turbomanlet is like. I would have roped instsntly if I were one. Its so unfair.

    Tallcels are fakecels and they all can (and should) suck my cock.

    If I were 17cm taller my life would be a heaven and I would be the happiest man alive.

    Just cope and wait for affordable body tranpslants.", "thread": "t3_abcusl" }

    1. Forums

    We here describe the .sqlite and .ndjson files that contain the data from the following forums.

    (avfm) --- https://d2ec906f9aea-003845.vbulletin.net (incels) --- https://incels.co/ (love_shy) --- http://love-shy.com/lsbb/ (redpilltalk) --- https://redpilltalk.com/ (mgtow) --- https://www.mgtow.com/forums/ (rooshv) --- https://www.rooshvforum.com/ (pua_forum) --- https://www.pick-up-artist-forum.com/ (the_attraction) --- http://www.theattractionforums.com/

    The files are in folders /sqlite/ and /ndjson.

    2.1 .sqlite

    All the tables in the sqlite. datasets follow a very simple {key:value} format. Each key is a thread name (for example /threads/housewife-is-like-a-job.123835/) and each value is a python dictionary or a list. This file contains three tables:

    idx each key is the relative address to a thread and maps to a post. Each post is represented by a dict:

    "type": (list) in some forums you can add a descriptor such as [RageFuel] to each topic, and you may also have special types of posts, like sticked/pool/locked posts.
    "title": (str) title of the thread; "link": (str) link to the thread; "author_topic": (str) username that created the thread; "replies": (int) number of replies, may differ from number of posts due to difference in crawling date; "views": (int) number of views; "subforum": (str) name of the subforum; "collected": (bool) indicates if raw posts have been collected; "crawled_idx_at": (str) datetime of the collection.

    processed_posts each key is the relative address to a thread and maps to a list with posts (in order). Each post is represented by a dict:

    "author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

    raw_posts each key is the relative address to a thread and maps to a list with unprocessed posts (in order). Each post is represented by a dict:

    "post_raw": (binary) raw html binary; "crawled_at": (str) datetime of the collection.

    2.2 .ndjson

    Each line consists of a json object representing a different comment with the following fields:

    "author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

    1. Perspective

    We also run each post and reddit post through perspective, the files are located in the /perspective/ folder. They are compressed with gzip. One example output

    { "id_post": 5200, "hate_output": { "text": "I still can\u2019t wrap my mind around both of those articles about these c~~~s sleeping with poor Haitian Men. Where\u2019s the uproar?, where the hell is the outcry?, the \u201cpig\u201d comments or the \u201ccreeper comments\u201d. F~~~ing hell, if roles were reversed and it was an article about Men going to Europe where under 18 sex in legal, you better believe they would crucify the writer of that article and DEMAND an apology by the paper that wrote it.. This is exactly what I try and explain to people about the double standards within our modern society. A bunch of older women, wanna get their kicks off by sleeping with poor Men, just before they either hit or are at menopause age. F~~~ing unreal, I\u2019ll never forget going to Sweden and Norway a few years ago with one of my buddies and his girlfriend who was from there, the legal age of consent in Norway is 16 and in Sweden it\u2019s 15. I couldn\u2019t believe it, but my friend told me \u201c hey, it\u2019s normal here\u201d . Not only that but the age wasn\u2019t a big different in other European countries as well. One thing i learned very quickly was how very Misandric Sweden as well as Denmark were.", "TOXICITY": 0.6079781, "SEVERE_TOXICITY": 0.53744453, "INFLAMMATORY": 0.7279288, "PROFANITY": 0.58842486, "INSULT": 0.5511079, "OBSCENE": 0.9830818, "SPAM": 0.17009115 } }

    1. Working with sqlite

    A nice way to read some of the files of the dataset is using SqliteDict, for example:

    from sqlitedict import SqliteDict processed_posts = SqliteDict("./data/forums/incels.sqlite", tablename="processed_posts")

    for key, posts in processed_posts.items(): for post in posts: # here you could do something with each post in the dataset pass

    1. Helpers

    Additionally, we provide two .sqlite files that are helpers used in the analyses. These are related to reddit, and not to the forums! They are:

    channel_dict.sqlite a sqlite where each key corresponds to a subreddit and values are lists of dictionaries users who posted on it, along with timestamps.

    author_dict.sqlite a sqlite where each key corresponds to an author and values are lists of dictionaries of the subreddits they posted on, along with timestamps.

    These are used in the paper for the migration analyses.

    1. Examples and particularities for forums

    Although we did our best to clean the data and be consistent across forums, this is not always possible. In the following subsections we talk about the particularities of each forum, directions to improve the parsing which were not pursued as well as give some examples on how things work in each forum.

    6.1 incels

    Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

    types: for the incel forums the special types associated with each thread in the idx table are “Sticky”, “Pool”, “Closed”, and the custom types added by users, such as [LifeFuel]. These last ones are all in brackets. You can see some examples of these in the on the example thread page.

    quotes: quotes in this forum were quite nice and thus, all quotations are deterministic.

    6.2 LoveShy

    Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

    types: no types were parsed. There are some rules in the forum, but not significant.

    quotes: quotes were obtained from exact text+author match, or author match + a jaccard

  13. c

    ckanext-comments - Extensions - CKAN Ecosystem Catalog

    • catalog.civicdataecosystem.org
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). ckanext-comments - Extensions - CKAN Ecosystem Catalog [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-comments
    Explore at:
    Dataset updated
    Jun 4, 2025
    Description

    The ckanext-comments extension enhances CKAN by enabling threaded discussions on core entities within the platform. This allows for direct feedback, collaboration, and annotation of datasets, resources, groups, organizations, and user profiles. By providing an API-first approach, the extension facilitates the integration of commenting functionality into custom user interfaces or automated workflows. Key Features: Threaded Comments: Implements a threaded commenting system, allowing users to reply to existing comments and create structured discussions around datasets and other entities. API-First Design: Offers a comprehensive API for all commenting features, enabling programmatic access to comment creation, retrieval, modification, and deletion. Entity Linking: Links comment threads to specific CKAN entities, including datasets, resources, groups, organizations, and users, providing context for discussions. Comment Management: Provides API endpoints for approving, deleting, and updating comments, allowing for moderation and content management. Thread Management: Allows creation, showing, and deletion of comment threads. Filtering and Retrieval: Supports filtering comments by date and including comment authors in API responses. Configuration options: Offers the possibility to automatically enable comments for datasets. Technical Integration: ckanext-comments integrates with CKAN through a plugin architecture. It requires installation as a Python package, activation in the CKAN configuration file (ckan.plugins), and database migrations to set up the necessary tables. The extension also provides a Jinja2 snippet (cooments/snippets/thread.html) for embedding comment threads into CKAN templates, allowing customization of the user interface. No WebUI changes are done by default - you have to include the provided snippet into the Jinja2 template. Benefits & Impact: Adding ckanext-comments to a CKAN instance permits increased user engagement through collaborative annotation and discussion. The ability to create threaded conversations on datasets, in particular, encourages dialogue about data quality, interpretation, and potential applications. This is most useful for research-focused organizations with a large community surrounding their data.

  14. D

    Digital Phenotyping via Social Media Content 2

    • ssh.datastations.nl
    ods, zip
    Updated Sep 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SK Kavvadias; SK Kavvadias (2023). Digital Phenotyping via Social Media Content 2 [Dataset]. http://doi.org/10.17026/DANS-Z7G-9WEK
    Explore at:
    ods(6494136), zip(16922)Available download formats
    Dataset updated
    Sep 21, 2023
    Dataset provided by
    DANS Data Station Social Sciences and Humanities
    Authors
    SK Kavvadias; SK Kavvadias
    License

    https://doi.org/10.17026/fp39-0x58https://doi.org/10.17026/fp39-0x58

    Description

    The research project associated with this dataset focuses on the analysis of the top threads within the ddo subreddit. The dataset contains essential information about each of these threads, including the author's username, the post's title, the post text, its score, and the number of comments it has received. Additionally, it includes a detailed record of all comments within each thread, encompassing the commenter's username, the date and time of their comment, and the score received by each comment.The purpose of this project is to recognize addicted users within the ddo subreddit community by considering their activity patterns, emotional expressions, and content preferences, ultimately contributing to a deeper understanding of addiction-related behaviors in online communities and informing strategies for tailored support and interventions. Date Submitted: 2023-09-19

  15. e

    Offensive language dataset of French comments FRENK-fr 1.0 - Dataset -...

    • b2find.eudat.eu
    Updated May 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Offensive language dataset of French comments FRENK-fr 1.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/ac695c1a-9fed-5a8a-bfe5-9f877ba9fd66
    Explore at:
    Dataset updated
    May 28, 2024
    Area covered
    French
    Description

    The FRENK-fr dataset contains French socially unacceptable and acceptable comments posted in response to news articles that cover the topics of LGBT and migrants, and which were posted on Facebook by prominent French media outlets (20 minutes, Le Figaro and Le Monde). The original thread order of comments based on the time of publishing is preserved in the dataset. These comments were manually annotated for the type and target of socially unacceptable comments. The creation process, including data collection, filtering, annotation schema and annotation procedure, was adopted from the FRENK 1.1 dataset (http://hdl.handle.net/11356/1462), which makes FRENK-fr fully comparable to the datasets of Croatian, English and Slovenian comments included in the FRENK 1.1. Apart from manual annotation of the type and target of socially unacceptable discourse, the comments are accompanied with metadata, namely the topic of the news item (LGBT or migrants) that triggered the comment, the news item itself and the media outlet authoring it, an anonymised user ID, and information about the reply level in the thread. The dataset consists of 10,239 Facebook comments posted under 66 news items. It includes 3,071 comments that were labelled as socially unacceptable, and 7,168 that were labelled as socially acceptable.

  16. e

    DANS Data Station Social Sciences and Humanities

    • b2find.eudat.eu
    Updated Sep 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). DANS Data Station Social Sciences and Humanities [Dataset]. https://b2find.eudat.eu/dataset/6a3402bb-5c01-56aa-90c2-9ab9311428f0
    Explore at:
    Dataset updated
    Sep 19, 2023
    Description

    The research project associated with this dataset focuses on the analysis of the top threads within the ddo subreddit. The dataset contains essential information about each of these threads, including the author's username, the post's title, the post text, its score, and the number of comments it has received. Additionally, it includes a detailed record of all comments within each thread, encompassing the commenter's username, the date and time of their comment, and the score received by each comment.The purpose of this project is to recognize addicted users within the ddo subreddit community by considering their activity patterns, emotional expressions, and content preferences, ultimately contributing to a deeper understanding of addiction-related behaviors in online communities and informing strategies for tailored support and interventions. Date Submitted: 2023-09-19

  17. d

    Data from: Examining the Structure, Organization, and Processes of the...

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). Examining the Structure, Organization, and Processes of the International Market for Stolen Data, 2007-2012 [Dataset]. https://catalog.data.gov/dataset/examining-the-structure-organization-and-processes-of-the-international-market-for-st-2007-08271
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    National Institute of Justice
    Description

    These data are part of NACJD's Fast Track Release and are distributed as they were received from the data depositor. The files have been zipped by NACJD for release, but not checked or processed except for the removal of direct identifiers. Users should refer to the accompanying readme file for a brief description of the files available with this collection and consult the investigator(s) if further information is needed. This study was designed to understand the economic and social structure of the market for stolen data on-line. This data provides information on the costs of various forms of personal information and cybercrime services, the payment systems used, social organization and structure of the market, and interactions between buyers, sellers, and forum operators. The PIs used this data to assess the economy of stolen data markets, the social organization of participants, and the payment methods and services used. The study utilized a sample of approximately 1,900 threads generated from 13 web forums, 10 of which used Russian as their primary language and three which used English. These forums were hosted around the world, and acted as online advertising spaces for individuals to sell and buy a range of products. The content of these forums were downloaded and translated from Russian to English to create a purposive, yet convenient sample of threads from each forum. The collection contains 1 SPSS data file (ICPSR Submission Economic File SPSS.sav) with 39 variables and 13,735 cases and 1 Access data file (Social Network Analysis File Revised 04-11-14.mdb) with a total of 16 data tables and 199 variables. Qualitative data used to examine the associations and working relationships present between participants at the micro and macro-level are not available at this time.

  18. f

    Datasets properties.

    • plos.figshare.com
    xls
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Chmiel; Julian Sienkiewicz; Mike Thelwall; Georgios Paltoglou; Kevan Buckley; Arvid Kappas; Janusz A. Hołyst (2023). Datasets properties. [Dataset]. http://doi.org/10.1371/journal.pone.0022207.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Anna Chmiel; Julian Sienkiewicz; Mike Thelwall; Georgios Paltoglou; Kevan Buckley; Arvid Kappas; Janusz A. Hołyst
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Properties of the three datasets: number of comments , number of different users giving comments , number of discussions/threads , average valence in the dataset , probability of finding positive, negative or neutral emotion (respectively , and ) and values of exponents for positive, negative and neutral clusters (respectively , and ). In case of Blogs data it was impossible to quantify the number of different users and note also a low number of comments in this dataset. Each data set has a different overall average valence – BBC is strongly negative, Digg is mildly negative while Blogs are mildly positive.

  19. O

    Data from: MuMiN

    • opendatalab.com
    zip
    Updated Mar 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Bristol (2023). MuMiN [Dataset]. https://opendatalab.com/OpenDataLab/MuMiN
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 22, 2023
    Dataset provided by
    University of Bristol
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade. MuMiN fills a gap in the existing misinformation datasets in multiple ways: By having a large amount of social media information which have been semantically linked to fact-checked claims on an individual basis. By featuring 41 languages, enabling evaluation of multilingual misinformation detection models. By featuring both tweets, articles, images, social connections and hashtags, enabling multimodal approaches to misinformation detection. MuMiN features two node classification tasks, related to the veracity of a claim: Claim classification: Determine the veracity of a claim, given its social network context. Tweet classification: Determine the likelihood that a social media post to be fact-checked is discussing a misleading claim, given its social network context. To use the dataset, see the "Getting Started" guide and tutorial at the MuMiN website.

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nicholas Landry; Nicholas Landry (2023). threads-math-sx [Dataset]. http://doi.org/10.5281/zenodo.10373324
Organization logo

threads-math-sx

Explore at:
27 scholarly articles cite this dataset (View in Google Scholar)
jsonAvailable download formats
Dataset updated
Dec 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nicholas Landry; Nicholas Landry
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Overview

This is a temporal higher-order network dataset, which here means a sequence of timestamped hyperedges where each hyperedge is a set of nodes. In this dataset, nodes are users on https://math.stackexchange.com, and a hyperedge comes from users participating in a thread that lasts for at most 24 hours. The timestamps are the time of the post, but normalized so that the earliest post starts at 0.

Source of original data

Source: threads-math-sx dataset

References

If you use this data, please cite the following paper:

Search
Clear search
Close search
Google apps
Main menu