42 datasets found
  1. h

    reddit_threads

    • huggingface.co
    Updated Apr 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Graph Datasets (2023). reddit_threads [Dataset]. https://huggingface.co/datasets/graphs-datasets/reddit_threads
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 13, 2023
    Dataset authored and provided by
    Graph Datasets
    License

    https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

    Description

    Dataset Card for Reddit threads

      Dataset Summary
    

    The Reddit threads dataset contains 'discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them' (doc).

      Supported Tasks and Leaderboards
    

    The related task is the binary classification to predict whether a thread is discussion based or not.

      External Use
    
    
    
    
    
      PyGeometric
    

    To load in… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/reddit_threads.

  2. o

    threads-math-sx

    • explore.openaire.eu
    • zenodo.org
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Landry (2023). threads-math-sx [Dataset]. http://doi.org/10.5281/zenodo.10373323
    Explore at:
    Dataset updated
    Dec 14, 2023
    Authors
    Nicholas Landry
    Description

    Overview This is a temporal higher-order network dataset, which here means a sequence of timestamped hyperedges where each hyperedge is a set of nodes. In this dataset, nodes are users on https://math.stackexchange.com, and a hyperedge comes from users participating in a thread that lasts for at most 24 hours. The timestamps are the time of the post, but normalized so that the earliest post starts at 0. Source of original data Source: threads-math-sx dataset References If you use this data, please cite the following paper: Simplicial closure and higher-order link prediction. Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg. Proceedings of the National Academy of Sciences (PNAS), 2018.

  3. Reddit Italy Coffee Dataset

    • kaggle.com
    Updated Mar 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luigi Cerone (2023). Reddit Italy Coffee Dataset [Dataset]. https://www.kaggle.com/datasets/gigggi/reddit-italy-coffee-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 16, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Luigi Cerone
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Italy
    Description

    r/italy is a subreddit focused on discussions related to Italy, including news, culture, politics, and society. Users can post and comment on various topics related to Italy, including travel, language, cuisine, and more. Among the many threads that populate the subreddit, one of the most popular is the daily thread named "Caffè Italia." As the name suggests, this thread is a virtual coffeehouse where users can gather and exchange ideas on a variety of topics.

    Every day, a new "Caffè Italia" thread is created, and users are encouraged to participate by sharing their opinions, asking for advice, or simply chatting with others. The topics discussed in this thread can be very diverse, ranging from Italian cuisine and travel to politics, news, and social issues.

    The "Caffè Italia" thread provides an informal and friendly space where users can express themselves freely and connect with others who share their interests or concerns. It's a place where they can ask for recommendations on the best places to visit in Italy, share their thoughts on the latest news or events, or discuss cultural topics, such as literature, art, or music.

    What makes the "Caffè Italia" thread so unique is its sense of community. Users feel welcome and valued, and they often return to the thread to catch up with the latest discussions or to contribute to ongoing conversations. Many users have formed friendships and connections through the thread, which has become a hub for the r/italy community.

    In summary, the "Caffè Italia" thread is a daily gathering place for r/italy users to engage in conversations, share their experiences, and connect with others. Whether you're a first-time visitor to the subreddit or a seasoned member of the community, you're sure to find something interesting and engaging in the "Caffè Italia" thread.

    This dataset contains several months of data scraped from it. The code used to generate it is available in my Github profile.

  4. Z

    Data from: Dataset of discussion threads from Meneame

    • data.niaid.nih.gov
    • recerca.uoc.edu
    • +1more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas, Kaltenbrunner (2020). Dataset of discussion threads from Meneame [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2536217
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Andreas, Kaltenbrunner
    Pablo, Aragón
    Vicenç, Gómez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset from our ICWSM 2017 paper. When using this resource, please use the following citation:

    Aragón P., Gómez V., Kaltenbrunner A. (2017) To Thread or Not to Thread: The Impact of Conversation Threading on Online Discussion, ICWSM-17- 11th International AAAI Conference on Web and Social Media, Montreal, Canada.

    @inproceedings {aragon2017ICWSM, author = {Arag\'on, Pablo and G\'omez, Vicen\c{c} and Kaltenbrunner, Andreas}, title = {To Thread or Not to Thread: The Impact of Conversation Threading on Online Discussion}, booktitle = {ICWSM-17 - 11th International AAAI Conference on Web and Social Media}, publisher = {The AAAI Press}, location = {Montreal, Canada}, year = 2017 }

    More info about this dataset can also be found at:

    Aragón P., Gómez V., Kaltenbrunner A., (2017) Detecting Platform Effects in Online Discussions, Policy & Internet, 9, 2017.

    @article{aragon2017PI, author = {Arag\'on, Pablo and G\'omez, Vicen\c{c} and Kaltenbrunner, Andreas}, title = {Detecting Platform Effects in Online Discussions}, journal = {Policy & Internet}, volume = {9}, number = {4}, pages = {420-443}, doi = {10.1002/poi3.158}, url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/poi3.158}, eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/poi3.158}, year = {2017} }

    Crawling process

    We built a crawling process that collects all the stories in the front page of Meneame from 2011 to 2015 (both years included). We then performed a second crawling process to collect every comment from the discussion thread of each story. From both crawling processes, we obtained 72,005 stories and 5,385,324 comments.

    It is important to highlight two issues taken into account when the crawler was designed. First, the machine-readable robots.txt file on Meneame does not disallow this process. Second, the footnote of Meneame indicates the licenses of the code, graphics and content of the website. The license for content is Attribution 3.0 Spain (CC BY 3.0 ES) which allows us to release this dataset.

    Fields

    Every discussion thread is stored in a JSON file named with the URL slug of the corresponding story in Meneame, located in a yyyy-mm-dd folder. The JSON file is an array of elements with the following fields:

    id (string): ID of the story/comment

    sent (timestamp): Date of the story/comment as yyyy-MM-ddThh:mm:ssZ.

    message (string): Text of the story/comment

    user (string): Username of the authoring story/comment

    karma (number): Karma score of the comment when the crawling was performed

    comments_count (number): Number of comments in reply to the story/post

    votes (number): Number of votes to the story/comment

    thread (string): URL of the thread

    thread_id (string): Sequential arriving order to the thread (0 if story, >=1 if comment)

    depth (string): Depth within the thread (0 if story, >=1 if comment)

    url (string): URL of the specific story/comment

    title (string): Title, only available for stories.

    published (string): Date when published on the front page, only available for stories.

    tags (string): Tags, only available for stories.

    clics (string): Number of clicks, only available for stories.

    users (string): Number of user votes, only available for stories.

    anonymous (string): Number of anonymous votes, only available for stories.

    negatives (string): Number of negative votes, only available for stories.

    in_reply_to_id (string): ID of the parent story/comment, only available for comments.

    in_reply_to_user (string): Authoring user of the parent story/comment, only available for comments.

    in_reply_to_thread_id (string): Sequential arriving order to the thread of of the parent story/comment, only available for comments.

    Acknowledgment

    This work is supported by the Spanish Ministry of Economy and Competitiveness under the María de Maeztu Units of Excellence Programme (MDM-2015-0502).

  5. h

    4chan-pol-extensive

    • huggingface.co
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mel (2024). 4chan-pol-extensive [Dataset]. http://doi.org/10.57967/hf/3931
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2024
    Authors
    mel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    4chan /pol/ dataset

    This dataset contains data from 12000+ threads from 4chan boards, collected and processed for research purposes. The data includes both active and archived threads, with extensive metadata and derived features for studying online discourse and community dynamics.I preserved thread structure, temporal information, and user interaction patterns while maintaining anonymity and excluding sensitive content.

      Dataset Details
    
    
    
    
    
      Dataset Sources and… See the full description on the dataset page: https://huggingface.co/datasets/vmfunc/4chan-pol-extensive.
    
  6. h

    BluePrint

    • huggingface.co
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Complex Data Lab (2025). BluePrint [Dataset]. http://doi.org/10.57967/hf/5425
    Explore at:
    Dataset updated
    May 28, 2025
    Dataset authored and provided by
    Complex Data Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📘 BluePrint

    BluePrint is a large-scale dataset of social media conversation threads designed for evaluating and training LLM-based social media agents. It provides realistic, thread-structured data clustered into representative user personas at various levels of granularity.

      ✅ Key Features
    

    Thread-Based Structure: Each example is a list of messages representing a user thread. Persona Clustering: Users are clustered into 2, 25, 100, and 1000 representative personas to… See the full description on the dataset page: https://huggingface.co/datasets/ComplexDataLab/BluePrint.

  7. Classification Graphs

    • kaggle.com
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Classification Graphs [Dataset]. https://www.kaggle.com/wolfram77/graphs-classification/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Deezer Ego Nets

    The ego-nets of Eastern European users collected from the music streaming service Deezer in February 2020. Nodes are users and edges are mutual follower relationships. The related task is the prediction of gender for the ego node in the graph.

    Github Stargazers

    The social networks of developers who starred popular machine learning and web development repositories (with at least 10 stars) until 2019 August. Nodes are users and links are follower relationships. The task is to decide whether a social network belongs to web or machine learning developers. We only included the largest component (at least with 10 users) of graphs.

    Reddit Threads

    Discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them. The task is to predict whether a thread is discussion based or not (binary classification).

    Twitch Ego Nets

    The ego-nets of Twitch users who participated in the partnership program in April 2018. Nodes are users and links are friendships. The binary classification task is to predict using the ego-net whether the ego user plays a single or multple games. Players who play a single game usually have a more dense ego-net.

    Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

    The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

    SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.

    http://snap.stanford.edu/data/index.html#disjointgraphs

  8. Bluesky Social Dataset

    • zenodo.org
    application/gzip, csv
    Updated Jan 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti (2025). Bluesky Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.14669616
    Explore at:
    application/gzip, csvAvailable download formats
    Dataset updated
    Jan 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti
    License

    https://bsky.social/about/support/toshttps://bsky.social/about/support/tos

    Description

    Bluesky Social Dataset

    Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue.

    The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

    Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their “like” interactions and time of bookmarking.

    Dataset

    Here is a description of the dataset files.

    • followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers representing a directed following relation (i.e., user u follows user v).
    • user_posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in a collection of files, each containing the post of an anonymized user. Each post is stored as a JSON-formatted line.
    • interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers representing a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author,quoted_author, and date.
    • graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread.
    • feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author);
    • feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values: the feed name, user id, and timestamp.
    • feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp;
    • scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.

    Citation

    If used for research purposes, please cite the following paper describing the dataset details:

    Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight: Insights from a Year's Worth of Social Data." PlosOne (2024) https://doi.org/10.1371/journal.pone.0310330

    Right to Erasure (Right to be forgotten)

    Note: If your account was created after March 21st, 2024, or if you did not post on Bluesky before such date, no data about your account exists in the dataset. Before sending a data removal request, please make sure that you were active and posting on bluesky before March 21st, 2024.

    Users included in the Bluesky Social dataset have the right to opt-out and request the removal of their data, per GDPR provisions (Article 17).

    We emphasize that the released data has been thoroughly pseudonymized in compliance with GDPR (Article 4(5)). Specifically, usernames and object identifiers (e.g., URIs) have been removed, and object timestamps have been coarsened to protect individual privacy further and minimize reidentification risk. Moreover, it should be noted that the dataset was created for scientific research purposes, thereby falling under the scenarios for which GDPR provides opt-out derogations (Article 17(3)(d) and Article 89).

    Nonetheless, if you wish to have your activities excluded from this dataset, please submit your request to blueskydatasetmoderation@gmail.com (with the subject "Removal request: [username]"). We will process your request within a reasonable timeframe - updates will occur monthly, if necessary, and access to previous versions will be restricted.

    Acknowledgments:

    This work is supported by :

    • the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”,
      Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu);
    • SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021;
    • EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
  9. Data from: E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects...

    • zenodo.org
    bin, txt
    Updated May 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Di Meglio; Sergio Di Meglio; Valeria Pontillo; Valeria Pontillo; Coen De roover; Coen De roover; Luigi Libero Lucio Starace; Luigi Libero Lucio Starace; Sergio Di Martino; Sergio Di Martino; Ruben Opdebeeck; Ruben Opdebeeck (2025). E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects [Dataset]. http://doi.org/10.5281/zenodo.14221860
    Explore at:
    txt, binAvailable download formats
    Dataset updated
    May 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sergio Di Meglio; Sergio Di Meglio; Valeria Pontillo; Valeria Pontillo; Coen De roover; Coen De roover; Luigi Libero Lucio Starace; Luigi Libero Lucio Starace; Sergio Di Martino; Sergio Di Martino; Ruben Opdebeeck; Ruben Opdebeeck
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT
    End-to-End (E2E) testing is a comprehensive approach to validating the functionality of a software application by testing its entire workflow from the user’s perspective, ensuring that all integrated components work together as expected. It is crucial for ensuring the quality and reliability of applications, especially in the web domain, which is often bound by Service Level Agreements (SLAs). This testing involves two key activities:
    Graphical User Interface (GUI) testing, which simulates user interactions through browsers, and performance testing, which evaluates system workload handling. Despite its importance, E2E testing is often neglected, and the lack of reliable datasets for Web GUI and performance testing has slowed research progress. This paper addresses these limitations by constructing E2EGit, a comprehensive dataset, cataloging non-trivial open-source web projects on GITHUB that adopt GUI or performance testing.
    The dataset construction process involved analyzing over 5k non-trivial web repositories based on popular programming languages (JAVA, JAVASCRIPT TYPESCRIPT PYTHON) to identify: 1) GUI tests based on popular browser automation frameworks (SELENIUM PLAYWRIGHT, CYPRESS, PUPPETEER), 2) performance tests written with the most popular open-source tools (JMETER, LOCUST). After analysis, we identified 472 repositories using web GUI testing, with over 43,000 tests, and 84 repositories using performance testing, with 410 tests.


    DATASET DESCRIPTION
    The dataset is provided as an SQLite database, whose structure is illustrated in Figure 3 (in the paper), which consists of five tables, each serving a specific purpose.
    The repository table contains information on 1.5 million repositories collected using the SEART tool on May 4. It includes 34 fields detailing repository characteristics. The
    non_trivial_repository table is a subset of the previous one, listing repositories that passed the two filtering stages described in the pipeline. For each repository, it specifies whether it is a web repository using JAVA, JAVASCRIPT, TYPESCRIPT, or PYTHON frameworks. A repository may use multiple frameworks, with corresponding fields (e.g., is web java) set to true, and the field web dependencies listing the detected web frameworks. For Web GUI testing, the dataset includes two additional tables; gui_testing_test _details, where each row represents a test file, providing the file path, the browser automation framework used, the test engine employed, and the number of tests implemented in the file. gui_testing_repo_details, aggregating data from the previous table at the repository level. Each of the 472 repositories has a row summarizing
    the number of test files using frameworks like SELENIUM or PLAYWRIGHT, test engines like JUNIT, and the total number of tests identified. For performance testing, the performance_testing_test_details table contains 410 rows, one for each test identified. Each row includes the file path, whether the test uses JMETER or LOCUST, and extracted details such as the number of thread groups, concurrent users, and requests. Notably, some fields may be absent—for instance, if external files (e.g., CSVs defining workloads) were unavailable, or in the case of Locust tests, where parameters like duration and concurrent users are specified via the command line.

    To cite this article refer to this citation:

    @inproceedings{di2025e2egit,
    title={E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects},
    author={Di Meglio, Sergio and Starace, Luigi Libero Lucio and Pontillo, Valeria and Opdebeeck, Ruben and De Roover, Coen and Di Martino, Sergio},
    booktitle={2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR)},
    pages={10--15},
    year={2025},
    organization={IEEE/ACM}
    }

    This work has been partially supported by the Italian PNRR MUR project PE0000013-FAIR.

  10. Z

    Dataset for: The Evolution of the Manosphere Across the Web

    • data.niaid.nih.gov
    Updated Aug 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emiliano De Cristofaro (2020). Dataset for: The Evolution of the Manosphere Across the Web [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4007912
    Explore at:
    Dataset updated
    Aug 30, 2020
    Dataset provided by
    Emiliano De Cristofaro
    Gianluca Stringhini
    Manoel Horta Ribeiro
    Jeremy Blackburn
    Stephanie Greenberg
    Summer Long
    Barry Bradlyn
    Savvas Zannettou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Evolution of the Manosphere Across the Web

    We make available data related to subreddit and standalone forums from the manosphere.

    We also make available Perspective API annotations for all posts.

    You can find the code in GitHub.

    Please cite this paper if you use this data:

    @article{ribeiroevolution2021, title={The Evolution of the Manosphere Across the Web}, author={Ribeiro, Manoel Horta and Blackburn, Jeremy and Bradlyn, Barry and De Cristofaro, Emiliano and Stringhini, Gianluca and Long, Summer and Greenberg, Stephanie and Zannettou, Savvas}, booktitle = {{Proceedings of the 15th International AAAI Conference on Weblogs and Social Media (ICWSM'21)}}, year={2021} }

    1. Reddit data

    We make available data for forums and for relevant subreddits (56 of them, as described in subreddit_descriptions.csv). These are available, 1 line per post in each subreddit Reddit in /ndjson/reddit.ndjson. A sample for example is:

    { "author": "Handheld_Gaming", "date_post": 1546300852, "id_post": "abcusl", "number_post": 9.0, "subreddit": "Braincels", "text_post": "Its been 2019 for almost 1 hour And I am at a party with 120 people, half of them being foids. The last year had been the best in my life. I actually was happy living hope because I was redpilled to the death.

    Now that I am blackpilled I see that I am the shortest of all men and that I am the only one with a recessed jaw.

    Its over. Its only thanks to my age old friendship with chads and my social skills I had developed in the past year that a lot of men like me a lot as a friend.

    No leg lengthening syrgery is gonna save me. Ignorance was a bliss. Its just horror now seeing that everyone can make out wirth some slin hoe at the party.

    I actually feel so unbelivably bad for turbomanlets. Life as an unattractive manlet is a pain, I cant imagine the hell being an ugly turbomanlet is like. I would have roped instsntly if I were one. Its so unfair.

    Tallcels are fakecels and they all can (and should) suck my cock.

    If I were 17cm taller my life would be a heaven and I would be the happiest man alive.

    Just cope and wait for affordable body tranpslants.", "thread": "t3_abcusl" }

    1. Forums

    We here describe the .sqlite and .ndjson files that contain the data from the following forums.

    (avfm) --- https://d2ec906f9aea-003845.vbulletin.net (incels) --- https://incels.co/ (love_shy) --- http://love-shy.com/lsbb/ (redpilltalk) --- https://redpilltalk.com/ (mgtow) --- https://www.mgtow.com/forums/ (rooshv) --- https://www.rooshvforum.com/ (pua_forum) --- https://www.pick-up-artist-forum.com/ (the_attraction) --- http://www.theattractionforums.com/

    The files are in folders /sqlite/ and /ndjson.

    2.1 .sqlite

    All the tables in the sqlite. datasets follow a very simple {key:value} format. Each key is a thread name (for example /threads/housewife-is-like-a-job.123835/) and each value is a python dictionary or a list. This file contains three tables:

    idx each key is the relative address to a thread and maps to a post. Each post is represented by a dict:

    "type": (list) in some forums you can add a descriptor such as [RageFuel] to each topic, and you may also have special types of posts, like sticked/pool/locked posts.
    "title": (str) title of the thread; "link": (str) link to the thread; "author_topic": (str) username that created the thread; "replies": (int) number of replies, may differ from number of posts due to difference in crawling date; "views": (int) number of views; "subforum": (str) name of the subforum; "collected": (bool) indicates if raw posts have been collected; "crawled_idx_at": (str) datetime of the collection.

    processed_posts each key is the relative address to a thread and maps to a list with posts (in order). Each post is represented by a dict:

    "author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

    raw_posts each key is the relative address to a thread and maps to a list with unprocessed posts (in order). Each post is represented by a dict:

    "post_raw": (binary) raw html binary; "crawled_at": (str) datetime of the collection.

    2.2 .ndjson

    Each line consists of a json object representing a different comment with the following fields:

    "author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

    1. Perspective

    We also run each post and reddit post through perspective, the files are located in the /perspective/ folder. They are compressed with gzip. One example output

    { "id_post": 5200, "hate_output": { "text": "I still can\u2019t wrap my mind around both of those articles about these c~~~s sleeping with poor Haitian Men. Where\u2019s the uproar?, where the hell is the outcry?, the \u201cpig\u201d comments or the \u201ccreeper comments\u201d. F~~~ing hell, if roles were reversed and it was an article about Men going to Europe where under 18 sex in legal, you better believe they would crucify the writer of that article and DEMAND an apology by the paper that wrote it.. This is exactly what I try and explain to people about the double standards within our modern society. A bunch of older women, wanna get their kicks off by sleeping with poor Men, just before they either hit or are at menopause age. F~~~ing unreal, I\u2019ll never forget going to Sweden and Norway a few years ago with one of my buddies and his girlfriend who was from there, the legal age of consent in Norway is 16 and in Sweden it\u2019s 15. I couldn\u2019t believe it, but my friend told me \u201c hey, it\u2019s normal here\u201d . Not only that but the age wasn\u2019t a big different in other European countries as well. One thing i learned very quickly was how very Misandric Sweden as well as Denmark were.", "TOXICITY": 0.6079781, "SEVERE_TOXICITY": 0.53744453, "INFLAMMATORY": 0.7279288, "PROFANITY": 0.58842486, "INSULT": 0.5511079, "OBSCENE": 0.9830818, "SPAM": 0.17009115 } }

    1. Working with sqlite

    A nice way to read some of the files of the dataset is using SqliteDict, for example:

    from sqlitedict import SqliteDict processed_posts = SqliteDict("./data/forums/incels.sqlite", tablename="processed_posts")

    for key, posts in processed_posts.items(): for post in posts: # here you could do something with each post in the dataset pass

    1. Helpers

    Additionally, we provide two .sqlite files that are helpers used in the analyses. These are related to reddit, and not to the forums! They are:

    channel_dict.sqlite a sqlite where each key corresponds to a subreddit and values are lists of dictionaries users who posted on it, along with timestamps.

    author_dict.sqlite a sqlite where each key corresponds to an author and values are lists of dictionaries of the subreddits they posted on, along with timestamps.

    These are used in the paper for the migration analyses.

    1. Examples and particularities for forums

    Although we did our best to clean the data and be consistent across forums, this is not always possible. In the following subsections we talk about the particularities of each forum, directions to improve the parsing which were not pursued as well as give some examples on how things work in each forum.

    6.1 incels

    Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

    types: for the incel forums the special types associated with each thread in the idx table are “Sticky”, “Pool”, “Closed”, and the custom types added by users, such as [LifeFuel]. These last ones are all in brackets. You can see some examples of these in the on the example thread page.

    quotes: quotes in this forum were quite nice and thus, all quotations are deterministic.

    6.2 LoveShy

    Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

    types: no types were parsed. There are some rules in the forum, but not significant.

    quotes: quotes were obtained from exact text+author match, or author match + a jaccard

  11. 🇨🇦 Reddit r/Canada Subreddit Dataset

    • kaggle.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2025). 🇨🇦 Reddit r/Canada Subreddit Dataset [Dataset]. https://www.kaggle.com/datasets/bwandowando/reddit-rcanada-subreddit-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    BwandoWando
    License

    https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api

    Area covered
    Canada
    Description

    Context

    I've tried looking for an r/Canada/ dataset here in Kaggle havent found one, so I made one for the Canadian Kaggle members

    About

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Ffb80802c4661e6a72ca9a1bba9c334a6%2F_ca5803bc-94b9-481f-9339-effcd87f3ee1_small.jpeg?generation=1736341311974753&alt=media" alt="">

    Created last Jan 25, 2008, r/Canada/ is labeled

    Welcome to Canada’s official subreddit! This is the place to engage on all things Canada. Nous parlons en anglais et en français. Please be respectful of each other when posting, and note that users new to the subreddit might experience posting limitations until they become more active and longer members of the community. Do not hesitate to message the mods if you experience any issues!

    This dataset can be used to extract insights from the trending topics and discussions in the subreddit.

    Banner Image

    Created with Bing Image Creator

  12. f

    A distribution of the user replies in various classes for the two datasets.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akram Osman; Naomie Salim; Faisal Saeed (2023). A distribution of the user replies in various classes for the two datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0215516.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Akram Osman; Naomie Salim; Faisal Saeed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A distribution of the user replies in various classes for the two datasets.

  13. 2015 Notebook UX Survey

    • kaggle.com
    zip
    Updated May 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2017). 2015 Notebook UX Survey [Dataset]. https://www.kaggle.com/datasets/kaggle/2015-notebook-ux-survey
    Explore at:
    zip(203077 bytes)Available download formats
    Dataset updated
    May 1, 2017
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    At the end of 2015, the Jupyter Project conducted a UX Survey for Jupyter Notebook users. This dataset, Survey.csv, contains the raw responses.

    See the Google Group Thread for more context around this dataset.

    Jupyter Notebook Use

  14. E

    Offensive language dataset of French comments FRENK-fr 1.0

    • live.european-language-grid.eu
    binary format
    Updated May 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Offensive language dataset of French comments FRENK-fr 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23627
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    May 26, 2024
    License

    https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0

    Area covered
    French
    Description

    The FRENK-fr dataset contains French socially unacceptable and acceptable comments posted in response to news articles that cover the topics of LGBT and migrants, and which were posted on Facebook by prominent French media outlets (20 minutes, Le Figaro and Le Monde). The original thread order of comments based on the time of publishing is preserved in the dataset.

    These comments were manually annotated for the type and target of socially unacceptable comments. The creation process, including data collection, filtering, annotation schema and annotation procedure, was adopted from the FRENK 1.1 dataset (http://hdl.handle.net/11356/1462), which makes FRENK-fr fully comparable to the datasets of Croatian, English and Slovenian comments included in the FRENK 1.1.

    Apart from manual annotation of the type and target of socially unacceptable discourse, the comments are accompanied with metadata, namely the topic of the news item (LGBT or migrants) that triggered the comment, the news item itself and the media outlet authoring it, an anonymised user ID, and information about the reply level in the thread.

    The dataset consists of 10,239 Facebook comments posted under 66 news items. It includes 3,071 comments that were labelled as socially unacceptable, and 7,168 that were labelled as socially acceptable.

  15. C

    The Online conversation threads repository

    • dataverse.csuc.cat
    txt
    Updated Oct 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vicenç Gómez; Vicenç Gómez; Andreas Kaltenbrunner; Andreas Kaltenbrunner; David Laniado; David Laniado (2023). The Online conversation threads repository [Dataset]. http://doi.org/10.34810/data497
    Explore at:
    txt(6626), txt(1763476626), txt(110980658), txt(673642981)Available download formats
    Dataset updated
    Oct 13, 2023
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Vicenç Gómez; Vicenç Gómez; Andreas Kaltenbrunner; Andreas Kaltenbrunner; David Laniado; David Laniado
    License

    https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data497https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data497

    Description

    This repository contains datasets with online conversation threads collected and analyzed by different researchers. Currently, you can find datsets from different news aggregators (Slashdot, Barrapunto) and the English Wikipedia talk pages. Slashdot conversations (Aug 2005 - Aug 2006) Online conversations generated at Slashdot during a year. Posts and comments published between August 26th, 2005 and August 31th, 2006. For each discussion thread: sub-domains, title, topics and hierarchical relations between comments. For each comment: user, date, score and textual content. This dataset is different from the Slashdot Zoo social network (it is not a signed network of users) contained in the SNAP repository and represents the full version of the dataset used in the CAW 2.0 - Content Analysis for the WEB 2.0 workshop for the WWW 2009 conference that can be found in several repositories such as Konect/n/nBarrapunto conversations (Jan 2005 - Dec 2008)/nOnline conversations generated at Barrapunto (Spanish clone of Slashdot) during three years. For each discussion thread: sub-domains, title, topics and hierarchical relations between comments. For each comment: user, date, score and textual content Wikipedia (2001 - Mar 2010) Data from articles discussions (talk) pages of the English Wikipedia as of March 2010. It contains comments on about 870,000 articles (i.e. all articles which had a corresponding talk page with at least one comment), in total about 9.4 million comments. The oldest comments date back to as early as 2001.

  16. E

    Data from: Facebook metadata dataset LiLaH-HAG

    • live.european-language-grid.eu
    • repository.uantwerpen.be
    binary format
    Updated Aug 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Facebook metadata dataset LiLaH-HAG [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20476
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Aug 23, 2022
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The LiLaH-HAG dataset (HAG is short for hate-age-gender) consists of metadata on Facebook comments to Facebook posts of mainstream media in Great Britain, Flanders, Slovenia and Croatia. The metadata available in the dataset are the hatefulness of the comment (0 is acceptable, 1 is hateful), age of the commenter (0-25, 26-30, 36-65, 65-), gender of the commenter (M or F), and the language in which the comment was written (EN, NL, SL, HR).

    The hatefulness of the comment was assigned by multiple well-trained annotators by reading comments in the order of appearance in a discussion thread, while the age and gender variables were estimated from the Facebook profile of a specific user by a single annotator.

  17. P

    SynthPAI Dataset

    • paperswithcode.com
    Updated Jun 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hanna Yukhymenko; Robin Staab; Mark Vero; Martin Vechev (2024). SynthPAI Dataset [Dataset]. https://paperswithcode.com/dataset/synthpai
    Explore at:
    Dataset updated
    Jun 10, 2024
    Authors
    Hanna Yukhymenko; Robin Staab; Mark Vero; Martin Vechev
    Description

    SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap.

    Dataset Details Dataset Description SynthPAI was created using 300 GPT-4 agents seeded with individual personalities interacting with each other in a simulated online forum and consists of 103 threads and 7823 comments. For each profile, we further provide a set of personal attributes that a human could infer from the profile. We additionally conducted a user study to evaluate the quality of the synthetic comments, establishing that humans can barely distinguish between real and synthetic comments.

    Curated by: The dataset was created by SRILab at ETH Zurich. It was not created on behalf of any outside entity. Funded by: Two authors of this work are supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) (SERI-funded ERC Consolidator Grant). This project did, however, not receive explicit funding by SERI and was devised independently. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the SERI-funded ERC Consolidator Grant. Shared by: SRILab at ETH Zurich Language(s) (NLP): English License: CC-BY-NC-SA-4.0

    Dataset Sources

    Repository: https://github.com/eth-sri/SynthPAI Paper: https://arxiv.org/abs/2406.07217

    Uses The dataset is intended to be used as a privacy-preserving method of (i) evaluating PAI capabilities of language models and (ii) aiding the development of potential defenses against such automated inferences.

    Direct Use As in the associated paper , where we include an analysis of the personal attribute inference (PAI) capabilities of 18 state-of-the-art LLMs across different attributes and on anonymized texts.

    Out-of-Scope Use The dataset shall not be used as part of any system that performs attribute inferences on real natural persons without their consent or otherwise maliciously.

    Dataset Structure We provide the instance descriptions below. Each data point consists of a single comment (that can be a top-level post):

    Comment

    author str: unique identifier of the person writing

    username str: corresponding username

    parent_id str: unique identifier of the parent comment

    thread_id str: unique identifier of the thread

    children list[str]: unique identifiers of children comments

    profile Profile: profile making the comment - described below

    text str: text of the comment

    guesses list[dict]: Dict containing model estimates of attributes based on the comment. Only contains attributes for which a prediction exists.

    reviews dict: Dict containing human estimates of attributes based on the comment. Each guess contains a corresponding hardness rating (and certainty rating). Contains all attributes

    The associated profiles are structured as follows

    Profile

    username str: identifier

    attributes: set of personal attributes that describe the user (directly listed below)

    The corresponding attributes and values are

    Attributes

    Age continuous [18-99] The age of a user in years.

    Place of Birth tuple [city, country] The place of birth of a user. We create tuples jointly for city and country in free-text format. (field name: birth_city_country)

    Location tuple [city, country] The current location of a user. We create tuples jointly for city and country in free-text format. (field name: city_country)

    Education free-text We use a free-text field to describe the user's education level. This includes additional details such as the degree and major. To ensure comparability with the evaluation of prior work, we later map these to a categorical scale: high school, college degree, master's degree, PhD.

    Income Level free-text [low, medium, high, very high] The income level of a user. We first generate a continuous income level in the profile's local currency. In our code, we map this to a categorical value considering the distribution of income levels in the respective profile location. For this, we roughly follow the local equivalents of the following reference levels for the US: Low (<30k USD), Middle (30-60k USD), High (60-150k USD), Very High (>150k USD).

    Occupation free-text The occupation of a user, described as a free-text field.

    Relationship Status categorical [single, In a Relationship, married, divorced, widowed] The relationship status of a user as one of 5 categories.

    Sex categorical [Male, Female] Biological Sex of a profile.

    Dataset Creation Curation Rationale SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap. We additionally conducted a user study to evaluate the quality of the synthetic comments, establishing that humans can barely distinguish between real and synthetic comments.

    Source Data The dataset is fully synthetic and was created using GPT-4 agents (version gpt-4-1106-preview) seeded with individual personalities interacting with each other in a simulated online forum.

    Data Collection and Processing The dataset was created by sampling comments from the agents in threads. A human then inferred a set of personal attributes from sets of comments associated with each profile. Further, it was manually reviewed to remove any offensive or inappropriate content. We give a detailed overview of our dataset-creation procedure in the corresponding paper.

    Annotations

    Annotations are provided by authors of the paper.

    Personal and Sensitive Information

    All contained personal information is purely synthetic and does not relate to any real individual.

    Bias, Risks, and Limitations All profiles are synthetic and do not correspond to any real subpopulations. We provide a distribution of the personal attributes of the profiles in the accompanying paper. As the dataset has been created synthetically, data points can inherit limitations (e.g., biases) from the underlying model, GPT-4. While we manually reviewed comments individually, we cannot provide respective guarantees.

    Citation BibTeX:

    @misc{2406.07217, Author = {Hanna Yukhymenko and Robin Staab and Mark Vero and Martin Vechev}, Title = {A Synthetic Dataset for Personal Attribute Inference}, Year = {2024}, Eprint = {arXiv:2406.07217}, } APA:

    Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev: “A Synthetic Dataset for Personal Attribute Inference”, 2024; arXiv:2406.07217.

    Dataset Card Authors

    Hanna Yukhymenko Robin Staab Mark Vero

  18. h

    reddit-ArtistHate

    • huggingface.co
    Updated Jun 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trent Kelly (2025). reddit-ArtistHate [Dataset]. https://huggingface.co/datasets/trentmkelly/reddit-ArtistHate
    Explore at:
    Dataset updated
    Jun 19, 2025
    Authors
    Trent Kelly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    What is this dataset?

    This is 831 thread, comment, comment reply triplets from r/ArtistHate. You can use this dataset to create a fine-tuned LLM that hates AI as much as the r/ArtistHate users do. Each row in this dataset has, in its system prompt, LLM-generated tone and instruction texts, allowing the resulting fine-tune to be steered. See the data explorer for examples of how to properly format the system prompt.

      Notice of Soul Trappin
    

    By permitting the inclusion of… See the full description on the dataset page: https://huggingface.co/datasets/trentmkelly/reddit-ArtistHate.

  19. f

    Top 12 quality features for the NYC and Ubuntu datasets that were ranked...

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akram Osman; Naomie Salim; Faisal Saeed (2023). Top 12 quality features for the NYC and Ubuntu datasets that were ranked based on their IG, Chi2 and GR values. [Dataset]. http://doi.org/10.1371/journal.pone.0215516.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Akram Osman; Naomie Salim; Faisal Saeed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    New York
    Description

    Top 12 quality features for the NYC and Ubuntu datasets that were ranked based on their IG, Chi2 and GR values.

  20. c

    ckanext-comments - Extensions - CKAN Ecosystem Catalog

    • catalog.civicdataecosystem.org
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). ckanext-comments - Extensions - CKAN Ecosystem Catalog [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-comments
    Explore at:
    Dataset updated
    Jun 4, 2025
    Description

    The ckanext-comments extension enhances CKAN by enabling threaded discussions on core entities within the platform. This allows for direct feedback, collaboration, and annotation of datasets, resources, groups, organizations, and user profiles. By providing an API-first approach, the extension facilitates the integration of commenting functionality into custom user interfaces or automated workflows. Key Features: Threaded Comments: Implements a threaded commenting system, allowing users to reply to existing comments and create structured discussions around datasets and other entities. API-First Design: Offers a comprehensive API for all commenting features, enabling programmatic access to comment creation, retrieval, modification, and deletion. Entity Linking: Links comment threads to specific CKAN entities, including datasets, resources, groups, organizations, and users, providing context for discussions. Comment Management: Provides API endpoints for approving, deleting, and updating comments, allowing for moderation and content management. Thread Management: Allows creation, showing, and deletion of comment threads. Filtering and Retrieval: Supports filtering comments by date and including comment authors in API responses. Configuration options: Offers the possibility to automatically enable comments for datasets. Technical Integration: ckanext-comments integrates with CKAN through a plugin architecture. It requires installation as a Python package, activation in the CKAN configuration file (ckan.plugins), and database migrations to set up the necessary tables. The extension also provides a Jinja2 snippet (cooments/snippets/thread.html) for embedding comment threads into CKAN templates, allowing customization of the user interface. No WebUI changes are done by default - you have to include the provided snippet into the Jinja2 template. Benefits & Impact: Adding ckanext-comments to a CKAN instance permits increased user engagement through collaborative annotation and discussion. The ability to create threaded conversations on datasets, in particular, encourages dialogue about data quality, interpretation, and potential applications. This is most useful for research-focused organizations with a large community surrounding their data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Graph Datasets (2023). reddit_threads [Dataset]. https://huggingface.co/datasets/graphs-datasets/reddit_threads

reddit_threads

graphs-datasets/reddit_threads

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 13, 2023
Dataset authored and provided by
Graph Datasets
License

https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

Description

Dataset Card for Reddit threads

  Dataset Summary

The Reddit threads dataset contains 'discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them' (doc).

  Supported Tasks and Leaderboards

The related task is the binary classification to predict whether a thread is discussion based or not.

  External Use





  PyGeometric

To load in… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/reddit_threads.

Search
Clear search
Close search
Google apps
Main menu