42 datasets found

h
reddit_threads
huggingface.co
Updated Apr 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Graph Datasets (2023). reddit_threads [Dataset]. https://huggingface.co/datasets/graphs-datasets/reddit_threads
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 13, 2023
Dataset authored and provided by
Graph Datasets
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
Dataset Card for Reddit threads

Dataset Summary

The Reddit threads dataset contains 'discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them' (doc).

Supported Tasks and Leaderboards

The related task is the binary classification to predict whether a thread is discussion based or not.

External Use PyGeometric

To load in… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/reddit_threads.
o
threads-math-sx
explore.openaire.eu
zenodo.org
Updated Dec 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Landry (2023). threads-math-sx [Dataset]. http://doi.org/10.5281/zenodo.10373323
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10373323
Dataset updated
Dec 14, 2023
Authors
Nicholas Landry
Description
Overview This is a temporal higher-order network dataset, which here means a sequence of timestamped hyperedges where each hyperedge is a set of nodes. In this dataset, nodes are users on https://math.stackexchange.com, and a hyperedge comes from users participating in a thread that lasts for at most 24 hours. The timestamps are the time of the post, but normalized so that the earliest post starts at 0. Source of original data Source: threads-math-sx dataset References If you use this data, please cite the following paper: Simplicial closure and higher-order link prediction. Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg. Proceedings of the National Academy of Sciences (PNAS), 2018.
Reddit Italy Coffee Dataset
kaggle.com
Updated Mar 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luigi Cerone (2023). Reddit Italy Coffee Dataset [Dataset]. https://www.kaggle.com/datasets/gigggi/reddit-italy-coffee-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 16, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Luigi Cerone
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Italy
Description
r/italy is a subreddit focused on discussions related to Italy, including news, culture, politics, and society. Users can post and comment on various topics related to Italy, including travel, language, cuisine, and more. Among the many threads that populate the subreddit, one of the most popular is the daily thread named "Caffè Italia." As the name suggests, this thread is a virtual coffeehouse where users can gather and exchange ideas on a variety of topics.

Every day, a new "Caffè Italia" thread is created, and users are encouraged to participate by sharing their opinions, asking for advice, or simply chatting with others. The topics discussed in this thread can be very diverse, ranging from Italian cuisine and travel to politics, news, and social issues.

The "Caffè Italia" thread provides an informal and friendly space where users can express themselves freely and connect with others who share their interests or concerns. It's a place where they can ask for recommendations on the best places to visit in Italy, share their thoughts on the latest news or events, or discuss cultural topics, such as literature, art, or music.

What makes the "Caffè Italia" thread so unique is its sense of community. Users feel welcome and valued, and they often return to the thread to catch up with the latest discussions or to contribute to ongoing conversations. Many users have formed friendships and connections through the thread, which has become a hub for the r/italy community.

In summary, the "Caffè Italia" thread is a daily gathering place for r/italy users to engage in conversations, share their experiences, and connect with others. Whether you're a first-time visitor to the subreddit or a seasoned member of the community, you're sure to find something interesting and engaging in the "Caffè Italia" thread.

This dataset contains several months of data scraped from it. The code used to generate it is available in my Github profile.
Z
Data from: Dataset of discussion threads from Meneame
data.niaid.nih.gov
recerca.uoc.edu
+1more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas, Kaltenbrunner (2020). Dataset of discussion threads from Meneame [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2536217
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Andreas, Kaltenbrunner
Pablo, Aragón
Vicenç, Gómez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset from our ICWSM 2017 paper. When using this resource, please use the following citation:

Aragón P., Gómez V., Kaltenbrunner A. (2017) To Thread or Not to Thread: The Impact of Conversation Threading on Online Discussion, ICWSM-17- 11th International AAAI Conference on Web and Social Media, Montreal, Canada.

@inproceedings {aragon2017ICWSM, author = {Arag\'on, Pablo and G\'omez, Vicen\c{c} and Kaltenbrunner, Andreas}, title = {To Thread or Not to Thread: The Impact of Conversation Threading on Online Discussion}, booktitle = {ICWSM-17 - 11th International AAAI Conference on Web and Social Media}, publisher = {The AAAI Press}, location = {Montreal, Canada}, year = 2017 }

More info about this dataset can also be found at:

Aragón P., Gómez V., Kaltenbrunner A., (2017) Detecting Platform Effects in Online Discussions, Policy & Internet, 9, 2017.

@article{aragon2017PI, author = {Arag\'on, Pablo and G\'omez, Vicen\c{c} and Kaltenbrunner, Andreas}, title = {Detecting Platform Effects in Online Discussions}, journal = {Policy & Internet}, volume = {9}, number = {4}, pages = {420-443}, doi = {10.1002/poi3.158}, url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/poi3.158}, eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/poi3.158}, year = {2017} }

Crawling process

We built a crawling process that collects all the stories in the front page of Meneame from 2011 to 2015 (both years included). We then performed a second crawling process to collect every comment from the discussion thread of each story. From both crawling processes, we obtained 72,005 stories and 5,385,324 comments.

It is important to highlight two issues taken into account when the crawler was designed. First, the machine-readable robots.txt file on Meneame does not disallow this process. Second, the footnote of Meneame indicates the licenses of the code, graphics and content of the website. The license for content is Attribution 3.0 Spain (CC BY 3.0 ES) which allows us to release this dataset.

Fields

Every discussion thread is stored in a JSON file named with the URL slug of the corresponding story in Meneame, located in a yyyy-mm-dd folder. The JSON file is an array of elements with the following fields:

id (string): ID of the story/comment

sent (timestamp): Date of the story/comment as yyyy-MM-ddThh:mm:ssZ.

message (string): Text of the story/comment

user (string): Username of the authoring story/comment

karma (number): Karma score of the comment when the crawling was performed

comments_count (number): Number of comments in reply to the story/post

votes (number): Number of votes to the story/comment

thread (string): URL of the thread

thread_id (string): Sequential arriving order to the thread (0 if story, >=1 if comment)

depth (string): Depth within the thread (0 if story, >=1 if comment)

url (string): URL of the specific story/comment

title (string): Title, only available for stories.

published (string): Date when published on the front page, only available for stories.

tags (string): Tags, only available for stories.

clics (string): Number of clicks, only available for stories.

users (string): Number of user votes, only available for stories.

anonymous (string): Number of anonymous votes, only available for stories.

negatives (string): Number of negative votes, only available for stories.

in_reply_to_id (string): ID of the parent story/comment, only available for comments.

in_reply_to_user (string): Authoring user of the parent story/comment, only available for comments.

in_reply_to_thread_id (string): Sequential arriving order to the thread of of the parent story/comment, only available for comments.

Acknowledgment

This work is supported by the Spanish Ministry of Economy and Competitiveness under the María de Maeztu Units of Excellence Programme (MDM-2015-0502).
h
4chan-pol-extensive
huggingface.co
Updated Dec 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mel (2024). 4chan-pol-extensive [Dataset]. http://doi.org/10.57967/hf/3931
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/3931
Dataset updated
Dec 30, 2024
Authors
mel
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
4chan /pol/ dataset

This dataset contains data from 12000+ threads from 4chan boards, collected and processed for research purposes. The data includes both active and archived threads, with extensive metadata and derived features for studying online discourse and community dynamics.I preserved thread structure, temporal information, and user interaction patterns while maintaining anonymity and excluding sensitive content.

Dataset Details Dataset Sources and… See the full description on the dataset page: https://huggingface.co/datasets/vmfunc/4chan-pol-extensive.
h
BluePrint
huggingface.co
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Complex Data Lab (2025). BluePrint [Dataset]. http://doi.org/10.57967/hf/5425
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5425
Dataset updated
May 28, 2025
Dataset authored and provided by
Complex Data Lab
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📘 BluePrint

BluePrint is a large-scale dataset of social media conversation threads designed for evaluating and training LLM-based social media agents. It provides realistic, thread-structured data clustered into representative user personas at various levels of granularity.

✅ Key Features

Thread-Based Structure: Each example is a list of messages representing a user thread. Persona Clustering: Users are clustered into 2, 25, 100, and 1000 representative personas to… See the full description on the dataset page: https://huggingface.co/datasets/ComplexDataLab/BluePrint.
Classification Graphs
kaggle.com
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Classification Graphs [Dataset]. https://www.kaggle.com/wolfram77/graphs-classification/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Deezer Ego Nets

The ego-nets of Eastern European users collected from the music streaming service Deezer in February 2020. Nodes are users and edges are mutual follower relationships. The related task is the prediction of gender for the ego node in the graph.

Github Stargazers

The social networks of developers who starred popular machine learning and web development repositories (with at least 10 stars) until 2019 August. Nodes are users and links are follower relationships. The task is to decide whether a social network belongs to web or machine learning developers. We only included the largest component (at least with 10 users) of graphs.

Reddit Threads

Discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them. The task is to predict whether a thread is discussion based or not (binary classification).

Twitch Ego Nets

The ego-nets of Twitch users who participated in the partnership program in April 2018. Nodes are users and links are friendships. The binary classification task is to predict using the ego-net whether the ego user plays a single or multple games. Players who play a single game usually have a more dense ego-net.

Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.

http://snap.stanford.edu/data/index.html#disjointgraphs
Bluesky Social Dataset
zenodo.org
application/gzip, csv
Updated Jan 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti (2025). Bluesky Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.14669616
Explore at:
application/gzip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14669616
Dataset updated
Jan 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti
License
https://bsky.social/about/support/toshttps://bsky.social/about/support/tos
Description
Bluesky Social Dataset

Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue.

The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their “like” interactions and time of bookmarking.

Dataset

Here is a description of the dataset files.

followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers representing a directed following relation (i.e., user u follows user v).

user_posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in a collection of files, each containing the post of an anonymized user. Each post is stored as a JSON-formatted line.

interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers representing a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author,quoted_author, and date.

graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread.

feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author);

feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values: the feed name, user id, and timestamp.

feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp;

scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.

Citation

If used for research purposes, please cite the following paper describing the dataset details:

Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight: Insights from a Year's Worth of Social Data." PlosOne (2024) https://doi.org/10.1371/journal.pone.0310330

Right to Erasure (Right to be forgotten)

Note: If your account was created after March 21st, 2024, or if you did not post on Bluesky before such date, no data about your account exists in the dataset. Before sending a data removal request, please make sure that you were active and posting on bluesky before March 21st, 2024.

Users included in the Bluesky Social dataset have the right to opt-out and request the removal of their data, per GDPR provisions (Article 17).

We emphasize that the released data has been thoroughly pseudonymized in compliance with GDPR (Article 4(5)). Specifically, usernames and object identifiers (e.g., URIs) have been removed, and object timestamps have been coarsened to protect individual privacy further and minimize reidentification risk. Moreover, it should be noted that the dataset was created for scientific research purposes, thereby falling under the scenarios for which GDPR provides opt-out derogations (Article 17(3)(d) and Article 89).

Nonetheless, if you wish to have your activities excluded from this dataset, please submit your request to blueskydatase tmoderation@gmail.com (with the subject "Removal request: [username]"). We will process your request within a reasonable timeframe - updates will occur monthly, if necessary, and access to previous versions will be restricted.

Acknowledgments:

This work is supported by :

the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”,
Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu);

SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021;

EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
Data from: E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects...
zenodo.org
bin, txt
Updated May 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Di Meglio; Sergio Di Meglio; Valeria Pontillo; Valeria Pontillo; Coen De roover; Coen De roover; Luigi Libero Lucio Starace; Luigi Libero Lucio Starace; Sergio Di Martino; Sergio Di Martino; Ruben Opdebeeck; Ruben Opdebeeck (2025). E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects [Dataset]. http://doi.org/10.5281/zenodo.14221860
Explore at:
txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14221860
Dataset updated
May 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sergio Di Meglio; Sergio Di Meglio; Valeria Pontillo; Valeria Pontillo; Coen De roover; Coen De roover; Luigi Libero Lucio Starace; Luigi Libero Lucio Starace; Sergio Di Martino; Sergio Di Martino; Ruben Opdebeeck; Ruben Opdebeeck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT
End-to-End (E2E) testing is a comprehensive approach to validating the functionality of a software application by testing its entire workflow from the user’s perspective, ensuring that all integrated components work together as expected. It is crucial for ensuring the quality and reliability of applications, especially in the web domain, which is often bound by Service Level Agreements (SLAs). This testing involves two key activities:
Graphical User Interface (GUI) testing, which simulates user interactions through browsers, and performance testing, which evaluates system workload handling. Despite its importance, E2E testing is often neglected, and the lack of reliable datasets for Web GUI and performance testing has slowed research progress. This paper addresses these limitations by constructing E2EGit, a comprehensive dataset, cataloging non-trivial open-source web projects on GITHUB that adopt GUI or performance testing.
The dataset construction process involved analyzing over 5k non-trivial web repositories based on popular programming languages (JAVA, JAVASCRIPT TYPESCRIPT PYTHON) to identify: 1) GUI tests based on popular browser automation frameworks (SELENIUM PLAYWRIGHT, CYPRESS, PUPPETEER), 2) performance tests written with the most popular open-source tools (JMETER, LOCUST). After analysis, we identified 472 repositories using web GUI testing, with over 43,000 tests, and 84 repositories using performance testing, with 410 tests.

DATASET DESCRIPTION
The dataset is provided as an SQLite database, whose structure is illustrated in Figure 3 (in the paper), which consists of five tables, each serving a specific purpose.
The repository table contains information on 1.5 million repositories collected using the SEART tool on May 4. It includes 34 fields detailing repository characteristics. The
non_trivial_repository table is a subset of the previous one, listing repositories that passed the two filtering stages described in the pipeline. For each repository, it specifies whether it is a web repository using JAVA, JAVASCRIPT, TYPESCRIPT, or PYTHON frameworks. A repository may use multiple frameworks, with corresponding fields (e.g., is web java) set to true, and the field web dependencies listing the detected web frameworks. For Web GUI testing, the dataset includes two additional tables; gui_testing_test _details, where each row represents a test file, providing the file path, the browser automation framework used, the test engine employed, and the number of tests implemented in the file. gui_testing_repo_details, aggregating data from the previous table at the repository level. Each of the 472 repositories has a row summarizing
the number of test files using frameworks like SELENIUM or PLAYWRIGHT, test engines like JUNIT, and the total number of tests identified. For performance testing, the performance_testing_test_details table contains 410 rows, one for each test identified. Each row includes the file path, whether the test uses JMETER or LOCUST, and extracted details such as the number of thread groups, concurrent users, and requests. Notably, some fields may be absent—for instance, if external files (e.g., CSVs defining workloads) were unavailable, or in the case of Locust tests, where parameters like duration and concurrent users are specified via the command line.

To cite this article refer to this citation:

@inproceedings{di2025e2egit,
title={E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects},
author={Di Meglio, Sergio and Starace, Luigi Libero Lucio and Pontillo, Valeria and Opdebeeck, Ruben and De Roover, Coen and Di Martino, Sergio},
booktitle={2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR)},
pages={10--15},
year={2025},
organization={IEEE/ACM}
}

This work has been partially supported by the Italian PNRR MUR project PE0000013-FAIR.
Z
Dataset for: The Evolution of the Manosphere Across the Web
data.niaid.nih.gov
Updated Aug 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emiliano De Cristofaro (2020). Dataset for: The Evolution of the Manosphere Across the Web [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4007912
Explore at:
Dataset updated
Aug 30, 2020
Dataset provided by
Emiliano De Cristofaro
Gianluca Stringhini
Manoel Horta Ribeiro
Jeremy Blackburn
Stephanie Greenberg
Summer Long
Barry Bradlyn
Savvas Zannettou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Evolution of the Manosphere Across the Web

We make available data related to subreddit and standalone forums from the manosphere.

We also make available Perspective API annotations for all posts.

You can find the code in GitHub.

Please cite this paper if you use this data:

@article{ribeiroevolution2021, title={The Evolution of the Manosphere Across the Web}, author={Ribeiro, Manoel Horta and Blackburn, Jeremy and Bradlyn, Barry and De Cristofaro, Emiliano and Stringhini, Gianluca and Long, Summer and Greenberg, Stephanie and Zannettou, Savvas}, booktitle = {{Proceedings of the 15th International AAAI Conference on Weblogs and Social Media (ICWSM'21)}}, year={2021} }

Reddit data

We make available data for forums and for relevant subreddits (56 of them, as described in subreddit_descriptions.csv). These are available, 1 line per post in each subreddit Reddit in /ndjson/reddit.ndjson. A sample for example is:

{ "author": "Handheld_Gaming", "date_post": 1546300852, "id_post": "abcusl", "number_post": 9.0, "subreddit": "Braincels", "text_post": "Its been 2019 for almost 1 hour And I am at a party with 120 people, half of them being foids. The last year had been the best in my life. I actually was happy living hope because I was redpilled to the death.

Now that I am blackpilled I see that I am the shortest of all men and that I am the only one with a recessed jaw.

Its over. Its only thanks to my age old friendship with chads and my social skills I had developed in the past year that a lot of men like me a lot as a friend.

No leg lengthening syrgery is gonna save me. Ignorance was a bliss. Its just horror now seeing that everyone can make out wirth some slin hoe at the party.

I actually feel so unbelivably bad for turbomanlets. Life as an unattractive manlet is a pain, I cant imagine the hell being an ugly turbomanlet is like. I would have roped instsntly if I were one. Its so unfair.

Tallcels are fakecels and they all can (and should) suck my cock.

If I were 17cm taller my life would be a heaven and I would be the happiest man alive.

Just cope and wait for affordable body tranpslants.", "thread": "t3_abcusl" }

Forums

We here describe the .sqlite and .ndjson files that contain the data from the following forums.

(avfm) --- https://d2ec906f9aea-003845.vbulletin.net (incels) --- https://incels.co/ (love_shy) --- http://love-shy.com/lsbb/ (redpilltalk) --- https://redpilltalk.com/ (mgtow) --- https://www.mgtow.com/forums/ (rooshv) --- https://www.rooshvforum.com/ (pua_forum) --- https://www.pick-up-artist-forum.com/ (the_attraction) --- http://www.theattractionforums.com/

The files are in folders /sqlite/ and /ndjson.

2.1 .sqlite

All the tables in the sqlite. datasets follow a very simple {key:value} format. Each key is a thread name (for example /threads/housewife-is-like-a-job.123835/) and each value is a python dictionary or a list. This file contains three tables:

idx each key is the relative address to a thread and maps to a post. Each post is represented by a dict:

"type": (list) in some forums you can add a descriptor such as [RageFuel] to each topic, and you may also have special types of posts, like sticked/pool/locked posts.
"title": (str) title of the thread; "link": (str) link to the thread; "author_topic": (str) username that created the thread; "replies": (int) number of replies, may differ from number of posts due to difference in crawling date; "views": (int) number of views; "subforum": (str) name of the subforum; "collected": (bool) indicates if raw posts have been collected; "crawled_idx_at": (str) datetime of the collection.

processed_posts each key is the relative address to a thread and maps to a list with posts (in order). Each post is represented by a dict:

"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

raw_posts each key is the relative address to a thread and maps to a list with unprocessed posts (in order). Each post is represented by a dict:

"post_raw": (binary) raw html binary; "crawled_at": (str) datetime of the collection.

2.2 .ndjson

Each line consists of a json object representing a different comment with the following fields:

"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

Perspective

We also run each post and reddit post through perspective, the files are located in the /perspective/ folder. They are compressed with gzip. One example output

{ "id_post": 5200, "hate_output": { "text": "I still can\u2019t wrap my mind around both of those articles about these c~~~s sleeping with poor Haitian Men. Where\u2019s the uproar?, where the hell is the outcry?, the \u201cpig\u201d comments or the \u201ccreeper comments\u201d. F~~~ing hell, if roles were reversed and it was an article about Men going to Europe where under 18 sex in legal, you better believe they would crucify the writer of that article and DEMAND an apology by the paper that wrote it.. This is exactly what I try and explain to people about the double standards within our modern society. A bunch of older women, wanna get their kicks off by sleeping with poor Men, just before they either hit or are at menopause age. F~~~ing unreal, I\u2019ll never forget going to Sweden and Norway a few years ago with one of my buddies and his girlfriend who was from there, the legal age of consent in Norway is 16 and in Sweden it\u2019s 15. I couldn\u2019t believe it, but my friend told me \u201c hey, it\u2019s normal here\u201d . Not only that but the age wasn\u2019t a big different in other European countries as well. One thing i learned very quickly was how very Misandric Sweden as well as Denmark were.", "TOXICITY": 0.6079781, "SEVERE_TOXICITY": 0.53744453, "INFLAMMATORY": 0.7279288, "PROFANITY": 0.58842486, "INSULT": 0.5511079, "OBSCENE": 0.9830818, "SPAM": 0.17009115 } }

Working with sqlite

A nice way to read some of the files of the dataset is using SqliteDict, for example:

from sqlitedict import SqliteDict processed_posts = SqliteDict("./data/forums/incels.sqlite", tablename="processed_posts")

for key, posts in processed_posts.items(): for post in posts: # here you could do something with each post in the dataset pass

Helpers

Additionally, we provide two .sqlite files that are helpers used in the analyses. These are related to reddit, and not to the forums! They are:

channel_dict.sqlite a sqlite where each key corresponds to a subreddit and values are lists of dictionaries users who posted on it, along with timestamps.

author_dict.sqlite a sqlite where each key corresponds to an author and values are lists of dictionaries of the subreddits they posted on, along with timestamps.

These are used in the paper for the migration analyses.

Examples and particularities for forums

Although we did our best to clean the data and be consistent across forums, this is not always possible. In the following subsections we talk about the particularities of each forum, directions to improve the parsing which were not pursued as well as give some examples on how things work in each forum.

6.1 incels

Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

types: for the incel forums the special types associated with each thread in the idx table are “Sticky”, “Pool”, “Closed”, and the custom types added by users, such as [LifeFuel]. These last ones are all in brackets. You can see some examples of these in the on the example thread page.

quotes: quotes in this forum were quite nice and thus, all quotations are deterministic.

6.2 LoveShy

Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

types: no types were parsed. There are some rules in the forum, but not significant.

quotes: quotes were obtained from exact text+author match, or author match + a jaccard
🇨🇦 Reddit r/Canada Subreddit Dataset
kaggle.com
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BwandoWando (2025). 🇨🇦 Reddit r/Canada Subreddit Dataset [Dataset]. https://www.kaggle.com/datasets/bwandowando/reddit-rcanada-subreddit-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 9, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BwandoWando
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Area covered
Canada
Description
Context

I've tried looking for an r/Canada/ dataset here in Kaggle havent found one, so I made one for the Canadian Kaggle members

About

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Ffb80802c4661e6a72ca9a1bba9c334a6%2F_ca5803bc-94b9-481f-9339-effcd87f3ee1_small.jpeg?generation=1736341311974753&alt=media" alt="">

Created last Jan 25, 2008, r/Canada/ is labeled

Welcome to Canada’s official subreddit! This is the place to engage on all things Canada. Nous parlons en anglais et en français. Please be respectful of each other when posting, and note that users new to the subreddit might experience posting limitations until they become more active and longer members of the community. Do not hesitate to message the mods if you experience any issues!

This dataset can be used to extract insights from the trending topics and discussions in the subreddit.

Banner Image

Created with Bing Image Creator
f
A distribution of the user replies in various classes for the two datasets.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akram Osman; Naomie Salim; Faisal Saeed (2023). A distribution of the user replies in various classes for the two datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0215516.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0215516.t005
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Akram Osman; Naomie Salim; Faisal Saeed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A distribution of the user replies in various classes for the two datasets.
2015 Notebook UX Survey
kaggle.com
zip
Updated May 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2017). 2015 Notebook UX Survey [Dataset]. https://www.kaggle.com/datasets/kaggle/2015-notebook-ux-survey
Explore at:
zip(203077 bytes)Available download formats
Dataset updated
May 1, 2017
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
At the end of 2015, the Jupyter Project conducted a UX Survey for Jupyter Notebook users. This dataset, Survey.csv, contains the raw responses.

See the Google Group Thread for more context around this dataset.
E
Offensive language dataset of French comments FRENK-fr 1.0
live.european-language-grid.eu
binary format
Updated May 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Offensive language dataset of French comments FRENK-fr 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23627
Explore at:
binary formatAvailable download formats
Dataset updated
May 26, 2024
License
https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
Area covered
French
Description
The FRENK-fr dataset contains French socially unacceptable and acceptable comments posted in response to news articles that cover the topics of LGBT and migrants, and which were posted on Facebook by prominent French media outlets (20 minutes, Le Figaro and Le Monde). The original thread order of comments based on the time of publishing is preserved in the dataset.

These comments were manually annotated for the type and target of socially unacceptable comments. The creation process, including data collection, filtering, annotation schema and annotation procedure, was adopted from the FRENK 1.1 dataset (http://hdl.handle.net/11356/1462), which makes FRENK-fr fully comparable to the datasets of Croatian, English and Slovenian comments included in the FRENK 1.1.

Apart from manual annotation of the type and target of socially unacceptable discourse, the comments are accompanied with metadata, namely the topic of the news item (LGBT or migrants) that triggered the comment, the news item itself and the media outlet authoring it, an anonymised user ID, and information about the reply level in the thread.

The dataset consists of 10,239 Facebook comments posted under 66 news items. It includes 3,071 comments that were labelled as socially unacceptable, and 7,168 that were labelled as socially acceptable.
C
The Online conversation threads repository
dataverse.csuc.cat
txt
Updated Oct 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vicenç Gómez; Vicenç Gómez; Andreas Kaltenbrunner; Andreas Kaltenbrunner; David Laniado; David Laniado (2023). The Online conversation threads repository [Dataset]. http://doi.org/10.34810/data497
Explore at:
txt(6626), txt(1763476626), txt(110980658), txt(673642981)Available download formats
Unique identifier
https://doi.org/10.34810/data497
Dataset updated
Oct 13, 2023
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
Vicenç Gómez; Vicenç Gómez; Andreas Kaltenbrunner; Andreas Kaltenbrunner; David Laniado; David Laniado
License
https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data497https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data497
Description
This repository contains datasets with online conversation threads collected and analyzed by different researchers. Currently, you can find datsets from different news aggregators (Slashdot, Barrapunto) and the English Wikipedia talk pages. Slashdot conversations (Aug 2005 - Aug 2006) Online conversations generated at Slashdot during a year. Posts and comments published between August 26th, 2005 and August 31th, 2006. For each discussion thread: sub-domains, title, topics and hierarchical relations between comments. For each comment: user, date, score and textual content. This dataset is different from the Slashdot Zoo social network (it is not a signed network of users) contained in the SNAP repository and represents the full version of the dataset used in the CAW 2.0 - Content Analysis for the WEB 2.0 workshop for the WWW 2009 conference that can be found in several repositories such as Konect/n/nBarrapunto conversations (Jan 2005 - Dec 2008)/nOnline conversations generated at Barrapunto (Spanish clone of Slashdot) during three years. For each discussion thread: sub-domains, title, topics and hierarchical relations between comments. For each comment: user, date, score and textual content Wikipedia (2001 - Mar 2010) Data from articles discussions (talk) pages of the English Wikipedia as of March 2010. It contains comments on about 870,000 articles (i.e. all articles which had a corresponding talk page with at least one comment), in total about 9.4 million comments. The oldest comments date back to as early as 2001.
E
Data from: Facebook metadata dataset LiLaH-HAG
live.european-language-grid.eu
repository.uantwerpen.be
binary format
Updated Aug 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Facebook metadata dataset LiLaH-HAG [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20476
Explore at:
binary formatAvailable download formats
Dataset updated
Aug 23, 2022
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The LiLaH-HAG dataset (HAG is short for hate-age-gender) consists of metadata on Facebook comments to Facebook posts of mainstream media in Great Britain, Flanders, Slovenia and Croatia. The metadata available in the dataset are the hatefulness of the comment (0 is acceptable, 1 is hateful), age of the commenter (0-25, 26-30, 36-65, 65-), gender of the commenter (M or F), and the language in which the comment was written (EN, NL, SL, HR).

The hatefulness of the comment was assigned by multiple well-trained annotators by reading comments in the order of appearance in a discussion thread, while the age and gender variables were estimated from the Facebook profile of a specific user by a single annotator.
P
SynthPAI Dataset
paperswithcode.com
Updated Jun 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hanna Yukhymenko; Robin Staab; Mark Vero; Martin Vechev (2024). SynthPAI Dataset [Dataset]. https://paperswithcode.com/dataset/synthpai
Explore at:
Dataset updated
Jun 10, 2024
Authors
Hanna Yukhymenko; Robin Staab; Mark Vero; Martin Vechev
Description
SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap.

Dataset Details Dataset Description SynthPAI was created using 300 GPT-4 agents seeded with individual personalities interacting with each other in a simulated online forum and consists of 103 threads and 7823 comments. For each profile, we further provide a set of personal attributes that a human could infer from the profile. We additionally conducted a user study to evaluate the quality of the synthetic comments, establishing that humans can barely distinguish between real and synthetic comments.

Curated by: The dataset was created by SRILab at ETH Zurich. It was not created on behalf of any outside entity. Funded by: Two authors of this work are supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) (SERI-funded ERC Consolidator Grant). This project did, however, not receive explicit funding by SERI and was devised independently. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the SERI-funded ERC Consolidator Grant. Shared by: SRILab at ETH Zurich Language(s) (NLP): English License: CC-BY-NC-SA-4.0

Dataset Sources

Repository: https://github.com/eth-sri/SynthPAI Paper: https://arxiv.org/abs/2406.07217

Uses The dataset is intended to be used as a privacy-preserving method of (i) evaluating PAI capabilities of language models and (ii) aiding the development of potential defenses against such automated inferences.

Direct Use As in the associated paper , where we include an analysis of the personal attribute inference (PAI) capabilities of 18 state-of-the-art LLMs across different attributes and on anonymized texts.

Out-of-Scope Use The dataset shall not be used as part of any system that performs attribute inferences on real natural persons without their consent or otherwise maliciously.

Dataset Structure We provide the instance descriptions below. Each data point consists of a single comment (that can be a top-level post):

Comment

author str: unique identifier of the person writing

username str: corresponding username

parent_id str: unique identifier of the parent comment

thread_id str: unique identifier of the thread

children list[str]: unique identifiers of children comments

profile Profile: profile making the comment - described below

text str: text of the comment

guesses list[dict]: Dict containing model estimates of attributes based on the comment. Only contains attributes for which a prediction exists.

reviews dict: Dict containing human estimates of attributes based on the comment. Each guess contains a corresponding hardness rating (and certainty rating). Contains all attributes

The associated profiles are structured as follows

Profile

username str: identifier

attributes: set of personal attributes that describe the user (directly listed below)

The corresponding attributes and values are

Attributes

Age continuous [18-99] The age of a user in years.

Place of Birth tuple [city, country] The place of birth of a user. We create tuples jointly for city and country in free-text format. (field name: birth_city_country)

Location tuple [city, country] The current location of a user. We create tuples jointly for city and country in free-text format. (field name: city_country)

Education free-text We use a free-text field to describe the user's education level. This includes additional details such as the degree and major. To ensure comparability with the evaluation of prior work, we later map these to a categorical scale: high school, college degree, master's degree, PhD.

Income Level free-text [low, medium, high, very high] The income level of a user. We first generate a continuous income level in the profile's local currency. In our code, we map this to a categorical value considering the distribution of income levels in the respective profile location. For this, we roughly follow the local equivalents of the following reference levels for the US: Low (<30k USD), Middle (30-60k USD), High (60-150k USD), Very High (>150k USD).

Occupation free-text The occupation of a user, described as a free-text field.

Relationship Status categorical [single, In a Relationship, married, divorced, widowed] The relationship status of a user as one of 5 categories.

Sex categorical [Male, Female] Biological Sex of a profile.

Dataset Creation Curation Rationale SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap. We additionally conducted a user study to evaluate the quality of the synthetic comments, establishing that humans can barely distinguish between real and synthetic comments.

Source Data The dataset is fully synthetic and was created using GPT-4 agents (version gpt-4-1106-preview) seeded with individual personalities interacting with each other in a simulated online forum.

Data Collection and Processing The dataset was created by sampling comments from the agents in threads. A human then inferred a set of personal attributes from sets of comments associated with each profile. Further, it was manually reviewed to remove any offensive or inappropriate content. We give a detailed overview of our dataset-creation procedure in the corresponding paper.

Annotations

Annotations are provided by authors of the paper.

Personal and Sensitive Information

All contained personal information is purely synthetic and does not relate to any real individual.

Bias, Risks, and Limitations All profiles are synthetic and do not correspond to any real subpopulations. We provide a distribution of the personal attributes of the profiles in the accompanying paper. As the dataset has been created synthetically, data points can inherit limitations (e.g., biases) from the underlying model, GPT-4. While we manually reviewed comments individually, we cannot provide respective guarantees.

Citation BibTeX:

@misc{2406.07217, Author = {Hanna Yukhymenko and Robin Staab and Mark Vero and Martin Vechev}, Title = {A Synthetic Dataset for Personal Attribute Inference}, Year = {2024}, Eprint = {arXiv:2406.07217}, } APA:

Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev: “A Synthetic Dataset for Personal Attribute Inference”, 2024; arXiv:2406.07217.

Dataset Card Authors

Hanna Yukhymenko Robin Staab Mark Vero
h
reddit-ArtistHate
huggingface.co
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trent Kelly (2025). reddit-ArtistHate [Dataset]. https://huggingface.co/datasets/trentmkelly/reddit-ArtistHate
Explore at:
Dataset updated
Jun 19, 2025
Authors
Trent Kelly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
What is this dataset?

This is 831 thread, comment, comment reply triplets from r/ArtistHate. You can use this dataset to create a fine-tuned LLM that hates AI as much as the r/ArtistHate users do. Each row in this dataset has, in its system prompt, LLM-generated tone and instruction texts, allowing the resulting fine-tune to be steered. See the data explorer for examples of how to properly format the system prompt.

Notice of Soul Trappin

By permitting the inclusion of… See the full description on the dataset page: https://huggingface.co/datasets/trentmkelly/reddit-ArtistHate.
f
Top 12 quality features for the NYC and Ubuntu datasets that were ranked...
plos.figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akram Osman; Naomie Salim; Faisal Saeed (2023). Top 12 quality features for the NYC and Ubuntu datasets that were ranked based on their IG, Chi2 and GR values. [Dataset]. http://doi.org/10.1371/journal.pone.0215516.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0215516.t008
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Akram Osman; Naomie Salim; Faisal Saeed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New York
Description
Top 12 quality features for the NYC and Ubuntu datasets that were ranked based on their IG, Chi2 and GR values.
c
ckanext-comments - Extensions - CKAN Ecosystem Catalog
catalog.civicdataecosystem.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ckanext-comments - Extensions - CKAN Ecosystem Catalog [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-comments
Explore at:
Dataset updated
Jun 4, 2025
Description
The ckanext-comments extension enhances CKAN by enabling threaded discussions on core entities within the platform. This allows for direct feedback, collaboration, and annotation of datasets, resources, groups, organizations, and user profiles. By providing an API-first approach, the extension facilitates the integration of commenting functionality into custom user interfaces or automated workflows. Key Features: Threaded Comments: Implements a threaded commenting system, allowing users to reply to existing comments and create structured discussions around datasets and other entities. API-First Design: Offers a comprehensive API for all commenting features, enabling programmatic access to comment creation, retrieval, modification, and deletion. Entity Linking: Links comment threads to specific CKAN entities, including datasets, resources, groups, organizations, and users, providing context for discussions. Comment Management: Provides API endpoints for approving, deleting, and updating comments, allowing for moderation and content management. Thread Management: Allows creation, showing, and deletion of comment threads. Filtering and Retrieval: Supports filtering comments by date and including comment authors in API responses. Configuration options: Offers the possibility to automatically enable comments for datasets. Technical Integration: ckanext-comments integrates with CKAN through a plugin architecture. It requires installation as a Python package, activation in the CKAN configuration file (ckan.plugins), and database migrations to set up the necessary tables. The extension also provides a Jinja2 snippet (cooments/snippets/thread.html) for embedding comment threads into CKAN templates, allowing customization of the user interface. No WebUI changes are done by default - you have to include the provided snippet into the Jinja2 template. Benefits & Impact: Adding ckanext-comments to a CKAN instance permits increased user engagement through collaborative annotation and discussion. The ability to create threaded conversations on datasets, in particular, encourages dialogue about data quality, interpretation, and potential applications. This is most useful for research-focused organizations with a large community surrounding their data.

Facebook

Twitter

Click to copy link

Link copied

Cite

Graph Datasets (2023). reddit_threads [Dataset]. https://huggingface.co/datasets/graphs-datasets/reddit_threads

reddit_threads

graphs-datasets/reddit_threads

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 13, 2023

Dataset authored and provided by

Graph Datasets

License

https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

Description

Dataset Card for Reddit threads

  Dataset Summary

The Reddit threads dataset contains 'discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them' (doc).

  Supported Tasks and Leaderboards

The related task is the binary classification to predict whether a thread is discussion based or not.

  External Use





  PyGeometric

To load in… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/reddit_threads.

Clear search

Close search

Google apps

Main menu

reddit_threads

threads-math-sx

Reddit Italy Coffee Dataset

Data from: Dataset of discussion threads from Meneame

4chan-pol-extensive

BluePrint

Classification Graphs

Deezer Ego Nets

Github Stargazers

Reddit Threads

Twitch Ego Nets

Bluesky Social Dataset

Bluesky Social Dataset

Dataset

Citation

Right to Erasure (Right to be forgotten)

Acknowledgments:

Data from: E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects...

Dataset for: The Evolution of the Manosphere Across the Web

🇨🇦 Reddit r/Canada Subreddit Dataset

Context

About

Banner Image

A distribution of the user replies in various classes for the two datasets.

2015 Notebook UX Survey

Offensive language dataset of French comments FRENK-fr 1.0

The Online conversation threads repository

Data from: Facebook metadata dataset LiLaH-HAG

SynthPAI Dataset

reddit-ArtistHate

Top 12 quality features for the NYC and Ubuntu datasets that were ranked...

ckanext-comments - Extensions - CKAN Ecosystem Catalog

reddit_threads

graphs-datasets/reddit_threads