18 datasets found

h
needle-threading
huggingface.co
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Roberts (2024). needle-threading [Dataset]. https://huggingface.co/datasets/jonathan-roberts1/needle-threading
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 7, 2024
Authors
Jonathan Roberts
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

Dataset Summary

As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. Although the development of longer context models has seen rapid gains recently, our understanding of how effectively they use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the… See the full description on the dataset page: https://huggingface.co/datasets/jonathan-roberts1/needle-threading.
Email Thread Summary Dataset
kaggle.com
Updated Sep 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marawan Mamdouh (2023). Email Thread Summary Dataset [Dataset]. https://www.kaggle.com/datasets/marawanxmamdouh/email-thread-summary-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Marawan Mamdouh
Description
Email Thread Summary Dataset

Overview:

The Email Thread Dataset consists of two main files: email_thread_details and email_thread_summaries. These files collectively offer a comprehensive compilation of email thread information alongside human-generated summaries.

Email Thread Details:

Description:

The email_thread_details file provides a detailed perspective on individual email threads, encompassing crucial information such as subject, timestamp, sender, recipients, and the content of the email.

Columns:

thread_id: A unique identifier for each email thread.

subject: Subject of the email thread.

timestamp: Timestamp indicating when the message was sent.

from: Sender of the email.

to: List of recipients of the email.

body: Content of the email message.

Additional Information:

The "to" column is available in both CSV and Pickle (pkl) formats, facilitating convenient access to recipient information as a column of lists of strings.

Email Thread Summaries:

Description:

The email_thread_summaries file contains concise summaries crafted by human annotators for each email thread, offering a high-level overview of the content.

Columns:

thread_id: A unique identifier for each email thread.

summary: A concise summary of the email thread.

Dataset Structure:

The dataset is organized into threads and emails. There are a total of 4,167 threads and 21,684 emails, providing a rich source of information for analysis and research purposes.

Threads: 4,167 threads

Emails: 21,684 emails

Language:

Languages: English (en)

Use Cases:

Natural Language Processing (NLP) Research:

Analyze email thread contents and human-generated summaries for advancements in NLP tasks.

Text Summarization Models:

Train and evaluate text summarization models using the provided email threads and summaries.

Email Analytics:

Gain insights into communication patterns, sender-receiver relationships, and content analysis.

File Formats:

CSV Files:

Easily importable into various data analysis tools.

Pickle (pkl) Files:

Facilitates direct reading of the "to" column as a column of lists of strings.

JSON Files:

Offers compatibility with JSON data structures, providing an additional option for users who prefer or require this widely-used format in their analytical workflows.

****JSON File Features Description****

[ { "thread_id": [unique identifier], "subject": "[email thread subject]", "timestamp": [timestamp in milliseconds], "from": "[sender's name and identifier]", "to": [ "[recipient 1]", "[recipient 2]", "[recipient 3]", ... ], "body": "[email content]" }, ... ]

[ { "thread_id": [unique identifier], "summary": "[summary content]" }, ... ]

****Files Structure:****

- Dataset ├── CSV │ ├── email_thread_details.csv │ └── email_thread_summaries.csv ├── Pickle │ ├── email_thread_details.pkl │ └── email_thread_summaries.pkl └── JSON ├── email_thread_details.json └── email_thread_summaries.json

License:

This dataset is provided under the MIT License.

Disclaimer:

The dataset has been anonymized and sanitized to ensure privacy and confidentiality.
Cheltenham's Facebook Groups
kaggle.com
zip
Updated Apr 2, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mike Chirico (2018). Cheltenham's Facebook Groups [Dataset]. https://www.kaggle.com/datasets/mchirico/cheltenham-s-facebook-group
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 2, 2018
Authors
Mike Chirico
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Facebook is becoming an essential tool for more than just family and friends. Discover how Cheltenham Township (USA), a diverse community just outside of Philadelphia, deals with major issues such as the Bill Cosby trial, everyday traffic issues, sewer I/I problems and lost cats and dogs. And yes, theft.

Communities work when they're connected and exchanging information. What and who are the essential forces making a positive impact, and when and how do conversational threads get directed or misdirected?

Use Any Facebook Public Group

You can leverage the examples here for any public Facebook group. For an example of the source code used to collect this data, and a quick start docker image, take a look at the following project: facebook-group-scrape.

Data Sources

There are 4 csv files in the dataset, with data from the following 5 public Facebook groups:

Unofficial Cheltenham Township

Elkins Park Happenings!

Free Speech Zone

Cheltenham Lateral Solutions

Cheltenham Township Residents

post.csv

These are the main posts you will see on the page. It might help to take a quick look at the page. Commas in the msg field have been replaced with {COMMA}, and apostrophes have been replaced with {APOST}.

gid Group id (5 different Facebook groups)

pid Main Post id

id Id of the user posting

name User's name

timeStamp

shares

url

msg Text of the message posted.

likes Number of likes

comment.csv

These are comments to the main post. Note, Facebook postings have comments, and comments on comments.

gid Group id

pid Matches Main Post identifier in post.csv

cid Comment Id.

timeStamp

id Id of user commenting

name Name of user commenting

rid Id of user responding to first comment

msg Message

like.csv

These are likes and responses. The two keys in this file (pid,cid) will join to post and comment respectively.

gid Group id

pid Matches Main Post identifier in post.csv

cid Matches Comments id.

response Response such as LIKE, ANGRY etc.

id The id of user responding

name Name of the user responding

member.csv

These are all the members in the group. Some members never, or rarely, post or comment. You may find multiple entries in this table for the same person. The name of the individual never changes, but they change their profile picture. Each profile picture change is captured in this table. Facebook gives users a new id in this table when they change their profile picture.

gid Group id

id Id of the member

name Name of the member

url URL of the member
h
reddit_threads
huggingface.co
Updated Apr 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Graph Datasets (2023). reddit_threads [Dataset]. https://huggingface.co/datasets/graphs-datasets/reddit_threads
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 13, 2023
Dataset authored and provided by
Graph Datasets
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
Dataset Card for Reddit threads

Dataset Summary

The Reddit threads dataset contains 'discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them' (doc).

Supported Tasks and Leaderboards

The related task is the binary classification to predict whether a thread is discussion based or not.

External Use PyGeometric

To load in… See the full description on the dataset page: https://huggingface.co/datasets/graphs-datasets/reddit_threads.
C
The Online conversation threads repository
dataverse.csuc.cat
txt
Updated Oct 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vicenç Gómez; Vicenç Gómez; Andreas Kaltenbrunner; Andreas Kaltenbrunner; David Laniado; David Laniado (2023). The Online conversation threads repository [Dataset]. http://doi.org/10.34810/data497
Explore at:
txt(6626), txt(1763476626), txt(110980658), txt(673642981)Available download formats
Unique identifier
https://doi.org/10.34810/data497
Dataset updated
Oct 13, 2023
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
Vicenç Gómez; Vicenç Gómez; Andreas Kaltenbrunner; Andreas Kaltenbrunner; David Laniado; David Laniado
License
https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data497https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data497
Description
This repository contains datasets with online conversation threads collected and analyzed by different researchers. Currently, you can find datsets from different news aggregators (Slashdot, Barrapunto) and the English Wikipedia talk pages. Slashdot conversations (Aug 2005 - Aug 2006) Online conversations generated at Slashdot during a year. Posts and comments published between August 26th, 2005 and August 31th, 2006. For each discussion thread: sub-domains, title, topics and hierarchical relations between comments. For each comment: user, date, score and textual content. This dataset is different from the Slashdot Zoo social network (it is not a signed network of users) contained in the SNAP repository and represents the full version of the dataset used in the CAW 2.0 - Content Analysis for the WEB 2.0 workshop for the WWW 2009 conference that can be found in several repositories such as Konect/n/nBarrapunto conversations (Jan 2005 - Dec 2008)/nOnline conversations generated at Barrapunto (Spanish clone of Slashdot) during three years. For each discussion thread: sub-domains, title, topics and hierarchical relations between comments. For each comment: user, date, score and textual content Wikipedia (2001 - Mar 2010) Data from articles discussions (talk) pages of the English Wikipedia as of March 2010. It contains comments on about 870,000 articles (i.e. all articles which had a corresponding talk page with at least one comment), in total about 9.4 million comments. The oldest comments date back to as early as 2001.
Z
Dataset for: The Evolution of the Manosphere Across the Web
data.niaid.nih.gov
Updated Aug 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Savvas Zannettou (2020). Dataset for: The Evolution of the Manosphere Across the Web [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4007912
Explore at:
Dataset updated
Aug 30, 2020
Dataset provided by
Jeremy Blackburn
Stephanie Greenberg
Savvas Zannettou
Gianluca Stringhini
Summer Long
Barry Bradlyn
Manoel Horta Ribeiro
Emiliano De Cristofaro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Evolution of the Manosphere Across the Web

We make available data related to subreddit and standalone forums from the manosphere.

We also make available Perspective API annotations for all posts.

You can find the code in GitHub.

Please cite this paper if you use this data:

@article{ribeiroevolution2021, title={The Evolution of the Manosphere Across the Web}, author={Ribeiro, Manoel Horta and Blackburn, Jeremy and Bradlyn, Barry and De Cristofaro, Emiliano and Stringhini, Gianluca and Long, Summer and Greenberg, Stephanie and Zannettou, Savvas}, booktitle = {{Proceedings of the 15th International AAAI Conference on Weblogs and Social Media (ICWSM'21)}}, year={2021} }

Reddit data

We make available data for forums and for relevant subreddits (56 of them, as described in subreddit_descriptions.csv). These are available, 1 line per post in each subreddit Reddit in /ndjson/reddit.ndjson. A sample for example is:

{ "author": "Handheld_Gaming", "date_post": 1546300852, "id_post": "abcusl", "number_post": 9.0, "subreddit": "Braincels", "text_post": "Its been 2019 for almost 1 hour And I am at a party with 120 people, half of them being foids. The last year had been the best in my life. I actually was happy living hope because I was redpilled to the death.

Now that I am blackpilled I see that I am the shortest of all men and that I am the only one with a recessed jaw.

Its over. Its only thanks to my age old friendship with chads and my social skills I had developed in the past year that a lot of men like me a lot as a friend.

No leg lengthening syrgery is gonna save me. Ignorance was a bliss. Its just horror now seeing that everyone can make out wirth some slin hoe at the party.

I actually feel so unbelivably bad for turbomanlets. Life as an unattractive manlet is a pain, I cant imagine the hell being an ugly turbomanlet is like. I would have roped instsntly if I were one. Its so unfair.

Tallcels are fakecels and they all can (and should) suck my cock.

If I were 17cm taller my life would be a heaven and I would be the happiest man alive.

Just cope and wait for affordable body tranpslants.", "thread": "t3_abcusl" }

Forums

We here describe the .sqlite and .ndjson files that contain the data from the following forums.

(avfm) --- https://d2ec906f9aea-003845.vbulletin.net (incels) --- https://incels.co/ (love_shy) --- http://love-shy.com/lsbb/ (redpilltalk) --- https://redpilltalk.com/ (mgtow) --- https://www.mgtow.com/forums/ (rooshv) --- https://www.rooshvforum.com/ (pua_forum) --- https://www.pick-up-artist-forum.com/ (the_attraction) --- http://www.theattractionforums.com/

The files are in folders /sqlite/ and /ndjson.

2.1 .sqlite

All the tables in the sqlite. datasets follow a very simple {key:value} format. Each key is a thread name (for example /threads/housewife-is-like-a-job.123835/) and each value is a python dictionary or a list. This file contains three tables:

idx each key is the relative address to a thread and maps to a post. Each post is represented by a dict:

"type": (list) in some forums you can add a descriptor such as [RageFuel] to each topic, and you may also have special types of posts, like sticked/pool/locked posts.
"title": (str) title of the thread; "link": (str) link to the thread; "author_topic": (str) username that created the thread; "replies": (int) number of replies, may differ from number of posts due to difference in crawling date; "views": (int) number of views; "subforum": (str) name of the subforum; "collected": (bool) indicates if raw posts have been collected; "crawled_idx_at": (str) datetime of the collection.

processed_posts each key is the relative address to a thread and maps to a list with posts (in order). Each post is represented by a dict:

"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

raw_posts each key is the relative address to a thread and maps to a list with unprocessed posts (in order). Each post is represented by a dict:

"post_raw": (binary) raw html binary; "crawled_at": (str) datetime of the collection.

2.2 .ndjson

Each line consists of a json object representing a different comment with the following fields:

"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

Perspective

We also run each post and reddit post through perspective, the files are located in the /perspective/ folder. They are compressed with gzip. One example output

{ "id_post": 5200, "hate_output": { "text": "I still can\u2019t wrap my mind around both of those articles about these c~~~s sleeping with poor Haitian Men. Where\u2019s the uproar?, where the hell is the outcry?, the \u201cpig\u201d comments or the \u201ccreeper comments\u201d. F~~~ing hell, if roles were reversed and it was an article about Men going to Europe where under 18 sex in legal, you better believe they would crucify the writer of that article and DEMAND an apology by the paper that wrote it.. This is exactly what I try and explain to people about the double standards within our modern society. A bunch of older women, wanna get their kicks off by sleeping with poor Men, just before they either hit or are at menopause age. F~~~ing unreal, I\u2019ll never forget going to Sweden and Norway a few years ago with one of my buddies and his girlfriend who was from there, the legal age of consent in Norway is 16 and in Sweden it\u2019s 15. I couldn\u2019t believe it, but my friend told me \u201c hey, it\u2019s normal here\u201d . Not only that but the age wasn\u2019t a big different in other European countries as well. One thing i learned very quickly was how very Misandric Sweden as well as Denmark were.", "TOXICITY": 0.6079781, "SEVERE_TOXICITY": 0.53744453, "INFLAMMATORY": 0.7279288, "PROFANITY": 0.58842486, "INSULT": 0.5511079, "OBSCENE": 0.9830818, "SPAM": 0.17009115 } }

Working with sqlite

A nice way to read some of the files of the dataset is using SqliteDict, for example:

from sqlitedict import SqliteDict processed_posts = SqliteDict("./data/forums/incels.sqlite", tablename="processed_posts")

for key, posts in processed_posts.items(): for post in posts: # here you could do something with each post in the dataset pass

Helpers

Additionally, we provide two .sqlite files that are helpers used in the analyses. These are related to reddit, and not to the forums! They are:

channel_dict.sqlite a sqlite where each key corresponds to a subreddit and values are lists of dictionaries users who posted on it, along with timestamps.

author_dict.sqlite a sqlite where each key corresponds to an author and values are lists of dictionaries of the subreddits they posted on, along with timestamps.

These are used in the paper for the migration analyses.

Examples and particularities for forums

Although we did our best to clean the data and be consistent across forums, this is not always possible. In the following subsections we talk about the particularities of each forum, directions to improve the parsing which were not pursued as well as give some examples on how things work in each forum.

6.1 incels

Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

types: for the incel forums the special types associated with each thread in the idx table are “Sticky”, “Pool”, “Closed”, and the custom types added by users, such as [LifeFuel]. These last ones are all in brackets. You can see some examples of these in the on the example thread page.

quotes: quotes in this forum were quite nice and thus, all quotations are deterministic.

6.2 LoveShy

Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

types: no types were parsed. There are some rules in the forum, but not significant.

quotes: quotes were obtained from exact text+author match, or author match + a jaccard
🇨🇦 Reddit r/Canada Subreddit Dataset
kaggle.com
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BwandoWando (2025). 🇨🇦 Reddit r/Canada Subreddit Dataset [Dataset]. https://www.kaggle.com/datasets/bwandowando/reddit-rcanada-subreddit-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 9, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BwandoWando
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Area covered
Canada
Description
Context

I've tried looking for an r/Canada/ dataset here in Kaggle havent found one, so I made one for the Canadian Kaggle members

About

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Ffb80802c4661e6a72ca9a1bba9c334a6%2F_ca5803bc-94b9-481f-9339-effcd87f3ee1_small.jpeg?generation=1736341311974753&alt=media" alt="">

Created last Jan 25, 2008, r/Canada/ is labeled

Welcome to Canada’s official subreddit! This is the place to engage on all things Canada. Nous parlons en anglais et en français. Please be respectful of each other when posting, and note that users new to the subreddit might experience posting limitations until they become more active and longer members of the community. Do not hesitate to message the mods if you experience any issues!

This dataset can be used to extract insights from the trending topics and discussions in the subreddit.

Banner Image

Created with Bing Image Creator
Classification Graphs
kaggle.com
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Classification Graphs [Dataset]. https://www.kaggle.com/wolfram77/graphs-classification/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Deezer Ego Nets

The ego-nets of Eastern European users collected from the music streaming service Deezer in February 2020. Nodes are users and edges are mutual follower relationships. The related task is the prediction of gender for the ego node in the graph.

Github Stargazers

The social networks of developers who starred popular machine learning and web development repositories (with at least 10 stars) until 2019 August. Nodes are users and links are follower relationships. The task is to decide whether a social network belongs to web or machine learning developers. We only included the largest component (at least with 10 users) of graphs.

Reddit Threads

Discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them. The task is to predict whether a thread is discussion based or not (binary classification).

Twitch Ego Nets

The ego-nets of Twitch users who participated in the partnership program in April 2018. Nodes are users and links are friendships. The binary classification task is to predict using the ego-net whether the ego user plays a single or multple games. Players who play a single game usually have a more dense ego-net.

Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.

http://snap.stanford.edu/data/index.html#disjointgraphs
d
Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...
datarade.ai
.json, .csv
Updated Aug 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataplex (2024). Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-global-social-media-data-1-1m-mill-dataplex
Explore at:
.json, .csvAvailable download formats
Dataset updated
Aug 12, 2024
Dataset authored and provided by
Dataplex
Area covered
Côte d'Ivoire, Martinique, Christmas Island, Chile, Holy See, Mexico, Jersey, Botswana, Macao, Gambia
Description
The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

Dataset Overview:

This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

Sourced Directly from Reddit:

All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

Key Features:

Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.

User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.

Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.

AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

Use Cases:

Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.

Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.

Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.

Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

Data Quality and Reliability:

The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

Integration and Usability:

The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

User-Friendly Structure and Metadata:

The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

Ideal For:

Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.

Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.

Researchers: Explore the social dynamics of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...
Pull Request Review Comments Dataset
zenodo.org
application/gzip, bin
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akshay Sinha; Akshay Sinha (2025). Pull Request Review Comments Dataset [Dataset]. http://doi.org/10.5281/zenodo.4773068
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4773068
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Akshay Sinha; Akshay Sinha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pull Request Review Comments (PRRC) Datasets

Two datasets have been created from the gharchive website. The Pull Request Review Comment Event was selected from the set of available GitHub events. This dataset has been created for CARA: Chatbot for Automating Repairnator Actions as part of a master's thesis at KTH, Stockholm.

First, a source dataset was downloaded from gharchive. That dataset ranges from January 2015 to December 2019. It consisted of 37,358,242 PRRCs and is over 12 Gigabytes in size. It took over 100 hours to download all the data files and extract PRRC from it. From this source dataset, two subsets were derived:

Pull Request Review Comments Dataset: This is the dataset of the comments from the first 100,000 threads in the source dataset from gharchive.

Pull Request Review Threads Dataset: This is the dataset of comments that were concatenated together if they were from the same thread.

Description

The dataset is stored in the JSONLines format, as was the source dataset from gharchive.

For PRRC events, the source dataset contains the fields `comment_id`, `commit_id`, `url`, `author`, `created_at`, and `body`.

`comment_id` is the field which specifies the ID GitHub uses for that comment.

`commit_id` is the field which specifies the ID of the commit proposed in the pull request.

`url` is the field which specifies the url to the comment in a pull request thread.

`author` is the field which lists the username of the author of the pull request.

`created_at` is the field which specifies the time at which the pull request comment was created.

`body` is the field which describes the contents of the PRRC.

The threads dataset contains the fields `url` and `body` which contain similar information as described above. However, the body field differs: it is a concatenation of all the PRRCs in a pull request thread. The comments dataset contains the fields `comment_id`, `commit_id`, `url`, `author`, `created_at`, and `body`. They are the same fields from the initial dataset.

Construction

We used the fasttext model published by Facebook to detect the language of the PRRC. Only those PRRCs in English were preserved. We also removed any PRRC or thread whose size exceeded 128 Kilobytes.
Bluesky Social Dataset
zenodo.org
application/gzip, csv
Updated Jan 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti (2025). Bluesky Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.14669616
Explore at:
application/gzip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14669616
Dataset updated
Jan 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti
License
https://bsky.social/about/support/toshttps://bsky.social/about/support/tos
Description
Bluesky Social Dataset

Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue.

The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their “like” interactions and time of bookmarking.

Dataset

Here is a description of the dataset files.

followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers representing a directed following relation (i.e., user u follows user v).

user_posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in a collection of files, each containing the post of an anonymized user. Each post is stored as a JSON-formatted line.

interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers representing a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author,quoted_author, and date.

graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread.

feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author);

feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values: the feed name, user id, and timestamp.

feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp;

scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.

Citation

If used for research purposes, please cite the following paper describing the dataset details:

Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight: Insights from a Year's Worth of Social Data." PlosOne (2024) https://doi.org/10.1371/journal.pone.0310330

Right to Erasure (Right to be forgotten)

Note: If your account was created after March 21st, 2024, or if you did not post on Bluesky before such date, no data about your account exists in the dataset. Before sending a data removal request, please make sure that you were active and posting on bluesky before March 21st, 2024.

Users included in the Bluesky Social dataset have the right to opt-out and request the removal of their data, per GDPR provisions (Article 17).

We emphasize that the released data has been thoroughly pseudonymized in compliance with GDPR (Article 4(5)). Specifically, usernames and object identifiers (e.g., URIs) have been removed, and object timestamps have been coarsened to protect individual privacy further and minimize reidentification risk. Moreover, it should be noted that the dataset was created for scientific research purposes, thereby falling under the scenarios for which GDPR provides opt-out derogations (Article 17(3)(d) and Article 89).

Nonetheless, if you wish to have your activities excluded from this dataset, please submit your request to blueskydatase tmoderation@gmail.com (with the subject "Removal request: [username]"). We will process your request within a reasonable timeframe - updates will occur monthly, if necessary, and access to previous versions will be restricted.

Acknowledgments:

This work is supported by :

the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”,
Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu);

SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021;

EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
e
Seawater carbonate chemistry and load at failure, thread extensibility, and...
b2find.eudat.eu
Updated Oct 18, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Seawater carbonate chemistry and load at failure, thread extensibility, and total thread counts of the blue mussel (Mytilus edulis) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/03c27212-237b-53c7-a0b2-0edfe201f56c
Explore at:
Dataset updated
Oct 18, 2018
Description
Blue mussel (Mytilus edulis) produce byssal threads to anchor themselves to the substrate. These threads are always exposed to the surrounding environmental conditions. Understanding how environmental pH affects these threads is crucial in understanding how climate change can affect mussels. This work examines three factors (load at failure, thread extensibility, and total thread counts) that indicate the performance of byssal threads as well as condition index to assess impacts on the physiological condition of mussels held in artificial seawater acidified by the addition of CO2. There was no significant variation between the control (786 μatm CO2 / 7.98 pH/ 2805 μmol/kg total alkalinity) and acidified (2555 μatm CO2 / 7.47 pH/ 2650 μmol/kg total alkalinity) treatment groups in any of these factors. The results of this study suggest that ocean acidification by CO2 addition has no significant effect on the quality and performance of threads produced by M. edulis. In order to allow full comparability with other ocean acidification data sets, the R package seacarb (Gattuso et al, 2019) was used to compute a complete and consistent set of carbonate system variables, as described by Nisumaa et al. (2010). In this dataset the original values were archived in addition with the recalculated parameters (see related PI). The date of carbonate chemistry calculation by seacarb is 2020-06-12.

Industrial screw driving dataset collection: Time series data for process...

zenodo.org

tar

Updated Jan 30, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Nikolai West; Nikolai West; Jochen Deuse; Jochen Deuse (2025). Industrial screw driving dataset collection: Time series data for process monitoring and anomaly detection [Dataset]. http://doi.org/10.5281/zenodo.14769379

Explore at:

tarAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14769379

Dataset updated

Jan 30, 2025

Dataset provided by

Nikolai West

Authors

Nikolai West; Nikolai West; Jochen Deuse; Jochen Deuse

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Industrial Screw Driving Datasets

Overview

This repository contains a collection of real-world industrial screw driving datasets, designed to support research in manufacturing process monitoring, anomaly detection, and quality control. Each dataset represents different aspects and challenges of automated screw driving operations, with a focus on natural process variations and degradation patterns.

Scenario name	Number of work pieces used in the experiments	Repetitions (screw cylces) per workpiece	Individual screws per workpiece	Total number of observations	Number of unique classes	Purpose
S01_thread-degradation	100	25	2	5.000	1	Investigation of thread degradation through repeated fastening
S02_surface-friction	250	25	2	12.500	8	Surface friction effects on screw driving operations
S03_error-collection-1		1	2		>20
S04_error-collection-2	2.500	1	2	5.000	25

Dataset Collection

The datasets were collected from operational industrial environments, specifically from automated screw driving stations used in manufacturing. Each scenario investigates specific mechanical phenomena that can occur during industrial screw driving operations:

Currently Available Datasets:

1. S01_thread-degradation

Focus: Investigation of thread degradation through repeated fastening
Samples: 5,000 screw operations (4,089 normal, 911 faulty)
Features: Natural degradation patterns, no artificial error induction
Equipment: Delta PT 40x12 screws, thermoplastic components
Process: 25 cycles per location, two locations per workpiece
First published in: HICSS 2024 (West & Deuse, 2024)

2. S02_surface-friction

Focus: Surface friction effects on screw driving operations
Samples: 12,500 screw operations (9,512 normal, 2,988 faulty)
Features: Eight distinct surface conditions (baseline to mechanical damage)
Equipment: Delta PT 40x12 screws, thermoplastic components, surface treatment materials
Process: 25 cycles per location, two locations per workpiece
First published in: CIE51 2024 (West & Deuse, 2024) [DOI will be added after publication]

Upcoming Datasets:

3. S03_screw-error-collection-1 (recorded but unpublished)

Focus: Varius manipulations of the screw driving process
Features: More than 20 different errors recorded
First published in: Publication planned
Status: In preparation

4. S04_screw-error-collection-2 (recorded but unpublished)

Focus: Varius manipulations of the screw driving process
Features: 25 distinct errors recorded over the course of a week
First published in: Publication planned
Status: In preparation

5. S05_upper-workpiece-manipulations (recorded but unpublished)

Manipulations of the injection molding process with no changes during tightening

6. S06_lower-workpiece-manipulations (recorded but unpublished)

Manipulations of the injection molding process with no changes during tightening

Additional scenarios may be added to this collection as they become available.

Data Format

Each dataset follows a standardized structure:

JSON files containing individual screw operation data
CSV files with operation metadata and labels
Comprehensive documentation in README files
Example code for data loading and processing is available in the companion library PyScrew

Research Applications

These datasets are suitable for various research purposes:

Machine learning model development and validation
Process monitoring and control systems
Quality assurance methodology development
Manufacturing analytics research
Anomaly detection algorithm benchmarking

Usage Notes

All datasets include both normal operations and natural process anomalies
Complete time series data for torque, angle, and additional parameters available
Detailed documentation of experimental conditions and setup
Data collection procedures and equipment specifications available

Access and Citation

These datasets are provided under an open-access license to support research and development in manufacturing analytics. When using any of these datasets, please cite the corresponding publication as detailed in each dataset's README file.

Related Tools

We recommend using our library PyScrew to load and prepare the data. However, the the datasets can be processed using standard JSON and CSV processing libraries. Common data analysis and machine learning frameworks may be used for the analysis. The .tar file provided all information required for each scenario.

Documentation

Each dataset includes:

Detailed README files
Data format specifications
Equipment and process parameters
Experimental setup documentation
Citation information

Contact and Support

For questions, issues, or collaboration interests regarding these datasets, please:

Open an issue in the respective GitHub repository
Contact the authors through the provided institutional channels

Acknowledgments

These datasets were collected and prepared from:

RIF Institute for Research and Transfer e.V.
Technical University Dortmund, Institute for Production Systems

The research was supported by:

German Ministry of Education and Research (BMBF)
European Union's "NextGenerationEU" program
The research is part of this funding program
More information regarding the research project is available here

h
BluePrint
huggingface.co
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Complex Data Lab (2025). BluePrint [Dataset]. http://doi.org/10.57967/hf/5425
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5425
Dataset updated
May 28, 2025
Dataset authored and provided by
Complex Data Lab
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📘 BluePrint

BluePrint is a large-scale dataset of social media conversation threads designed for evaluating and training LLM-based social media agents. It provides realistic, thread-structured data clustered into representative user personas at various levels of granularity.

✅ Key Features

Thread-Based Structure: Each example is a list of messages representing a user thread. Persona Clustering: Users are clustered into 2, 25, 100, and 1000 representative personas to… See the full description on the dataset page: https://huggingface.co/datasets/ComplexDataLab/BluePrint.
Dataset: Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from...
zenodo.org
bin, pdf
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonis Papasavva; Savvas Zannettou; Emiliano De Cristofaro; Gianluca Stringhini; Jeremy Blackburn; Antonis Papasavva; Savvas Zannettou; Emiliano De Cristofaro; Gianluca Stringhini; Jeremy Blackburn (2024). Dataset: Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board [Dataset]. http://doi.org/10.5281/zenodo.3606810
Explore at:
bin, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3606810
Dataset updated
Jul 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antonis Papasavva; Savvas Zannettou; Emiliano De Cristofaro; Gianluca Stringhini; Jeremy Blackburn; Antonis Papasavva; Savvas Zannettou; Emiliano De Cristofaro; Gianluca Stringhini; Jeremy Blackburn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset released with the paper titled: "Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board".

The dataset is a single Newline delimited JSON file. Each line in the file consists of a JSON object which is a full 4chan /pol/ thread. The JSON objects contain all the key/values returned by the 4chan API, along with three additional keys (entities, perspectives, and extracted_poster_id).

For each JSON object we complement the data with the list of the named entities we detect for each post, using the spaCy Python library. In addition, for each post we add scores returned by the Google’s Perspective API, and more specifically seven scores in the [0; 1] interval.

For the detailed description of every key in the JSON structure, along with the type of the value, please read the readme.pdf file provided with this dataset.

If you find our dataset useful, please cite our paper:

@article{papasavva2020raiders, title={Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board}, author={Antonis Papasavva, Savvas Zannettou, Emiliano De Cristofaro, Gianluca Stringhini, Jeremy Blackburn}, journal={14th International AAAI Conference On Web And Social Media (ICWSM), 2020}, year={2020} }

How to extract the data:

Note that the data is compressed. See the instructions below on how to extract the data:

Linux and Mac

Step 1: Open a terminal window and navigate to the path where the file pol_0616-1119_labeled.tar.zst is located.

Step2: Run the following command:

unzstd pol_0616-1119_labeled.tar.zst

The above command will result in a file named pol_0616-1119_labeled.tar. (in the same directory)

Step 3: Again, from your terminal window, run this command:

tar -xvf pol_0616-1119_labeled.tar

When the above command finishes, you will get (in the same directory) the extracted data - a file named pol_062016-112019_labeled.ndjson.

Windows

There are many applications that can be used to extract this data on Windows available online. The authors cannot recommend specific applications. Note that the file is compressed twice so you will need to perform the data extraction twice - once on the downloaded file, and once on the file that was extracted from the downloaded file.

Please do not hesitate to contact the author of this study in case you face any problem at: antonis.papasavva@ucl.ac.uk
Cebulka (Polish dark web cryptomarket and image board) messages data
zenodo.org
data.niaid.nih.gov
csv, zip
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Piotr Siuda; Piotr Siuda; Haitao Shi; Haitao Shi; Patrycja Cheba; Patrycja Cheba; Leszek Świeca; Leszek Świeca (2024). Cebulka (Polish dark web cryptomarket and image board) messages data [Dataset]. http://doi.org/10.5281/zenodo.10810939
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10810939
Dataset updated
Mar 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Piotr Siuda; Piotr Siuda; Haitao Shi; Haitao Shi; Patrycja Cheba; Patrycja Cheba; Leszek Świeca; Leszek Świeca
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 2023
Description
General Information

1. Title of Dataset

Cebulka (Polish dark web cryptomarket and image board) messages data.

2. Data Collectors

Haitao Shi (The University of Edinburgh, UK); Patrycja Cheba (Jagiellonian University); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).

3. Funding Information

The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.

Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).

Data Collection Context

4. Data Source

Polish dark web cryptomarket and image board called Cebulka (http://cebulka7uxchnbpvmqapg5pfos4ngaxglsktzvha7a5rigndghvadeyd.onion/index.php).

5. Purpose

This dataset was developed within the abovementioned project. The project focuses on studying internet behavior concerning disruptive actions, particularly emphasizing the online narcotics market in Poland. The research seeks to (1) investigate how the open internet, including social media, is used in the drug trade; (2) outline the significance of darknet platforms in the distribution of drugs; and (3) explore the complex exchange of content related to the drug trade between the surface web and the darknet, along with understanding meanings constructed within the drug subculture.

Within this context, Cebulka is identified as a critical digital venue in Poland’s dark web illicit substances scene. Besides serving as a marketplace, it plays a crucial role in shaping the narratives and discussions prevalent in the drug subculture. The dataset has proved to be a valuable tool for performing the analyses needed to achieve the project’s objectives.

Data Content

6. Data Description

The data was collected in three periods, i.e., in January 2023, June 2023, and January 2024.

The dataset comprises a sample of messages posted on Cebulka from its inception until January 2024 (including all the messages with drug advertisements). These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories. The “cebulka_adverts” directory contains posts related to drug advertisements (both advertisements and comments). In contrast, the “cebulka_community” directory holds a sample of posts from other parts of the cryptomarket, i.e., those not related directly to trading drugs but rather focusing on discussing illicit substances. The dataset consists of 16,842 posts.

7. Data Cleaning, Processing, and Anonymization

The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.

8. File Formats and Variables/Fields

The dataset consists of the following files:

Zipped .txt files (“cebulka_adverts.zip” and “cebulka_community.zip”) containing all messages. These files are organized into individual directories that mirror the folder structure found on Cebulka.

Two .csv files that list all the messages, including file names and the content of each post. The first .csv lists messages from “cebulka_adverts.zip,” and the second .csv lists messages from “cebulka_community.zip.”

Ethical Considerations

9. Ethics Statement

A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:

Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.

The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.
e
Seawater carbonate chemistry and mussel attachment - Dataset - B2FIND
b2find.eudat.eu
Updated Nov 2, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Seawater carbonate chemistry and mussel attachment - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/f18be50c-0188-55e1-9cc1-40d38742d182
Explore at:
Dataset updated
Nov 2, 2019
Description
Predicting how combinations of stressors will affect failure risk is a key challenge for the field of ecomechanics and, more generally, ecophysiology. Environmental conditions often influence the manufacture and durability of biomaterials, inducing structural failure that potentially compromises organismal reproduction, growth, and survival. Species known for tight linkages between structural integrity and survival include bivalve mussels, which produce numerous byssal threads to attach to hard substrate. Among the current environmental threats to marine organisms are ocean warming and acidification. Elevated pCO2 exposure is known to weaken byssal threads by compromising the strength of the adhesive plaque. This study uses structural analysis to evaluate how an additional stressor, elevated temperature, influences byssal thread quality and production. Mussels (Mytilus trossulus) were placed in controlled temperature and pCO2 treatments, and then, newly produced threads were counted and pulled to failure to determine byssus strength. The effects of elevated temperature on mussel attachment were dramatic; mussels produced 60% weaker and 65% fewer threads at 25°C in comparison to 10°C. These effects combine to weaken overall attachment by 64–88% at 25°C. The magnitude of the effect of pCO2 on thread strength was substantially lower than that of temperature and, contrary to our expectations, positive at high pCO2 exposure. Failure mode analysis localized the effect of temperature to the proximal region of the thread, whereas pCO2 affected only the adhesive plaques. The two stressors therefore act independently, and because their respective target regions are interconnected (resisting tension in series), their combined effects on thread strength are exactly equal to the effect of the strongest stressor. Altogether, these results show that mussels, and the coastal communities they support, may be more vulnerable to the negative effects of ocean warming than ocean acidification. In order to allow full comparability with other ocean acidification data sets, the R package seacarb (Gattuso et al, 2019) was used to compute a complete and consistent set of carbonate system variables, as described by Nisumaa et al. (2010). In this dataset the original values were archived in addition with the recalculated parameters (see related PI). The date of carbonate chemistry calculation by seacarb is 2020-07-07.
e
DANS Data Station Social Sciences and Humanities
b2find.eudat.eu
Updated Sep 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). DANS Data Station Social Sciences and Humanities [Dataset]. https://b2find.eudat.eu/dataset/6a3402bb-5c01-56aa-90c2-9ab9311428f0
Explore at:
Dataset updated
Sep 19, 2023
Description
The research project associated with this dataset focuses on the analysis of the top threads within the ddo subreddit. The dataset contains essential information about each of these threads, including the author's username, the post's title, the post text, its score, and the number of comments it has received. Additionally, it includes a detailed record of all comments within each thread, encompassing the commenter's username, the date and time of their comment, and the score received by each comment.The purpose of this project is to recognize addicted users within the ddo subreddit community by considering their activity patterns, emotional expressions, and content preferences, ultimately contributing to a deeper understanding of addiction-related behaviors in online communities and informing strategies for tailored support and interventions. Date Submitted: 2023-09-19
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jonathan Roberts (2024). needle-threading [Dataset]. https://huggingface.co/datasets/jonathan-roberts1/needle-threading

needle-threading

jonathan-roberts1/needle-threading

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 7, 2024

Authors

Jonathan Roberts

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

  Dataset Summary

As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. Although the development of longer context models has seen rapid gains recently, our understanding of how effectively they use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the… See the full description on the dataset page: https://huggingface.co/datasets/jonathan-roberts1/needle-threading.

Clear search

Close search

Google apps

Main menu

needle-threading

Email Thread Summary Dataset

Email Thread Summary Dataset

Overview:

Email Thread Details:

Description:

Columns:

Additional Information:

Email Thread Summaries:

Description:

Columns:

Dataset Structure:

Language:

Use Cases:

File Formats:

****Files Structure:****

License:

Disclaimer:

Cheltenham's Facebook Groups

reddit_threads

The Online conversation threads repository

Dataset for: The Evolution of the Manosphere Across the Web

🇨🇦 Reddit r/Canada Subreddit Dataset

Context

About

Banner Image

Classification Graphs

Deezer Ego Nets

Github Stargazers

Reddit Threads

Twitch Ego Nets

Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...

Pull Request Review Comments Dataset

Bluesky Social Dataset

Bluesky Social Dataset

Dataset

Citation

Right to Erasure (Right to be forgotten)

Acknowledgments:

Seawater carbonate chemistry and load at failure, thread extensibility, and...

Industrial screw driving dataset collection: Time series data for process...

Industrial Screw Driving Datasets

Overview

Dataset Collection

Currently Available Datasets:

Upcoming Datasets:

Data Format

Research Applications

Usage Notes

Access and Citation

Related Tools

Documentation

Contact and Support

Acknowledgments

BluePrint

Dataset: Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from...

Cebulka (Polish dark web cryptomarket and image board) messages data

General Information

Data Collection Context

Data Content

Ethical Considerations

Seawater carbonate chemistry and mussel attachment - Dataset - B2FIND

DANS Data Station Social Sciences and Humanities

needle-threading

jonathan-roberts1/needle-threading

Files Structure: