Bluesky Social Dataset Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.
The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.
Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions and time of bookmarking.
This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection, and performing content virality and diffusion analysis.
Dataset Here is a description of the dataset files.
followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers u, v, representing a directed following relation (i.e., user u follows user v). posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in 100 files, each containing the full posts of up to 50,000 users. Each post is stored as a JSON-formatted line. interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers, and represents a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author ,quoted_author, and date. graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread. feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author); feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values, namely the feed name, the user id, and the timestamp. feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp; scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.
Citation If used for research purposes, please cite the following paper describing the dataset details:
Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight": Insights from a Year Worth of Social Data. (2024) arXiv:2404.18984
Acknowledgments: This work is supported by :
the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”, Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu); SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021; EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
POLITISKY24 (Political Stance Analysis on Bluesky for 2024) is a first-of-its-kind dataset for stance detection, focused on the 2024 U.S. presidential election. It designed for target-specific user-level stance detection and contains 16,044 user-target stance pairs centered on two key political figures, Kamala Harris and Donald Trump. In addition, this dataset includes detailed metadata, such as complete user posting histories and engagement graphs (likes, reposts, and quotes).
Stance labels were generated using a robust and evaluated pipeline that integrates state-of-the-art Information Retrieval (IR) techniques with Large Language Models (LLMs), offering confidence scores, reasoning explanations, and text spans for each label. With an LLM-assisted labeling accuracy of 81%, POLITISKY24 provides a rich resource for the target-specific stance detection task. This dataset enables the exploration of Bluesky platform, paving the way for deeper insights into political opinions and social discourse, and addressing gaps left by traditional datasets constrained by platform policies.
In the uploaded files:
The file user_post_history_dataset.parquet
includes the posting history of 8,561 active Bluesky users who have shared content related to American politics.
The file user_post_list_for_stance_detection.parquet
contains a list of up to 1,000 recent English-language post IDs per user, intended for use in the stance detection task.
The file user_network_dataset.parquet
captures users’ interactions through likes, reposts, and quotes.
The file human_annotated_validation_user_stance_dataset.parquet
contains human-annotated stance labels for 445 validation users toward Trump and Harris, resulting in a total of 890 user-target pairs. The labels are divided into three stances: 1 (favor), 2 (against), and 3 (neither).
The file llm_annotated_validation_user_stance_dataset.parquet
contains stance labels annotated by an LLM for the same 445 validation users toward Trump and Harris, also totaling 890 user-target pairs. In addition to stance labels, each pair includes an explanation of the reasoning, the source tweets, spans from the source tweets used in the reasoning, and a confidence score.
The file llm_annotated_full_user_stance_dataset.parquet
is similar to the above LLM-annotated validation file but covers all dataset users excluding the validation set. It provides stance labels for 8,022 users toward Trump and Harris, totaling 16,044 user-target pairs.
The file human_annotated_validation_stance_relevancy_dataset (post-target entity pairs).parquet
contains human-annotated stance labels for 175 validation posts toward Trump and Harris, resulting in 350 post-target pairs. The labels are divided into three stances: 1 (favor), 2 (against), and 3 (neither).
The file human_annotated_validation_stance_relevancy_dataset (query-post stance relevancy pairs).parquet
contains 700 query-post stance relevancy pairs derived from the post-target entity pairs.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Five Million bluesky posts
This dataset contains 5 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data. This dataset was inspired by the Alpindales original 2 million posts dataset, this dataset expands on that dataset with much more data. Alpins dataset did not get author handles or image urls & metadata that was included in the posts. The images and their captions could potenically… See the full description on the dataset page: https://huggingface.co/datasets/Roronotalt/bluesky-five-million.
This dataset consists of all starter packs and all following network data available on Bluesky in January and February 2025. Starter packs can be created by any Bluesky user. They are lists of users and curated feeds with a minimum of 6 and a maximum of 150 users, curated by the starter pack creator. The creator typically names them and provides a description. Other users can use a single click to follow all users in the starter pack, or they can scroll through a specific starter pack to decide who to follow within that starter pack. In our dataset, all DIDs (persistent, unique identifiers) are anonymized with a non-reversible hash function; users in the network, as well as users who created starter packs, or appear in starter packs, are identified by their hashed DIDs. Similarly, starter packs themselves are identified by their hashed identifiers.
First, we include the Bluesky following network as it appeared in late January/early February 2025. This shows all available directed following relationships on Bluesky. We also include a network dataset of starter packs with information on creators and starter pack members. This is intended for users who wish to undertake a computational analysis of the networks created by starter packs or starter packs’ influences on networks.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Ten Million bluesky posts
This dataset contains 5 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data. This dataset was inspired by the Alpindales original 2 million posts dataset, this dataset expands on that dataset with much more data. Alpins dataset did not get author handles or image urls & metadata that was included in the posts. The images and their captions could potenically… See the full description on the dataset page: https://huggingface.co/datasets/Roronotalt/bluesky-ten-million.
This dataset contains MODIS-derived daily mean shortwave blue sky albedo for northern North America (i.e., Canada and Alaska) and a set of quality control flags for each albedo value to aid in user interpretation. The data cover the period of February 24, 2000 through April 22, 2017. The blue sky albedo data were derived from the MODIS 500-m version 6 Bidirectional Reflectance Distribution Function and Albedo (BRDF/Albedo) Model Parameters MCD43A1 dataset (MCD43A1.006, https://doi.org/10.5067/MODIS/MCD43A1.006) (Schaaf & Wang, 2015a, please refer to the MCD43 documentation and user guides for more information). Blue sky refers to albedo calculated under real-world conditions with a combination of both diffuse and direct lighting based on atmospheric and view-geometry conditions. Daily mean albedo was calculated by averaging hourly instantaneous blue sky albedo values weighted by the solar insolation for each time interval. Potter et al. (2019, https://doi.org/10.1111/gcb.14888) is the associated paper for this dataset. Note the actual extent of the dataset in Figure 1 of the User Guide. Users are encouraged to refer to the User Guide for further important information about the use of this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the datasets required to reproduce the results presented in the paper "The Rise of Bluesky."
Due to its large size, the dataset used to construct the follower network in Fig. 1c is not included here. However, it may be made available upon request under exceptional circumstances.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This contains the code and output logs to run the BlueSky simulator for "Availability and utilisation of airspace structure in a U-space traffic management system". The code provided here is a modified version of the main fork of BlueSky (https://github.com/TUDelft-CNS-ATM/bluesky).
The first step is to install the correct environment. Refer to `condaenv.txt` for the list of packages needed to run the simulator.
After setting up the environment, we then need to save all of the potential paths of drones in `bluesky/plugins/streets/path_plan_dills`. Note that this takes about 180GB of storage so make sure to have enough available. The paths can be downloaded from https://surfdrive.surf.nl/files/index.php/s/makXrEfPtrtdzaO. There are some example paths saved in this dataset but it will not be possible to run all of the experiment without downloading the paths.
The scenarios for sub-experiment 1 are saved in `bluesky/scenario/subexperiment1`.
The scenarios for sub-experiment 2 are saved in `bluesky/scenario/subexperiment2`.
To run the scenarios we first need to start a bluesky server by running the following code inside `bluesky`:
`python BlueSky.py --headless`
In another terminal we can start a bluesky client by running:
`python BlueSky.py --client`
In the bluesky console we can now run each batch scenario by typing and entering:
`batch batch_subexperiment_1.scn` or
`batch batch_subexperiment_2.scn`
The logs of the scenarios are saved in `bluesky/output`.
Without the paths, it will not possible to run the simulations. However, this code currently has some paths so that it is possible to run some example scenarios. The zeroth repetition for the low imposed traffic demand case can be run without all of the paths. For example, `bluesky/scenario/subexperiment1/Flight_intention_low_40_0_1to1.scn` and `bluesky/scenario/subexperiment2/Flight_intention_low_40_0_baseline.scn` can be run directly with this dataset.
First start bluesky by running:
`python BlueSky.py`
In the console, type.
`ic subexperiment2/Flight_intention_low_40_0_baseline.scn`
Please do not hesistate to contact me with any questions.
-Andres
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This contains the code and output logs to run the BlueSky simulator for "U-Space Utilisation of Airspace under Various Layer Function Assignments and Allocations". The code provided here is a modified version of the main fork of BlueSky (https://github.com/TUDelft-CNS-ATM/bluesky).
The first step is to install the correct environment. Refer to `condaenv.txt` for the list of packages needed to run the simulator.
After setting up the environment, we then need to save all of the potential paths of drones in `bluesky/plugins/streets/path_plan_dills`. Note that this takes about 180GB of storage so make sure to have enough available. The paths can be downloaded from https://surfdrive.surf.nl/files/index.php/s/EcPGLvaBu7cZfTA. There are some example paths saved in this dataset but it will not be possible to run all of the experiment without downloading the paths.
The scenarios for sub-experiment 1 are saved in `bluesky/scenario/subexperiment1`.
The scenarios for sub-experiment 2 are saved in `bluesky/scenario/subexperiment2`.
To run the scenarios we first need to start a bluesky server by running the following code inside `bluesky`:
`python BlueSky.py --headless`
In another terminal we can start a bluesky client by running:
`python BlueSky.py --client`
In the bluesky console we can now run each batch scenario by typing and entering:
`batch batch_subexperiment_1.scn` or
`batch batch_subexperiment_2.scn`
The logs of the scenarios are saved in `bluesky/output`.
Without the paths, it will not possible to run the simulations. However, this code currently has some paths so that it is possible to run some example scenarios. The zeroth repetition for the low imposed traffic demand case can be run without all of the paths. For example, `bluesky/scenario/subexperiment1/Flight_intention_low_40_0_1to1.scn` and `bluesky/scenario/subexperiment2/Flight_intention_low_40_0_baseline.scn` can be run directly with this dataset.
First start bluesky by running:
`python BlueSky.py`
In the console, type.
`ic subexperiment2/Flight_intention_low_40_0_baseline.scn`
Please do not hesistate to contact me with any questions.
-Andres
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Bluesky Social Dataset Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.
The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.
Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions and time of bookmarking.
This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection, and performing content virality and diffusion analysis.
Dataset Here is a description of the dataset files.
followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers u, v, representing a directed following relation (i.e., user u follows user v). posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in 100 files, each containing the full posts of up to 50,000 users. Each post is stored as a JSON-formatted line. interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers, and represents a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author ,quoted_author, and date. graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread. feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author); feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values, namely the feed name, the user id, and the timestamp. feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp; scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.
Citation If used for research purposes, please cite the following paper describing the dataset details:
Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight": Insights from a Year Worth of Social Data. (2024) arXiv:2404.18984
Acknowledgments: This work is supported by :
the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”, Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu); SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021; EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).