5 datasets found
  1. Bluesky Social Dataset

    • zenodo.org
    application/gzip, csv
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti (2025). Bluesky Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.14669616
    Explore at:
    application/gzip, csvAvailable download formats
    Dataset updated
    Jan 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti
    License

    https://bsky.social/about/support/toshttps://bsky.social/about/support/tos

    Description

    Bluesky Social Dataset

    Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue.

    The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

    Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their “like” interactions and time of bookmarking.

    Dataset

    Here is a description of the dataset files.

    • followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers representing a directed following relation (i.e., user u follows user v).
    • user_posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in a collection of files, each containing the post of an anonymized user. Each post is stored as a JSON-formatted line.
    • interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers representing a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author,quoted_author, and date.
    • graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread.
    • feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author);
    • feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values: the feed name, user id, and timestamp.
    • feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp;
    • scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.

    Citation

    If used for research purposes, please cite the following paper describing the dataset details:

    Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight: Insights from a Year's Worth of Social Data." PlosOne (2024) https://doi.org/10.1371/journal.pone.0310330

    Right to Erasure (Right to be forgotten)

    Note: If your account was created after March 21st, 2024, or if you did not post on Bluesky before such date, no data about your account exists in the dataset. Before sending a data removal request, please make sure that you were active and posting on bluesky before March 21st, 2024.

    Users included in the Bluesky Social dataset have the right to opt-out and request the removal of their data, per GDPR provisions (Article 17).

    We emphasize that the released data has been thoroughly pseudonymized in compliance with GDPR (Article 4(5)). Specifically, usernames and object identifiers (e.g., URIs) have been removed, and object timestamps have been coarsened to protect individual privacy further and minimize reidentification risk. Moreover, it should be noted that the dataset was created for scientific research purposes, thereby falling under the scenarios for which GDPR provides opt-out derogations (Article 17(3)(d) and Article 89).

    Nonetheless, if you wish to have your activities excluded from this dataset, please submit your request to blueskydatasetmoderation@gmail.com (with the subject "Removal request: [username]"). We will process your request within a reasonable timeframe - updates will occur monthly, if necessary, and access to previous versions will be restricted.

    Acknowledgments:

    This work is supported by :

    • the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”,
      Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu);
    • SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021;
    • EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
  2. POLITISKY24: U.S. Political Bluesky Dataset with Stance Labels

    • zenodo.org
    bin, csv, json
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peyman Rostami; Peyman Rostami; Vahid Rahimzadeh; Vahid Rahimzadeh; Ali Adibi; Ali Adibi; Azadeh Shakery; Azadeh Shakery (2025). POLITISKY24: U.S. Political Bluesky Dataset with Stance Labels [Dataset]. http://doi.org/10.5281/zenodo.14671773
    Explore at:
    json, bin, csvAvailable download formats
    Dataset updated
    Jan 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Peyman Rostami; Peyman Rostami; Vahid Rahimzadeh; Vahid Rahimzadeh; Ali Adibi; Ali Adibi; Azadeh Shakery; Azadeh Shakery
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    POLITISKY24 (Political Stance Analysis on Bluesky for 2024) is a first-of-its-kind dataset for stance detection, focused on the 2024 U.S. presidential election. It designed for target-specific user-level stance detection and contains 16,044 user-target stance pairs centered on two key political figures, Kamala Harris and Donald Trump. In addition, this dataset includes detailed metadata, such as complete user posting histories and engagement graphs (likes, reposts, and quotes).

    Stance labels were generated using a robust and evaluated pipeline that integrates state-of-the-art Information Retrieval (IR) techniques with Large Language Models (LLMs), offering confidence scores, reasoning explanations, and text spans for each label. With an LLM-assisted labeling accuracy of 81%, POLITISKY24 provides a rich resource for the target-specific stance detection task. This dataset enables the exploration of Bluesky platform, paving the way for deeper insights into political opinions and social discourse, and addressing gaps left by traditional datasets constrained by platform policies.

    In the uploaded files:

    • The file 'Human_annotation_on_validation_users.csv' contains human-annotated stance labels for 445 validation users toward Trump and Harris, resulting in a total of 890 user-target pairs.
      The labels are divided into four stances: 1 (favor), 2 (against), 3 (neutral), and 4 (unrelated). To simplify the stance annotations provided by the large language model, the "neutral" and "unrelated" categories are combined and represented as "neither."
    • The file 'LLM_annotation_on_validation_users.json' contains stance labels annotated by a state-of-the-art LLM for 445 validation users toward Trump and Harris, resulting in a total of 890 user-target pairs. In addition to stance labels, each pair includes an explanation of the reasoning, the source tweets, spans from the source tweets used in the reasoning, and a confidence score.
    • The file 'LLM_annotation_on_dataset_users.json' is similar to 'LLM_annotation_on_validation_users.json but is generated for all dataset users excluding the validation set. It provides stance labels for 8,022 users toward Trump and Harris, totaling 16,044 user-target pairs.
    • The file 'Main_dataset_for_stance_detection.parquet' contains up to 1,000 recent English-language posts (including both original posts and reposts) from each of the 8,022 + 445 = 8,467 users. This file was used for the stance detection task.
    • The file 'Bluesky_dataset_on_us_politics.parquet' is similar to 'Main_dataset_for_stance_detection.parquet', but it contains all posts (including both original posts and reposts) from each of the 8,022 + 445 = 8,467 users.
    • The file 'Like_network.parquet' captures users' interactions through likes. Specifically, it contains the number of likes each user has given to original posts made by other users. It includes likes from 8,022 + 445 = 8,467 users, but it is not limited to interactions from these users alone.
    • The files 'Repost_network.parquet' and 'Quote_network.parquet' are similar to 'Like_network.parquet', but they capture users' interactions through reposts and quotes, respectively.

  3. ABoVE: MODIS-Derived Daily Mean Blue Sky Albedo for Northern North America,...

    • catalog.data.gov
    • cmr.earthdata.nasa.gov
    • +1more
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ORNL_DAAC (2025). ABoVE: MODIS-Derived Daily Mean Blue Sky Albedo for Northern North America, 2000-2017 [Dataset]. https://catalog.data.gov/dataset/above-modis-derived-daily-mean-blue-sky-albedo-for-northern-north-america-2000-2017-7abac
    Explore at:
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    Oak Ridge National Laboratory Distributed Active Archive Center
    Description

    This dataset contains MODIS-derived daily mean shortwave blue sky albedo for northern North America (i.e., Canada and Alaska) and a set of quality control flags for each albedo value to aid in user interpretation. The data cover the period of February 24, 2000 through April 22, 2017. The blue sky albedo data were derived from the MODIS 500-m version 6 Bidirectional Reflectance Distribution Function and Albedo (BRDF/Albedo) Model Parameters MCD43A1 dataset (MCD43A1.006, https://doi.org/10.5067/MODIS/MCD43A1.006) (Schaaf & Wang, 2015a, please refer to the MCD43 documentation and user guides for more information). Blue sky refers to albedo calculated under real-world conditions with a combination of both diffuse and direct lighting based on atmospheric and view-geometry conditions. Daily mean albedo was calculated by averaging hourly instantaneous blue sky albedo values weighted by the solar insolation for each time interval. Potter et al. (2019, https://doi.org/10.1111/gcb.14888) is the associated paper for this dataset. Note the actual extent of the dataset in Figure 1 of the User Guide. Users are encouraged to refer to the User Guide for further important information about the use of this dataset.

  4. Data from: The Rise of Bluesky

    • zenodo.org
    zip
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Özgür Can Seçkin; Özgür Can Seçkin; Filipi Nascimento Silva; Filipi Nascimento Silva; Bao Tran Truong; Bao Tran Truong; Sangyeon Kim; Sangyeon Kim; Fan Huang; Fan Huang; Chang Liu; Chang Liu; alessandro flammini; alessandro flammini; Filippo Menczer; Filippo Menczer (2025). The Rise of Bluesky [Dataset]. http://doi.org/10.5281/zenodo.15066073
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Özgür Can Seçkin; Özgür Can Seçkin; Filipi Nascimento Silva; Filipi Nascimento Silva; Bao Tran Truong; Bao Tran Truong; Sangyeon Kim; Sangyeon Kim; Fan Huang; Fan Huang; Chang Liu; Chang Liu; alessandro flammini; alessandro flammini; Filippo Menczer; Filippo Menczer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 21, 2025
    Description

    This repository contains the datasets required to reproduce the results presented in the paper "The Rise of Bluesky."

    • profile_creations.parquet.zip: Profile creation dates for each user (Fig.1a).
    • user_group_date_intervals.pickle.zip: The dates corresponding to each user group (Fig.1a).
    • total_engagement_by_day_by_user.parquet.zip: Daily total activity per user group (Fig.1b, d).
    • active_user_count_per_day_0_8.parquet.zip: Daily number of active users (Fig.1e, f).
    • summary_stats.zip: Files containing daily network statistics such as average degree and node count (Fig.1g).
    • group_degree_distributions.zip: Daily user group out-degree distributions (Fig.1g).
    • clustering_coef.parquet.zip: Clustering coefficients for follower network for each day (Fig.1h).
    • gini_and_kappa.zip: Daily Gini and Kappa statistics (Fig.1i).

    Due to its large size, the dataset used to construct the follower network in Fig. 1c is not included here. However, it may be made available upon request under exceptional circumstances.

  5. f

    Collected feed statistics.

    • plos.figshare.com
    xls
    Updated Nov 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Failla; Giulio Rossetti (2024). Collected feed statistics. [Dataset]. http://doi.org/10.1371/journal.pone.0310330.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 5, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Andrea Failla; Giulio Rossetti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue. The dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions. Since Bluesky allows users to create and like feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions. This dataset allows novel analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection and performing content virality and diffusion analysis.

  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti (2025). Bluesky Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.14669616
Organization logo

Bluesky Social Dataset

Explore at:
application/gzip, csvAvailable download formats
Dataset updated
Jan 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti
License

https://bsky.social/about/support/toshttps://bsky.social/about/support/tos

Description

Bluesky Social Dataset

Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue.

The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their “like” interactions and time of bookmarking.

Dataset

Here is a description of the dataset files.

  • followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers representing a directed following relation (i.e., user u follows user v).
  • user_posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in a collection of files, each containing the post of an anonymized user. Each post is stored as a JSON-formatted line.
  • interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers representing a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author,quoted_author, and date.
  • graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread.
  • feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author);
  • feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values: the feed name, user id, and timestamp.
  • feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp;
  • scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.

Citation

If used for research purposes, please cite the following paper describing the dataset details:

Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight: Insights from a Year's Worth of Social Data." PlosOne (2024) https://doi.org/10.1371/journal.pone.0310330

Right to Erasure (Right to be forgotten)

Note: If your account was created after March 21st, 2024, or if you did not post on Bluesky before such date, no data about your account exists in the dataset. Before sending a data removal request, please make sure that you were active and posting on bluesky before March 21st, 2024.

Users included in the Bluesky Social dataset have the right to opt-out and request the removal of their data, per GDPR provisions (Article 17).

We emphasize that the released data has been thoroughly pseudonymized in compliance with GDPR (Article 4(5)). Specifically, usernames and object identifiers (e.g., URIs) have been removed, and object timestamps have been coarsened to protect individual privacy further and minimize reidentification risk. Moreover, it should be noted that the dataset was created for scientific research purposes, thereby falling under the scenarios for which GDPR provides opt-out derogations (Article 17(3)(d) and Article 89).

Nonetheless, if you wish to have your activities excluded from this dataset, please submit your request to blueskydatasetmoderation@gmail.com (with the subject "Removal request: [username]"). We will process your request within a reasonable timeframe - updates will occur monthly, if necessary, and access to previous versions will be restricted.

Acknowledgments:

This work is supported by :

  • the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”,
    Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu);
  • SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021;
  • EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
Search
Clear search
Close search
Google apps
Main menu