100+ datasets found
  1. Bluesky Social Dataset

    • zenodo.org
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti (2024). Bluesky Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.11082879
    Explore at:
    Dataset updated
    Dec 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Bluesky Social Dataset

    1st Dec 2024. This version of the dataset has been superseeded and is now restricted. Please refer to the most recent release.

    Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.

    The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

    Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their “like” interactions and time of bookmarking.

    Dataset

    Here is a description of the dataset files.

    • followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers u, v, representing a directed following relation (i.e., user u follows user v).
    • posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in 100 files, each containing the full posts of up to 50,000 users. Each post is stored as a JSON-formatted line.
    • interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers, and represents a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author ,quoted_author, and date.
    • graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread.
    • feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author);
    • feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values, namely the feed name, the user id, and the timestamp.
    • feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp;
    • scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.

    Citation

    If used for research purposes, please cite the following paper describing the dataset details:

    Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight: Insights from a Year Worth of Social Data". PlosOne (2024) a https://doi.org/10.1371/journal.pone.0310330

    Right to Erasure (Right to be forgotten)

    Note: If your account was created after March 21st, 2024, or if you did not post on Bluesky before such date, no data about your account exists in the dataset. Before sending a data removal request, please make sure that you were active and posting on bluesky before March 21st, 2024.

    Users included in the Bluesky dataset have the right to opt out and request the removal of their data, in accordance with GDPR provisions (Article 17). It should be noted, however, that the dataset was created for scientific research purposes, thereby falling under the scenarios for which GDPR provides derogations (Article 17(3)(d) and Article 89).

    We emphasize that, in compliance with GDPR (Article 4(5)), the released data has been thoroughly pseudonymized. Specifically, usernames and object identifiers (e.g., URIs) have been removed, and object timestamps have been coarsened to further protect individual privacy.

    If you wish to have your activities excluded from this dataset, please submit your request to blueskydatasetmoderation@gmail.com (with subject "Removal request: [username]").
    We will process your request within a reasonable timeframe.

    Acknowledgments:

    This work is supported by :

    • the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”,
      Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu);
    • SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021;
    • EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
  2. h

    two-million-bluesky-posts

    • huggingface.co
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alpin (2024). two-million-bluesky-posts [Dataset]. https://huggingface.co/datasets/alpindale/two-million-bluesky-posts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2024
    Authors
    Alpin
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    2 Million Bluesky Posts

    This dataset contains 2 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data. The with-language-predictions config contains the same data as the default config but with language predictions added using the glotlid model. Dataset Details Dataset Description This dataset consists of 2 million public posts from Bluesky Social, collected through the platform's firehose… See the full description on the dataset page: https://huggingface.co/datasets/alpindale/two-million-bluesky-posts.

  3. h

    bluesky-posts

    • huggingface.co
    Updated Dec 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    alim maasoglu (2024). bluesky-posts [Dataset]. https://huggingface.co/datasets/withalim/bluesky-posts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2024
    Authors
    alim maasoglu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    8 Million Bluesky Social Posts Collection

    I've collected and curated 8 million public posts from Bluesky Social between November 27 - December 1, 2024, with an additional 12 million posts coming in the upcoming weeks. This growing dataset aims to provide researchers and developers with a comprehensive sample of real world social media data for analysis and experimentation. This collection represents one of the largest publicly available Bluesky datasets, offering unique insights… See the full description on the dataset page: https://huggingface.co/datasets/withalim/bluesky-posts.

  4. h

    bluesky

    • huggingface.co
    Updated Nov 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roro (2024). bluesky [Dataset]. https://huggingface.co/datasets/Roronotalt/bluesky
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 29, 2024
    Authors
    Roro
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Five Million bluesky posts

    This dataset contains 5 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data. This dataset was inspired by the Alpindales original 2 million posts dataset, this dataset expands on that dataset with much more data. Alpins dataset did not get author handles or image urls & metadata that was included in the posts. The images and their captions could potenically… See the full description on the dataset page: https://huggingface.co/datasets/Roronotalt/bluesky.

  5. e

    bluesky.social Traffic Analytics Data

    • analytics.explodingtopics.com
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). bluesky.social Traffic Analytics Data [Dataset]. https://analytics.explodingtopics.com/website/bluesky.social
    Explore at:
    Dataset updated
    Jun 1, 2025
    Variables measured
    Global Rank, Monthly Visits, Authority Score, US Country Rank
    Description

    Traffic analytics, rankings, and competitive metrics for bluesky.social as of June 2025

  6. i

    Data from: BlueTempNet: A Temporal Multi-network Dataset of Social...

    • ieee-dataport.org
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ujun Jeong (2024). BlueTempNet: A Temporal Multi-network Dataset of Social Interactions in Bluesky Social [Dataset]. https://ieee-dataport.org/documents/bluetempnet-temporal-multi-network-dataset-social-interactions-bluesky-social
    Explore at:
    Dataset updated
    Oct 2, 2024
    Authors
    Ujun Jeong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    including user-to-user interactions (following and blocking users) and user-to-community interactions (creating and joining communities).

  7. Data from: The Rise of Bluesky

    • zenodo.org
    zip
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Özgür Can Seçkin; Özgür Can Seçkin; Filipi Nascimento Silva; Filipi Nascimento Silva; Bao Tran Truong; Bao Tran Truong; Sangyeon Kim; Sangyeon Kim; Fan Huang; Fan Huang; Chang Liu; Chang Liu; alessandro flammini; alessandro flammini; Filippo Menczer; Filippo Menczer (2025). The Rise of Bluesky [Dataset]. http://doi.org/10.5281/zenodo.15066073
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Özgür Can Seçkin; Özgür Can Seçkin; Filipi Nascimento Silva; Filipi Nascimento Silva; Bao Tran Truong; Bao Tran Truong; Sangyeon Kim; Sangyeon Kim; Fan Huang; Fan Huang; Chang Liu; Chang Liu; alessandro flammini; alessandro flammini; Filippo Menczer; Filippo Menczer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 21, 2025
    Description

    This repository contains the datasets required to reproduce the results presented in the paper "The Rise of Bluesky."

    • profile_creations.parquet.zip: Profile creation dates for each user (Fig.1a).
    • user_group_date_intervals.pickle.zip: The dates corresponding to each user group (Fig.1a).
    • total_engagement_by_day_by_user.parquet.zip: Daily total activity per user group (Fig.1b, d).
    • active_user_count_per_day_0_8.parquet.zip: Daily number of active users (Fig.1e, f).
    • summary_stats.zip: Files containing daily network statistics such as average degree and node count (Fig.1g).
    • group_degree_distributions.zip: Daily user group out-degree distributions (Fig.1g).
    • clustering_coef.parquet.zip: Clustering coefficients for follower network for each day (Fig.1h).
    • gini_and_kappa.zip: Daily Gini and Kappa statistics (Fig.1i).

    Due to its large size, the dataset used to construct the follower network in Fig. 1c is not included here. However, it may be made available upon request under exceptional circumstances.

  8. Global Bluesky users 2025

    • statista.com
    Updated Aug 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Global Bluesky users 2025 [Dataset]. https://www.statista.com/statistics/1536616/global-bluesky-users/
    Explore at:
    Dataset updated
    Aug 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2024 - Dec 2024
    Area covered
    Worldwide
    Description

    Bluesky experienced rapid user growth in late 2024. The platform's user base expanded from 14.5 million in October 2024 to 38 million by August 2025, showcasing its increasing popularity among social media users seeking new options. Surge in downloads and user engagement The platform's growth was particularly notable following the U.S. presidential elections in November 2024, when monthly downloads surged to 7.35 million. This increase in user adoption coincided with rising demand for Twitter alternatives. Earlier in the year, Bluesky had already shown strong performance, with 38,000 downloads from Android devices and 30,000 from iOS devices in July 2024. Moderation challenges and user demographics As Bluesky's user base expanded, so did the need for content moderation. In 2024, the platform received 6.48 million reports to its moderation service, a significant increase from 358,000 reports in 2023. These reports included 1.75 million for anti-social behavior, 1.2 million for misleading content, and 1.4 million for spam.

  9. Bluesky: global moderation reports 2023-2024

    • statista.com
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Bluesky: global moderation reports 2023-2024 [Dataset]. https://www.statista.com/statistics/1552694/bluesky-moderation-reports-worldwide/
    Explore at:
    Dataset updated
    Feb 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    Bluesky saw a significant increase in user reports to its moderation service in 2024. The number of reports jumped from 358,000 in 2023 to 6.48 million in 2024, indicating a growing user base and increased platform activity. This surge in moderation reports coincided with a spike in monthly downloads, particularly after the U.S. presidential elections in November 2024, when Bluesky downloads reached 7.35 million. Breakdown of moderation reports The 6.48 million reports submitted to Bluesky's moderation service in 2024 covered various issues. Anti-social behavior accounted for 1.75 million reports, while misleading content and spam received 1.2 million and 1.4 million reports, respectively. These figures suggest that users actively engaged in flagging content that violated platform guidelines. Additionally, Bluesky received 238 requests from law enforcement, governments, and legal entities, responding to 182 of them. The most common legal requests were for user data, followed by takedown requests and inquiries. Comparison with other platforms While Bluesky experienced growth in user reports, other social media platforms like Facebook saw fluctuations in content moderation. In the third quarter of 2024, Facebook removed 6.4 million pieces of hate speech content, down from 7.2 million in the previous quarter. Similarly, Facebook took action on 7.6 million pieces of bullying and harassment related content in the same period, a slight decrease from 7.8 million in the previous quarter. These comparisons highlight the ongoing challenges social media platforms face in content moderation and user safety.

  10. o

    Blue Sky Cross Street Data in Springville, IN

    • ownerly.com
    Updated Dec 10, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ownerly (2021). Blue Sky Cross Street Data in Springville, IN [Dataset]. https://www.ownerly.com/in/springville/blue-sky-home-details
    Explore at:
    Dataset updated
    Dec 10, 2021
    Dataset authored and provided by
    Ownerly
    Area covered
    Springville
    Description

    This dataset provides information about the number of properties, residents, and average property values for Blue Sky cross streets in Springville, IN.

  11. f

    Bluesky data: underlying the publication “Velocity Obstacle Based Conflict...

    • datasetcatalog.nlm.nih.gov
    Updated Feb 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ribeiro, Marta (2021). Bluesky data: underlying the publication “Velocity Obstacle Based Conflict Avoidance in Urban Environment with Variable Speed Limit” [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000761372
    Explore at:
    Dataset updated
    Feb 3, 2021
    Authors
    Ribeiro, Marta
    Description

    Bluesky scenarios and result files used in the work "Velocity Obstacle Based Conflict Avoidance in Urban Environment with Variable Speed Limit". The scenario files can be used with the Bluesky simulator tool implementation found at https://github.com/TUDelft-CNS-ATM/bluesky. The result files exhibit the results obtained with the previous tool.

  12. f

    Bluesky data: underlying the publication “Review of Conflict Resolution...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Apr 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ribeiro, Marta (2020). Bluesky data: underlying the publication “Review of Conflict Resolution Methods for Manned and Unmanned Aviation” [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000459603
    Explore at:
    Dataset updated
    Apr 22, 2020
    Authors
    Ribeiro, Marta
    Description

    Bluesky scenarios and result files used in the work "Review of Conflict Resolution Methods for Manned and Unmanned Aviation". The scenario files can be used with the Bluesky simulator tool implementation found at https://github.com/TUDelft-CNS-ATM/bluesky. The result files exhibit the results obtained with the previous tool.

  13. Blue Sky Import Data India – Buyers & Importers List

    • seair.co.in
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim, Blue Sky Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset provided by
    Authors
    Seair Exim
    Area covered
    India
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  14. bluesky.com Website Traffic, Ranking, Analytics [July 2025]

    • semrush.com
    • stb2.digiseotools.com
    Updated Aug 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semrush (2025). bluesky.com Website Traffic, Ranking, Analytics [July 2025] [Dataset]. https://www.semrush.com/website/bluesky.com/overview/
    Explore at:
    Dataset updated
    Aug 12, 2025
    Dataset authored and provided by
    Semrushhttps://fr.semrush.com/
    License

    https://www.semrush.com/company/legal/terms-of-service/https://www.semrush.com/company/legal/terms-of-service/

    Time period covered
    Aug 12, 2025
    Area covered
    Worldwide
    Variables measured
    visits, backlinks, bounceRate, pagesPerVisit, authorityScore, organicKeywords, avgVisitDuration, referringDomains, trafficByCountry, paidSearchTraffic, and 3 more
    Measurement technique
    Semrush Traffic Analytics; Click-stream data
    Description

    bluesky.com is ranked #67329 in US with 119.79K Traffic. Categories: Retail. Learn more about website traffic, market share, and more!

  15. o

    Blue Sky Drive Cross Street Data in Jeffersonville, KY

    • ownerly.com
    Updated Dec 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ownerly (2021). Blue Sky Drive Cross Street Data in Jeffersonville, KY [Dataset]. https://www.ownerly.com/ky/jeffersonville/blue-sky-dr-home-details
    Explore at:
    Dataset updated
    Dec 10, 2021
    Dataset authored and provided by
    Ownerly
    Area covered
    Blue Sky Drive, Kentucky, Jeffersonville
    Description

    This dataset provides information about the number of properties, residents, and average property values for Blue Sky Drive cross streets in Jeffersonville, KY.

  16. USFS Airfire BlueSky Daily 4km PM2.5 Model Data and Imagery

    • data.ucar.edu
    • ckanprod.ucar.edu
    archive
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Narasimhan K. Larkin; Robert Solomon; Susan M. O'Neill (2024). USFS Airfire BlueSky Daily 4km PM2.5 Model Data and Imagery [Dataset]. http://doi.org/10.5065/D6F18XK4
    Explore at:
    archiveAvailable download formats
    Dataset updated
    Dec 26, 2024
    Dataset provided by
    University Corporation for Atmospheric Research
    Authors
    Narasimhan K. Larkin; Robert Solomon; Susan M. O'Neill
    Time period covered
    Jul 7, 2018 - Sep 30, 2018
    Area covered
    Description

    This data set contains the output and products from the daily runs of the United States Forest Service (USFS) BlueSky modeling framework for the WE-CAN and BB-FLUX field projects. The runs for WE-CAN utilized the University of Washington 4-km resolution WRF model over the Pacific Northwest. The data set includes the model output in NetCDF format, KMZ files for display of the PM2.5 column average forecasts, KMZ files of the active fire locations, and PM2.5 forecast imagery all contained within daily gzipped tar files.

  17. Bluesky solutions llc USA Import & Buyer Data

    • seair.co.in
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim, Bluesky solutions llc USA Import & Buyer Data [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset provided by
    Authors
    Seair Exim
    Area covered
    United States
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  18. s

    New Bluesky Exporter/Supplier Data to USA, New Bluesky Export to USA Data

    • seair.co.in
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2025). New Bluesky Exporter/Supplier Data to USA, New Bluesky Export to USA Data [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Jul 25, 2025
    Dataset provided by
    Seair Info Solutions PVT LTD
    Authors
    Seair Exim
    Area covered
    United States
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  19. o

    Blue Sky Loop Cross Street Data in Jeffersonville, IN

    • ownerly.com
    Updated Dec 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ownerly (2021). Blue Sky Loop Cross Street Data in Jeffersonville, IN [Dataset]. https://www.ownerly.com/in/jeffersonville/blue-sky-loop-home-details
    Explore at:
    Dataset updated
    Dec 29, 2021
    Dataset authored and provided by
    Ownerly
    Area covered
    Jeffersonville, Indiana, Blue Sky Loop
    Description

    This dataset provides information about the number of properties, residents, and average property values for Blue Sky Loop cross streets in Jeffersonville, IN.

  20. h

    40-million-bluesky-posts

    • huggingface.co
    Updated Dec 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aranym (2024). 40-million-bluesky-posts [Dataset]. https://huggingface.co/datasets/Aranym/40-million-bluesky-posts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 21, 2024
    Authors
    Aranym
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Nightsky 40M Dataset

    ~40 million posts from the Bluesky Firehose API, reasonably anonymized. Licensed under CC0 and completely independently sourced to avoid licensing issues. Use it as you wish! Very little preprocessing.

      Request data deletion
    

    A user may request removal of their data by e-mailing nightsky-rm@proton.me with a subject line of "Delete My Data".As I don't collect usernames/DIDs, you must specify the position of every individual row you would like to be… See the full description on the dataset page: https://huggingface.co/datasets/Aranym/40-million-bluesky-posts.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti (2024). Bluesky Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.11082879
Organization logo

Bluesky Social Dataset

Explore at:
Dataset updated
Dec 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Bluesky Social Dataset

1st Dec 2024. This version of the dataset has been superseeded and is now restricted. Please refer to the most recent release.

Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.

The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their “like” interactions and time of bookmarking.

Dataset

Here is a description of the dataset files.

  • followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers u, v, representing a directed following relation (i.e., user u follows user v).
  • posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in 100 files, each containing the full posts of up to 50,000 users. Each post is stored as a JSON-formatted line.
  • interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers, and represents a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author ,quoted_author, and date.
  • graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread.
  • feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author);
  • feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values, namely the feed name, the user id, and the timestamp.
  • feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp;
  • scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.

Citation

If used for research purposes, please cite the following paper describing the dataset details:

Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight: Insights from a Year Worth of Social Data". PlosOne (2024) a https://doi.org/10.1371/journal.pone.0310330

Right to Erasure (Right to be forgotten)

Note: If your account was created after March 21st, 2024, or if you did not post on Bluesky before such date, no data about your account exists in the dataset. Before sending a data removal request, please make sure that you were active and posting on bluesky before March 21st, 2024.

Users included in the Bluesky dataset have the right to opt out and request the removal of their data, in accordance with GDPR provisions (Article 17). It should be noted, however, that the dataset was created for scientific research purposes, thereby falling under the scenarios for which GDPR provides derogations (Article 17(3)(d) and Article 89).

We emphasize that, in compliance with GDPR (Article 4(5)), the released data has been thoroughly pseudonymized. Specifically, usernames and object identifiers (e.g., URIs) have been removed, and object timestamps have been coarsened to further protect individual privacy.

If you wish to have your activities excluded from this dataset, please submit your request to blueskydatasetmoderation@gmail.com (with subject "Removal request: [username]").
We will process your request within a reasonable timeframe.

Acknowledgments:

This work is supported by :

  • the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”,
    Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu);
  • SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021;
  • EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
Search
Clear search
Close search
Google apps
Main menu