13 datasets found

P
Bluesky Social Dataset Dataset
paperswithcode.com
Updated Apr 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Bluesky Social Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/bluesky-social-dataset
Explore at:
Dataset updated
Apr 28, 2024
Description
Bluesky Social Dataset Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.

The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions and time of bookmarking.

This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection, and performing content virality and diffusion analysis.

Dataset Here is a description of the dataset files.

followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers u, v, representing a directed following relation (i.e., user u follows user v). posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in 100 files, each containing the full posts of up to 50,000 users. Each post is stored as a JSON-formatted line. interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers, and represents a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author ,quoted_author, and date. graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread. feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author); feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values, namely the feed name, the user id, and the timestamp. feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp; scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.

Citation If used for research purposes, please cite the following paper describing the dataset details:

Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight": Insights from a Year Worth of Social Data. (2024) arXiv:2404.18984

Acknowledgments: This work is supported by :

the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”, Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu); SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021; EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
POLITISKY24: U.S. Political Bluesky Dataset with Stance Labels
zenodo.org
bin, csv, json
Updated Jan 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peyman Rostami; Peyman Rostami; Vahid Rahimzadeh; Vahid Rahimzadeh; Ali Adibi; Ali Adibi; Azadeh Shakery; Azadeh Shakery (2025). POLITISKY24: U.S. Political Bluesky Dataset with Stance Labels [Dataset]. http://doi.org/10.5281/zenodo.14671773
Explore at:
json, bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14671773
Dataset updated
Jan 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Peyman Rostami; Peyman Rostami; Vahid Rahimzadeh; Vahid Rahimzadeh; Ali Adibi; Ali Adibi; Azadeh Shakery; Azadeh Shakery
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Description
POLITISKY24 (Political Stance Analysis on Bluesky for 2024) is a first-of-its-kind dataset for stance detection, focused on the 2024 U.S. presidential election. It designed for target-specific user-level stance detection and contains 16,044 user-target stance pairs centered on two key political figures, Kamala Harris and Donald Trump. In addition, this dataset includes detailed metadata, such as complete user posting histories and engagement graphs (likes, reposts, and quotes).

Stance labels were generated using a robust and evaluated pipeline that integrates state-of-the-art Information Retrieval (IR) techniques with Large Language Models (LLMs), offering confidence scores, reasoning explanations, and text spans for each label. With an LLM-assisted labeling accuracy of 81%, POLITISKY24 provides a rich resource for the target-specific stance detection task. This dataset enables the exploration of Bluesky platform, paving the way for deeper insights into political opinions and social discourse, and addressing gaps left by traditional datasets constrained by platform policies.

In the uploaded files:

The file 'Human_annotation_on_validation_users.csv' contains human-annotated stance labels for 445 validation users toward Trump and Harris, resulting in a total of 890 user-target pairs.
The labels are divided into four stances: 1 (favor), 2 (against), 3 (neutral), and 4 (unrelated). To simplify the stance annotations provided by the large language model, the "neutral" and "unrelated" categories are combined and represented as "neither."

The file 'LLM_annotation_on_validation_users.json' contains stance labels annotated by a state-of-the-art LLM for 445 validation users toward Trump and Harris, resulting in a total of 890 user-target pairs. In addition to stance labels, each pair includes an explanation of the reasoning, the source tweets, spans from the source tweets used in the reasoning, and a confidence score.

The file 'LLM_annotation_on_dataset_users.json' is similar to 'LLM_annotation_on_validation_users.json but is generated for all dataset users excluding the validation set. It provides stance labels for 8,022 users toward Trump and Harris, totaling 16,044 user-target pairs.

The file 'Main_dataset_for_stance_detection.parquet' contains up to 1,000 recent English-language posts (including both original posts and reposts) from each of the 8,022 + 445 = 8,467 users. This file was used for the stance detection task.

The file 'Bluesky_dataset_on_us_politics.parquet' is similar to 'Main_dataset_for_stance_detection.parquet', but it contains all posts (including both original posts and reposts) from each of the 8,022 + 445 = 8,467 users.

The file 'Like_network.parquet' captures users' interactions through likes. Specifically, it contains the number of likes each user has given to original posts made by other users. It includes likes from 8,022 + 445 = 8,467 users, but it is not limited to interactions from these users alone.

The files 'Repost_network.parquet' and 'Quote_network.parquet' are similar to 'Like_network.parquet', but they capture users' interactions through reposts and quotes, respectively.
h
50-million-bluesky-posts
huggingface.co
Updated Dec 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aranym (2024). 50-million-bluesky-posts [Dataset]. https://huggingface.co/datasets/Aranym/50-million-bluesky-posts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 21, 2024
Authors
Aranym
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Nightsky 50M Dataset

~50 million posts from the Bluesky Firehose API, reasonably anonymized. Licensed under CC0 and completely independently sourced to avoid licensing issues. Use it as you wish! Very little preprocessing.

Request data deletion

A user may request removal of their data by e-mailing nightsky-rm@proton.me with a subject line of "Delete My Data".As I don't collect usernames/DIDs, you must specify the position of every individual row you would like to be… See the full description on the dataset page: https://huggingface.co/datasets/Aranym/50-million-bluesky-posts.
h
40-million-bluesky-posts
huggingface.co
Updated Dec 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aranym (2024). 40-million-bluesky-posts [Dataset]. https://huggingface.co/datasets/Aranym/40-million-bluesky-posts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 21, 2024
Authors
Aranym
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Nightsky 40M Dataset

~40 million posts from the Bluesky Firehose API, reasonably anonymized. Licensed under CC0 and completely independently sourced to avoid licensing issues. Use it as you wish! Very little preprocessing.

Request data deletion

A user may request removal of their data by e-mailing nightsky-rm@proton.me with a subject line of "Delete My Data".As I don't collect usernames/DIDs, you must specify the position of every individual row you would like to be… See the full description on the dataset page: https://huggingface.co/datasets/Aranym/40-million-bluesky-posts.
s
A Blue Start: A large-scale pairwise and higher-order social network dataset...
socialmediaarchive.org
csv, json, pdf, zip
Updated May 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). A Blue Start: A large-scale pairwise and higher-order social network dataset [Dataset]. https://socialmediaarchive.org/record/78
Explore at:
json(619104394), zip(8502960667), json(132710350), pdf(73964), csv(85077755)Available download formats
Dataset updated
May 9, 2025
Description
This dataset consists of all starter packs and all following network data available on Bluesky in January and February 2025. Starter packs can be created by any Bluesky user. They are lists of users and curated feeds with a minimum of 6 and a maximum of 150 users, curated by the starter pack creator. The creator typically names them and provides a description. Other users can use a single click to follow all users in the starter pack, or they can scroll through a specific starter pack to decide who to follow within that starter pack. In our dataset, all DIDs (persistent, unique identifiers) are anonymized with a non-reversible hash function; users in the network, as well as users who created starter packs, or appear in starter packs, are identified by their hashed DIDs. Similarly, starter packs themselves are identified by their hashed identifiers.

First, we include the Bluesky following network as it appeared in late January/early February 2025. This shows all available directed following relationships on Bluesky. We also include a network dataset of starter packs with information on creators and starter pack members. This is intended for users who wish to undertake a computational analysis of the networks created by starter packs or starter packs’ influences on networks.
z
Data from: POLITISKY24: U.S. Political Bluesky Dataset with User Stance...
zenodo.org
bin
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peyman Rostami; Peyman Rostami; Vahid Rahimzadeh; Vahid Rahimzadeh; Ali Adibi; Ali Adibi; Azadeh Shakery; Azadeh Shakery (2025). POLITISKY24: U.S. Political Bluesky Dataset with User Stance Labels [Dataset]. http://doi.org/10.5281/zenodo.15616911
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15616911
Dataset updated
Jun 9, 2025
Dataset provided by
Zenodo
Authors
Peyman Rostami; Peyman Rostami; Vahid Rahimzadeh; Vahid Rahimzadeh; Ali Adibi; Ali Adibi; Azadeh Shakery; Azadeh Shakery
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Description
POLITISKY24 (Political Stance Analysis on Bluesky for 2024) is a first-of-its-kind dataset for stance detection, focused on the 2024 U.S. presidential election. It designed for target-specific user-level stance detection and contains 16,044 user-target stance pairs centered on two key political figures, Kamala Harris and Donald Trump. In addition, this dataset includes detailed metadata, such as complete user posting histories and engagement graphs (likes, reposts, and quotes).

Stance labels were generated using a robust and evaluated pipeline that integrates state-of-the-art Information Retrieval (IR) techniques with Large Language Models (LLMs), offering confidence scores, reasoning explanations, and text spans for each label. With an LLM-assisted labeling accuracy of 81%, POLITISKY24 provides a rich resource for the target-specific stance detection task. This dataset enables the exploration of Bluesky platform, paving the way for deeper insights into political opinions and social discourse, and addressing gaps left by traditional datasets constrained by platform policies.

In the uploaded files:

The file user_post_history_dataset.parquet includes the posting history of 8,561 active Bluesky users who have shared content related to American politics.

The file user_post_list_for_stance_detection.parquet contains a list of up to 1,000 recent English-language post IDs per user, intended for use in the stance detection task.

The file user_network_dataset.parquet captures users’ interactions through likes, reposts, and quotes.

The file human_annotated_validation_user_stance_dataset.parquet contains human-annotated stance labels for 445 validation users toward Trump and Harris, resulting in a total of 890 user-target pairs. The labels are divided into three stances: 1 (favor), 2 (against), and 3 (neither).

The file llm_annotated_validation_user_stance_dataset.parquet contains stance labels annotated by an LLM for the same 445 validation users toward Trump and Harris, also totaling 890 user-target pairs. In addition to stance labels, each pair includes an explanation of the reasoning, the source tweets, spans from the source tweets used in the reasoning, and a confidence score.

The file llm_annotated_full_user_stance_dataset.parquet is similar to the above LLM-annotated validation file but covers all dataset users excluding the validation set. It provides stance labels for 8,022 users toward Trump and Harris, totaling 16,044 user-target pairs.

The file human_annotated_validation_stance_relevancy_dataset (post-target entity pairs).parquet contains human-annotated stance labels for 175 validation posts toward Trump and Harris, resulting in 350 post-target pairs. The labels are divided into three stances: 1 (favor), 2 (against), and 3 (neither).

The file human_annotated_validation_stance_relevancy_dataset (query-post stance relevancy pairs).parquet contains 700 query-post stance relevancy pairs derived from the post-target entity pairs.
i
Data from: BlueTempNet: A Temporal Multi-network Dataset of Social...
ieee-dataport.org
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ujun Jeong (2024). BlueTempNet: A Temporal Multi-network Dataset of Social Interactions in Bluesky Social [Dataset]. https://ieee-dataport.org/documents/bluetempnet-temporal-multi-network-dataset-social-interactions-bluesky-social
Explore at:
Dataset updated
Oct 2, 2024
Authors
Ujun Jeong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
including user-to-user interactions (following and blocking users) and user-to-community interactions (creating and joining communities).
h
bluesky-journalist-classification
huggingface.co
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruggero Marino Lazzaroni (2025). bluesky-journalist-classification [Dataset]. https://huggingface.co/datasets/ruggsea/bluesky-journalist-classification
Explore at:
Dataset updated
Jul 8, 2025
Authors
Ruggero Marino Lazzaroni
Description
Bluesky Journalist Classification Dataset

Dataset Description

This dataset contains Bluesky user profiles for training and evaluating journalist classification models. Created for the CSH Vienna Machine Learning Workshop, it includes comprehensive user data with human-verified labels for binary classification tasks.

Dataset Summary

Total Examples: 1,189 Test Split: 229 labeled examples
Unlabeled Split: 960 unlabeled examples Languages: Primarily English… See the full description on the dataset page: https://huggingface.co/datasets/ruggsea/bluesky-journalist-classification.
ABoVE: MODIS-Derived Daily Mean Blue Sky Albedo for Northern North America,...
s.cnmilf.com
daac.ornl.gov
+3more
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ORNL_DAAC (2025). ABoVE: MODIS-Derived Daily Mean Blue Sky Albedo for Northern North America, 2000-2017 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/above-modis-derived-daily-mean-blue-sky-albedo-for-northern-north-america-2000-2017-7abac
Explore at:
Dataset updated
Jun 28, 2025
Dataset provided by
Oak Ridge National Laboratory Distributed Active Archive Center
Description
This dataset contains MODIS-derived daily mean shortwave blue sky albedo for northern North America (i.e., Canada and Alaska) and a set of quality control flags for each albedo value to aid in user interpretation. The data cover the period of February 24, 2000 through April 22, 2017. The blue sky albedo data were derived from the MODIS 500-m version 6 Bidirectional Reflectance Distribution Function and Albedo (BRDF/Albedo) Model Parameters MCD43A1 dataset (MCD43A1.006, https://doi.org/10.5067/MODIS/MCD43A1.006) (Schaaf & Wang, 2015a, please refer to the MCD43 documentation and user guides for more information). Blue sky refers to albedo calculated under real-world conditions with a combination of both diffuse and direct lighting based on atmospheric and view-geometry conditions. Daily mean albedo was calculated by averaging hourly instantaneous blue sky albedo values weighted by the solar insolation for each time interval. Potter et al. (2019, https://doi.org/10.1111/gcb.14888) is the associated paper for this dataset. Note the actual extent of the dataset in Figure 1 of the User Guide. Users are encouraged to refer to the User Guide for further important information about the use of this dataset.
h
30-million-bluesky-posts
huggingface.co
Updated Dec 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
30-million-bluesky-posts [Dataset]. https://huggingface.co/datasets/Aranym/30-million-bluesky-posts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 21, 2024
Authors
Aranym
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Nightsky 30M Dataset

~30 million posts from the Bluesky Firehose API, reasonably anonymized. Licensed under CC0 and completely independently sourced to avoid licensing issues. Use it as you wish! Very little preprocessing.

Request data deletion

A user may request removal of their data by e-mailing nightsky-rm@proton.me with a subject line of "Delete My Data".As I don't collect usernames/DIDs, you must specify the position of every individual row you would like to be… See the full description on the dataset page: https://huggingface.co/datasets/Aranym/30-million-bluesky-posts.
f
Most frequent words appearing in each feed.
plos.figshare.com
xls
Updated Nov 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Failla; Giulio Rossetti (2024). Most frequent words appearing in each feed. [Dataset]. http://doi.org/10.1371/journal.pone.0310330.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310330.t004
Dataset updated
Nov 5, 2024
Dataset provided by
PLOS ONE
Authors
Andrea Failla; Giulio Rossetti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue. The dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions. Since Bluesky allows users to create and like feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions. This dataset allows novel analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection and performing content virality and diffusion analysis.
e
CLARA-A3: CM SAF cLoud, Albedo and surface RAdiation dataset from AVHRR data...
navigator.eumetsat.int
user.eumetsat.int
Updated Jan 1, 1979
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CM SAF (1979). CLARA-A3: CM SAF cLoud, Albedo and surface RAdiation dataset from AVHRR data - Edition 3 [Dataset]. https://navigator.eumetsat.int/product/EO:EUM:DAT:0874
Explore at:
Dataset updated
Jan 1, 1979
Dataset authored and provided by
CM SAF
Measurement technique
Radiometer
Description
The CLARA-A3 record provides cloud properties and radiation parameters derived from the AVHRR sensor onboard polar orbiting NOAA and METOP satellites. CLARA-A3 is the latest edition of CLARA with previous editions documented in Karlsson et al. (2013) and Karlsson et al. (2017). CLARA-A3 covers the time period 1979/01/01 until 2020/12/31 as climate data record (CDR), but is operationally extended as interim climate data record (ICDR) to the present with a latency of 10 days. The AVHRR measurement input to the CLARA-A3 retrieval algorithms is the EUMETSAT PyGAC AVHRR Fundamental Data Record (FDR) Release 1 (DOI:10.15770/EUM_SEC_CLM_0060). CLARA-A3 features a range of cloud products: cloud mask, cloud top temperature/pressure/height, cloud thermodynamic phase, and (for liquid and ice clouds separately) cloud optical thickness, particle effective radius and cloud water path. Additionally, cloud droplet number concentration and cloud geometrical thickness are provided for liquid clouds. Furthermore, a range of radiation products are included in CLARA-A3: surface black-sky, white-sky and blue-sky albedo; surface downwelling short- and longwave radiation as well as surface net radiation; top-of-atmosphere (TOA) upwelling short- and longwave radiation. Cloud products are available as monthly and daily averages and histograms, as well as daily resampled global products (Level 2b) for individual satellites. Surface albedo is presented as monthly and pentad (5 day) averages. Surface and TOA radiation products are provided as daily and monthly averages. All averages are available on a 0.25° x 0.25° global grid. Surface albedo and selected cloud products are also provided on two equal area grids with a resolution of 25 km x 25 km covering the polar regions. Daily resampled cloud products (level 2b) are provided in a global grid with a resolution of 0.05°x0.05°. CLARA-A3 features a comprehensive set of documentation including User Manuals, Validation Reports and Algorithms Theoretical Baseline Documents. This is a Thematic Climate Data Record (TCDR).
S
CRAFTS wide-band datacube pre-release
scidb.cn
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zheng Zheng; Chen Hao; Di Li; Pei Wang (2025). CRAFTS wide-band datacube pre-release [Dataset]. http://doi.org/10.57760/sciencedb.Fastro.00024
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.Fastro.00024
Dataset updated
Apr 1, 2025
Dataset provided by
Science Data Bank
Authors
Zheng Zheng; Chen Hao; Di Li; Pei Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We are pleased to announce the pre-release of the CRAFTS wide-band spectral datacubes. This encompasses the frequency range of 1323 to 1419 MHz and includes almost all the drift scans conducted under the CRAFTS project between July 31, 2021 and January 30, 2025. In total, there are 182 drift scans, amounting to about 880 hours of data covering ~3500 square degrees of the sky (blue regions in Fig.1). Please note that observations prior to July 31st, 2021 and a portion of later data (gray regions in Fig.1) are excluded in this pre-release due to damage by compressor RFI, being led by an external PI, or other specific considerations. The data has been processed into datacubes with the intention of facilitating extra-galactic spectral line research. The frequency range has been selected because data below 1323 MHz has a large chance to be affected by satellite radio frequency interference (RFI), while data above 1419 MHz is predominantly influenced by Galactic HI, which have already been incorporated into previously released narrow-band datacubes. The datasets are publicly available without collaboration required. Proper attribution through citation of the dataset DOI and related publications listed in the Reference section of the Readme document is appreciated.Detailed information about the dataset and subsequent releases can be found on the HIverse platform (https://hiverse.zero2x.org/wide). The HIverse platform features an integrated search engine, through which users can search by RA & Dec coordinates.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). Bluesky Social Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/bluesky-social-dataset

Bluesky Social Dataset Dataset

Explore at:

Dataset updated

Apr 28, 2024

Description

Bluesky Social Dataset Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.

The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions and time of bookmarking.

This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection, and performing content virality and diffusion analysis.

Dataset Here is a description of the dataset files.

followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers u, v, representing a directed following relation (i.e., user u follows user v). posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in 100 files, each containing the full posts of up to 50,000 users. Each post is stored as a JSON-formatted line. interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers, and represents a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author ,quoted_author, and date. graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread. feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author); feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values, namely the feed name, the user id, and the timestamp. feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp; scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.

Citation If used for research purposes, please cite the following paper describing the dataset details:

Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight": Insights from a Year Worth of Social Data. (2024) arXiv:2404.18984

Acknowledgments: This work is supported by :

the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”, Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu); SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021; EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).

Clear search

Close search

Google apps

Main menu

Bluesky Social Dataset Dataset

POLITISKY24: U.S. Political Bluesky Dataset with Stance Labels

50-million-bluesky-posts

40-million-bluesky-posts

A Blue Start: A large-scale pairwise and higher-order social network dataset...

Data from: POLITISKY24: U.S. Political Bluesky Dataset with User Stance...

Data from: BlueTempNet: A Temporal Multi-network Dataset of Social...

bluesky-journalist-classification

ABoVE: MODIS-Derived Daily Mean Blue Sky Albedo for Northern North America,...

30-million-bluesky-posts

Most frequent words appearing in each feed.

CLARA-A3: CM SAF cLoud, Albedo and surface RAdiation dataset from AVHRR data...

CRAFTS wide-band datacube pre-release

Bluesky Social Dataset DatasetSee More Versions

Bluesky Social Dataset Dataset