98 datasets found

GitHub Social Network
kaggle.com
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gitanjali Wadhwa (2023). GitHub Social Network [Dataset]. https://www.kaggle.com/datasets/gitanjali1425/github-social-network-graph-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gitanjali Wadhwa
Description
Description

An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.

Properties

Directed: No.

Node features: Yes.

Edge features: No.

Node labels: Yes. Binary-labeled.

Temporal: No.

Nodes: 37,700

Edges: 289,003

Density: 0.001

Transitvity: 0.013

Possible Tasks

Binary node classification

Link prediction

Community detection

Network visualisation
f
parsed_redd_-00003-of-00005.json
figshare.com
json
Updated Jun 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hana Matatov (2023). parsed_redd_-00003-of-00005.json [Dataset]. http://doi.org/10.6084/m9.figshare.19208616.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19208616.v1
Dataset updated
Jun 15, 2023
Dataset provided by
figshare
Authors
Hana Matatov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset of the paper: Dataset and Case Studies for Visual Near-Duplicates Detection in the Context of Social Media'', by Hana Matatov, Mor Naaman, and Ofra Amir.See the Github repository for details.The dataset of the paper:Dataset and Case Studies for Visual Near-Duplicates Detection in the Context of Social Media'', by Hana Matatov, Mor Naaman, and Ofra Amir.See the Github repository for details.

IMDB & Social Media Dataset

kaggle.com

Updated Nov 5, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

momo5577 (2023). IMDB & Social Media Dataset [Dataset]. https://www.kaggle.com/datasets/momo5577/imdb-and-social-media-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 5, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

momo5577

Description

This dataset is compiled using this dataset from GitHub.

Data Description Table

Variable Name	Description
`movie_title`	Title of the Movie
`duration`	Duration in minutes
`director_name`	Name of the Director of the Movie
`director_facebook_likes`	Number of likes of the Director on his Facebook Page
`actor_1_name`	Primary actor starring in the movie
`actor_1_facebook_likes`	Number of likes of the Actor_1 on his/her Facebook Page
`actor_2_name`	Other actor starring in the movie
`actor_2_facebook_likes`	Number of likes of the Actor_2 on his/her Facebook Page
`actor_3_name`	Other actor starring in the movie
`actor_3_facebook_likes`	Number of likes of the Actor_3 on his/her Facebook Page
`num_user_for_reviews`	Number of users who gave a review
`num_critic_for_reviews`	Number of critical reviews on imdb
`num_voted_users`	Number of people who voted for the movie
`cast_total_facebook_likes`	Total number of facebook likes of the entire cast of the movie
`movie_facebook_likes`	Number of Facebook likes in the movie page
`plot_keywords`	Keywords describing the movie plot
`facenumber_in_poster`	Number of the actor who featured in the movie poster
`color`	Film colorization. ‘Black and White’ or ‘Color’
`genres`	Film categorization like ‘Animation’, ‘Comedy’, etc
`title_year`	The year in which the movie is released (1916:2016)
`language`	Languages like English, Arabic, Chinese, etc
`country`	Country where the movie is produced
`content_rating`	Content rating of the movie
`aspect_ratio`	Aspect ratio the movie was made in
`movie_imdb_link`	IMDB link of the movie
`gross`	Gross earnings of the movie in Dollars
`budget`	Budget of the movie in Dollars
`imdb_score`	IMDB Score of the movie on IMDB

f
COVID-19 Twitter Dataset
figshare.com
borealisdata.ca
zip
Updated Oct 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Media Lab (2021). COVID-19 Twitter Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.16713448.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16713448.v1
Dataset updated
Oct 2, 2021
Dataset provided by
figshare
Authors
Social Media Lab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The current dataset contains Tweet IDs for tweets mentioning "COVID" (e.g., COVID-19, COVID19) and shared between March and July of 2020.Sampling Method: hourly requests sent to Twitter Search API using Social Feed Manager, an open source software that harvests social media data and related content from Twitter and other platforms.NOTE: 1) In accordance with Twitter API Terms, only Tweet IDs are provided as part of this dataset. 2) To recollect tweets based on the list of Tweet IDs contained in these datasets, you will need to use tweet 'rehydration' programs like Hydrator (https://github.com/DocNow/hydrator) or Python library Twarc (https://github.com/DocNow/twarc). 3) This dataset, like most datasets collected via the Twitter Search API, is a sample of the available tweets on this topic and is not meant to be comprehensive. Some COVID-related tweets might not be included in the dataset either because the tweets were collected using a standardized but intermittent (hourly) sampling protocol or because tweets used hashtags/keywords other than COVID (e.g., Coronavirus or #nCoV). 4) To broaden this sample, consider comparing/merging this dataset with other COVID-19 related public datasets such as: https://github.com/thepanacealab/covid19_twitter https://ieee-dataport.org/open-access/corona-virus-covid-19-tweets-dataset https://github.com/echen102/COVID-19-TweetIDs
d
Social Media Data | LinkedIn, Facebook, Instagram X (Twitter), YouTube,...
datarade.ai
.json, .csv
Updated Aug 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HitHorizons (2025). Social Media Data | LinkedIn, Facebook, Instagram X (Twitter), YouTube, TikTok, GitHub | European Coverage | 50 Countries | Monthly Refresh [Dataset]. https://datarade.ai/data-products/social-media-data-linkedin-facebook-instagram-x-twitter-hithorizons
Explore at:
.json, .csvAvailable download formats
Dataset updated
Aug 28, 2025
Dataset authored and provided by
HitHorizons
Area covered
France, Monaco, Italy, Ukraine, Luxembourg, Malta, Norway, Montenegro, Estonia, Netherlands
Description
Social Media Data for European Companies offers a powerful tool for organizations looking to enhance their decision-making through informed strategies. By providing links to social media profiles across various platforms—including LinkedIn, Facebook, Instagram, X (formerly Twitter), YouTube, TikTok, and GitHub—this solution caters to the specific needs of industries ranging from sales to recruitment. Updated monthly and fully compliant with GDPR regulations, it ensures reliability, relevancy, and trustworthiness.

LinkedIn – A leading network for businesses and professionals, ideal for B2B interactions.

Facebook – A hub for business pages, reviews, and customer engagement.

Instagram – A visually-driven platform for brand marketing and audience engagement.

X (formerly Twitter) – Known for real-time updates and customer interactions.

YouTube – A video powerhouse offering in-depth brand storytelling opportunities.

TikTok – A rapidly growing platform for creative and viral content.

GitHub – A crucial resource for tech professionals and organizations focused on open-source projects.
Developer Community and Code Datasets
datarade.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs, Developer Community and Code Datasets [Dataset]. https://datarade.ai/data-products/developer-community-and-code-datasets-oxylabs
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset authored and provided by
Oxylabs
Area covered
El Salvador, Tuvalu, Philippines, Bahamas, Marshall Islands, South Sudan, Djibouti, United Kingdom, Guyana, Saint Pierre and Miquelon
Description
Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.

Data Sources:

GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.

StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.

DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.

Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.

With our datasets, you'll receive:

Usernames;

Companies;

Locations;

Job Titles;

Follower Counts;

Contact Details;

Employability Statuses;

And More.

Choose from various output formats, storage options, and delivery frequencies:

Get datasets in CSV, JSON, or other preferred formats.

Opt for data delivery via SFTP or directly to your cloud storage, such as AWS S3.

Receive datasets either once or as per your agreed-upon schedule.

Why choose our Datasets?

Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.

Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.

Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.

Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!
o
Social Media Profile Links by Name
openwebninja.com
json
Updated Feb 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenWeb Ninja (2025). Social Media Profile Links by Name [Dataset]. https://www.openwebninja.com/api/social-links-search
Explore at:
jsonAvailable download formats
Dataset updated
Feb 2, 2025
Dataset authored and provided by
OpenWeb Ninja
Area covered
Worldwide
Description
This dataset provides comprehensive social media profile links discovered through real-time web search. It includes profiles from major social networks like Facebook, TikTok, Instagram, Twitter, LinkedIn, Youtube, Pinterest, Github and more. The data is gathered through intelligent search algorithms and pattern matching. Users can leverage this dataset for social media research, influencer discovery, social presence analysis, and social media marketing. The API enables efficient discovery of social profiles across multiple platforms. The dataset is delivered in a JSON format via REST API.
Data from: On the Role of Images for Analyzing Claims in Social Media
data.europa.eu
data.niaid.nih.gov
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). On the Role of Images for Analyzing Claims in Social Media [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-4592249?locale=cs
Explore at:
unknownAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
Description
This is a multimodal dataset used in the paper "On the Role of Images for Analyzing Claims in Social Media", accepted at CLEOPATRA-2021 (2nd International Workshop on Cross-lingual Event-centric Open Analytics), co-located with The Web Conference 2021. The four datasets are curated for two different tasks that broadly come under fake news detection. Originally, the datasets were released as part of challenges or papers for text-based NLP tasks and are further extended here with corresponding images. 1. clef_en and clef_ar are English and Arabic Twitter datasets for claim check-worthiness detection released in CLEF CheckThat! 2020 Barrón-Cedeno et al. [1]. 2. lesa is an English Twitter dataset for claim detection released by Gupta et al.[2] 3. mediaeval is an English Twitter dataset for conspiracy detection released in MediaEval 2020 Workshop by Pogorelov et al.[3] The dataset details like data curation and annotation process can be found in the cited papers. Datasets released here with corresponding images are relatively smaller than the original text-based tweets. The data statistics are as follows: 1. clef_en: 281 2. clef_ar: 2571 3. lesa: 1395 4. mediaeval: 1724 Each folder has two sub-folders and a json file data.json that consists of crawled tweets. Two sub-folders are: 1. images: This Contains crawled images with the same name as tweet-id in data.json. 2. splits: This contains 5-fold splits used for training and evaluation in our paper. Each file in this folder is a csv with two columns
g
Social media search stream data sets
fsadata.github.io
csv
Updated Mar 30, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Social media search stream data sets [Dataset]. https://fsadata.github.io/social-media-search-stream-data-sets/
Explore at:
csvAvailable download formats
Dataset updated
Mar 30, 2017
Description
The FSA Communications team tracked online and social data streams for pre-determined search topics, to capture data sets for the period between March 2016 – March 2017.
f
COVID-19 rumor dataset
figshare.com
html
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cheng (2023). COVID-19 rumor dataset [Dataset]. http://doi.org/10.6084/m9.figshare.14456385.v2
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14456385.v2
Dataset updated
Jun 10, 2023
Dataset provided by
figshare
Authors
cheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A COVID-19 misinformation / fake news / rumor / disinformation dataset collected from online social media and news websites. Usage note:Misinformation detection, classification, tracking, prediction.Misinformation sentiment analysis.Rumor veracity classification, comment stance classification.Rumor tracking, social network analysis.Data pre-processing and data analysis codes available at https://github.com/MickeysClubhouse/COVID-19-rumor-datasetPlease see full info in our GitHub link.Cite us:Cheng, Mingxi, et al. "A COVID-19 Rumor Dataset." Frontiers in Psychology 12 (2021): 1566.@article{cheng2021covid, title={A COVID-19 Rumor Dataset}, author={Cheng, Mingxi and Wang, Songli and Yan, Xiaofeng and Yang, Tianqi and Wang, Wenshuo and Huang, Zehao and Xiao, Xiongye and Nazarian, Shahin and Bogdan, Paul}, journal={Frontiers in Psychology}, volume={12}, pages={1566}, year={2021}, publisher={Frontiers} }
E
Augmented dataset of rumours and non-rumours for rumour detection
live.european-language-grid.eu
data.niaid.nih.gov
+1more
json
Updated Oct 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Augmented dataset of rumours and non-rumours for rumour detection [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7551
Explore at:
jsonAvailable download formats
Dataset updated
Oct 22, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains a collection of Twitter rumours and non-rumours during six real-world events: 1) 2013 Boston marathon bombings, 2) 2014 Ottawa shooting, 3) 2014 Sydney siege, 4) 2015 Charlie Hebdo Attack, 5) 2014 Ferguson unrest, and 6) 2015 Germanwings plane crash
The data set is an augmented data set of the PHEME dataset of rumours and non-rumours based on two data sets: the PHEME data [2] (downloaded via https://figshare.com/articles/PHEME_dataset_for_Rumour_Detection_and_Veracity_Classification/6392078), and the CrisisLexT26 data [3] (downloaded via https://github.com/sajao/CrisisLex/tree/master/data/CrisisLexT26/2013_Boston_bombings).

PHEME-Aug v2.0 (aug-rnr-data_filtered.tar.bz2 and aur-rnr-data_full.tar.bz2) contains augmented data for all six events.
aug-rnr-data_full.tar.bz2 contains source tweets and replies without temporal filtering. Please refer to [1] for details about temporal filtering. The statistics are as follows:
* 2013 Boston marathon bombings: 392 rumours and 784 non-rumours
* 2014 Ottawa shooting: 1,047 rumours and 2,072 non-rumours
* 2014 Sydney siege: 1,764 rumours and 3,530 non-rumours
* 2015 Charlie Hebdo Attack: 1,225 rumours and 2,450 non-rumours
* 2014 Ferguson unrest: 737 rumours and 1,476 non-rumours
* 2015 Germanwings plane crash: 502 rumours and 604 non-rumours

aug-rnr-data_filtered.tar.bz2 contains source tweets, replies, and retweets after temporal filtering and deduplication. Please refer to [1] for details. The statistics are as follows:
* 2013 Boston marathon bombings: 323 rumours and 645 non-rumours
* 2014 Ottawa shooting: 713 rumours and 1,420 non-rumours
* 2014 Sydney siege: 1,134 rumours and 2,262 non-rumours
* 2015 Charlie Hebdo Attack: 812 rumours and 1,673 non-rumours
* 2014 Ferguson unrest: 471 rumours and 949 non-rumours
* 2015 Germanwings plane crash: 375 rumours and 402 non-rumours

The data structure follows the format of the PHEME data [2]. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet. Also each folder contains ‘aug_complete.csv’ and ‘reference.csv'.
'aug_complete.csv' file contains the metadata (tweet ID, tweet text, timestamp, and rumour label) of augmented tweets before deduplication and filtering tweets without context (i.e., replies).
'reference.csv' file contains manually annotated reference tweets [2, 3].

If you use our augmented data (PHEME-Aug v2.0), please also cite:
[1] Han S., Gao, J., Ciravegna, F. (2019). "Neural Language Model Based Training Data Augmentation for Weakly Supervised Early Rumor Detection", The 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), Vancouver, Canada, 27-30 August, 2019
==============================================================================================
[2] Kochkina, E., Liakata, M., & Zubiaga, A. (2018). All-in-one: Multi-task Learning for Rumour Verification. COLING.
[3] Olteanu, A., Vieweg, S., & Castillo, C. (2015, February). What to expect when the unexpected happens: Social media communications across crises. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing (pp. 994-1009). ACM
A Twitter Dataset of 70+ million tweets related to COVID-19
zenodo.org
csv, tsv, zip
Updated Apr 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Gerardo Chowell; Gerardo Chowell (2023). A Twitter Dataset of 70+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3732460
Explore at:
csv, tsv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3732460
Dataset updated
Apr 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Gerardo Chowell; Gerardo Chowell
Description
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 29th which yielded over 4 million tweets a day.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (70,569,368 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (13,535,912 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
ETHOS Hate Speech Dataset
kaggle.com
Updated Jun 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konrad Banachewicz (2023). ETHOS Hate Speech Dataset [Dataset]. https://www.kaggle.com/datasets/konradb/ethos-hate-speech-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 26, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Konrad Banachewicz
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
From the project repo: https://github.com/intelligence-csd-auth-gr/Ethos-Hate-Speech-Dataset

ETHOS: multi-labEl haTe speecH detectiOn dataSet. This repository contains a dataset for hate speech detection on social media platforms, called Ethos. There are two variations of the dataset:

Ethos_Dataset_Binary.csv[Ethos_Dataset_Binary.csv] contains 998 comments in the dataset alongside with a label about hate speech presence or absence. 565 of them do not contain hate speech, while the rest of them, 433, contain. Ethos_Dataset_Multi_Label.csv [Ethos_Dataset_Multi_Label.csv] which contains 8 labels for the 433 comments with hate speech content. These labels are violence (if it incites (1) or not (0) violence), directed_vs_general (if it is directed to a person (1) or a group (0)), and 6 labels about the category of hate speech like, gender, race, national_origin, disability, religion and sexual_orientation.
Real Indian users on Github
kaggle.com
Updated Oct 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archit Tyagi (2024). Real Indian users on Github [Dataset]. https://www.kaggle.com/datasets/archittyagi108/real-indian-users-on-github
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 6, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Archit Tyagi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
India
Description
📊 GitHub Indian Users Dataset

Overview

This dataset provides insights into the Indian developer community on GitHub, one of the world’s largest platforms for developers to collaborate, share, and contribute to open-source projects. Whether you're interested in analyzing trends, understanding community growth, or identifying popular programming languages, this dataset offers a comprehensive look at the profiles of GitHub users from India.

🧑‍💻 Dataset Contents

The dataset includes anonymized profile information for a diverse range of GitHub users based in India. Key features include: - Username: Unique identifier for each user (anonymized) - Location: City or region within India - Programming Languages: Most commonly used languages per user - Repositories: Public repositories owned and contributed to - Followers and Following: Social network connections within the platform - GitHub Join Date: Date the user joined GitHub - Organizations: Affiliated organizations (if publicly available)

🌟 Source and Inspiration

This dataset is curated from publicly available GitHub profiles with a specific focus on Indian users. It is inspired by the need to understand the growth of the tech ecosystem in India, including the languages, tools, and topics that are currently popular among Indian developers. This dataset aims to provide valuable insights for recruiters, data scientists, and anyone interested in the open-source contributions of Indian developers.

Potential Use Cases

Trend Analysis: Identify popular programming languages, tech stacks, and frameworks among Indian developers.

Community Growth: Analyze how the Indian developer community has grown over time on GitHub.

Social Network Analysis: Understand the follower and following patterns to uncover influential developers within the Indian tech community.

Regional Insights: Discover which cities or regions in India have the most active GitHub users.

Career Development: Insights for recruiters looking to identify and understand potential talent pools in India.

💡 Ideal for

This dataset is perfect for: - Data scientists looking to explore and visualize developer trends - Recruiters interested in talent scouting within the Indian tech ecosystem - Tech enthusiasts who want to explore the dynamics of India's open-source community - Students and educators looking for real-world data to practice analysis and modeling
o
Data from: Two Computational Models for Analyzing Political Attention in...
openicpsr.org
delimited
Updated Mar 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Libby Hemphill; Angela M. Schöpke-Gonzalez (2020). Two Computational Models for Analyzing Political Attention in Social Media [Dataset]. http://doi.org/10.3886/E118569V2
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E118569V2
Dataset updated
Mar 30, 2020
Dataset provided by
University of Michigan
Authors
Libby Hemphill; Angela M. Schöpke-Gonzalez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Using the Twitter Search API, we collected all tweets posted by official MC accounts (voting members only) during the 115th U.S. Congress which ran January 3, 2017 to January 3, 2019. We identified MCs' Twitter user names by combining the lists of MC social media accounts from the United States project (https://github.com/unitedstates/congress-legislators), George Washington Libraries (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UIVHQR), and the Sunlight Foundation (https://sunlightlabs.github.io/congress/index.html#legislator-spreadsheet). Throughout 2017 and 2018, we used the Twitter API to search for the user names in this composite list and retrieved the accounts' most recent tweets. Our final search occurred on January 3, 2019, shortly after the 115th U.S. Congress ended. In all, we collected 1,485,834 original tweets (i.e., we excluded retweets) from 524 accounts. The accounts differ from the total size of Congress because we included tweet data for MCs who resigned (e.g., Ryan Zinke) and those who joined off cycle (e.g., Rep. Conor Lamb); we were also unable to confirm accounts for every state and district.Twitter prohibits us from sharing the full tweet text, and so we have included tweet IDs when possible.
H
FakeNewsNet
dataverse.harvard.edu
kaggle.com
json, text/markdown +3
Updated Jan 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2020). FakeNewsNet [Dataset]. http://doi.org/10.7910/DVN/UEMMHS
Explore at:
text/x-python(2201), txt(546), json(637), text/x-python(2018), text/markdown(11574), tsv(13172624), tsv(20973070), text/x-python(4760), text/x-python(2891), text/x-python(2384), text/x-python(8673), text/x-python(1825), text/x-python(0), text/x-python(3516), json(104), tsv(8701109), tsv(3454648), text/x-python(281), text/x-python(2829)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/UEMMHS
Dataset updated
Jan 16, 2020
Dataset provided by
Harvard Dataverse
Description
FakeNewsNet is a multi-dimensional data repository that currently contains two datasets with news content, social context, and spatiotemporal information. The dataset is constructed using an end-to-end system, FakeNewsTracker. The constructed FakeNewsNet repository has the potential to boost the study of various open research problems related to fake news study. Because of the Twitter data sharing policy, we only share the news articles and tweet ids as part of this dataset and provide code along with repo to download complete tweet details, social engagements, and social networks. We describe and compare FakeNewsNet with other existing datasets in FakeNewsNet: A Data Repository with News Content, Social Context and Spatialtemporal Information for Studying Fake News on Social Media (https://arxiv.org/abs/1809.01286). A more readable version of the dataset is available at https://github.com/KaiDMML/FakeNewsNet
m
Amharic text dataset extracted from memes for hate speech detection or...
data.mendeley.com
Updated Jun 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mequanent Degu (2023). Amharic text dataset extracted from memes for hate speech detection or classification [Dataset]. http://doi.org/10.17632/gw3fdtw5v7.2
Explore at:
Unique identifier
https://doi.org/10.17632/gw3fdtw5v7.2
Dataset updated
Jun 8, 2023
Authors
Mequanent Degu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
the dataset is collected from social media such as facebook and telegram. the dataset is further processed. the collection are orginal_cleaned: this dataset is neither stemed nor stopword are remove: stopword_removed: in this dataset stopwords are removed but not stemmed and in stemed datset is stemmed and stopwords are removed. stemming is done using hornmorpho developed by Michael Gesser( available at https://github.com/hltdi/HornMorpho) all datasets are normalized and free from noise such as punctuation marks and emojs.
PHEME dataset for Rumour Detection and Veracity Classification
figshare.com
application/gzip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Kochkina; Maria Liakata; Arkaitz Zubiaga (2023). PHEME dataset for Rumour Detection and Veracity Classification [Dataset]. http://doi.org/10.6084/m9.figshare.6392078.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6392078.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Elena Kochkina; Maria Liakata; Arkaitz Zubiaga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a collection of Twitter rumours and non-rumours posted during breaking news.

The data is structured as follows. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet. Also each folder contains ‘annotation.json’ which contains information about veracity of the rumour and ‘structure.json’, which contains information about structure of the conversation.

This dataset is an extension of the PHEME dataset of rumours and non-rumours (https://figshare.com/articles/PHEME_dataset_of_rumours_and_non-rumours/4010619), it contains rumours related to 9 events and each of the rumours is annotated with its veracity value, either True, False or Unverified.

This dataset was used in the paper 'All-in-one: Multi-task Learning for Rumour Verification'. For more details, please refer to the paper.

Code using this dataset can be found on github (https://github.com/kochkinaelena/Multitask4Veracity).

License: The annotations are provided under a CC-BY license, while Twitter retains the ownership and rights of the content of the tweets.
e
BESOCIAL: Social media archiving tools comparison - Dataset - B2FIND
b2find.eudat.eu
Updated Aug 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). BESOCIAL: Social media archiving tools comparison - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/e084520a-90ac-5506-8bfe-5d3b6ae1796b
Explore at:
Dataset updated
Aug 1, 2024
Description
This spreadsheet has the aim to provide an overview about tools and frameworks for social media archiving, as a variety of open source tools implemented in different programming languages and with different features exist. It is an adapted version of the "Comparison of web archiving software" by the Data Together Initiative which is licensed under CC BY 4.0: https://github.com/datatogether/research/tree/master/web_archiving. We kept the table structure with its clear column definitions and added new columns, these definitions can be found in the column definitions sheet; some columns do not apply or are not our main focus and thus we kept them empty for now. The observations sheet contains a description of each tool which also served as basis for several column values in the comparison sheet. Initially we focus on tools which are dedicated social media harvesting tools. Please note that general web archiving tools such as listed in the table of the Data Together initiative may be used to harvest social media data too. However, these tools might require a specific setup to cope with the peculiarities of social media data, hence we initially did not include these tools. Contributions and feedback are welcome and we also envision to contribute results back to the Data Together initiative. This spreadsheet is part of work package 1 of the BeSocial project, the conducted research has been funded by BELSPO, the Belgian Science Policy office. Contact information: — For BeSocial in general: Fien Messens from KBR fien.messens@kbr.be — For the tool comparison: Sven Lieber from Ghent University - IDLab sven.lieber@ugent.be
MultiSocial
zenodo.org
data.niaid.nih.gov
Updated Aug 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal (2025). MultiSocial [Dataset]. http://doi.org/10.5281/zenodo.13846152
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13846152
Dataset updated
Aug 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.

If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

Disclaimer

Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.

Data Source

The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:

Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.

Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).

Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.

Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).

WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.

From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.

The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).

The dataset has the following fields:

'text' - a text sample,

'label' - 0 for human-written text, 1 for machine-generated text,

'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,

'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,

'language' - the ISO 639-1 language code identifying the detected language of the given text,

'length' - word count of the given text,

'source' - a string identifying the source dataset / platform of the given text,

'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.

ToDo Statistics (under construction)

Facebook

Twitter

Click to copy link

Link copied

Cite

Gitanjali Wadhwa (2023). GitHub Social Network [Dataset]. https://www.kaggle.com/datasets/gitanjali1425/github-social-network-graph-dataset

GitHub Social Network

GitHub Social Network - graph based dataset consisting of Nodes and Edges.

Explore at:

69 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 12, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Gitanjali Wadhwa

Description

Description

An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.

Properties

Directed: No.
Node features: Yes.
Edge features: No.
Node labels: Yes. Binary-labeled.
Temporal: No.
Nodes: 37,700
Edges: 289,003
Density: 0.001
Transitvity: 0.013

Possible Tasks

Binary node classification
Link prediction
Community detection
Network visualisation

Clear search

Close search

Google apps

Main menu

GitHub Social Network

parsed_redd_-00003-of-00005.json

IMDB & Social Media Dataset

COVID-19 Twitter Dataset

Social Media Data | LinkedIn, Facebook, Instagram X (Twitter), YouTube,...

Developer Community and Code Datasets

Social Media Profile Links by Name

Data from: On the Role of Images for Analyzing Claims in Social Media

Social media search stream data sets

COVID-19 rumor dataset

Augmented dataset of rumours and non-rumours for rumour detection

A Twitter Dataset of 70+ million tweets related to COVID-19

ETHOS Hate Speech Dataset

Real Indian users on Github

📊 GitHub Indian Users Dataset

Overview

🧑‍💻 Dataset Contents

🌟 Source and Inspiration

Potential Use Cases

💡 Ideal for

Data from: Two Computational Models for Analyzing Political Attention in...

FakeNewsNet

Amharic text dataset extracted from memes for hate speech detection or...

PHEME dataset for Rumour Detection and Veracity Classification

BESOCIAL: Social media archiving tools comparison - Dataset - B2FIND

MultiSocial

Disclaimer

Data Source

GitHub Social Network

GitHub Social Network - graph based dataset consisting of Nodes and Edges.