98 datasets found
  1. GitHub Social Network

    • kaggle.com
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gitanjali Wadhwa (2023). GitHub Social Network [Dataset]. https://www.kaggle.com/datasets/gitanjali1425/github-social-network-graph-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gitanjali Wadhwa
    Description

    Description

    An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.

    Properties

    • Directed: No.
    • Node features: Yes.
    • Edge features: No.
    • Node labels: Yes. Binary-labeled.
    • Temporal: No.
    • Nodes: 37,700
    • Edges: 289,003
    • Density: 0.001
    • Transitvity: 0.013

    Possible Tasks

    • Binary node classification
    • Link prediction
    • Community detection
    • Network visualisation
  2. f

    parsed_redd_-00003-of-00005.json

    • figshare.com
    json
    Updated Jun 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hana Matatov (2023). parsed_redd_-00003-of-00005.json [Dataset]. http://doi.org/10.6084/m9.figshare.19208616.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    figshare
    Authors
    Hana Matatov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset of the paper: Dataset and Case Studies for Visual Near-Duplicates Detection in the Context of Social Media'', by Hana Matatov, Mor Naaman, and Ofra Amir.See the Github repository for details.The dataset of the paper:Dataset and Case Studies for Visual Near-Duplicates Detection in the Context of Social Media'', by Hana Matatov, Mor Naaman, and Ofra Amir.See the Github repository for details.

  3. IMDB & Social Media Dataset

    • kaggle.com
    Updated Nov 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    momo5577 (2023). IMDB & Social Media Dataset [Dataset]. https://www.kaggle.com/datasets/momo5577/imdb-and-social-media-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    momo5577
    Description

    This dataset is compiled using this dataset from GitHub.

    Data Description Table

    Variable NameDescription
    movie_titleTitle of the Movie
    durationDuration in minutes
    director_nameName of the Director of the Movie
    director_facebook_likesNumber of likes of the Director on his Facebook Page
    actor_1_namePrimary actor starring in the movie
    actor_1_facebook_likesNumber of likes of the Actor_1 on his/her Facebook Page
    actor_2_nameOther actor starring in the movie
    actor_2_facebook_likesNumber of likes of the Actor_2 on his/her Facebook Page
    actor_3_nameOther actor starring in the movie
    actor_3_facebook_likesNumber of likes of the Actor_3 on his/her Facebook Page
    num_user_for_reviewsNumber of users who gave a review
    num_critic_for_reviewsNumber of critical reviews on imdb
    num_voted_usersNumber of people who voted for the movie
    cast_total_facebook_likesTotal number of facebook likes of the entire cast of the movie
    movie_facebook_likesNumber of Facebook likes in the movie page
    plot_keywordsKeywords describing the movie plot
    facenumber_in_posterNumber of the actor who featured in the movie poster
    colorFilm colorization. ‘Black and White’ or ‘Color’
    genresFilm categorization like ‘Animation’, ‘Comedy’, etc
    title_yearThe year in which the movie is released (1916:2016)
    languageLanguages like English, Arabic, Chinese, etc
    countryCountry where the movie is produced
    content_ratingContent rating of the movie
    aspect_ratioAspect ratio the movie was made in
    movie_imdb_linkIMDB link of the movie
    grossGross earnings of the movie in Dollars
    budgetBudget of the movie in Dollars
    imdb_scoreIMDB Score of the movie on IMDB
  4. f

    COVID-19 Twitter Dataset

    • figshare.com
    • borealisdata.ca
    zip
    Updated Oct 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Media Lab (2021). COVID-19 Twitter Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.16713448.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 2, 2021
    Dataset provided by
    figshare
    Authors
    Social Media Lab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The current dataset contains Tweet IDs for tweets mentioning "COVID" (e.g., COVID-19, COVID19) and shared between March and July of 2020.Sampling Method: hourly requests sent to Twitter Search API using Social Feed Manager, an open source software that harvests social media data and related content from Twitter and other platforms.NOTE: 1) In accordance with Twitter API Terms, only Tweet IDs are provided as part of this dataset. 2) To recollect tweets based on the list of Tweet IDs contained in these datasets, you will need to use tweet 'rehydration' programs like Hydrator (https://github.com/DocNow/hydrator) or Python library Twarc (https://github.com/DocNow/twarc). 3) This dataset, like most datasets collected via the Twitter Search API, is a sample of the available tweets on this topic and is not meant to be comprehensive. Some COVID-related tweets might not be included in the dataset either because the tweets were collected using a standardized but intermittent (hourly) sampling protocol or because tweets used hashtags/keywords other than COVID (e.g., Coronavirus or #nCoV). 4) To broaden this sample, consider comparing/merging this dataset with other COVID-19 related public datasets such as: https://github.com/thepanacealab/covid19_twitter https://ieee-dataport.org/open-access/corona-virus-covid-19-tweets-dataset https://github.com/echen102/COVID-19-TweetIDs

  5. d

    Social Media Data | LinkedIn, Facebook, Instagram X (Twitter), YouTube,...

    • datarade.ai
    .json, .csv
    Updated Aug 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HitHorizons (2025). Social Media Data | LinkedIn, Facebook, Instagram X (Twitter), YouTube, TikTok, GitHub | European Coverage | 50 Countries | Monthly Refresh [Dataset]. https://datarade.ai/data-products/social-media-data-linkedin-facebook-instagram-x-twitter-hithorizons
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Aug 28, 2025
    Dataset authored and provided by
    HitHorizons
    Area covered
    France, Monaco, Italy, Ukraine, Luxembourg, Malta, Norway, Montenegro, Estonia, Netherlands
    Description

    Social Media Data for European Companies offers a powerful tool for organizations looking to enhance their decision-making through informed strategies. By providing links to social media profiles across various platforms—including LinkedIn, Facebook, Instagram, X (formerly Twitter), YouTube, TikTok, and GitHub—this solution caters to the specific needs of industries ranging from sales to recruitment. Updated monthly and fully compliant with GDPR regulations, it ensures reliability, relevancy, and trustworthiness.

    LinkedIn – A leading network for businesses and professionals, ideal for B2B interactions.

    Facebook – A hub for business pages, reviews, and customer engagement.

    Instagram – A visually-driven platform for brand marketing and audience engagement.

    X (formerly Twitter) – Known for real-time updates and customer interactions.

    YouTube – A video powerhouse offering in-depth brand storytelling opportunities.

    TikTok – A rapidly growing platform for creative and viral content.

    GitHub – A crucial resource for tech professionals and organizations focused on open-source projects.

  6. Developer Community and Code Datasets

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs, Developer Community and Code Datasets [Dataset]. https://datarade.ai/data-products/developer-community-and-code-datasets-oxylabs
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Oxylabs
    Area covered
    El Salvador, Tuvalu, Philippines, Bahamas, Marshall Islands, South Sudan, Djibouti, United Kingdom, Guyana, Saint Pierre and Miquelon
    Description

    Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.

    Data Sources:

    1. GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.

    2. StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.

    3. DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.

    Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.

    With our datasets, you'll receive:

    • Usernames;
    • Companies;
    • Locations;
    • Job Titles;
    • Follower Counts;
    • Contact Details;
    • Employability Statuses;
    • And More.

    Choose from various output formats, storage options, and delivery frequencies:

    • Get datasets in CSV, JSON, or other preferred formats.
    • Opt for data delivery via SFTP or directly to your cloud storage, such as AWS S3.
    • Receive datasets either once or as per your agreed-upon schedule.

    Why choose our Datasets?

    1. Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.

    2. Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.

    3. Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.

    4. Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!

  7. o

    Social Media Profile Links by Name

    • openwebninja.com
    json
    Updated Feb 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenWeb Ninja (2025). Social Media Profile Links by Name [Dataset]. https://www.openwebninja.com/api/social-links-search
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Feb 2, 2025
    Dataset authored and provided by
    OpenWeb Ninja
    Area covered
    Worldwide
    Description

    This dataset provides comprehensive social media profile links discovered through real-time web search. It includes profiles from major social networks like Facebook, TikTok, Instagram, Twitter, LinkedIn, Youtube, Pinterest, Github and more. The data is gathered through intelligent search algorithms and pattern matching. Users can leverage this dataset for social media research, influencer discovery, social presence analysis, and social media marketing. The API enables efficient discovery of social profiles across multiple platforms. The dataset is delivered in a JSON format via REST API.

  8. Data from: On the Role of Images for Analyzing Claims in Social Media

    • data.europa.eu
    • data.niaid.nih.gov
    unknown
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). On the Role of Images for Analyzing Claims in Social Media [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-4592249?locale=cs
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    Description

    This is a multimodal dataset used in the paper "On the Role of Images for Analyzing Claims in Social Media", accepted at CLEOPATRA-2021 (2nd International Workshop on Cross-lingual Event-centric Open Analytics), co-located with The Web Conference 2021. The four datasets are curated for two different tasks that broadly come under fake news detection. Originally, the datasets were released as part of challenges or papers for text-based NLP tasks and are further extended here with corresponding images. 1. clef_en and clef_ar are English and Arabic Twitter datasets for claim check-worthiness detection released in CLEF CheckThat! 2020 Barrón-Cedeno et al. [1]. 2. lesa is an English Twitter dataset for claim detection released by Gupta et al.[2] 3. mediaeval is an English Twitter dataset for conspiracy detection released in MediaEval 2020 Workshop by Pogorelov et al.[3] The dataset details like data curation and annotation process can be found in the cited papers. Datasets released here with corresponding images are relatively smaller than the original text-based tweets. The data statistics are as follows: 1. clef_en: 281 2. clef_ar: 2571 3. lesa: 1395 4. mediaeval: 1724 Each folder has two sub-folders and a json file data.json that consists of crawled tweets. Two sub-folders are: 1. images: This Contains crawled images with the same name as tweet-id in data.json. 2. splits: This contains 5-fold splits used for training and evaluation in our paper. Each file in this folder is a csv with two columns

  9. g

    Social media search stream data sets

    • fsadata.github.io
    csv
    Updated Mar 30, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Social media search stream data sets [Dataset]. https://fsadata.github.io/social-media-search-stream-data-sets/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 30, 2017
    Description

    The FSA Communications team tracked online and social data streams for pre-determined search topics, to capture data sets for the period between March 2016 – March 2017.

  10. f

    COVID-19 rumor dataset

    • figshare.com
    html
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cheng (2023). COVID-19 rumor dataset [Dataset]. http://doi.org/10.6084/m9.figshare.14456385.v2
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    figshare
    Authors
    cheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A COVID-19 misinformation / fake news / rumor / disinformation dataset collected from online social media and news websites. Usage note:Misinformation detection, classification, tracking, prediction.Misinformation sentiment analysis.Rumor veracity classification, comment stance classification.Rumor tracking, social network analysis.Data pre-processing and data analysis codes available at https://github.com/MickeysClubhouse/COVID-19-rumor-datasetPlease see full info in our GitHub link.Cite us:Cheng, Mingxi, et al. "A COVID-19 Rumor Dataset." Frontiers in Psychology 12 (2021): 1566.@article{cheng2021covid, title={A COVID-19 Rumor Dataset}, author={Cheng, Mingxi and Wang, Songli and Yan, Xiaofeng and Yang, Tianqi and Wang, Wenshuo and Huang, Zehao and Xiao, Xiongye and Nazarian, Shahin and Bogdan, Paul}, journal={Frontiers in Psychology}, volume={12}, pages={1566}, year={2021}, publisher={Frontiers} }

  11. E

    Augmented dataset of rumours and non-rumours for rumour detection

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    json
    Updated Oct 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Augmented dataset of rumours and non-rumours for rumour detection [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7551
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 22, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains a collection of Twitter rumours and non-rumours during six real-world events: 1) 2013 Boston marathon bombings, 2) 2014 Ottawa shooting, 3) 2014 Sydney siege, 4) 2015 Charlie Hebdo Attack, 5) 2014 Ferguson unrest, and 6) 2015 Germanwings plane crash

    The data set is an augmented data set of the PHEME dataset of rumours and non-rumours based on two data sets: the PHEME data [2] (downloaded via https://figshare.com/articles/PHEME_dataset_for_Rumour_Detection_and_Veracity_Classification/6392078), and the CrisisLexT26 data [3] (downloaded via https://github.com/sajao/CrisisLex/tree/master/data/CrisisLexT26/2013_Boston_bombings).

    PHEME-Aug v2.0 (aug-rnr-data_filtered.tar.bz2 and aur-rnr-data_full.tar.bz2) contains augmented data for all six events.

    aug-rnr-data_full.tar.bz2 contains source tweets and replies without temporal filtering. Please refer to [1] for details about temporal filtering. The statistics are as follows:

    * 2013 Boston marathon bombings: 392 rumours and 784 non-rumours

    * 2014 Ottawa shooting: 1,047 rumours and 2,072 non-rumours

    * 2014 Sydney siege: 1,764 rumours and 3,530 non-rumours

    * 2015 Charlie Hebdo Attack: 1,225 rumours and 2,450 non-rumours

    * 2014 Ferguson unrest: 737 rumours and 1,476 non-rumours

    * 2015 Germanwings plane crash: 502 rumours and 604 non-rumours

    aug-rnr-data_filtered.tar.bz2 contains source tweets, replies, and retweets after temporal filtering and deduplication. Please refer to [1] for details. The statistics are as follows:

    * 2013 Boston marathon bombings: 323 rumours and 645 non-rumours

    * 2014 Ottawa shooting: 713 rumours and 1,420 non-rumours

    * 2014 Sydney siege: 1,134 rumours and 2,262 non-rumours

    * 2015 Charlie Hebdo Attack: 812 rumours and 1,673 non-rumours

    * 2014 Ferguson unrest: 471 rumours and 949 non-rumours

    * 2015 Germanwings plane crash: 375 rumours and 402 non-rumours

    The data structure follows the format of the PHEME data [2]. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet. Also each folder contains ‘aug_complete.csv’ and ‘reference.csv'.

    'aug_complete.csv' file contains the metadata (tweet ID, tweet text, timestamp, and rumour label) of augmented tweets before deduplication and filtering tweets without context (i.e., replies).

    'reference.csv' file contains manually annotated reference tweets [2, 3].

    If you use our augmented data (PHEME-Aug v2.0), please also cite:

    [1] Han S., Gao, J., Ciravegna, F. (2019). "Neural Language Model Based Training Data Augmentation for Weakly Supervised Early Rumor Detection", The 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), Vancouver, Canada, 27-30 August, 2019

    ==============================================================================================

    [2] Kochkina, E., Liakata, M., & Zubiaga, A. (2018). All-in-one: Multi-task Learning for Rumour Verification. COLING.

    [3] Olteanu, A., Vieweg, S., & Castillo, C. (2015, February). What to expect when the unexpected happens: Social media communications across crises. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing (pp. 994-1009). ACM

  12. A Twitter Dataset of 70+ million tweets related to COVID-19

    • zenodo.org
    csv, tsv, zip
    Updated Apr 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Gerardo Chowell; Gerardo Chowell (2023). A Twitter Dataset of 70+ million tweets related to COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.3732460
    Explore at:
    csv, tsv, zipAvailable download formats
    Dataset updated
    Apr 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juan M. Banda; Juan M. Banda; Ramya Tekumalla; Ramya Tekumalla; Gerardo Chowell; Gerardo Chowell
    Description

    Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 29th which yielded over 4 million tweets a day.

    The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (70,569,368 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (13,535,912 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

    More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)

    As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.

  13. ETHOS Hate Speech Dataset

    • kaggle.com
    Updated Jun 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konrad Banachewicz (2023). ETHOS Hate Speech Dataset [Dataset]. https://www.kaggle.com/datasets/konradb/ethos-hate-speech-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 26, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Konrad Banachewicz
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    From the project repo: https://github.com/intelligence-csd-auth-gr/Ethos-Hate-Speech-Dataset

    ETHOS: multi-labEl haTe speecH detectiOn dataSet. This repository contains a dataset for hate speech detection on social media platforms, called Ethos. There are two variations of the dataset:

    Ethos_Dataset_Binary.csv[Ethos_Dataset_Binary.csv] contains 998 comments in the dataset alongside with a label about hate speech presence or absence. 565 of them do not contain hate speech, while the rest of them, 433, contain. Ethos_Dataset_Multi_Label.csv [Ethos_Dataset_Multi_Label.csv] which contains 8 labels for the 433 comments with hate speech content. These labels are violence (if it incites (1) or not (0) violence), directed_vs_general (if it is directed to a person (1) or a group (0)), and 6 labels about the category of hate speech like, gender, race, national_origin, disability, religion and sexual_orientation.

  14. Real Indian users on Github

    • kaggle.com
    Updated Oct 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archit Tyagi (2024). Real Indian users on Github [Dataset]. https://www.kaggle.com/datasets/archittyagi108/real-indian-users-on-github
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Archit Tyagi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    India
    Description

    📊 GitHub Indian Users Dataset

    Overview

    This dataset provides insights into the Indian developer community on GitHub, one of the world’s largest platforms for developers to collaborate, share, and contribute to open-source projects. Whether you're interested in analyzing trends, understanding community growth, or identifying popular programming languages, this dataset offers a comprehensive look at the profiles of GitHub users from India.

    🧑‍💻 Dataset Contents

    The dataset includes anonymized profile information for a diverse range of GitHub users based in India. Key features include: - Username: Unique identifier for each user (anonymized) - Location: City or region within India - Programming Languages: Most commonly used languages per user - Repositories: Public repositories owned and contributed to - Followers and Following: Social network connections within the platform - GitHub Join Date: Date the user joined GitHub - Organizations: Affiliated organizations (if publicly available)

    🌟 Source and Inspiration

    This dataset is curated from publicly available GitHub profiles with a specific focus on Indian users. It is inspired by the need to understand the growth of the tech ecosystem in India, including the languages, tools, and topics that are currently popular among Indian developers. This dataset aims to provide valuable insights for recruiters, data scientists, and anyone interested in the open-source contributions of Indian developers.

    Potential Use Cases

    1. Trend Analysis: Identify popular programming languages, tech stacks, and frameworks among Indian developers.
    2. Community Growth: Analyze how the Indian developer community has grown over time on GitHub.
    3. Social Network Analysis: Understand the follower and following patterns to uncover influential developers within the Indian tech community.
    4. Regional Insights: Discover which cities or regions in India have the most active GitHub users.
    5. Career Development: Insights for recruiters looking to identify and understand potential talent pools in India.

    💡 Ideal for

    This dataset is perfect for: - Data scientists looking to explore and visualize developer trends - Recruiters interested in talent scouting within the Indian tech ecosystem - Tech enthusiasts who want to explore the dynamics of India's open-source community - Students and educators looking for real-world data to practice analysis and modeling

  15. o

    Data from: Two Computational Models for Analyzing Political Attention in...

    • openicpsr.org
    delimited
    Updated Mar 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Libby Hemphill; Angela M. Schöpke-Gonzalez (2020). Two Computational Models for Analyzing Political Attention in Social Media [Dataset]. http://doi.org/10.3886/E118569V2
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Mar 30, 2020
    Dataset provided by
    University of Michigan
    Authors
    Libby Hemphill; Angela M. Schöpke-Gonzalez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Using the Twitter Search API, we collected all tweets posted by official MC accounts (voting members only) during the 115th U.S. Congress which ran January 3, 2017 to January 3, 2019. We identified MCs' Twitter user names by combining the lists of MC social media accounts from the United States project (https://github.com/unitedstates/congress-legislators), George Washington Libraries (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UIVHQR), and the Sunlight Foundation (https://sunlightlabs.github.io/congress/index.html#legislator-spreadsheet). Throughout 2017 and 2018, we used the Twitter API to search for the user names in this composite list and retrieved the accounts' most recent tweets. Our final search occurred on January 3, 2019, shortly after the 115th U.S. Congress ended. In all, we collected 1,485,834 original tweets (i.e., we excluded retweets) from 524 accounts. The accounts differ from the total size of Congress because we included tweet data for MCs who resigned (e.g., Ryan Zinke) and those who joined off cycle (e.g., Rep. Conor Lamb); we were also unable to confirm accounts for every state and district.Twitter prohibits us from sharing the full tweet text, and so we have included tweet IDs when possible.

  16. H

    FakeNewsNet

    • dataverse.harvard.edu
    • kaggle.com
    json, text/markdown +3
    Updated Jan 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2020). FakeNewsNet [Dataset]. http://doi.org/10.7910/DVN/UEMMHS
    Explore at:
    text/x-python(2201), txt(546), json(637), text/x-python(2018), text/markdown(11574), tsv(13172624), tsv(20973070), text/x-python(4760), text/x-python(2891), text/x-python(2384), text/x-python(8673), text/x-python(1825), text/x-python(0), text/x-python(3516), json(104), tsv(8701109), tsv(3454648), text/x-python(281), text/x-python(2829)Available download formats
    Dataset updated
    Jan 16, 2020
    Dataset provided by
    Harvard Dataverse
    Description

    FakeNewsNet is a multi-dimensional data repository that currently contains two datasets with news content, social context, and spatiotemporal information. The dataset is constructed using an end-to-end system, FakeNewsTracker. The constructed FakeNewsNet repository has the potential to boost the study of various open research problems related to fake news study. Because of the Twitter data sharing policy, we only share the news articles and tweet ids as part of this dataset and provide code along with repo to download complete tweet details, social engagements, and social networks. We describe and compare FakeNewsNet with other existing datasets in FakeNewsNet: A Data Repository with News Content, Social Context and Spatialtemporal Information for Studying Fake News on Social Media (https://arxiv.org/abs/1809.01286). A more readable version of the dataset is available at https://github.com/KaiDMML/FakeNewsNet

  17. m

    Amharic text dataset extracted from memes for hate speech detection or...

    • data.mendeley.com
    Updated Jun 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mequanent Degu (2023). Amharic text dataset extracted from memes for hate speech detection or classification [Dataset]. http://doi.org/10.17632/gw3fdtw5v7.2
    Explore at:
    Dataset updated
    Jun 8, 2023
    Authors
    Mequanent Degu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    the dataset is collected from social media such as facebook and telegram. the dataset is further processed. the collection are orginal_cleaned: this dataset is neither stemed nor stopword are remove: stopword_removed: in this dataset stopwords are removed but not stemmed and in stemed datset is stemmed and stopwords are removed. stemming is done using hornmorpho developed by Michael Gesser( available at https://github.com/hltdi/HornMorpho) all datasets are normalized and free from noise such as punctuation marks and emojs.

  18. PHEME dataset for Rumour Detection and Veracity Classification

    • figshare.com
    application/gzip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Kochkina; Maria Liakata; Arkaitz Zubiaga (2023). PHEME dataset for Rumour Detection and Veracity Classification [Dataset]. http://doi.org/10.6084/m9.figshare.6392078.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Elena Kochkina; Maria Liakata; Arkaitz Zubiaga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a collection of Twitter rumours and non-rumours posted during breaking news.

    The data is structured as follows. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet. Also each folder contains ‘annotation.json’ which contains information about veracity of the rumour and ‘structure.json’, which contains information about structure of the conversation.

    This dataset is an extension of the PHEME dataset of rumours and non-rumours (https://figshare.com/articles/PHEME_dataset_of_rumours_and_non-rumours/4010619), it contains rumours related to 9 events and each of the rumours is annotated with its veracity value, either True, False or Unverified.

    This dataset was used in the paper 'All-in-one: Multi-task Learning for Rumour Verification'. For more details, please refer to the paper.

    Code using this dataset can be found on github (https://github.com/kochkinaelena/Multitask4Veracity).

    License: The annotations are provided under a CC-BY license, while Twitter retains the ownership and rights of the content of the tweets.

  19. e

    BESOCIAL: Social media archiving tools comparison - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Aug 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). BESOCIAL: Social media archiving tools comparison - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/e084520a-90ac-5506-8bfe-5d3b6ae1796b
    Explore at:
    Dataset updated
    Aug 1, 2024
    Description

    This spreadsheet has the aim to provide an overview about tools and frameworks for social media archiving, as a variety of open source tools implemented in different programming languages and with different features exist. It is an adapted version of the "Comparison of web archiving software" by the Data Together Initiative which is licensed under CC BY 4.0: https://github.com/datatogether/research/tree/master/web_archiving. We kept the table structure with its clear column definitions and added new columns, these definitions can be found in the column definitions sheet; some columns do not apply or are not our main focus and thus we kept them empty for now. The observations sheet contains a description of each tool which also served as basis for several column values in the comparison sheet. Initially we focus on tools which are dedicated social media harvesting tools. Please note that general web archiving tools such as listed in the table of the Data Together initiative may be used to harvest social media data too. However, these tools might require a specific setup to cope with the peculiarities of social media data, hence we initially did not include these tools. Contributions and feedback are welcome and we also envision to contribute results back to the Data Together initiative. This spreadsheet is part of work package 1 of the BeSocial project, the conducted research has been funded by BELSPO, the Belgian Science Policy office. Contact information: — For BeSocial in general: Fien Messens from KBR fien.messens@kbr.be — For the tool comparison: Sven Lieber from Ghent University - IDLab sven.lieber@ugent.be

  20. MultiSocial

    • zenodo.org
    • data.niaid.nih.gov
    Updated Aug 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal (2025). MultiSocial [Dataset]. http://doi.org/10.5281/zenodo.13846152
    Explore at:
    Dataset updated
    Aug 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.

    If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

    Disclaimer

    Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.

    Data Source

    The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:

    1. Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.

    2. Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).

    3. Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.

    4. Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).

    5. WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.

    From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.

    The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).

    The dataset has the following fields:

    • 'text' - a text sample,

    • 'label' - 0 for human-written text, 1 for machine-generated text,

    • 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,

    • 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,

    • 'language' - the ISO 639-1 language code identifying the detected language of the given text,

    • 'length' - word count of the given text,

    • 'source' - a string identifying the source dataset / platform of the given text,

    • 'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.

    ToDo Statistics (under construction)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gitanjali Wadhwa (2023). GitHub Social Network [Dataset]. https://www.kaggle.com/datasets/gitanjali1425/github-social-network-graph-dataset
Organization logo

GitHub Social Network

GitHub Social Network - graph based dataset consisting of Nodes and Edges.

Explore at:
69 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gitanjali Wadhwa
Description

Description

An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.

Properties

  • Directed: No.
  • Node features: Yes.
  • Edge features: No.
  • Node labels: Yes. Binary-labeled.
  • Temporal: No.
  • Nodes: 37,700
  • Edges: 289,003
  • Density: 0.001
  • Transitvity: 0.013

Possible Tasks

  • Binary node classification
  • Link prediction
  • Community detection
  • Network visualisation
Search
Clear search
Close search
Google apps
Main menu