
 Facebook
Facebook Twitter
Twitter Email
Email
Description
An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.
Properties
Possible Tasks

 Facebook
Facebook Twitter
Twitter Email
Email
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset of the paper: Dataset and Case Studies for Visual Near-Duplicates Detection in the Context of Social Media'', by Hana Matatov, Mor Naaman, and Ofra Amir.See the Github repository for details.The dataset of the paper:Dataset and Case Studies for Visual Near-Duplicates Detection in the Context of Social Media'', by Hana Matatov, Mor Naaman, and Ofra Amir.See the Github repository for details.

 Facebook
Facebook Twitter
Twitter Email
Email
This dataset is compiled using this dataset from GitHub.
Data Description Table
| Variable Name | Description | 
|---|---|
| movie_title | Title of the Movie | 
| duration | Duration in minutes | 
| director_name | Name of the Director of the Movie | 
| director_facebook_likes | Number of likes of the Director on his Facebook Page | 
| actor_1_name | Primary actor starring in the movie | 
| actor_1_facebook_likes | Number of likes of the Actor_1 on his/her Facebook Page | 
| actor_2_name | Other actor starring in the movie | 
| actor_2_facebook_likes | Number of likes of the Actor_2 on his/her Facebook Page | 
| actor_3_name | Other actor starring in the movie | 
| actor_3_facebook_likes | Number of likes of the Actor_3 on his/her Facebook Page | 
| num_user_for_reviews | Number of users who gave a review | 
| num_critic_for_reviews | Number of critical reviews on imdb | 
| num_voted_users | Number of people who voted for the movie | 
| cast_total_facebook_likes | Total number of facebook likes of the entire cast of the movie | 
| movie_facebook_likes | Number of Facebook likes in the movie page | 
| plot_keywords | Keywords describing the movie plot | 
| facenumber_in_poster | Number of the actor who featured in the movie poster | 
| color | Film colorization. ‘Black and White’ or ‘Color’ | 
| genres | Film categorization like ‘Animation’, ‘Comedy’, etc | 
| title_year | The year in which the movie is released (1916:2016) | 
| language | Languages like English, Arabic, Chinese, etc | 
| country | Country where the movie is produced | 
| content_rating | Content rating of the movie | 
| aspect_ratio | Aspect ratio the movie was made in | 
| movie_imdb_link | IMDB link of the movie | 
| gross | Gross earnings of the movie in Dollars | 
| budget | Budget of the movie in Dollars | 
| imdb_score | IMDB Score of the movie on IMDB | 

 Facebook
Facebook Twitter
Twitter Email
Email
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The current dataset contains Tweet IDs for tweets mentioning "COVID" (e.g., COVID-19, COVID19) and shared between March and July of 2020.Sampling Method: hourly requests sent to Twitter Search API using Social Feed Manager, an open source software that harvests social media data and related content from Twitter and other platforms.NOTE: 1) In accordance with Twitter API Terms, only Tweet IDs are provided as part of this dataset. 2) To recollect tweets based on the list of Tweet IDs contained in these datasets, you will need to use tweet 'rehydration' programs like Hydrator (https://github.com/DocNow/hydrator) or Python library Twarc (https://github.com/DocNow/twarc). 3) This dataset, like most datasets collected via the Twitter Search API, is a sample of the available tweets on this topic and is not meant to be comprehensive. Some COVID-related tweets might not be included in the dataset either because the tweets were collected using a standardized but intermittent (hourly) sampling protocol or because tweets used hashtags/keywords other than COVID (e.g., Coronavirus or #nCoV). 4) To broaden this sample, consider comparing/merging this dataset with other COVID-19 related public datasets such as: https://github.com/thepanacealab/covid19_twitter https://ieee-dataport.org/open-access/corona-virus-covid-19-tweets-dataset https://github.com/echen102/COVID-19-TweetIDs

 Facebook
Facebook Twitter
Twitter Email
Email
Social Media Data for European Companies offers a powerful tool for organizations looking to enhance their decision-making through informed strategies. By providing links to social media profiles across various platforms—including LinkedIn, Facebook, Instagram, X (formerly Twitter), YouTube, TikTok, and GitHub—this solution caters to the specific needs of industries ranging from sales to recruitment. Updated monthly and fully compliant with GDPR regulations, it ensures reliability, relevancy, and trustworthiness.
LinkedIn – A leading network for businesses and professionals, ideal for B2B interactions.
Facebook – A hub for business pages, reviews, and customer engagement.
Instagram – A visually-driven platform for brand marketing and audience engagement.
X (formerly Twitter) – Known for real-time updates and customer interactions.
YouTube – A video powerhouse offering in-depth brand storytelling opportunities.
TikTok – A rapidly growing platform for creative and viral content.
GitHub – A crucial resource for tech professionals and organizations focused on open-source projects.

 Facebook
Facebook Twitter
Twitter Email
Email
Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.
Data Sources:
GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.
StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.
DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.
Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.
With our datasets, you'll receive:
Choose from various output formats, storage options, and delivery frequencies:
Why choose our Datasets?
Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.
Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.
Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.
Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.
Pricing Options:
Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.
Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.
Experience a seamless journey with Oxylabs:
Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!

 Facebook
Facebook Twitter
Twitter Email
Email
This dataset provides comprehensive social media profile links discovered through real-time web search. It includes profiles from major social networks like Facebook, TikTok, Instagram, Twitter, LinkedIn, Youtube, Pinterest, Github and more. The data is gathered through intelligent search algorithms and pattern matching. Users can leverage this dataset for social media research, influencer discovery, social presence analysis, and social media marketing. The API enables efficient discovery of social profiles across multiple platforms. The dataset is delivered in a JSON format via REST API.

 Facebook
Facebook Twitter
Twitter Email
Email
This is a multimodal dataset used in the paper "On the Role of Images for Analyzing Claims in Social Media", accepted at CLEOPATRA-2021 (2nd International Workshop on Cross-lingual Event-centric Open Analytics), co-located with The Web Conference 2021. The four datasets are curated for two different tasks that broadly come under fake news detection. Originally, the datasets were released as part of challenges or papers for text-based NLP tasks and are further extended here with corresponding images. 1. clef_en and clef_ar are English and Arabic Twitter datasets for claim check-worthiness detection released in CLEF CheckThat! 2020 Barrón-Cedeno et al. [1]. 2. lesa is an English Twitter dataset for claim detection released by Gupta et al.[2] 3. mediaeval is an English Twitter dataset for conspiracy detection released in MediaEval 2020 Workshop by Pogorelov et al.[3] The dataset details like data curation and annotation process can be found in the cited papers. Datasets released here with corresponding images are relatively smaller than the original text-based tweets. The data statistics are as follows: 1. clef_en: 281 2. clef_ar: 2571 3. lesa: 1395 4. mediaeval: 1724 Each folder has two sub-folders and a json file data.json that consists of crawled tweets. Two sub-folders are: 1. images: This Contains crawled images with the same name as tweet-id in data.json. 2. splits: This contains 5-fold splits used for training and evaluation in our paper. Each file in this folder is a csv with two columns

 Facebook
Facebook Twitter
Twitter Email
Email
The FSA Communications team tracked online and social data streams for pre-determined search topics, to capture data sets for the period between March 2016 – March 2017.

 Facebook
Facebook Twitter
Twitter Email
Email
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A COVID-19 misinformation / fake news / rumor / disinformation dataset collected from online social media and news websites. Usage note:Misinformation detection, classification, tracking, prediction.Misinformation sentiment analysis.Rumor veracity classification, comment stance classification.Rumor tracking, social network analysis.Data pre-processing and data analysis codes available at https://github.com/MickeysClubhouse/COVID-19-rumor-datasetPlease see full info in our GitHub link.Cite us:Cheng, Mingxi, et al. "A COVID-19 Rumor Dataset." Frontiers in Psychology 12 (2021): 1566.@article{cheng2021covid, title={A COVID-19 Rumor Dataset}, author={Cheng, Mingxi and Wang, Songli and Yan, Xiaofeng and Yang, Tianqi and Wang, Wenshuo and Huang, Zehao and Xiao, Xiongye and Nazarian, Shahin and Bogdan, Paul}, journal={Frontiers in Psychology}, volume={12}, pages={1566}, year={2021}, publisher={Frontiers} }

 Facebook
Facebook Twitter
Twitter Email
Email
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains a collection of Twitter rumours and non-rumours during six real-world events: 1) 2013 Boston marathon bombings, 2) 2014 Ottawa shooting, 3) 2014 Sydney siege, 4) 2015 Charlie Hebdo Attack, 5) 2014 Ferguson unrest, and 6) 2015 Germanwings plane crash
The data set is an augmented data set of the PHEME dataset of rumours and non-rumours based on two data sets: the PHEME data [2] (downloaded via https://figshare.com/articles/PHEME_dataset_for_Rumour_Detection_and_Veracity_Classification/6392078), and the CrisisLexT26 data [3] (downloaded via https://github.com/sajao/CrisisLex/tree/master/data/CrisisLexT26/2013_Boston_bombings).
PHEME-Aug v2.0 (aug-rnr-data_filtered.tar.bz2 and aur-rnr-data_full.tar.bz2) contains augmented data for all six events.
aug-rnr-data_full.tar.bz2 contains source tweets and replies without temporal filtering. Please refer to [1] for details about temporal filtering. The statistics are as follows:
* 2013 Boston marathon bombings: 392 rumours and 784 non-rumours
* 2014 Ottawa shooting: 1,047 rumours and 2,072 non-rumours
* 2014 Sydney siege: 1,764 rumours and 3,530 non-rumours
* 2015 Charlie Hebdo Attack: 1,225 rumours and 2,450 non-rumours
* 2014 Ferguson unrest: 737 rumours and 1,476 non-rumours
* 2015 Germanwings plane crash: 502 rumours and 604 non-rumours
aug-rnr-data_filtered.tar.bz2 contains source tweets, replies, and retweets after temporal filtering and deduplication. Please refer to [1] for details. The statistics are as follows:
* 2013 Boston marathon bombings: 323 rumours and 645 non-rumours
* 2014 Ottawa shooting: 713 rumours and 1,420 non-rumours
* 2014 Sydney siege: 1,134 rumours and 2,262 non-rumours
* 2015 Charlie Hebdo Attack: 812 rumours and 1,673 non-rumours
* 2014 Ferguson unrest: 471 rumours and 949 non-rumours
* 2015 Germanwings plane crash: 375 rumours and 402 non-rumours
The data structure follows the format of the PHEME data [2]. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet. Also each folder contains ‘aug_complete.csv’ and ‘reference.csv'.
'aug_complete.csv' file contains the metadata (tweet ID, tweet text, timestamp, and rumour label) of augmented tweets before deduplication and filtering tweets without context (i.e., replies).
'reference.csv' file contains manually annotated reference tweets [2, 3].
If you use our augmented data (PHEME-Aug v2.0), please also cite:
[1] Han S., Gao, J., Ciravegna, F. (2019). "Neural Language Model Based Training Data Augmentation for Weakly Supervised Early Rumor Detection", The 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), Vancouver, Canada, 27-30 August, 2019
==============================================================================================
[2] Kochkina, E., Liakata, M., & Zubiaga, A. (2018). All-in-one: Multi-task Learning for Rumour Verification. COLING.
[3] Olteanu, A., Vieweg, S., & Castillo, C. (2015, February). What to expect when the unexpected happens: Social media communications across crises. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing (pp. 994-1009). ACM

 Facebook
Facebook Twitter
Twitter Email
Email
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 29th which yielded over 4 million tweets a day.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (70,569,368 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (13,535,912 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.

 Facebook
Facebook Twitter
Twitter Email
Email
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
From the project repo: https://github.com/intelligence-csd-auth-gr/Ethos-Hate-Speech-Dataset
ETHOS: multi-labEl haTe speecH detectiOn dataSet. This repository contains a dataset for hate speech detection on social media platforms, called Ethos. There are two variations of the dataset:
Ethos_Dataset_Binary.csv[Ethos_Dataset_Binary.csv] contains 998 comments in the dataset alongside with a label about hate speech presence or absence. 565 of them do not contain hate speech, while the rest of them, 433, contain. Ethos_Dataset_Multi_Label.csv [Ethos_Dataset_Multi_Label.csv] which contains 8 labels for the 433 comments with hate speech content. These labels are violence (if it incites (1) or not (0) violence), directed_vs_general (if it is directed to a person (1) or a group (0)), and 6 labels about the category of hate speech like, gender, race, national_origin, disability, religion and sexual_orientation.

 Facebook
Facebook Twitter
Twitter Email
Email
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides insights into the Indian developer community on GitHub, one of the world’s largest platforms for developers to collaborate, share, and contribute to open-source projects. Whether you're interested in analyzing trends, understanding community growth, or identifying popular programming languages, this dataset offers a comprehensive look at the profiles of GitHub users from India.
The dataset includes anonymized profile information for a diverse range of GitHub users based in India. Key features include: - Username: Unique identifier for each user (anonymized) - Location: City or region within India - Programming Languages: Most commonly used languages per user - Repositories: Public repositories owned and contributed to - Followers and Following: Social network connections within the platform - GitHub Join Date: Date the user joined GitHub - Organizations: Affiliated organizations (if publicly available)
This dataset is curated from publicly available GitHub profiles with a specific focus on Indian users. It is inspired by the need to understand the growth of the tech ecosystem in India, including the languages, tools, and topics that are currently popular among Indian developers. This dataset aims to provide valuable insights for recruiters, data scientists, and anyone interested in the open-source contributions of Indian developers.
This dataset is perfect for: - Data scientists looking to explore and visualize developer trends - Recruiters interested in talent scouting within the Indian tech ecosystem - Tech enthusiasts who want to explore the dynamics of India's open-source community - Students and educators looking for real-world data to practice analysis and modeling

 Facebook
Facebook Twitter
Twitter Email
Email
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Using the Twitter Search API, we collected all tweets posted by official MC accounts (voting members only) during the 115th U.S. Congress which ran January 3, 2017 to January 3, 2019. We identified MCs' Twitter user names by combining the lists of MC social media accounts from the United States project (https://github.com/unitedstates/congress-legislators), George Washington Libraries (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UIVHQR), and the Sunlight Foundation (https://sunlightlabs.github.io/congress/index.html#legislator-spreadsheet). Throughout 2017 and 2018, we used the Twitter API to search for the user names in this composite list and retrieved the accounts' most recent tweets. Our final search occurred on January 3, 2019, shortly after the 115th U.S. Congress ended. In all, we collected 1,485,834 original tweets (i.e., we excluded retweets) from 524 accounts. The accounts differ from the total size of Congress because we included tweet data for MCs who resigned (e.g., Ryan Zinke) and those who joined off cycle (e.g., Rep. Conor Lamb); we were also unable to confirm accounts for every state and district.Twitter prohibits us from sharing the full tweet text, and so we have included tweet IDs when possible.

 Facebook
Facebook Twitter
Twitter Email
Email
FakeNewsNet is a multi-dimensional data repository that currently contains two datasets with news content, social context, and spatiotemporal information. The dataset is constructed using an end-to-end system, FakeNewsTracker. The constructed FakeNewsNet repository has the potential to boost the study of various open research problems related to fake news study. Because of the Twitter data sharing policy, we only share the news articles and tweet ids as part of this dataset and provide code along with repo to download complete tweet details, social engagements, and social networks. We describe and compare FakeNewsNet with other existing datasets in FakeNewsNet: A Data Repository with News Content, Social Context and Spatialtemporal Information for Studying Fake News on Social Media (https://arxiv.org/abs/1809.01286). A more readable version of the dataset is available at https://github.com/KaiDMML/FakeNewsNet

 Facebook
Facebook Twitter
Twitter Email
Email
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
the dataset is collected from social media such as facebook and telegram. the dataset is further processed. the collection are orginal_cleaned: this dataset is neither stemed nor stopword are remove: stopword_removed: in this dataset stopwords are removed but not stemmed and in stemed datset is stemmed and stopwords are removed. stemming is done using hornmorpho developed by Michael Gesser( available at https://github.com/hltdi/HornMorpho) all datasets are normalized and free from noise such as punctuation marks and emojs.

 Facebook
Facebook Twitter
Twitter Email
Email
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of Twitter rumours and non-rumours posted during breaking news.
The data is structured as follows. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet. Also each folder contains ‘annotation.json’ which contains information about veracity of the rumour and ‘structure.json’, which contains information about structure of the conversation.
This dataset is an extension of the PHEME dataset of rumours and non-rumours (https://figshare.com/articles/PHEME_dataset_of_rumours_and_non-rumours/4010619), it contains rumours related to 9 events and each of the rumours is annotated with its veracity value, either True, False or Unverified.
This dataset was used in the paper 'All-in-one: Multi-task Learning for Rumour Verification'. For more details, please refer to the paper.
Code using this dataset can be found on github (https://github.com/kochkinaelena/Multitask4Veracity).
License: The annotations are provided under a CC-BY license, while Twitter retains the ownership and rights of the content of the tweets.

 Facebook
Facebook Twitter
Twitter Email
Email
This spreadsheet has the aim to provide an overview about tools and frameworks for social media archiving, as a variety of open source tools implemented in different programming languages and with different features exist. It is an adapted version of the "Comparison of web archiving software" by the Data Together Initiative which is licensed under CC BY 4.0: https://github.com/datatogether/research/tree/master/web_archiving. We kept the table structure with its clear column definitions and added new columns, these definitions can be found in the column definitions sheet; some columns do not apply or are not our main focus and thus we kept them empty for now. The observations sheet contains a description of each tool which also served as basis for several column values in the comparison sheet. Initially we focus on tools which are dedicated social media harvesting tools. Please note that general web archiving tools such as listed in the table of the Data Together initiative may be used to harvest social media data too. However, these tools might require a specific setup to cope with the peculiarities of social media data, hence we initially did not include these tools. Contributions and feedback are welcome and we also envision to contribute results back to the Data Together initiative. This spreadsheet is part of work package 1 of the BeSocial project, the conducted research has been funded by BELSPO, the Belgian Science Policy office. Contact information: — For BeSocial in general: Fien Messens from KBR fien.messens@kbr.be — For the tool comparison: Sven Lieber from Ghent University - IDLab sven.lieber@ugent.be

 Facebook
Facebook Twitter
Twitter Email
Email
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.
If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.
Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.
The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:
Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.
Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).
Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.
Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).
WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.
From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.
The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).
The dataset has the following fields:
'text' - a text sample,
'label' - 0 for human-written text, 1 for machine-generated text,
'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
'language' - the ISO 639-1 language code identifying the detected language of the given text,
'length' - word count of the given text,
'source' - a string identifying the source dataset / platform of the given text,
'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.
ToDo Statistics (under construction)

 Facebook
Facebook Twitter
Twitter Email
Email
Description
An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.
Properties
Possible Tasks