This is a large-scale dataset collected from WhatsApp public groups. It has been created from 178 public groups containing around 45K users and 454K messages. This dataset allows researchers to ask questions like (i) Are WhatsApp groups a broadcast, multicast or unicast medium? (ii) How interactive are users, and how do these interactions emerge over time? (iii) What geographical span do WhatsApp groups have, and how does geographical placement impact interaction dynamics? (iv) What role does multimedia content play in WhatsApp groups, and how do users form interaction around multimedia content? (v) What is the potential of WhatsApp data in answering further social science questions, particularly in relation to bias and representability?
Overview: The QuitNowTXT text messaging program is designed as a resource that can be adapted to specific contexts including those outside the United States and in languages other than English. Based on evidence-based practices, this program is a smoking cessation intervention for smokers who are ready to quit smoking. Although evidence supports the use of text messaging as a platform to deliver cessation interventions, it is expected that the maximum effect of the program will be demonstrated when it is integrated into other elements of a national tobacco control strategy. The QuitNowTXT program is designed to deliver tips, motivation, encouragement and fact-based information via unidirectional and interactive bidirectional message formats. The core of the program consists of messages sent to the user based on a scheduled quit day identified by the user. Messages are sent for up to four weeks pre-quit date and up to six weeks post quit date. Messages assessing mood, craving, and smoking status are also sent at various intervals, and the user receives messages back based on the response they have submitted. In addition, users can request assistance in dealing with craving, stress/mood, and responding to slips/relapses by texting specific key words to the QuitNow. Rotating automated messages are then returned to the user based on the keyword. Details of the program are provided below. Texting STOP to the service discontinues further texts being sent. This option is provided every few messages as required by the United States cell phone providers. It is not an option to remove this feature if the program is used within the US. If a web-based registration is used, it is suggested that users provide demographic information such as age, sex, and smoking frequency (daily or almost every day, most days, only a few days a week, only on weekends, a few times a month or less) in addition to their mobile phone number and quit date. This information will be useful for assessing the reach of the program, as well as identifying possible need to develop libraries to specific groups. The use of only a mobile phone-based registration system reduces barriers for participant entry into the program but limits the collection of additional data. At bare minimum, quit date must be collected. At sign up, participants will have the option to choose a quit date up to one month out. Text messages will start up to 14 days before their specified quit date. Users also have the option of changing their quit date at any time if desired. The program can also be modified to provide texts to users who have already quit within the last month. One possible adaptation of the program is to include a QuitNowTXT "light" version. This adaptation would allow individuals who do not have unlimited text messaging capabilities but would still like to receive support to participate by controlling the number of messages they receive. In the light program, users can text any of the programmed keywords without fully opting in to the program. Program Design: The program is designed as a 14-day countdown to quit date, with subsequent six weeks of daily messages. Each day within the program is identified as either a pre-quit date (Q- # days) or a post-quit date (Q+#). If a user opts into the program fewer than 14 days before their quit date, the system will begin sending messages on that day. For example, if they opt in four days prior to their quit date, the system will send a welcome message and recognize that they are at Q-4 (or four days before their quit date), and they will receive the message that everyone else receives four days before their quit date. As the user progresses throughout the program, they will receive messages outlined in the text message library. Throughout the program, users will receive texts that cover a variety of content areas including tips, informational content, motivational messaging, and keyword responses. The frequency of messages incre
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Digitally semi-literate means those people who face challenges in digital enablement and are not too familiar with using smartphones for text message communication. Any progress to reduce the difficulty of their smartphone usage can help these people. These people are over one billion worldwide. The dataset contains text messages in English (some of these are translations of local text messages) from semi-literate Indian users. The dataset has been derived from face to face surveys primarily. Only 10% by online surveys since these people are not comfortable in doing online surveys.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.
If you use this dataset in any publication, project, tool or in any other form, please, cite the a paper.
Disclaimer
Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously. The intended use if for non-commercial research purpose only.
Data Source
The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:
Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.
Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).
Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.
Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).
WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.
From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.
The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).
The dataset has the following fields:
'text' - a text sample,
'label' - 0 for human-written text, 1 for machine-generated text,
'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
'language' - the ISO 639-1 language code identifying the detected language of the given text,
'length' - word count of the given text,
'source' - a string identifying the source dataset / platform of the given text,
'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.
ToDo Statistics (under construction)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Emotional Voice Messages (EMOVOME) database is a speech dataset collected for emotion recognition in real-world conditions. It contains 999 spontaneous voice messages from 100 Spanish speakers, collected from real conversations on a messaging app. EMOVOME includes both expert and non-expert emotional annotations, covering valence and arousal dimensions, along with emotion categories for the expert annotations. Detailed participant information is provided, including sociodemographic data and personality trait assessments using the NEO-FFI questionnaire. Moreover, EMOVOME provides audio recordings of participants reading a given text, as well as transcriptions of all 999 voice messages. Additionally, baseline models for valence and arousal recognition are provided, utilizing both speech and audio transcriptions.
Description
For details on the EMOVOME database, please refer to the article:
"EMOVOME Database: Advancing Emotion Recognition in Speech Beyond Staged Scenarios". Lucía Gómez-Zaragozá, Rocío del Amor, María José Castro-Bleda, Valery Naranjo, Mariano Alcañiz Raya, Javier Marín-Morales. (pre-print available in https://doi.org/10.48550/arXiv.2403.02167)
Content
The Zenodo repository contains four files:
EMOVOME_agreement.pdf: agreement file required to access the original audio files, detailed in section Usage Notes.
labels.csv: ratings of the three non-experts and the expert annotator, independently and combined.
participants_ids.csv: table mapping each numerical file ID to its corresponding alphanumeric participant ID.
transcriptions.csv: transcriptions of each audio.
The repository also includes three folders:
Audios: it contains the file features_eGeMAPSv02.csv corresponding to the standard acoustic feature set used in the baseline model, and two folders:
Lecture: contains the audio files corresponding to the text readings, with each file named according to the participant's ID.
Emotions: contains the voice recordings from the messaging app provided by the user, named with a file ID.
Questionnaires: it contains two files: 1) sociodemographic_spanish.csv and sociodemographic_english.csv are the sociodemographic data of participants in Spanish and English, respectively, including the demographic information; and 2) NEO-FFI_spanish.csv includes the participants’ answers to the Spanish version of the NEO-FFI questionnaire. The three files include a column indicating the participant's ID to link the information.
Baseline_emotion_recognition: it includes three files and two folders. The file partitions.csv specifies the proposed data partition. Particularly, the dataset is divided into 80% for development and 20% for testing using a speaker-independent approach, i.e., samples from the same speaker are not included in both development and test. The development set includes 80 participants (40 female, 40 male) containing the following distribution of labels: 241 negative, 305 neutral and 261 positive valence; and 148 low, 328 neutral and 331 high arousal. The test set includes 20 participants (10 female, 10 male) with the distribution of labels that follows: 57 negative, 62 neutral and 73 positive valence; and 13 low, 70 neutral and 109 high arousal. Files baseline_speech.ipynb and baseline_text.ipynb contain the code used to create the baseline emotion recognition models based on speech and text, respectively. The actual trained models for valence and arousal prediction are provided in folders models_speech and models_text.
Audio files in “Lecture” and “Emotions” are only provided to the users that complete the agreement file in section Usage Notes. Audio files are in Ogg Vorbis format at 16-bit and 44.1 kHz or 48 kHz. The total size of the “Audios” folder is about 213 MB.
Usage Notes
All the data included in the EMOVOME database is publicly available under the Creative Commons Attribution 4.0 International license. The only exception is the original raw audio files, for which an additional step is required as a security measure to safeguard the speakers' privacy. To request access, interested authors should first complete and sign the agreement file EMOVOME_agreement.pdf and send it to the corresponding author (jamarmo@htech.upv.es). The data included in the EMOVOME database is expected to be used for research purposes only. Therefore, the agreement file states that the authors are not allowed to share the data with profit-making companies or organisations. They are also not expected to distribute the data to other research institutions; instead, they are suggested to kindly refer interested colleagues to the corresponding author of this article. By agreeing to the terms of the agreement, the authors also commit to refraining from publishing the audio content on the media (such as television and radio), in scientific journals (or any other publications), as well as on other platforms on the internet. The agreement must bear the signature of the legally authorised representative of the research institution (e.g., head of laboratory/department). Once the signed agreement is received and validated, the corresponding author will deliver the "Audios" folder containing the audio files through a download procedure. A direct connection between the EMOVOME authors and the applicants guarantees that updates regarding additional materials included in the database can be received by all EMOVOME users.
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.
This corpus has been collected from free or free for research sources at the Internet:
-> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link]. -> A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link]. -> Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link]. This corpus has been used in the following academic researches:
The original dataset can be found here. The creators would like to note that in case you find the dataset useful, please make a reference to previous paper and the web page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ in your papers, research, etc.
We offer a comprehensive study of this corpus in the following paper. This work presents a number of statistics, studies and baseline results for several machine learning methods.
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a temporal hypergraph dataset, which here means a sequence of timestamped hyperedges where each hyperedge is a set of nodes. In email communication, messages can be sent to multiple recipients. In this dataset, nodes are email addresses at Enron, and a hyperedge is comprised of the sender and all recipients of the email. Only email addresses from a core set of employees are included. Timestamps are in ISO8601 format.
This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public and posted to the web by the Federal Energy Regulatory Commission during its investigation.
The email dataset was later purchased by Leslie Kaelbling at MIT and turned out to have a number of integrity problems. A number of folks at SRI, notably Melinda Gervasio, worked hard to correct these problems, and it is thanks to them that the dataset is available. The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees". Invalid email addresses were converted to something of the form user@enron.com whenever possible (i.e., the recipient is specified in some parseable format like "Doe, John" or "Mary K. Smith") and to no_address@enron.com when no recipient was specified.
Some basic statistics of this dataset are:
Component Size, Number
Source: email-Enron dataset
If you use this dataset, please cite these references:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is a collection of SMS messages, precisely 5,574 entries, each meticulously tagged as either 'ham' (legitimate) or 'spam'. It was compiled primarily for the purpose of SMS spam research and is invaluable for developing predictive models that accurately classify text messages. The collection offers a robust foundation for projects involving natural language processing (NLP) and binary classification tasks within the telecommunications domain.
The dataset consists of 5,574 individual SMS messages. Each message is presented on a single line, structured with two distinct columns. The data files are typically in a text-based format, suitable for processing. The distribution of messages is approximately 87% 'ham' (legitimate) messages and 13% 'spam' messages. There are 5,171 unique text values within the dataset.
This dataset is ideally suited for: * Developing and training machine learning models for SMS spam detection. * Conducting research in Natural Language Processing (NLP), particularly for text categorisation. * Implementing binary classification algorithms to distinguish between legitimate and unsolicited messages. * Exploring text analytics and pattern recognition in short message services.
The messages within this dataset originate from diverse sources, including a UK forum where users reported SMS spam, and a large collection of legitimate messages primarily from Singaporean university students. While the original collection points span specific regions, the dataset is globally relevant for research and application. A specific time range for the original data collection is not specified in the available information.
CC0
This dataset is beneficial for: * Data Scientists: To build and evaluate machine learning models for text classification and spam filtering. * Machine Learning Engineers: For developing and deploying automated spam detection systems in telecommunications. * Researchers: Engaged in natural language processing, data mining, and communication security studies. * Students: Working on academic projects that require text analysis and classification.
Original Data Source: Ham & Spam Messages Dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To research the illegal activities of underground apps on Telegram, we have created a dataset called TUApps. TUApps is a progressively growing dataset of underground apps, collected from September 2023 to February 2024, consisting of a total of 1,000 underground apps and 200 million messages distributed across 71,332 Telegram channels.
In the process of creating this dataset, we followed strict ethical standards to ensure the lawful use of the data and the protection of user privacy. The dataset includes the following files:
(1) dataset.zip: We have packaged the underground app samples. The naming of Android app files is based on the SHA256 hash of the file, and the naming of iOS app files is based on the SHA256 hash of the publishing webpage.
(2) code.zip: We have packaged the code used for crawling data from Telegram and for performing data analysis.
(3) message.zip: We have packaged the messages crawled from Telegram, the files are named after the names of the channels in Telegram.
Availability of code and messages
Upon acceptance of our research paper, the dataset containing user messages and the code used for data collection and analysis will only be made available upon request to researchers who agree to adhere to strict ethical principles and maintain the confidentiality of the data.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Language | Number of Samples |
Java | 153,119 |
Ruby | 233,710 |
Go | 137,998 |
JavaScript | 373,598 |
Python | 472,469 |
PHP | 294,394 |
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Classification of online health messages The dataset has 487 annotated messages taken from Medhelp, an online health forum with several health communities (https://www.medhelp.org/). It was built in a master thesis entitled "Automatic categorization of health-related messages in online health communities" of the Master in Informatics and Computing Engineering of the Faculty of Engineering of the University of Porto. It expands a dataset created in a previous work [see Relation metadata] whose objective was to propose a classification scheme to analyze messages exchanged in online health forums. A website was built to allow the classification of additional messages collected from Medhelp. After using a Python script to scrape the five most recent discussions from popular forums (https://www.medhelp.org/forums/list), we sampled 285 messages from them to annotate. Each message was classified three times by anonymous people in 11 categories from April 2022 until the end of May 2022. For each message, the rater picked the categories associated with the message and its emotional polarity (positive, neutral, and negative). Our dataset is organized in two CSV files, one containing information regarding the 885 (=3*285) classifications collected via crowdsourcing (CrowdsourcingClassification.csv) and the other containing the 487 messages with their final and consensual classifications (FinalClassification.csv). The readMe file provides detailed information about the two .csv files.
This corpus has been collected from free or free for research sources at the Internet:
A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis. the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains 523 human feedback messages sent to daily smokers and vapers who were preparing to quit smoking/vaping with a virtual coach.
Study
Daily smokers and vapers recruited through the online crowdsourcing platform Prolific interacted with the text-based virtual coach Kai in up to five sessions between 1 February and 19 March 2024. The sessions were 3-5 days apart. In each session, participants were assigned a new preparatory activity for quitting smoking (e.g., listing reasons for quitting smoking, envisioning one's desired future self after quitting smoking, doing a breathing exercise). Between sessions, participants had a 20% chance of receiving a feedback message from one of two human coaches. More information on the study can be found in the Open Science Framework (OSF) pre-registration: https://doi.org/10.17605/OSF.IO/78CNR. The implementation of the virtual coach Kai can be found here: https://doi.org/10.5281/zenodo.11102861.
Feedback messages
All feedback messages were written by one of two Master's students in psychology. The two human coaches were directed to craft messages incorporating feedback, argument, and either a suggestion or reinforcement. They were also instructed to connect with individuals by referencing aspects of their lives, express empathy toward those with low confidence, and provide reinforcement when people were motivated.
When writing the feedback, the human coaches had access to data on people's baseline smoking and physical activity behavior (i.e., smoking/vaping frequency, weekly exercise amount, existence of previous quit attempts of at least 24 hours, and the number of such quit attempts in the last year), introduction texts from the first session with the virtual coach, previous preparatory activity (i.e., activity formulation, effort spent on the activity and experience with it, return likelihood), current state (i.e., self-efficacy, perceived importance of preparing for quitting, human feedback appreciation), and new activity formulation. Notably, the human coaches only had access to anonymized versions of the introduction texts and activity experience responses (e.g., names were removed). Except for the free-text responses describing participants' experiences with their previous activity and their introduction texts, all of this information is provided together with the feedback messages. For the previous and new activities, we just provide the titles and not also the entire formulations that the human coaches had access to.
Before sending the messages to participants on Prolific, we added a greeting (i.e., "Best wishes, Karina & Goda on behalf of the Perfect Fit Smoking Cessation Team"), a disclaimer that the messages were not medical advice, and a link to confirm having read the message at the end. We also added "This is your feedback message from your human coaches Karina and Goda for preparing to quit [smoking/vaping]:" at the start of the message.
The human coaches approved publishing these feedback messages.
Additional data from the study
Additional data from the study such as participants' free-text descriptions of their experiences with their activities and their introductions from the first session with the virtual coach will also be published and linked to the OSF pre-registration of the study.
In the case of questions, please contact Nele Albers (n.albers@tudelft.nl) or Willem-Paul Brinkman (w.p.brinkman@tudelft.nl).
This dataset captures engagement patterns within the GDG Babcock data community, specifically focusing on the Data & AI Track. It is structured into two main files: one for message data and another for member-specific metrics. The message data includes details such as timestamps, usernames, and various derived features like message quality and word count. The member data provides insights into total messages sent, active days, and a classification of user activity levels based on multiple engagement factors. This dataset is designed to enable the analysis of user participation, message frequency, and behavioural trends within an online community. It can be used to identify trends in message frequency across different times, build models to predict user activity, conduct text analysis on message content, and investigate the relationship between message length and user activity.
Message Data File: * Date: The date the message was sent, in YYYY/MM/DD format. * Username: The identifier for the user who sent the message. * Hour: The hour during which the message was sent (in 24-hour format, ranging from 0-23). * Month: The month when the message was sent. * Quality: A derived measure of message quality, often based on the number of non-stopwords. * Weekday: The day of the week when the message was sent. * Weekend: A boolean indicator (True/False) if the message was sent during the weekend. * Wordcount: The total number of words in the message. * Message: The actual content of the message sent by the user.
Member Data File: * Username: The unique identifier for the user. * Total Messages: The total number of messages sent by the user. * Active Days: The number of days the user has been active in the group chat. * Weekend Activity: A boolean indicator (True/False) if the user is more active on weekends. * Activity Level: A classification of the user's activity level (e.g., High, Medium, Low) based on engagement metrics.
The dataset is typically provided as data files, commonly in CSV format. It consists of two distinct files: one for message-level data and another for member-level aggregated data. The message data file contains approximately 1,275 records based on aggregated date and other attribute counts. The exact number of records for the member data file is not specified but represents unique users within the community.
This dataset is ideal for: * Analysing user activity and engagement in online discussions. * Identifying trends in message frequency across different times of the day and week. * Building predictive models for user activity levels and engagement patterns. * Conducting sentiment analysis or text analysis on message content. * Investigating the relationship between message content length and user activity.
The dataset focuses on the GDG Babcock data community's Data & AI Track. It has a global regional scope. The time range for the collected data is from 2024-01-01 to 2024-12-23, covering approximately one year of community engagement. There are no specific notes on data availability for certain groups or years outside of this community and timeframe.
CC BY-SA
This dataset is suitable for: * Data Scientists and Analysts interested in community engagement and behavioural trends. * Researchers studying online communities, social dynamics, and communication patterns. * Community Managers looking to understand and improve engagement within their platforms. * Academics for educational purposes and case studies in data science and analytics. * Developers building tools for community management or engagement prediction.
Original Data Source: GDG Community Chat Dataset
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset offers a creative collection of 7,000 one-on-one conversations, engineered to explore a diverse range of dialogue modes [1]. It allows for the exploration of conversations that exude personality, demonstrate empathy, and showcase knowledge [1]. Each record is meticulously structured with fields such as personas, additional context, previous utterance, free messages, guided messages, and suggestions, providing a rich foundation for stimulating topics and unique dialogues [1]. It is an invaluable resource for anyone looking to train and validate conversational models and delve into the capabilities of dynamic dialogue systems [1].
The dataset is structured with several key columns, each providing distinct information about the conversations [1, 2]: * Personas: Contains detailed information about the individuals or roles involved in the conversation [2]. This field has 960 unique values [2]. * Additional Context: Provides extra contextual information pertinent to the conversation [2]. * previous_utterance: Records the preceding statement in the dialogue [2]. * context: Offers contextual details for the conversation flow [2]. This field contains 975 unique values [3]. * free_messages: Includes unconstrained messages exchanged within the conversation [2]. This field contains 980 unique values [3]. * guided_messages: Features messages that may follow a specific guidance or structure [2]. This field contains 980 unique values [3]. * suggestions: Contains recommended messages, particularly useful for building knowledge-based conversations [1, 2]. This field contains 980 unique values [3].
The dataset is formatted for easy integration into machine learning frameworks, typically provided in test.csv
format, with validation
, train
, and test
splits available [1, 2]. It comprises 7,000 individual conversations [1]. Each conversation record is uniformly structured, making it suitable for training and evaluating models [1]. The data quality is rated 5 out of 5 [4].
This dataset is ideally suited for several applications [1]:
* Training and validating conversational AI models, enhancing their ability to handle dynamic dialogues [1].
* Generating creative responses by leveraging the personas
and additional context
fields [1].
* Developing knowledge-based conversational agents by utilising information from the suggestions
field [1].
* Building chatbots that offer personalised responses and empathetic support to users, drawing on the personas
and free message
fields [1].
* Exploring the nuances of dialogue modes, including personality expression, empathy, and knowledge demonstration [1].
The dataset is indicated to have a GLOBAL region coverage [4]. Specific time ranges or demographic scopes are not detailed in the provided sources.
CC0
This dataset is particularly beneficial for: * Data Scientists and Machine Learning Engineers: For developing and refining conversational AI models [1]. * NLP Researchers: To study dialogue systems, text mining, and the dynamics of human-like conversations [1]. * Chatbot Developers: For creating more sophisticated and human-centric conversational agents [1]. * Academics: For research into areas such as artificial intelligence, natural language processing, and human-computer interaction [1].
Original Data Source: Blended Skill Talk (1 On 1 Conversations)
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The results of a survey of bibliometric and altmetric end-users run between Feb-Mar 2018 inviting them to suggest ways in which suppliers of bibliometric and altmetric data might improve their services. There were 42 respondents and 149 data points. Data include tools regularly used by respondents, demographic data and the free-text comments. The data have been coded and analysed. The data have been written up at Gadd, E.A. & Rowlands, I. (2018) How can bibliometric and altmetric vendors improve? Messages from the end-user community. Insights Journal. [In Press]
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is designed to facilitate the training of machine learning models for classifying SMS messages as either spam or not spam, often referred to as 'ham'. It comprises a collection of real, English, and non-encoded SMS messages, each meticulously labelled to indicate its status as legitimate or unsolicited. This makes it particularly valuable for research into mobile phone spam, enabling the development of automated tools for identification and blocking, as well as providing a foundation for studying the characteristics of spam messages and devising strategies for avoidance.
The dataset is typically provided in a CSV file format, such as train.csv
. It contains 5574 individual SMS messages. The messages are structured with two key fields: the message text itself and its corresponding label (ham or spam).
This dataset covers SMS messages globally. The messages are in English, representing real and non-encoded content. While a specific time range for data collection isn't provided, it is a public set collected for mobile phone spam research.
CCO
Original Data Source: SMS Spam Collection (Text Classification)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have developed an application and solution approach (using this dataset) for automatically generating and suggesting short email responses to support queries in a university environment. Our proposed solution can be used as one tap or one click solution for responding to various types of queries raised by faculty members and students in a university. Office of Academic Affairs (OAA), Office of Student Life (OSL) and Information Technology Helpdesk (ITD) are support functions within a university which receives hundreds of email messages on the daily basis. Email communication is still the most frequently used mode of communication by these departments. A large percentage of emails received by these departments are frequent and commonly used queries or request for information. Responding to every query by manually typing is a tedious and time consuming task. Furthermore a large percentage of emails and their responses are consists of short messages. For example, an IT support department in our university receives several emails on Wi-Fi not working or someone needing help with a projector or requires an HDMI cable or remote slide changer. Another example is emails from students requesting the office of academic affairs to add and drop courses which they cannot do it directly. The dataset consists of emails messages which are generally received by ITD, OAA and OSL in Ashoka University. The dataset also contains intermediate results while conducting machine learning experiments.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This is a large-scale dataset collected from WhatsApp public groups. It has been created from 178 public groups containing around 45K users and 454K messages. This dataset allows researchers to ask questions like (i) Are WhatsApp groups a broadcast, multicast or unicast medium? (ii) How interactive are users, and how do these interactions emerge over time? (iii) What geographical span do WhatsApp groups have, and how does geographical placement impact interaction dynamics? (iv) What role does multimedia content play in WhatsApp groups, and how do users form interaction around multimedia content? (v) What is the potential of WhatsApp data in answering further social science questions, particularly in relation to bias and representability?