18 datasets found
  1. P

    WhatsApp, Doc? Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Apr 3, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiran Garimella; Gareth Tyson (2018). WhatsApp, Doc? Dataset [Dataset]. https://paperswithcode.com/dataset/whatsapp-doc
    Explore at:
    Dataset updated
    Apr 3, 2018
    Authors
    Kiran Garimella; Gareth Tyson
    Description

    This is a large-scale dataset collected from WhatsApp public groups. It has been created from 178 public groups containing around 45K users and 454K messages. This dataset allows researchers to ask questions like (i) Are WhatsApp groups a broadcast, multicast or unicast medium? (ii) How interactive are users, and how do these interactions emerge over time? (iii) What geographical span do WhatsApp groups have, and how does geographical placement impact interaction dynamics? (iv) What role does multimedia content play in WhatsApp groups, and how do users form interaction around multimedia content? (v) What is the potential of WhatsApp data in answering further social science questions, particularly in relation to bias and representability?

  2. d

    QuitNowTXT Text Messaging Library

    • catalog.data.gov
    • data.virginia.gov
    • +2more
    Updated Feb 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Cancer Institute (NCI), National Institutes of Health (NIH) (2025). QuitNowTXT Text Messaging Library [Dataset]. https://catalog.data.gov/dataset/quitnowtxt-text-messaging-library
    Explore at:
    Dataset updated
    Feb 22, 2025
    Dataset provided by
    National Cancer Institute (NCI), National Institutes of Health (NIH)
    Description

    Overview: The QuitNowTXT text messaging program is designed as a resource that can be adapted to specific contexts including those outside the United States and in languages other than English. Based on evidence-based practices, this program is a smoking cessation intervention for smokers who are ready to quit smoking. Although evidence supports the use of text messaging as a platform to deliver cessation interventions, it is expected that the maximum effect of the program will be demonstrated when it is integrated into other elements of a national tobacco control strategy. The QuitNowTXT program is designed to deliver tips, motivation, encouragement and fact-based information via unidirectional and interactive bidirectional message formats. The core of the program consists of messages sent to the user based on a scheduled quit day identified by the user. Messages are sent for up to four weeks pre-quit date and up to six weeks post quit date. Messages assessing mood, craving, and smoking status are also sent at various intervals, and the user receives messages back based on the response they have submitted. In addition, users can request assistance in dealing with craving, stress/mood, and responding to slips/relapses by texting specific key words to the QuitNow. Rotating automated messages are then returned to the user based on the keyword. Details of the program are provided below. Texting STOP to the service discontinues further texts being sent. This option is provided every few messages as required by the United States cell phone providers. It is not an option to remove this feature if the program is used within the US. If a web-based registration is used, it is suggested that users provide demographic information such as age, sex, and smoking frequency (daily or almost every day, most days, only a few days a week, only on weekends, a few times a month or less) in addition to their mobile phone number and quit date. This information will be useful for assessing the reach of the program, as well as identifying possible need to develop libraries to specific groups. The use of only a mobile phone-based registration system reduces barriers for participant entry into the program but limits the collection of additional data. At bare minimum, quit date must be collected. At sign up, participants will have the option to choose a quit date up to one month out. Text messages will start up to 14 days before their specified quit date. Users also have the option of changing their quit date at any time if desired. The program can also be modified to provide texts to users who have already quit within the last month. One possible adaptation of the program is to include a QuitNowTXT "light" version. This adaptation would allow individuals who do not have unlimited text messaging capabilities but would still like to receive support to participate by controlling the number of messages they receive. In the light program, users can text any of the programmed keywords without fully opting in to the program. Program Design: The program is designed as a 14-day countdown to quit date, with subsequent six weeks of daily messages. Each day within the program is identified as either a pre-quit date (Q- # days) or a post-quit date (Q+#). If a user opts into the program fewer than 14 days before their quit date, the system will begin sending messages on that day. For example, if they opt in four days prior to their quit date, the system will send a welcome message and recognize that they are at Q-4 (or four days before their quit date), and they will receive the message that everyone else receives four days before their quit date. As the user progresses throughout the program, they will receive messages outlined in the text message library. Throughout the program, users will receive texts that cover a variety of content areas including tips, informational content, motivational messaging, and keyword responses. The frequency of messages incre

  3. m

    digitally semi-literate text message dataset

    • data.mendeley.com
    Updated Aug 11, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prawaal Sharma (2021). digitally semi-literate text message dataset [Dataset]. http://doi.org/10.17632/4b53nj78tv.8
    Explore at:
    Dataset updated
    Aug 11, 2021
    Authors
    Prawaal Sharma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Digitally semi-literate means those people who face challenges in digital enablement and are not too familiar with using smartphones for text message communication. Any progress to reduce the difficulty of their smartphone usage can help these people. These people are over one billion worldwide. The dataset contains text messages in English (some of these are translations of local text messages) from semi-literate Indian users. The dataset has been derived from face to face surveys primarily. Only 10% by online surveys since these people are not comfortable in doing online surveys.

  4. Z

    MultiSocial

    • data.niaid.nih.gov
    Updated Oct 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moro, Robert (2024). MultiSocial [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13846151
    Explore at:
    Dataset updated
    Oct 4, 2024
    Dataset provided by
    Kopal, Jakub
    Srba, Ivan
    Moro, Robert
    Macko, Dominik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.

    If you use this dataset in any publication, project, tool or in any other form, please, cite the a paper.

    Disclaimer

    Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously. The intended use if for non-commercial research purpose only.

    Data Source

    The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:

    Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.

    Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).

    Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.

    Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).

    WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.

    From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.

    The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).

    The dataset has the following fields:

    'text' - a text sample,

    'label' - 0 for human-written text, 1 for machine-generated text,

    'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,

    'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,

    'language' - the ISO 639-1 language code identifying the detected language of the given text,

    'length' - word count of the given text,

    'source' - a string identifying the source dataset / platform of the given text,

    'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.

    ToDo Statistics (under construction)

  5. Z

    Emotional Voice Messages (EMOVOME) database

    • data.niaid.nih.gov
    Updated Jun 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gómez-Zaragozá, Lucía (2024). Emotional Voice Messages (EMOVOME) database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6453063
    Explore at:
    Dataset updated
    Jun 13, 2024
    Dataset provided by
    Parra Vargas, Elena
    Gómez-Zaragozá, Lucía
    Alcañiz Raya, Mariano
    Naranjo, Valery
    Marín-Morales, Javier
    del Amor, Rocío
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Emotional Voice Messages (EMOVOME) database is a speech dataset collected for emotion recognition in real-world conditions. It contains 999 spontaneous voice messages from 100 Spanish speakers, collected from real conversations on a messaging app. EMOVOME includes both expert and non-expert emotional annotations, covering valence and arousal dimensions, along with emotion categories for the expert annotations. Detailed participant information is provided, including sociodemographic data and personality trait assessments using the NEO-FFI questionnaire. Moreover, EMOVOME provides audio recordings of participants reading a given text, as well as transcriptions of all 999 voice messages. Additionally, baseline models for valence and arousal recognition are provided, utilizing both speech and audio transcriptions.

    Description

    For details on the EMOVOME database, please refer to the article:

    "EMOVOME Database: Advancing Emotion Recognition in Speech Beyond Staged Scenarios". Lucía Gómez-Zaragozá, Rocío del Amor, María José Castro-Bleda, Valery Naranjo, Mariano Alcañiz Raya, Javier Marín-Morales. (pre-print available in https://doi.org/10.48550/arXiv.2403.02167)

    Content

    The Zenodo repository contains four files:

    EMOVOME_agreement.pdf: agreement file required to access the original audio files, detailed in section Usage Notes.

    labels.csv: ratings of the three non-experts and the expert annotator, independently and combined.

    participants_ids.csv: table mapping each numerical file ID to its corresponding alphanumeric participant ID.

    transcriptions.csv: transcriptions of each audio.

    The repository also includes three folders:

    Audios: it contains the file features_eGeMAPSv02.csv corresponding to the standard acoustic feature set used in the baseline model, and two folders:

    Lecture: contains the audio files corresponding to the text readings, with each file named according to the participant's ID.

    Emotions: contains the voice recordings from the messaging app provided by the user, named with a file ID.

    Questionnaires: it contains two files: 1) sociodemographic_spanish.csv and sociodemographic_english.csv are the sociodemographic data of participants in Spanish and English, respectively, including the demographic information; and 2) NEO-FFI_spanish.csv includes the participants’ answers to the Spanish version of the NEO-FFI questionnaire. The three files include a column indicating the participant's ID to link the information.

    Baseline_emotion_recognition: it includes three files and two folders. The file partitions.csv specifies the proposed data partition. Particularly, the dataset is divided into 80% for development and 20% for testing using a speaker-independent approach, i.e., samples from the same speaker are not included in both development and test. The development set includes 80 participants (40 female, 40 male) containing the following distribution of labels: 241 negative, 305 neutral and 261 positive valence; and 148 low, 328 neutral and 331 high arousal. The test set includes 20 participants (10 female, 10 male) with the distribution of labels that follows: 57 negative, 62 neutral and 73 positive valence; and 13 low, 70 neutral and 109 high arousal. Files baseline_speech.ipynb and baseline_text.ipynb contain the code used to create the baseline emotion recognition models based on speech and text, respectively. The actual trained models for valence and arousal prediction are provided in folders models_speech and models_text.

    Audio files in “Lecture” and “Emotions” are only provided to the users that complete the agreement file in section Usage Notes. Audio files are in Ogg Vorbis format at 16-bit and 44.1 kHz or 48 kHz. The total size of the “Audios” folder is about 213 MB.

    Usage Notes

    All the data included in the EMOVOME database is publicly available under the Creative Commons Attribution 4.0 International license. The only exception is the original raw audio files, for which an additional step is required as a security measure to safeguard the speakers' privacy. To request access, interested authors should first complete and sign the agreement file EMOVOME_agreement.pdf and send it to the corresponding author (jamarmo@htech.upv.es). The data included in the EMOVOME database is expected to be used for research purposes only. Therefore, the agreement file states that the authors are not allowed to share the data with profit-making companies or organisations. They are also not expected to distribute the data to other research institutions; instead, they are suggested to kindly refer interested colleagues to the corresponding author of this article. By agreeing to the terms of the agreement, the authors also commit to refraining from publishing the audio content on the media (such as television and radio), in scientific journals (or any other publications), as well as on other platforms on the internet. The agreement must bear the signature of the legally authorised representative of the research institution (e.g., head of laboratory/department). Once the signed agreement is received and validated, the corresponding author will deliver the "Audios" folder containing the audio files through a download procedure. A direct connection between the EMOVOME authors and the applicants guarantees that updates regarding additional materials included in the database can be received by all EMOVOME users.

  6. SMS Spam Collection Dataset

    • kaggle.com
    • opendatalab.com
    zip
    Updated Dec 2, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI Machine Learning (2016). SMS Spam Collection Dataset [Dataset]. https://www.kaggle.com/uciml/sms-spam-collection-dataset
    Explore at:
    zip(215934 bytes)Available download formats
    Dataset updated
    Dec 2, 2016
    Dataset authored and provided by
    UCI Machine Learning
    Description

    Context

    The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

    Content

    The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

    This corpus has been collected from free or free for research sources at the Internet:

    -> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link]. -> A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link]. -> Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link]. This corpus has been used in the following academic researches:

    Acknowledgements

    The original dataset can be found here. The creators would like to note that in case you find the dataset useful, please make a reference to previous paper and the web page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ in your papers, research, etc.

    We offer a comprehensive study of this corpus in the following paper. This work presents a number of statistics, studies and baseline results for several machine learning methods.

    Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

    Inspiration

    • Can you use this dataset to build a prediction model that will accurately classify which texts are spam?
  7. email-Enron

    • zenodo.org
    json
    Updated Nov 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Landry; Nicholas Landry (2023). email-Enron [Dataset]. http://doi.org/10.5281/zenodo.10155819
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 19, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nicholas Landry; Nicholas Landry
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This is a temporal hypergraph dataset, which here means a sequence of timestamped hyperedges where each hyperedge is a set of nodes. In email communication, messages can be sent to multiple recipients. In this dataset, nodes are email addresses at Enron, and a hyperedge is comprised of the sender and all recipients of the email. Only email addresses from a core set of employees are included. Timestamps are in ISO8601 format.

    This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public and posted to the web by the Federal Energy Regulatory Commission during its investigation.

    The email dataset was later purchased by Leslie Kaelbling at MIT and turned out to have a number of integrity problems. A number of folks at SRI, notably Melinda Gervasio, worked hard to correct these problems, and it is thanks to them that the dataset is available. The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees". Invalid email addresses were converted to something of the form user@enron.com whenever possible (i.e., the recipient is specified in some parseable format like "Doe, John" or "Mary K. Smith") and to no_address@enron.com when no recipient was specified.

    Statistics

    Some basic statistics of this dataset are:

    • number of nodes: 148
    • number of timestamped hyperedges: 10,885
    • distribution of the connected components:

    Component Size, Number

    • 143, 1
    • 1, 5

    Source of original data

    Source: email-Enron dataset

    References

    If you use this dataset, please cite these references:

  8. o

    Ham & Spam SMS Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Ham & Spam SMS Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/a57d0ce7-c5e7-4048-8036-b502e1ed73ae
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Telecommunications & Network Data
    Description

    This dataset is a collection of SMS messages, precisely 5,574 entries, each meticulously tagged as either 'ham' (legitimate) or 'spam'. It was compiled primarily for the purpose of SMS spam research and is invaluable for developing predictive models that accurately classify text messages. The collection offers a robust foundation for projects involving natural language processing (NLP) and binary classification tasks within the telecommunications domain.

    Columns

    • v1: This column contains the label for each SMS message, indicating whether it is 'ham' (legitimate) or 'spam'. It comprises two distinct classes.
    • v2: This column holds the raw text content of the SMS message. It is the core textual data for analysis.

    Distribution

    The dataset consists of 5,574 individual SMS messages. Each message is presented on a single line, structured with two distinct columns. The data files are typically in a text-based format, suitable for processing. The distribution of messages is approximately 87% 'ham' (legitimate) messages and 13% 'spam' messages. There are 5,171 unique text values within the dataset.

    Usage

    This dataset is ideally suited for: * Developing and training machine learning models for SMS spam detection. * Conducting research in Natural Language Processing (NLP), particularly for text categorisation. * Implementing binary classification algorithms to distinguish between legitimate and unsolicited messages. * Exploring text analytics and pattern recognition in short message services.

    Coverage

    The messages within this dataset originate from diverse sources, including a UK forum where users reported SMS spam, and a large collection of legitimate messages primarily from Singaporean university students. While the original collection points span specific regions, the dataset is globally relevant for research and application. A specific time range for the original data collection is not specified in the available information.

    License

    CC0

    Who Can Use It

    This dataset is beneficial for: * Data Scientists: To build and evaluate machine learning models for text classification and spam filtering. * Machine Learning Engineers: For developing and deploying automated spam detection systems in telecommunications. * Researchers: Engaged in natural language processing, data mining, and communication security studies. * Students: Working on academic projects that require text analysis and classification.

    Dataset Name Suggestions

    • SMS Message Spam-Ham Classification
    • Text Message Spam Detection Dataset
    • Mobile SMS Content Classifier
    • Ham & Spam SMS Dataset
    • Short Message Service Categorisation Data

    Attributes

    Original Data Source: Ham & Spam Messages Dataset

  9. TUApps

    • zenodo.org
    zip
    Updated May 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2024). TUApps [Dataset]. http://doi.org/10.5281/zenodo.11201267
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To research the illegal activities of underground apps on Telegram, we have created a dataset called TUApps. TUApps is a progressively growing dataset of underground apps, collected from September 2023 to February 2024, consisting of a total of 1,000 underground apps and 200 million messages distributed across 71,332 Telegram channels.
    In the process of creating this dataset, we followed strict ethical standards to ensure the lawful use of the data and the protection of user privacy. The dataset includes the following files:
    (1) dataset.zip: We have packaged the underground app samples. The naming of Android app files is based on the SHA256 hash of the file, and the naming of iOS app files is based on the SHA256 hash of the publishing webpage.
    (2) code.zip: We have packaged the code used for crawling data from Telegram and for performing data analysis.
    (3) message.zip: We have packaged the messages crawled from Telegram, the files are named after the names of the channels in Telegram.
    Availability of code and messages
    Upon acceptance of our research paper, the dataset containing user messages and the code used for data collection and analysis will only be made available upon request to researchers who agree to adhere to strict ethical principles and maintain the confidentiality of the data.

  10. CommitBench

    • zenodo.org
    csv, json
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo (2024). CommitBench [Dataset]. http://doi.org/10.5281/zenodo.10497442
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Dec 15, 2023
    Description

    Data Statement for CommitBench

    - Dataset Title: CommitBench
    - Dataset Curator: Maximilian Schall, Tamara Czinczoll, Gerard de Melo
    - Dataset Version: 1.0, 15.12.2023
    - Data Statement Author: Maximilian Schall, Tamara Czinczoll
    - Data Statement Version: 1.0, 16.01.2023

    EXECUTIVE SUMMARY

    We provide CommitBench as an open-source, reproducible and privacy- and license-aware benchmark for commit message generation. The dataset is gathered from github repositories with licenses that permit redistribution. We provide six programming languages, Java, Python, Go, JavaScript, PHP and Ruby. The commit messages in natural language are restricted to English, as it is the working language in many software development projects. The dataset has 1,664,590 examples that were generated by using extensive quality-focused filtering techniques (e.g. excluding bot commits). Additionally, we provide a version with longer sequences for benchmarking models with more extended sequence input, as well a version with

    CURATION RATIONALE

    We created this dataset due to quality and legal issues with previous commit message generation datasets. Given a git diff displaying code changes between two file versions, the task is to predict the accompanying commit message describing these changes in natural language. We base our GitHub repository selection on that of a previous dataset, CodeSearchNet, but apply a large number of filtering techniques to improve the data quality and eliminate noise. Due to the original repository selection, we are also restricted to the aforementioned programming languages. It was important to us, however, to provide some number of programming languages to accommodate any changes in the task due to the degree of hardware-relatedness of a language. The dataset is provides as a large CSV file containing all samples. We provide the following fields: Diff, Commit Message, Hash, Project, Split.

    DOCUMENTATION FOR SOURCE DATASETS

    Repository selection based on CodeSearchNet, which can be found under https://github.com/github/CodeSearchNet

    LANGUAGE VARIETIES

    Since GitHub hosts software projects from all over the world, there is no single uniform variety of English used across all commit messages. This means that phrasing can be regional or subject to influences from the programmer's native language. It also means that different spelling conventions may co-exist and that different terms may used for the same concept. Any model trained on this data should take these factors into account. For the number of samples for different programming languages, see Table below:

    LanguageNumber of Samples
    Java153,119
    Ruby233,710
    Go137,998
    JavaScript373,598
    Python472,469
    PHP294,394

    SPEAKER DEMOGRAPHIC

    Due to the extremely diverse (geographically, but also socio-economically) backgrounds of the software development community, there is no single demographic the data comes from. Of course, this does not entail that there are no biases when it comes to the data origin. Globally, the average software developer tends to be male and has obtained higher education. Due to the anonymous nature of GitHub profiles, gender distribution information cannot be extracted.

    ANNOTATOR DEMOGRAPHIC

    Due to the automated generation of the dataset, no annotators were used.

    SPEECH SITUATION AND CHARACTERISTICS

    The public nature and often business-related creation of the data by the original GitHub users fosters a more neutral, information-focused and formal language. As it is not uncommon for developers to find the writing of commit messages tedious, there can also be commit messages representing the frustration or boredom of the commit author. While our filtering is supposed to catch these types of messages, there can be some instances still in the dataset.

    PREPROCESSING AND DATA FORMATTING

    See paper for all preprocessing steps. We do not provide the un-processed raw data due to privacy concerns, but it can be obtained via CodeSearchNet or requested from the authors.

    CAPTURE QUALITY

    While our dataset is completely reproducible at the time of writing, there are external dependencies that could restrict this. If GitHub shuts down and someone with a software project in the dataset deletes their repository, there can be instances that are non-reproducible.

    LIMITATIONS

    While our filters are meant to ensure a high quality for each data sample in the dataset, we cannot ensure that only low-quality examples were removed. Similarly, we cannot guarantee that our extensive filtering methods catch all low-quality examples. Some might remain in the dataset. Another limitation of our dataset is the low number of programming languages (there are many more) as well as our focus on English commit messages. There might be some people that only write commit messages in their respective languages, e.g., because the organization they work at has established this or because they do not speak English (confidently enough). Perhaps some languages' syntax better aligns with that of programming languages. These effects cannot be investigated with CommitBench.

    Although we anonymize the data as far as possible, the required information for reproducibility, including the organization, project name, and project hash, makes it possible to refer back to the original authoring user account, since this information is freely available in the original repository on GitHub.

    METADATA

    License: Dataset under the CC BY-NC 4.0 license

    DISCLOSURES AND ETHICAL REVIEW

    While we put substantial effort into removing privacy-sensitive information, our solutions cannot find 100% of such cases. This means that researchers and anyone using the data need to incorporate their own safeguards to effectively reduce the amount of personal information that can be exposed.

    ABOUT THIS DOCUMENT

    A data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.

    This data statement was written based on the template for the Data Statements Version 2 schema. The template was prepared by Angelina McMillan-Major, Emily M. Bender, and Batya Friedman and can be found at https://techpolicylab.uw.edu/data-statements/ and was updated from the community Version 1 Markdown template by Leon Dercyznski.

  11. i

    Classification of online health messages - Dataset - CKAN

    • rdm.inesctec.pt
    Updated Jul 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Classification of online health messages - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2022-008
    Explore at:
    Dataset updated
    Jul 6, 2022
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Classification of online health messages The dataset has 487 annotated messages taken from Medhelp, an online health forum with several health communities (https://www.medhelp.org/). It was built in a master thesis entitled "Automatic categorization of health-related messages in online health communities" of the Master in Informatics and Computing Engineering of the Faculty of Engineering of the University of Porto. It expands a dataset created in a previous work [see Relation metadata] whose objective was to propose a classification scheme to analyze messages exchanged in online health forums. A website was built to allow the classification of additional messages collected from Medhelp. After using a Python script to scrape the five most recent discussions from popular forums (https://www.medhelp.org/forums/list), we sampled 285 messages from them to annotate. Each message was classified three times by anonymous people in 11 categories from April 2022 until the end of May 2022. For each message, the rater picked the categories associated with the message and its emotional polarity (positive, neutral, and negative). Our dataset is organized in two CSV files, one containing information regarding the 885 (=3*285) classifications collected via crowdsourcing (CrowdsourcingClassification.csv) and the other containing the 487 messages with their final and consensual classifications (FinalClassification.csv). The readMe file provides detailed information about the two .csv files.

  12. P

    SMS Spam Collection Data Set Dataset

    • paperswithcode.com
    Updated Mar 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). SMS Spam Collection Data Set Dataset [Dataset]. https://paperswithcode.com/dataset/sms-spam-collection-data-set
    Explore at:
    Dataset updated
    Mar 13, 2022
    Description

    This corpus has been collected from free or free for research sources at the Internet:

    A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis. the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages.

  13. 4

    Human Feedback Messages for Preparing for Quitting Smoking: Dataset

    • data.4tu.nl
    zip
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nele Albers; Mark Neerincx; Willem-Paul Brinkman (2024). Human Feedback Messages for Preparing for Quitting Smoking: Dataset [Dataset]. http://doi.org/10.4121/7e88ca88-50e9-4e8d-a049-6266315a2ece.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 6, 2024
    Dataset provided by
    4TU.ResearchData
    Authors
    Nele Albers; Mark Neerincx; Willem-Paul Brinkman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 1, 2024 - Mar 19, 2024
    Description

    This repository contains 523 human feedback messages sent to daily smokers and vapers who were preparing to quit smoking/vaping with a virtual coach.

    Study

    Daily smokers and vapers recruited through the online crowdsourcing platform Prolific interacted with the text-based virtual coach Kai in up to five sessions between 1 February and 19 March 2024. The sessions were 3-5 days apart. In each session, participants were assigned a new preparatory activity for quitting smoking (e.g., listing reasons for quitting smoking, envisioning one's desired future self after quitting smoking, doing a breathing exercise). Between sessions, participants had a 20% chance of receiving a feedback message from one of two human coaches. More information on the study can be found in the Open Science Framework (OSF) pre-registration: https://doi.org/10.17605/OSF.IO/78CNR. The implementation of the virtual coach Kai can be found here: https://doi.org/10.5281/zenodo.11102861.

    Feedback messages

    All feedback messages were written by one of two Master's students in psychology. The two human coaches were directed to craft messages incorporating feedback, argument, and either a suggestion or reinforcement. They were also instructed to connect with individuals by referencing aspects of their lives, express empathy toward those with low confidence, and provide reinforcement when people were motivated.

    When writing the feedback, the human coaches had access to data on people's baseline smoking and physical activity behavior (i.e., smoking/vaping frequency, weekly exercise amount, existence of previous quit attempts of at least 24 hours, and the number of such quit attempts in the last year), introduction texts from the first session with the virtual coach, previous preparatory activity (i.e., activity formulation, effort spent on the activity and experience with it, return likelihood), current state (i.e., self-efficacy, perceived importance of preparing for quitting, human feedback appreciation), and new activity formulation. Notably, the human coaches only had access to anonymized versions of the introduction texts and activity experience responses (e.g., names were removed). Except for the free-text responses describing participants' experiences with their previous activity and their introduction texts, all of this information is provided together with the feedback messages. For the previous and new activities, we just provide the titles and not also the entire formulations that the human coaches had access to.

    Before sending the messages to participants on Prolific, we added a greeting (i.e., "Best wishes, Karina & Goda on behalf of the Perfect Fit Smoking Cessation Team"), a disclaimer that the messages were not medical advice, and a link to confirm having read the message at the end. We also added "This is your feedback message from your human coaches Karina and Goda for preparing to quit [smoking/vaping]:" at the start of the message.

    The human coaches approved publishing these feedback messages.

    Additional data from the study

    Additional data from the study such as participants' free-text descriptions of their experiences with their activities and their introductions from the first session with the virtual coach will also be published and linked to the OSF pre-registration of the study.

    In the case of questions, please contact Nele Albers (n.albers@tudelft.nl) or Willem-Paul Brinkman (w.p.brinkman@tudelft.nl).

  14. o

    Online Community Chat Analytics Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Online Community Chat Analytics Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/e0da87d1-e83b-4f53-9b4a-d995d115e210
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    Area covered
    Data Science and Analytics
    Description

    This dataset captures engagement patterns within the GDG Babcock data community, specifically focusing on the Data & AI Track. It is structured into two main files: one for message data and another for member-specific metrics. The message data includes details such as timestamps, usernames, and various derived features like message quality and word count. The member data provides insights into total messages sent, active days, and a classification of user activity levels based on multiple engagement factors. This dataset is designed to enable the analysis of user participation, message frequency, and behavioural trends within an online community. It can be used to identify trends in message frequency across different times, build models to predict user activity, conduct text analysis on message content, and investigate the relationship between message length and user activity.

    Columns

    Message Data File: * Date: The date the message was sent, in YYYY/MM/DD format. * Username: The identifier for the user who sent the message. * Hour: The hour during which the message was sent (in 24-hour format, ranging from 0-23). * Month: The month when the message was sent. * Quality: A derived measure of message quality, often based on the number of non-stopwords. * Weekday: The day of the week when the message was sent. * Weekend: A boolean indicator (True/False) if the message was sent during the weekend. * Wordcount: The total number of words in the message. * Message: The actual content of the message sent by the user.

    Member Data File: * Username: The unique identifier for the user. * Total Messages: The total number of messages sent by the user. * Active Days: The number of days the user has been active in the group chat. * Weekend Activity: A boolean indicator (True/False) if the user is more active on weekends. * Activity Level: A classification of the user's activity level (e.g., High, Medium, Low) based on engagement metrics.

    Distribution

    The dataset is typically provided as data files, commonly in CSV format. It consists of two distinct files: one for message-level data and another for member-level aggregated data. The message data file contains approximately 1,275 records based on aggregated date and other attribute counts. The exact number of records for the member data file is not specified but represents unique users within the community.

    Usage

    This dataset is ideal for: * Analysing user activity and engagement in online discussions. * Identifying trends in message frequency across different times of the day and week. * Building predictive models for user activity levels and engagement patterns. * Conducting sentiment analysis or text analysis on message content. * Investigating the relationship between message content length and user activity.

    Coverage

    The dataset focuses on the GDG Babcock data community's Data & AI Track. It has a global regional scope. The time range for the collected data is from 2024-01-01 to 2024-12-23, covering approximately one year of community engagement. There are no specific notes on data availability for certain groups or years outside of this community and timeframe.

    License

    CC BY-SA

    Who Can Use It

    This dataset is suitable for: * Data Scientists and Analysts interested in community engagement and behavioural trends. * Researchers studying online communities, social dynamics, and communication patterns. * Community Managers looking to understand and improve engagement within their platforms. * Academics for educational purposes and case studies in data science and analytics. * Developers building tools for community management or engagement prediction.

    Dataset Name Suggestions

    • GDG Community Engagement Data
    • Online Community Chat Analytics Dataset
    • Data & AI Community Activity Log
    • User Engagement Chat Dataset

    Attributes

    Original Data Source: GDG Community Chat Dataset

  15. o

    Blended Skill Talk Conversational Dataset

    • opendatabay.com
    .undefined
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Blended Skill Talk Conversational Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/abb8614a-0f90-4c6e-9cbe-ef04ee0b23bc
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This dataset offers a creative collection of 7,000 one-on-one conversations, engineered to explore a diverse range of dialogue modes [1]. It allows for the exploration of conversations that exude personality, demonstrate empathy, and showcase knowledge [1]. Each record is meticulously structured with fields such as personas, additional context, previous utterance, free messages, guided messages, and suggestions, providing a rich foundation for stimulating topics and unique dialogues [1]. It is an invaluable resource for anyone looking to train and validate conversational models and delve into the capabilities of dynamic dialogue systems [1].

    Columns

    The dataset is structured with several key columns, each providing distinct information about the conversations [1, 2]: * Personas: Contains detailed information about the individuals or roles involved in the conversation [2]. This field has 960 unique values [2]. * Additional Context: Provides extra contextual information pertinent to the conversation [2]. * previous_utterance: Records the preceding statement in the dialogue [2]. * context: Offers contextual details for the conversation flow [2]. This field contains 975 unique values [3]. * free_messages: Includes unconstrained messages exchanged within the conversation [2]. This field contains 980 unique values [3]. * guided_messages: Features messages that may follow a specific guidance or structure [2]. This field contains 980 unique values [3]. * suggestions: Contains recommended messages, particularly useful for building knowledge-based conversations [1, 2]. This field contains 980 unique values [3].

    Distribution

    The dataset is formatted for easy integration into machine learning frameworks, typically provided in test.csv format, with validation, train, and test splits available [1, 2]. It comprises 7,000 individual conversations [1]. Each conversation record is uniformly structured, making it suitable for training and evaluating models [1]. The data quality is rated 5 out of 5 [4].

    Usage

    This dataset is ideally suited for several applications [1]: * Training and validating conversational AI models, enhancing their ability to handle dynamic dialogues [1]. * Generating creative responses by leveraging the personas and additional context fields [1]. * Developing knowledge-based conversational agents by utilising information from the suggestions field [1]. * Building chatbots that offer personalised responses and empathetic support to users, drawing on the personas and free message fields [1]. * Exploring the nuances of dialogue modes, including personality expression, empathy, and knowledge demonstration [1].

    Coverage

    The dataset is indicated to have a GLOBAL region coverage [4]. Specific time ranges or demographic scopes are not detailed in the provided sources.

    License

    CC0

    Who Can Use It

    This dataset is particularly beneficial for: * Data Scientists and Machine Learning Engineers: For developing and refining conversational AI models [1]. * NLP Researchers: To study dialogue systems, text mining, and the dynamics of human-like conversations [1]. * Chatbot Developers: For creating more sophisticated and human-centric conversational agents [1]. * Academics: For research into areas such as artificial intelligence, natural language processing, and human-computer interaction [1].

    Dataset Name Suggestions

    • Blended Skill Talk Conversational Dataset
    • One-on-One Dialogue Collection
    • AI Chat Conversation Dataset
    • Empathetic Dialogue Data

    Attributes

    Original Data Source: Blended Skill Talk (1 On 1 Conversations)

  16. l

    How can bibliometric and altmetric vendors improve? Messages from the...

    • repository.lboro.ac.uk
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Gadd; Ian Rowlands (2023). How can bibliometric and altmetric vendors improve? Messages from the end-user community. Dataset. [Dataset]. http://doi.org/10.17028/rd.lboro.7022213.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Loughborough University
    Authors
    Elizabeth Gadd; Ian Rowlands
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The results of a survey of bibliometric and altmetric end-users run between Feb-Mar 2018 inviting them to suggest ways in which suppliers of bibliometric and altmetric data might improve their services. There were 42 respondents and 149 data points. Data include tools regularly used by respondents, demographic data and the free-text comments. The data have been coded and analysed. The data have been written up at Gadd, E.A. & Rowlands, I. (2018) How can bibliometric and altmetric vendors improve? Messages from the end-user community. Insights Journal. [In Press]

  17. o

    Text Message Spam/Ham Dataset

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Text Message Spam/Ham Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/8d282a44-3d61-42d9-ae40-b749521de738
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This dataset is designed to facilitate the training of machine learning models for classifying SMS messages as either spam or not spam, often referred to as 'ham'. It comprises a collection of real, English, and non-encoded SMS messages, each meticulously labelled to indicate its status as legitimate or unsolicited. This makes it particularly valuable for research into mobile phone spam, enabling the development of automated tools for identification and blocking, as well as providing a foundation for studying the characteristics of spam messages and devising strategies for avoidance.

    Columns

    • sms: This column contains the actual text content of the SMS message. (String)
    • label: This column provides the classification for each SMS message, indicating whether it is 'ham' (legitimate) or 'spam' (unsolicited). (String)
      • There are 5171 unique SMS message texts.
      • Label counts: 4,827 messages are labelled as 'ham' and 747 messages are labelled as 'spam'.

    Distribution

    The dataset is typically provided in a CSV file format, such as train.csv. It contains 5574 individual SMS messages. The messages are structured with two key fields: the message text itself and its corresponding label (ham or spam).

    Usage

    • Training machine learning models to effectively distinguish between legitimate and spam SMS messages.
    • Developing tools capable of automatically identifying and blocking unwanted messages on mobile phones.
    • Conducting academic or industry research into the evolving nature and characteristics of spam messages.
    • Formulating strategies and preventative measures for users to identify and avoid unsolicited communications.

    Coverage

    This dataset covers SMS messages globally. The messages are in English, representing real and non-encoded content. While a specific time range for data collection isn't provided, it is a public set collected for mobile phone spam research.

    License

    CCO

    Who Can Use It

    • Data Scientists and Machine Learning Engineers: For developing and refining text classification models.
    • Mobile Security Developers: To create or enhance spam filtering applications.
    • Academic Researchers: For studies on unsolicited communication patterns and natural language processing.
    • Analysts: To gain insights into the properties of spam messages.

    Dataset Name Suggestions

    • SMS Spam Collection
    • SMS Message Classifier Data
    • Mobile Spam Detection Dataset
    • Text Message Spam/Ham Data

    Attributes

    Original Data Source: SMS Spam Collection (Text Classification)

  18. Email Dataset for Automatic Response Suggestion within a University

    • figshare.com
    pdf
    Updated Feb 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditya Singh; Dibyendu Mishra; Sanchit Bansal; Vinayak Agarwal; Anjali Goyal; Ashish Sureka (2018). Email Dataset for Automatic Response Suggestion within a University [Dataset]. http://doi.org/10.6084/m9.figshare.5853057.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 4, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Aditya Singh; Dibyendu Mishra; Sanchit Bansal; Vinayak Agarwal; Anjali Goyal; Ashish Sureka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We have developed an application and solution approach (using this dataset) for automatically generating and suggesting short email responses to support queries in a university environment. Our proposed solution can be used as one tap or one click solution for responding to various types of queries raised by faculty members and students in a university. Office of Academic Affairs (OAA), Office of Student Life (OSL) and Information Technology Helpdesk (ITD) are support functions within a university which receives hundreds of email messages on the daily basis. Email communication is still the most frequently used mode of communication by these departments. A large percentage of emails received by these departments are frequent and commonly used queries or request for information. Responding to every query by manually typing is a tedious and time consuming task. Furthermore a large percentage of emails and their responses are consists of short messages. For example, an IT support department in our university receives several emails on Wi-Fi not working or someone needing help with a projector or requires an HDMI cable or remote slide changer. Another example is emails from students requesting the office of academic affairs to add and drop courses which they cannot do it directly. The dataset consists of emails messages which are generally received by ITD, OAA and OSL in Ashoka University. The dataset also contains intermediate results while conducting machine learning experiments.

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kiran Garimella; Gareth Tyson (2018). WhatsApp, Doc? Dataset [Dataset]. https://paperswithcode.com/dataset/whatsapp-doc

WhatsApp, Doc? Dataset

Explore at:
Dataset updated
Apr 3, 2018
Authors
Kiran Garimella; Gareth Tyson
Description

This is a large-scale dataset collected from WhatsApp public groups. It has been created from 178 public groups containing around 45K users and 454K messages. This dataset allows researchers to ask questions like (i) Are WhatsApp groups a broadcast, multicast or unicast medium? (ii) How interactive are users, and how do these interactions emerge over time? (iii) What geographical span do WhatsApp groups have, and how does geographical placement impact interaction dynamics? (iv) What role does multimedia content play in WhatsApp groups, and how do users form interaction around multimedia content? (v) What is the potential of WhatsApp data in answering further social science questions, particularly in relation to bias and representability?

Search
Clear search
Close search
Google apps
Main menu