CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset captures suggestions for improving a game, sourced from a Discord server where members can submit their ideas. Users have the ability to upvote these suggestions, and any suggestion accumulating 35 or more upvotes is forwarded to the game's developers. The primary purpose of compiling this dataset was for natural language processing (NLP) practice, but it also offers opportunities for applying statistical analysis to understand factors that contribute to a suggestion being sent to the developers. The dataset provides valuable insights into community feedback and engagement.
The dataset is typically provided in a tabular format, such as a CSV file. It contains a total of 158 individual records or rows, each representing a unique game improvement suggestion. The data includes suggestion dates ranging from 1st April 2022 to 26th April 2022. The character count for suggestions varies widely, from 1 to 1831 characters. Categories include 'Feature' (53%), 'Item' (32%), and 'Other' (16%). A small percentage of suggestions (16%) were reported to the developers.
This dataset is ideally suited for various analytical tasks. It can be used for natural language processing (NLP) exercises, such as sentiment analysis of suggestions, topic modelling, or text summarisation. Additionally, it is suitable for statistical analysis to identify correlations between suggestion characteristics (e.g., length, category, keywords) and their likelihood of receiving upvotes or being reported to the developers. Game developers, community managers, and data analysts can utilise this data to gain actionable insights into player feedback and prioritise development efforts.
The dataset's geographic coverage is global, as the Discord server from which suggestions were drawn is accessible worldwide. The time range for the suggestions captured spans from 1st April 2022 to 26th April 2022. The demographic scope includes any member of the specific Discord server who submitted a suggestion. There are no specific notes on data availability limitations for particular groups or years within the provided information.
CC0
Original Data Source: Grounded Suggestions via Discord Server
Results of a survey of 403 discord users. The selection was random, the servers were random, a lot of people refused to go through, but someone agreed. Interrogated only Russian-speaking people. When creating, I notified users that after completion I was going to analyze the data and post the results in the public domain. No any personal user data was collected either.
In general, you can see that I like the discord, as well as some of the psychological focus of the questions. I have no experience in doing something like this, but still I tried to do everything as correctly as possible.
This version is translated into English. Also cleaned data and removed or changed something that wasn't needed.
The original extracted versions (in .srt and .ass format) are also included in this release (which, idk why, but kaggle decompressed >:U)
This dataset contains 1,497,770 messages across 3,836 episodes of anime. The raw dataset contains 1,563,442 messages, some of which were removed during cleaning.
This version (V4) adapts the original (frankly, terrible) format into the newer format I developed, which is used in https://github.com/JEF1056/clean-discord. The Dataset folder contains compressed text files, which are compatable with tensorflow datasets. These can be streamed as a textlinedataset in the TSV format.
V4 also fixes many (but not all) issues that the original cleaning script was too simple to realistically take care of. It also uses the clean-discord cleaner algorithms to make sentences more natural language than formatting. The script has also been optimized to run on multi-core systems, allowing it to complete cleaning this entire dataset in under 30 seconds on a 4-core machine. See the new and impoved script here: https://github.com/JEF1056/clean-discord/blob/v1.2/misc/anime.py (no longer bundled in the dataset files)
The files are now all compressed to save space, and are compatable with tensorflow datasets. You can initialize a dataset function as such:
def dataset_fn_local(split, shuffle_files=False):
global nq_tsv_path
del shuffle_files
# Load lines from the text file as examples.
files_to_read=[os.path.join(nq_tsv_path[split],filename) for filename in os.listdir(nq_tsv_path[split]) if filename.startswith(split)]
print(f"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Split {split} contains {len(files_to_read)} files.
First 10: {files_to_read[0:10]}")
ds = tf.data.TextLineDataset(files_to_read, compression_type="GZIP").filter(lambda line:tf.not_equal(tf.strings.length(line),0))
ds = ds.shuffle(buffer_size=600000)
ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["",""], field_delim="\t", use_quote_delim=False), num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex)))
return ds
A sincere thanks to all of my friends for helping me come up with anime titles, a shoutout to the talented and dedicated people translating Japanese anime, and an even bigger thanks to Leen Chan for compiling the actual subtitles.
This dataset is far from complete! I hope that people who are willing to find, add and clean the data are out there, and could do their best to try and help out in the effort to grow this data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Dataset is described in Charting the Landscape of Online Cryptocurrency Manipulation. IEEE Access (2020), a study that aims to map and assess the extent of cryptocurrency manipulations within and across the online ecosystems of Twitter, Telegram, and Discord. Starting from tweets mentioning cryptocurrencies, we leveraged and followed invite URLs from platform to platform, building the invite-link network, in order to study the invite link diffusion process.
Please, refer to the paper below for more details.
Nizzoli, L., Tardelli, S., Avvenuti, M., Cresci, S., Tesconi, M. & Ferrara, E. (2020). Charting the Landscape of Online Cryptocurrency Manipulation. IEEE Access (2020).
This dataset is composed of:
~16M tweet ids shared between March and May 2019, mentioning at least one of the 3,822 cryptocurrencies (cashtags) provided by the CryptoCompare public API;
~13k nodes of the invite-link network, i.e., the information about the Telegram/Discord channels and Twitter users involved in the cryptocurrency discussion (e.g., id, name, audience, invite URL);
~62k edges of the invite-link network, i.e., the information about the flow of invites (e.g., source id, target id, weight).
With such information, one can easily retrieve the content of channels and messages through Twitter, Telegram, and Discord public APIs.
Please, refer to the README file for more details about the fields.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.
If you use this dataset in any publication, project, tool or in any other form, please, cite the a paper.
Disclaimer
Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously. The intended use if for non-commercial research purpose only.
Data Source
The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:
Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.
Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).
Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.
Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).
WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.
From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.
The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).
The dataset has the following fields:
'text' - a text sample,
'label' - 0 for human-written text, 1 for machine-generated text,
'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
'language' - the ISO 639-1 language code identifying the detected language of the given text,
'length' - word count of the given text,
'source' - a string identifying the source dataset / platform of the given text,
'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.
ToDo Statistics (under construction)
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Midjourney User Prompts & Generated Images (250k)
Dataset Summary
General Context
Midjourney is an independent research lab whose broad mission is to "explore new mediums of thought". In 2022, they launched a text-to-image service that, given a natural language prompt, produces visual depictions that are faithful to the description. Their service is accessible via a public Discord server, where users interact with a Midjourney bot. When issued… See the full description on the dataset page: https://huggingface.co/datasets/nateraw/midjourney-texttoimage.
Recognizing complex emotions linked to ambivalence and hesitancy (A/H) can play a critical role in the personalization and effectiveness of digital behaviour change interventions. These subtle and conflicting emotions are manifested by a discord between multiple modalities, such as facial and vocal expressions, and body language. Although experts can be trained to identify A/H, integrating them into digital interventions is costly and less effective. Automatic learning systems provide a cost-effective alternative that can adapt to individual users, and operate seamlessly within real-time, and resource-limited environments. However, there are currently no datasets available for the design of ML models to recognize A/H.
This paper introduces a first Behavioural Ambivalence/Hesitancy ( BAH) dataset collected for subject-based multimodal recognition of A/H in videos. It contains videos from 224 participants captured across 9 provinces in Canada, with different age, and ethnicity. Through our web platform, we recruited participants to answer 7 questions, some of which were designed to elicit A/H while recording themselves via webcam with microphone. BAH amounts to 1,118 videos for a total duration of 8.26 hours with 1.5 hours of A/H. Our behavioural team annotated timestamp segments to indicate where A/H occurs, and provide frame- and video-level annotations with the A/H cues. Video transcripts and their timestamps are also included, along with cropped and aligned faces in each frame, and a variety of participants meta-data.
Additionally, this paper provides preliminary benchmarking results baseline models for BAH at frame- and video-level recognition with mono- and multi-modal setups. It also includes results on models for zero-shot prediction, and for personalization using unsupervised domain adaptation. The limited performance of baseline models highlights the challenges of recognizing A/H in real-world videos. The data, code, and pretrained weights are available.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Midjourney is an independent research lab whose broad mission is to "explore new mediums of thought". In 2022, they launched a text-to-image service that, given a natural language prompt, produces visual depictions that are faithful to the description. Their service is accessible via a public Discord server: users issue a query in natural language, and the Midjourney bot returns AI-generated images that follow the given description. The raw dataset (with Discord messages) can be found on… See the full description on the dataset page: https://huggingface.co/datasets/succinctly/midjourney-prompts.
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.
Getting Started
You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")
Background
Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
Ai4Privacy Community
Join our community at https://discord.gg/FmzWshaaQT to help build open datasets for privacy masking.
Purpose and Features
Previous world's largest open dataset for privacy. Now it is pii-masking-300k The purpose of the dataset is to train models to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.
Dataset Card for Unsupervised Peoples Speech
Dataset Description
Dataset Summary
The Unsupervised Peoples Speech Dataset is a compilation of audiofiles extracted from Archive.org that is licensed for academic and commercial usage under CC-BY and CC-BY-SA licenses. It includes more than one million hours of audio with a diverse set of speakers.
Point of Contact: MLCommons Datasets Discord
Dataset Structure
This dataset is a collection of audio… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/unsupervised_peoples_speech.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset captures suggestions for improving a game, sourced from a Discord server where members can submit their ideas. Users have the ability to upvote these suggestions, and any suggestion accumulating 35 or more upvotes is forwarded to the game's developers. The primary purpose of compiling this dataset was for natural language processing (NLP) practice, but it also offers opportunities for applying statistical analysis to understand factors that contribute to a suggestion being sent to the developers. The dataset provides valuable insights into community feedback and engagement.
The dataset is typically provided in a tabular format, such as a CSV file. It contains a total of 158 individual records or rows, each representing a unique game improvement suggestion. The data includes suggestion dates ranging from 1st April 2022 to 26th April 2022. The character count for suggestions varies widely, from 1 to 1831 characters. Categories include 'Feature' (53%), 'Item' (32%), and 'Other' (16%). A small percentage of suggestions (16%) were reported to the developers.
This dataset is ideally suited for various analytical tasks. It can be used for natural language processing (NLP) exercises, such as sentiment analysis of suggestions, topic modelling, or text summarisation. Additionally, it is suitable for statistical analysis to identify correlations between suggestion characteristics (e.g., length, category, keywords) and their likelihood of receiving upvotes or being reported to the developers. Game developers, community managers, and data analysts can utilise this data to gain actionable insights into player feedback and prioritise development efforts.
The dataset's geographic coverage is global, as the Discord server from which suggestions were drawn is accessible worldwide. The time range for the suggestions captured spans from 1st April 2022 to 26th April 2022. The demographic scope includes any member of the specific Discord server who submitted a suggestion. There are no specific notes on data availability limitations for particular groups or years within the provided information.
CC0
Original Data Source: Grounded Suggestions via Discord Server