Facebook
TwitterThis dataset was created by Emperiums
Facebook
Twitterhttp://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on discord data scraped from a large variety of severs, big and small.
Want your server to be a part of the next release? Want access to the raw data? Contact me at contact@j-fan.ml
The raw data for this version contained 51,826,268 messages [v1] 5103788 (regex) + 696161 (toxic)/51826268, or 0.11% of the messages were removed [v2] 6737000 (regex) + 946778 (toxic)/90841631, 0.08%of the messages were removed The dataset's final size is 46,026,319 (v1) + 64,345,492 (v2) [110,371,811] messages across 456,810 (v1) + 750,416 (v2) [1,207,226] conversations, which is reduced from 89.6 GB of raw json data to just under 2 GB
There is a wide variety of NLP datasets that cover a huge number of different interactions between users that can be used for pretraining; Google's C4 covers webtexts and a extremely diverse amount of data for the majority of language tasks. Reddit crawls cover strucutred, forum-style text. However, despite this abundance of data, there is a lack of clean long-context data for specifically conversation puproses. In a search for potential sources of data, I discovered that discord has a long-standing history of having interesting and diverse conversations, and a realatively open API. With the collaboration of a large number of discord moderators, server owners, and members of the community, this data was sucessfully downloaded and cleaned.
To create a diverse, structured dataset of turn-by-tun conversation that can be used to pretrain a model oriented specifically for conversational purposes
Files containing -detox are cleaned files that utilized a LSTM network to analyize each message and evaluate if the message is toxic, obscene, threatening, insulting, or is identity hate
All files were cleaned using https://github.com/JEF1056/clean-discord, mostly using the default settings.
The repo takes an automated, heuristic approach to removing unwanted, non-NLP, or toxic comments.
context.txt contains all data that has been cleaned using basic regex and some text replacement
context-pairs.txt contains pairs of data using only discord's recent replies feature. As it is so new, its yeild is very low. It has also been cleaned using basic regex and some text replacement
A massive thanks to https://github.com/codemicro for working on multithreading code for the clean-discord repo!
Cite this dataset:
@misc{discord-data,
author = {Jess Fan},
title = {Discord Dataset},
contact = {jeefan@ucsc.edu, contact@j-fan.ml},
year = {2021},
howpublished = {\url{https://www.kaggle.com/jef1056/discord-data}},
note = {V5}
}
Facebook
TwitterThis dataset was created by Deep Sarda
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Any member of the discord server can submit suggestions for improving the game. Others have the option of upvoting said suggestions. If a suggestion reaches 35 or more upvotes, it is sent to the developers. I compiled this dataset mainly to practice NLP, but there are use cases for applying statistical tests to see what has a better chance of getting sent to devs.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Top.gg is one of the most popular websites that lists bots that people can share and add to their Discord servers. It has over 10,000 listings.
This dataset has been scraped from all top.gg Top bots pages in JSON format, cleaned and converted to CSV. I have included both formats. It was collected July 29th, 2020. The included features:
The data belongs to Top.gg and shall not be used for any commercial purposes.
I may decide to update this in the future, as I've learned about other attributes that could possibly be incorporated.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Trystynpatrick
Released under CC0: Public Domain
Just a silly test for me and my friends
Facebook
TwitterResults of a survey of 403 discord users. The selection was random, the servers were random, a lot of people refused to go through, but someone agreed. Interrogated only Russian-speaking people. When creating, I notified users that after completion I was going to analyze the data and post the results in the public domain. No any personal user data was collected either.
In general, you can see that I like the discord, as well as some of the psychological focus of the questions. I have no experience in doing something like this, but still I tried to do everything as correctly as possible.
This version is translated into English. Also cleaned data and removed or changed something that wasn't needed.
Facebook
TwitterThis dataset was created by Matthew Weinberger
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Roman Matveev
Released under MIT
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains real-world messages from my Discord server, labeled to support the fine-tuning of BERT/distilBERT base for phishing and scam detection.
This dataset was created by first collecting 80,000 raw messages from my Discord server using the Discord API. To ensure quality and relevance short messages of less than 3 words, messages from bots, non-text messages, and mass duplicated messages such as those containing over 70% of emojis are removed. This filtering process reduced the dataset to fewer than 20,000 high-quality messages. Afterward, additional preprocessing was applied (as described below), and more targeted scam messages were manually added to improve model exposure to common phishing tactics and keyword variations. This ensures the dataset remains both realistic and effective for fine-tuning models used in live moderation settings.
Discord/External links → <URL>
User mentions → <USER>
Custom emojis → <EMOJI>
Discord invite links → <DISCORD_INVITE>
Messages were collected via the Discord API from my community server active for eight years, comprising ~11 000 members and over 20 million messages.
Traditional Discord moderation bots rely on static keyword rules set by server owners, but scammers easily evade these filters by subtly altering spellings, using homoglyphs and more. Thus, I built an NLP-powered moderation bot by fine-tuning DistilBERT base uncased on labelled chat data to recognize phishing and scam patterns beyond simple keywords, The bot is deployed and scans every incoming message in real time, automatically flagging or deleting malicious content. You can find out more here → https://github.com/wang-yuancheng/shibemod
Facebook
TwitterThis dataset was created by fishiv
Facebook
TwitterThis dataset was created by joshua
Facebook
TwitterThis dataset was created by MonFire
Released under Data files © Original Authors
Facebook
TwitterThis dataset was created by Irpanko122
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Python Discord community is an active group with interactive events from time to time. The Pixels event (inspired by Reddit's r/place project) gives members access to a heavily-limited API to place colored pixels on a black canvas, which are often overwritten by others.
This data was gathered using the Python Discord Pixels API and some Python scripting to automate the analysis of the raw data.
All pixels were placed by members of the Python Discord community.
The analyses in this dataset are licensed under CC BY-SA 4.0, meaning you can share or adapt them as long as appropriate credit is provided and you redistribute your changes under the same license.
What simple but fun statistics can you gather from this data?
Facebook
TwitterThis dataset was created by Landon King
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains anonymized chat records from a real university-level computer science course conducted via a Discord server. It captures authentic student interactions across multiple semesters, providing valuable insight into informal, peer-to-peer educational discussions.
To protect privacy, all usernames and personal identifiers have been replaced with randomly generated pseudonyms. The dataset includes message content, timestamps, dates, and emoji usage, preserving the conversational and temporal structure of the original data.
This resource is well-suited for a variety of research and machine learning tasks, including:
Educational dialogue analysis
Toxic language detection in academic settings
Role-based interaction modeling
Temporal pattern recognition in online discourse
All data was collected from a publicly available source and processed to ensure ethical usage and compliance with data privacy norms.
Facebook
TwitterDataset inspired by Midjourney User Prompts & Generated Images (250k)
Collected messages are only for upscale requests.
Facebook
TwitterThis dataset was created by Renja Grotemeyer
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Irakli P
Released under CC0: Public Domain
Facebook
TwitterThis dataset was created by Emperiums