63 datasets found

discord chat
kaggle.com
zip
Updated Mar 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emperiums (2023). discord chat [Dataset]. https://www.kaggle.com/datasets/emperiums/discord-chat
Explore at:
zip(10062 bytes)Available download formats
Dataset updated
Mar 28, 2023
Authors
Emperiums
Description
Dataset

This dataset was created by Emperiums

Contents
Discord-Data
kaggle.com
zip
Updated Apr 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jess Fan (2021). Discord-Data [Dataset]. https://www.kaggle.com/datasets/jef1056/discord-data/code
Explore at:
zip(8155868013 bytes)Available download formats
Dataset updated
Apr 16, 2021
Authors
Jess Fan
License
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Description
Description

This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on discord data scraped from a large variety of severs, big and small.

Want your server to be a part of the next release? Want access to the raw data? Contact me at contact@j-fan.ml

Some statistics

The raw data for this version contained 51,826,268 messages [v1] 5103788 (regex) + 696161 (toxic)/51826268, or 0.11% of the messages were removed [v2] 6737000 (regex) + 946778 (toxic)/90841631, 0.08%of the messages were removed The dataset's final size is 46,026,319 (v1) + 64,345,492 (v2) [110,371,811] messages across 456,810 (v1) + 750,416 (v2) [1,207,226] conversations, which is reduced from 89.6 GB of raw json data to just under 2 GB

Inspiration

There is a wide variety of NLP datasets that cover a huge number of different interactions between users that can be used for pretraining; Google's C4 covers webtexts and a extremely diverse amount of data for the majority of language tasks. Reddit crawls cover strucutred, forum-style text. However, despite this abundance of data, there is a lack of clean long-context data for specifically conversation puproses. In a search for potential sources of data, I discovered that discord has a long-standing history of having interesting and diverse conversations, and a realatively open API. With the collaboration of a large number of discord moderators, server owners, and members of the community, this data was sucessfully downloaded and cleaned.

Goal

To create a diverse, structured dataset of turn-by-tun conversation that can be used to pretrain a model oriented specifically for conversational purposes

Content

Files containing -detox are cleaned files that utilized a LSTM network to analyize each message and evaluate if the message is toxic, obscene, threatening, insulting, or is identity hate All files were cleaned using https://github.com/JEF1056/clean-discord, mostly using the default settings. The repo takes an automated, heuristic approach to removing unwanted, non-NLP, or toxic comments. context.txt contains all data that has been cleaned using basic regex and some text replacement context-pairs.txt contains pairs of data using only discord's recent replies feature. As it is so new, its yeild is very low. It has also been cleaned using basic regex and some text replacement

Aknowledgements

A massive thanks to https://github.com/codemicro for working on multithreading code for the clean-discord repo!

Cite this dataset: @misc{discord-data, author = {Jess Fan}, title = {Discord Dataset}, contact = {jeefan@ucsc.edu, contact@j-fan.ml}, year = {2021}, howpublished = {\url{https://www.kaggle.com/jef1056/discord-data}}, note = {V5} }
discord-messages
kaggle.com
zip
Updated Dec 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deep Sarda (2021). discord-messages [Dataset]. https://www.kaggle.com/datasets/deepsarda/discordmessages
Explore at:
zip(2776830801 bytes)Available download formats
Dataset updated
Dec 10, 2021
Authors
Deep Sarda
Description
Dataset

This dataset was created by Deep Sarda

Contents
Grounded Suggestions via Discord Server
kaggle.com
zip
Updated Apr 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brandon Conrady (2022). Grounded Suggestions via Discord Server [Dataset]. https://www.kaggle.com/datasets/brandonconrady/grounded-suggestions-via-discord-server
Explore at:
zip(33097 bytes)Available download formats
Dataset updated
Apr 27, 2022
Authors
Brandon Conrady
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Any member of the discord server can submit suggestions for improving the game. Others have the option of upvoting said suggestions. If a suggestion reaches 35 or more upvotes, it is sent to the developers. I compiled this dataset mainly to practice NLP, but there are use cases for applying statistical tests to see what has a better chance of getting sent to devs.
10,000+ Discord Bot Listings
kaggle.com
zip
Updated Jul 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dotslan (2020). 10,000+ Discord Bot Listings [Dataset]. https://www.kaggle.com/dotslan/discord-bots-on-topgg
Explore at:
zip(14229830 bytes)Available download formats
Dataset updated
Jul 30, 2020
Authors
dotslan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Context

Top.gg is one of the most popular websites that lists bots that people can share and add to their Discord servers. It has over 10,000 listings.

Content

This dataset has been scraped from all top.gg Top bots pages in JSON format, cleaned and converted to CSV. I have included both formats. It was collected July 29th, 2020. The included features:

Bot name

short description

Number of servers

Number of votes

Tags

Bot's Website URL

Invite link

Support server

Creator

Long description

Prefix

image url

Acknowledgements

The data belongs to Top.gg and shall not be used for any commercial purposes.

Notes

I may decide to update this in the future, as I've learned about other attributes that could possibly be incorporated.
Lily Discord
kaggle.com
zip
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trystynpatrick (2023). Lily Discord [Dataset]. https://www.kaggle.com/datasets/trystynpatrick/lily-discord/suggestions?status=pending&yourSuggestions=true
Explore at:
zip(534 bytes)Available download formats
Dataset updated
Jun 11, 2023
Authors
Trystynpatrick
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Trystynpatrick

Released under CC0: Public Domain

Contents

Just a silly test for me and my friends
Discord Survey
kaggle.com
zip
Updated Oct 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yonko (Czeslaw Meyer) (2021). Discord Survey [Dataset]. https://www.kaggle.com/yonkotoshiro/discord-survey
Explore at:
zip(13490 bytes)Available download formats
Dataset updated
Oct 11, 2021
Authors
Yonko (Czeslaw Meyer)
Description
Results of a survey of 403 discord users. The selection was random, the servers were random, a lot of people refused to go through, but someone agreed. Interrogated only Russian-speaking people. When creating, I notified users that after completion I was going to analyze the data and post the results in the public domain. No any personal user data was collected either.

In general, you can see that I like the discord, as well as some of the psychological focus of the questions. I have no experience in doing something like this, but still I tried to do everything as correctly as possible.

This version is translated into English. Also cleaned data and removed or changed something that wasn't needed.
long discord
kaggle.com
zip
Updated Sep 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Weinberger (2024). long discord [Dataset]. https://www.kaggle.com/datasets/matthewweinberger/long-discord
Explore at:
zip(281434690 bytes)Available download formats
Dataset updated
Sep 23, 2024
Authors
Matthew Weinberger
Description
Dataset

This dataset was created by Matthew Weinberger

Contents
discord-rag-files
kaggle.com
zip
Updated Apr 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roman Matveev (2025). discord-rag-files [Dataset]. https://www.kaggle.com/datasets/matveevromanjob/discord-rag-files
Explore at:
zip(8831 bytes)Available download formats
Dataset updated
Apr 20, 2025
Authors
Roman Matveev
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Roman Matveev

Released under MIT

Contents
discord-phishing-scam
kaggle.com
zip
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shibe123 (2025). discord-phishing-scam [Dataset]. https://www.kaggle.com/datasets/shibe123/discord-phishing-scam/data
Explore at:
zip(52875 bytes)Available download formats
Dataset updated
Jul 11, 2025
Authors
Shibe123
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Context

This dataset contains real-world messages from my Discord server, labeled to support the fine-tuning of BERT/distilBERT base for phishing and scam detection.

Content

This dataset was created by first collecting 80,000 raw messages from my Discord server using the Discord API. To ensure quality and relevance short messages of less than 3 words, messages from bots, non-text messages, and mass duplicated messages such as those containing over 70% of emojis are removed. This filtering process reduced the dataset to fewer than 20,000 high-quality messages. Afterward, additional preprocessing was applied (as described below), and more targeted scam messages were manually added to improve model exposure to common phishing tactics and keyword variations. This ensures the dataset remains both realistic and effective for fine-tuning models used in live moderation settings.

Data Pre-processing

Discord/External links → <URL> User mentions → <USER> Custom emojis → <EMOJI> Discord invite links → <DISCORD_INVITE>

Acknowledgements

Messages were collected via the Discord API from my community server active for eight years, comprising ~11 000 members and over 20 million messages.

Inspiration

Traditional Discord moderation bots rely on static keyword rules set by server owners, but scammers easily evade these filters by subtly altering spellings, using homoglyphs and more. Thus, I built an NLP-powered moderation bot by fine-tuning DistilBERT base uncased on labelled chat data to recognize phishing and scam patterns beyond simple keywords, The bot is deployed and scans every incoming message in real time, automatically flagging or deleting malicious content. You can find out more here → https://github.com/wang-yuancheng/shibemod
BrandonRTalks Discord Scrape
kaggle.com
zip
Updated Jan 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fishiv (2023). BrandonRTalks Discord Scrape [Dataset]. https://www.kaggle.com/datasets/fishiv/brandonrtalkschatbot
Explore at:
zip(10861090 bytes)Available download formats
Dataset updated
Jan 14, 2023
Authors
fishiv
Description
Dataset

This dataset was created by fishiv

Contents
quinton discord logs
kaggle.com
zip
Updated Mar 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
joshua (2022). quinton discord logs [Dataset]. https://www.kaggle.com/joshuat2/quinton-discord-logs
Explore at:
zip(2774543 bytes)Available download formats
Dataset updated
Mar 15, 2022
Authors
joshua
Description
Dataset

This dataset was created by joshua

Contents
MonFire
kaggle.com
zip
Updated Mar 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MonFire (2022). MonFire [Dataset]. https://www.kaggle.com/datasets/monfire/monfire/code
Explore at:
zip(36769 bytes)Available download formats
Dataset updated
Mar 14, 2022
Authors
MonFire
Description
Dataset

This dataset was created by MonFire

Released under Data files © Original Authors

Contents
discord-irfan-ta2
kaggle.com
zip
Updated Jun 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irpanko122 (2024). discord-irfan-ta2 [Dataset]. https://www.kaggle.com/irpanko122/discord
Explore at:
zip(58579 bytes)Available download formats
Dataset updated
Jun 13, 2024
Authors
Irpanko122
Description
Dataset

This dataset was created by Irpanko122

Contents
Python Discord Pixels Analysis
kaggle.com
zip
Updated May 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Soyka (2021). Python Discord Pixels Analysis [Dataset]. https://www.kaggle.com/bsoyka3/python-discord-pixels-analysis
Explore at:
zip(61559 bytes)Available download formats
Dataset updated
May 26, 2021
Authors
Ben Soyka
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

The Python Discord community is an active group with interactive events from time to time. The Pixels event (inspired by Reddit's r/place project) gives members access to a heavily-limited API to place colored pixels on a black canvas, which are often overwritten by others.

Content

This data was gathered using the Python Discord Pixels API and some Python scripting to automate the analysis of the raw data.

Acknowledgements

All pixels were placed by members of the Python Discord community.

The analyses in this dataset are licensed under CC BY-SA 4.0, meaning you can share or adapt them as long as appropriate credit is provided and you redistribute your changes under the same license.

Inspiration

What simple but fun statistics can you gather from this data?
Studbot
kaggle.com
zip
Updated Jul 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Landon King (2024). Studbot [Dataset]. https://www.kaggle.com/datasets/landonking/studbot/code
Explore at:
zip(1863 bytes)Available download formats
Dataset updated
Jul 25, 2024
Authors
Landon King
Description
Dataset

This dataset was created by Landon King

Contents
AnonymousChatLogInDiscordClass
kaggle.com
zip
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiujie Feng (2025). AnonymousChatLogInDiscordClass [Dataset]. https://www.kaggle.com/datasets/xiujiefeng/anonymouschatlogindiscordclass
Explore at:
zip(616832 bytes)Available download formats
Dataset updated
Jun 26, 2025
Authors
Xiujie Feng
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains anonymized chat records from a real university-level computer science course conducted via a Discord server. It captures authentic student interactions across multiple semesters, providing valuable insight into informal, peer-to-peer educational discussions.

To protect privacy, all usernames and personal identifiers have been replaced with randomly generated pseudonyms. The dataset includes message content, timestamps, dates, and emoji usage, preserving the conversational and temporal structure of the original data.

This resource is well-suited for a variety of research and machine learning tasks, including:

Educational dialogue analysis

Toxic language detection in academic settings

Role-based interaction modeling

Temporal pattern recognition in online discourse

All data was collected from a publicly available source and processed to ensure ethical usage and compliance with data privacy norms.
Midjourney_discord_messgaes
kaggle.com
zip
Updated Mar 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krzysztof Gonia (2023). Midjourney_discord_messgaes [Dataset]. https://www.kaggle.com/datasets/krzysztofgonia/midjourney-discord-messgaes
Explore at:
zip(2493161 bytes)Available download formats
Dataset updated
Mar 22, 2023
Authors
Krzysztof Gonia
Description
Dataset inspired by Midjourney User Prompts & Generated Images (250k)

Collected messages are only for upscale requests.
ProtoTech Chat Transcipt
kaggle.com
zip
Updated Dec 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Renja Grotemeyer (2021). ProtoTech Chat Transcipt [Dataset]. https://www.kaggle.com/datasets/renjagrotemeyer/prototech-chat-transcipt
Explore at:
zip(13459899 bytes)Available download formats
Dataset updated
Dec 15, 2021
Authors
Renja Grotemeyer
Description
Dataset

This dataset was created by Renja Grotemeyer

Contents
Midjourney V5 Prompts and Links
kaggle.com
zip
Updated May 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irakli P (2023). Midjourney V5 Prompts and Links [Dataset]. https://www.kaggle.com/datasets/iraklip/midjourney-v5-prompts-and-links
Explore at:
zip(571044595 bytes)Available download formats
Dataset updated
May 7, 2023
Authors
Irakli P
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Irakli P

Released under CC0: Public Domain

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Emperiums (2023). discord chat [Dataset]. https://www.kaggle.com/datasets/emperiums/discord-chat

discord chat

Explore at:

zip(10062 bytes)Available download formats

Dataset updated

Mar 28, 2023

Authors

Emperiums

Description

Dataset

This dataset was created by Emperiums

Clear search

Close search

Google apps

Main menu

discord chat

Dataset

Contents

Discord-Data

Description

Some statistics

Inspiration

Goal

Content

Aknowledgements

discord-messages

Dataset

Contents

Grounded Suggestions via Discord Server

10,000+ Discord Bot Listings

Context

Content

Acknowledgements

Notes

Lily Discord

Dataset

Contents

Discord Survey

long discord

Dataset

Contents

discord-rag-files

Dataset

Contents

discord-phishing-scam

Context

Content

Data Pre-processing

Acknowledgements

Inspiration

BrandonRTalks Discord Scrape

Dataset

Contents

quinton discord logs

Dataset

Contents

MonFire

Dataset

Contents

discord-irfan-ta2

Dataset

Contents

Python Discord Pixels Analysis

Context

Content

Acknowledgements

Inspiration

Studbot

Dataset

Contents

AnonymousChatLogInDiscordClass

Midjourney_discord_messgaes

ProtoTech Chat Transcipt

Dataset

Contents

Midjourney V5 Prompts and Links

Dataset

Contents

discord chat

Dataset

Contents