63 datasets found
  1. discord chat

    • kaggle.com
    zip
    Updated Mar 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emperiums (2023). discord chat [Dataset]. https://www.kaggle.com/datasets/emperiums/discord-chat
    Explore at:
    zip(10062 bytes)Available download formats
    Dataset updated
    Mar 28, 2023
    Authors
    Emperiums
    Description

    Dataset

    This dataset was created by Emperiums

    Contents

  2. Discord-Data

    • kaggle.com
    zip
    Updated Apr 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jess Fan (2021). Discord-Data [Dataset]. https://www.kaggle.com/datasets/jef1056/discord-data/code
    Explore at:
    zip(8155868013 bytes)Available download formats
    Dataset updated
    Apr 16, 2021
    Authors
    Jess Fan
    License

    http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html

    Description

    Description

    This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on discord data scraped from a large variety of severs, big and small.

    Want your server to be a part of the next release? Want access to the raw data? Contact me at contact@j-fan.ml

    Some statistics

    The raw data for this version contained 51,826,268 messages [v1] 5103788 (regex) + 696161 (toxic)/51826268, or 0.11% of the messages were removed [v2] 6737000 (regex) + 946778 (toxic)/90841631, 0.08%of the messages were removed The dataset's final size is 46,026,319 (v1) + 64,345,492 (v2) [110,371,811] messages across 456,810 (v1) + 750,416 (v2) [1,207,226] conversations, which is reduced from 89.6 GB of raw json data to just under 2 GB

    Inspiration

    There is a wide variety of NLP datasets that cover a huge number of different interactions between users that can be used for pretraining; Google's C4 covers webtexts and a extremely diverse amount of data for the majority of language tasks. Reddit crawls cover strucutred, forum-style text. However, despite this abundance of data, there is a lack of clean long-context data for specifically conversation puproses. In a search for potential sources of data, I discovered that discord has a long-standing history of having interesting and diverse conversations, and a realatively open API. With the collaboration of a large number of discord moderators, server owners, and members of the community, this data was sucessfully downloaded and cleaned.

    Goal

    To create a diverse, structured dataset of turn-by-tun conversation that can be used to pretrain a model oriented specifically for conversational purposes

    Content

    Files containing -detox are cleaned files that utilized a LSTM network to analyize each message and evaluate if the message is toxic, obscene, threatening, insulting, or is identity hate All files were cleaned using https://github.com/JEF1056/clean-discord, mostly using the default settings. The repo takes an automated, heuristic approach to removing unwanted, non-NLP, or toxic comments. context.txt contains all data that has been cleaned using basic regex and some text replacement context-pairs.txt contains pairs of data using only discord's recent replies feature. As it is so new, its yeild is very low. It has also been cleaned using basic regex and some text replacement

    Aknowledgements

    A massive thanks to https://github.com/codemicro for working on multithreading code for the clean-discord repo!

    Cite this dataset: @misc{discord-data, author = {Jess Fan}, title = {Discord Dataset}, contact = {jeefan@ucsc.edu, contact@j-fan.ml}, year = {2021}, howpublished = {\url{https://www.kaggle.com/jef1056/discord-data}}, note = {V5} }

  3. discord-messages

    • kaggle.com
    zip
    Updated Dec 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deep Sarda (2021). discord-messages [Dataset]. https://www.kaggle.com/datasets/deepsarda/discordmessages
    Explore at:
    zip(2776830801 bytes)Available download formats
    Dataset updated
    Dec 10, 2021
    Authors
    Deep Sarda
    Description

    Dataset

    This dataset was created by Deep Sarda

    Contents

  4. Grounded Suggestions via Discord Server

    • kaggle.com
    zip
    Updated Apr 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brandon Conrady (2022). Grounded Suggestions via Discord Server [Dataset]. https://www.kaggle.com/datasets/brandonconrady/grounded-suggestions-via-discord-server
    Explore at:
    zip(33097 bytes)Available download formats
    Dataset updated
    Apr 27, 2022
    Authors
    Brandon Conrady
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Any member of the discord server can submit suggestions for improving the game. Others have the option of upvoting said suggestions. If a suggestion reaches 35 or more upvotes, it is sent to the developers. I compiled this dataset mainly to practice NLP, but there are use cases for applying statistical tests to see what has a better chance of getting sent to devs.

  5. 10,000+ Discord Bot Listings

    • kaggle.com
    zip
    Updated Jul 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dotslan (2020). 10,000+ Discord Bot Listings [Dataset]. https://www.kaggle.com/dotslan/discord-bots-on-topgg
    Explore at:
    zip(14229830 bytes)Available download formats
    Dataset updated
    Jul 30, 2020
    Authors
    dotslan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Context

    Top.gg is one of the most popular websites that lists bots that people can share and add to their Discord servers. It has over 10,000 listings.

    Content

    This dataset has been scraped from all top.gg Top bots pages in JSON format, cleaned and converted to CSV. I have included both formats. It was collected July 29th, 2020. The included features:

    • Bot name
    • short description
    • Number of servers
    • Number of votes
    • Tags
    • Bot's Website URL
    • Invite link
    • Support server
    • Creator
    • Long description
    • Prefix
    • image url

    Acknowledgements

    The data belongs to Top.gg and shall not be used for any commercial purposes.

    Notes

    I may decide to update this in the future, as I've learned about other attributes that could possibly be incorporated.

  6. Lily Discord

    • kaggle.com
    zip
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trystynpatrick (2023). Lily Discord [Dataset]. https://www.kaggle.com/datasets/trystynpatrick/lily-discord/suggestions?status=pending&yourSuggestions=true
    Explore at:
    zip(534 bytes)Available download formats
    Dataset updated
    Jun 11, 2023
    Authors
    Trystynpatrick
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Trystynpatrick

    Released under CC0: Public Domain

    Contents

    Just a silly test for me and my friends

  7. Discord Survey

    • kaggle.com
    zip
    Updated Oct 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yonko (Czeslaw Meyer) (2021). Discord Survey [Dataset]. https://www.kaggle.com/yonkotoshiro/discord-survey
    Explore at:
    zip(13490 bytes)Available download formats
    Dataset updated
    Oct 11, 2021
    Authors
    Yonko (Czeslaw Meyer)
    Description

    Results of a survey of 403 discord users. The selection was random, the servers were random, a lot of people refused to go through, but someone agreed. Interrogated only Russian-speaking people. When creating, I notified users that after completion I was going to analyze the data and post the results in the public domain. No any personal user data was collected either.

    In general, you can see that I like the discord, as well as some of the psychological focus of the questions. I have no experience in doing something like this, but still I tried to do everything as correctly as possible.

    This version is translated into English. Also cleaned data and removed or changed something that wasn't needed.

  8. long discord

    • kaggle.com
    zip
    Updated Sep 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Weinberger (2024). long discord [Dataset]. https://www.kaggle.com/datasets/matthewweinberger/long-discord
    Explore at:
    zip(281434690 bytes)Available download formats
    Dataset updated
    Sep 23, 2024
    Authors
    Matthew Weinberger
    Description

    Dataset

    This dataset was created by Matthew Weinberger

    Contents

  9. discord-rag-files

    • kaggle.com
    zip
    Updated Apr 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roman Matveev (2025). discord-rag-files [Dataset]. https://www.kaggle.com/datasets/matveevromanjob/discord-rag-files
    Explore at:
    zip(8831 bytes)Available download formats
    Dataset updated
    Apr 20, 2025
    Authors
    Roman Matveev
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Roman Matveev

    Released under MIT

    Contents

  10. discord-phishing-scam

    • kaggle.com
    zip
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shibe123 (2025). discord-phishing-scam [Dataset]. https://www.kaggle.com/datasets/shibe123/discord-phishing-scam/data
    Explore at:
    zip(52875 bytes)Available download formats
    Dataset updated
    Jul 11, 2025
    Authors
    Shibe123
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context

    This dataset contains real-world messages from my Discord server, labeled to support the fine-tuning of BERT/distilBERT base for phishing and scam detection.

    Content

    This dataset was created by first collecting 80,000 raw messages from my Discord server using the Discord API. To ensure quality and relevance short messages of less than 3 words, messages from bots, non-text messages, and mass duplicated messages such as those containing over 70% of emojis are removed. This filtering process reduced the dataset to fewer than 20,000 high-quality messages. Afterward, additional preprocessing was applied (as described below), and more targeted scam messages were manually added to improve model exposure to common phishing tactics and keyword variations. This ensures the dataset remains both realistic and effective for fine-tuning models used in live moderation settings.

    Data Pre-processing

    Discord/External links → <URL> User mentions → <USER> Custom emojis → <EMOJI> Discord invite links → <DISCORD_INVITE>

    Acknowledgements

    Messages were collected via the Discord API from my community server active for eight years, comprising ~11 000 members and over 20 million messages.

    Inspiration

    Traditional Discord moderation bots rely on static keyword rules set by server owners, but scammers easily evade these filters by subtly altering spellings, using homoglyphs and more. Thus, I built an NLP-powered moderation bot by fine-tuning DistilBERT base uncased on labelled chat data to recognize phishing and scam patterns beyond simple keywords, The bot is deployed and scans every incoming message in real time, automatically flagging or deleting malicious content. You can find out more here → https://github.com/wang-yuancheng/shibemod

  11. BrandonRTalks Discord Scrape

    • kaggle.com
    zip
    Updated Jan 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fishiv (2023). BrandonRTalks Discord Scrape [Dataset]. https://www.kaggle.com/datasets/fishiv/brandonrtalkschatbot
    Explore at:
    zip(10861090 bytes)Available download formats
    Dataset updated
    Jan 14, 2023
    Authors
    fishiv
    Description

    Dataset

    This dataset was created by fishiv

    Contents

  12. quinton discord logs

    • kaggle.com
    zip
    Updated Mar 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    joshua (2022). quinton discord logs [Dataset]. https://www.kaggle.com/joshuat2/quinton-discord-logs
    Explore at:
    zip(2774543 bytes)Available download formats
    Dataset updated
    Mar 15, 2022
    Authors
    joshua
    Description

    Dataset

    This dataset was created by joshua

    Contents

  13. MonFire

    • kaggle.com
    zip
    Updated Mar 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MonFire (2022). MonFire [Dataset]. https://www.kaggle.com/datasets/monfire/monfire/code
    Explore at:
    zip(36769 bytes)Available download formats
    Dataset updated
    Mar 14, 2022
    Authors
    MonFire
    Description

    Dataset

    This dataset was created by MonFire

    Released under Data files © Original Authors

    Contents

  14. discord-irfan-ta2

    • kaggle.com
    zip
    Updated Jun 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irpanko122 (2024). discord-irfan-ta2 [Dataset]. https://www.kaggle.com/irpanko122/discord
    Explore at:
    zip(58579 bytes)Available download formats
    Dataset updated
    Jun 13, 2024
    Authors
    Irpanko122
    Description

    Dataset

    This dataset was created by Irpanko122

    Contents

  15. Python Discord Pixels Analysis

    • kaggle.com
    zip
    Updated May 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Soyka (2021). Python Discord Pixels Analysis [Dataset]. https://www.kaggle.com/bsoyka3/python-discord-pixels-analysis
    Explore at:
    zip(61559 bytes)Available download formats
    Dataset updated
    May 26, 2021
    Authors
    Ben Soyka
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    The Python Discord community is an active group with interactive events from time to time. The Pixels event (inspired by Reddit's r/place project) gives members access to a heavily-limited API to place colored pixels on a black canvas, which are often overwritten by others.

    Content

    This data was gathered using the Python Discord Pixels API and some Python scripting to automate the analysis of the raw data.

    Acknowledgements

    All pixels were placed by members of the Python Discord community.

    The analyses in this dataset are licensed under CC BY-SA 4.0, meaning you can share or adapt them as long as appropriate credit is provided and you redistribute your changes under the same license.

    Inspiration

    What simple but fun statistics can you gather from this data?

  16. Studbot

    • kaggle.com
    zip
    Updated Jul 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Landon King (2024). Studbot [Dataset]. https://www.kaggle.com/datasets/landonking/studbot/code
    Explore at:
    zip(1863 bytes)Available download formats
    Dataset updated
    Jul 25, 2024
    Authors
    Landon King
    Description

    Dataset

    This dataset was created by Landon King

    Contents

  17. AnonymousChatLogInDiscordClass

    • kaggle.com
    zip
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiujie Feng (2025). AnonymousChatLogInDiscordClass [Dataset]. https://www.kaggle.com/datasets/xiujiefeng/anonymouschatlogindiscordclass
    Explore at:
    zip(616832 bytes)Available download formats
    Dataset updated
    Jun 26, 2025
    Authors
    Xiujie Feng
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains anonymized chat records from a real university-level computer science course conducted via a Discord server. It captures authentic student interactions across multiple semesters, providing valuable insight into informal, peer-to-peer educational discussions.

    To protect privacy, all usernames and personal identifiers have been replaced with randomly generated pseudonyms. The dataset includes message content, timestamps, dates, and emoji usage, preserving the conversational and temporal structure of the original data.

    This resource is well-suited for a variety of research and machine learning tasks, including:

    Educational dialogue analysis

    Toxic language detection in academic settings

    Role-based interaction modeling

    Temporal pattern recognition in online discourse

    All data was collected from a publicly available source and processed to ensure ethical usage and compliance with data privacy norms.

  18. Midjourney_discord_messgaes

    • kaggle.com
    zip
    Updated Mar 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krzysztof Gonia (2023). Midjourney_discord_messgaes [Dataset]. https://www.kaggle.com/datasets/krzysztofgonia/midjourney-discord-messgaes
    Explore at:
    zip(2493161 bytes)Available download formats
    Dataset updated
    Mar 22, 2023
    Authors
    Krzysztof Gonia
    Description

    Dataset inspired by Midjourney User Prompts & Generated Images (250k)

    Collected messages are only for upscale requests.

  19. ProtoTech Chat Transcipt

    • kaggle.com
    zip
    Updated Dec 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Renja Grotemeyer (2021). ProtoTech Chat Transcipt [Dataset]. https://www.kaggle.com/datasets/renjagrotemeyer/prototech-chat-transcipt
    Explore at:
    zip(13459899 bytes)Available download formats
    Dataset updated
    Dec 15, 2021
    Authors
    Renja Grotemeyer
    Description

    Dataset

    This dataset was created by Renja Grotemeyer

    Contents

  20. Midjourney V5 Prompts and Links

    • kaggle.com
    zip
    Updated May 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irakli P (2023). Midjourney V5 Prompts and Links [Dataset]. https://www.kaggle.com/datasets/iraklip/midjourney-v5-prompts-and-links
    Explore at:
    zip(571044595 bytes)Available download formats
    Dataset updated
    May 7, 2023
    Authors
    Irakli P
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Irakli P

    Released under CC0: Public Domain

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Emperiums (2023). discord chat [Dataset]. https://www.kaggle.com/datasets/emperiums/discord-chat
Organization logo

discord chat

Explore at:
zip(10062 bytes)Available download formats
Dataset updated
Mar 28, 2023
Authors
Emperiums
Description

Dataset

This dataset was created by Emperiums

Contents

Search
Clear search
Close search
Google apps
Main menu