43 datasets found
  1. T

    wikipedia_toxicity_subtypes

    • tensorflow.org
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  2. jigsaw-multilingual-toxic-comment-classification

    • kaggle.com
    zip
    Updated Nov 14, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julián Peller (dataista0) (2021). jigsaw-multilingual-toxic-comment-classification [Dataset]. https://www.kaggle.com/datasets/julian3833/jigsaw-multilingual-toxic-comment-classification
    Explore at:
    zip(1159031079 bytes)Available download formats
    Dataset updated
    Nov 14, 2021
    Authors
    Julián Peller (dataista0)
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    Data from Jigsaw Multilingual Toxic Comment Classification

    For using it in Jigsaw Rate Severity of Toxic Comments

    Please, DO upvote if you use the dataset!

  3. r

    Toxic Comment Classification Challenge dataset

    • resodate.org
    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Waseem et al.; Kwok and Wang (2024). Toxic Comment Classification Challenge dataset [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdG94aWMtY29tbWVudC1jbGFzc2lmaWNhdGlvbi1jaGFsbGVuZ2UtZGF0YXNldA==
    Explore at:
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Leibniz Data Manager
    Authors
    Waseem et al.; Kwok and Wang
    Description

    The Toxic Comment Classification Challenge dataset contains comments from Wikipedia organized in six classes: toxic, severe toxic, obscene, threat, insult, and identity hate.

  4. Depression Severity Toxic Comments Dataset

    • kaggle.com
    zip
    Updated Jul 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mugees Asif (2025). Depression Severity Toxic Comments Dataset [Dataset]. https://www.kaggle.com/datasets/thestartupboy/depression-severity-toxic-comments-dataset
    Explore at:
    zip(54512484 bytes)Available download formats
    Dataset updated
    Jul 14, 2025
    Authors
    Muhammad Mugees Asif
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a customized and re-labeled version of the original Jigsaw Toxic Comment Classification Challenge dataset.

    Instead of toxic behavior categories, the comments are now annotated with depression severity levels, aiming to support mental health research and AI-based early detection of psychological distress.

    🗂️ Label Categories: Each comment has been carefully annotated into one of the following classes: - psychotic_depression - severe_depression - moderate_depression - mild_depression - toxic_depression - major_depression

    These labels help transform the original problem into a multi-class depression severity classification task. 👨‍💻 Project Contributors: - Muhammad Mugees Asif — Lead Annotator & AI Researcher - Dr. Arfan Ali Nagra — Computational Intelligence Expert - Sana Asif — Mental Health Research Support & Dataset Coordination

    This dataset was created with the intention to help data scientists, researchers, and students work on AI solutions for mental health support.

    ⚠️ Acknowledgement: The original dataset was sourced from the Jigsaw Toxic Comment Classification Challenge hosted on Kaggle. Full credit to the creators of the original dataset. This re-labeled version is shared for educational and research purposes only.

  5. h

    jigsaw-toxic-comment-classification-challenge

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas capelle, jigsaw-toxic-comment-classification-challenge [Dataset]. https://huggingface.co/datasets/tcapelle/jigsaw-toxic-comment-classification-challenge
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Thomas capelle
    Description

    tcapelle/jigsaw-toxic-comment-classification-challenge dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. jigsaw-toxic-comment-classification-challenge

    • kaggle.com
    zip
    Updated Nov 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julián Peller (dataista0) (2021). jigsaw-toxic-comment-classification-challenge [Dataset]. https://www.kaggle.com/julian3833/jigsaw-toxic-comment-classification-challenge
    Explore at:
    zip(55956177 bytes)Available download formats
    Dataset updated
    Nov 11, 2021
    Authors
    Julián Peller (dataista0)
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    Data from Toxic Comment Classification Challenge without modification

    For using it in Jigsaw Rate Severity of Toxic Comments

    Example usage: ☣️ Jigsaw - Super Simple Naive Bayes [LB=0.768]

    Please, DO upvote if you use the dataset!

  7. h

    jigsaw-toxic-comments

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anitamaxvim, jigsaw-toxic-comments [Dataset]. https://huggingface.co/datasets/anitamaxvim/jigsaw-toxic-comments
    Explore at:
    Authors
    anitamaxvim
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Jigsaw Toxic Comments

      Dataset
    
    
    
    
    
      Dataset Description
    

    The Jigsaw Toxic Comments dataset is a benchmark dataset created for the Toxic Comment Classification Challenge on Kaggle. It is designed to help develop machine learning models that can identify and classify toxic online comments across multiple categories of toxicity.

    Curated by: Jigsaw (a technology incubator within Alphabet Inc.) Shared by: Kaggle Language(s) (NLP): English License: CC0 1.0… See the full description on the dataset page: https://huggingface.co/datasets/anitamaxvim/jigsaw-toxic-comments.

  8. Toxic Comment Classification Challenge

    • kaggle.com
    zip
    Updated Jan 14, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yochino (2022). Toxic Comment Classification Challenge [Dataset]. https://www.kaggle.com/yochino/toxic-comment-classification-challenge
    Explore at:
    zip(55956177 bytes)Available download formats
    Dataset updated
    Jan 14, 2022
    Authors
    Yochino
    Description

    Dataset

    This dataset was created by Yochino

    Contents

  9. jigsaw_toxicity_pred

    • huggingface.co
    Updated Dec 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2020). jigsaw_toxicity_pred [Dataset]. https://huggingface.co/datasets/google/jigsaw_toxicity_pred
    Explore at:
    Dataset updated
    Dec 14, 2020
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    This dataset consists of a large number of Wikipedia comments which have been labeled by human raters for toxic behavior.

  10. toxic comment - merge train and test with label

    • kaggle.com
    zip
    Updated Apr 30, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teng Lei (2019). toxic comment - merge train and test with label [Dataset]. https://www.kaggle.com/datasets/nichaoku/toxic-comment-merge-train-and-test-with-label/discussion
    Explore at:
    zip(39757567 bytes)Available download formats
    Dataset updated
    Apr 30, 2019
    Authors
    Teng Lei
    Description

    Data from Toxic Comment Classification Challenge Merged train data with labeled test data; unlabeled test data are removed.

  11. h

    processed-jigsaw-toxic-comments

    • huggingface.co
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K Koushik Reddy (2023). processed-jigsaw-toxic-comments [Dataset]. https://huggingface.co/datasets/Koushim/processed-jigsaw-toxic-comments
    Explore at:
    Dataset updated
    Oct 10, 2023
    Authors
    K Koushik Reddy
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Processed Jigsaw Toxic Comments Dataset

    This is a preprocessed and tokenized version of the original Jigsaw Toxic Comment Classification Challenge dataset, prepared for multi-label toxicity classification using transformer-based models like BERT. ⚠️ Important Note: I am not the original creator of the dataset. This dataset is a cleaned and restructured version made for quick use in PyTorch deep learning models.

      📦 Dataset Features
    

    Each example contains:

    text: The… See the full description on the dataset page: https://huggingface.co/datasets/Koushim/processed-jigsaw-toxic-comments.

  12. h

    Tox

    • huggingface.co
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Luz (2023). Tox [Dataset]. https://huggingface.co/datasets/vluz/Tox
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2023
    Authors
    Victor Luz
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    A cleaned up version of train dataset from kaggle, the Toxic Comment Classification Challenge

    https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data?select=train.csv.zip the alt_format directory contains an alternate format intended for a tutorial.

    What was done:

    Removed extra spaces and new lines Removed non-printing characters Removed punctuation except apostrophe… See the full description on the dataset page: https://huggingface.co/datasets/vluz/Tox.

  13. h

    ukr-toxicity-dataset-translated-jigsaw

    • huggingface.co
    Updated Feb 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ukrainian Texts Classification (2024). ukr-toxicity-dataset-translated-jigsaw [Dataset]. https://huggingface.co/datasets/ukr-detect/ukr-toxicity-dataset-translated-jigsaw
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 17, 2024
    Dataset authored and provided by
    Ukrainian Texts Classification
    License

    https://choosealicense.com/licenses/openrail++/https://choosealicense.com/licenses/openrail++/

    Description

    Ukrainian Toxicity Dataset (translated)

    Additionaly to the twitter filtered data, we provide translated English Jigsaw Toxicity Classification Dataset to Ukrainian.

      Dataset formation:
    

    English data source: https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/ Working with data to get only two labels: a toxic and a non-toxic sentence. Translation into Ukrainian language using model: https://huggingface.co/Helsinki-NLP/opus-mt-en-uk

    Labels: 0 -… See the full description on the dataset page: https://huggingface.co/datasets/ukr-detect/ukr-toxicity-dataset-translated-jigsaw.

  14. Cleaned Toxic Comments

    • kaggle.com
    zip
    Updated Mar 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zafar (2018). Cleaned Toxic Comments [Dataset]. https://www.kaggle.com/fizzbuzz/cleaned-toxic-comments
    Explore at:
    zip(45799147 bytes)Available download formats
    Dataset updated
    Mar 12, 2018
    Authors
    Zafar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Preporcessed Toxic Comments Classification Dataset

    The obstacle I faced in Toxic Comments Classification Challenge was the preprocessing part. One can easily improve their LB performance if the preprocessing is done right.

    This is the preprocessed version of Toxic Comments Classification Challenge dataset. The code for preprocessing: https://www.kaggle.com/fizzbuzz/toxic-data-preprocessing

  15. jigsaw-toxic-comment-classification-challenges

    • kaggle.com
    zip
    Updated May 12, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Jain (2022). jigsaw-toxic-comment-classification-challenges [Dataset]. https://www.kaggle.com/datasets/rahul247jain/jigsawtoxiccommentclassificationchallenges
    Explore at:
    zip(55956177 bytes)Available download formats
    Dataset updated
    May 12, 2022
    Authors
    Rahul Jain
    Description

    Dataset

    This dataset was created by Rahul Jain

    Contents

  16. h

    toxicity-multi-label-classifier

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    raj, toxicity-multi-label-classifier [Dataset]. https://huggingface.co/datasets/acloudfan/toxicity-multi-label-classifier
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    raj
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Part of a course titled "Generative AI application design & development"

    https://genai.acloudfan.com/ Created from a dataset available on Kaggle. https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data

  17. O

    Wiki Toxic

    • opendatalab.com
    • huggingface.co
    Updated Jan 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Wiki Toxic [Dataset]. https://opendatalab.com/OpenDataLab/Wiki%20Toxic
    Explore at:
    Dataset updated
    Jan 17, 2024
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Wiki Toxic dataset is a modified, cleaned version of the dataset used in the Kaggle Toxic Comment Classification challenge from 2017/18. The dataset contains comments collected from Wikipedia forums and classifies them into two categories, toxic and non-toxic.

  18. Toxic Comment Classification Challenge for colab

    • kaggle.com
    zip
    Updated Nov 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shilovsky Dmitry (2025). Toxic Comment Classification Challenge for colab [Dataset]. https://www.kaggle.com/datasets/bobbyshmurda31/toxic-comment-classification-challenge-for-colab
    Explore at:
    zip(55956177 bytes)Available download formats
    Dataset updated
    Nov 10, 2025
    Authors
    Shilovsky Dmitry
    Description

    Dataset

    This dataset was created by Shilovsky Dmitry

    Contents

  19. Z

    Navigating News Narratives: A Media Bias Analysis Dataset

    • data-staging.niaid.nih.gov
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raza, Shaina (2023). Navigating News Narratives: A Media Bias Analysis Dataset [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10037860
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Vector Institute
    Authors
    Raza, Shaina
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media. Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII). Data Format: The format of data is:

    ID: Numeric unique identifier. Text: Main content. Dimension: Categorical descriptor of the text. Biased_Words: List of words considered biased. Aspect: Specific topic within the text. Label: Bias True/False value Aggregate Label: Calculated through multiple weighted formulae Annotation Scheme: The annotation scheme is based on Active learning, which is Manual Labeling --> Semi-Supervised Learning --> Human Verifications (iterative process)

    Bias Label: Indicate the presence/absence of bias (e.g., no bias, mild, strong). Words/Phrases Level Biases: Identify specific biased words/phrases. Subjective Bias (Aspect): Capture biases related to content aspects. List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news. We also utilize publicly available data from the following links. Our Attribution to others. MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336
    Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detection Toxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge. Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification. Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu) Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtV Social biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/

    Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage. If you use this dataset, please cite us. Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0

  20. bert-toxic-comment-classification-challenge

    • kaggle.com
    Updated Feb 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abxmaster (2022). bert-toxic-comment-classification-challenge [Dataset]. https://www.kaggle.com/abxmaster/berttoxiccommentclassificationchallenge
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    abxmaster
    Description

    Dataset

    This dataset was created by abxmaster

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2022). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes

wikipedia_toxicity_subtypes

Related Article
Explore at:
Dataset updated
Dec 6, 2022
Description

The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

Search
Clear search
Close search
Google apps
Main menu