91 datasets found
  1. T

    wikipedia_toxicity_subtypes

    • tensorflow.org
    Updated Oct 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
    Explore at:
    Dataset updated
    Oct 4, 2021
    Description

    The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  2. h

    jigsaw-toxic-comments

    • huggingface.co
    Updated Mar 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anitamaxvim (2025). jigsaw-toxic-comments [Dataset]. https://huggingface.co/datasets/anitamaxvim/jigsaw-toxic-comments
    Explore at:
    Dataset updated
    Mar 26, 2025
    Authors
    anitamaxvim
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Jigsaw Toxic Comments

      Dataset
    
    
    
    
    
      Dataset Description
    

    The Jigsaw Toxic Comments dataset is a benchmark dataset created for the Toxic Comment Classification Challenge on Kaggle. It is designed to help develop machine learning models that can identify and classify toxic online comments across multiple categories of toxicity.

    Curated by: Jigsaw (a technology incubator within Alphabet Inc.) Shared by: Kaggle Language(s) (NLP): English License: CC0 1.0… See the full description on the dataset page: https://huggingface.co/datasets/anitamaxvim/jigsaw-toxic-comments.

  3. h

    toxic-comments

    • huggingface.co
    Updated Oct 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI Robotics Ethics Society (PUCRS) (2024). toxic-comments [Dataset]. https://huggingface.co/datasets/AiresPucrs/toxic-comments
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 13, 2024
    Dataset authored and provided by
    AI Robotics Ethics Society (PUCRS)
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Toxic-comments (Teeny-Tiny Castle)

    This dataset is part of a tutorial tied to the Teeny-Tiny Castle, an open-source repository containing educational tools for AI Ethics and Safety research.

      How to Use
    

    from datasets import load_dataset

    dataset = load_dataset("AiresPucrs/toxic_content", split = 'train')

  4. jigsaw-toxic-comment-classification-challenge

    • kaggle.com
    Updated Nov 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dataista0 (Julián Peller) (2021). jigsaw-toxic-comment-classification-challenge [Dataset]. https://www.kaggle.com/datasets/julian3833/jigsaw-toxic-comment-classification-challenge/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 11, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    dataista0 (Julián Peller)
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    Data from Toxic Comment Classification Challenge without modification

    For using it in Jigsaw Rate Severity of Toxic Comments

    Example usage: ☣️ Jigsaw - Super Simple Naive Bayes [LB=0.768]

    Please, DO upvote if you use the dataset!

  5. h

    toxic-comments

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tillmann Schwörer, toxic-comments [Dataset]. https://huggingface.co/datasets/tillschwoerer/toxic-comments
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Tillmann Schwörer
    Description

    tillschwoerer/toxic-comments dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. Cleaned Toxic Comments

    • kaggle.com
    zip
    Updated Mar 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zafar (2018). Cleaned Toxic Comments [Dataset]. https://www.kaggle.com/fizzbuzz/cleaned-toxic-comments
    Explore at:
    zip(45799147 bytes)Available download formats
    Dataset updated
    Mar 12, 2018
    Authors
    Zafar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Preporcessed Toxic Comments Classification Dataset

    The obstacle I faced in Toxic Comments Classification Challenge was the preprocessing part. One can easily improve their LB performance if the preprocessing is done right.

    This is the preprocessed version of Toxic Comments Classification Challenge dataset. The code for preprocessing: https://www.kaggle.com/fizzbuzz/toxic-data-preprocessing

  7. P

    Civil Comments Dataset

    • library.toponeai.link
    • paperswithcode.com
    Updated Nov 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Borkan; Lucas Dixon; Jeffrey Sorensen; Nithum Thain; Lucy Vasserman (2022). Civil Comments Dataset [Dataset]. https://library.toponeai.link/dataset/civil-comments
    Explore at:
    Dataset updated
    Nov 15, 2022
    Authors
    Daniel Borkan; Lucas Dixon; Jeffrey Sorensen; Nithum Thain; Lucy Vasserman
    Description

    At the end of 2017 the Civil Comments platform shut down and chose make their ~2m public comments from their platform available in a lasting open archive so that researchers could understand and improve civility in online conversations for years to come. Jigsaw sponsored this effort and extended annotation of this data by human raters for various toxic conversational attributes.

    In the data supplied for this competition, the text of the individual comment is found in the comment_text column. Each comment in Train has a toxicity label (target), and models should predict the target toxicity for the Test data. This attribute (and all others) are fractional values which represent the fraction of human raters who believed the attribute applied to the given comment.

    The data also has several additional toxicity subtype attributes. Models do not need to predict these attributes for the competition, they are included as an additional avenue for research. Subtype attributes are:

    severe_toxicity obscene threat insult identity_attack sexual_explicit

    Additionally, a subset of comments have been labelled with a variety of identity attributes, representing the identities that are mentioned in the comment. The columns corresponding to identity attributes are listed below. Only identities with more than 500 examples in the test set (combined public and private) will be included in the evaluation calculation. These identities are shown in bold.

    male female transgender other_gender heterosexual homosexual_gay_or_lesbian bisexual other_sexual_orientation christian jewish muslim hindu buddhist atheist other_religion black white asian latino other_race_or_ethnicity physical_disability intellectual_or_learning_disability psychiatric_or_mental_illness other_disability

  8. f

    Toxicity Dataset

    • figshare.com
    bin
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vincent Maladiere (2024). Toxicity Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.27240072.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 16, 2024
    Dataset provided by
    figshare
    Authors
    Vincent Maladiere
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Toxicity Dataset*by Surge AIThis dataset contains 500 toxic and 500 non-toxic comments from a variety of popular social media platforms. Rather than operating under a strict definition of toxicity, we asked our team to identify comments that they personally found toxic. ## Columns text: the text of the comment* is_toxic: whether or not the comment is toxic

  9. h

    jigsaw-toxic-comment-classification-challenge

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas capelle, jigsaw-toxic-comment-classification-challenge [Dataset]. https://huggingface.co/datasets/tcapelle/jigsaw-toxic-comment-classification-challenge
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Thomas capelle
    Description

    tcapelle/jigsaw-toxic-comment-classification-challenge dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. toxic comments

    • kaggle.com
    Updated Feb 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chinmay Das (2024). toxic comments [Dataset]. https://www.kaggle.com/datasets/chinmaydas/toxic-comments
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    chinmay Das
    Description

    Dataset

    This dataset was created by chinmay Das

    Contents

  11. Data from: Detoxify

    • kaggle.com
    Updated Feb 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shohei Maruyama (2022). Detoxify [Dataset]. https://www.kaggle.com/datasets/maruyama/detoxify
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shohei Maruyama
    Description

    About this dataset

    Trained models to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. https://github.com/unitaryai/detoxify

    How to use

    import transformers
    
    # original
    model_path = "../input/detoxify/original"
    self.tokenizer = transformers.BertTokenizer.from_pretrained(model_path)
    self.encoder = transformers.BertForTokenClassification.from_pretrained(
      f"{model_path}/pytorch_model.bin",
      config = transformers.BertConfig.from_pretrained(f"{model_path}/config.json")
    ).bert
    
    # unbiased
    model_path = "../input/detoxify/unbiased"
    self.tokenizer = transformers.RobertaTokenizer.from_pretrained(model_path)
    self.encoder = transformers.RobertaForSequenceClassification.from_pretrained(
      f"{model_path}/pytorch_model.bin",
      config = transformers.RobertaConfig.from_pretrained(f"{model_path}/config.json")
    ).roberta
    
    # multilingual
    model_path = "../input/detoxify/multilingual"
    self.tokenizer = transformers.XLMRobertaTokenizer.from_pretrained(model_path)
    self.encoder = transformers.XLMRobertaForSequenceClassification.from_pretrained(
      f"{model_path}/pytorch_model.bin",
      config = transformers.XLMRobertaConfig.from_pretrained(f"{model_path}/config.json")
    ).roberta
    
  12. o

    Youtube toxic comments

    • opendatabay.com
    .csv
    Updated Jun 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Youtube toxic comments [Dataset]. https://www.opendatabay.com/data/dataset/0a34ae04-d822-4ac2-aaa1-1f445579500b
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Social Media and Networking
    Description

    About Dataset This is a hand-labelled toxicity data set containing 1000 comments crawled from YouTube videos about the Ferguson unrest in 2014. In addition to toxicity, this data set contains labels for multiple subclassifications of toxicity which form a hierarchical structure. Each comment can have multiple of these labels assigned. The structure can be seen in the following enumeration:

    *IsToxic - IsAbusive IsThreat IsProvocative IsObscene - IsHatespeech IsRacist IsNationalist IsSexist IsHomophobic IsReligiousHate - IsRadicalism

    Original Data Source: Youtube toxic comments

  13. h

    processed-jigsaw-toxic-comments

    • huggingface.co
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K Koushik Reddy (2023). processed-jigsaw-toxic-comments [Dataset]. https://huggingface.co/datasets/Koushim/processed-jigsaw-toxic-comments
    Explore at:
    Dataset updated
    Oct 10, 2023
    Authors
    K Koushik Reddy
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Processed Jigsaw Toxic Comments Dataset

    This is a preprocessed and tokenized version of the original Jigsaw Toxic Comment Classification Challenge dataset, prepared for multi-label toxicity classification using transformer-based models like BERT. ⚠️ Important Note: I am not the original creator of the dataset. This dataset is a cleaned and restructured version made for quick use in PyTorch deep learning models.

      📦 Dataset Features
    

    Each example contains:

    text: The… See the full description on the dataset page: https://huggingface.co/datasets/Koushim/processed-jigsaw-toxic-comments.

  14. Toxic Comment Classification labelled languages

    • kaggle.com
    Updated Dec 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllHailSammy (2017). Toxic Comment Classification labelled languages [Dataset]. https://www.kaggle.com/datasets/wangshangsam/toxic-comment-classification-labelled-languages/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 22, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    AllHailSammy
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Dataset

    This dataset was created by AllHailSammy

    Released under GPL 2

    Contents

  15. UIT-ViCTSD (UIT Vietnamese Constructive and Toxic Speech Detection)

    • opendatalab.com
    zip
    Updated Mar 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Information Technology (2023). UIT-ViCTSD (UIT Vietnamese Constructive and Toxic Speech Detection) [Dataset]. https://opendatalab.com/OpenDataLab/UIT-ViCTSD
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    胡志明市国家大学https://vnuhcm.edu.vn/
    University of Information Technology
    Description

    UIT-ViCTSD (Vietnamese Constructive and Toxic Speech Detection) is a dataset for constructive and toxic speech detection in Vietnamese. It consists of 10,000 human-annotated comments.

  16. h

    jigsaw-toxic-comment-train-processed-seqlen128

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    akcit ijf, jigsaw-toxic-comment-train-processed-seqlen128 [Dataset]. https://huggingface.co/datasets/akcit-ijf/jigsaw-toxic-comment-train-processed-seqlen128
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    akcit ijf
    Description

    akcit-ijf/jigsaw-toxic-comment-train-processed-seqlen128 dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. toxic comment vietnamese

    • kaggle.com
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    trandong2932002 (2024). toxic comment vietnamese [Dataset]. https://www.kaggle.com/trandong2932002/toxic-comment-vietnamese/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    trandong2932002
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by trandong2932002

    Released under MIT

    Contents

  18. s

    HOT Speech: Comments from Political News Posts and Videos that were...

    • socialmediaarchive.org
    csv, pdf
    Updated Apr 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). HOT Speech: Comments from Political News Posts and Videos that were Annotated for Hateful, Offensive, and Toxic Content [Dataset]. http://doi.org/10.3886/45fc-9c8f
    Explore at:
    csv(1591949), csv(21665), pdf(729152)Available download formats
    Dataset updated
    Apr 20, 2023
    Description

    This text dataset includes 3,481 social media user comments posted in response to political news posts and videos on Twitter, YouTube, and Reddit in August, 2021. The dataset also includes MTurk workers’ annotations of these comments as hateful, offensive, and/or toxic; and codes assigned by researchers describing various rhetorical dimensions of these comments.

  19. E

    GATE: Toxic Language Classifier

    • live.european-language-grid.eu
    Updated Aug 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). GATE: Toxic Language Classifier [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/8022
    Explore at:
    Dataset updated
    Aug 16, 2021
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This classifier is a fine-tuned Roberta-base model using the simpletransformers toolkit for classifying toxic language. We use the Kaggle Toxic Comments Challenge dataset as training data. This dataset contains Wikipedia comments classified as toxic or non-toxic.

  20. toxic comments

    • kaggle.com
    zip
    Updated Sep 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rowan Curry (2021). toxic comments [Dataset]. https://www.kaggle.com/rowancurry/toxic-comments
    Explore at:
    zip(52968821 bytes)Available download formats
    Dataset updated
    Sep 16, 2021
    Authors
    Rowan Curry
    Description

    Dataset

    This dataset was created by Rowan Curry

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2021). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes

wikipedia_toxicity_subtypes

Related Article
Explore at:
Dataset updated
Oct 4, 2021
Description

The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

Search
Clear search
Close search
Google apps
Main menu