63 datasets found
  1. h

    jigsaw-toxic-comments

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anitamaxvim, jigsaw-toxic-comments [Dataset]. https://huggingface.co/datasets/anitamaxvim/jigsaw-toxic-comments
    Explore at:
    Authors
    anitamaxvim
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Jigsaw Toxic Comments

      Dataset
    
    
    
    
    
      Dataset Description
    

    The Jigsaw Toxic Comments dataset is a benchmark dataset created for the Toxic Comment Classification Challenge on Kaggle. It is designed to help develop machine learning models that can identify and classify toxic online comments across multiple categories of toxicity.

    Curated by: Jigsaw (a technology incubator within Alphabet Inc.) Shared by: Kaggle Language(s) (NLP): English License: CC0 1.0… See the full description on the dataset page: https://huggingface.co/datasets/anitamaxvim/jigsaw-toxic-comments.

  2. T

    wikipedia_toxicity_subtypes

    • tensorflow.org
    Updated Oct 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
    Explore at:
    Dataset updated
    Oct 4, 2021
    Description

    The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  3. Nanoparticle Toxicity Dataset

    • kaggle.com
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI Machine Learning (2024). Nanoparticle Toxicity Dataset [Dataset]. https://www.kaggle.com/datasets/ucimachinelearning/nanoparticle-toxicity-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    UCI Machine Learning
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is a toxicity dataset consisting of several columns capturing various attributes of nanoparticles (NPs) and their toxicological effects. The dataset contains various features related to nanoparticles (NPs) and their properties, which are likely related to toxicity classification. Here is an overview of the columns in the dataset:

    NPs: Type of nanoparticles (e.g., Al2O3). coresize: Core size of the nanoparticles. hydrosize: Hydrodynamic size of the nanoparticles. surfcharge: Surface charge of the nanoparticles. surfarea: Surface area of the nanoparticles. Ec: Electric Charge Expotime: Exposure time. Dosage: amount of material used. e: Energy-related feature. NOxygen: Number of oxygen atoms. class: Class label indicating whether the nanoparticles are toxic or non-toxic.

  4. h

    kaggle-toxic-annotated

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas capelle, kaggle-toxic-annotated [Dataset]. https://huggingface.co/datasets/tcapelle/kaggle-toxic-annotated
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Thomas capelle
    Description

    Kaggle toxic dataset annotated with gpt-4o-mini with the same prompt used to annotate Toxic-Commons Celadon

  5. T

    civil_comments

    • tensorflow.org
    • huggingface.co
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
    Explore at:
    Dataset updated
    Feb 28, 2023
    Description

    This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

    The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

    The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

    For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('civil_comments', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  6. h

    kaggle-toxic-annotated-filtered

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas capelle, kaggle-toxic-annotated-filtered [Dataset]. https://huggingface.co/datasets/tcapelle/kaggle-toxic-annotated-filtered
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Thomas capelle
    Description

    tcapelle/kaggle-toxic-annotated-filtered dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. Cleaned Toxic Comments

    • kaggle.com
    zip
    Updated Mar 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zafar (2018). Cleaned Toxic Comments [Dataset]. https://www.kaggle.com/fizzbuzz/cleaned-toxic-comments
    Explore at:
    zip(45799147 bytes)Available download formats
    Dataset updated
    Mar 12, 2018
    Authors
    Zafar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Preporcessed Toxic Comments Classification Dataset

    The obstacle I faced in Toxic Comments Classification Challenge was the preprocessing part. One can easily improve their LB performance if the preprocessing is done right.

    This is the preprocessed version of Toxic Comments Classification Challenge dataset. The code for preprocessing: https://www.kaggle.com/fizzbuzz/toxic-data-preprocessing

  8. h

    Tox

    • huggingface.co
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Luz (2023). Tox [Dataset]. https://huggingface.co/datasets/vluz/Tox
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2023
    Authors
    Victor Luz
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    A cleaned up version of train dataset from kaggle, the Toxic Comment Classification Challenge

    https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data?select=train.csv.zip the alt_format directory contains an alternate format intended for a tutorial.

    What was done:

    Removed extra spaces and new lines Removed non-printing characters Removed punctuation except apostrophe… See the full description on the dataset page: https://huggingface.co/datasets/vluz/Tox.

  9. Toxic data

    • kaggle.com
    Updated Mar 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    x2022gmv (2023). Toxic data [Dataset]. https://www.kaggle.com/datasets/x2022gmv/toxic-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    x2022gmv
    Description

    Dataset

    This dataset was created by x2022gmv

    Contents

  10. external data toxic comments

    • kaggle.com
    Updated Feb 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roshan Velpula (2024). external data toxic comments [Dataset]. https://www.kaggle.com/datasets/roshanvelpula/external-data-toxic-comments/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Roshan Velpula
    Description

    Dataset

    This dataset was created by Roshan Velpula

    Contents

  11. h

    toxic_conversations_50k

    • huggingface.co
    Updated Jun 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2022). toxic_conversations_50k [Dataset]. https://huggingface.co/datasets/mteb/toxic_conversations_50k
    Explore at:
    Dataset updated
    Jun 29, 2022
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ToxicConversationsClassification An MTEB dataset Massive Text Embedding Benchmark

    Collection of comments from the Civil Comments platform together with annotations if the comment is toxic or not.

    Task category t2c

    Domains Social, Written Reference https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/overview

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import… See the full description on the dataset page: https://huggingface.co/datasets/mteb/toxic_conversations_50k.

  12. Toxic Comment Classification- Toxic

    • kaggle.com
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Dutta (2023). Toxic Comment Classification- Toxic [Dataset]. https://www.kaggle.com/datasets/gauravduttakiit/toxic-comment-classification-toxic/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaurav Dutta
    Description

    Dataset

    This dataset was created by Gaurav Dutta

    Contents

  13. Toxic comment classification 2

    • kaggle.com
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juntao Liang (2025). Toxic comment classification 2 [Dataset]. https://www.kaggle.com/datasets/babypeach/toxic-comment-classification-2/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 19, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Juntao Liang
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Juntao Liang

    Released under CC0: Public Domain

    Contents

  14. E

    GATE: Toxic Language Classifier

    • live.european-language-grid.eu
    Updated Aug 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). GATE: Toxic Language Classifier [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/8022
    Explore at:
    Dataset updated
    Aug 16, 2021
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This classifier is a fine-tuned Roberta-base model using the simpletransformers toolkit for classifying toxic language. We use the Kaggle Toxic Comments Challenge dataset as training data. This dataset contains Wikipedia comments classified as toxic or non-toxic.

  15. toxic-data-helpers

    • kaggle.com
    zip
    Updated May 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ilya Ezepov (2020). toxic-data-helpers [Dataset]. https://www.kaggle.com/iezepov/toxic-data-helpers
    Explore at:
    zip(332461808 bytes)Available download formats
    Dataset updated
    May 11, 2020
    Authors
    Ilya Ezepov
    Description

    Dataset

    This dataset was created by Ilya Ezepov

    Contents

  16. h

    FinToxicityClassification

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark, FinToxicityClassification [Dataset]. https://huggingface.co/datasets/mteb/FinToxicityClassification
    Explore at:
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    FinToxicityClassification An MTEB dataset Massive Text Embedding Benchmark

    This dataset is a DeepL -based machine translated version of the Jigsaw toxicity dataset for Finnish. The dataset is originally from a Kaggle competition https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data.
    The original dataset poses a multi-label text classification problem and includes the labels identity_attack, insult, obscene, severe_toxicity, threat and toxicity.
    Here… See the full description on the dataset page: https://huggingface.co/datasets/mteb/FinToxicityClassification.
    
  17. Depression Severity Toxic Comments Dataset

    • kaggle.com
    Updated Jul 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mugees Asif (2025). Depression Severity Toxic Comments Dataset [Dataset]. https://www.kaggle.com/datasets/thestartupboy/depression-severity-toxic-comments-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 14, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Muhammad Mugees Asif
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a customized and re-labeled version of the original Jigsaw Toxic Comment Classification Challenge dataset.

    Instead of toxic behavior categories, the comments are now annotated with depression severity levels, aiming to support mental health research and AI-based early detection of psychological distress.

    🗂️ Label Categories: Each comment has been carefully annotated into one of the following classes: - psychotic_depression - severe_depression - moderate_depression - mild_depression - toxic_depression - major_depression

    These labels help transform the original problem into a multi-class depression severity classification task. 👨‍💻 Project Contributors: - Muhammad Mugees Asif — Lead Annotator & AI Researcher - Dr. Arfan Ali Nagra — Computational Intelligence Expert - Sana Asif — Mental Health Research Support & Dataset Coordination

    This dataset was created with the intention to help data scientists, researchers, and students work on AI solutions for mental health support.

    ⚠️ Acknowledgement: The original dataset was sourced from the Jigsaw Toxic Comment Classification Challenge hosted on Kaggle. Full credit to the creators of the original dataset. This re-labeled version is shared for educational and research purposes only.

  18. h

    jigsaw_toxicity_pred_fi

    • huggingface.co
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TurkuNLP Research Group (2023). jigsaw_toxicity_pred_fi [Dataset]. https://huggingface.co/datasets/TurkuNLP/jigsaw_toxicity_pred_fi
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 2, 2023
    Dataset authored and provided by
    TurkuNLP Research Group
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Summary

    This dataset is a DeepL -based machine translated version of the Jigsaw toxicity dataset for Finnish. The dataset is originally from a Kaggle competition https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data. The dataset poses a multi-label text classification problem and includes the labels identity_attack, insult, obscene, severe_toxicity, threat and toxicity.

      Example data
    

    { "label_identity_attack": 0, "label_insult": 0… See the full description on the dataset page: https://huggingface.co/datasets/TurkuNLP/jigsaw_toxicity_pred_fi.

  19. h

    toxic_tweets_and_comments

    • huggingface.co
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Masab Anees (2025). toxic_tweets_and_comments [Dataset]. https://huggingface.co/datasets/Masabanees619/toxic_tweets_and_comments
    Explore at:
    Dataset updated
    Jul 8, 2025
    Authors
    Masab Anees
    Description

    Misalignment Toxic Comments Dataset

    A curated collection of toxic comments and tweets for LLM misalignment research.

      Dataset Description
    

    This dataset contains only toxic comments and tweets, drawn from two established sources:

    Hate Speech and Offensive Language Datasethttps://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset/data Wikipedia Talk Labels: Personal… See the full description on the dataset page: https://huggingface.co/datasets/Masabanees619/toxic_tweets_and_comments.

  20. Toxicity

    • kaggle.com
    Updated Oct 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataNerd (2024). Toxicity [Dataset]. https://www.kaggle.com/datanerd2233/toxicity/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DataNerd
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by DataNerd

    Released under CC0: Public Domain

    Contents

    All thing toxic about the datasets and more.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
anitamaxvim, jigsaw-toxic-comments [Dataset]. https://huggingface.co/datasets/anitamaxvim/jigsaw-toxic-comments

jigsaw-toxic-comments

d

anitamaxvim/jigsaw-toxic-comments

Explore at:
30 scholarly articles cite this dataset (View in Google Scholar)
Authors
anitamaxvim
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for Jigsaw Toxic Comments

  Dataset





  Dataset Description

The Jigsaw Toxic Comments dataset is a benchmark dataset created for the Toxic Comment Classification Challenge on Kaggle. It is designed to help develop machine learning models that can identify and classify toxic online comments across multiple categories of toxicity.

Curated by: Jigsaw (a technology incubator within Alphabet Inc.) Shared by: Kaggle Language(s) (NLP): English License: CC0 1.0… See the full description on the dataset page: https://huggingface.co/datasets/anitamaxvim/jigsaw-toxic-comments.

Search
Clear search
Close search
Google apps
Main menu