91 datasets found

T
wikipedia_toxicity_subtypes
tensorflow.org
Updated Oct 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
Explore at:
Dataset updated
Oct 4, 2021
Description
The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikipedia_toxicity_subtypes', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
h
jigsaw-toxic-comments
huggingface.co
Updated Mar 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anitamaxvim (2025). jigsaw-toxic-comments [Dataset]. https://huggingface.co/datasets/anitamaxvim/jigsaw-toxic-comments
Explore at:
Dataset updated
Mar 26, 2025
Authors
anitamaxvim
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Jigsaw Toxic Comments

Dataset Dataset Description

The Jigsaw Toxic Comments dataset is a benchmark dataset created for the Toxic Comment Classification Challenge on Kaggle. It is designed to help develop machine learning models that can identify and classify toxic online comments across multiple categories of toxicity.

Curated by: Jigsaw (a technology incubator within Alphabet Inc.) Shared by: Kaggle Language(s) (NLP): English License: CC0 1.0… See the full description on the dataset page: https://huggingface.co/datasets/anitamaxvim/jigsaw-toxic-comments.
h
toxic-comments
huggingface.co
Updated Oct 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI Robotics Ethics Society (PUCRS) (2024). toxic-comments [Dataset]. https://huggingface.co/datasets/AiresPucrs/toxic-comments
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 13, 2024
Dataset authored and provided by
AI Robotics Ethics Society (PUCRS)
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Toxic-comments (Teeny-Tiny Castle)

This dataset is part of a tutorial tied to the Teeny-Tiny Castle, an open-source repository containing educational tools for AI Ethics and Safety research.

How to Use

from datasets import load_dataset

dataset = load_dataset("AiresPucrs/toxic_content", split = 'train')
jigsaw-toxic-comment-classification-challenge
kaggle.com
Updated Nov 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dataista0 (Julián Peller) (2021). jigsaw-toxic-comment-classification-challenge [Dataset]. https://www.kaggle.com/datasets/julian3833/jigsaw-toxic-comment-classification-challenge/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
dataista0 (Julián Peller)
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description

Data from Toxic Comment Classification Challenge without modification

For using it in Jigsaw Rate Severity of Toxic Comments

Example usage: ☣️ Jigsaw - Super Simple Naive Bayes [LB=0.768]

Please, DO upvote if you use the dataset!
h
toxic-comments
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tillmann Schwörer, toxic-comments [Dataset]. https://huggingface.co/datasets/tillschwoerer/toxic-comments
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Tillmann Schwörer
Description
tillschwoerer/toxic-comments dataset hosted on Hugging Face and contributed by the HF Datasets community
Cleaned Toxic Comments
kaggle.com
zip
Updated Mar 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zafar (2018). Cleaned Toxic Comments [Dataset]. https://www.kaggle.com/fizzbuzz/cleaned-toxic-comments
Explore at:
zip(45799147 bytes)Available download formats
Dataset updated
Mar 12, 2018
Authors
Zafar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Preporcessed Toxic Comments Classification Dataset

The obstacle I faced in Toxic Comments Classification Challenge was the preprocessing part. One can easily improve their LB performance if the preprocessing is done right.

This is the preprocessed version of Toxic Comments Classification Challenge dataset. The code for preprocessing: https://www.kaggle.com/fizzbuzz/toxic-data-preprocessing
P
Civil Comments Dataset
library.toponeai.link
paperswithcode.com
Updated Nov 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Borkan; Lucas Dixon; Jeffrey Sorensen; Nithum Thain; Lucy Vasserman (2022). Civil Comments Dataset [Dataset]. https://library.toponeai.link/dataset/civil-comments
Explore at:
Dataset updated
Nov 15, 2022
Authors
Daniel Borkan; Lucas Dixon; Jeffrey Sorensen; Nithum Thain; Lucy Vasserman
Description
At the end of 2017 the Civil Comments platform shut down and chose make their ~2m public comments from their platform available in a lasting open archive so that researchers could understand and improve civility in online conversations for years to come. Jigsaw sponsored this effort and extended annotation of this data by human raters for various toxic conversational attributes.

In the data supplied for this competition, the text of the individual comment is found in the comment_text column. Each comment in Train has a toxicity label (target), and models should predict the target toxicity for the Test data. This attribute (and all others) are fractional values which represent the fraction of human raters who believed the attribute applied to the given comment.

The data also has several additional toxicity subtype attributes. Models do not need to predict these attributes for the competition, they are included as an additional avenue for research. Subtype attributes are:

severe_toxicity obscene threat insult identity_attack sexual_explicit

Additionally, a subset of comments have been labelled with a variety of identity attributes, representing the identities that are mentioned in the comment. The columns corresponding to identity attributes are listed below. Only identities with more than 500 examples in the test set (combined public and private) will be included in the evaluation calculation. These identities are shown in bold.

male female transgender other_gender heterosexual homosexual_gay_or_lesbian bisexual other_sexual_orientation christian jewish muslim hindu buddhist atheist other_religion black white asian latino other_race_or_ethnicity physical_disability intellectual_or_learning_disability psychiatric_or_mental_illness other_disability
f
Toxicity Dataset
figshare.com
bin
Updated Oct 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Maladiere (2024). Toxicity Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.27240072.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27240072.v2
Dataset updated
Oct 16, 2024
Dataset provided by
figshare
Authors
Vincent Maladiere
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Toxicity Dataset*by Surge AIThis dataset contains 500 toxic and 500 non-toxic comments from a variety of popular social media platforms. Rather than operating under a strict definition of toxicity, we asked our team to identify comments that they personally found toxic. ## Columns text: the text of the comment* is_toxic: whether or not the comment is toxic
h
jigsaw-toxic-comment-classification-challenge
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas capelle, jigsaw-toxic-comment-classification-challenge [Dataset]. https://huggingface.co/datasets/tcapelle/jigsaw-toxic-comment-classification-challenge
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Thomas capelle
Description
tcapelle/jigsaw-toxic-comment-classification-challenge dataset hosted on Hugging Face and contributed by the HF Datasets community
toxic comments
kaggle.com
Updated Feb 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chinmay Das (2024). toxic comments [Dataset]. https://www.kaggle.com/datasets/chinmaydas/toxic-comments
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
chinmay Das
Description
Dataset

This dataset was created by chinmay Das

Contents

Data from: Detoxify

kaggle.com

Updated Feb 8, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Shohei Maruyama (2022). Detoxify [Dataset]. https://www.kaggle.com/datasets/maruyama/detoxify

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 8, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Shohei Maruyama

Description

About this dataset

Trained models to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. https://github.com/unitaryai/detoxify

How to use

import transformers

# original
model_path = "../input/detoxify/original"
self.tokenizer = transformers.BertTokenizer.from_pretrained(model_path)
self.encoder = transformers.BertForTokenClassification.from_pretrained(
  f"{model_path}/pytorch_model.bin",
  config = transformers.BertConfig.from_pretrained(f"{model_path}/config.json")
).bert

# unbiased
model_path = "../input/detoxify/unbiased"
self.tokenizer = transformers.RobertaTokenizer.from_pretrained(model_path)
self.encoder = transformers.RobertaForSequenceClassification.from_pretrained(
  f"{model_path}/pytorch_model.bin",
  config = transformers.RobertaConfig.from_pretrained(f"{model_path}/config.json")
).roberta

# multilingual
model_path = "../input/detoxify/multilingual"
self.tokenizer = transformers.XLMRobertaTokenizer.from_pretrained(model_path)
self.encoder = transformers.XLMRobertaForSequenceClassification.from_pretrained(
  f"{model_path}/pytorch_model.bin",
  config = transformers.XLMRobertaConfig.from_pretrained(f"{model_path}/config.json")
).roberta

o
Youtube toxic comments
opendatabay.com
.csv
Updated Jun 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Youtube toxic comments [Dataset]. https://www.opendatabay.com/data/dataset/0a34ae04-d822-4ac2-aaa1-1f445579500b
Explore at:
.csvAvailable download formats
Dataset updated
Jun 9, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Social Media and Networking
Description
About Dataset This is a hand-labelled toxicity data set containing 1000 comments crawled from YouTube videos about the Ferguson unrest in 2014. In addition to toxicity, this data set contains labels for multiple subclassifications of toxicity which form a hierarchical structure. Each comment can have multiple of these labels assigned. The structure can be seen in the following enumeration:

*IsToxic - IsAbusive IsThreat IsProvocative IsObscene - IsHatespeech IsRacist IsNationalist IsSexist IsHomophobic IsReligiousHate - IsRadicalism

Original Data Source: Youtube toxic comments
h
processed-jigsaw-toxic-comments
huggingface.co
Updated Oct 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
K Koushik Reddy (2023). processed-jigsaw-toxic-comments [Dataset]. https://huggingface.co/datasets/Koushim/processed-jigsaw-toxic-comments
Explore at:
Dataset updated
Oct 10, 2023
Authors
K Koushik Reddy
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Processed Jigsaw Toxic Comments Dataset

This is a preprocessed and tokenized version of the original Jigsaw Toxic Comment Classification Challenge dataset, prepared for multi-label toxicity classification using transformer-based models like BERT. ⚠️ Important Note: I am not the original creator of the dataset. This dataset is a cleaned and restructured version made for quick use in PyTorch deep learning models.

📦 Dataset Features

Each example contains:

text: The… See the full description on the dataset page: https://huggingface.co/datasets/Koushim/processed-jigsaw-toxic-comments.
Toxic Comment Classification labelled languages
kaggle.com
Updated Dec 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AllHailSammy (2017). Toxic Comment Classification labelled languages [Dataset]. https://www.kaggle.com/datasets/wangshangsam/toxic-comment-classification-labelled-languages/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 22, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
AllHailSammy
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Dataset

This dataset was created by AllHailSammy

Released under GPL 2

Contents
UIT-ViCTSD (UIT Vietnamese Constructive and Toxic Speech Detection)
opendatalab.com
zip
Updated Mar 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Information Technology (2023). UIT-ViCTSD (UIT Vietnamese Constructive and Toxic Speech Detection) [Dataset]. https://opendatalab.com/OpenDataLab/UIT-ViCTSD
Explore at:
zipAvailable download formats
Dataset updated
Mar 24, 2023
Dataset provided by
胡志明市国家大学https://vnuhcm.edu.vn/
University of Information Technology
Description
UIT-ViCTSD (Vietnamese Constructive and Toxic Speech Detection) is a dataset for constructive and toxic speech detection in Vietnamese. It consists of 10,000 human-annotated comments.
h
jigsaw-toxic-comment-train-processed-seqlen128
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
akcit ijf, jigsaw-toxic-comment-train-processed-seqlen128 [Dataset]. https://huggingface.co/datasets/akcit-ijf/jigsaw-toxic-comment-train-processed-seqlen128
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
akcit ijf
Description
akcit-ijf/jigsaw-toxic-comment-train-processed-seqlen128 dataset hosted on Hugging Face and contributed by the HF Datasets community
toxic comment vietnamese
kaggle.com
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
trandong2932002 (2024). toxic comment vietnamese [Dataset]. https://www.kaggle.com/trandong2932002/toxic-comment-vietnamese/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 1, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
trandong2932002
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by trandong2932002

Released under MIT

Contents
s
HOT Speech: Comments from Political News Posts and Videos that were...
socialmediaarchive.org
csv, pdf
Updated Apr 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). HOT Speech: Comments from Political News Posts and Videos that were Annotated for Hateful, Offensive, and Toxic Content [Dataset]. http://doi.org/10.3886/45fc-9c8f
Explore at:
csv(1591949), csv(21665), pdf(729152)Available download formats
Unique identifier
https://doi.org/10.3886/45fc-9c8f
Dataset updated
Apr 20, 2023
Description
This text dataset includes 3,481 social media user comments posted in response to political news posts and videos on Twitter, YouTube, and Reddit in August, 2021. The dataset also includes MTurk workers’ annotations of these comments as hateful, offensive, and/or toxic; and codes assigned by researchers describing various rhetorical dimensions of these comments.
E
GATE: Toxic Language Classifier
live.european-language-grid.eu
Updated Aug 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). GATE: Toxic Language Classifier [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/8022
Explore at:
Dataset updated
Aug 16, 2021
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This classifier is a fine-tuned Roberta-base model using the simpletransformers toolkit for classifying toxic language. We use the Kaggle Toxic Comments Challenge dataset as training data. This dataset contains Wikipedia comments classified as toxic or non-toxic.
toxic comments
kaggle.com
zip
Updated Sep 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rowan Curry (2021). toxic comments [Dataset]. https://www.kaggle.com/rowancurry/toxic-comments
Explore at:
zip(52968821 bytes)Available download formats
Dataset updated
Sep 16, 2021
Authors
Rowan Curry
Description
Dataset

This dataset was created by Rowan Curry

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

(2021). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes

wikipedia_toxicity_subtypes

Explore at:

Dataset updated

Oct 4, 2021

Description

The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

Clear search

Close search

Google apps

Main menu

wikipedia_toxicity_subtypes

jigsaw-toxic-comments

toxic-comments

jigsaw-toxic-comment-classification-challenge

Description

Data from Toxic Comment Classification Challenge without modification

For using it in Jigsaw Rate Severity of Toxic Comments

Example usage: ☣️ Jigsaw - Super Simple Naive Bayes [LB=0.768]

Please, DO upvote if you use the dataset!

toxic-comments

Cleaned Toxic Comments

Preporcessed Toxic Comments Classification Dataset

Civil Comments Dataset

Toxicity Dataset

jigsaw-toxic-comment-classification-challenge

toxic comments

Dataset

Contents

Data from: Detoxify

About this dataset

How to use

Youtube toxic comments

processed-jigsaw-toxic-comments

Toxic Comment Classification labelled languages

Dataset

Contents

UIT-ViCTSD (UIT Vietnamese Constructive and Toxic Speech Detection)

jigsaw-toxic-comment-train-processed-seqlen128

toxic comment vietnamese

Dataset

Contents

HOT Speech: Comments from Political News Posts and Videos that were...

GATE: Toxic Language Classifier

toxic comments

Dataset

Contents

wikipedia_toxicity_subtypes