The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Jigsaw Toxic Comments
Dataset
Dataset Description
The Jigsaw Toxic Comments dataset is a benchmark dataset created for the Toxic Comment Classification Challenge on Kaggle. It is designed to help develop machine learning models that can identify and classify toxic online comments across multiple categories of toxicity.
Curated by: Jigsaw (a technology incubator within Alphabet Inc.) Shared by: Kaggle Language(s) (NLP): English License: CC0 1.0… See the full description on the dataset page: https://huggingface.co/datasets/anitamaxvim/jigsaw-toxic-comments.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Toxic-comments (Teeny-Tiny Castle)
This dataset is part of a tutorial tied to the Teeny-Tiny Castle, an open-source repository containing educational tools for AI Ethics and Safety research.
How to Use
from datasets import load_dataset
dataset = load_dataset("AiresPucrs/toxic_content", split = 'train')
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
tillschwoerer/toxic-comments dataset hosted on Hugging Face and contributed by the HF Datasets community
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The obstacle I faced in Toxic Comments Classification Challenge was the preprocessing part. One can easily improve their LB performance if the preprocessing is done right.
This is the preprocessed version of Toxic Comments Classification Challenge dataset. The code for preprocessing: https://www.kaggle.com/fizzbuzz/toxic-data-preprocessing
At the end of 2017 the Civil Comments platform shut down and chose make their ~2m public comments from their platform available in a lasting open archive so that researchers could understand and improve civility in online conversations for years to come. Jigsaw sponsored this effort and extended annotation of this data by human raters for various toxic conversational attributes.
In the data supplied for this competition, the text of the individual comment is found in the comment_text column. Each comment in Train has a toxicity label (target), and models should predict the target toxicity for the Test data. This attribute (and all others) are fractional values which represent the fraction of human raters who believed the attribute applied to the given comment.
The data also has several additional toxicity subtype attributes. Models do not need to predict these attributes for the competition, they are included as an additional avenue for research. Subtype attributes are:
severe_toxicity obscene threat insult identity_attack sexual_explicit
Additionally, a subset of comments have been labelled with a variety of identity attributes, representing the identities that are mentioned in the comment. The columns corresponding to identity attributes are listed below. Only identities with more than 500 examples in the test set (combined public and private) will be included in the evaluation calculation. These identities are shown in bold.
male female transgender other_gender heterosexual homosexual_gay_or_lesbian bisexual other_sexual_orientation christian jewish muslim hindu buddhist atheist other_religion black white asian latino other_race_or_ethnicity physical_disability intellectual_or_learning_disability psychiatric_or_mental_illness other_disability
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
text
: the text of the comment* is_toxic
: whether or not the comment is toxictcapelle/jigsaw-toxic-comment-classification-challenge dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset was created by chinmay Das
Trained models to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. https://github.com/unitaryai/detoxify
import transformers
# original
model_path = "../input/detoxify/original"
self.tokenizer = transformers.BertTokenizer.from_pretrained(model_path)
self.encoder = transformers.BertForTokenClassification.from_pretrained(
f"{model_path}/pytorch_model.bin",
config = transformers.BertConfig.from_pretrained(f"{model_path}/config.json")
).bert
# unbiased
model_path = "../input/detoxify/unbiased"
self.tokenizer = transformers.RobertaTokenizer.from_pretrained(model_path)
self.encoder = transformers.RobertaForSequenceClassification.from_pretrained(
f"{model_path}/pytorch_model.bin",
config = transformers.RobertaConfig.from_pretrained(f"{model_path}/config.json")
).roberta
# multilingual
model_path = "../input/detoxify/multilingual"
self.tokenizer = transformers.XLMRobertaTokenizer.from_pretrained(model_path)
self.encoder = transformers.XLMRobertaForSequenceClassification.from_pretrained(
f"{model_path}/pytorch_model.bin",
config = transformers.XLMRobertaConfig.from_pretrained(f"{model_path}/config.json")
).roberta
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
About Dataset This is a hand-labelled toxicity data set containing 1000 comments crawled from YouTube videos about the Ferguson unrest in 2014. In addition to toxicity, this data set contains labels for multiple subclassifications of toxicity which form a hierarchical structure. Each comment can have multiple of these labels assigned. The structure can be seen in the following enumeration:
*IsToxic - IsAbusive IsThreat IsProvocative IsObscene - IsHatespeech IsRacist IsNationalist IsSexist IsHomophobic IsReligiousHate - IsRadicalism
Original Data Source: Youtube toxic comments
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Processed Jigsaw Toxic Comments Dataset
This is a preprocessed and tokenized version of the original Jigsaw Toxic Comment Classification Challenge dataset, prepared for multi-label toxicity classification using transformer-based models like BERT. ⚠️ Important Note: I am not the original creator of the dataset. This dataset is a cleaned and restructured version made for quick use in PyTorch deep learning models.
📦 Dataset Features
Each example contains:
text: The… See the full description on the dataset page: https://huggingface.co/datasets/Koushim/processed-jigsaw-toxic-comments.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset was created by AllHailSammy
Released under GPL 2
UIT-ViCTSD (Vietnamese Constructive and Toxic Speech Detection) is a dataset for constructive and toxic speech detection in Vietnamese. It consists of 10,000 human-annotated comments.
akcit-ijf/jigsaw-toxic-comment-train-processed-seqlen128 dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by trandong2932002
Released under MIT
This text dataset includes 3,481 social media user comments posted in response to political news posts and videos on Twitter, YouTube, and Reddit in August, 2021. The dataset also includes MTurk workers’ annotations of these comments as hateful, offensive, and/or toxic; and codes assigned by researchers describing various rhetorical dimensions of these comments.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This classifier is a fine-tuned Roberta-base model using the simpletransformers toolkit for classifying toxic language. We use the Kaggle Toxic Comments Challenge dataset as training data. This dataset contains Wikipedia comments classified as toxic or non-toxic.
This dataset was created by Rowan Curry
The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.