This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.
The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.
The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.
For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('civil_comments', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
A collection of comments from the defunct Civil Comments platform that have been annotated for their toxicity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ToxicConversationsClassification An MTEB dataset Massive Text Embedding Benchmark
Collection of comments from the Civil Comments platform together with annotations if the comment is toxic or not.
Task category t2c
Domains Social, Written Reference https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/overview
How to evaluate on this task
You can evaluate an embedding model on this dataset using the following code: import… See the full description on the dataset page: https://huggingface.co/datasets/mteb/toxic_conversations_50k.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data repository contains the output files from the analysis of the paper "Supporting Online Toxicity Detection with Knowledge Graphs" presented at the International Conference on Web and Social Media 2022 (ICWSM-2022).
The data contains annotations of gender and sexual orientation entities provided by the Gender and Sexual Orientation Ontology (https://bioportal.bioontology.org/ontologies/GSSO).
We analyse demographic group samples from the Civil Comments Identities dataset (https://www.tensorflow.org/datasets/catalog/civil_comments).
Toxic Conversation
This is a version of the Jigsaw Unintended Bias in Toxicity Classification dataset. It contains comments from the Civil Comments platform together with annotations if the comment is toxic or not. 10 annotators annotated each example and, as recommended in the task page, set a comment as toxic when target >= 0.5 The dataset is inbalanced, with only about 8% of the comments marked as toxic.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Indeterminacy Research Datasets
Datasets used in Validating LLM-as-a-Judge Systems under Rating Indeterminacy
Datasets
Civil Comments: Toxicity detection ChaosNLI: Natural language inference (MNLI, SNLI, AlphaNLI) SummEval: Summarization quality (relevance, coherence, consistency, fluency) QAGS: Factuality assessment TopicalChat: Dialogue evaluation
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.
The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.
The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.
For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('civil_comments', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.