6 datasets found

T
civil_comments
tensorflow.org
huggingface.co
Updated Feb 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
Explore at:
Dataset updated
Feb 28, 2023
Description
This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('civil_comments', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
jigsaw_unintended_bias
huggingface.co
Updated Nov 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2021). jigsaw_unintended_bias [Dataset]. https://huggingface.co/datasets/google/jigsaw_unintended_bias
Explore at:
Dataset updated
Nov 18, 2021
Dataset authored and provided by
Googlehttp://google.com/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
A collection of comments from the defunct Civil Comments platform that have been annotated for their toxicity.
h
toxic_conversations_50k
huggingface.co
Updated Jun 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massive Text Embedding Benchmark (2022). toxic_conversations_50k [Dataset]. https://huggingface.co/datasets/mteb/toxic_conversations_50k
Explore at:
Dataset updated
Jun 29, 2022
Dataset authored and provided by
Massive Text Embedding Benchmark
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ToxicConversationsClassification An MTEB dataset Massive Text Embedding Benchmark

Collection of comments from the Civil Comments platform together with annotations if the comment is toxic or not.

Task category t2c

Domains Social, Written Reference https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/overview

How to evaluate on this task

You can evaluate an embedding model on this dataset using the following code: import… See the full description on the dataset page: https://huggingface.co/datasets/mteb/toxic_conversations_50k.
Supporting Online Toxicity Detection with Knowledge Graphs: Data
zenodo.org
data.niaid.nih.gov
zip
Updated Mar 24, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paula Reyero Lobo; Paula Reyero Lobo (2022). Supporting Online Toxicity Detection with Knowledge Graphs: Data [Dataset]. http://doi.org/10.5281/zenodo.6379344
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6379344
Dataset updated
Mar 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Paula Reyero Lobo; Paula Reyero Lobo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data repository contains the output files from the analysis of the paper "Supporting Online Toxicity Detection with Knowledge Graphs" presented at the International Conference on Web and Social Media 2022 (ICWSM-2022).

The data contains annotations of gender and sexual orientation entities provided by the Gender and Sexual Orientation Ontology (https://bioportal.bioontology.org/ontologies/GSSO).

We analyse demographic group samples from the Civil Comments Identities dataset (https://www.tensorflow.org/datasets/catalog/civil_comments).
h
toxic_conversations
huggingface.co
Updated Jun 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SetFit (2022). toxic_conversations [Dataset]. https://huggingface.co/datasets/SetFit/toxic_conversations
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2022
Dataset authored and provided by
SetFit
Description
Toxic Conversation

This is a version of the Jigsaw Unintended Bias in Toxicity Classification dataset. It contains comments from the Civil Comments platform together with annotations if the comment is toxic or not. 10 annotators annotated each example and, as recommended in the task page, set a comment as toxic when target >= 0.5 The dataset is inbalanced, with only about 8% of the comments marked as toxic.
h
indeterminacy-datasets
huggingface.co
Updated Oct 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luke Guerdan (2025). indeterminacy-datasets [Dataset]. https://huggingface.co/datasets/lguerdan/indeterminacy-datasets
Explore at:
Dataset updated
Oct 2, 2025
Authors
Luke Guerdan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Indeterminacy Research Datasets

Datasets used in Validating LLM-as-a-Judge Systems under Rating Indeterminacy

Datasets

Civil Comments: Toxicity detection ChaosNLI: Natural language inference (MNLI, SNLI, AlphaNLI) SummEval: Summarization quality (relevance, coherence, consistency, fluency) QAGS: Factuality assessment TopicalChat: Dialogue evaluation
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments

civil_comments

Explore at:

26 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Feb 28, 2023

Description

This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('civil_comments', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

Clear search

Close search

Google apps

Main menu

civil_comments

jigsaw_unintended_bias

toxic_conversations_50k

Supporting Online Toxicity Detection with Knowledge Graphs: Data

toxic_conversations

indeterminacy-datasets

civil_comments