6 datasets found
  1. T

    civil_comments

    • tensorflow.org
    • huggingface.co
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
    Explore at:
    Dataset updated
    Feb 28, 2023
    Description

    This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

    The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

    The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

    For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('civil_comments', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  2. jigsaw_unintended_bias

    • huggingface.co
    Updated Nov 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2021). jigsaw_unintended_bias [Dataset]. https://huggingface.co/datasets/google/jigsaw_unintended_bias
    Explore at:
    Dataset updated
    Nov 18, 2021
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    A collection of comments from the defunct Civil Comments platform that have been annotated for their toxicity.

  3. h

    toxic_conversations_50k

    • huggingface.co
    Updated Jun 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2022). toxic_conversations_50k [Dataset]. https://huggingface.co/datasets/mteb/toxic_conversations_50k
    Explore at:
    Dataset updated
    Jun 29, 2022
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ToxicConversationsClassification An MTEB dataset Massive Text Embedding Benchmark

    Collection of comments from the Civil Comments platform together with annotations if the comment is toxic or not.

    Task category t2c

    Domains Social, Written Reference https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/overview

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import… See the full description on the dataset page: https://huggingface.co/datasets/mteb/toxic_conversations_50k.

  4. Supporting Online Toxicity Detection with Knowledge Graphs: Data

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Mar 24, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paula Reyero Lobo; Paula Reyero Lobo (2022). Supporting Online Toxicity Detection with Knowledge Graphs: Data [Dataset]. http://doi.org/10.5281/zenodo.6379344
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Paula Reyero Lobo; Paula Reyero Lobo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data repository contains the output files from the analysis of the paper "Supporting Online Toxicity Detection with Knowledge Graphs" presented at the International Conference on Web and Social Media 2022 (ICWSM-2022).

    The data contains annotations of gender and sexual orientation entities provided by the Gender and Sexual Orientation Ontology (https://bioportal.bioontology.org/ontologies/GSSO).

    We analyse demographic group samples from the Civil Comments Identities dataset (https://www.tensorflow.org/datasets/catalog/civil_comments).

  5. h

    toxic_conversations

    • huggingface.co
    Updated Jun 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SetFit (2022). toxic_conversations [Dataset]. https://huggingface.co/datasets/SetFit/toxic_conversations
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2022
    Dataset authored and provided by
    SetFit
    Description

    Toxic Conversation

    This is a version of the Jigsaw Unintended Bias in Toxicity Classification dataset. It contains comments from the Civil Comments platform together with annotations if the comment is toxic or not. 10 annotators annotated each example and, as recommended in the task page, set a comment as toxic when target >= 0.5 The dataset is inbalanced, with only about 8% of the comments marked as toxic.

  6. h

    indeterminacy-datasets

    • huggingface.co
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luke Guerdan (2025). indeterminacy-datasets [Dataset]. https://huggingface.co/datasets/lguerdan/indeterminacy-datasets
    Explore at:
    Dataset updated
    Oct 2, 2025
    Authors
    Luke Guerdan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Indeterminacy Research Datasets

    Datasets used in Validating LLM-as-a-Judge Systems under Rating Indeterminacy

      Datasets
    

    Civil Comments: Toxicity detection ChaosNLI: Natural language inference (MNLI, SNLI, AlphaNLI) SummEval: Summarization quality (relevance, coherence, consistency, fluency) QAGS: Factuality assessment TopicalChat: Dialogue evaluation

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments

civil_comments

Explore at:
26 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Feb 28, 2023
Description

This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('civil_comments', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

Search
Clear search
Close search
Google apps
Main menu