35 datasets found
  1. h

    jigsaw-toxic-comment-classification-challenge

    • huggingface.co
    Updated Feb 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giulio Starace (2025). jigsaw-toxic-comment-classification-challenge [Dataset]. https://huggingface.co/datasets/thesofakillers/jigsaw-toxic-comment-classification-challenge
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2025
    Authors
    Giulio Starace
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Description

    You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

    toxic severe_toxic obscene threat insult identity_hate

    You must create a model which predicts a probability of each type of toxicity for each comment.

      File descriptions
    

    train.csv - the training set, contains comments with their binary labels test.csv - the test set, you must predict the toxicity… See the full description on the dataset page: https://huggingface.co/datasets/thesofakillers/jigsaw-toxic-comment-classification-challenge.

  2. h

    jigsaw-toxic-comment-classification-challenge

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas capelle, jigsaw-toxic-comment-classification-challenge [Dataset]. https://huggingface.co/datasets/tcapelle/jigsaw-toxic-comment-classification-challenge
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Thomas capelle
    Description

    tcapelle/jigsaw-toxic-comment-classification-challenge dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. T

    wikipedia_toxicity_subtypes

    • tensorflow.org
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  4. h

    jigsaw-toxic-comments

    • huggingface.co
    Updated Mar 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anitamaxvim (2025). jigsaw-toxic-comments [Dataset]. https://huggingface.co/datasets/anitamaxvim/jigsaw-toxic-comments
    Explore at:
    Dataset updated
    Mar 26, 2025
    Authors
    anitamaxvim
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Jigsaw Toxic Comments

      Dataset
    
    
    
    
    
      Dataset Description
    

    The Jigsaw Toxic Comments dataset is a benchmark dataset created for the Toxic Comment Classification Challenge on Kaggle. It is designed to help develop machine learning models that can identify and classify toxic online comments across multiple categories of toxicity.

    Curated by: Jigsaw (a technology incubator within Alphabet Inc.) Shared by: Kaggle Language(s) (NLP): English License: CC0 1.0… See the full description on the dataset page: https://huggingface.co/datasets/anitamaxvim/jigsaw-toxic-comments.

  5. Data from: Detoxify

    • kaggle.com
    Updated Feb 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shohei Maruyama (2022). Detoxify [Dataset]. https://www.kaggle.com/datasets/maruyama/detoxify
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shohei Maruyama
    Description

    About this dataset

    Trained models to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. https://github.com/unitaryai/detoxify

    How to use

    import transformers
    
    # original
    model_path = "../input/detoxify/original"
    self.tokenizer = transformers.BertTokenizer.from_pretrained(model_path)
    self.encoder = transformers.BertForTokenClassification.from_pretrained(
      f"{model_path}/pytorch_model.bin",
      config = transformers.BertConfig.from_pretrained(f"{model_path}/config.json")
    ).bert
    
    # unbiased
    model_path = "../input/detoxify/unbiased"
    self.tokenizer = transformers.RobertaTokenizer.from_pretrained(model_path)
    self.encoder = transformers.RobertaForSequenceClassification.from_pretrained(
      f"{model_path}/pytorch_model.bin",
      config = transformers.RobertaConfig.from_pretrained(f"{model_path}/config.json")
    ).roberta
    
    # multilingual
    model_path = "../input/detoxify/multilingual"
    self.tokenizer = transformers.XLMRobertaTokenizer.from_pretrained(model_path)
    self.encoder = transformers.XLMRobertaForSequenceClassification.from_pretrained(
      f"{model_path}/pytorch_model.bin",
      config = transformers.XLMRobertaConfig.from_pretrained(f"{model_path}/config.json")
    ).roberta
    
  6. h

    processed-jigsaw-toxic-comments

    • huggingface.co
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K Koushik Reddy (2023). processed-jigsaw-toxic-comments [Dataset]. https://huggingface.co/datasets/Koushim/processed-jigsaw-toxic-comments
    Explore at:
    Dataset updated
    Oct 10, 2023
    Authors
    K Koushik Reddy
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Processed Jigsaw Toxic Comments Dataset

    This is a preprocessed and tokenized version of the original Jigsaw Toxic Comment Classification Challenge dataset, prepared for multi-label toxicity classification using transformer-based models like BERT. ⚠️ Important Note: I am not the original creator of the dataset. This dataset is a cleaned and restructured version made for quick use in PyTorch deep learning models.

      📦 Dataset Features
    

    Each example contains:

    text: The… See the full description on the dataset page: https://huggingface.co/datasets/Koushim/processed-jigsaw-toxic-comments.

  7. Cleaned Toxic Comments

    • kaggle.com
    zip
    Updated Mar 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zafar (2018). Cleaned Toxic Comments [Dataset]. https://www.kaggle.com/fizzbuzz/cleaned-toxic-comments
    Explore at:
    zip(45799147 bytes)Available download formats
    Dataset updated
    Mar 12, 2018
    Authors
    Zafar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Preporcessed Toxic Comments Classification Dataset

    The obstacle I faced in Toxic Comments Classification Challenge was the preprocessing part. One can easily improve their LB performance if the preprocessing is done right.

    This is the preprocessed version of Toxic Comments Classification Challenge dataset. The code for preprocessing: https://www.kaggle.com/fizzbuzz/toxic-data-preprocessing

  8. Toxic Comment Classification labelled languages

    • kaggle.com
    Updated Dec 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllHailSammy (2017). Toxic Comment Classification labelled languages [Dataset]. https://www.kaggle.com/datasets/wangshangsam/toxic-comment-classification-labelled-languages/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 22, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    AllHailSammy
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Dataset

    This dataset was created by AllHailSammy

    Released under GPL 2

    Contents

  9. P

    Civil Comments Dataset

    • paperswithcode.com
    • library.toponeai.link
    Updated Nov 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Borkan; Lucas Dixon; Jeffrey Sorensen; Nithum Thain; Lucy Vasserman (2022). Civil Comments Dataset [Dataset]. https://paperswithcode.com/dataset/civil-comments
    Explore at:
    Dataset updated
    Nov 15, 2022
    Authors
    Daniel Borkan; Lucas Dixon; Jeffrey Sorensen; Nithum Thain; Lucy Vasserman
    Description

    At the end of 2017 the Civil Comments platform shut down and chose make their ~2m public comments from their platform available in a lasting open archive so that researchers could understand and improve civility in online conversations for years to come. Jigsaw sponsored this effort and extended annotation of this data by human raters for various toxic conversational attributes.

    In the data supplied for this competition, the text of the individual comment is found in the comment_text column. Each comment in Train has a toxicity label (target), and models should predict the target toxicity for the Test data. This attribute (and all others) are fractional values which represent the fraction of human raters who believed the attribute applied to the given comment.

    The data also has several additional toxicity subtype attributes. Models do not need to predict these attributes for the competition, they are included as an additional avenue for research. Subtype attributes are:

    severe_toxicity obscene threat insult identity_attack sexual_explicit

    Additionally, a subset of comments have been labelled with a variety of identity attributes, representing the identities that are mentioned in the comment. The columns corresponding to identity attributes are listed below. Only identities with more than 500 examples in the test set (combined public and private) will be included in the evaluation calculation. These identities are shown in bold.

    male female transgender other_gender heterosexual homosexual_gay_or_lesbian bisexual other_sexual_orientation christian jewish muslim hindu buddhist atheist other_religion black white asian latino other_race_or_ethnicity physical_disability intellectual_or_learning_disability psychiatric_or_mental_illness other_disability

  10. jigsaw-toxic-comment-classification

    • kaggle.com
    Updated Jun 22, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhiyu Li (2019). jigsaw-toxic-comment-classification [Dataset]. https://www.kaggle.com/zhiyuli000/jigsawtoxiccommentclassification/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Zhiyu Li
    Description

    Dataset

    This dataset was created by Zhiyu Li

    Contents

  11. f

    Toxicity Dataset

    • figshare.com
    bin
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vincent Maladiere (2024). Toxicity Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.27240072.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 16, 2024
    Dataset provided by
    figshare
    Authors
    Vincent Maladiere
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Toxicity Dataset*by Surge AIThis dataset contains 500 toxic and 500 non-toxic comments from a variety of popular social media platforms. Rather than operating under a strict definition of toxicity, we asked our team to identify comments that they personally found toxic. ## Columns text: the text of the comment* is_toxic: whether or not the comment is toxic

  12. h

    Tox

    • huggingface.co
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Luz (2023). Tox [Dataset]. https://huggingface.co/datasets/vluz/Tox
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2023
    Authors
    Victor Luz
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    A cleaned up version of train dataset from kaggle, the Toxic Comment Classification Challenge

    https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data?select=train.csv.zip the alt_format directory contains an alternate format intended for a tutorial.

    What was done:

    Removed extra spaces and new lines Removed non-printing characters Removed punctuation except apostrophe… See the full description on the dataset page: https://huggingface.co/datasets/vluz/Tox.

  13. o

    Online Comment Toxicity Labels Dataset

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Online Comment Toxicity Labels Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/0a34ae04-d822-4ac2-aaa1-1f445579500b
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    Area covered
    Social Media and Networking
    Description

    This dataset contains hand-labelled toxicity data from 1000 comments, which were crawled from YouTube videos related to the Ferguson unrest in 2014. It is designed to assist in categorising online comment toxicity, featuring labels for multiple subclassifications that form a hierarchical structure. Each comment can have one or more of these labels assigned.

    Columns

    • CommentId: A unique identifier for each comment.
    • VideoId: The YouTube video identifier from which the comment originated.
    • Text: The full text of the comment.
    • IsToxic: A boolean indicating whether the comment is considered toxic.
    • IsAbusive: A boolean indicating if the comment is abusive.
    • IsThreat: A boolean indicating if the comment contains a threat.
    • IsProvocative: A boolean indicating if the comment is provocative.
    • IsObscene: A boolean indicating if the comment is obscene.
    • IsHatespeech: A boolean indicating if the comment contains hate speech.
    • IsRacist: A boolean indicating if the comment is racist.

    Distribution

    The dataset comprises 1000 unique comments and is typically provided in a CSV file format. It details various toxicity subclassifications with their respective distributions: * IsToxic: 46% of comments are labelled as true. * IsAbusive: 35% of comments are labelled as true. * IsThreat: 2% of comments are labelled as true. * IsProvocative: 16% of comments are labelled as true. * IsObscene: 10% of comments are labelled as true. * IsHatespeech: 14% of comments are labelled as true. * IsRacist: 13% of comments are labelled as true.

    Usage

    This dataset is ideal for a variety of applications, including: * Developing and evaluating machine learning models for natural language processing (NLP). * Training systems for multiclass classification of text data. * Performing text mining operations to identify patterns in online discourse. * Building tools for automated content moderation and detection of abusive language or hate speech.

    Coverage

    The dataset covers YouTube comments from videos related to the Ferguson unrest in 2014. It has a global region scope, focusing on comments from this specific period and event.

    License

    CC-BY

    Who Can Use It

    • Researchers: For academic studies on online social behaviour, hate speech, and natural language processing.
    • Data Scientists and Machine Learning Engineers: For building and refining models for content moderation, sentiment analysis, and toxicity detection.
    • Developers: To integrate toxicity analysis features into social media platforms or other applications.
    • Organisations/Companies: For enhancing platform safety and managing user-generated content.

    Dataset Name Suggestions

    • YouTube Toxic Comments Dataset
    • Online Comment Toxicity Labels
    • Ferguson Unrest YouTube Comments
    • Hand-Labelled Toxicity Data

    Attributes

    Original Data Source: Youtube toxic comments

  14. bert-toxic-comment-classification-challenge

    • kaggle.com
    Updated Feb 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abxmaster (2022). bert-toxic-comment-classification-challenge [Dataset]. https://www.kaggle.com/datasets/abxmaster/berttoxiccommentclassificationchallenge/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    abxmaster
    Description

    Dataset

    This dataset was created by abxmaster

    Contents

  15. T

    civil_comments

    • tensorflow.org
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
    Explore at:
    Dataset updated
    Feb 28, 2023
    Description

    This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

    The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

    The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

    For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('civil_comments', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  16. UIT-ViCTSD (UIT Vietnamese Constructive and Toxic Speech Detection)

    • opendatalab.com
    zip
    Updated Mar 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Information Technology (2023). UIT-ViCTSD (UIT Vietnamese Constructive and Toxic Speech Detection) [Dataset]. https://opendatalab.com/OpenDataLab/UIT-ViCTSD
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    胡志明市国家大学https://vnuhcm.edu.vn/
    University of Information Technology
    Description

    UIT-ViCTSD (Vietnamese Constructive and Toxic Speech Detection) is a dataset for constructive and toxic speech detection in Vietnamese. It consists of 10,000 human-annotated comments.

  17. Toxic Comment Classification- Toxic

    • kaggle.com
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Dutta (2023). Toxic Comment Classification- Toxic [Dataset]. https://www.kaggle.com/datasets/gauravduttakiit/toxic-comment-classification-toxic/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaurav Dutta
    Description

    Dataset

    This dataset was created by Gaurav Dutta

    Contents

  18. h

    FinToxicityClassification

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark, FinToxicityClassification [Dataset]. https://huggingface.co/datasets/mteb/FinToxicityClassification
    Explore at:
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    FinToxicityClassification An MTEB dataset Massive Text Embedding Benchmark

    This dataset is a DeepL -based machine translated version of the Jigsaw toxicity dataset for Finnish. The dataset is originally from a Kaggle competition https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data.
    The original dataset poses a multi-label text classification problem and includes the labels identity_attack, insult, obscene, severe_toxicity, threat and toxicity.
    Here… See the full description on the dataset page: https://huggingface.co/datasets/mteb/FinToxicityClassification.
    
  19. Toxic Comment Detection Multilingual [Extended]

    • kaggle.com
    Updated Jun 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Sun (2020). Toxic Comment Detection Multilingual [Extended] [Dataset]. https://www.kaggle.com/alansun17904/toxic-comment-detection-multilingual-extended/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 6, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Alan Sun
    Description
  20. f

    Navigating News Narratives: A Media Bias Analysis Dataset

    • figshare.com
    txt
    Updated Dec 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaina Raza (2023). Navigating News Narratives: A Media Bias Analysis Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24422122.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    figshare
    Authors
    Shaina Raza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII).The data structure is tabulated as follows:Text: The main content.Dimension: Descriptive category of the text.Biased_Words: A compilation of words regarded as biased.Aspect: Specific sub-topic within the main content.Label: Indicates the presence (True) or absence (False) of bias. The label is ternary - highly biased, slightly biased and neutralToxicity: Indicates the presence (True) or absence (False) of bias.Identity_mention: Mention of any identity based on words match.Annotation SchemeThe labels and annotations in the dataset are generated through a system of Active Learning, cycling through:Manual LabelingSemi-Supervised LearningHuman VerificationThe scheme comprises:Bias Label: Specifies the degree of bias (e.g., no bias, mild, or strong).Words/Phrases Level Biases: Pinpoints specific biased terms or phrases.Subjective Bias (Aspect): Highlights biases pertinent to content dimensions.Due to the nuances of semantic match algorithms, certain labels such as 'identity' and 'aspect' may appear distinctively different.List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news.We also utilize publicly available data from the following links. Our Attribution to others.MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detectionToxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge.Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu)Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtVSocial biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.If you use this dataset, please cite us.Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Giulio Starace (2025). jigsaw-toxic-comment-classification-challenge [Dataset]. https://huggingface.co/datasets/thesofakillers/jigsaw-toxic-comment-classification-challenge

jigsaw-toxic-comment-classification-challenge

thesofakillers/jigsaw-toxic-comment-classification-challenge

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2025
Authors
Giulio Starace
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Dataset Description

You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

toxic severe_toxic obscene threat insult identity_hate

You must create a model which predicts a probability of each type of toxicity for each comment.

  File descriptions

train.csv - the training set, contains comments with their binary labels test.csv - the test set, you must predict the toxicity… See the full description on the dataset page: https://huggingface.co/datasets/thesofakillers/jigsaw-toxic-comment-classification-challenge.

Search
Clear search
Close search
Google apps
Main menu