You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:
toxic severe_toxic obscene threat insult identity_hate You must create a model which predicts a probability of each type of toxicity for each comment.
File descriptions train.csv - the training set, contains comments with their binary labels test.csv - the test set, you must predict the toxicity probabilities for these comments. To deter hand labeling, the test set contains some comments which are not included in scoring. sample_submission.csv - a sample submission file in the correct format test_labels.csv - labels for the test data; value of -1 indicates it was not used for scoring; (Note: file added after competition close!) Usage The dataset under CC0, with the underlying comment text being governed by Wikipedia's CC-SA-3.0
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Jigsaw Toxic Comments
Dataset
Dataset Description
The Jigsaw Toxic Comments dataset is a benchmark dataset created for the Toxic Comment Classification Challenge on Kaggle. It is designed to help develop machine learning models that can identify and classify toxic online comments across multiple categories of toxicity.
Curated by: Jigsaw (a technology incubator within Alphabet Inc.) Shared by: Kaggle Language(s) (NLP): English License: CC0 1.0… See the full description on the dataset page: https://huggingface.co/datasets/anitamaxvim/jigsaw-toxic-comments.
This dataset was created by Chenghong Hu
This dataset was created by Rahul Jain
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a collection of datasets from different sources related to the automatic detection of cyber-bullying. The data is from different social media platforms like Kaggle, Twitter, Wikipedia Talk pages and YouTube. The data contain text and labeled as bullying or not. The data contains different types of cyber-bullying like hate speech, aggression, insults and toxicity.
The data is from different social media platforms like Kaggle, Twitter, Wikipedia Talk pages and YouTube. The data contain text and labeled as bullying or not. The data contains different types of cyber-bullying like hate speech, aggression, insults and toxicity.
Elsafoury, Fatma (2020), “Cyberbullying datasets”, Mendeley Data, V1, doi: 10.17632/jf4pzyvnpj.1
Not seeing a result you expected?
Learn how you can add new datasets to our index.
You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:
toxic severe_toxic obscene threat insult identity_hate You must create a model which predicts a probability of each type of toxicity for each comment.
File descriptions train.csv - the training set, contains comments with their binary labels test.csv - the test set, you must predict the toxicity probabilities for these comments. To deter hand labeling, the test set contains some comments which are not included in scoring. sample_submission.csv - a sample submission file in the correct format test_labels.csv - labels for the test data; value of -1 indicates it was not used for scoring; (Note: file added after competition close!) Usage The dataset under CC0, with the underlying comment text being governed by Wikipedia's CC-SA-3.0