100+ datasets found

h
hatebr
huggingface.co
Updated Mar 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruan Chaves Rodrigues (2023). hatebr [Dataset]. http://doi.org/10.57967/hf/0274
Explore at:
Unique identifier
https://doi.org/10.57967/hf/0274
Dataset updated
Mar 10, 2023
Authors
Ruan Chaves Rodrigues
License
https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
Description
HateBR is the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media. The HateBR corpus was collected from Brazilian Instagram comments of politicians and manually annotated by specialists. It is composed of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level (highly, moderately, and slightly offensive messages), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). Each comment was annotated by three different annotators and achieved high inter-annotator agreement. Furthermore, baseline experiments were implemented reaching 85% of F1-score outperforming the current literature models for the Portuguese language. Accordingly, we hope that the proposed expertly annotated corpus may foster research on hate speech and offensive language detection in the Natural Language Processing area.
h
large-scale-hate-speech-v1
huggingface.co
Updated Dec 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cagri Toraman (2023). large-scale-hate-speech-v1 [Dataset]. https://huggingface.co/datasets/ctoraman/large-scale-hate-speech-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2023
Authors
Cagri Toraman
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The dataset published in the LREC 2022 paper "Large-Scale Hate Speech Detection with Cross-Domain Transfer".

This is Dataset v1:

The original dataset that includes 100,000 tweets in English. The annotations with more than 60% agreement are included. TweetID: Tweet ID from Twitter API LangID: 1 (English) TopicID: Domain of the topic 0-Religion, 1-Gender, 2-Race, 3-Politics, 4-Sports HateLabel: Final hate label decision 0-Normal, 1-Offensive, 2-Hate

GitHub Repo:

NOTE:… See the full description on the dataset page: https://huggingface.co/datasets/ctoraman/large-scale-hate-speech-v1.
Hate speech and Offensive language dataset from X
kaggle.com
Updated Jan 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nahid Hasan Nill (2024). Hate speech and Offensive language dataset from X [Dataset]. https://www.kaggle.com/datasets/nahidhasannill/hate-speech-and-offensive-language-dataset-from-x
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 22, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nahid Hasan Nill
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
We use a cool tool called "Instant Data Scraper," which is like a clever web crawler designed to collect text data efficiently. Imagine it like a hardworking assistant, collecting information from Twitter pages and taking breaks of 1 to 20 seconds between each task. When we turn on this web crawler, it starts gathering information from the webpage. The crawler doesn't miss anything on the page. It gathers all the information and saves it in a spreadsheet. Even though we gathered a lot of data, we might have things we don't want and some unfair preferences. To make our dataset neat and clean, we carefully go through the information and keep only the important parts – like usernames and what they say in tweets or comments. This careful process creates a clean dataset of 21,010 data. Each data includes a username, a tweet, and a label.

The process of putting labels on each tweet needs careful attention. So, Here, we perform labeling manually and also we do a ternary classification for labeling. We use numbers like 1 for hate speech, 2 for offensive speech, and 3 for normal or free speech.

When we label a tweet as a hate speech, we're looking at things like dehumanization, violence, and encouraging others to be violent. And also anything related to sexuality is considered hate speech. For offensive speech, we pay attention to negativity, strong language, criticism, things that might be offensive, mean language, ignoring or making light of something, and saying things that sound threatening. Normal speech includes talking about politics, expressing frustration or excitement, and anything that doesn't spread hate.
E
Data from: English YouTube Hate Speech Corpus
live.european-language-grid.eu
binary format
Updated Oct 13, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). English YouTube Hate Speech Corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20160
Explore at:
binary formatAvailable download formats
Dataset updated
Oct 13, 2021
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
YouTube
Description
We present an English YouTube dataset manually annotated for hate speech types and targets. The comments to be annotated were sampled from the English YouTube comments on videos about the Covid-19 pandemic in the period from January 2020 to May 2020. Two sets were annotated: a training set with 51,655 comments (IMSyPP_EN_YouTube_comments_train.csv) and two evaluation sets, one annotated in-context (IMSyPP_EN_YouTube_comments_evaluation_context.csv), another out-of-context (IMSyPP_EN_YouTube_comments_evaluation_no_context.csv), each based on the same 10,759 comments. The dataset was annotated by 10 annotators with most (99.9%) of the comments being annotated by two annotators. It was used to train a classification model for hate speech types detection that is publicly available at the following URL: https://huggingface.co/IMSyPP/hate_speech_en.

The dataset consists of the following fields: Video_ID - YouTube ID of the video under which the comment was posted Comment_ID - YouTube ID of the comment Text - text of the comment Type - type of hate speech Target - the target of hate speech Annotator - code of the human annotator
Facebook: actioned hate speech content restoration as of Q2 2024
statista.com
Updated Oct 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Facebook: actioned hate speech content restoration as of Q2 2024 [Dataset]. https://www.statista.com/statistics/1243571/facebook-restored-hate-speech-content/
Explore at:
Dataset updated
Oct 9, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
In the second quarter of 2024, the number of restored content items that were originally actioned for hate speech on Facebook worldwide amounted to 157,000, up from 148,000 of such restored items in the preceding quarter.
f
Data from: Hidden Hate: Analysis of Hate Speech on a Darknet Forum
tandf.figshare.com
pdf
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kenji Logie; Noah D. Cohen; E. Taylor; Katrina Perry (2025). Hidden Hate: Analysis of Hate Speech on a Darknet Forum [Dataset]. http://doi.org/10.6084/m9.figshare.29083220.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29083220.v1
Dataset updated
May 15, 2025
Dataset provided by
Taylor & Francis
Authors
Kenji Logie; Noah D. Cohen; E. Taylor; Katrina Perry
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Research relating to hate speech and the darknet have both grown significantly in the previous decade. Nonetheless, there is a dearth of empirical research exploring how hate speech manifests within the darknet, the groups targeted. This study seeks to fill this gap in the literature by investigating the different targets of hate speech within the darknet forum Dread and how posts within this forum are affected by hate motivated events. Through analysis of posts (n = 1,047) 3 months before and after major hate-motivated events, this study finds that approximately 13% (n =135) of posts in our sample contain hate speech targeting several groups. In addition we also examined the variations in targets between forum-specific subjects (internal) and targets outside of the forum (external). Our findings suggest that there is limited conversation surrounding hate-motivated events discussed in mainstream media on Dread. However, instances of hate speech, predominantly targeting religious, racial, and gender-related groups, are present at a lower percentage in comparison to research conducted about hate speech on social media platforms.
Κ
Data from: Hate speech and social acceptance of migrants in Europe: Analysis...
datacatalogue.sodanet.gr
produccioncientifica.usal.es
+1more
pdf, tsv
Updated Apr 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Κατάλογος Δεδομένων SoDaNet (2024). Hate speech and social acceptance of migrants in Europe: Analysis of tweets with geolocation [Dataset]. http://doi.org/10.17903/FK2/G83HNY
Explore at:
tsv(67308968), pdf(112290)Available download formats
Unique identifier
https://doi.org/10.17903/FK2/G83HNY
Dataset updated
Apr 30, 2024
Dataset provided by
Κατάλογος Δεδομένων SoDaNet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2015 - Dec 31, 2020
Area covered
Europe
Description
The data project includes large-scale longitudinal analysis (2015-2020) of online hate speech on Twitter (N=847,978). A tweet database was generated: collected tweets using Twitter’s Application Programming Interface (API) (v2 full-archive search endpoint, using Academic research product track), which provides access to the historical archive of messages since Twitter was created in 2006. To download the tweets, we first defined the search filter by keyword and geographic zones using the Python programming language and the NLTK, Tensorflow, Keras and Numpy libraries. We established generic words directly related with the topic, taking into account linguistic agreement in Spanish (i.e., gender and number inflections) but without considering adjectives, for instance: migrant, migrants, immigrant, immigrants, refugee (both in masculine and feminine forms in Spanish), refugees (both in masculine and feminine forms in Spanish), asylum seeker, asylum seekers (the keywords are available as supplementary materials here. For the process of hate speech detection in tweets, we used as a basis a tool created and validated by Vrysis et al. (2021). For this research, the tool has been retrained with: supervised dictionary-based term detection; and also taking an unsupervised approach (machine learning with neural networks) Using a corpus of 90,977 short messages, from which 15,761 were in Greek (5,848 with hate toward immigrants), 46,012 were in Spanish (11,117 with hate toward immigrants) and 29,204 in Italian (5,848 with hate toward immigrants). This corpus comes from two sources: the import of already classified messages in other databases (n=57,328, of which 5,362 are generic messages in Greek, 23,787 are generic messages and 9,727 are messages with hate toward immigrants in Spanish, and 18,452 are generic messages in Italian), and the other from messages manually coded by local trained analysts (in Spain, Greece and Italy), using at least 2 coders with total agreement between them (the level of agreement in the tests was 94%), dismissing those without a 100% intercoder agreement (n=33,649, of which 6,040 are messages about immigration without hate and 4,359 are messages with hate toward immigrants in Greek; 11,108 are messages about immigration without hate and 1,390 are messages with hate toward immigrants in Spanish; and 4,904 are messages about immigration without hate and 5,848 are messages with hate toward immigrants in Italian). The corpus was divided into 80% training and 20% test.In the models, embeddings were used for the representation of language and Recurrent Neural Networks (RNN) for the supervised text classification. Specifically, the embeddings were created with the 1,000 most repeated words with 8 dimensions (first input layer), two hidden layers’ type Gated Recurrent Unit (GRU) with 64 neurons each, and a dense output layer with one neuron and softmax activation (the model is compiled with Adam optimizing and the Sparse Categorical Crossentropy loss).
Antisemitism on Twitter: A Dataset for Machine Learning and Text Analytics
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gunther Jikeli; Gunther Jikeli; Sameer Karali; Sameer Karali; Daniel Miehling; Daniel Miehling; Katharina Soemer; Katharina Soemer (2023). Antisemitism on Twitter: A Dataset for Machine Learning and Text Analytics [Dataset]. http://doi.org/10.5281/zenodo.7872835
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7872835
Dataset updated
Jun 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gunther Jikeli; Gunther Jikeli; Sameer Karali; Sameer Karali; Daniel Miehling; Daniel Miehling; Katharina Soemer; Katharina Soemer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# Institute For the Study of Contemporary Antisemitism (ISCA) at Indiana University Dataset:

The ISCA project has compiled this dataset using an annotation portal, which was used to label tweets as either antisemitic or non-antisemitic, among other labels. Please note that the annotation was done with live data, including images and the context, such as threads. The original data was sourced from annotationportal.com.

# Content:
This dataset contains 6,941 tweets that cover a wide range of topics common in conversations about Jews, Israel, and antisemitism between January 2019 and December 2021. The dataset is drawn from representative samples during this period with relevant keywords. 1,250 tweets (18%) meet the IHRA definition of antisemitic messages.

The dataset has been compiled within the ISCA project using an annotation portal to label tweets as either antisemitic or non-antisemitic. The original data was sourced from annotationportal.com.

The tweets' distribution of all messages by year is as follows: 1,499 (22%) from 2019, 3,716 (54%) from 2020, and 1,726 (25%) from 2021. 4,605 (66%) contain the keyword "Jews," 1,524 (22%) include "Israel," 529 (8%) feature the derogatory term "ZioNazi*," and 283 (4%) use the slur "K---s." Some tweets may contain multiple keywords.

483 out of the 4,605 tweets with the keyword "Jews" (11%) and 203 out of the 1,524 tweets with the keyword "Israel" (13%) were classified as antisemitic. 97 out of the 283 tweets using the antisemitic slur "K---s" (34%) are antisemitic. Interestingly, many tweets featuring the slur "K---s" actually call out its usage. In contrast, the majority of tweets with the derogatory term "ZioNazi*" are antisemitic, with 467 out of 529 (88%) being classified as such.

File Description:

The dataset is provided in a csv file format, with each row representing a single tweet, including replies, quotes, and retweets. The file contains the following columns:

‘TweetID’: Represents the tweet ID.

‘Username’: Represents the username who published the tweet.
‘Text’: Represents the full text of the tweet.

‘CreateDate’: Represents the date the tweet was created.

‘Biased’: Represents the labeled by our annotations if the tweet is antisemitic or non-antisemitic.

‘Keyword’: Represents the keyword that was used in the query. The keyword can be in the text, including mentioned names, or the username.

Licences

Data is published under the terms of the "Creative Commons Attribution 4.0 International" licence (https://creativecommons.org/licenses/by/4.0)

R code is published under the terms of the "MIT" licence (https://opensource.org/licenses/MIT) ‘

Acknowledgements

We are grateful for the support of Indiana University’s Observatory on Social Media (OSoMe) (Davis et al. 2016) and the contributions and annotations of all team members in our Social Media & Hate Research Lab at Indiana University’s Institute for the Study of Contemporary Antisemitism, especially Grace Bland, Elisha S. Breton, Kathryn Cooper, Robin Forstenhäusler, Sophie von Máriássy, Mabel Poindexter, Jenna Solomon, Clara Schilling, and Victor Tschiskale.

This work used Jetstream2 at Indiana University through allocation HUM200003 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.
Z
Algerian Dialect Dataset Targeted Hate Speech, Offensive Language and...
data.niaid.nih.gov
Updated Apr 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mazari, Ahmed cherif (2024). Algerian Dialect Dataset Targeted Hate Speech, Offensive Language and Cyberbullying [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10937444
Explore at:
Dataset updated
Apr 7, 2024
Dataset authored and provided by
Mazari, Ahmed cherif
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Algerian Dialect Dataset Targeted Hate Speech, Offensive Language and Cyberbullying.

To cite this dataset refer to http://dx.doi.org/10.12785/ijcds/130177Mazari, A. C., & Kheddar, H. (2023). "Deep Learning-based Analysis of Algerian Dialect Dataset Targeted Hate Speech, Offensive Language and Cyberbullying." IJCDS, 13(1).

Due to the nature of this Dataset, comments contain offensiveness and hate speech. This does not reflect author values, however the aim is to providing a resource to help in detecting and preventing spread of such harmful content.

Features

Algerian Dialect

Cyberbullying

Hate speech

Offensive Language

Dialect Dataset
f
Data from: Hate Speech Library in Spanish / Librería de odio en Español
figshare.com
investigacion.unir.net
xlsx
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elias Said-Hung; Max Römer Pieretti; julio Montero-Diaz; Alberto De Lucas Vicente; Javier Martinez Torres (2023). Hate Speech Library in Spanish / Librería de odio en Español [Dataset]. http://doi.org/10.6084/m9.figshare.22383643.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22383643.v2
Dataset updated
Jun 2, 2023
Dataset provided by
figshare
Authors
Elias Said-Hung; Max Römer Pieretti; julio Montero-Diaz; Alberto De Lucas Vicente; Javier Martinez Torres
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Librería de expresiones de odio detectado en medios informativos digitales en España, resultado del proyecto "Hatemedia" (proyecto PID2020-114584GB-I00), financiado por la Agencia Estatal de Investigación - Ministerio de Ciencia e Innovación.

Las expresiones de odio muestra 7.210 lemas simples y compuestos más repetidos y que desde el punto de vista semántico tienden al odio en medios informativos digitales en España. La elaboración de este documento final, requirió las siguientes fases:

ETIQUETADO DE EXPRESIONES Y EXTRACCIÓN DE LEMAS. En la primera fase, se revisaron un total de 476.753 mensajes asociados a medios informativos digitales en España, en el que se identificaron un total aproximadamente 4,5% de mensajes con expresiones que tendían al odio. Del total de mensajes identificados se eliminaron stop-words, se identificaron datos anómalos (que no pertenecían a un idioma conocido o eran diminutivos de éste) y se revisaron manualmente para identificar tanto los lemas simples como compuestos que tendían al odio. IDENTIFICACIÓN DE DUPLICADOS: En la primera fase se realizaron dos listados, el primero de lemas simples y el segundo de lemas compuestos. El primer paso fue filtrar estas dos listas para identificar lemas repetidos, obteniendo estas dos bibliotecas donde cada lema aparece una sola vez. INTEGRACIÓN BBDD: A continuación, en la tercera fase, se procedió a unir ambas bibliotecas para construir una biblioteca final que integrara todos los lemas, tanto simples como compuestos. Finalmente, se realizó un filtrado final para asegurar que no se repitan los lemas.

Autores: - Elias Said-Hung, Max Römer Pieretti, Julio Montero-Díaz, Alberto De Lucas, Javier Martínez Torres.

Apoyado por: - POSIBLE S.L.

Para más información: - https://www.hatemedia.es/, o contactar a elias.said@unir.net

Library of hate speech detected in digital news media in Spain, the result of the "Hatemedia" project (project PID2020-114584GB-I00), financed by the State Research Agency - Ministry of Science and Innovation.

Hate expressions show 7,210 more repeated simple and compound slogans, and from the semantic point of view tend to be hate in digital news media in Spain. The preparation of this final document required the following phases:

LABELING OF EXPRESSIONS AND EXTRACTION OF SLOGMS. In the first phase, a total of 476,753 messages associated with digital news media in Spain were reviewed. Approximately 4.5% of messages with expressions tending toward hatred were identified. From the total number of messages identified, stop-words were removed, and anomalous data (that did not belong to a known language or were diminutive of it) were identified and manually reviewed to identify both simple and compound slogans that tended towards hatred. IDENTIFICATION OF DUPLICATES: In the first phase, two lists were made, the first of simple lemmas and the second of compound lemmas. The first step was to filter these two lists to identify repeated lemmas, obtaining these two libraries where each lemma appears only once. DDBB INTEGRATION: Next, in the third phase, we proceeded to join both libraries to build a final library that integrated all the lemmas, both simple and compound. Finally, final filtering was carried out to ensure that the lemmas were not repeated.

Authors: - Elias Said-Hung, Max Römer Pieretti, Julio Montero-Díaz, Alberto De Lucas, Javier Martínez Torres.

Supported by: - POSSIBLE S.L.

For more information: - https://www.hatemedia.es/ or contact elias.said@unir.net
Facebook: hate speech violations prevalence as of Q3 2023
statista.com
Updated Sep 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Facebook: hate speech violations prevalence as of Q3 2023 [Dataset]. https://www.statista.com/statistics/1275798/facebook-prevalence-hate-speech-violation/
Explore at:
Dataset updated
Sep 4, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
In the third quarter of 2023, hate speech content on Meta's Facebook had a prevalence rate of 0.02 percent. For every 10,000 content views on the social media platform, about two pieces of content would contain hate speech. Overall, the prevalence of what Facebook considers to be hate speech has remained steady since the first quarter of 2022.
i
Hate Speech in Chilean Twitter
ieee-dataport.org
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Domingo Benoit (2024). Hate Speech in Chilean Twitter [Dataset]. https://ieee-dataport.org/documents/hate-speech-chilean-twitter
Explore at:
Dataset updated
Jul 8, 2024
Authors
Domingo Benoit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ethnicity
Labelled Hate Speech Detection Dataset.
figshare.com
txt
Updated Apr 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shane Cooke (2022). Labelled Hate Speech Detection Dataset. [Dataset]. http://doi.org/10.6084/m9.figshare.19686954.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19686954.v1
Dataset updated
Apr 30, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Shane Cooke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 3000 labelled comments and posts scraped from the Reddit, Twitter and 4Chan social media websites in 2022. In this dataset, 2400 comments are labelled as non-hateful or '0', and 600 comments are labelled as hateful or '1', making an even 80/20 split.

This dataset's primary purpose is for the use in machine learning classifications of hateful speech in the online speher.
HateBR: Large-scale expert annotated dataset of Brazilian Instagram comments...
zenodo.org
zip
Updated Apr 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francielle Vargas; Isabelle Carvalho; Fabiana Rodrigues de Góes; Thiago A. S. Pardo; Fabricio Benevenuto; Francielle Vargas; Isabelle Carvalho; Fabiana Rodrigues de Góes; Thiago A. S. Pardo; Fabricio Benevenuto (2023). HateBR: Large-scale expert annotated dataset of Brazilian Instagram comments for abusive language detection [Dataset]. http://doi.org/10.5281/zenodo.7681303
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7681303
Dataset updated
Apr 13, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francielle Vargas; Isabelle Carvalho; Fabiana Rodrigues de Góes; Thiago A. S. Pardo; Fabricio Benevenuto; Francielle Vargas; Isabelle Carvalho; Fabiana Rodrigues de Góes; Thiago A. S. Pardo; Fabricio Benevenuto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Brazil
Description
The HateBR dataset was collected from the comment section of Brazilian politicians’ accounts on Instagram and manually annotated by specialists, reaching a high inter-annotator agreement. The corpus consists of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level classification (highly, moderately, and slightly offensive), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). We also implemented baseline experiments for offensive language and hate speech detection and compared them with a literature baseline. Results show that the baseline experiments on our corpus outperform the current state-of-the-art for the Portuguese language.
People who have experienced hate speech on the internet in Poland 2021
statista.com
Updated Jul 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). People who have experienced hate speech on the internet in Poland 2021 [Dataset]. https://www.statista.com/statistics/1268151/poland-people-who-receive-hateful-comments-online/
Explore at:
Dataset updated
Jul 24, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Sep 17, 2021 - Sep 20, 2021
Area covered
Poland
Description
******* percent of Poles experienced hate speech online in 2021, with men and 18- to 24-year-olds most likely to be affected.
h
measuring-hate-speech
huggingface.co
opendatalab.com
Updated Feb 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
D-Lab, UC Berkeley (2022). measuring-hate-speech [Dataset]. http://doi.org/10.57967/hf/2710
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2710
Dataset updated
Feb 5, 2022
Dataset authored and provided by
D-Lab, UC Berkeley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset card for Measuring Hate Speech

This is a public release of the dataset described in Kennedy et al. (2020) and Sachdeva et al. (2022), consisting of 39,565 comments annotated by 7,912 annotators, for 135,556 combined rows. The primary outcome variable is the "hate speech score" but the 10 constituent ordinal labels (sentiment, (dis)respect, insult, humiliation, inferior status, violence, dehumanization, genocide, attack/defense, hate speech benchmark) can also be treated as… See the full description on the dataset page: https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech.
Data from: Hateful Messages: A Conversational Data Set of Hate Speech...
zenodo.org
Updated May 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Fillies; Jan Fillies; Silvio Peikert; Adrian Paschke; Silvio Peikert; Adrian Paschke (2023). Hateful Messages: A Conversational Data Set of Hate Speech produced by Adolescents on Discord [Dataset]. http://doi.org/10.5281/zenodo.7824768
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7824768
Dataset updated
May 2, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Fillies; Jan Fillies; Silvio Peikert; Adrian Paschke; Silvio Peikert; Adrian Paschke
Description
With the rise of social media, a rise of hateful content can be observed. Even though the understanding and definitions of hate speech varies, platforms, communities, and legislature all acknowledge the problem. Therefore, adolescents are a new and active group of social media users. The majority of adolescents experience or witness online hate speech. Research in the field of automated hate speech classification has been on the rise and focuses on aspects such as bias, generalizability, and performance. To increase generalizability and performance, it is important to understand biases within the data. This research addresses the bias of youth language within hate speech classification and contributes by providing a modern and anonymized hate speech youth language data set consisting of 88.395 annotated chat messages. The data set consists of publicly available online messages from the chat platform Discord. ~6,42\% of the messages were classified by a self-developed annotation schema as hate speech. For 35.553 messages, the user profiles provided age annotations setting the average author age to under 20 years old.
m
NIGERIAN BASED HATE DATASET
data.mendeley.com
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bassey Adim (2024). NIGERIAN BASED HATE DATASET [Dataset]. http://doi.org/10.17632/r5jsynhxsx.1
Explore at:
Unique identifier
https://doi.org/10.17632/r5jsynhxsx.1
Dataset updated
Jan 29, 2024
Authors
Bassey Adim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Nigeria
Description
The dataset was built from tweeter and contains tweets based on common Nigerian hate words and stereotypes. The dataset features 20,176 tweets that have been classified into binary class: Hate Speech with polarity of 1, containing 4,801 tweets and Non Hate Speech with polarity 0, having 15,375 tweets. The dataset is presented in text csv table format, arranged in the following columns: 1. ʽId’ which gives the serial number of the tweets; 2. ʽTweets’. This Contain the contents of the tweets; and 3. 'Polarity' contains the label of tweets. Hate Speech have a polarity of 1 and Non-Hate Speech have a polarity of 0.
E
Data from: Italian YouTube Hate Speech Corpus
live.european-language-grid.eu
binary format
Updated Sep 30, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Italian YouTube Hate Speech Corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20158
Explore at:
binary formatAvailable download formats
Dataset updated
Sep 30, 2021
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
YouTube
Description
We present an Italian YouTube dataset manually annotated for hate speech types and targets. The comments to be annotated were sampled from the Italian YouTube comments on videos about the Covid-19 pandemic in the period from January 2020 to May 2020. Two sets were annotated: a training set with 59,870 comments (IMSyPP_IT_YouTube_comments_train.csv) and an evaluation set with 10,536 comments (IMSyPP_IT_YouTube_comments_evaluation.csv). The dataset was annotated by 8 annotators with each comment being annotated by two annotators. It was used to train a classification model for hate speech types detection that is publicly available at the following URL: https://huggingface.co/IMSyPP/hate_speech_it.

The dataset consists of the following fields: ID_Commento - YouTube ID of the comment ID_Video - YouTube ID of the video under which the comment was posted Testo - text of the comment Tipo - type of hate speech Target - the target of hate speech

Additionally, we have included the Italian YouTube data (SR_YT_comments.csv) which was collected in the same period as the training data and was annotated using the aforementioned model. The automatically labeled data was used to analyze the relationship between hate speech and misinformation on Italian YouTube. The results of this analysis are presented in the associated paper.

The analyzed data are represented with the following fields: ID_Commento - YouTube ID of the comment Label - automatically assigned label by the model is_questionable - the type of channel where the comment was collected from; the channels could either be categorized as spreading reliable or questionable information.
f
SHAJ: Albanian hate speech & abusive language
figshare.com
txt
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leon Derczynski (2023). SHAJ: Albanian hate speech & abusive language [Dataset]. http://doi.org/10.6084/m9.figshare.19333298.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19333298.v2
Dataset updated
Jun 10, 2023
Dataset provided by
figshare
Authors
Leon Derczynski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SHAJ is an annotated Albanian dataset for hate speech and offensive speech that has been constructed from user-generated content on various social media platforms. Its annotation follows the hierarchical schema introduced in OffensEval.Paper here: https://arxiv.org/abs/2107.13592

Facebook

Twitter

Click to copy link

Link copied

Cite

Ruan Chaves Rodrigues (2023). hatebr [Dataset]. http://doi.org/10.57967/hf/0274

hatebr

ruanchaves/hatebr

HateBR - Offensive Language and Hate Speech Dataset in Brazilian Portuguese

Explore at:

305 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.57967/hf/0274

Dataset updated

Mar 10, 2023

Authors

Ruan Chaves Rodrigues

License

https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/

Description

HateBR is the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media. The HateBR corpus was collected from Brazilian Instagram comments of politicians and manually annotated by specialists. It is composed of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level (highly, moderately, and slightly offensive messages), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). Each comment was annotated by three different annotators and achieved high inter-annotator agreement. Furthermore, baseline experiments were implemented reaching 85% of F1-score outperforming the current literature models for the Portuguese language. Accordingly, we hope that the proposed expertly annotated corpus may foster research on hate speech and offensive language detection in the Natural Language Processing area.

Clear search

Close search

Google apps

Main menu

hatebr

large-scale-hate-speech-v1

Hate speech and Offensive language dataset from X

Data from: English YouTube Hate Speech Corpus

Facebook: actioned hate speech content restoration as of Q2 2024

Data from: Hidden Hate: Analysis of Hate Speech on a Darknet Forum

Data from: Hate speech and social acceptance of migrants in Europe: Analysis...

Antisemitism on Twitter: A Dataset for Machine Learning and Text Analytics

Algerian Dialect Dataset Targeted Hate Speech, Offensive Language and...

Data from: Hate Speech Library in Spanish / Librería de odio en Español

Facebook: hate speech violations prevalence as of Q3 2023

Hate Speech in Chilean Twitter

Labelled Hate Speech Detection Dataset.

HateBR: Large-scale expert annotated dataset of Brazilian Instagram comments...

People who have experienced hate speech on the internet in Poland 2021

measuring-hate-speech

Data from: Hateful Messages: A Conversational Data Set of Hate Speech...

NIGERIAN BASED HATE DATASET

Data from: Italian YouTube Hate Speech Corpus

SHAJ: Albanian hate speech & abusive language

hatebr

ruanchaves/hatebr

HateBR - Offensive Language and Hate Speech Dataset in Brazilian Portuguese