https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
HateBR is the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media. The HateBR corpus was collected from Brazilian Instagram comments of politicians and manually annotated by specialists. It is composed of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level (highly, moderately, and slightly offensive messages), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). Each comment was annotated by three different annotators and achieved high inter-annotator agreement. Furthermore, baseline experiments were implemented reaching 85% of F1-score outperforming the current literature models for the Portuguese language. Accordingly, we hope that the proposed expertly annotated corpus may foster research on hate speech and offensive language detection in the Natural Language Processing area.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The dataset published in the LREC 2022 paper "Large-Scale Hate Speech Detection with Cross-Domain Transfer".
This is Dataset v1:
The original dataset that includes 100,000 tweets in English. The annotations with more than 60% agreement are included. TweetID: Tweet ID from Twitter API LangID: 1 (English) TopicID: Domain of the topic 0-Religion, 1-Gender, 2-Race, 3-Politics, 4-Sports HateLabel: Final hate label decision 0-Normal, 1-Offensive, 2-Hate
GitHub Repo:
NOTE:… See the full description on the dataset page: https://huggingface.co/datasets/ctoraman/large-scale-hate-speech-v1.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
We use a cool tool called "Instant Data Scraper," which is like a clever web crawler designed to collect text data efficiently. Imagine it like a hardworking assistant, collecting information from Twitter pages and taking breaks of 1 to 20 seconds between each task. When we turn on this web crawler, it starts gathering information from the webpage. The crawler doesn't miss anything on the page. It gathers all the information and saves it in a spreadsheet. Even though we gathered a lot of data, we might have things we don't want and some unfair preferences. To make our dataset neat and clean, we carefully go through the information and keep only the important parts – like usernames and what they say in tweets or comments. This careful process creates a clean dataset of 21,010 data. Each data includes a username, a tweet, and a label.
The process of putting labels on each tweet needs careful attention. So, Here, we perform labeling manually and also we do a ternary classification for labeling. We use numbers like 1 for hate speech, 2 for offensive speech, and 3 for normal or free speech.
When we label a tweet as a hate speech, we're looking at things like dehumanization, violence, and encouraging others to be violent. And also anything related to sexuality is considered hate speech. For offensive speech, we pay attention to negativity, strong language, criticism, things that might be offensive, mean language, ignoring or making light of something, and saying things that sound threatening. Normal speech includes talking about politics, expressing frustration or excitement, and anything that doesn't spread hate.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
We present an English YouTube dataset manually annotated for hate speech types and targets. The comments to be annotated were sampled from the English YouTube comments on videos about the Covid-19 pandemic in the period from January 2020 to May 2020. Two sets were annotated: a training set with 51,655 comments (IMSyPP_EN_YouTube_comments_train.csv) and two evaluation sets, one annotated in-context (IMSyPP_EN_YouTube_comments_evaluation_context.csv), another out-of-context (IMSyPP_EN_YouTube_comments_evaluation_no_context.csv), each based on the same 10,759 comments. The dataset was annotated by 10 annotators with most (99.9%) of the comments being annotated by two annotators. It was used to train a classification model for hate speech types detection that is publicly available at the following URL: https://huggingface.co/IMSyPP/hate_speech_en.
The dataset consists of the following fields: Video_ID - YouTube ID of the video under which the comment was posted Comment_ID - YouTube ID of the comment Text - text of the comment Type - type of hate speech Target - the target of hate speech Annotator - code of the human annotator
In the second quarter of 2024, the number of restored content items that were originally actioned for hate speech on Facebook worldwide amounted to 157,000, up from 148,000 of such restored items in the preceding quarter.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research relating to hate speech and the darknet have both grown significantly in the previous decade. Nonetheless, there is a dearth of empirical research exploring how hate speech manifests within the darknet, the groups targeted. This study seeks to fill this gap in the literature by investigating the different targets of hate speech within the darknet forum Dread and how posts within this forum are affected by hate motivated events. Through analysis of posts (n = 1,047) 3 months before and after major hate-motivated events, this study finds that approximately 13% (n =135) of posts in our sample contain hate speech targeting several groups. In addition we also examined the variations in targets between forum-specific subjects (internal) and targets outside of the forum (external). Our findings suggest that there is limited conversation surrounding hate-motivated events discussed in mainstream media on Dread. However, instances of hate speech, predominantly targeting religious, racial, and gender-related groups, are present at a lower percentage in comparison to research conducted about hate speech on social media platforms.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data project includes large-scale longitudinal analysis (2015-2020) of online hate speech on Twitter (N=847,978). A tweet database was generated: collected tweets using Twitter’s Application Programming Interface (API) (v2 full-archive search endpoint, using Academic research product track), which provides access to the historical archive of messages since Twitter was created in 2006. To download the tweets, we first defined the search filter by keyword and geographic zones using the Python programming language and the NLTK, Tensorflow, Keras and Numpy libraries. We established generic words directly related with the topic, taking into account linguistic agreement in Spanish (i.e., gender and number inflections) but without considering adjectives, for instance: migrant, migrants, immigrant, immigrants, refugee (both in masculine and feminine forms in Spanish), refugees (both in masculine and feminine forms in Spanish), asylum seeker, asylum seekers (the keywords are available as supplementary materials here. For the process of hate speech detection in tweets, we used as a basis a tool created and validated by Vrysis et al. (2021). For this research, the tool has been retrained with: supervised dictionary-based term detection; and also taking an unsupervised approach (machine learning with neural networks) Using a corpus of 90,977 short messages, from which 15,761 were in Greek (5,848 with hate toward immigrants), 46,012 were in Spanish (11,117 with hate toward immigrants) and 29,204 in Italian (5,848 with hate toward immigrants). This corpus comes from two sources: the import of already classified messages in other databases (n=57,328, of which 5,362 are generic messages in Greek, 23,787 are generic messages and 9,727 are messages with hate toward immigrants in Spanish, and 18,452 are generic messages in Italian), and the other from messages manually coded by local trained analysts (in Spain, Greece and Italy), using at least 2 coders with total agreement between them (the level of agreement in the tests was 94%), dismissing those without a 100% intercoder agreement (n=33,649, of which 6,040 are messages about immigration without hate and 4,359 are messages with hate toward immigrants in Greek; 11,108 are messages about immigration without hate and 1,390 are messages with hate toward immigrants in Spanish; and 4,904 are messages about immigration without hate and 5,848 are messages with hate toward immigrants in Italian). The corpus was divided into 80% training and 20% test.In the models, embeddings were used for the representation of language and Recurrent Neural Networks (RNN) for the supervised text classification. Specifically, the embeddings were created with the 1,000 most repeated words with 8 dimensions (first input layer), two hidden layers’ type Gated Recurrent Unit (GRU) with 64 neurons each, and a dense output layer with one neuron and softmax activation (the model is compiled with Adam optimizing and the Sparse Categorical Crossentropy loss).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Institute For the Study of Contemporary Antisemitism (ISCA) at Indiana University Dataset:
The ISCA project has compiled this dataset using an annotation portal, which was used to label tweets as either antisemitic or non-antisemitic, among other labels. Please note that the annotation was done with live data, including images and the context, such as threads. The original data was sourced from annotationportal.com.
# Content:
This dataset contains 6,941 tweets that cover a wide range of topics common in conversations about Jews, Israel, and antisemitism between January 2019 and December 2021. The dataset is drawn from representative samples during this period with relevant keywords. 1,250 tweets (18%) meet the IHRA definition of antisemitic messages.
The dataset has been compiled within the ISCA project using an annotation portal to label tweets as either antisemitic or non-antisemitic. The original data was sourced from annotationportal.com.
The tweets' distribution of all messages by year is as follows: 1,499 (22%) from 2019, 3,716 (54%) from 2020, and 1,726 (25%) from 2021. 4,605 (66%) contain the keyword "Jews," 1,524 (22%) include "Israel," 529 (8%) feature the derogatory term "ZioNazi*," and 283 (4%) use the slur "K---s." Some tweets may contain multiple keywords.
483 out of the 4,605 tweets with the keyword "Jews" (11%) and 203 out of the 1,524 tweets with the keyword "Israel" (13%) were classified as antisemitic. 97 out of the 283 tweets using the antisemitic slur "K---s" (34%) are antisemitic. Interestingly, many tweets featuring the slur "K---s" actually call out its usage. In contrast, the majority of tweets with the derogatory term "ZioNazi*" are antisemitic, with 467 out of 529 (88%) being classified as such.
File Description:
The dataset is provided in a csv file format, with each row representing a single tweet, including replies, quotes, and retweets. The file contains the following columns:
‘TweetID’: Represents the tweet ID.
‘Username’: Represents the username who published the tweet.
‘Text’: Represents the full text of the tweet.
‘CreateDate’: Represents the date the tweet was created.
‘Biased’: Represents the labeled by our annotations if the tweet is antisemitic or non-antisemitic.
‘Keyword’: Represents the keyword that was used in the query. The keyword can be in the text, including mentioned names, or the username.
Licences
Data is published under the terms of the "Creative Commons Attribution 4.0 International" licence (https://creativecommons.org/licenses/by/4.0)
R code is published under the terms of the "MIT" licence (https://opensource.org/licenses/MIT) ‘
Acknowledgements
We are grateful for the support of Indiana University’s Observatory on Social Media (OSoMe) (Davis et al. 2016) and the contributions and annotations of all team members in our Social Media & Hate Research Lab at Indiana University’s Institute for the Study of Contemporary Antisemitism, especially Grace Bland, Elisha S. Breton, Kathryn Cooper, Robin Forstenhäusler, Sophie von Máriássy, Mabel Poindexter, Jenna Solomon, Clara Schilling, and Victor Tschiskale.
This work used Jetstream2 at Indiana University through allocation HUM200003 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Algerian Dialect Dataset Targeted Hate Speech, Offensive Language and Cyberbullying.
To cite this dataset refer to http://dx.doi.org/10.12785/ijcds/130177Mazari, A. C., & Kheddar, H. (2023). "Deep Learning-based Analysis of Algerian Dialect Dataset Targeted Hate Speech, Offensive Language and Cyberbullying." IJCDS, 13(1).
Due to the nature of this Dataset, comments contain offensiveness and hate speech. This does not reflect author values, however the aim is to providing a resource to help in detecting and preventing spread of such harmful content.
Features
Algerian Dialect
Cyberbullying
Hate speech
Offensive Language
Dialect Dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LibrerÃa de expresiones de odio detectado en medios informativos digitales en España, resultado del proyecto "Hatemedia" (proyecto PID2020-114584GB-I00), financiado por la Agencia Estatal de Investigación - Ministerio de Ciencia e Innovación.
Las expresiones de odio muestra 7.210 lemas simples y compuestos más repetidos y que desde el punto de vista semántico tienden al odio en medios informativos digitales en España. La elaboración de este documento final, requirió las siguientes fases:
ETIQUETADO DE EXPRESIONES Y EXTRACCIÓN DE LEMAS. En la primera fase, se revisaron un total de 476.753 mensajes asociados a medios informativos digitales en España, en el que se identificaron un total aproximadamente 4,5% de mensajes con expresiones que tendÃan al odio. Del total de mensajes identificados se eliminaron stop-words, se identificaron datos anómalos (que no pertenecÃan a un idioma conocido o eran diminutivos de éste) y se revisaron manualmente para identificar tanto los lemas simples como compuestos que tendÃan al odio. IDENTIFICACIÓN DE DUPLICADOS: En la primera fase se realizaron dos listados, el primero de lemas simples y el segundo de lemas compuestos. El primer paso fue filtrar estas dos listas para identificar lemas repetidos, obteniendo estas dos bibliotecas donde cada lema aparece una sola vez. INTEGRACIÓN BBDD: A continuación, en la tercera fase, se procedió a unir ambas bibliotecas para construir una biblioteca final que integrara todos los lemas, tanto simples como compuestos. Finalmente, se realizó un filtrado final para asegurar que no se repitan los lemas.
Autores: - Elias Said-Hung, Max Römer Pieretti, Julio Montero-DÃaz, Alberto De Lucas, Javier MartÃnez Torres.
Apoyado por: - POSIBLE S.L.
Para más información: - https://www.hatemedia.es/, o contactar a elias.said@unir.net
Library of hate speech detected in digital news media in Spain, the result of the "Hatemedia" project (project PID2020-114584GB-I00), financed by the State Research Agency - Ministry of Science and Innovation.
Hate expressions show 7,210 more repeated simple and compound slogans, and from the semantic point of view tend to be hate in digital news media in Spain. The preparation of this final document required the following phases:
LABELING OF EXPRESSIONS AND EXTRACTION OF SLOGMS. In the first phase, a total of 476,753 messages associated with digital news media in Spain were reviewed. Approximately 4.5% of messages with expressions tending toward hatred were identified. From the total number of messages identified, stop-words were removed, and anomalous data (that did not belong to a known language or were diminutive of it) were identified and manually reviewed to identify both simple and compound slogans that tended towards hatred. IDENTIFICATION OF DUPLICATES: In the first phase, two lists were made, the first of simple lemmas and the second of compound lemmas. The first step was to filter these two lists to identify repeated lemmas, obtaining these two libraries where each lemma appears only once. DDBB INTEGRATION: Next, in the third phase, we proceeded to join both libraries to build a final library that integrated all the lemmas, both simple and compound. Finally, final filtering was carried out to ensure that the lemmas were not repeated.
Authors: - Elias Said-Hung, Max Römer Pieretti, Julio Montero-DÃaz, Alberto De Lucas, Javier MartÃnez Torres.
Supported by: - POSSIBLE S.L.
For more information: - https://www.hatemedia.es/ or contact elias.said@unir.net
In the third quarter of 2023, hate speech content on Meta's Facebook had a prevalence rate of 0.02 percent. For every 10,000 content views on the social media platform, about two pieces of content would contain hate speech. Overall, the prevalence of what Facebook considers to be hate speech has remained steady since the first quarter of 2022.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ethnicity
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 3000 labelled comments and posts scraped from the Reddit, Twitter and 4Chan social media websites in 2022. In this dataset, 2400 comments are labelled as non-hateful or '0', and 600 comments are labelled as hateful or '1', making an even 80/20 split.
This dataset's primary purpose is for the use in machine learning classifications of hateful speech in the online speher.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The HateBR dataset was collected from the comment section of Brazilian politicians’ accounts on Instagram and manually annotated by specialists, reaching a high inter-annotator agreement. The corpus consists of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level classification (highly, moderately, and slightly offensive), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). We also implemented baseline experiments for offensive language and hate speech detection and compared them with a literature baseline. Results show that the baseline experiments on our corpus outperform the current state-of-the-art for the Portuguese language.
******* percent of Poles experienced hate speech online in 2021, with men and 18- to 24-year-olds most likely to be affected.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset card for Measuring Hate Speech
This is a public release of the dataset described in Kennedy et al. (2020) and Sachdeva et al. (2022), consisting of 39,565 comments annotated by 7,912 annotators, for 135,556 combined rows. The primary outcome variable is the "hate speech score" but the 10 constituent ordinal labels (sentiment, (dis)respect, insult, humiliation, inferior status, violence, dehumanization, genocide, attack/defense, hate speech benchmark) can also be treated as… See the full description on the dataset page: https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech.
With the rise of social media, a rise of hateful content can be observed. Even though the understanding and definitions of hate speech varies, platforms, communities, and legislature all acknowledge the problem. Therefore, adolescents are a new and active group of social media users. The majority of adolescents experience or witness online hate speech. Research in the field of automated hate speech classification has been on the rise and focuses on aspects such as bias, generalizability, and performance. To increase generalizability and performance, it is important to understand biases within the data. This research addresses the bias of youth language within hate speech classification and contributes by providing a modern and anonymized hate speech youth language data set consisting of 88.395 annotated chat messages. The data set consists of publicly available online messages from the chat platform Discord. ~6,42\% of the messages were classified by a self-developed annotation schema as hate speech. For 35.553 messages, the user profiles provided age annotations setting the average author age to under 20 years old.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was built from tweeter and contains tweets based on common Nigerian hate words and stereotypes. The dataset features 20,176 tweets that have been classified into binary class: Hate Speech with polarity of 1, containing 4,801 tweets and Non Hate Speech with polarity 0, having 15,375 tweets. The dataset is presented in text csv table format, arranged in the following columns: 1. ʽId’ which gives the serial number of the tweets; 2. ʽTweets’. This Contain the contents of the tweets; and 3. 'Polarity' contains the label of tweets. Hate Speech have a polarity of 1 and Non-Hate Speech have a polarity of 0.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
We present an Italian YouTube dataset manually annotated for hate speech types and targets. The comments to be annotated were sampled from the Italian YouTube comments on videos about the Covid-19 pandemic in the period from January 2020 to May 2020. Two sets were annotated: a training set with 59,870 comments (IMSyPP_IT_YouTube_comments_train.csv) and an evaluation set with 10,536 comments (IMSyPP_IT_YouTube_comments_evaluation.csv). The dataset was annotated by 8 annotators with each comment being annotated by two annotators. It was used to train a classification model for hate speech types detection that is publicly available at the following URL: https://huggingface.co/IMSyPP/hate_speech_it.
The dataset consists of the following fields: ID_Commento - YouTube ID of the comment ID_Video - YouTube ID of the video under which the comment was posted Testo - text of the comment Tipo - type of hate speech Target - the target of hate speech
Additionally, we have included the Italian YouTube data (SR_YT_comments.csv) which was collected in the same period as the training data and was annotated using the aforementioned model. The automatically labeled data was used to analyze the relationship between hate speech and misinformation on Italian YouTube. The results of this analysis are presented in the associated paper.
The analyzed data are represented with the following fields: ID_Commento - YouTube ID of the comment Label - automatically assigned label by the model is_questionable - the type of channel where the comment was collected from; the channels could either be categorized as spreading reliable or questionable information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SHAJ is an annotated Albanian dataset for hate speech and offensive speech that has been constructed from user-generated content on various social media platforms. Its annotation follows the hierarchical schema introduced in OffensEval.Paper here: https://arxiv.org/abs/2107.13592
https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
HateBR is the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media. The HateBR corpus was collected from Brazilian Instagram comments of politicians and manually annotated by specialists. It is composed of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level (highly, moderately, and slightly offensive messages), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). Each comment was annotated by three different annotators and achieved high inter-annotator agreement. Furthermore, baseline experiments were implemented reaching 85% of F1-score outperforming the current literature models for the Portuguese language. Accordingly, we hope that the proposed expertly annotated corpus may foster research on hate speech and offensive language detection in the Natural Language Processing area.