Business jargon.
The pandemic might have brought office life to a shuddering halt across much of the world but it hasn't allowed workers to breathe a sigh of relief and escape arguably one of the most annoying aspects of office culture - business jargon.
This infographic focuses on the 15 worst offenders with the survey's respondents hating the term "Synergy" the most. It was followed by the seemingly innocuous "Teamwork" and the possibly more irritating "Touch base".
Niall McCarthy - Data Journalist
Photo by Courtney Nuss on Unsplash
The Office sitcom television series.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
-**About this Data :** Social media platforms have become the most prominent medium for spreading hate speech, primarily through hateful textual content. An extensive dataset containing emoticons, emojis, hashtags, slang, and contractions is required to detect hate speech on social media based on current trends. This dataset contains hate speech sentences in English and is confined into two classes, one representing hateful content and the other representing non-hateful content.
Specifications table | |
---|---|
Subject | Natural Language Processing - NLP |
Specific subject area | A curated dataset comprising emojis, emoticons, and contractions bundled into two classes, hateful and non-hateful, to detect hate speech in text. |
Type of data | Text |
Data format | Annotated, Analysed, Filtered Data |
Data Article | A curated dataset for hate speech detection on social media text |
Data source location | https://data.mendeley.com/datasets/9sxpkmm8xn/1 |
-**Value of this Data :**
1. This dataset is useful for training machine learning models to identify hate speech on social media in text. It reflects current social media trends and the modern ways of writing hateful text, using emojis, emoticons, or slang. It will help social media managers, administrators, or companies develop automatic systems to filter out hateful content on social media by identifying a text and categorizing it as hateful or non-hateful speech.
2. Deep Learning (DL) and Natural Language Processing (NLP) practitioners can be the target beneficiaries as this dataset can be used for detecting hateful speech through DL and NLP techniques. Here the samples are composed of text sentences and labels belonging to two categories “0″ for non-hateful and “1″ for hateful.
3. Additionally, this data set can be used as a benchmark data set to detect hate speech
4. The data set is neutralized in such a way that it can be used by anyone as it doesn't include any entities or names which can have an impact or cyber harm on the user that generated the content. Researchers can take advantage of the pre-processed dataset for their projects as it maintains and follows the policy guidelines.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About
Recent research shows that visualizing linguistic media bias mitigates its negative effects. However, reliable automatic detection methods to generate such visualizations require costly, knowledge-intensive training data. To facilitate data collection for media bias datasets, we present News Ninja, a game employing data-collecting game mechanics to generate a crowdsourced dataset. Before annotating sentences, players are educated on media bias via a tutorial. Our findings show that datasets gathered with crowdsourced workers trained on News Ninja can reach significantly higher inter-annotator agreements than expert and crowdsourced datasets. As News Ninja encourages continuous play, it allows datasets to adapt to the reception and contextualization of news over time, presenting a promising strategy to reduce data collection expenses, educate players, and promote long-term bias mitigation.
General
This dataset was created through player annotations in the News Ninja Game made by ANON. Its goal is to improve the detection of linguistic media bias. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.
The dataset includes sentences with binary bias labels (processed, biased or not biased) as well as the annotations of single players used for the majority vote. It includes all game-collected data. All data is completely anonymous. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.
Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset contains topics such as violence, abortion, and hate against specific races, genders, religions, or sexual orientations.
Description of the Data Files
This repository contains the datasets for the anonymous News Ninja submission. The tables contain the following data:
ExportNewsNinja.csv: Contains 370 BABE sentences and 150 new sentences with their text (sentence), words labeled as biased (words), BABE ground truth (ground_Truth), and the sentence bias label from the player annotations (majority_vote). The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences.
AnalysisNewsNinja.xlsx: Contains 370 BABE sentences and 150 new sentences. The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences. The table includes the full sentence (Sentence), the sentence bias label from player annotations (isBiased Game), the new expert label (isBiased Expert), if the game label and expert label match (Game VS Expert), if differing labels are a false positives or false negatives (false negative, false positive), the ground truth label from BABE (isBiasedBABE), if Expert and BABE labels match (Expert VS BABE), and if the game label and BABE label match (Game VS BABE). It also includes the analysis of the agreement between the three rater categories (Game, Expert, BABE).
demographics.csv: Contains demographic information of News Ninja players, including gender, age, education, English proficiency, political orientation, news consumption, and consumed outlets.
Collection Process
Data was collected through interactions with the NewsNinja game. All participants went through a tutorial before annotating 2x10 BABE sentences and 2x10 new sentences. For this first test, players were recruited using Prolific. The game was hosted on a costume-built responsive website. The collection period was from 20.02.2023 to 28.02.2023. Before starting the game, players were informed about the goal and the data processing. After consenting, they could proceed to the tutorial.
The dataset will be open source. A link with all details and contact information will be provided upon acceptance. No third parties are involved.
The dataset will not be maintained as it captures the first test of NewsNinja at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsNinja paper if you use the dataset and contact us if you're interested in more information or joining the project.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Hate speech becomes a major social issue, breaking our relationships and threating our society members. However, in most Korean datasets about hate speech, there're very few samples related to LGBT(and other minorities, too).
This dataset contains the contents
column and label
column, 1 labeled as "hate speech" and 0 as negative samples(NOT-Hatespeech comments).
This dataset has NOT cross-validated by several researchers. This dataset will be updated with another validation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Hate Crime Statistics’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/2014-hate-crime-statisticse on 13 February 2022.
--- Dataset description provided by original source is as follows ---
The Uniform Crime Reporting Program collects data about both single-bias and multiple-bias hate crimes. For each offense type reported, law enforcement must indicate at least one bias motivation. A single-bias incident is defined as an incident in which one or more offense types are motivated by the same bias. As of 2013, a multiple-bias incident is defined as an incident in which one or more offense types are motivated by two or more biases. Overview
In 2014, 15,494 law enforcement agencies participated in the Hate Crime Statistics Program. Of these agencies, 1,666 reported 5,479 hate crime incidents involving 6,418 offenses.
There were 5,462 single-bias incidents that involved 6,385 offenses, 6,681 victims, and 5,176 known offenders.
The 17 multiple-bias incidents reported in 2014 involved 33 offenses, 46 victims, and 16 offenders. (See Tables 1 and 12.) Source: FBI Hate Crime Statistics and more about the Hate Crime StatisticsSource: https://ucr.fbi.gov/about-us/cjis/ucr/hate-crime/2014/resource-pages/download-files
This dataset was created by Uniform Crime Reports and contains around 0 samples along with Unnamed: 13, Unnamed: 3, technical information and other features such as: - Unnamed: 12 - Unnamed: 5 - and more.
- Analyze Unnamed: 14 in relation to Unnamed: 9
- Study the influence of Unnamed: 15 on Unnamed: 4
- More datasets
If you use this dataset in your research, please credit Uniform Crime Reports
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hate speech detection in Arabic poses a complex challenge due to the dialectal diversity across the Arab world. Most existing hate speech datasets for Arabic cover only one dialect or one hate speech category. They also lack balance across dialects, topics, and hate/non-hate classes. In this paper, we address this gap by presenting ADHAR—a comprehensive multi-dialect, multi-category hate speech corpus for Arabic. ADHAR contains 70,369 words and spans four language variants: Modern Standard Arabic (MSA), Egyptian, Levantine, Gulf and Maghrebi. It covers four key hate speech categories: nationality, religion, ethnicity, and race. A major contribution is that ADHAR is carefully curated to maintain balance across dialects, categories, and hate/non-hate classes to enable unbiased dataset evaluation. We describe the systematic data collection methodology, followed by a rigorous annotation process involving multiple annotators per dialect. Extensive qualitative and quantitative analyses demonstrate the quality and usefulness of ADHAR. Our experiments with various classical and deep learning models demonstrate that our dataset enables the development of robust hate speech classifiers for Arabic, achieving accuracy and F1-scores of up to 90% for hate speech detection and up to 92% for category detection. When trained with Arabert, we achieved an accuracy and F1-score of 94% for hate speech detection, as well as 95% for the category detection.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
We present an English YouTube dataset manually annotated for hate speech types and targets. The comments to be annotated were sampled from the English YouTube comments on videos about the Covid-19 pandemic in the period from January 2020 to May 2020. Two sets were annotated: a training set with 51,655 comments (IMSyPP_EN_YouTube_comments_train.csv) and two evaluation sets, one annotated in-context (IMSyPP_EN_YouTube_comments_evaluation_context.csv), another out-of-context (IMSyPP_EN_YouTube_comments_evaluation_no_context.csv), each based on the same 10,759 comments. The dataset was annotated by 10 annotators with most (99.9%) of the comments being annotated by two annotators. It was used to train a classification model for hate speech types detection that is publicly available at the following URL: https://huggingface.co/IMSyPP/hate_speech_en.
The dataset consists of the following fields: Video_ID - YouTube ID of the video under which the comment was posted Comment_ID - YouTube ID of the comment Text - text of the comment Type - type of hate speech Target - the target of hate speech Annotator - code of the human annotator
It is important for the community to understand what is – and is not – a hate crime. First and foremost, the incident must be a crime. Although that may seem obvious, most speech is not a hate crime, regardless of how offensive it may be. In addition, a hate crime is not a crime, but a possible motive for a crime.It can be difficult to establish a motive for a crime. Therefore, the classification as a hate crime is subject to change as an investigation proceeds – even as prosecutors continue an investigation. If a person is found guilty of a hate crime, the court may fine the offender up to 1½ times the maximum fine and imprison him or her for up to 1½ times the maximum term authorized for the underlying crime.While the District strives to reduce crime for all residents of and visitors to the city, hate crimes can make a particular community feel vulnerable and more fearful. This is unacceptable, and is the reason everyone must work together not just to address allegations of hate crimes, but also to proactively educate the public about hate crimes.The figures in this data align with DC Official Code 22-3700. Because the DC statute differs from the FBI Uniform Crime Reporting (UCR) and National Incident-Based Reporting System (NIBRS) definitions, these figures may be higher than those reported to the FBI.Each month, an MPD team reviews crimes that have been identified as potentially motivated by hate/bias to determine whether there is sufficient information to support that designation. The data in this document is current through the end of the most recent month.The hate crimes dataset is not an official MPD database of record and may not match details in records pulled from the official Records Management System (RMS).Unknown or blank values in the Targeted Group field may be present prior to 2016 data. As of January 2022, an offense with multiple bias categories would be reflected as such.Data is updated on the 15th of every month.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data was performed by employing standard Twitter API on Arabic tweets and code-mixing datasets. The Data was carried out for a duration of three months, specifically from April 2023 to June 2023. This was done via a combination of keyword, thread-based searches, and profile-based search approaches as. A total of 120 terms, including various versions, which were used to identify tweets containing code-mixing concerning regional hate speech. To conduct a thread-based search, we have incorporated hashtags that are related to contentious subjects that are deemed essential markers for hateful speech. Throughout the data-gathering phase, we kept an eye on Twitter trends and designated ten hashtags for information retrieval. Given that hateful tweets are usually less common than regular tweets, we expanded our dataset and improved the representation of the hate class by incorporating the most impactful terms from a lexicon of religious hate terms (Albadi et al., 2018). We gathered exclusively original Arabic tweets for all queries, excluding retweets and non-Arabic tweets. In all, we obtained 200,000 Twitter data, of which we sampled 35k tweets for annotation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LREC 2022
For more detail, you can check our paper.
For codes and other information, you can the official our github repo
abbreviations used:
HS
=> hate speech
NH
=> not hate speech
ind
=> Individual
Task | Task description | # labels | Labels | Nature of classification |
---|---|---|---|---|
TaskA | HS detection | 02 | HS , NH | Binary (HS or NH) |
TaskB | HS target detection | 04 | ind , male , female , group | Multilabel |
TaskC | HS type detection | 04 | slander , religion , gender , callToViolence | Multilabel |
@InProceedings{romim-EtAl:2022:LREC,
author = {Romim, Nauros and Ahmed, Mosahed and Islam, Md Saiful and Sen Sharma, Arnab and Talukder, Hriteshwar and Amin, Mohammad Ruhul},
title = {BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {June},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {5153--5162},
abstract = {Social media platforms and online streaming services have spawned a new breed of Hate Speech (HS). Due to the massive amount of user-generated content on these sites, modern machine learning techniques are found to be feasible and cost-effective to tackle this problem. However, linguistically diverse datasets covering different social contexts in which offensive language is typically used are required to train generalizable models. In this paper, we identify the shortcomings of existing Bangla HS datasets and introduce a large manually labeled dataset BD-SHS that includes HS in different social contexts. The labeling criteria were prepared following a hierarchical annotation process, which is the first of its kind in Bangla HS to the best of our knowledge. The dataset includes more than 50,200 offensive comments crawled from online social networking sites and is at least 60\% larger than any existing Bangla HS datasets. We present the benchmark result of our dataset by training different NLP models resulting in the best one achieving an F1-score of 91.0\%. In our experiments, we found that a word embedding trained exclusively using 1.47 million comments from social media and streaming sites consistently resulted in better modeling of HS detection in comparison to other pre-trained embeddings. Our dataset and all accompanying codes is publicly available at github.com/naurosromim/hate-speech-dataset-for-Bengali-social-media},
url = {https://aclanthology.org/2022.lrec-1.552}
}
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
We present an English YouTube dataset manually annotated for hate speech types and targets. The comments to be annotated were sampled from the English YouTube comments on videos about the Covid-19 pandemic in the period from January 2020 to May 2020. Two sets were annotated: a training set with 51,655 comments (IMSyPP_EN_YouTube_comments_train.csv) and two evaluation sets, one annotated in-context (IMSyPP_EN_YouTube_comments_evaluation_context.csv), another out-of-context (IMSyPP_EN_YouTube_comments_evaluation_no_context.csv), each based on the same 10,759 comments. The dataset was annotated by 10 annotators with most (99.9%) of the comments being annotated by two annotators. It was used to train a classification model for hate speech types detection that is publicly available at the following URL: https://huggingface.co/IMSyPP/hate_speech_en. The dataset consists of the following fields: Video_ID - YouTube ID of the video under which the comment was posted Comment_ID - YouTube ID of the comment Text - text of the comment Type - type of hate speech Target - the target of hate speech Annotator - code of the human annotator
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Posting hateful, abusive or insulting comments, most commonly known as Hate Speech has currently been a major issue in social media websites. Hence it would be great if there are datasets that can be used to predict these abusive comments on the internet. Even though there are several datasets available in the English language, currently there are no datasets based on the native language of Sri Lanka, which is Sinhala. Therefore this dataset will help to predict hateful, abusive or insulting comments which are posted on the social media platforms using Sinhala Unicode.
This dataset contains alphabetically ordered comments made using Sinhala Unicode on Facebook and a label which can be used to identify whether the comment is a hate speech or not.
Can this dataset be used to predict hateful, abusive or insulting comments which are made in the Sinhala language?
The dataset is provided with the understanding that the Author is not herein engaged in rendering legal, accounting, tax, or other professional advice and services. As such, it should not be used as a substitute for consultation with professional accounting, tax, legal or other competent advisers. In no event shall the Author be liable for any special, incidental, indirect, or consequential damages whatsoever arising out of or in connection with your access or use or inability to access or use the Dataset.
Note: Due to a system migration, this data will cease to update on March 14th, 2023. The current projection is to restart the updates within 30 days of the system migration, on or around April 13th, 2023Data is subset of the Incident data provided by the open data portal. This data specifically identifies crimes that meet the elements outlined under the FBI Hate crimes program since 2010. For more information on the FBI hate crime overview please visithttps://www.fbi.gov/about-us/investigate/civilrights/hate_crimesData Dictionary:ID - the row numberINCIDENT_NUMBER - the number associated with either the incident or used as reference to store the items in our evidence rooms and can be used to connect the dataset to other LMPD datasets:DATE_REPORTED - the date the incident was reported to LMPDDATE_OCCURED - the date the incident actually occurredCRIME_TYPE - the crime type categoryBIAS_MOTIVATION_GROUP - Victim group that was targeted by the criminal actBIAS_TARGETED_AGAINST - Criminal act was against a person or propertyUOR_DESC - Uniform Offense Reporting code for the criminal act committedNIBRS_CODE - the code that follows the guidelines of the National Incident Based Reporting System. For more details visit https://ucr.fbi.gov/nibrs/2011/resources/nibrs-offense-codes/viewUCR_HIERARCHY - hierarchy that follows the guidelines of the FBI Uniform Crime Reporting. For more details visit https://ucr.fbi.gov/ATT_COMP - Status indicating whether the incident was an attempted crime or a completed crime.LMPD_DIVISION - the LMPD division in which the incident actually occurredLMPD_BEAT - the LMPD beat in which the incident actually occurredPREMISE_TYPE - the type of location in which the incident occurred (e.g. Restaurant)BLOCK_ADDRESS - the location the incident occurredCITY - the city associated to the incident block locationZIP_CODE - the zip code associated to the incident block location
Anti-Jewish attacks were the most common form of anti-religious group hate crimes in the United States in 2023, with ***** cases. Anti-Islamic hate crimes were the second most common anti-religious hate crimes in that year, with *** incidents.
The survey ´Topic-specific Information Behaviour on the Corona Pandemic´ conducted by the market and opinion research institute INFO GmbH on behalf of the Press and Information Office of the Federal Government examines the attitudes of the population towards the COVID-19 pandemic, their information behaviour and their handling of the topic as well as the assessment of the reporting on the COVID-19 pandemic. The current survey wave in April 2021 thus builds on a first survey on the topic in November 2020. Political interest; political informativeness; general attitude towards politics in Germany (satisfaction with politics, influence on politics, attitude towards parties and politicians, understanding of politics); frequency of use of certain media for political information (e.g. public television, private television programmes, printed newspapers, news in social networks, etc.); use of various information offers of the federal government; assessment of the credibility of the information of the federal government on political topics; topic that currently arouses the most interest (open); currently most annoying topic; used to be more interested in this annoying topic; interest in the topic Corona; annoyed by the topic Corona pandemic; most annoying in connection with Corona; informedness about the Corona pandemic; personal handling of the topic Corona pandemic (information behaviour, e.g. complete reading of articles about Corona, conversations about the topic, arguments about Corona, etc.); assessment of reporting on the topic of the Corona pandemic; behaviour in social media on the topic of Corona (own opinion expressed, other opinions on the topic read, no opinions on the topic yet read in social media); attitude towards public discussion in social media on Corona; comparison of one´s own view regarding the Corona pandemic with the views of the environment in social networks and of family and friends; credibility of the information provided by the federal government on the Corona pandemic; satisfaction with the way the federal government provides information on the following aspects of the Corona pandemic: Corona measures such as e.g. contact restrictions, Corona vaccinations, Corona financial aid, Corona testing; reasons for dissatisfaction with the information provided by the federal government on individual aspects of the Corona pandemic; assessment of current policy measures against Corona (appropriate, go too far, do not go far enough). Demography: age (year of birth, average age and grouped); sex; education; occupation, type of degree; household size; number of persons in household under 16; federal state; Berlin West/East; place of residence West/East; city size; party sympathies; former non-German citizenship of respondent and parents (migration background); net household income (grouped). Additionally coded were: Respondent ID; coarse clustering not politically disaffected, politically disaffected); clustering annoyed by Corona pandemic issue; weight; Nielsen areas. Die Umfrage „Themenspezifisches Informationsverhalten zur Corona-Pandemie“ des Markt- und Meinungsforschungsinstitut INFO GmbH im Auftrag des Presse- und Informationsamts der Bundesregierung untersucht die Einstellungen der Bevölkerung zur Corona-Pandemie, ihr Informationsverhalten und ihren Umgang mit dem Thema sowie die Bewertung der Berichterstattung zur Corona-Pandemie. Damit baut die aktuelle Befragungswelle im April 2021 auf eine erste Umfrage zum Thema November 2020 auf. Politisches Interesse; politische Informiertheit; allgemeine Einstellung zu Politik in Deutschland (Zufriedenheit mit Politik, Einfluss auf Politik, Einstellung zu Parteien und Politikern, Politikverständnis); Nutzungshäufigkeit bestimmter Medien für politische Informationen (z.B. öffentlich-rechtliches Fernsehen, private Fernsehprogramme, gedruckte Zeitungen, Nachrichten in sozialen Netzwerken, etc.); Nutzung verschiedener Informationsangebote der Bundesregierung; Einschätzung der Glaubwürdigkeit der Informationen der Bundesregierung zu politischen Themen; Thema, das aktuell am meisten Interesse weckt (offen); aktuell am meisten nervendes Thema; früher mehr Interesse an diesem nervenden Thema; Interesse am Thema Corona; vom Thema Corona-Pandemie genervt; am meisten Nervendes im Zusammenhang mit Corona; Informiertheit über die Corona-Pandemie; persönlicher Umgang mit dem Thema Corona-Pandemie (Informationsverhalten, also z.B. vollständiges Lesen der Artikel über Corona, Unterhaltungen zum Thema, Streit wegen Corona, etc.); Bewertung der Berichterstattung zum Thema Corona-Pandemie; Verhalten in den sozialen Medien zum Thema Corona (eigene Meinung geäußert, andere Meinungen zu diesem Thema gelesen, noch keine Meinungen dazu in sozialen Medien gelesen); Einstellung zur öffentlichen Diskussion in den sozialen Medien zu Corona; Vergleich der eigenen Ansicht in Bezug auf die Corona-Pandemie mit den Ansichten des Umfelds in sozialen Netzwerken und von Familie und Freunden; Glaubwürdigkeit der Informationen der Bundesregierung zur Corona-Pandemie; Zufriedenheit mit der Art und Weise der Informationen durch die Bundesregierung zu folgenden Aspekten der Corona-Pandemie: Corona-Maßnahmen wie z.B. Kontaktbeschränkungen, Corona-Impfungen, finanzielle Corona-Hilfen, Corona-Testungen; Gründe für Unzufriedenheit mit den Informationen der Bundesregierung zu einzelnen Aspekten der Corona-Pandemie; Einschätzung der aktuellen politischen Maßnahmen gegen Corona (angemessen, gehen zu weit, gehen nicht weit genug). Demographie: Alter (Geburtsjahr, Durchschnittsalter und gruppiert); Geschlecht; Bildung; Berufstätigkeit, Art des Berufsabschlusses; Haushaltsgröße; Anzahl Personen im Haushalt unter 16 Jahren; Bundesland; Berlin West/Ost; Wohnort West/Ost; Ortsgröße; Parteisympathie; früher nichtdeutsche Staatsangehörigkeit des Befragten und der Eltern (Migrationshintergrund); Haushaltsnettoeinkommen (gruppiert). Zusätzlich verkodet wurde: Befragten ID; Grob-Clusterung nicht politikverdrossen, politikverdrossen); Clusterung genervt vom Thema Corona-Pandemie; Gewicht; Nielsengebiete. Probability: MultistageProbability.Multistage Wahrscheinlichkeitsauswahl: Mehrstufige ZufallsauswahlProbability.Multistage Telephone interview: Computer-assisted (CATI)Interview.Telephone.CATI
For the study ´Topic-specific Information Behaviour on the COVID-19 Pandemic´, the market and opinion research institute INFO GmbH surveyed a total of 2,012 persons of the German-speaking resident population aged 16 and over from 6 to 25 November 2020 on behalf of the Press and Information Office of the Federal Government. The subject of the survey were attitudes of the population to the topic of the coronavirus pandemic, their information behaviour and handling of the topic as well as the reporting on the topic of the coronavirus pandemic. General questions: interest in politics; self-assessment of being politically informed; agreement with statements on disenchantment with politics (e.g. satisfied with politics in Germany all in all, parties only want the voters´ votes, they are not interested in their views, etc.); frequency of media use on political topics; statements on information processing (I specifically look for information on a political topic that interests me, I read through an article on a political event in its entirety, I read through a background report on a political topic in its entirety); perception of information from the federal government on selected information channels in recent months (e.g. federal government websites, interviews of government politicians on television, etc.); credibility of information from the federal government on political topics. 2. Current interest in the topic: currently most interesting political or social topic (open question); currently most annoying political or social topic (open question); previously greater interest in the currently most annoying topic. 3. Attitudes towards the topic of the coronavirus pandemic (e.g. the topic interests me, the topic is socially relevant, bores me, annoys me, etc.). 4. Information behaviour and dealing with the topic of the coronavirus pandemic: self-assessment of current level of information about the coronavirus pandemic; frequency of certain behaviours in dealing with this topic (have myself frequently searched for information on the topic, have watched video reports on the topic on the Internet or television, have read articles on the topic in newspapers or on the Internet in full, have only skimmed articles on the topic, have talked about the topic with friends or acquaintances, have tried to change the topic when talking about the topic, have avoided the topic as much as possible). 5. Coverage of the coronavirus pandemic: agreement with statements on media coverage (too detailed, too complicated, too extensive, I consider credible, balanced, correct, only aims to influence people, contains many opinions I disagree with, I feel it is one-sided, does not reflect my own opinion on the topic at all, distracts from other important issues); statements on opinions in social media (already expressed own opinion on the topic of the coronavirus pandemic in social media, only read opinions of others on this topic there, have not yet read any opinions on this topic in social media); statements on public discussion in social media (contains many opinions with which I do not agree, I perceive as one-sided, does not reflect my own opinion on the topic at all, I perceive as factual, I perceive as helpful to hear new arguments); credibility of the information of the federal government on the topic of the coronavirus pandemic. Demography: age (year of birth, average age, age groups); sex; education; vocational training; employment status; household size; number of children/adolescents under 16 in the household; federal state; former district classification Berlin (West/East); place of residence (West/East); Nielsen areas; city size; political orientation; migration background of the respondent or his/her parents; net household income. Additionally coded were: Respondent ID, weight; political disenchantment (rough cluster: not disenchanted with politics, disenchanted with politics); information processing (2-cluster: rather thorough, rather superficial); information processing (3-cluster: thorough, occasionally thorough, superficial).
This is a collection of data from the Leicester Hate Crime Project. It includes interview transcripts, survey data, reports, blogs, presentations, end of project conference details, the original proposal, and other information. To investigate victims’ experiences of hate and prejudice, the study used a mixed methods approach that included: (1) an online and hard-copy survey, translated into eight different languages; (2) in-depth, semi-structured face-to-face interviews; and (3) personal and reflective researcher field diary observations. From the outset we realised that for practical and logistical reasons we would not be able to attain a statistically representative sample of each of the myriad communities we wanted to hear from. We therefore developed a dual method of administering our survey – via hard copy questionnaires (which were distributed through dozens of community locations in the city, and through educational establishments, charitable institutions and other liaison points) and online – in order to gain as many and as diverse a range of responses as we possibly could. The research team worked with Ipsos MORI, a leading market research company in the UK and Ireland, to develop the survey instrument. This two-year study examined the experiences and expectations of those who are victimised because of their identity or perceived 'difference' in the eyes of the perpetrator. By exploring hate crime in a broader sense of 'targeted victimisation', the project aimed to investigate the experiences of the more ‘recognised’ hate crime victim communities, including those who experience racist, religiously motivated, homophobic, disablist and transphobic victimisation, as well as those who are marginalised from existing hate crime scholarly and policy frameworks. The study also investigated respondents’ perceptions of criminal justice agencies and other service providers in order to assess the needs of victims and to identify lessons for effective service delivery. The site for the research was Leicester, one of the most plural cities in the UK containing a diverse range of established and emerging minority communities. The research team administered online and written surveys to victims of hate crime within these communities and conducted in-depth interviews to probe issues in greater depth. Within this project we employed a ‘softer’, more subtle approach to locating and engaging with a wide range of diverse communities. This approach involved the research team spending prolonged periods of time in public spaces and buildings across the city, including international supermarkets, cafes and restaurants, charity shops, community and neighbourhood centres, libraries, health centres, places of worship, pubs and clubs, taxi ranks, and shelters and drug and alcohol services that support ‘hard to reach’ groups. Adopting this method enabled us to engage with over 4,000 members of established and emerging communities in order to raise awareness of the project itself, and to promote further recognition of the harms of hate and available pathways of support for victims. A total of 1,106 questionnaires were completed by people aged 16 and over who had experienced a hate crime in accordance with the definition employed within this study. Of these questionnaires, 808 were completed on paper and 298 were completed online. Ipsos MORI entered the resultant survey data into data analysis software and worked with the research team in interrogating it. The project used in-depth face-to-face qualitative interviews to further explore the nature, extent and impact of hate crime victimisation. Depending on the individual or group, interviews were conducted either individually or in the presence of family members, friends or carers as appropriate. Overall, interviews were carried out with 374 victims, 59 of whom had also completed a survey. Therefore, in total we heard from 1,421 victims over the duration of the study. Additionally, the Lead Researcher kept a field-note diary throughout the research process. The diary was used to detail observations and informal conversations with community groups, participants and practitioners and provided additional insight into the context and impact of victimisation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains:
The outbreak of COVID-19 has transformed societies across the world as governments tackle the health, economic and social costs of the pandemic. It has also raised concerns about the spread of hateful language and prejudice online, especially hostility directed against East Asia. This data repository is for a classifier that detects and categorizes social media posts from Twitter into four classes: Hostility against East Asia, Criticism of East Asia, Meta-discussions of East Asian prejudice and a neutral class. The classifier achieves an F1 score of 0.83 across all four classes. We provide our final model (coded in Python), as well as a new 20,000 tweet training dataset used to make the classifier, two analyses of hashtags associated with East Asian prejudice and the annotation codebook. The classifier can be implemented by other researchers, assisting with both online content moderation processes and further research into the dynamics, prevalence and impact of East Asian prejudice online during this global pandemic.
This work is a collaboration between The Alan Turing Institute and the Oxford Internet Institute. It was funded by the Criminal JusticeTheme of the Alan Turing Institute under Wave 1 of The UKRI Strategic Priorities Fund, EPSRC Grant EP/T001569/1
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Children express preferences for a wide range of options, such as objects, and frequently observe the preferences that others express towards these things. However, little is know about how these initial preferences develop. The present research investigated whether one particular type of social information – other children’s preferences – influences children’s own preferences. Four-year-old children observed, via video, two boys and two girls display the same preference for one of two stickers. Each child (peer) expressed liking for one sticker and dislike for the other. Then children completed two rounds of the Dictator Game, a classic resource distribution task. In each round, children distributed either 10 liked stickers or 10 disliked stickers (counterbalanced) between themselves and another child who was not present. If the preferences expressed by their peers influenced children’s own preferences, children should keep more of the liked than disliked stickers for themselves. In line with this prediction, more children kept more liked than disliked stickers, indicating their distribution patterns were influenced by their peers’ preferences. This finding suggests that children extracted informational content about the value of the stickers from their peers and used that information to guide their own preferences. Children might also have aligned their preferences with those of their peers to facilitate social bonding and group membership. This research demonstrates the strong influence of peers on children’s developing preferences, and reveals the effect of peer influence via video – a medium that young children are frequently exposed to but often struggle to learn from in other contexts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)
Abstract
The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.
For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.
The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)
There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)
The following is a description of the attributes present in this dataset
Open Research Questions
This dataset is expected to be helpful for the investigation of the following research questions and even beyond:
All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Business jargon.
The pandemic might have brought office life to a shuddering halt across much of the world but it hasn't allowed workers to breathe a sigh of relief and escape arguably one of the most annoying aspects of office culture - business jargon.
This infographic focuses on the 15 worst offenders with the survey's respondents hating the term "Synergy" the most. It was followed by the seemingly innocuous "Teamwork" and the possibly more irritating "Touch base".
Niall McCarthy - Data Journalist
Photo by Courtney Nuss on Unsplash
The Office sitcom television series.