100+ datasets found

h
Dutch-Government-Data-for-Bias-detection
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Milena, Dutch-Government-Data-for-Bias-detection [Dataset]. https://huggingface.co/datasets/milenamileentje/Dutch-Government-Data-for-Bias-detection
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Milena
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Politics of the Netherlands, Netherlands
Description
milenamileentje/Dutch-Government-Data-for-Bias-detection dataset hosted on Hugging Face and contributed by the HF Datasets community
Opinion on mitigating AI data bias in healthcare worldwide 2024
statista.com
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Opinion on mitigating AI data bias in healthcare worldwide 2024 [Dataset]. https://www.statista.com/statistics/1559311/ways-to-mitigate-ai-bias-in-healthcare-worldwide/
Explore at:
Dataset updated
Jul 18, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Dec 2023 - Mar 2024
Area covered
Worldwide
Description
According to a survey of healthcare leaders carried out globally in 2024, almost half of respondents believed that by making AI more transparent and interpretable, this would mitigate the risk of data bias in AI applications for healthcare. Furthermore, ** percent of healthcare leaders thought there should be continuous training and education in AI.
Z
Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...
data.niaid.nih.gov
zenodo.org
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haak, Fabian (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7682914
Explore at:
Dataset updated
Mar 1, 2023
Dataset provided by
Haak, Fabian
Schaer, Philipp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

Dataset 2: Search Query Suggestions (suggestions.csv)

The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

AllSides Scraper

At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.
f
Data_Sheet_1_Data and model bias in artificial intelligence for healthcare...
frontiersin.figshare.com
zip
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vithya Yogarajan; Gillian Dobbie; Sharon Leitch; Te Taka Keegan; Joshua Bensemann; Michael Witbrock; Varsha Asrani; David Reith (2023). Data_Sheet_1_Data and model bias in artificial intelligence for healthcare applications in New Zealand.zip [Dataset]. http://doi.org/10.3389/fcomp.2022.1070493.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fcomp.2022.1070493.s001
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Vithya Yogarajan; Gillian Dobbie; Sharon Leitch; Te Taka Keegan; Joshua Bensemann; Michael Witbrock; Varsha Asrani; David Reith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New Zealand
Description
IntroductionDevelopments in Artificial Intelligence (AI) are adopted widely in healthcare. However, the introduction and use of AI may come with biases and disparities, resulting in concerns about healthcare access and outcomes for underrepresented indigenous populations. In New Zealand, Māori experience significant inequities in health compared to the non-Indigenous population. This research explores equity concepts and fairness measures concerning AI for healthcare in New Zealand.MethodsThis research considers data and model bias in NZ-based electronic health records (EHRs). Two very distinct NZ datasets are used in this research, one obtained from one hospital and another from multiple GP practices, where clinicians obtain both datasets. To ensure research equality and fair inclusion of Māori, we combine expertise in Artificial Intelligence (AI), New Zealand clinical context, and te ao Māori. The mitigation of inequity needs to be addressed in data collection, model development, and model deployment. In this paper, we analyze data and algorithmic bias concerning data collection and model development, training and testing using health data collected by experts. We use fairness measures such as disparate impact scores, equal opportunities and equalized odds to analyze tabular data. Furthermore, token frequencies, statistical significance testing and fairness measures for word embeddings, such as WEAT and WEFE frameworks, are used to analyze bias in free-form medical text. The AI model predictions are also explained using SHAP and LIME.ResultsThis research analyzed fairness metrics for NZ EHRs while considering data and algorithmic bias. We show evidence of bias due to the changes made in algorithmic design. Furthermore, we observe unintentional bias due to the underlying pre-trained models used to represent text data. This research addresses some vital issues while opening up the need and opportunity for future research.DiscussionsThis research takes early steps toward developing a model of socially responsible and fair AI for New Zealand's population. We provided an overview of reproducible concepts that can be adopted toward any NZ population data. Furthermore, we discuss the gaps and future research avenues that will enable more focused development of fairness measures suitable for the New Zealand population's needs and social structure. One of the primary focuses of this research was ensuring fair inclusions. As such, we combine expertise in AI, clinical knowledge, and the representation of indigenous populations. This inclusion of experts will be vital moving forward, proving a stepping stone toward the integration of AI for better outcomes in healthcare.
mediabias
kaggle.com
Updated Jul 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Tegmark (2022). mediabias [Dataset]. http://doi.org/10.34740/kaggle/dsv/3966214
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/3966214
Dataset updated
Jul 20, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Max Tegmark
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Many people believe that news media they dislike are biased, while their favorite news source isn't. Can we move beyond such subjectivity and measure media bias objectively, from data alone? The auto-generated figure below answers this question with a resounding "yes", showing left-leaning media on the left, right-leaning media on the right, establishment-critical media at the bottom, etc. https://space.mit.edu/home/tegmark/phrasebias.jpg" alt="Media bias landscape">

Our algorithm analyzed over a million articles from over a hundred newspapers. It first audo-identifies phrases that help predict which newspaper a givens article is from (e.g. "undocumented immigrant" vs. "illegal immigrant"). It then analyzes the frequencies of such phrases across newspapers and topics, producing the media bias landscape below. This means that although news bias is inherently political, its measurement need not be.

Here's our paper: arXiv:2109.00024. Our Kaggle data set here contains the discriminative phrases and phrase counts needed to reproduce all the plots in our paper. The files contain the following data: - The directory phrase_selection contains tables such as immigration_phrases.csv that you can open with Microsoft Excel. They contain the phrases that our method found most informative for predicting which newspaper an article is from, sorted by decreasing utility. Our analysis ones only the ones passing all our screenings, i.e., with ones in columns D, E and F. - The directory counts contains tables such as immigration_counts.csv, listing the number of times that each phrase in occurs in each newspaper's coverage of that topic. - The file blacklist.csv contains journalist names and other phrases that were discarded because they helped revealed the identity of a newspaper without reflecting any political bias.

If you have questions, please contact Samantha at sdalonzo@mit.edu or Max at tegmark@mit.edu.
NewsMediaBias-Plus Dataset
zenodo.org
huggingface.co
bin, zip
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaina Raza; Shaina Raza (2024). NewsMediaBias-Plus Dataset [Dataset]. http://doi.org/10.5281/zenodo.13961155
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13961155
Dataset updated
Nov 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shaina Raza; Shaina Raza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NewsMediaBias-Plus Dataset

Overview

The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.

Dataset Description

NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.

Contents

unique_id: Unique identifier for each news item. Each unique_id matches an image for the same article.

outlet: The publisher of the article.

headline: The headline of the article.

article_text: The full content of the news article.

image_description: Description of the paired image.

image: The file path of the associated image.

date_published: The date the article was published.

source_url: The original URL of the article.

canonical_link: The canonical URL of the article.

new_categories: Categories assigned to the article.

news_categories_confidence_scores: Confidence scores for each category.

Annotation Labels

text_label: Indicates the likelihood of the article being disinformation:

Likely: Likely to be disinformation.

Unlikely: Unlikely to be disinformation.

multimodal_label: Indicates the likelihood of disinformation from the combination of the text snippet and image content:

Likely: Likely to be disinformation.

Unlikely: Unlikely to be disinformation.

Getting Started

Prerequisites

Python 3.6+

Pandas

Hugging Face Datasets

Hugging Face Hub

Installation

Load the dataset into Python:

python

Copy code

from datasets import load_dataset ds = load_dataset("vector-institute/newsmediabias-plus") print(ds) # View structure and splits print(ds['train'][0]) # Access the first record of the train split print(ds['train'][:5]) # Access the first five records

Load a Few Records

python

Copy code

from datasets import load_dataset # Load the dataset in streaming mode streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True) # Get an iterable dataset dataset_iterable = streamed_dataset['train'].take(5) # Print the records for record in dataset_iterable: print(record)

Contributions

Contributions are welcome! You can:

Add Data: Contribute more data points.

Refine Annotations: Improve annotation accuracy.

Share Usage Examples: Help others use the dataset effectively.

To contribute, fork the repository and create a pull request with your changes.

License

This dataset is released under a non-commercial license. See the LICENSE file for more details.

Citation

Please cite the dataset using this BibTeX entry:

bibtex

Copy code

@misc{vector_institute_2024_newsmediabias_plus, title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias}, author={Vector Institute Research Team}, year={2024}, url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus} }

Contact

For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai

Disclaimer and User Guidance

Disclaimer: The labels Likely and Unlikely are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.

Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.
f
Navigating News Narratives: A Media Bias Analysis Dataset
figshare.com
txt
Updated Dec 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaina Raza (2023). Navigating News Narratives: A Media Bias Analysis Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24422122.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24422122.v4
Dataset updated
Dec 8, 2023
Dataset provided by
figshare
Authors
Shaina Raza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII).The data structure is tabulated as follows:Text: The main content.Dimension: Descriptive category of the text.Biased_Words: A compilation of words regarded as biased.Aspect: Specific sub-topic within the main content.Label: Indicates the presence (True) or absence (False) of bias. The label is ternary - highly biased, slightly biased and neutralToxicity: Indicates the presence (True) or absence (False) of bias.Identity_mention: Mention of any identity based on words match.Annotation SchemeThe labels and annotations in the dataset are generated through a system of Active Learning, cycling through:Manual LabelingSemi-Supervised LearningHuman VerificationThe scheme comprises:Bias Label: Specifies the degree of bias (e.g., no bias, mild, or strong).Words/Phrases Level Biases: Pinpoints specific biased terms or phrases.Subjective Bias (Aspect): Highlights biases pertinent to content dimensions.Due to the nuances of semantic match algorithms, certain labels such as 'identity' and 'aspect' may appear distinctively different.List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news.We also utilize publicly available data from the following links. Our Attribution to others.MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detectionToxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge.Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu)Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtVSocial biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.If you use this dataset, please cite us.Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0
h
debiased_dataset
huggingface.co
Updated Sep 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
News Media Biases (2023). debiased_dataset [Dataset]. http://doi.org/10.57967/hf/1050
Explore at:
Unique identifier
https://doi.org/10.57967/hf/1050
Dataset updated
Sep 18, 2023
Dataset authored and provided by
News Media Biases
License
https://choosealicense.com/licenses/creativeml-openrail-m/https://choosealicense.com/licenses/creativeml-openrail-m/
Description
Dataset Description

About the Dataset: This dataset contains text data that has been processed to identify biased statements based on dimensions and aspects. Each entry has been processed using the GPT-4 language model and manually verified by 5 human annotators for quality assurance. Purpose: The dataset aims to help train and evaluate machine learning models in detecting, classifying, and correcting biases in text content, making it essential for NLP research related to fairness… See the full description on the dataset page: https://huggingface.co/datasets/newsmediabias/debiased_dataset.
o
Data from: Deconstructing Bias in Social Preferences Reveals Groupy and Not...
openicpsr.org
stata
Updated Aug 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachel Kranton; Matthew Pease; Seth Sanders; Scott Heutell (2020). Deconstructing Bias in Social Preferences Reveals Groupy and Not Groupy Behavior [Dataset]. http://doi.org/10.3886/E120555V1
Explore at:
stataAvailable download formats
Unique identifier
https://doi.org/10.3886/E120555V1
Dataset updated
Aug 5, 2020
Dataset provided by
Cornell University
Duke University
UPMC
Authors
Rachel Kranton; Matthew Pease; Seth Sanders; Scott Heutell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2010 - 2020
Area covered
Durham, NC
Description
Group divisions are a continual feature of human history, with biases toward people’s own groups shown in both experimental and natural settings. Using a novel within-subject design, this work deconstructs group biases to find significant and robust individual differences; some individuals consistently respond to group divisions, while others do not. We examined individual behavior in two treatments in which subjects make pairwise decisions that determine own and others’ income. In a political treatment, which divided subjects into groups based on their political leanings, political party members showed more ingroup bias than Independents who professed the same political opinions. But this greater bias was also present in a minimal group treatment, showing that stronger group identification was not the driver of higher favoritism in the political setting. Analyzing individual choices across the experiment, we categorize participants as “groupy” or “not groupy,” such that groupy participants have social preferences that change for ingroup and outgroup recipients, while not-groupy participants’ preferences do not change across group context. Demonstrating further that the group identity of the recipient mattered less to their choices, strongly not-groupy subjects made allocation decisions faster. We conclude that observed ingroup biases build on a foundation of heterogeneity in individual groupiness.
h
political-bias
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Jones, political-bias [Dataset]. https://huggingface.co/datasets/cajcodes/political-bias
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Christopher Jones
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Political Bias Dataset

Overview

The Political Bias dataset contains 658 synthetic statements, each annotated with a bias rating ranging from 0 to 4. These ratings represent a spectrum from highly conservative (0) to highly liberal (4). The dataset was generated using GPT-4, aiming to facilitate research and development in bias detection and reduction in textual data. Special emphasis was placed on distinguishing between moderate biases on both sides, as this has proven to… See the full description on the dataset page: https://huggingface.co/datasets/cajcodes/political-bias.
NewsUnravel Dataset
zenodo.org
csv
Updated Sep 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anonymous; anonymous (2023). NewsUnravel Dataset [Dataset]. http://doi.org/10.5281/zenodo.8344882
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8344882
Dataset updated
Sep 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
anonymous; anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About the Dataset
Media bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.

Description of the data files
This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain following data:

NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labels
Statistics.png: contains all Umami statistics for NewsUnravel's usage data
Feedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasons
Content.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentences and the bias rating, and reason, if given
Article.csv: holds the article ID, title, source, article meta data, article topic, and bias amount in %
Participant.csv: holds the participant IDs and data processing consent
h
md_gender_bias
huggingface.co
opendatalab.com
Updated Mar 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2021). md_gender_bias [Dataset]. https://huggingface.co/datasets/facebook/md_gender_bias
Explore at:
Dataset updated
Mar 26, 2021
Dataset authored and provided by
AI at Meta
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.
o
Data and Code for: Confidence, Self-Selection and Bias in the Aggregate
openicpsr.org
delimited
Updated Mar 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Enke; Thomas Graeber; Ryan Oprea (2023). Data and Code for: Confidence, Self-Selection and Bias in the Aggregate [Dataset]. http://doi.org/10.3886/E185741V1
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E185741V1
Dataset updated
Mar 2, 2023
Dataset provided by
American Economic Association
Authors
Benjamin Enke; Thomas Graeber; Ryan Oprea
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The influence of behavioral biases on aggregate outcomes depends in part on self-selection: whether rational people opt more strongly into aggregate interactions than biased individuals. In betting market, auction and committee experiments, we document that some errors are strongly reduced through self-selection, while others are not affected at all or even amplified. A large part of this variation is explained by differences in the relationship between confidence and performance. In some tasks, they are positively correlated, such that self-selection attenuates errors. In other tasks, rational and biased people are equally confident, such that self-selection has no effects on aggregate quantities.
H
Age Bias Training and Testing Data
dataverse.harvard.edu
Updated Jul 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Diaz (2020). Age Bias Training and Testing Data [Dataset]. http://doi.org/10.7910/DVN/F6EMTS
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/F6EMTS
Dataset updated
Jul 10, 2020
Dataset provided by
Harvard Dataverse
Authors
Mark Diaz
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Training and testing data annotated by a panel survey sample of older adults (aged 50+) and used to create a basic, maximum entropy bag-of-words sentiment classifier. Testing data is scraped from blog posts discussing aging authored by older adults (see published work: https://doi.org/10.1145/3173574.3173986). Training data is a re-annotated subset of the Sentiment140 training data set containing the strings "old" and "young" (see Sentiment140: http://help.sentiment140.com/for-students). For model building, the subset was re-introduce into the full data set, replacing the original annotations.
Z
Data from: Diversity matters: Robustness of bias measurements in Wikidata
data.niaid.nih.gov
Updated May 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sai Keerthana Karnam (2023). Diversity matters: Robustness of bias measurements in Wikidata [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7881057
Explore at:
Dataset updated
May 1, 2023
Dataset provided by
Anirban Panda
Bhanu Prakash Reddy Guda
Paramita das
Soumya Sarkar
Sai Keerthana Karnam
Animesh Mukherjee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the widespread use of knowledge graphs (KG) in various automated AI systems and applications, it is very important to ensure that information retrieval algorithms leveraging them are free from societal biases. Previous works have depicted biases that persist in KGs, as well as employed several metrics for measuring the biases. However, such studies lack the systematic exploration of the sensitivity of the bias measurements, through varying sources of data, or the embedding algorithms used. To address this research gap, in this work, we present a holistic analysis of bias measurement on the knowledge graph. First, we attempt to reveal data biases that surface in Wikidata for thirteen different demographics selected from seven continents. Next, we attempt to unfold the variance in the detection of biases by two different knowledge graph embedding algorithms - TransE and ComplEx. We conduct our extensive experiments on a large number of occupations sampled from the thirteen demographics with respect to the sensitive attribute, i.e., gender. Our results show that the inherent data bias that persists in KG can be altered by specific algorithm bias as incorporated by KG embedding learning algorithms. Further, we show that the choice of the state-of-the-art KG embedding algorithm has a strong impact on the ranking of biased occupations irrespective of gender. We observe that the similarity of the biased occupations across demographics is minimal which reflects the socio-cultural differences around the globe. We believe that this full-scale audit of the bias measurement pipeline will raise awareness among the community while deriving insights related to design choices of data and algorithms both and refrain from the popular dogma of ``one-size-fits-all''.
Z
NewsUnravel Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anon (2024). NewsUnravel Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8344890
Explore at:
Dataset updated
Jul 11, 2024
Dataset authored and provided by
anon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About the NUDA DatasetMedia bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.

General

This dataset was created through user feedback on automatically generated bias highlights on news articles on the website NewsUnravel made by ANON. Its goal is to improve the detection of linguistic media bias for analysis and to indicate it to the public. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.

The dataset consists of text, namely biased sentences with binary bias labels (processed, biased or not biased) as well as metadata about the article. It includes all feedback that was given. The single ratings (unprocessed) used to create the labels with correlating User IDs are included.

For training, this dataset was combined with the BABE dataset. All data is completely anonymous. Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.

Description of the Data Files

This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain the following data:

NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labelsStatistics.png: contains all Umami statistics for NewsUnravel's usage dataFeedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasonsContent.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentence and the bias rating, and reason, if givenArticle.csv: holds the article ID, title, source, article metadata, article topic, and bias amount in %Participant.csv: holds the participant IDs and data processing consent

Collection Process

Data was collected through interactions with the Feedback Mechanism on NewsUnravel. A news article was displayed with automatically generated bias highlights. Each highlight could be selected, and readers were able to agree or disagree with the automatic label. Through a majority vote, labels were generated from those feedback interactions. Spammers were excluded through a spam detection approach.

Readers came to our website voluntarily through posts on LinkedIn and social media as well as posts on university boards. The data collection period lasted for one week, from March 4th to March 11th (2023). The landing page informed them about the goal and the data processing. After being informed, they could proceed to the article overview.

So far, the dataset has been used on top of BABE to train a linguistic bias classifier, adopting hyperparameter configurations from BABE with a pre-trained model from Hugging Face.The dataset will be open source. On acceptance, a link with all details and contact information will be provided. No third parties are involved.

The dataset will not be maintained as it captures the first test of NewsUnravel at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsUnravel paper if you use the dataset and contact us if you're interested in more information or joining the project.
H
Replication Data for: More Human than Human: Measuring ChatGPT Political...
dataverse.harvard.edu
search.dataone.org
Updated Aug 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabio Motoki; Valdemar Pinho Neto; Victor Rodrigues (2023). Replication Data for: More Human than Human: Measuring ChatGPT Political Bias [Dataset]. http://doi.org/10.7910/DVN/KGMEYI
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/KGMEYI
Dataset updated
Aug 18, 2023
Dataset provided by
Harvard Dataverse
Authors
Fabio Motoki; Valdemar Pinho Neto; Victor Rodrigues
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
A standing issue is how to measure bias in Large Language Models (LLMs) like ChatGPT. We devise a novel method of sampling, bootstrapping, and impersonation that addresses concerns about the inherent randomness of LLMs and test if it can capture political bias in ChatGPT. Our results indicate that, by default, ChatGPT is aligned with Democrats in the US. Placebo tests indicate that our results are due to bias, not noise or spurious relationships. Robustness tests show that our findings are valid also for Brazil and the UK, different professions, and different numerical scales and questionnaires.
f
news-media-bias.data.json
figshare.com
txt
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaina Raza (2023). news-media-bias.data.json [Dataset]. http://doi.org/10.6084/m9.figshare.24422122.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24422122.v1
Dataset updated
Dec 8, 2023
Dataset provided by
figshare
Authors
Shaina Raza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.Data descriptionThis is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution.The dataset used for this project does not contain any personally identifiable information (PII).Data Format:- ID: Numeric unique identifier.- Text: Main content.- Dimension: Categorical descriptor of the text.- Biased_Words: List of words considered biased.- Aspect: Specific topic within the text.- Label: Bias True/False value- Aggregate Label: Calculated through multiple weighted formulaeAnnotation Scheme:1. Bias Label: Indicate the presence/absence of bias (e.g., no bias, mild, strong).2. Words/Phrases Level Biases: Identify specific biased words/phrases.3. Subjective Bias (Aspect): Capture biases related to content aspects.Annotation Process:Manual Labeling --> LLM based labelling -->Semi-Supervised Learning --> Human Verifications (iterative process)The scheme employs a mix of manual labeling, GPT-based labeling, human verification, and semi-supervised learning for refined and accurate annotation.We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.
I
Global AI Bias Audit Services Market Growth Opportunities 2025-2032
statsndata.org
excel, pdf
Updated Jun 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats N Data (2025). Global AI Bias Audit Services Market Growth Opportunities 2025-2032 [Dataset]. https://www.statsndata.org/report/ai-bias-audit-services-market-375694
Explore at:
excel, pdfAvailable download formats
Dataset updated
Jun 2025
Dataset authored and provided by
Stats N Data
License
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
Area covered
Global
Description
The AI Bias Audit Services market has emerged as a critical sector in response to the increasing reliance on artificial intelligence (AI) across various industries. As organizations integrate AI systems into their operations, concerns about bias-stemming from data quality, algorithmic fairness, and ethical considera
g
EartH2Observe, WFDEI and ERA-Interim data Merged and Bias-corrected for...
dataservices.gfz-potsdam.de
explore.openaire.eu
Updated 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Lange (2016). EartH2Observe, WFDEI and ERA-Interim data Merged and Bias-corrected for ISIMIP (EWEMBI) [Dataset]. http://doi.org/10.5880/pik.2016.004
Explore at:
Unique identifier
https://doi.org/10.5880/pik.2016.004
Dataset updated
2016
Dataset provided by
GFZ Data Services
datacite
Authors
Stefan Lange
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Description
The EWEMBI dataset was compiled to support the bias correction of climate input data for the impact assessments carried out in phase 2b of the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP2b; Frieler et al., 2017), which will contribute to the 2018 IPCC special report on the impacts of global warming of 1.5°C above pre-industrial levels and related global greenhouse gas emission pathways. The EWEMBI data cover the entire globe at 0.5° horizontal and daily temporal resolution from 1979 to 2013. Data sources of EWEMBI are ERA-Interim reanalysis data (ERAI; Dee et al., 2011), WATCH forcing data methodology applied to ERA-Interim reanalysis data (WFDEI; Weedon et al., 2014), eartH2Observe forcing data (E2OBS; Calton et al., 2016) and NASA/GEWEX Surface Radiation Budget data (SRB; Stackhouse Jr. et al., 2011). The SRB data were used to bias-correct E2OBS shortwave and longwave radiation (Lange, 2018). Variables included in the EWEMBI dataset are Near Surface Relative Humidity, Near Surface Specific Humidity, Precipitation, Snowfall Flux, Surface Air Pressure, Surface Downwelling Longwave Radiation, Surface Downwelling Shortwave Radiation, Near Surface Wind Speed, Near-Surface Air Temperature, Daily Maximum Near Surface Air Temperature, Daily Minimum Near Surface Air Temperature, Eastward Near-Surface Wind and Northward Near-Surface Wind. For data sources, units and short names of all variables see Frieler et al. (2017, Table 1).

Facebook

Twitter

Click to copy link

Link copied

Cite

Milena, Dutch-Government-Data-for-Bias-detection [Dataset]. https://huggingface.co/datasets/milenamileentje/Dutch-Government-Data-for-Bias-detection

Dutch-Government-Data-for-Bias-detection

Dutch Government Dataset for Bias detectio

milenamileentje/Dutch-Government-Data-for-Bias-detection

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Authors

Milena

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Area covered

Politics of the Netherlands, Netherlands

Description

milenamileentje/Dutch-Government-Data-for-Bias-detection dataset hosted on Hugging Face and contributed by the HF Datasets community

Clear search

Close search

Google apps

Main menu

Dutch-Government-Data-for-Bias-detection

Opinion on mitigating AI data bias in healthcare worldwide 2024

Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

Data_Sheet_1_Data and model bias in artificial intelligence for healthcare...

mediabias

NewsMediaBias-Plus Dataset

NewsMediaBias-Plus Dataset

Overview

Dataset Description

Contents

Annotation Labels

Getting Started

Prerequisites

Installation

Load a Few Records

Contributions

License

Citation

Contact

Disclaimer and User Guidance

Navigating News Narratives: A Media Bias Analysis Dataset

debiased_dataset

Data from: Deconstructing Bias in Social Preferences Reveals Groupy and Not...

political-bias

NewsUnravel Dataset

md_gender_bias

Data and Code for: Confidence, Self-Selection and Bias in the Aggregate

Age Bias Training and Testing Data

Data from: Diversity matters: Robustness of bias measurements in Wikidata

NewsUnravel Dataset

Replication Data for: More Human than Human: Measuring ChatGPT Political...

news-media-bias.data.json

Global AI Bias Audit Services Market Growth Opportunities 2025-2032

EartH2Observe, WFDEI and ERA-Interim data Merged and Bias-corrected for...

Dutch-Government-Data-for-Bias-detection

Dutch Government Dataset for Bias detectio

milenamileentje/Dutch-Government-Data-for-Bias-detection