100+ datasets found

mediabias
kaggle.com
Updated Jul 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Tegmark (2022). mediabias [Dataset]. http://doi.org/10.34740/kaggle/dsv/3966214
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/3966214
Dataset updated
Jul 20, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Max Tegmark
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Many people believe that news media they dislike are biased, while their favorite news source isn't. Can we move beyond such subjectivity and measure media bias objectively, from data alone? The auto-generated figure below answers this question with a resounding "yes", showing left-leaning media on the left, right-leaning media on the right, establishment-critical media at the bottom, etc. https://space.mit.edu/home/tegmark/phrasebias.jpg" alt="Media bias landscape">

Our algorithm analyzed over a million articles from over a hundred newspapers. It first audo-identifies phrases that help predict which newspaper a givens article is from (e.g. "undocumented immigrant" vs. "illegal immigrant"). It then analyzes the frequencies of such phrases across newspapers and topics, producing the media bias landscape below. This means that although news bias is inherently political, its measurement need not be.

Here's our paper: arXiv:2109.00024. Our Kaggle data set here contains the discriminative phrases and phrase counts needed to reproduce all the plots in our paper. The files contain the following data: - The directory phrase_selection contains tables such as immigration_phrases.csv that you can open with Microsoft Excel. They contain the phrases that our method found most informative for predicting which newspaper an article is from, sorted by decreasing utility. Our analysis ones only the ones passing all our screenings, i.e., with ones in columns D, E and F. - The directory counts contains tables such as immigration_counts.csv, listing the number of times that each phrase in occurs in each newspaper's coverage of that topic. - The file blacklist.csv contains journalist names and other phrases that were discarded because they helped revealed the identity of a newspaper without reflecting any political bias.

If you have questions, please contact Samantha at sdalonzo@mit.edu or Max at tegmark@mit.edu.
persona-bias
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2, persona-bias [Dataset]. https://huggingface.co/datasets/allenai/persona-bias
Explore at:
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Persona-bias

Data accompanying the paper Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs at ICLR 2024. Paper || Code || Project website || License

Motivation

This is a dataset of model outputs supporting our extensive study of biases in persona-assigned LLMs. These model outputs can be used for many purposes, for instance:

developing a deeper understanding of persona-induced biases, e.g. by analyzing the inhibiting assumptions underlying model… See the full description on the dataset page: https://huggingface.co/datasets/allenai/persona-bias.
P
Article Bias Prediction Dataset
paperswithcode.com
opendatalab.com
Updated Oct 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramy Baly; Giovanni Da San Martino; James Glass; Preslav Nakov (2020). Article Bias Prediction Dataset [Dataset]. https://paperswithcode.com/dataset/article-bias-prediction
Explore at:
Dataset updated
Oct 10, 2020
Authors
Ramy Baly; Giovanni Da San Martino; James Glass; Preslav Nakov
Description
Article-Bias-Prediction Dataset The articles crawled from www.allsides.com are available in the ./data folder, along with the different evaluation splits.

The dataset consists of a total of 37,554 articles. Each article is stored as a JSON object in the ./data/jsons directory, and contains the following fields: 1. ID: an alphanumeric identifier. 2. topic: the topic being discussed in the article. 3. source: the name of the articles's source (example: New York Times) 4. source_url: the URL to the source's homepage (example: www.nytimes.com) 5. url: the link to the actual article. 6. date: the publication date of the article. 7. authors: a comma-separated list of the article's authors. 8. title: the article's title. 9. content_original: the original body of the article, as returned by the newspaper3k Python library. 10. content: the processed and tokenized content, which is used as input to the different models. 11. bias_text: the label of the political bias annotation of the article (left, center, or right). 12. bias: the numeric encoding of the political bias of the article (0, 1, or 2).

The ./data/splits directory contains the two types of splits, as discussed in the paper: random and media-based. For each of these types, we provide the train, validation and test files that contains the articles' IDs belonging to each set, along with their numeric bias label.

Code Under maintenance. To be available soon.

Citation @inproceedings{baly2020we, author = {Baly, Ramy and Da San Martino, Giovanni and Glass, James and Nakov, Preslav}, title = {We Can Detect Your Bias: Predicting the Political Ideology of News Articles}, booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, series = {EMNLP~'20}, NOmonth = {November}, year = {2020} pages = {4982--4991}, NOpublisher = {Association for Computational Linguistics} }
f
Navigating News Narratives: A Media Bias Analysis Dataset
figshare.com
txt
Updated Dec 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaina Raza (2023). Navigating News Narratives: A Media Bias Analysis Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24422122.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24422122.v4
Dataset updated
Dec 8, 2023
Dataset provided by
figshare
Authors
Shaina Raza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII).The data structure is tabulated as follows:Text: The main content.Dimension: Descriptive category of the text.Biased_Words: A compilation of words regarded as biased.Aspect: Specific sub-topic within the main content.Label: Indicates the presence (True) or absence (False) of bias. The label is ternary - highly biased, slightly biased and neutralToxicity: Indicates the presence (True) or absence (False) of bias.Identity_mention: Mention of any identity based on words match.Annotation SchemeThe labels and annotations in the dataset are generated through a system of Active Learning, cycling through:Manual LabelingSemi-Supervised LearningHuman VerificationThe scheme comprises:Bias Label: Specifies the degree of bias (e.g., no bias, mild, or strong).Words/Phrases Level Biases: Pinpoints specific biased terms or phrases.Subjective Bias (Aspect): Highlights biases pertinent to content dimensions.Due to the nuances of semantic match algorithms, certain labels such as 'identity' and 'aspect' may appear distinctively different.List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news.We also utilize publicly available data from the following links. Our Attribution to others.MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detectionToxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge.Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu)Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtVSocial biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.If you use this dataset, please cite us.Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0
NewsMediaBias-Plus Dataset
zenodo.org
huggingface.co
bin, zip
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaina Raza; Shaina Raza (2024). NewsMediaBias-Plus Dataset [Dataset]. http://doi.org/10.5281/zenodo.13961155
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13961155
Dataset updated
Nov 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shaina Raza; Shaina Raza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NewsMediaBias-Plus Dataset

Overview

The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.

Dataset Description

NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.

Contents

unique_id: Unique identifier for each news item. Each unique_id matches an image for the same article.

outlet: The publisher of the article.

headline: The headline of the article.

article_text: The full content of the news article.

image_description: Description of the paired image.

image: The file path of the associated image.

date_published: The date the article was published.

source_url: The original URL of the article.

canonical_link: The canonical URL of the article.

new_categories: Categories assigned to the article.

news_categories_confidence_scores: Confidence scores for each category.

Annotation Labels

text_label: Indicates the likelihood of the article being disinformation:

Likely: Likely to be disinformation.

Unlikely: Unlikely to be disinformation.

multimodal_label: Indicates the likelihood of disinformation from the combination of the text snippet and image content:

Likely: Likely to be disinformation.

Unlikely: Unlikely to be disinformation.

Getting Started

Prerequisites

Python 3.6+

Pandas

Hugging Face Datasets

Hugging Face Hub

Installation

Load the dataset into Python:

python

Copy code

from datasets import load_dataset ds = load_dataset("vector-institute/newsmediabias-plus") print(ds) # View structure and splits print(ds['train'][0]) # Access the first record of the train split print(ds['train'][:5]) # Access the first five records

Load a Few Records

python

Copy code

from datasets import load_dataset # Load the dataset in streaming mode streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True) # Get an iterable dataset dataset_iterable = streamed_dataset['train'].take(5) # Print the records for record in dataset_iterable: print(record)

Contributions

Contributions are welcome! You can:

Add Data: Contribute more data points.

Refine Annotations: Improve annotation accuracy.

Share Usage Examples: Help others use the dataset effectively.

To contribute, fork the repository and create a pull request with your changes.

License

This dataset is released under a non-commercial license. See the LICENSE file for more details.

Citation

Please cite the dataset using this BibTeX entry:

bibtex

Copy code

@misc{vector_institute_2024_newsmediabias_plus, title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias}, author={Vector Institute Research Team}, year={2024}, url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus} }

Contact

For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai

Disclaimer and User Guidance

Disclaimer: The labels Likely and Unlikely are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.

Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.
h
political-bias
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Jones (2024). political-bias [Dataset]. https://huggingface.co/datasets/cajcodes/political-bias
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Christopher Jones
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Political Bias Dataset

Overview

The Political Bias dataset contains 658 synthetic statements, each annotated with a bias rating ranging from 0 to 4. These ratings represent a spectrum from highly conservative (0) to highly liberal (4). The dataset was generated using GPT-4, aiming to facilitate research and development in bias detection and reduction in textual data. Special emphasis was placed on distinguishing between moderate biases on both sides, as this has proven to… See the full description on the dataset page: https://huggingface.co/datasets/cajcodes/political-bias.
Z
Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...
data.niaid.nih.gov
zenodo.org
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schaer, Philipp (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7682914
Explore at:
Dataset updated
Mar 1, 2023
Dataset provided by
Haak, Fabian
Schaer, Philipp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

Dataset 2: Search Query Suggestions (suggestions.csv)

The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

AllSides Scraper

At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.
f
news-media-bias.data.json
figshare.com
txt
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaina Raza (2023). news-media-bias.data.json [Dataset]. http://doi.org/10.6084/m9.figshare.24422122.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24422122.v1
Dataset updated
Dec 8, 2023
Dataset provided by
figshare
Authors
Shaina Raza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.Data descriptionThis is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution.The dataset used for this project does not contain any personally identifiable information (PII).Data Format:- ID: Numeric unique identifier.- Text: Main content.- Dimension: Categorical descriptor of the text.- Biased_Words: List of words considered biased.- Aspect: Specific topic within the text.- Label: Bias True/False value- Aggregate Label: Calculated through multiple weighted formulaeAnnotation Scheme:1. Bias Label: Indicate the presence/absence of bias (e.g., no bias, mild, strong).2. Words/Phrases Level Biases: Identify specific biased words/phrases.3. Subjective Bias (Aspect): Capture biases related to content aspects.Annotation Process:Manual Labeling --> LLM based labelling -->Semi-Supervised Learning --> Human Verifications (iterative process)The scheme employs a mix of manual labeling, GPT-based labeling, human verification, and semi-supervised learning for refined and accurate annotation.We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.
P
shape bias Dataset
paperswithcode.com
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Geirhos; Patricia Rubisch; Claudio Michaelis; Matthias Bethge; Felix A. Wichmann; Wieland Brendel (2024). shape bias Dataset [Dataset]. https://paperswithcode.com/dataset/shape-bias
Explore at:
Dataset updated
Jul 7, 2024
Authors
Robert Geirhos; Patricia Rubisch; Claudio Michaelis; Matthias Bethge; Felix A. Wichmann; Wieland Brendel
Description
The 'shape bias' dataset was introduced in Geirhos et al. (ICLR 2019) and consists of 224x224 images with conflicting texture and shape information (e.g., cat shape with elephant texture). This is used to measure the shape vs. texture bias of image classifiers.
h
md_gender_bias
huggingface.co
opendatalab.com
Updated Mar 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2021). md_gender_bias [Dataset]. https://huggingface.co/datasets/facebook/md_gender_bias
Explore at:
Dataset updated
Mar 26, 2021
Dataset authored and provided by
AI at Meta
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.
Number of new AI farness and bias metrics worldwide 2016-2022, by type
statista.com
Updated Jul 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Number of new AI farness and bias metrics worldwide 2016-2022, by type [Dataset]. https://www.statista.com/statistics/1378864/ai-fairness-bias-metrics-growth-worlwide/
Explore at:
Dataset updated
Jul 1, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
Worldwide
Description
There has been a continuous growth in the number of metrics used to analyze fairness and biases in artificial intelligence (AI) platforms since 2016. Diagnostic metrics have consistently been adapted more than benchmarks, with a peak of ** in 2019. It is quite likely that this is simply because more diagnostics need to be run to analyze data to create more accurate benchmarks, i.e. the diagnostics lead to benchmarks.
P
BiasBios Dataset
paperswithcode.com
Updated Jan 26, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria De-Arteaga; Alexey Romanov; Hanna Wallach; Jennifer Chayes; Christian Borgs; Alexandra Chouldechova; Sahin Geyik; Krishnaram Kenthapadi; Adam Tauman Kalai (2019). BiasBios Dataset [Dataset]. https://paperswithcode.com/dataset/biasbios
Explore at:
Dataset updated
Jan 26, 2019
Authors
Maria De-Arteaga; Alexey Romanov; Hanna Wallach; Jennifer Chayes; Christian Borgs; Alexandra Chouldechova; Sahin Geyik; Krishnaram Kenthapadi; Adam Tauman Kalai
Description
The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to find the names, pronouns, and occupations. Twenty-eight most frequent occupations were identified based on their appearances. The resulting dataset consists of 397,340 biographies spanning twenty-eight different occupations. Of these occupations, the professor is the most frequent, with 118,400 biographies, while the rapper is the least frequent, with 1,406 biographies. Important information about the biographies: 1. The longest biography is 194 tokens, while the shortest is eighteen; the median biography length is seventy-two tokens. 2. It should be noted that the demographics of online biographies’ subjects differ from those of the overall workforce and that this dataset does not contain all biographies on the Internet.
m
Reddit Ideological and Extreme Bias Dataset - Part 1
data.mendeley.com
Updated Feb 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamalakkannan Ravi (2024). Reddit Ideological and Extreme Bias Dataset - Part 1 [Dataset]. http://doi.org/10.17632/2tdr9sjd83.3
Explore at:
Unique identifier
https://doi.org/10.17632/2tdr9sjd83.3
Dataset updated
Feb 28, 2024
Authors
Kamalakkannan Ravi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data 1: Dataset with articles posted in the r/Liberal and r/Conservative subreddits. In total, we collected a corpus of 226,010 articles. We have collected news articles to understand political expression through the shared news articles. Data 2: Dataset with articles posted in the Liberal, Conservative, and Restricted (private or banned) subreddits. In total, we collected a corpus of 1.3 million articles. We have collected news articles to understand radicalized communities through the shared news articles.

Part 1 has Data 1 (all) and Data 2 (Raw and Labeled Data - Restricted.json) Part 2 has Data 2 (Raw and Labeled Data - Liberal.json, and Conservative.json) and Data 2 (Raw and Unlabeled Data - first 40 of the 76 .json files) Part 3 has Data 2 (Raw and Unlabeled Data - reamaining 36 of the 76 .json files)
d
Risk Of Bias
catalog.data.gov
s.cnmilf.com
+1more
Updated Jun 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Center for PTSD (2025). Risk Of Bias [Dataset]. https://catalog.data.gov/dataset/risk-of-bias-af03c
Explore at:
Dataset updated
Jun 2, 2025
Dataset provided by
National Center for PTSD
Description
Each study in the PTSD-Repository was coded for risk of bias (ROB) in specific domains as well as an overall risk of bias for the study. Detailed information about the specific coding strategy is available in Comparative Effectiveness Review [CER] No. 207, Psychological and Pharmacological Treatments for Adults with Posttraumatic Stress Disorder. (See our "Risk of Bias" data story.) Most domains were rated as Yes (i.e., minimal risk of bias), No (i.e., high risk of bias) or Unclear (i.e., bias could not be determined). The overall study rating is based on the domains and is coded as low, medium or high risk of bias. The 4 domains that were included as components of the overall rating include: Selection bias--randomization adequate, allocation concealment adequate, groups similar at baseline and whether they used intention to treat analyses; Performance bias--care providers masked and patients masked; Detection bias--outcome assessors masked; Attrition bias--overall attrition less than or equal to 20% vs over 20%; differential attrition less than or equal to 15% vs over 15%. Additional items assessed (but not considered as part of the overall rating) include: Reporting bias--all prespecified outcomes reported; Reporting bias--method for handling dropouts; Outcome measures equal valid and reliable; Study reports adequate treatment fidelity based on measurements by independent raters.
h
deepseek-geopolitical-bias-dataset
huggingface.co
Updated Jan 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enkrypt AI (2025). deepseek-geopolitical-bias-dataset [Dataset]. https://huggingface.co/datasets/enkryptai/deepseek-geopolitical-bias-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 31, 2025
Dataset authored and provided by
Enkrypt AI
Description
Dataset Card: deepseek_geopolitical_bias_dataset

Dataset Summary

The deepseek_geopolitical_bias_dataset is a collection of geopolitical questions and model responses. It focuses on historical incidents spanning multiple regions (e.g., China, India, Pakistan, Russia, Taiwan, and the USA) and provides an in-depth look at how different Large Language Models (LLMs), including DeepSeek, respond to these sensitive topics. The dataset aims to support research in bias detection… See the full description on the dataset page: https://huggingface.co/datasets/enkryptai/deepseek-geopolitical-bias-dataset.
Opinion on mitigating AI data bias in healthcare worldwide 2024
statista.com
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Opinion on mitigating AI data bias in healthcare worldwide 2024 [Dataset]. https://www.statista.com/statistics/1559311/ways-to-mitigate-ai-bias-in-healthcare-worldwide/
Explore at:
Dataset updated
Mar 20, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Dec 2023 - Mar 2024
Area covered
Worldwide
Description
According to a survey of healthcare leaders carried out globally in 2024, almost half of respondents believed that by making AI more transparent and interpretable, this would mitigate the risk of data bias in AI applications for healthcare. Furthermore, 46 percent of healthcare leaders thought there should be continuous training and education in AI.
Z
Timbral Clarity Dumping Bias Dataset
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brookes, Tim (2020). Timbral Clarity Dumping Bias Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_21341
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Hermes, Kirsten
Hummersone, Chris
Brookes, Tim
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Data generated as part of research to determine the influence of 'dumping bias' on listener ratings of timbral clarity. Data comprise audio files, listening test interfaces, results and MATLAB code for plot generation.

References

AES 139 (2015): Hermes, K., Brookes, T., Hummersone, C., “The influence of dumping bias on timbral clarity ratings”, 139th Audio Engineering Society Convention, New York, USA, November 2015.
f
Data_Sheet_1_Data and model bias in artificial intelligence for healthcare...
frontiersin.figshare.com
zip
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vithya Yogarajan; Gillian Dobbie; Sharon Leitch; Te Taka Keegan; Joshua Bensemann; Michael Witbrock; Varsha Asrani; David Reith (2023). Data_Sheet_1_Data and model bias in artificial intelligence for healthcare applications in New Zealand.zip [Dataset]. http://doi.org/10.3389/fcomp.2022.1070493.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fcomp.2022.1070493.s001
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Vithya Yogarajan; Gillian Dobbie; Sharon Leitch; Te Taka Keegan; Joshua Bensemann; Michael Witbrock; Varsha Asrani; David Reith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New Zealand
Description
IntroductionDevelopments in Artificial Intelligence (AI) are adopted widely in healthcare. However, the introduction and use of AI may come with biases and disparities, resulting in concerns about healthcare access and outcomes for underrepresented indigenous populations. In New Zealand, Māori experience significant inequities in health compared to the non-Indigenous population. This research explores equity concepts and fairness measures concerning AI for healthcare in New Zealand.MethodsThis research considers data and model bias in NZ-based electronic health records (EHRs). Two very distinct NZ datasets are used in this research, one obtained from one hospital and another from multiple GP practices, where clinicians obtain both datasets. To ensure research equality and fair inclusion of Māori, we combine expertise in Artificial Intelligence (AI), New Zealand clinical context, and te ao Māori. The mitigation of inequity needs to be addressed in data collection, model development, and model deployment. In this paper, we analyze data and algorithmic bias concerning data collection and model development, training and testing using health data collected by experts. We use fairness measures such as disparate impact scores, equal opportunities and equalized odds to analyze tabular data. Furthermore, token frequencies, statistical significance testing and fairness measures for word embeddings, such as WEAT and WEFE frameworks, are used to analyze bias in free-form medical text. The AI model predictions are also explained using SHAP and LIME.ResultsThis research analyzed fairness metrics for NZ EHRs while considering data and algorithmic bias. We show evidence of bias due to the changes made in algorithmic design. Furthermore, we observe unintentional bias due to the underlying pre-trained models used to represent text data. This research addresses some vital issues while opening up the need and opportunity for future research.DiscussionsThis research takes early steps toward developing a model of socially responsible and fair AI for New Zealand's population. We provided an overview of reproducible concepts that can be adopted toward any NZ population data. Furthermore, we discuss the gaps and future research avenues that will enable more focused development of fairness measures suitable for the New Zealand population's needs and social structure. One of the primary focuses of this research was ensuring fair inclusions. As such, we combine expertise in AI, clinical knowledge, and the representation of indigenous populations. This inclusion of experts will be vital moving forward, proving a stepping stone toward the integration of AI for better outcomes in healthcare.
Z
Webis-Bias-Flipper-18
data.niaid.nih.gov
webis.de
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen, Wei-Fan (2020). Webis-Bias-Flipper-18 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3250685
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Chen, Wei-Fan
Stein, Benno
Patrick Saad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Webis Bias Flipper 2018 (Webis-Bias-Flipper-2018) comprises 2781 events from allsides.com as of June 1st, 2012 till February 10, 2018. For each event, the title, the summary, all news portals belonging to the event, and the links to the news portals with respective bias were recorded. After that, we crawled the news portals with the given links to retrieve their headlines and the content of all articles, because the content is not provided on allsides.com. For each event we collected the corresponding news articles. A total of 6458 news articles are collected.
f
Data_Sheet_1_Gender Bias in Artificial Intelligence: Severity Prediction at...
frontiersin.figshare.com
docx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heewon Chung; Chul Park; Wu Seong Kang; Jinseok Lee (2023). Data_Sheet_1_Gender Bias in Artificial Intelligence: Severity Prediction at an Early Stage of COVID-19.docx [Dataset]. http://doi.org/10.3389/fphys.2021.778720.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fphys.2021.778720.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Heewon Chung; Chul Park; Wu Seong Kang; Jinseok Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Artificial intelligence (AI) technologies have been applied in various medical domains to predict patient outcomes with high accuracy. As AI becomes more widely adopted, the problem of model bias is increasingly apparent. In this study, we investigate the model bias that can occur when training a model using datasets for only one particular gender and aim to present new insights into the bias issue. For the investigation, we considered an AI model that predicts severity at an early stage based on the medical records of coronavirus disease (COVID-19) patients. For 5,601 confirmed COVID-19 patients, we used 37 medical records, namely, basic patient information, physical index, initial examination findings, clinical findings, comorbidity diseases, and general blood test results at an early stage. To investigate the gender-based AI model bias, we trained and evaluated two separate models—one that was trained using only the male group, and the other using only the female group. When the model trained by the male-group data was applied to the female testing data, the overall accuracy decreased—sensitivity from 0.93 to 0.86, specificity from 0.92 to 0.86, accuracy from 0.92 to 0.86, balanced accuracy from 0.93 to 0.86, and area under the curve (AUC) from 0.97 to 0.94. Similarly, when the model trained by the female-group data was applied to the male testing data, once again, the overall accuracy decreased—sensitivity from 0.97 to 0.90, specificity from 0.96 to 0.91, accuracy from 0.96 to 0.91, balanced accuracy from 0.96 to 0.90, and AUC from 0.97 to 0.95. Furthermore, when we evaluated each gender-dependent model with the test data from the same gender used for training, the resultant accuracy was also lower than that from the unbiased model.

Facebook

Twitter

Click to copy link

Link copied

Cite

Max Tegmark (2022). mediabias [Dataset]. http://doi.org/10.34740/kaggle/dsv/3966214

mediabias

A dataset for measuring media bias from a million newspaper articles

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/3966214

Dataset updated

Jul 20, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Max Tegmark

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Many people believe that news media they dislike are biased, while their favorite news source isn't. Can we move beyond such subjectivity and measure media bias objectively, from data alone? The auto-generated figure below answers this question with a resounding "yes", showing left-leaning media on the left, right-leaning media on the right, establishment-critical media at the bottom, etc. https://space.mit.edu/home/tegmark/phrasebias.jpg" alt="Media bias landscape">

Our algorithm analyzed over a million articles from over a hundred newspapers. It first audo-identifies phrases that help predict which newspaper a givens article is from (e.g. "undocumented immigrant" vs. "illegal immigrant"). It then analyzes the frequencies of such phrases across newspapers and topics, producing the media bias landscape below. This means that although news bias is inherently political, its measurement need not be.

Here's our paper: arXiv:2109.00024. Our Kaggle data set here contains the discriminative phrases and phrase counts needed to reproduce all the plots in our paper. The files contain the following data: - The directory phrase_selection contains tables such as immigration_phrases.csv that you can open with Microsoft Excel. They contain the phrases that our method found most informative for predicting which newspaper an article is from, sorted by decreasing utility. Our analysis ones only the ones passing all our screenings, i.e., with ones in columns D, E and F. - The directory counts contains tables such as immigration_counts.csv, listing the number of times that each phrase in occurs in each newspaper's coverage of that topic. - The file blacklist.csv contains journalist names and other phrases that were discarded because they helped revealed the identity of a newspaper without reflecting any political bias.

If you have questions, please contact Samantha at sdalonzo@mit.edu or Max at tegmark@mit.edu.

Clear search

Close search

Google apps

Main menu

mediabias

persona-bias

Article Bias Prediction Dataset

Navigating News Narratives: A Media Bias Analysis Dataset

NewsMediaBias-Plus Dataset

NewsMediaBias-Plus Dataset

Overview

Dataset Description

Contents

Annotation Labels

Getting Started

Prerequisites

Installation

Load a Few Records

Contributions

License

Citation

Contact

Disclaimer and User Guidance

political-bias

Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

news-media-bias.data.json

shape bias Dataset

md_gender_bias

Number of new AI farness and bias metrics worldwide 2016-2022, by type

BiasBios Dataset

Reddit Ideological and Extreme Bias Dataset - Part 1

Risk Of Bias

deepseek-geopolitical-bias-dataset

Opinion on mitigating AI data bias in healthcare worldwide 2024

Timbral Clarity Dumping Bias Dataset

Data_Sheet_1_Data and model bias in artificial intelligence for healthcare...

Webis-Bias-Flipper-18

Data_Sheet_1_Gender Bias in Artificial Intelligence: Severity Prediction at...

mediabias

A dataset for measuring media bias from a million newspaper articles