Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Many people believe that news media they dislike are biased, while their favorite news source isn't. Can we move beyond such subjectivity and measure media bias objectively, from data alone? The auto-generated figure below answers this question with a resounding "yes", showing left-leaning media on the left, right-leaning media on the right, establishment-critical media at the bottom, etc.
https://space.mit.edu/home/tegmark/phrasebias.jpg" alt="Media bias landscape">
Our algorithm analyzed over a million articles from over a hundred newspapers. It first audo-identifies phrases that help predict which newspaper a givens article is from (e.g. "undocumented immigrant" vs. "illegal immigrant"). It then analyzes the frequencies of such phrases across newspapers and topics, producing the media bias landscape below. This means that although news bias is inherently political, its measurement need not be.
Here's our paper: arXiv:2109.00024. Our Kaggle data set here contains the discriminative phrases and phrase counts needed to reproduce all the plots in our paper. The files contain the following data: - The directory phrase_selection contains tables such as immigration_phrases.csv that you can open with Microsoft Excel. They contain the phrases that our method found most informative for predicting which newspaper an article is from, sorted by decreasing utility. Our analysis ones only the ones passing all our screenings, i.e., with ones in columns D, E and F. - The directory counts contains tables such as immigration_counts.csv, listing the number of times that each phrase in occurs in each newspaper's coverage of that topic. - The file blacklist.csv contains journalist names and other phrases that were discarded because they helped revealed the identity of a newspaper without reflecting any political bias.
If you have questions, please contact Samantha at sdalonzo@mit.edu or Max at tegmark@mit.edu.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Persona-bias
Data accompanying the paper Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs at ICLR 2024. Paper || Code || Project website || License
Motivation
This is a dataset of model outputs supporting our extensive study of biases in persona-assigned LLMs. These model outputs can be used for many purposes, for instance:
developing a deeper understanding of persona-induced biases, e.g. by analyzing the inhibiting assumptions underlying model… See the full description on the dataset page: https://huggingface.co/datasets/allenai/persona-bias.
Article-Bias-Prediction Dataset The articles crawled from www.allsides.com are available in the ./data folder, along with the different evaluation splits.
The dataset consists of a total of 37,554 articles. Each article is stored as a JSON object in the ./data/jsons directory, and contains the following fields: 1. ID: an alphanumeric identifier. 2. topic: the topic being discussed in the article. 3. source: the name of the articles's source (example: New York Times) 4. source_url: the URL to the source's homepage (example: www.nytimes.com) 5. url: the link to the actual article. 6. date: the publication date of the article. 7. authors: a comma-separated list of the article's authors. 8. title: the article's title. 9. content_original: the original body of the article, as returned by the newspaper3k Python library. 10. content: the processed and tokenized content, which is used as input to the different models. 11. bias_text: the label of the political bias annotation of the article (left, center, or right). 12. bias: the numeric encoding of the political bias of the article (0, 1, or 2).
The ./data/splits directory contains the two types of splits, as discussed in the paper: random and media-based. For each of these types, we provide the train, validation and test files that contains the articles' IDs belonging to each set, along with their numeric bias label.
Code Under maintenance. To be available soon.
Citation @inproceedings{baly2020we, author = {Baly, Ramy and Da San Martino, Giovanni and Glass, James and Nakov, Preslav}, title = {We Can Detect Your Bias: Predicting the Political Ideology of News Articles}, booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, series = {EMNLP~'20}, NOmonth = {November}, year = {2020} pages = {4982--4991}, NOpublisher = {Association for Computational Linguistics} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII).The data structure is tabulated as follows:Text: The main content.Dimension: Descriptive category of the text.Biased_Words: A compilation of words regarded as biased.Aspect: Specific sub-topic within the main content.Label: Indicates the presence (True) or absence (False) of bias. The label is ternary - highly biased, slightly biased and neutralToxicity: Indicates the presence (True) or absence (False) of bias.Identity_mention: Mention of any identity based on words match.Annotation SchemeThe labels and annotations in the dataset are generated through a system of Active Learning, cycling through:Manual LabelingSemi-Supervised LearningHuman VerificationThe scheme comprises:Bias Label: Specifies the degree of bias (e.g., no bias, mild, or strong).Words/Phrases Level Biases: Pinpoints specific biased terms or phrases.Subjective Bias (Aspect): Highlights biases pertinent to content dimensions.Due to the nuances of semantic match algorithms, certain labels such as 'identity' and 'aspect' may appear distinctively different.List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news.We also utilize publicly available data from the following links. Our Attribution to others.MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detectionToxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge.Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu)Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtVSocial biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.If you use this dataset, please cite us.Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.
NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.
unique_id
: Unique identifier for each news item. Each unique_id
matches an image for the same article.outlet
: The publisher of the article.headline
: The headline of the article.article_text
: The full content of the news article.image_description
: Description of the paired image.image
: The file path of the associated image.date_published
: The date the article was published.source_url
: The original URL of the article.canonical_link
: The canonical URL of the article.new_categories
: Categories assigned to the article.news_categories_confidence_scores
: Confidence scores for each category.text_label
: Indicates the likelihood of the article being disinformation:
Likely
: Likely to be disinformation.Unlikely
: Unlikely to be disinformation.multimodal_label
: Indicates the likelihood of disinformation from the combination of the text snippet and image content:
Likely
: Likely to be disinformation.Unlikely
: Unlikely to be disinformation.Load the dataset into Python:
from datasets import load_dataset
ds = load_dataset("vector-institute/newsmediabias-plus")
print(ds) # View structure and splits
print(ds['train'][0]) # Access the first record of the train split
print(ds['train'][:5]) # Access the first five records
from datasets import load_dataset
# Load the dataset in streaming mode
streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True)
# Get an iterable dataset
dataset_iterable = streamed_dataset['train'].take(5)
# Print the records
for record in dataset_iterable:
print(record)
Contributions are welcome! You can:
To contribute, fork the repository and create a pull request with your changes.
This dataset is released under a non-commercial license. See the LICENSE file for more details.
Please cite the dataset using this BibTeX entry:
@misc{vector_institute_2024_newsmediabias_plus,
title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias},
author={Vector Institute Research Team},
year={2024},
url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus}
}
For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai
Disclaimer: The labels Likely
and Unlikely
are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.
Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Political Bias Dataset
Overview
The Political Bias dataset contains 658 synthetic statements, each annotated with a bias rating ranging from 0 to 4. These ratings represent a spectrum from highly conservative (0) to highly liberal (4). The dataset was generated using GPT-4, aiming to facilitate research and development in bias detection and reduction in textual data. Special emphasis was placed on distinguishing between moderate biases on both sides, as this has proven to… See the full description on the dataset page: https://huggingface.co/datasets/cajcodes/political-bias.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in
Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.
Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)
The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.
To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.
Dataset 2: Search Query Suggestions (suggestions.csv)
The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.
The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".
We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.
AllSides Scraper
At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.
We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.Data descriptionThis is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution.The dataset used for this project does not contain any personally identifiable information (PII).Data Format:- ID: Numeric unique identifier.- Text: Main content.- Dimension: Categorical descriptor of the text.- Biased_Words: List of words considered biased.- Aspect: Specific topic within the text.- Label: Bias True/False value- Aggregate Label: Calculated through multiple weighted formulaeAnnotation Scheme:1. Bias Label: Indicate the presence/absence of bias (e.g., no bias, mild, strong).2. Words/Phrases Level Biases: Identify specific biased words/phrases.3. Subjective Bias (Aspect): Capture biases related to content aspects.Annotation Process:Manual Labeling --> LLM based labelling -->Semi-Supervised Learning --> Human Verifications (iterative process)The scheme employs a mix of manual labeling, GPT-based labeling, human verification, and semi-supervised learning for refined and accurate annotation.We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.
The 'shape bias' dataset was introduced in Geirhos et al. (ICLR 2019) and consists of 224x224 images with conflicting texture and shape information (e.g., cat shape with elephant texture). This is used to measure the shape vs. texture bias of image classifiers.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.
There has been a continuous growth in the number of metrics used to analyze fairness and biases in artificial intelligence (AI) platforms since 2016. Diagnostic metrics have consistently been adapted more than benchmarks, with a peak of ** in 2019. It is quite likely that this is simply because more diagnostics need to be run to analyze data to create more accurate benchmarks, i.e. the diagnostics lead to benchmarks.
The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to find the names, pronouns, and occupations. Twenty-eight most frequent occupations were identified based on their appearances. The resulting dataset consists of 397,340 biographies spanning twenty-eight different occupations. Of these occupations, the professor is the most frequent, with 118,400 biographies, while the rapper is the least frequent, with 1,406 biographies. Important information about the biographies: 1. The longest biography is 194 tokens, while the shortest is eighteen; the median biography length is seventy-two tokens. 2. It should be noted that the demographics of online biographies’ subjects differ from those of the overall workforce and that this dataset does not contain all biographies on the Internet.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data 1: Dataset with articles posted in the r/Liberal and r/Conservative subreddits. In total, we collected a corpus of 226,010 articles. We have collected news articles to understand political expression through the shared news articles. Data 2: Dataset with articles posted in the Liberal, Conservative, and Restricted (private or banned) subreddits. In total, we collected a corpus of 1.3 million articles. We have collected news articles to understand radicalized communities through the shared news articles.
Part 1 has Data 1 (all) and Data 2 (Raw and Labeled Data - Restricted.json) Part 2 has Data 2 (Raw and Labeled Data - Liberal.json, and Conservative.json) and Data 2 (Raw and Unlabeled Data - first 40 of the 76 .json files) Part 3 has Data 2 (Raw and Unlabeled Data - reamaining 36 of the 76 .json files)
Each study in the PTSD-Repository was coded for risk of bias (ROB) in specific domains as well as an overall risk of bias for the study. Detailed information about the specific coding strategy is available in Comparative Effectiveness Review [CER] No. 207, Psychological and Pharmacological Treatments for Adults with Posttraumatic Stress Disorder. (See our "Risk of Bias" data story.) Most domains were rated as Yes (i.e., minimal risk of bias), No (i.e., high risk of bias) or Unclear (i.e., bias could not be determined). The overall study rating is based on the domains and is coded as low, medium or high risk of bias. The 4 domains that were included as components of the overall rating include: Selection bias--randomization adequate, allocation concealment adequate, groups similar at baseline and whether they used intention to treat analyses; Performance bias--care providers masked and patients masked; Detection bias--outcome assessors masked; Attrition bias--overall attrition less than or equal to 20% vs over 20%; differential attrition less than or equal to 15% vs over 15%. Additional items assessed (but not considered as part of the overall rating) include: Reporting bias--all prespecified outcomes reported; Reporting bias--method for handling dropouts; Outcome measures equal valid and reliable; Study reports adequate treatment fidelity based on measurements by independent raters.
Dataset Card: deepseek_geopolitical_bias_dataset
Dataset Summary
The deepseek_geopolitical_bias_dataset is a collection of geopolitical questions and model responses. It focuses on historical incidents spanning multiple regions (e.g., China, India, Pakistan, Russia, Taiwan, and the USA) and provides an in-depth look at how different Large Language Models (LLMs), including DeepSeek, respond to these sensitive topics. The dataset aims to support research in bias detection… See the full description on the dataset page: https://huggingface.co/datasets/enkryptai/deepseek-geopolitical-bias-dataset.
According to a survey of healthcare leaders carried out globally in 2024, almost half of respondents believed that by making AI more transparent and interpretable, this would mitigate the risk of data bias in AI applications for healthcare. Furthermore, 46 percent of healthcare leaders thought there should be continuous training and education in AI.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Data generated as part of research to determine the influence of 'dumping bias' on listener ratings of timbral clarity. Data comprise audio files, listening test interfaces, results and MATLAB code for plot generation.
References
AES 139 (2015): Hermes, K., Brookes, T., Hummersone, C., “The influence of dumping bias on timbral clarity ratings”, 139th Audio Engineering Society Convention, New York, USA, November 2015.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionDevelopments in Artificial Intelligence (AI) are adopted widely in healthcare. However, the introduction and use of AI may come with biases and disparities, resulting in concerns about healthcare access and outcomes for underrepresented indigenous populations. In New Zealand, Māori experience significant inequities in health compared to the non-Indigenous population. This research explores equity concepts and fairness measures concerning AI for healthcare in New Zealand.MethodsThis research considers data and model bias in NZ-based electronic health records (EHRs). Two very distinct NZ datasets are used in this research, one obtained from one hospital and another from multiple GP practices, where clinicians obtain both datasets. To ensure research equality and fair inclusion of Māori, we combine expertise in Artificial Intelligence (AI), New Zealand clinical context, and te ao Māori. The mitigation of inequity needs to be addressed in data collection, model development, and model deployment. In this paper, we analyze data and algorithmic bias concerning data collection and model development, training and testing using health data collected by experts. We use fairness measures such as disparate impact scores, equal opportunities and equalized odds to analyze tabular data. Furthermore, token frequencies, statistical significance testing and fairness measures for word embeddings, such as WEAT and WEFE frameworks, are used to analyze bias in free-form medical text. The AI model predictions are also explained using SHAP and LIME.ResultsThis research analyzed fairness metrics for NZ EHRs while considering data and algorithmic bias. We show evidence of bias due to the changes made in algorithmic design. Furthermore, we observe unintentional bias due to the underlying pre-trained models used to represent text data. This research addresses some vital issues while opening up the need and opportunity for future research.DiscussionsThis research takes early steps toward developing a model of socially responsible and fair AI for New Zealand's population. We provided an overview of reproducible concepts that can be adopted toward any NZ population data. Furthermore, we discuss the gaps and future research avenues that will enable more focused development of fairness measures suitable for the New Zealand population's needs and social structure. One of the primary focuses of this research was ensuring fair inclusions. As such, we combine expertise in AI, clinical knowledge, and the representation of indigenous populations. This inclusion of experts will be vital moving forward, proving a stepping stone toward the integration of AI for better outcomes in healthcare.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Webis Bias Flipper 2018 (Webis-Bias-Flipper-2018) comprises 2781 events from allsides.com as of June 1st, 2012 till February 10, 2018. For each event, the title, the summary, all news portals belonging to the event, and the links to the news portals with respective bias were recorded. After that, we crawled the news portals with the given links to retrieve their headlines and the content of all articles, because the content is not provided on allsides.com. For each event we collected the corresponding news articles. A total of 6458 news articles are collected.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artificial intelligence (AI) technologies have been applied in various medical domains to predict patient outcomes with high accuracy. As AI becomes more widely adopted, the problem of model bias is increasingly apparent. In this study, we investigate the model bias that can occur when training a model using datasets for only one particular gender and aim to present new insights into the bias issue. For the investigation, we considered an AI model that predicts severity at an early stage based on the medical records of coronavirus disease (COVID-19) patients. For 5,601 confirmed COVID-19 patients, we used 37 medical records, namely, basic patient information, physical index, initial examination findings, clinical findings, comorbidity diseases, and general blood test results at an early stage. To investigate the gender-based AI model bias, we trained and evaluated two separate models—one that was trained using only the male group, and the other using only the female group. When the model trained by the male-group data was applied to the female testing data, the overall accuracy decreased—sensitivity from 0.93 to 0.86, specificity from 0.92 to 0.86, accuracy from 0.92 to 0.86, balanced accuracy from 0.93 to 0.86, and area under the curve (AUC) from 0.97 to 0.94. Similarly, when the model trained by the female-group data was applied to the male testing data, once again, the overall accuracy decreased—sensitivity from 0.97 to 0.90, specificity from 0.96 to 0.91, accuracy from 0.96 to 0.91, balanced accuracy from 0.96 to 0.90, and AUC from 0.97 to 0.95. Furthermore, when we evaluated each gender-dependent model with the test data from the same gender used for training, the resultant accuracy was also lower than that from the unbiased model.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Many people believe that news media they dislike are biased, while their favorite news source isn't. Can we move beyond such subjectivity and measure media bias objectively, from data alone? The auto-generated figure below answers this question with a resounding "yes", showing left-leaning media on the left, right-leaning media on the right, establishment-critical media at the bottom, etc.
https://space.mit.edu/home/tegmark/phrasebias.jpg" alt="Media bias landscape">
Our algorithm analyzed over a million articles from over a hundred newspapers. It first audo-identifies phrases that help predict which newspaper a givens article is from (e.g. "undocumented immigrant" vs. "illegal immigrant"). It then analyzes the frequencies of such phrases across newspapers and topics, producing the media bias landscape below. This means that although news bias is inherently political, its measurement need not be.
Here's our paper: arXiv:2109.00024. Our Kaggle data set here contains the discriminative phrases and phrase counts needed to reproduce all the plots in our paper. The files contain the following data: - The directory phrase_selection contains tables such as immigration_phrases.csv that you can open with Microsoft Excel. They contain the phrases that our method found most informative for predicting which newspaper an article is from, sorted by decreasing utility. Our analysis ones only the ones passing all our screenings, i.e., with ones in columns D, E and F. - The directory counts contains tables such as immigration_counts.csv, listing the number of times that each phrase in occurs in each newspaper's coverage of that topic. - The file blacklist.csv contains journalist names and other phrases that were discarded because they helped revealed the identity of a newspaper without reflecting any political bias.
If you have questions, please contact Samantha at sdalonzo@mit.edu or Max at tegmark@mit.edu.