100+ datasets found
  1. mediabias

    • kaggle.com
    Updated Jul 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Tegmark (2022). mediabias [Dataset]. http://doi.org/10.34740/kaggle/dsv/3966214
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Max Tegmark
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Many people believe that news media they dislike are biased, while their favorite news source isn't. Can we move beyond such subjectivity and measure media bias objectively, from data alone? The auto-generated figure below answers this question with a resounding "yes", showing left-leaning media on the left, right-leaning media on the right, establishment-critical media at the bottom, etc. https://space.mit.edu/home/tegmark/phrasebias.jpg" alt="Media bias landscape">

    Our algorithm analyzed over a million articles from over a hundred newspapers. It first audo-identifies phrases that help predict which newspaper a givens article is from (e.g. "undocumented immigrant" vs. "illegal immigrant"). It then analyzes the frequencies of such phrases across newspapers and topics, producing the media bias landscape below. This means that although news bias is inherently political, its measurement need not be.

    Here's our paper: arXiv:2109.00024. Our Kaggle data set here contains the discriminative phrases and phrase counts needed to reproduce all the plots in our paper. The files contain the following data: - The directory phrase_selection contains tables such as immigration_phrases.csv that you can open with Microsoft Excel. They contain the phrases that our method found most informative for predicting which newspaper an article is from, sorted by decreasing utility. Our analysis ones only the ones passing all our screenings, i.e., with ones in columns D, E and F. - The directory counts contains tables such as immigration_counts.csv, listing the number of times that each phrase in occurs in each newspaper's coverage of that topic. - The file blacklist.csv contains journalist names and other phrases that were discarded because they helped revealed the identity of a newspaper without reflecting any political bias.

    If you have questions, please contact Samantha at sdalonzo@mit.edu or Max at tegmark@mit.edu.

  2. persona-bias

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2, persona-bias [Dataset]. https://huggingface.co/datasets/allenai/persona-bias
    Explore at:
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Persona-bias

    Data accompanying the paper Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs at ICLR 2024. Paper || Code || Project website || License

      Motivation
    

    This is a dataset of model outputs supporting our extensive study of biases in persona-assigned LLMs. These model outputs can be used for many purposes, for instance:

    developing a deeper understanding of persona-induced biases, e.g. by analyzing the inhibiting assumptions underlying model… See the full description on the dataset page: https://huggingface.co/datasets/allenai/persona-bias.

  3. h

    political-bias

    • huggingface.co
    Updated May 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Jones (2024). political-bias [Dataset]. https://huggingface.co/datasets/cajcodes/political-bias
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 20, 2024
    Authors
    Christopher Jones
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Political Bias Dataset

      Overview
    

    The Political Bias dataset contains 658 synthetic statements, each annotated with a bias rating ranging from 0 to 4. These ratings represent a spectrum from highly conservative (0) to highly liberal (4). The dataset was generated using GPT-4, aiming to facilitate research and development in bias detection and reduction in textual data. Special emphasis was placed on distinguishing between moderate biases on both sides, as this has proven to… See the full description on the dataset page: https://huggingface.co/datasets/cajcodes/political-bias.

  4. P

    Article Bias Prediction Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Oct 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramy Baly; Giovanni Da San Martino; James Glass; Preslav Nakov (2020). Article Bias Prediction Dataset [Dataset]. https://paperswithcode.com/dataset/article-bias-prediction
    Explore at:
    Dataset updated
    Oct 10, 2020
    Authors
    Ramy Baly; Giovanni Da San Martino; James Glass; Preslav Nakov
    Description

    Article-Bias-Prediction Dataset The articles crawled from www.allsides.com are available in the ./data folder, along with the different evaluation splits.

    The dataset consists of a total of 37,554 articles. Each article is stored as a JSON object in the ./data/jsons directory, and contains the following fields: 1. ID: an alphanumeric identifier. 2. topic: the topic being discussed in the article. 3. source: the name of the articles's source (example: New York Times) 4. source_url: the URL to the source's homepage (example: www.nytimes.com) 5. url: the link to the actual article. 6. date: the publication date of the article. 7. authors: a comma-separated list of the article's authors. 8. title: the article's title. 9. content_original: the original body of the article, as returned by the newspaper3k Python library. 10. content: the processed and tokenized content, which is used as input to the different models. 11. bias_text: the label of the political bias annotation of the article (left, center, or right). 12. bias: the numeric encoding of the political bias of the article (0, 1, or 2).

    The ./data/splits directory contains the two types of splits, as discussed in the paper: random and media-based. For each of these types, we provide the train, validation and test files that contains the articles' IDs belonging to each set, along with their numeric bias label.

    Code Under maintenance. To be available soon.

    Citation @inproceedings{baly2020we, author = {Baly, Ramy and Da San Martino, Giovanni and Glass, James and Nakov, Preslav}, title = {We Can Detect Your Bias: Predicting the Political Ideology of News Articles}, booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, series = {EMNLP~'20}, NOmonth = {November}, year = {2020} pages = {4982--4991}, NOpublisher = {Association for Computational Linguistics} }

  5. f

    Navigating News Narratives: A Media Bias Analysis Dataset

    • figshare.com
    txt
    Updated Dec 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaina Raza (2023). Navigating News Narratives: A Media Bias Analysis Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24422122.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    figshare
    Authors
    Shaina Raza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII).The data structure is tabulated as follows:Text: The main content.Dimension: Descriptive category of the text.Biased_Words: A compilation of words regarded as biased.Aspect: Specific sub-topic within the main content.Label: Indicates the presence (True) or absence (False) of bias. The label is ternary - highly biased, slightly biased and neutralToxicity: Indicates the presence (True) or absence (False) of bias.Identity_mention: Mention of any identity based on words match.Annotation SchemeThe labels and annotations in the dataset are generated through a system of Active Learning, cycling through:Manual LabelingSemi-Supervised LearningHuman VerificationThe scheme comprises:Bias Label: Specifies the degree of bias (e.g., no bias, mild, or strong).Words/Phrases Level Biases: Pinpoints specific biased terms or phrases.Subjective Bias (Aspect): Highlights biases pertinent to content dimensions.Due to the nuances of semantic match algorithms, certain labels such as 'identity' and 'aspect' may appear distinctively different.List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news.We also utilize publicly available data from the following links. Our Attribution to others.MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detectionToxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge.Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu)Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtVSocial biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.If you use this dataset, please cite us.Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0

  6. h

    md_gender_bias

    • huggingface.co
    • opendatalab.com
    Updated Mar 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2021). md_gender_bias [Dataset]. https://huggingface.co/datasets/facebook/md_gender_bias
    Explore at:
    Dataset updated
    Mar 26, 2021
    Dataset authored and provided by
    AI at Meta
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.

  7. NewsMediaBias-Plus Dataset

    • zenodo.org
    • huggingface.co
    bin, zip
    Updated Nov 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaina Raza; Shaina Raza (2024). NewsMediaBias-Plus Dataset [Dataset]. http://doi.org/10.5281/zenodo.13961155
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Nov 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shaina Raza; Shaina Raza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NewsMediaBias-Plus Dataset

    Overview

    The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.

    Dataset Description

    NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.

    Contents

    • unique_id: Unique identifier for each news item. Each unique_id matches an image for the same article.
    • outlet: The publisher of the article.
    • headline: The headline of the article.
    • article_text: The full content of the news article.
    • image_description: Description of the paired image.
    • image: The file path of the associated image.
    • date_published: The date the article was published.
    • source_url: The original URL of the article.
    • canonical_link: The canonical URL of the article.
    • new_categories: Categories assigned to the article.
    • news_categories_confidence_scores: Confidence scores for each category.

    Annotation Labels

    • text_label: Indicates the likelihood of the article being disinformation:

      • Likely: Likely to be disinformation.
      • Unlikely: Unlikely to be disinformation.
    • multimodal_label: Indicates the likelihood of disinformation from the combination of the text snippet and image content:

      • Likely: Likely to be disinformation.
      • Unlikely: Unlikely to be disinformation.

    Getting Started

    Prerequisites

    • Python 3.6+
    • Pandas
    • Hugging Face Datasets
    • Hugging Face Hub

    Installation

    Load the dataset into Python:

    python
    Copy code
    from datasets import load_dataset ds = load_dataset("vector-institute/newsmediabias-plus") print(ds) # View structure and splits print(ds['train'][0]) # Access the first record of the train split print(ds['train'][:5]) # Access the first five records

    Load a Few Records

    python
    Copy code
    from datasets import load_dataset # Load the dataset in streaming mode streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True) # Get an iterable dataset dataset_iterable = streamed_dataset['train'].take(5) # Print the records for record in dataset_iterable: print(record)

    Contributions

    Contributions are welcome! You can:

    • Add Data: Contribute more data points.
    • Refine Annotations: Improve annotation accuracy.
    • Share Usage Examples: Help others use the dataset effectively.

    To contribute, fork the repository and create a pull request with your changes.

    License

    This dataset is released under a non-commercial license. See the LICENSE file for more details.

    Citation

    Please cite the dataset using this BibTeX entry:

    bibtex
    Copy code
    @misc{vector_institute_2024_newsmediabias_plus, title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias}, author={Vector Institute Research Team}, year={2024}, url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus} }

    Contact

    For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai

    Disclaimer and User Guidance

    Disclaimer: The labels Likely and Unlikely are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.

    Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.

  8. h

    holistic-bias

    • huggingface.co
    Updated Feb 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FairNLP (2024). holistic-bias [Dataset]. https://huggingface.co/datasets/fairnlp/holistic-bias
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 6, 2024
    Dataset authored and provided by
    FairNLP
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Usage

    When downloading, specify which files you want to download and set the split to train (required by datasets). from datasets import load_dataset

    nouns = load_dataset("fairnlp/holistic-bias", data_files=["nouns.csv"], split="train") sentences = load_dataset("fairnlp/holistic-bias", data_files=["sentences.csv"], split="train")

      Dataset Card for Holistic Bias
    

    This dataset contains the source data of the Holistic Bias dataset as described by Smith et. al. (2022).… See the full description on the dataset page: https://huggingface.co/datasets/fairnlp/holistic-bias.

  9. P

    BiasBios Dataset

    • paperswithcode.com
    Updated Jan 26, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria De-Arteaga; Alexey Romanov; Hanna Wallach; Jennifer Chayes; Christian Borgs; Alexandra Chouldechova; Sahin Geyik; Krishnaram Kenthapadi; Adam Tauman Kalai (2019). BiasBios Dataset [Dataset]. https://paperswithcode.com/dataset/biasbios
    Explore at:
    Dataset updated
    Jan 26, 2019
    Authors
    Maria De-Arteaga; Alexey Romanov; Hanna Wallach; Jennifer Chayes; Christian Borgs; Alexandra Chouldechova; Sahin Geyik; Krishnaram Kenthapadi; Adam Tauman Kalai
    Description

    The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to find the names, pronouns, and occupations. Twenty-eight most frequent occupations were identified based on their appearances. The resulting dataset consists of 397,340 biographies spanning twenty-eight different occupations. Of these occupations, the professor is the most frequent, with 118,400 biographies, while the rapper is the least frequent, with 1,406 biographies. Important information about the biographies: 1. The longest biography is 194 tokens, while the shortest is eighteen; the median biography length is seventy-two tokens. 2. It should be noted that the demographics of online biographies’ subjects differ from those of the overall workforce and that this dataset does not contain all biographies on the Internet.

  10. Z

    Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schaer, Philipp (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7682914
    Explore at:
    Dataset updated
    Mar 1, 2023
    Dataset provided by
    Haak, Fabian
    Schaer, Philipp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

    Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

    Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

    The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

    To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

    Dataset 2: Search Query Suggestions (suggestions.csv)

    The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

    The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

    We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

    AllSides Scraper

    At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

    We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.

  11. h

    deepseek-geopolitical-bias-dataset

    • huggingface.co
    Updated Jan 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enkrypt AI (2025). deepseek-geopolitical-bias-dataset [Dataset]. https://huggingface.co/datasets/enkryptai/deepseek-geopolitical-bias-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 31, 2025
    Dataset authored and provided by
    Enkrypt AI
    Description

    Dataset Card: deepseek_geopolitical_bias_dataset

      Dataset Summary
    

    The deepseek_geopolitical_bias_dataset is a collection of geopolitical questions and model responses. It focuses on historical incidents spanning multiple regions (e.g., China, India, Pakistan, Russia, Taiwan, and the USA) and provides an in-depth look at how different Large Language Models (LLMs), including DeepSeek, respond to these sensitive topics. The dataset aims to support research in bias detection… See the full description on the dataset page: https://huggingface.co/datasets/enkryptai/deepseek-geopolitical-bias-dataset.

  12. Number of new AI farness and bias metrics worldwide 2016-2022, by type

    • statista.com
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of new AI farness and bias metrics worldwide 2016-2022, by type [Dataset]. https://www.statista.com/statistics/1378864/ai-fairness-bias-metrics-growth-worlwide/
    Explore at:
    Dataset updated
    Jul 1, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2022
    Area covered
    Worldwide
    Description

    There has been a continuous growth in the number of metrics used to analyze fairness and biases in artificial intelligence (AI) platforms since 2016. Diagnostic metrics have consistently been adapted more than benchmarks, with a peak of ** in 2019. It is quite likely that this is simply because more diagnostics need to be run to analyze data to create more accurate benchmarks, i.e. the diagnostics lead to benchmarks.

  13. h

    bias-lexicon

    • huggingface.co
    Updated Mar 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Media Bias Group (2024). bias-lexicon [Dataset]. https://huggingface.co/datasets/mediabiasgroup/bias-lexicon
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 13, 2024
    Dataset authored and provided by
    Media Bias Group
    Description

    Description:

    The bias_lexicon file is a comprehensive dictionary of biased words. This lexicon is designed to assist in identifying and analyzing biased language in various texts. The dictionary encompasses a wide range of words that are often associated with biased expressions, including those related to gender, race, age, and other social categories.

      Usage:
    

    This resource can be pivotal for crafting features in natural language processing (NLP) tasks, sentiment analysis… See the full description on the dataset page: https://huggingface.co/datasets/mediabiasgroup/bias-lexicon.

  14. Opinion on mitigating AI data bias in healthcare worldwide 2024

    • statista.com
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Opinion on mitigating AI data bias in healthcare worldwide 2024 [Dataset]. https://www.statista.com/statistics/1559311/ways-to-mitigate-ai-bias-in-healthcare-worldwide/
    Explore at:
    Dataset updated
    Mar 20, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Dec 2023 - Mar 2024
    Area covered
    Worldwide
    Description

    According to a survey of healthcare leaders carried out globally in 2024, almost half of respondents believed that by making AI more transparent and interpretable, this would mitigate the risk of data bias in AI applications for healthcare. Furthermore, 46 percent of healthcare leaders thought there should be continuous training and education in AI.

  15. d

    Risk Of Bias

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Center for PTSD (2025). Risk Of Bias [Dataset]. https://catalog.data.gov/dataset/risk-of-bias-af03c
    Explore at:
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    National Center for PTSD
    Description

    Each study in the PTSD-Repository was coded for risk of bias (ROB) in specific domains as well as an overall risk of bias for the study. Detailed information about the specific coding strategy is available in Comparative Effectiveness Review [CER] No. 207, Psychological and Pharmacological Treatments for Adults with Posttraumatic Stress Disorder. (See our "Risk of Bias" data story.) Most domains were rated as Yes (i.e., minimal risk of bias), No (i.e., high risk of bias) or Unclear (i.e., bias could not be determined). The overall study rating is based on the domains and is coded as low, medium or high risk of bias. The 4 domains that were included as components of the overall rating include: Selection bias--randomization adequate, allocation concealment adequate, groups similar at baseline and whether they used intention to treat analyses; Performance bias--care providers masked and patients masked; Detection bias--outcome assessors masked; Attrition bias--overall attrition less than or equal to 20% vs over 20%; differential attrition less than or equal to 15% vs over 15%. Additional items assessed (but not considered as part of the overall rating) include: Reporting bias--all prespecified outcomes reported; Reporting bias--method for handling dropouts; Outcome measures equal valid and reliable; Study reports adequate treatment fidelity based on measurements by independent raters.

  16. P

    CLEAR-Bias Dataset

    • paperswithcode.com
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Riccardo Cantini; Alessio Orsino; Massimo Ruggiero; Domenico Talia (2025). CLEAR-Bias Dataset [Dataset]. https://paperswithcode.com/dataset/clear-bias
    Explore at:
    Dataset updated
    Apr 9, 2025
    Authors
    Riccardo Cantini; Alessio Orsino; Massimo Ruggiero; Domenico Talia
    Description

    CLEAR-Bias is a benchmark dataset designed to evaluate the robustness of large language models (LLMs) against bias elicitation, particularly under adversarial conditions. It comprises 4,400 prompts across two task formats: multiple-choice and sentence completion. These prompts span seven core bias categories—age, disability, ethnicity, gender, religion, sexual orientation, and socioeconomic status—as well as three intersectional categories, enabling the exploration of overlapping social biases often overlooked in standard evaluations. Each category includes 20 carefully crafted base prompts (10 per task type), which are further expanded using seven jailbreak techniques: machine translation, obfuscation, prefix and prompt injection, refusal suppression, reward incentives, and role-playing—each implemented with three variants.

    CLEAR-Bias is intended for researchers, developers, and practitioners seeking to assess or enhance the ethical behavior of language models. It serves as a benchmarking tool for measuring how effectively different models resist producing biased outputs in both standard and adversarial scenarios. By supporting the evaluation of ethical reliability before real-world deployment, CLEAR-Bias contributes to the development of safer and more responsible LLMs.

  17. m

    Reddit Ideological and Extreme Bias Dataset - Part 1

    • data.mendeley.com
    Updated Feb 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kamalakkannan Ravi (2024). Reddit Ideological and Extreme Bias Dataset - Part 1 [Dataset]. http://doi.org/10.17632/2tdr9sjd83.3
    Explore at:
    Dataset updated
    Feb 28, 2024
    Authors
    Kamalakkannan Ravi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data 1: Dataset with articles posted in the r/Liberal and r/Conservative subreddits. In total, we collected a corpus of 226,010 articles. We have collected news articles to understand political expression through the shared news articles. Data 2: Dataset with articles posted in the Liberal, Conservative, and Restricted (private or banned) subreddits. In total, we collected a corpus of 1.3 million articles. We have collected news articles to understand radicalized communities through the shared news articles.

    Part 1 has Data 1 (all) and Data 2 (Raw and Labeled Data - Restricted.json) Part 2 has Data 2 (Raw and Labeled Data - Liberal.json, and Conservative.json) and Data 2 (Raw and Unlabeled Data - first 40 of the 76 .json files) Part 3 has Data 2 (Raw and Unlabeled Data - reamaining 36 of the 76 .json files)

  18. h

    CLEAR-Bias

    • huggingface.co
    Updated Apr 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Riccardo Cantini (2025). CLEAR-Bias [Dataset]. https://huggingface.co/datasets/RCantini/CLEAR-Bias
    Explore at:
    Dataset updated
    Apr 12, 2025
    Authors
    Riccardo Cantini
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for CLEAR-Bias

    CLEAR-Bias (Corpus for Linguistic Evaluation of Adversarial Robustness against Bias) is a benchmark dataset designed to assess the robustness of large language models (LLMs) against bias elicitation, especially under adversarial conditions. It consists of carefully curated prompts that test the ability of LLMs to resist generating biased content when exposed to both standard and adversarial inputs. The dataset targets a broad spectrum of social biases… See the full description on the dataset page: https://huggingface.co/datasets/RCantini/CLEAR-Bias.

  19. Z

    Webis-Bias-Flipper-18

    • data.niaid.nih.gov
    • webis.de
    • +1more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, Wei-Fan (2020). Webis-Bias-Flipper-18 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3250685
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Chen, Wei-Fan
    Stein, Benno
    Patrick Saad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis Bias Flipper 2018 (Webis-Bias-Flipper-2018) comprises 2781 events from allsides.com as of June 1st, 2012 till February 10, 2018. For each event, the title, the summary, all news portals belonging to the event, and the links to the news portals with respective bias were recorded. After that, we crawled the news portals with the given links to retrieve their headlines and the content of all articles, because the content is not provided on allsides.com. For each event we collected the corresponding news articles. A total of 6458 news articles are collected.

  20. P

    shape bias Dataset

    • paperswithcode.com
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Geirhos; Patricia Rubisch; Claudio Michaelis; Matthias Bethge; Felix A. Wichmann; Wieland Brendel (2024). shape bias Dataset [Dataset]. https://paperswithcode.com/dataset/shape-bias
    Explore at:
    Dataset updated
    Jul 7, 2024
    Authors
    Robert Geirhos; Patricia Rubisch; Claudio Michaelis; Matthias Bethge; Felix A. Wichmann; Wieland Brendel
    Description

    The 'shape bias' dataset was introduced in Geirhos et al. (ICLR 2019) and consists of 224x224 images with conflicting texture and shape information (e.g., cat shape with elephant texture). This is used to measure the shape vs. texture bias of image classifiers.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Max Tegmark (2022). mediabias [Dataset]. http://doi.org/10.34740/kaggle/dsv/3966214
Organization logo

mediabias

A dataset for measuring media bias from a million newspaper articles

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 20, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Max Tegmark
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Many people believe that news media they dislike are biased, while their favorite news source isn't. Can we move beyond such subjectivity and measure media bias objectively, from data alone? The auto-generated figure below answers this question with a resounding "yes", showing left-leaning media on the left, right-leaning media on the right, establishment-critical media at the bottom, etc. https://space.mit.edu/home/tegmark/phrasebias.jpg" alt="Media bias landscape">

Our algorithm analyzed over a million articles from over a hundred newspapers. It first audo-identifies phrases that help predict which newspaper a givens article is from (e.g. "undocumented immigrant" vs. "illegal immigrant"). It then analyzes the frequencies of such phrases across newspapers and topics, producing the media bias landscape below. This means that although news bias is inherently political, its measurement need not be.

Here's our paper: arXiv:2109.00024. Our Kaggle data set here contains the discriminative phrases and phrase counts needed to reproduce all the plots in our paper. The files contain the following data: - The directory phrase_selection contains tables such as immigration_phrases.csv that you can open with Microsoft Excel. They contain the phrases that our method found most informative for predicting which newspaper an article is from, sorted by decreasing utility. Our analysis ones only the ones passing all our screenings, i.e., with ones in columns D, E and F. - The directory counts contains tables such as immigration_counts.csv, listing the number of times that each phrase in occurs in each newspaper's coverage of that topic. - The file blacklist.csv contains journalist names and other phrases that were discarded because they helped revealed the identity of a newspaper without reflecting any political bias.

If you have questions, please contact Samantha at sdalonzo@mit.edu or Max at tegmark@mit.edu.

Search
Clear search
Close search
Google apps
Main menu