100+ datasets found
  1. Z

    Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haak, Fabian (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7682914
    Explore at:
    Dataset updated
    Mar 1, 2023
    Dataset provided by
    Haak, Fabian
    Schaer, Philipp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

    Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

    Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

    The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

    To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

    Dataset 2: Search Query Suggestions (suggestions.csv)

    The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

    The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

    We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

    AllSides Scraper

    At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

    We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.

  2. Opinion on mitigating AI data bias in healthcare worldwide 2024

    • statista.com
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Opinion on mitigating AI data bias in healthcare worldwide 2024 [Dataset]. https://www.statista.com/statistics/1559311/ways-to-mitigate-ai-bias-in-healthcare-worldwide/
    Explore at:
    Dataset updated
    Mar 20, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Dec 2023 - Mar 2024
    Area covered
    Worldwide
    Description

    According to a survey of healthcare leaders carried out globally in 2024, almost half of respondents believed that by making AI more transparent and interpretable, this would mitigate the risk of data bias in AI applications for healthcare. Furthermore, 46 percent of healthcare leaders thought there should be continuous training and education in AI.

  3. f

    Data_Sheet_1_Gender Bias in Artificial Intelligence: Severity Prediction at...

    • frontiersin.figshare.com
    docx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heewon Chung; Chul Park; Wu Seong Kang; Jinseok Lee (2023). Data_Sheet_1_Gender Bias in Artificial Intelligence: Severity Prediction at an Early Stage of COVID-19.docx [Dataset]. http://doi.org/10.3389/fphys.2021.778720.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Frontiers
    Authors
    Heewon Chung; Chul Park; Wu Seong Kang; Jinseok Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Artificial intelligence (AI) technologies have been applied in various medical domains to predict patient outcomes with high accuracy. As AI becomes more widely adopted, the problem of model bias is increasingly apparent. In this study, we investigate the model bias that can occur when training a model using datasets for only one particular gender and aim to present new insights into the bias issue. For the investigation, we considered an AI model that predicts severity at an early stage based on the medical records of coronavirus disease (COVID-19) patients. For 5,601 confirmed COVID-19 patients, we used 37 medical records, namely, basic patient information, physical index, initial examination findings, clinical findings, comorbidity diseases, and general blood test results at an early stage. To investigate the gender-based AI model bias, we trained and evaluated two separate models—one that was trained using only the male group, and the other using only the female group. When the model trained by the male-group data was applied to the female testing data, the overall accuracy decreased—sensitivity from 0.93 to 0.86, specificity from 0.92 to 0.86, accuracy from 0.92 to 0.86, balanced accuracy from 0.93 to 0.86, and area under the curve (AUC) from 0.97 to 0.94. Similarly, when the model trained by the female-group data was applied to the male testing data, once again, the overall accuracy decreased—sensitivity from 0.97 to 0.90, specificity from 0.96 to 0.91, accuracy from 0.96 to 0.91, balanced accuracy from 0.96 to 0.90, and AUC from 0.97 to 0.95. Furthermore, when we evaluated each gender-dependent model with the test data from the same gender used for training, the resultant accuracy was also lower than that from the unbiased model.

  4. o

    Data and Code for: Confidence, Self-Selection and Bias in the Aggregate

    • openicpsr.org
    delimited
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Enke; Thomas Graeber; Ryan Oprea (2023). Data and Code for: Confidence, Self-Selection and Bias in the Aggregate [Dataset]. http://doi.org/10.3886/E185741V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Mar 2, 2023
    Dataset provided by
    American Economic Association
    Authors
    Benjamin Enke; Thomas Graeber; Ryan Oprea
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The influence of behavioral biases on aggregate outcomes depends in part on self-selection: whether rational people opt more strongly into aggregate interactions than biased individuals. In betting market, auction and committee experiments, we document that some errors are strongly reduced through self-selection, while others are not affected at all or even amplified. A large part of this variation is explained by differences in the relationship between confidence and performance. In some tasks, they are positively correlated, such that self-selection attenuates errors. In other tasks, rational and biased people are equally confident, such that self-selection has no effects on aggregate quantities.

  5. f

    Navigating News Narratives: A Media Bias Analysis Dataset

    • figshare.com
    txt
    Updated Dec 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaina Raza (2023). Navigating News Narratives: A Media Bias Analysis Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24422122.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    figshare
    Authors
    Shaina Raza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII).The data structure is tabulated as follows:Text: The main content.Dimension: Descriptive category of the text.Biased_Words: A compilation of words regarded as biased.Aspect: Specific sub-topic within the main content.Label: Indicates the presence (True) or absence (False) of bias. The label is ternary - highly biased, slightly biased and neutralToxicity: Indicates the presence (True) or absence (False) of bias.Identity_mention: Mention of any identity based on words match.Annotation SchemeThe labels and annotations in the dataset are generated through a system of Active Learning, cycling through:Manual LabelingSemi-Supervised LearningHuman VerificationThe scheme comprises:Bias Label: Specifies the degree of bias (e.g., no bias, mild, or strong).Words/Phrases Level Biases: Pinpoints specific biased terms or phrases.Subjective Bias (Aspect): Highlights biases pertinent to content dimensions.Due to the nuances of semantic match algorithms, certain labels such as 'identity' and 'aspect' may appear distinctively different.List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news.We also utilize publicly available data from the following links. Our Attribution to others.MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detectionToxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge.Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu)Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtVSocial biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.If you use this dataset, please cite us.Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0

  6. NewsUnravel Dataset

    • zenodo.org
    • data.niaid.nih.gov
    csv, png
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anon; anon (2024). NewsUnravel Dataset [Dataset]. http://doi.org/10.5281/zenodo.8344891
    Explore at:
    csv, pngAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    anon; anon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About the NUDA Dataset
    Media bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.

    General

    This dataset was created through user feedback on automatically generated bias highlights on news articles on the website NewsUnravel made by ANON. Its goal is to improve the detection of linguistic media bias for analysis and to indicate it to the public. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.

    The dataset consists of text, namely biased sentences with binary bias labels (processed, biased or not biased) as well as metadata about the article. It includes all feedback that was given. The single ratings (unprocessed) used to create the labels with correlating User IDs are included.

    For training, this dataset was combined with the BABE dataset. All data is completely anonymous. Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.

    Description of the Data Files

    This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain the following data:

    NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labels
    Statistics.png: contains all Umami statistics for NewsUnravel's usage data
    Feedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasons
    Content.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentence and the bias rating, and reason, if given
    Article.csv: holds the article ID, title, source, article metadata, article topic, and bias amount in %
    Participant.csv: holds the participant IDs and data processing consent

    Collection Process

    Data was collected through interactions with the Feedback Mechanism on NewsUnravel. A news article was displayed with automatically generated bias highlights. Each highlight could be selected, and readers were able to agree or disagree with the automatic label. Through a majority vote, labels were generated from those feedback interactions. Spammers were excluded through a spam detection approach.

    Readers came to our website voluntarily through posts on LinkedIn and social media as well as posts on university boards. The data collection period lasted for one week, from March 4th to March 11th (2023). The landing page informed them about the goal and the data processing. After being informed, they could proceed to the article overview.

    So far, the dataset has been used on top of BABE to train a linguistic bias classifier, adopting hyperparameter configurations from BABE with a pre-trained model from Hugging Face.
    The dataset will be open source. On acceptance, a link with all details and contact information will be provided. No third parties are involved.

    The dataset will not be maintained as it captures the first test of NewsUnravel at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsUnravel paper if you use the dataset and contact us if you're interested in more information or joining the project.

  7. NewsMediaBias-Plus Dataset

    • zenodo.org
    • huggingface.co
    bin, zip
    Updated Nov 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaina Raza; Shaina Raza (2024). NewsMediaBias-Plus Dataset [Dataset]. http://doi.org/10.5281/zenodo.13961155
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Nov 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shaina Raza; Shaina Raza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NewsMediaBias-Plus Dataset

    Overview

    The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.

    Dataset Description

    NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.

    Contents

    • unique_id: Unique identifier for each news item. Each unique_id matches an image for the same article.
    • outlet: The publisher of the article.
    • headline: The headline of the article.
    • article_text: The full content of the news article.
    • image_description: Description of the paired image.
    • image: The file path of the associated image.
    • date_published: The date the article was published.
    • source_url: The original URL of the article.
    • canonical_link: The canonical URL of the article.
    • new_categories: Categories assigned to the article.
    • news_categories_confidence_scores: Confidence scores for each category.

    Annotation Labels

    • text_label: Indicates the likelihood of the article being disinformation:

      • Likely: Likely to be disinformation.
      • Unlikely: Unlikely to be disinformation.
    • multimodal_label: Indicates the likelihood of disinformation from the combination of the text snippet and image content:

      • Likely: Likely to be disinformation.
      • Unlikely: Unlikely to be disinformation.

    Getting Started

    Prerequisites

    • Python 3.6+
    • Pandas
    • Hugging Face Datasets
    • Hugging Face Hub

    Installation

    Load the dataset into Python:

    python
    Copy code
    from datasets import load_dataset ds = load_dataset("vector-institute/newsmediabias-plus") print(ds) # View structure and splits print(ds['train'][0]) # Access the first record of the train split print(ds['train'][:5]) # Access the first five records

    Load a Few Records

    python
    Copy code
    from datasets import load_dataset # Load the dataset in streaming mode streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True) # Get an iterable dataset dataset_iterable = streamed_dataset['train'].take(5) # Print the records for record in dataset_iterable: print(record)

    Contributions

    Contributions are welcome! You can:

    • Add Data: Contribute more data points.
    • Refine Annotations: Improve annotation accuracy.
    • Share Usage Examples: Help others use the dataset effectively.

    To contribute, fork the repository and create a pull request with your changes.

    License

    This dataset is released under a non-commercial license. See the LICENSE file for more details.

    Citation

    Please cite the dataset using this BibTeX entry:

    bibtex
    Copy code
    @misc{vector_institute_2024_newsmediabias_plus, title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias}, author={Vector Institute Research Team}, year={2024}, url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus} }

    Contact

    For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai

    Disclaimer and User Guidance

    Disclaimer: The labels Likely and Unlikely are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.

    Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.

  8. n

    Data from: Reliable species distributions are obtainable with sparse, patchy...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated May 31, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samantha L. Peel; Nicole A. Hill; Scott D. Foster; Simon J. Wotherspoon; Claudio Ghiglione; Stefano Schiaparelli (2019). Reliable species distributions are obtainable with sparse, patchy and biased data by leveraging over species and data types [Dataset]. http://doi.org/10.5061/dryad.2226v8m
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2019
    Dataset provided by
    University of Tasmania
    Commonwealth Scientific and Industrial Research Organisation
    Italian National Antarctic Museum (MNA, Section of Genoa) Genoa Italy
    Authors
    Samantha L. Peel; Nicole A. Hill; Scott D. Foster; Simon J. Wotherspoon; Claudio Ghiglione; Stefano Schiaparelli
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description
    1. New methods for species distribution models (SDMs) utilise presence‐absence (PA) data to correct the sampling bias of presence‐only (PO) data in a spatial point process setting. These have been shown to improve species estimates when both data sets are large and dense. However, is a PA data set that is smaller and patchier than hitherto examined able to do the same? Furthermore, when both data sets are relatively small, is there enough information contained within them to produce a useful estimate of species’ distributions? These attributes are common in many applications.

    2. A stochastic simulation was conducted to assess the ability of a pooled data SDM to estimate the distribution of species from increasingly sparser and patchier data sets. The simulated data sets were varied by changing the number of presence‐absence sample locations, the degree of patchiness of these locations, the number of PO observations, and the level of sampling bias within the PO observations. The performance of the pooled data SDM was compared to a PA SDM and a PO SDM to assess the strengths and limitations of each SDM.

    3. The pooled data SDM successfully removed the sampling bias from the PO observations even when the presence‐absence data was sparse and patchy, and the PO observations formed the majority of the data. The pooled data SDM was, in general, more accurate and more precise than either the PA SDM or the PO SDM. All SDMs were more precise for the species responses than they were for the covariate coefficients.

    4. The emerging SDM methodology that pools PO and PA data will facilitate more certainty around species’ distribution estimates, which in turn will allow more relevant and concise management and policy decisions to be enacted. This work shows that it is possible to achieve this result even in relatively data‐poor regions.

  9. f

    Data_Sheet_1_Data and model bias in artificial intelligence for healthcare...

    • frontiersin.figshare.com
    zip
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vithya Yogarajan; Gillian Dobbie; Sharon Leitch; Te Taka Keegan; Joshua Bensemann; Michael Witbrock; Varsha Asrani; David Reith (2023). Data_Sheet_1_Data and model bias in artificial intelligence for healthcare applications in New Zealand.zip [Dataset]. http://doi.org/10.3389/fcomp.2022.1070493.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Frontiers
    Authors
    Vithya Yogarajan; Gillian Dobbie; Sharon Leitch; Te Taka Keegan; Joshua Bensemann; Michael Witbrock; Varsha Asrani; David Reith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    New Zealand
    Description

    IntroductionDevelopments in Artificial Intelligence (AI) are adopted widely in healthcare. However, the introduction and use of AI may come with biases and disparities, resulting in concerns about healthcare access and outcomes for underrepresented indigenous populations. In New Zealand, Māori experience significant inequities in health compared to the non-Indigenous population. This research explores equity concepts and fairness measures concerning AI for healthcare in New Zealand.MethodsThis research considers data and model bias in NZ-based electronic health records (EHRs). Two very distinct NZ datasets are used in this research, one obtained from one hospital and another from multiple GP practices, where clinicians obtain both datasets. To ensure research equality and fair inclusion of Māori, we combine expertise in Artificial Intelligence (AI), New Zealand clinical context, and te ao Māori. The mitigation of inequity needs to be addressed in data collection, model development, and model deployment. In this paper, we analyze data and algorithmic bias concerning data collection and model development, training and testing using health data collected by experts. We use fairness measures such as disparate impact scores, equal opportunities and equalized odds to analyze tabular data. Furthermore, token frequencies, statistical significance testing and fairness measures for word embeddings, such as WEAT and WEFE frameworks, are used to analyze bias in free-form medical text. The AI model predictions are also explained using SHAP and LIME.ResultsThis research analyzed fairness metrics for NZ EHRs while considering data and algorithmic bias. We show evidence of bias due to the changes made in algorithmic design. Furthermore, we observe unintentional bias due to the underlying pre-trained models used to represent text data. This research addresses some vital issues while opening up the need and opportunity for future research.DiscussionsThis research takes early steps toward developing a model of socially responsible and fair AI for New Zealand's population. We provided an overview of reproducible concepts that can be adopted toward any NZ population data. Furthermore, we discuss the gaps and future research avenues that will enable more focused development of fairness measures suitable for the New Zealand population's needs and social structure. One of the primary focuses of this research was ensuring fair inclusions. As such, we combine expertise in AI, clinical knowledge, and the representation of indigenous populations. This inclusion of experts will be vital moving forward, proving a stepping stone toward the integration of AI for better outcomes in healthcare.

  10. Number of new AI farness and bias metrics worldwide 2016-2022, by type

    • statista.com
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of new AI farness and bias metrics worldwide 2016-2022, by type [Dataset]. https://www.statista.com/statistics/1378864/ai-fairness-bias-metrics-growth-worlwide/
    Explore at:
    Dataset updated
    Jul 1, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2022
    Area covered
    Worldwide
    Description

    There has been a continuous growth in the number of metrics used to analyze fairness and biases in artificial intelligence (AI) platforms since 2016. Diagnostic metrics have consistently been adapted more than benchmarks, with a peak of ** in 2019. It is quite likely that this is simply because more diagnostics need to be run to analyze data to create more accurate benchmarks, i.e. the diagnostics lead to benchmarks.

  11. o

    Data from: Deconstructing Bias in Social Preferences Reveals Groupy and Not...

    • openicpsr.org
    stata
    Updated Aug 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachel Kranton; Matthew Pease; Seth Sanders; Scott Heutell (2020). Deconstructing Bias in Social Preferences Reveals Groupy and Not Groupy Behavior [Dataset]. http://doi.org/10.3886/E120555V1
    Explore at:
    stataAvailable download formats
    Dataset updated
    Aug 5, 2020
    Dataset provided by
    Cornell University
    Duke University
    UPMC
    Authors
    Rachel Kranton; Matthew Pease; Seth Sanders; Scott Heutell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2010 - 2020
    Area covered
    Durham, NC
    Description

    Group divisions are a continual feature of human history, with biases toward people’s own groups shown in both experimental and natural settings. Using a novel within-subject design, this work deconstructs group biases to find significant and robust individual differences; some individuals consistently respond to group divisions, while others do not. We examined individual behavior in two treatments in which subjects make pairwise decisions that determine own and others’ income. In a political treatment, which divided subjects into groups based on their political leanings, political party members showed more ingroup bias than Independents who professed the same political opinions. But this greater bias was also present in a minimal group treatment, showing that stronger group identification was not the driver of higher favoritism in the political setting. Analyzing individual choices across the experiment, we categorize participants as “groupy” or “not groupy,” such that groupy participants have social preferences that change for ingroup and outgroup recipients, while not-groupy participants’ preferences do not change across group context. Demonstrating further that the group identity of the recipient mattered less to their choices, strongly not-groupy subjects made allocation decisions faster. We conclude that observed ingroup biases build on a foundation of heterogeneity in individual groupiness.

  12. I

    Global AI Bias Audit Services Market Growth Opportunities 2025-2032

    • statsndata.org
    excel, pdf
    Updated Jun 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global AI Bias Audit Services Market Growth Opportunities 2025-2032 [Dataset]. https://www.statsndata.org/report/ai-bias-audit-services-market-375694
    Explore at:
    excel, pdfAvailable download formats
    Dataset updated
    Jun 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The AI Bias Audit Services market has emerged as a critical sector in response to the increasing reliance on artificial intelligence (AI) across various industries. As organizations integrate AI systems into their operations, concerns about bias-stemming from data quality, algorithmic fairness, and ethical considera

  13. Data from: Confirmation Bias in Web-Based Search: A Randomized Online Study...

    • zenodo.org
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Schweiger; Stefan Schweiger; Ulrike Cress; Aileen Oeberst; Ulrike Cress; Aileen Oeberst (2020). Confirmation Bias in Web-Based Search: A Randomized Online Study on the Effects of Expert Information and Social Tags on Information Search and Evaluation [Dataset]. http://doi.org/10.5281/zenodo.3358127
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Stefan Schweiger; Stefan Schweiger; Ulrike Cress; Aileen Oeberst; Ulrike Cress; Aileen Oeberst
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT

    Background: The public typically believes psychotherapy to be more effective than pharmacotherapy for depression treatments. This is not consistent with current scientific evidence, which shows that both types of treatment are about equally effective.

    Objective: The study investigates whether this bias towards psychotherapy guides online information search and whether the bias can be reduced by explicitly providing expert information (in a blog entry) and by providing tag clouds that implicitly reveal experts’ evaluations.

    Methods: A total of 174 participants completed a fully automated Web-based study after we invited them via mailing lists. First, participants read two blog posts by experts that either challenged or supported the bias towards psychotherapy. Subsequently, participants searched for information about depression treatment in an online environment that provided more experts’ blog posts about the effectiveness of treatments based on alleged research findings. These blogs were organized in a tag cloud; both psychotherapy tags and pharmacotherapy tags were popular. We measured tag and blog post selection, efficacy ratings of the presented treatments, and participants’ treatment recommendation after information search.

    Results: Participants demonstrated a clear bias towards psychotherapy (mean 4.53, SD 1.99) compared to pharmacotherapy (mean 2.73, SD 2.41; t173=7.67, P<.001, d=0.81) when rating treatment efficacy prior to the experiment. Accordingly, participants exhibited biased information search and evaluation. This bias was significantly reduced, however, when participants were exposed to tag clouds with challenging popular tags. Participants facing popular tags challenging their bias (n=61) showed significantly less biased tag selection (F2,168=10.61, P<.001, partial eta squared=0.112), blog post selection (F2,168=6.55, P=.002, partial eta squared=0.072), and treatment efficacy ratings (F2,168=8.48, P<.001, partial eta squared=0.092), compared to bias-supporting tag clouds (n=56) and balanced tag clouds (n=57). Challenging (n=93) explicit expert information as presented in blog posts, compared to supporting expert information (n=81), decreased the bias in information search with regard to blog post selection (F1,168=4.32, P=.04, partial eta squared=0.025). No significant effects were found for treatment recommendation (Ps>.33).

    Conclusions: We conclude that the psychotherapy bias is most effectively attenuated—and even eliminated—when popular tags implicitly point to blog posts that challenge the widespread view. Explicit expert information (in a blog entry) was less successful in reducing biased information search and evaluation. Since tag clouds have the potential to counter biased information processing, we recommend their insertion.

  14. d

    Bias feature containing proxy-datum bias information to be used in the...

    • catalog.data.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Bias feature containing proxy-datum bias information to be used in the Digital Shoreline Analysis System for the central coast of North Carolina from Cape Hatteras to Cape Lookout (NCcentral) [Dataset]. https://catalog.data.gov/dataset/bias-feature-containing-proxy-datum-bias-information-to-be-used-in-the-digital-shoreline-a-a918a
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Hatteras Island, North Carolina, Cape Lookout, Cape Hatteras, Cape Lookout
    Description

    The U.S. Geological Survey (USGS) has compiled national shoreline data for more than 20 years to document coastal change and serve the needs of research, management, and the public. Maintaining a record of historical shoreline positions is an effective method to monitor national shoreline evolution over time, enabling scientists to identify areas most susceptible to erosion or accretion. These data can help coastal managers and planners understand which areas of the coast are vulnerable to change. This data release includes one new mean high water (MHW) shoreline extracted from lidar data collected in 2017 for the entire coastal region of North Carolina which is divided into four subregions: northern North Carolina (NCnorth), central North Carolina (NCcentral), southern North Carolina (NCsouth), and western North Carolina (NCwest). Previously published historical shorelines for North Carolina (Kratzmann and others, 2017) were combined with the new lidar shoreline to calculate long-term (up to 169 years) and short-term (up to 20 years) rates of change. Files associated with the long-term and short-term rates are appended with "LT" and "ST", respectively. A proxy-datum bias reference line that accounts for the positional difference in a proxy shoreline (e.g. High Water Line (HWL) shoreline) and a datum shoreline (e.g. MHW shoreline) is also included in this release.

  15. NewsUnravel Dataset

    • zenodo.org
    csv
    Updated Sep 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous; anonymous (2023). NewsUnravel Dataset [Dataset]. http://doi.org/10.5281/zenodo.8344882
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    anonymous; anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About the Dataset
    Media bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.

    Description of the data files
    This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain following data:

    NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labels
    Statistics.png: contains all Umami statistics for NewsUnravel's usage data
    Feedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasons
    Content.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentences and the bias rating, and reason, if given
    Article.csv: holds the article ID, title, source, article meta data, article topic, and bias amount in %
    Participant.csv: holds the participant IDs and data processing consent

  16. d

    Data from: Decisions reduce sensitivity to subsequent information

    • search.dataone.org
    • datadryad.org
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zohar Z. Bronfman; Noam Brezis; Rani Moran; Konstantinos Tsetsos; Tobias Donner; Marius Usher (2025). Decisions reduce sensitivity to subsequent information [Dataset]. http://doi.org/10.5061/dryad.40f6v
    Explore at:
    Dataset updated
    Apr 2, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Zohar Z. Bronfman; Noam Brezis; Rani Moran; Konstantinos Tsetsos; Tobias Donner; Marius Usher
    Time period covered
    Jun 20, 2020
    Description

    Behavioural studies over half a century indicate that making categorical choices alters beliefs about the state of the world. People seem biased to confirm previous choices, and to suppress contradicting information. These choice-dependent biases imply a fundamental bound of human rationality. However, it remains unclear whether these effects extend to lower level decisions, and only little is known about the computational mechanisms underlying them. Building on the framework of sequential-sampling models of decision-making, we developed novel psychophysical protocols that enable us to dissect quantitatively how choices affect the way decision-makers accumulate additional noisy evidence. We find robust choice-induced biases in the accumulation of abstract numerical (experiment 1) and low-level perceptual (experiment 2) evidence. These biases deteriorate estimations of the mean value of the numerical sequence (experiment 1) and reduce the likelihood to revise decisions (experiment 2). Co...

  17. d

    Data from: Bias in tree searches and its consequences for measuring group...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jul 24, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pablo A. Goloboff; Mark P. Simmons (2014). Bias in tree searches and its consequences for measuring group supports [Dataset]. http://doi.org/10.5061/dryad.tm80k
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 24, 2014
    Dataset provided by
    Dryad
    Authors
    Pablo A. Goloboff; Mark P. Simmons
    Time period covered
    2014
    Description

    Supplementary_Material_Search_biasC code and Windows executables, to calculate bias for wagner trees and branch-swapping starting from random trees.

  18. c

    Data from: Young children seek out biased information about social groups

    • datacatalogue.cessda.eu
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Over, H; Eggleston, A; Bell, J; Dunham, Y (2025). Young children seek out biased information about social groups [Dataset]. http://doi.org/10.5255/UKDA-SN-852858
    Explore at:
    Dataset updated
    Jun 1, 2025
    Dataset provided by
    Yale University
    University of York
    Authors
    Over, H; Eggleston, A; Bell, J; Dunham, Y
    Time period covered
    Dec 31, 2013 - Jun 30, 2017
    Area covered
    United Kingdom
    Variables measured
    Individual
    Measurement technique
    Each participant was invited into the testing area and asked to sit at a small table. After a brief warm-up period, the experimenter explained that there were two groups – the Yellow group and the Green group – and that children in the Yellow group got yellow scarves to wear and children in the Green group got green scarves to wear. She then asked children to reach inside a bag and pull out a token, explaining that if the token was yellow then they would be in the Yellow group, and if the token was green, then they would be in the Green group. (Although this process appeared random to the child it was actually fixed such that half of the children were allocated to the Yellow group and half of the children were allocated to the Green group.) Once children had chosen a token, the experimenter checked that children understood which group they were in by asking ‘What colour token did you get?’ and ‘What colour group are you in?’ In order to check that children could visually identify the two colour groups, they were then asked to take the appropriate colour scarf (yellow or green) from the table in front of them and put it on.Following the group allocation, children were asked how much they liked the two groups. The experimenter explained that children could show her using the scale. She placed the scale in front of children and, pointing at each face in turn, asked, ‘Do you really like them, kind of like them, think they're OK, kind of don't like them, or really don't like them?’ Once children had answered, E asked them how much they wanted to play with their own group and encouraged them to answer again using the scale. ‘Do you really want to play with them, kind of want to play with them, think playing with them would be OK, kind of don't want to play with them, or really don't want to play with them?’ Children were then asked the same two questions, following the same procedure, about the other group.Following this, the story choice measure was introduced. The specific nature of this story choice varied depending on the particular study in question. In studies 1 and 2, children were offered a choice between hearing a story that favoured their own group and disfavoured the other group or a story that favoured the other group and disfavoured their own group. In study 3, children were asked which story another child ought to hear - one that favoured the participant’s own group and disfavoured the other group or a story that favoured the other group and disfavoured the participant’s own group. In Study 4, children were offered choices between four stories that contained positive information about their own group, positive information about the other group, negative information about their own group and negative information about the other group. In study 5, children were offered a single choice between stories that favoured their own group, favoured the other group or provided balanced information. In studies 1 and 2, children also completed preference measures for the two groups after being read the story of their choice. The experimenter asked children to rate once more how much they liked and wanted to play with each of the two groups in the same manner described above.In all five studies, the experimenter concluded by the session by thanking children for their participation. To ensure that the procedure ended on a positive note, the experimenter told them that, although children in both groups could be mean, they were usually nice. As she told them this, she showed them a final picture in which the Yellow and Green groups played nicely together. Children were then told that the groups did not matter anymore and that they could take off their scarves.
    Description

    Understanding the origins of prejudice necessitates exploring the ways in which children participate in the construction of biased representations of social groups. We investigate whether young children actively seek out information that supports and extends their initial intergroup biases. In Studies 1 and 2, we show that children choose to hear a story that contains positive information about their own group and negative information about another group rather than a story that contains negative information about their own group and positive information about the other group. In a third study, we show that children choose to present biased information to others, thus demonstrating that the effects of information selection can start to propagate through social networks. In Studies 4 and 5, we further investigate the nature of children's selective information seeking and show that children prefer ingroup-favouring information to other types of biased information and even to balanced, unbiased information. Together, this work shows that children are not merely passively recipients of social information; they play an active role in the creation and transmission of intergroup attitudes.

    Understanding the origins of prejudice necessitates exploring the ways in which children participate in the construction of biased representations of social groups. We investigate whether young children actively seek out information that supports and extends their initial intergroup biases. In Studies 1 and 2, we show that children choose to hear a story that contains positive information about their own group and negative information about another group rather than a story that contains negative information about their own group and positive information about the other group. In a third study, we show that children choose to present biased information to others, thus demonstrating that the effects of information selection can start to propagate through social networks. In Studies 4 and 5, we further investigate the nature of children's selective information seeking and show that children prefer ingroup-favouring information to other types of biased information and even to balanced, unbiased information. Together, this work shows that children are not merely passively recipients of social information; they play an active role in the creation and transmission of intergroup attitudes.

  19. c

    Data from: Racial Bias in AI-Generated Images

    • datacatalogue.cessda.eu
    • openicpsr.org
    • +1more
    Updated Sep 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Y. Yang (2024). Racial Bias in AI-Generated Images [Dataset]. http://doi.org/10.17026/SS/O9M6VR
    Explore at:
    Dataset updated
    Sep 3, 2024
    Dataset provided by
    Radboud University
    Authors
    Y. Yang
    Time period covered
    Jul 16, 2023 - Jul 23, 2023
    Description

    This file is supplementary material for the manuscript Racial Bias in AI-Generated Images, which has been submitted to a peer-reviewed journal.This dataset/paper examined the image-to-image generation accuracy (i.e., the original race and gender of a person’s image were replicated in the new AI-generated image) of a Chinese AI-powered image generator. We examined the image-to-image generation models transforming the racial and gender categories of the original photos of White, Black and East Asian people (N =1260) in three different racial photo contexts: a single person, two people of the same race, and two people of different races. There are original images (e.g., WW1), AI-generated images (e.g., AM1_1, AM1_2, AM1_3), and SPSS files (Yang 230801 Racial bias in Meitu_Accuracy Paper.sav) in this dataset.

  20. d

    Data from: Inferential Selection Bias in a Study of Racial Bias: Revisiting...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zigerell, L.J (2023). Inferential Selection Bias in a Study of Racial Bias: Revisiting \"Working Twice as Hard to Get Half as Far\" [Dataset]. http://doi.org/10.7910/DVN/28043
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Zigerell, L.J
    Description

    A recent article reported evidence from a survey experiment indicating that Americans reward whites more than blacks for hard work but penalize blacks more than whites for laziness. However, the present study demonstrates that these inferences were based on an unrepresentative selection of possible analyses: strength of inferences from results reported in the original article were weakened when combined with results from equivalent or relevant analyses not reported in the original article; moreover, newly-reported evidence revealed heterogeneity in racial bias: respondents given a direct choice between equivalent targets of different races favored the black target over the white target. Results illustrate how the presence of researcher degrees of freedom can foster production of inferences that are not representative of all inferences that could have been produced from a set of data, thus illustrating the value in preregistering research design protocols and requiring public posting of data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Haak, Fabian (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7682914

Data from: Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions

Related Article
Explore at:
Dataset updated
Mar 1, 2023
Dataset provided by
Haak, Fabian
Schaer, Philipp
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

Dataset 2: Search Query Suggestions (suggestions.csv)

The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

AllSides Scraper

At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.

Search
Clear search
Close search
Google apps
Main menu