100+ datasets found
  1. Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabian Haak; Fabian Haak; Philipp Schaer; Philipp Schaer (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. http://doi.org/10.5281/zenodo.7682915
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 1, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Fabian Haak; Fabian Haak; Philipp Schaer; Philipp Schaer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

    Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

    Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

    The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles.
    Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

    To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

    Dataset 2: Search Query Suggestions (suggestions.csv)

    The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

    The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

    We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

    AllSides Scraper

    At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

    We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.

  2. Number of new AI farness and bias metrics worldwide 2016-2022, by type

    • statista.com
    Updated Aug 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Number of new AI farness and bias metrics worldwide 2016-2022, by type [Dataset]. https://www.statista.com/statistics/1378864/ai-fairness-bias-metrics-growth-worlwide/
    Explore at:
    Dataset updated
    Aug 12, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2022
    Area covered
    Worldwide
    Description

    There has been a continuous growth in the number of metrics used to analyse fairness and biases in artificial intelligence (AI) platforms since 2016. Diagnostic metrics have consistently been adapted more than benchmarks. It is quite likely that this is simply because more diagnostics need to be run to analyse data to create more accurate benchmarks, i.e. the diagnostics lead to benchmarks.

  3. Data from: Racial Bias in AI-Generated Images

    • ssh.datastations.nl
    • openicpsr.org
    Updated Aug 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Y. Yang; Y. Yang (2024). Racial Bias in AI-Generated Images [Dataset]. http://doi.org/10.17026/SS/7MQV4M
    Explore at:
    text/x-fixed-field(28980), application/x-spss-sav(67438), application/x-spss-syntax(1998)Available download formats
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Data Archiving and Networked Services
    Authors
    Y. Yang; Y. Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This file is supplementary material for the manuscript Racial Bias in AI-Generated Images, which has been submitted to a peer-reviewed journal. This dataset/paper examined the image-to-image generation accuracy (i.e., the original race and gender of a person’s image were replicated in the new AI-generated image) of a Chinese AI-powered image generator. We examined the image-to-image generation models transforming the racial and gender categories of the original photos of White, Black and East Asian people (N =1260) in three different racial photo contexts: a single person, two people of the same race, and two people of different races.

  4. Opinion on mitigating AI data bias in healthcare worldwide 2024

    • statista.com
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Opinion on mitigating AI data bias in healthcare worldwide 2024 [Dataset]. https://www.statista.com/statistics/1559311/ways-to-mitigate-ai-bias-in-healthcare-worldwide/
    Explore at:
    Dataset updated
    Mar 21, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Dec 2023 - Mar 2024
    Area covered
    Worldwide
    Description

    According to a survey of healthcare leaders carried out globally in 2024, almost half of respondents believed that by making AI more transparent and interpretable, this would mitigate the risk of data bias in AI applications for healthcare. Furthermore, 46 percent of healthcare leaders thought there should be continuous training and education in AI.

  5. NewsUnravel Dataset

    • zenodo.org
    csv
    Updated Sep 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous; anonymous (2023). NewsUnravel Dataset [Dataset]. http://doi.org/10.5281/zenodo.8344882
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    anonymous; anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About the Dataset
    Media bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.

    Description of the data files
    This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain following data:

    NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labels
    Statistics.png: contains all Umami statistics for NewsUnravel's usage data
    Feedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasons
    Content.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentences and the bias rating, and reason, if given
    Article.csv: holds the article ID, title, source, article meta data, article topic, and bias amount in %
    Participant.csv: holds the participant IDs and data processing consent

  6. H

    Replication Data for: More Human than Human: Measuring ChatGPT Political...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabio Motoki; Valdemar Pinho Neto; Victor Rodrigues (2023). Replication Data for: More Human than Human: Measuring ChatGPT Political Bias [Dataset]. http://doi.org/10.7910/DVN/KGMEYI
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Fabio Motoki; Valdemar Pinho Neto; Victor Rodrigues
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    A standing issue is how to measure bias in Large Language Models (LLMs) like ChatGPT. We devise a novel method of sampling, bootstrapping, and impersonation that addresses concerns about the inherent randomness of LLMs and test if it can capture political bias in ChatGPT. Our results indicate that, by default, ChatGPT is aligned with Democrats in the US. Placebo tests indicate that our results are due to bias, not noise or spurious relationships. Robustness tests show that our findings are valid also for Brazil and the UK, different professions, and different numerical scales and questionnaires.

  7. h

    md_gender_bias

    • huggingface.co
    • opendatalab.com
    Updated Mar 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    md_gender_bias [Dataset]. https://huggingface.co/datasets/facebook/md_gender_bias
    Explore at:
    Dataset updated
    Mar 26, 2021
    Dataset authored and provided by
    AI at Meta
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.

  8. Bias and fact-checking in news in the U.S. 2022

    • statista.com
    Updated Nov 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Bias and fact-checking in news in the U.S. 2022 [Dataset]. https://www.statista.com/statistics/874821/news-media-bias-perceptions/
    Explore at:
    Dataset updated
    Nov 22, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 31, 2022 - Jul 21, 2022
    Area covered
    United States
    Description

    A survey from July 2022 asked Americans how they felt about the effects of bias in news on their ability to sort out facts, and revealed that 50 percent felt there was so much bias in the news that it was difficult to discern what was factual from information that was not. This was the highest share who said so across all years shown, and at the same time, the 2022 survey showed the lowest share of respondents who believed there were enough sources to be able to sort out fact from fiction.

  9. Opinion on political bias in news U.S. 2022, by political affiliation

    • statista.com
    Updated Sep 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Opinion on political bias in news U.S. 2022, by political affiliation [Dataset]. https://www.statista.com/statistics/802278/opinion-extent-political-bias-news-coverage-us-political-affiliation/
    Explore at:
    Dataset updated
    Sep 6, 2023
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 31, 2022 - Jul 21, 2022
    Area covered
    United States
    Description

    According to a survey conducted in the United States in summer 2022, 79 percent of Republican respondents felt that news coverage had a great deal of political bias, making these voters the most likely to hold this opinion of the news media. Independents also felt strongly about this issue, whereas only 33 percent of Democrats said they saw a great deal of political bias in news.

    How politics affects news consumption

    Political bias in news can alienate consumers and may also be poorly received when coverage of a non-political topic leans too heavily towards one end of the spectrum. However, at the same time, personal politics in general are often closely interlinked with how a consumer perceives or engages with news and information. A clear example of this can be found when looking at political news sources used weekly in the U.S., with Republicans and Democrats opting for the national networks they most identify with. But what if audiences cannot find the content they want?

    A change in behavior

    Engaging with news aligning with one’s politics is not uncommon. That said, perceived bias in mainstream media may lead some consumers to look elsewhere and turn away from more “neutral” outlets if they believe the news is no longer partisan. Data shows that a number of leading conservative websites registered a substantial increase in visitors year over year. Looking at this data in context of Republicans’ concern about bias in political news, it is likely that this trend will continue and consumers will pursue outlets they feel resonate with them most.

  10. Perceived bias of news on social media in the U.S. 2018

    • statista.com
    Updated Jun 11, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2019). Perceived bias of news on social media in the U.S. 2018 [Dataset]. https://www.statista.com/statistics/874843/social-media-bias-perceptions/
    Explore at:
    Dataset updated
    Jun 11, 2019
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 5, 2018 - Mar 11, 2018
    Area covered
    United States
    Description

    This statistic presents data on the perception of the level of bias in the news seen on social media amongst consumers in the United States as of March 2018. During the survey, 66 percent of consumers stated that they believed 76% or more of the news on social media to be biased.

  11. J

    How large is the bias in self-reported disability? (replication data)

    • journaldata.zbw.eu
    • jda-test.zbw.eu
    txt
    Updated Dec 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo Benitez-Silva; Moshe Buchinsky; Hiu Man Chan; Sofia Cheidvasser; John Rust; Hugo Benitez-Silva; Moshe Buchinsky; Hiu Man Chan; Sofia Cheidvasser; John Rust (2022). How large is the bias in self-reported disability? (replication data) [Dataset]. http://doi.org/10.15456/jae.2022319.0707261544
    Explore at:
    txt(762420), txt(4844)Available download formats
    Dataset updated
    Dec 8, 2022
    Dataset provided by
    ZBW - Leibniz Informationszentrum Wirtschaft
    Authors
    Hugo Benitez-Silva; Moshe Buchinsky; Hiu Man Chan; Sofia Cheidvasser; John Rust; Hugo Benitez-Silva; Moshe Buchinsky; Hiu Man Chan; Sofia Cheidvasser; John Rust
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A pervasive concern with the use of self-reported health measures in behavioural models is that individuals tend to exaggerate the severity of health problems in order to rationalize their decisions regarding labour force participation, application for disability benefits, etc. We re-examine this issue using a self-reported indicator of disability status from the Health and Retirement Study. We study a subsample of individuals who applied for disability benefits from the Social Security Administration (SSA), for whom we can also observe the SSA's decision. Using a battery of tests, we are unable to reject the hypothesis that self-reported disability is an unbiased indicator of the SSA's decision.

  12. Data and Code for: Confidence, Self-Selection and Bias in the Aggregate

    • openicpsr.org
    delimited
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Enke; Thomas Graeber; Ryan Oprea (2023). Data and Code for: Confidence, Self-Selection and Bias in the Aggregate [Dataset]. http://doi.org/10.3886/E185741V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Mar 2, 2023
    Dataset provided by
    American Economic Associationhttp://www.aeaweb.org/
    Authors
    Benjamin Enke; Thomas Graeber; Ryan Oprea
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The influence of behavioral biases on aggregate outcomes depends in part on self-selection: whether rational people opt more strongly into aggregate interactions than biased individuals. In betting market, auction and committee experiments, we document that some errors are strongly reduced through self-selection, while others are not affected at all or even amplified. A large part of this variation is explained by differences in the relationship between confidence and performance. In some tasks, they are positively correlated, such that self-selection attenuates errors. In other tasks, rational and biased people are equally confident, such that self-selection has no effects on aggregate quantities.

  13. o

    Replication data for: Media Bias in China

    • openicpsr.org
    • test.openicpsr.org
    Updated Oct 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bei Qin; David Strömberg; Yanhui Wu (2019). Replication data for: Media Bias in China [Dataset]. http://doi.org/10.3886/E113189V1
    Explore at:
    Dataset updated
    Oct 12, 2019
    Dataset provided by
    American Economic Association
    Authors
    Bei Qin; David Strömberg; Yanhui Wu
    Area covered
    China
    Description

    This paper examines whether and how market competition affected the political bias of government-owned newspapers in China from 1981 to 2011. We measure media bias based on coverage of government mouthpiece content (propaganda) relative to commercial content. We first find that a reform that forced newspaper exits (reduced competition) affected media bias by increasing product specialization, with some papers focusing on propaganda and others on commercial content. Second, lower-level governments produce less-biased content and launch commercial newspapers earlier, eroding higher-level governments' political goals. Third, bottom-up competition intensifies the politico-economic tradeoff, leading to product proliferation and less audience exposure to propaganda.

  14. d

    Risk Of Bias

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Sep 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Center for PTSD (2024). Risk Of Bias [Dataset]. https://catalog.data.gov/dataset/risk-of-bias-af03c
    Explore at:
    Dataset updated
    Sep 30, 2024
    Dataset provided by
    National Center for PTSD
    Description

    Each study in the PTSD-Repository was coded for risk of bias (ROB) in specific domains as well as an overall risk of bias for the study. Detailed information about the specific coding strategy is available in Comparative Effectiveness Review [CER] No. 207, Psychological and Pharmacological Treatments for Adults with Posttraumatic Stress Disorder. (See our "Risk of Bias" data story.) Most domains were rated as Yes (i.e., minimal risk of bias), No (i.e., high risk of bias) or Unclear (i.e., bias could not be determined). The overall study rating is based on the domains and is coded as low, medium or high risk of bias. The 4 domains that were included as components of the overall rating include: Selection bias--randomization adequate, allocation concealment adequate, groups similar at baseline and whether they used intention to treat analyses; Performance bias--care providers masked and patients masked; Detection bias--outcome assessors masked; Attrition bias--overall attrition less than or equal to 20% vs over 20%; differential attrition less than or equal to 15% vs over 15%. Additional items assessed (but not considered as part of the overall rating) include: Reporting bias--all prespecified outcomes reported; Reporting bias--method for handling dropouts; Outcome measures equal valid and reliable; Study reports adequate treatment fidelity based on measurements by independent raters.

  15. d

    Data from: Bias feature containing proxy-datum bias information to be used...

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Bias feature containing proxy-datum bias information to be used in the Digital Shoreline Analysis System for the western coast of North Carolina from Cape Fear to the South Carolina border (NCwest) [Dataset]. https://catalog.data.gov/dataset/bias-feature-containing-proxy-datum-bias-information-to-be-used-in-the-digital-shoreline-a-f4889
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    The U.S. Geological Survey (USGS) has compiled national shoreline data for more than 20 years to document coastal change and serve the needs of research, management, and the public. Maintaining a record of historical shoreline positions is an effective method to monitor national shoreline evolution over time, enabling scientists to identify areas most susceptible to erosion or accretion. These data can help coastal managers and planners understand which areas of the coast are vulnerable to change. This data release includes one new mean high water (MHW) shoreline extracted from lidar data collected in 2017 for the entire coastal region of North Carolina which is divided into four subregions: northern North Carolina (NCnorth), central North Carolina (NCcentral), southern North Carolina (NCsouth), and western North Carolina (NCwest). Previously published historical shorelines for North Carolina (Kratzmann and others, 2017) were combined with the new lidar shoreline to calculate long-term (up to 169 years) and short-term (up to 20 years) rates of change. Files associated with the long-term and short-term rates are appended with "LT" and "ST", respectively. A proxy-datum bias reference line that accounts for the positional difference in a proxy shoreline (e.g. High Water Line (HWL) shoreline) and a datum shoreline (e.g. MHW shoreline) is also included in this release.

  16. Z

    NewsUnravel Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anon (2024). NewsUnravel Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8344890
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset authored and provided by
    anon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About the NUDA DatasetMedia bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.

    General

    This dataset was created through user feedback on automatically generated bias highlights on news articles on the website NewsUnravel made by ANON. Its goal is to improve the detection of linguistic media bias for analysis and to indicate it to the public. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.

    The dataset consists of text, namely biased sentences with binary bias labels (processed, biased or not biased) as well as metadata about the article. It includes all feedback that was given. The single ratings (unprocessed) used to create the labels with correlating User IDs are included.

    For training, this dataset was combined with the BABE dataset. All data is completely anonymous. Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.

    Description of the Data Files

    This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain the following data:

    NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labelsStatistics.png: contains all Umami statistics for NewsUnravel's usage dataFeedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasonsContent.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentence and the bias rating, and reason, if givenArticle.csv: holds the article ID, title, source, article metadata, article topic, and bias amount in %Participant.csv: holds the participant IDs and data processing consent

    Collection Process

    Data was collected through interactions with the Feedback Mechanism on NewsUnravel. A news article was displayed with automatically generated bias highlights. Each highlight could be selected, and readers were able to agree or disagree with the automatic label. Through a majority vote, labels were generated from those feedback interactions. Spammers were excluded through a spam detection approach.

    Readers came to our website voluntarily through posts on LinkedIn and social media as well as posts on university boards. The data collection period lasted for one week, from March 4th to March 11th (2023). The landing page informed them about the goal and the data processing. After being informed, they could proceed to the article overview.

    So far, the dataset has been used on top of BABE to train a linguistic bias classifier, adopting hyperparameter configurations from BABE with a pre-trained model from Hugging Face.The dataset will be open source. On acceptance, a link with all details and contact information will be provided. No third parties are involved.

    The dataset will not be maintained as it captures the first test of NewsUnravel at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsUnravel paper if you use the dataset and contact us if you're interested in more information or joining the project.

  17. b

    Judgement Bias Data - Datasets - data.bris

    • data.bris.ac.uk
    Updated Feb 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Judgement Bias Data - Datasets - data.bris [Dataset]. https://data.bris.ac.uk/data/dataset/3hl3of08pc6432hubhr8nwesgj
    Explore at:
    Dataset updated
    Feb 27, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets from the study "An exploratory study of associations between judgement bias, demographic and behavioural characteristics, and detection task performance in medical detection dogs," including "Sample details" and "Judgement Bias Data" files. Complete download (zip, 33.7 KiB)

  18. d

    Data from: Questioning Bias: Validating a Bias Crime Assessment Tool in...

    • catalog.data.gov
    • icpsr.umich.edu
    • +1more
    Updated Mar 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). Questioning Bias: Validating a Bias Crime Assessment Tool in California and New Jersey, 2016-2017 [Dataset]. https://catalog.data.gov/dataset/questioning-bias-validating-a-bias-crime-assessment-tool-in-california-and-new-jersey-2016-a062f
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    National Institute of Justice
    Area covered
    California, New Jersey
    Description

    These data are part of NACJD's Fast Track Release and are distributed as they were received from the data depositor. The files have been zipped by NACJD for release, but not checked or processed except for the removal of direct identifiers. Users should refer to the accompanying readme file for a brief description of the files available with this collection and consult the investigator(s) if further information is needed. This study investigates experiences surrounding hate and bias crimes and incidents and reasons and factors affecting reporting and under-reporting among youth and adults in LGBT, immigrant, Hispanic, Black, and Muslim communities in New Jersey and Los Angeles County, California. The collection includes 1 SPSS data file (QB_FinalDataset-Revised.sav (n=1,326; 513 variables)). The collection also contains 24 qualitative data files of transcripts from focus groups and interviews with key informants, which are not included in this release.

  19. News Ninja Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv
    Updated Feb 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anon; anon (2024). News Ninja Dataset [Dataset]. http://doi.org/10.5281/zenodo.10683029
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Feb 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    anon; anon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About
    Recent research shows that visualizing linguistic media bias mitigates its negative effects. However, reliable automatic detection methods to generate such visualizations require costly, knowledge-intensive training data. To facilitate data collection for media bias datasets, we present News Ninja, a game employing data-collecting game mechanics to generate a crowdsourced dataset. Before annotating sentences, players are educated on media bias via a tutorial. Our findings show that datasets gathered with crowdsourced workers trained on News Ninja can reach significantly higher inter-annotator agreements than expert and crowdsourced datasets. As News Ninja encourages continuous play, it allows datasets to adapt to the reception and contextualization of news over time, presenting a promising strategy to reduce data collection expenses, educate players, and promote long-term bias mitigation.

    General
    This dataset was created through player annotations in the News Ninja Game made by ANON. Its goal is to improve the detection of linguistic media bias. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.

    The dataset includes sentences with binary bias labels (processed, biased or not biased) as well as the annotations of single players used for the majority vote. It includes all game-collected data. All data is completely anonymous. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.

    Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset contains topics such as violence, abortion, and hate against specific races, genders, religions, or sexual orientations.

    Description of the Data Files
    This repository contains the datasets for the anonymous News Ninja submission. The tables contain the following data:

    ExportNewsNinja.csv: Contains 370 BABE sentences and 150 new sentences with their text (sentence), words labeled as biased (words), BABE ground truth (ground_Truth), and the sentence bias label from the player annotations (majority_vote). The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences.

    AnalysisNewsNinja.xlsx: Contains 370 BABE sentences and 150 new sentences. The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences. The table includes the full sentence (Sentence), the sentence bias label from player annotations (isBiased Game), the new expert label (isBiased Expert), if the game label and expert label match (Game VS Expert), if differing labels are a false positives or false negatives (false negative, false positive), the ground truth label from BABE (isBiasedBABE), if Expert and BABE labels match (Expert VS BABE), and if the game label and BABE label match (Game VS BABE). It also includes the analysis of the agreement between the three rater categories (Game, Expert, BABE).

    demographics.csv: Contains demographic information of News Ninja players, including gender, age, education, English proficiency, political orientation, news consumption, and consumed outlets.

    Collection Process
    Data was collected through interactions with the NewsNinja game. All participants went through a tutorial before annotating 2x10 BABE sentences and 2x10 new sentences. For this first test, players were recruited using Prolific. The game was hosted on a costume-built responsive website. The collection period was from 20.02.2023 to 28.02.2023. Before starting the game, players were informed about the goal and the data processing. After consenting, they could proceed to the tutorial.

    The dataset will be open source. A link with all details and contact information will be provided upon acceptance. No third parties are involved.

    The dataset will not be maintained as it captures the first test of NewsNinja at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsNinja paper if you use the dataset and contact us if you're interested in more information or joining the project.

  20. Z

    Data from: Diversity matters: Robustness of bias measurements in Wikidata

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sai Keerthana Karnam (2023). Diversity matters: Robustness of bias measurements in Wikidata [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7881057
    Explore at:
    Dataset updated
    May 1, 2023
    Dataset provided by
    Paramita das
    Anirban Panda
    Sai Keerthana Karnam
    Animesh Mukherjee
    Bhanu Prakash Reddy Guda
    Soumya Sarkar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the widespread use of knowledge graphs (KG) in various automated AI systems and applications, it is very important to ensure that information retrieval algorithms leveraging them are free from societal biases. Previous works have depicted biases that persist in KGs, as well as employed several metrics for measuring the biases. However, such studies lack the systematic exploration of the sensitivity of the bias measurements, through varying sources of data, or the embedding algorithms used. To address this research gap, in this work, we present a holistic analysis of bias measurement on the knowledge graph. First, we attempt to reveal data biases that surface in Wikidata for thirteen different demographics selected from seven continents. Next, we attempt to unfold the variance in the detection of biases by two different knowledge graph embedding algorithms - TransE and ComplEx. We conduct our extensive experiments on a large number of occupations sampled from the thirteen demographics with respect to the sensitive attribute, i.e., gender. Our results show that the inherent data bias that persists in KG can be altered by specific algorithm bias as incorporated by KG embedding learning algorithms. Further, we show that the choice of the state-of-the-art KG embedding algorithm has a strong impact on the ranking of biased occupations irrespective of gender. We observe that the similarity of the biased occupations across demographics is minimal which reflects the socio-cultural differences around the globe. We believe that this full-scale audit of the bias measurement pipeline will raise awareness among the community while deriving insights related to design choices of data and algorithms both and refrain from the popular dogma of ``one-size-fits-all''.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Fabian Haak; Fabian Haak; Philipp Schaer; Philipp Schaer (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. http://doi.org/10.5281/zenodo.7682915
Organization logo

Data from: Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions

Related Article
Explore at:
csvAvailable download formats
Dataset updated
Mar 1, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fabian Haak; Fabian Haak; Philipp Schaer; Philipp Schaer
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles.
Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

Dataset 2: Search Query Suggestions (suggestions.csv)

The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

AllSides Scraper

At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.

Search
Clear search
Close search
Google apps
Main menu