100+ datasets found
  1. NewsMediaBias-Plus Dataset

    • zenodo.org
    • huggingface.co
    bin, zip
    Updated Nov 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaina Raza; Shaina Raza (2024). NewsMediaBias-Plus Dataset [Dataset]. http://doi.org/10.5281/zenodo.13961155
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Nov 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shaina Raza; Shaina Raza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NewsMediaBias-Plus Dataset

    Overview

    The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.

    Dataset Description

    NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.

    Contents

    • unique_id: Unique identifier for each news item. Each unique_id matches an image for the same article.
    • outlet: The publisher of the article.
    • headline: The headline of the article.
    • article_text: The full content of the news article.
    • image_description: Description of the paired image.
    • image: The file path of the associated image.
    • date_published: The date the article was published.
    • source_url: The original URL of the article.
    • canonical_link: The canonical URL of the article.
    • new_categories: Categories assigned to the article.
    • news_categories_confidence_scores: Confidence scores for each category.

    Annotation Labels

    • text_label: Indicates the likelihood of the article being disinformation:

      • Likely: Likely to be disinformation.
      • Unlikely: Unlikely to be disinformation.
    • multimodal_label: Indicates the likelihood of disinformation from the combination of the text snippet and image content:

      • Likely: Likely to be disinformation.
      • Unlikely: Unlikely to be disinformation.

    Getting Started

    Prerequisites

    • Python 3.6+
    • Pandas
    • Hugging Face Datasets
    • Hugging Face Hub

    Installation

    Load the dataset into Python:

    python
    Copy code
    from datasets import load_dataset ds = load_dataset("vector-institute/newsmediabias-plus") print(ds) # View structure and splits print(ds['train'][0]) # Access the first record of the train split print(ds['train'][:5]) # Access the first five records

    Load a Few Records

    python
    Copy code
    from datasets import load_dataset # Load the dataset in streaming mode streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True) # Get an iterable dataset dataset_iterable = streamed_dataset['train'].take(5) # Print the records for record in dataset_iterable: print(record)

    Contributions

    Contributions are welcome! You can:

    • Add Data: Contribute more data points.
    • Refine Annotations: Improve annotation accuracy.
    • Share Usage Examples: Help others use the dataset effectively.

    To contribute, fork the repository and create a pull request with your changes.

    License

    This dataset is released under a non-commercial license. See the LICENSE file for more details.

    Citation

    Please cite the dataset using this BibTeX entry:

    bibtex
    Copy code
    @misc{vector_institute_2024_newsmediabias_plus, title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias}, author={Vector Institute Research Team}, year={2024}, url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus} }

    Contact

    For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai

    Disclaimer and User Guidance

    Disclaimer: The labels Likely and Unlikely are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.

    Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.

  2. f

    Data from: Improving short-term grade block models: alternative for...

    • scielo.figshare.com
    jpeg
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cristina da Paixão Araújo; João Felipe Coimbra Leite Costa; Vanessa Cerqueira Koppe (2023). Improving short-term grade block models: alternative for correcting soft data [Dataset]. http://doi.org/10.6084/m9.figshare.5772303.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SciELO journals
    Authors
    Cristina da Paixão Araújo; João Felipe Coimbra Leite Costa; Vanessa Cerqueira Koppe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract Short-term mining planning typically relies on samples obtained from channels or less-accurate sampling methods. The results may include larger sampling errors than those derived from diamond drill hole core samples. The aim of this paper is to evaluate the impact of the sampling error on grade estimation and propose a method of correcting the imprecision and bias in the soft data. In addition, this paper evaluates the benefits of using soft data in mining planning. These concepts are illustrated via a gold mine case study, where two different data types are presented. The study used Au grades collected via diamond drilling (hard data) and channels (soft data). Four methodologies were considered for estimation of the Au grades of each block to be mined: ordinary kriging with hard and soft data pooled without considering differences in data quality; ordinary kriging with only hard data; standardized ordinary kriging with pooled hard and soft data; and standardized, ordinary cokriging. The results show that even biased samples collected using poor sampling protocols improve the estimates more than a limited number of precise and unbiased samples. A welldesigned estimation method corrects the biases embedded in the samples, mitigating their propagation to the block model.

  3. Data from: Reliable species distributions are obtainable with sparse, patchy...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    Updated May 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samantha L. Peel; Nicole A. Hill; Scott D. Foster; Simon J. Wotherspoon; Claudio Ghiglione; Stefano Schiaparelli; Samantha L. Peel; Nicole A. Hill; Scott D. Foster; Simon J. Wotherspoon; Claudio Ghiglione; Stefano Schiaparelli (2022). Data from: Reliable species distributions are obtainable with sparse, patchy and biased data by leveraging over species and data types [Dataset]. http://doi.org/10.5061/dryad.2226v8m
    Explore at:
    Dataset updated
    May 30, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Samantha L. Peel; Nicole A. Hill; Scott D. Foster; Simon J. Wotherspoon; Claudio Ghiglione; Stefano Schiaparelli; Samantha L. Peel; Nicole A. Hill; Scott D. Foster; Simon J. Wotherspoon; Claudio Ghiglione; Stefano Schiaparelli
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description
    1. New methods for species distribution models (SDMs) utilise presence‐absence (PA) data to correct the sampling bias of presence‐only (PO) data in a spatial point process setting. These have been shown to improve species estimates when both data sets are large and dense. However, is a PA data set that is smaller and patchier than hitherto examined able to do the same? Furthermore, when both data sets are relatively small, is there enough information contained within them to produce a useful estimate of species' distributions? These attributes are common in many applications.

    2. A stochastic simulation was conducted to assess the ability of a pooled data SDM to estimate the distribution of species from increasingly sparser and patchier data sets. The simulated data sets were varied by changing the number of presence‐absence sample locations, the degree of patchiness of these locations, the number of PO observations, and the level of sampling bias within the PO observations. The performance of the pooled data SDM was compared to a PA SDM and a PO SDM to assess the strengths and limitations of each SDM.

    3. The pooled data SDM successfully removed the sampling bias from the PO observations even when the presence‐absence data was sparse and patchy, and the PO observations formed the majority of the data. The pooled data SDM was, in general, more accurate and more precise than either the PA SDM or the PO SDM. All SDMs were more precise for the species responses than they were for the covariate coefficients.

    4. The emerging SDM methodology that pools PO and PA data will facilitate more certainty around species' distribution estimates, which in turn will allow more relevant and concise management and policy decisions to be enacted. This work shows that it is possible to achieve this result even in relatively data‐poor regions.

  4. Data from: Citizen science can complement professional invasive plant...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Monica Dimson (2024). Citizen science can complement professional invasive plant surveys and improve estimates of suitable habitat [Dataset]. http://doi.org/10.5068/D1769Q
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    University of California, Los Angeles
    Authors
    Monica Dimson
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Aim: Citizen science is a cost-effective potential source of invasive species occurrence data. However, data quality issues due to unstructured sampling approaches may discourage the use of these observations by science and conservation professionals. This study explored the utility of low-structure iNaturalist citizen science data in invasive plant monitoring. We first examined the prevalence of invasive taxa in iNaturalist plant observations and sampling biases associated with those data. Using four invasive species as examples, we then compared iNaturalist and professional agency observations and used the two datasets to model suitable habitat for each species. Location: Hawaiʻi, USA Methods: To estimate the prevalence of invasive plant data, we compared the number of species and observations recorded in iNaturalist to botanical checklists for Hawaiʻi. Sampling bias was quantified along gradients of site accessibility, protective status, and vegetation disturbance using a bias index. Habitat suitability for four invasive species was modeled in Maxent, using observations from iNaturalist, professional agencies, and stratified subsets of iNaturalist data. Results: iNaturalist plant observations were biased toward invasive species, which were frequently recorded in areas with higher road/trail density and vegetation disturbance. Professional observations of four example invasive species tended to occur in less accessible, native-dominated sites. Habitat suitability models based on iNaturalist versus professional data showed moderate overlap and different distributions of suitable habitat across vegetation disturbance classes. Stratifying iNaturalist observations had little effect on how suitable habitat was distributed for the species modeled in this study. Main conclusions: Opportunistic iNaturalist observations have the potential to complement and expand professional invasive plant monitoring, which we found was often affected by inverse sampling biases. Invasive species represented a high proportion of iNaturalist plant observations, and were recorded in environments that were not captured by professional surveys. Combining the datasets thus led to more comprehensive estimates of suitable habitat.

  5. e

    Cross-cultural differences in biased cognition - Pilot task data - Dataset -...

    • b2find.eudat.eu
    Updated Mar 30, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). Cross-cultural differences in biased cognition - Pilot task data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/545c1da9-93df-58a4-9cb2-41c48d1170cb
    Explore at:
    Dataset updated
    Mar 30, 2014
    Description

    This data collection consists of pilot data measuring task equivalence for measures of attention and interpretation bias. Congruent Mandarin and English emotional Stroop, attention probe (both measuring attention bias) and similarity ratings task and scrambled sentence task (both measuring interpretation bias) were developed using back-translation and decentering procedures. Tasks were then completed by 47 bilingual Mandarin-English speakers. Presented are data detailing personal characteristics, task scores and bias scores.The way in which we process information in the world around us has a significant effect on our health and well being. For example, some people are more prone than others to notice potential dangers, to remember bad things from the past and assume the worst, when the meaning of an event or comment is uncertain. These tendencies are called negative cognitive biases and can lead to low mood and poor quality of life. They also make people vulnerable to mental illnesses. In contrast, those with positive cognitive biases tend to function well and remain healthy. To date most of this work has been conducted on white, western populations and we do not know whether similar cognitive biases exist in Eastern cultures. This project will examine cognitive biases in Eastern (Hong Kong nationals ) and Western (UK nationals) people to see whether there are any differences between the two. It will also examine what happens to cognitive biases when someone migrates to a different culture. This will tell us whether influences from the society and culture around us have any effect on our cognitive biases. Finally the project will consider how much our own cognitive biases are inherited from our parents. Together these results will tell us whether the known good and bad effects of cognitive biases apply to non Western cultural groups as well, and how much cognitive biases are decided by our genes or our environment. Participants: Fluent bilingual Mandarin and English speakers, aged 16-65 with no current major physical illness or psychological disorder, who were not receiving psychological therapy or medication for psychological conditions. Sampling procedure: Participants were recruited using circular emails which are sent to all university staff and students as well as through flyers around campuses. Relevant societies and language schools in central London were also contacted. Data collection: Participants completed four cognitive bias tasks (emotional Stroop, attention probe, similarity ratings task and scrambled sentence task) in both English and Mandarin. Order of language presentation and task presentation were counterbalanced.

  6. f

    Data from: Bivariate Analysis of Distribution Functions Under Biased...

    • tandf.figshare.com
    txt
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hsin-wen Chang; Shu-Hsiang Wang (2024). Bivariate Analysis of Distribution Functions Under Biased Sampling [Dataset]. http://doi.org/10.6084/m9.figshare.23998414.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 17, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Hsin-wen Chang; Shu-Hsiang Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This article compares distribution functions among pairs of locations in their domains, in contrast to the typical approach of univariate comparison across individual locations. This bivariate approach is studied in the presence of sampling bias, which has been gaining attention in COVID-19 studies that over-represent more symptomatic people. In cases with either known or unknown sampling bias, we introduce Anderson–Darling-type tests based on both the univariate and bivariate formulation. A simulation study shows the superior performance of the bivariate approach over the univariate one. We illustrate the proposed methods using real data on the distribution of the number of symptoms suggestive of COVID-19.

  7. d

    Replication data for: Testing for Publication Bias in Political Science

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Gerber; Donald Green; David Nickerson (2023). Replication data for: Testing for Publication Bias in Political Science [Dataset]. http://doi.org/10.7910/DVN/DQC9KV
    Explore at:
    Dataset updated
    Nov 20, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Alan Gerber; Donald Green; David Nickerson
    Description

    If the publication decisions of journals are a function of the statistical significance of research findings, the published literature may suffer from “publication bias.” This paper describes a method for detecting publication bias. We point out that to achieve statistical significance, the effect size must be larger in small samples. If publications tend to be biased against statistically insignificant results, we should observe that the effect size diminishes as sample sizes increase. This proposition is tested and confirmed using the experimental literature on voter mobilization.

  8. H

    Replication Data (A) for 'Biased Programmers or Biased Data?': Individual...

    • dataverse.harvard.edu
    Updated Sep 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bo Cowgill; Fabrizio Dell'Acqua; Sam Deng; Daniel Hsu; Nakul Verma; Augustin Chaintreau (2020). Replication Data (A) for 'Biased Programmers or Biased Data?': Individual Measures of Numeracy, Literacy and Problem Solving Skill -- and Biographical Data -- for a Representative Sample of 200K OECD Residents [Dataset]. http://doi.org/10.7910/DVN/JAJ3CP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Bo Cowgill; Fabrizio Dell'Acqua; Sam Deng; Daniel Hsu; Nakul Verma; Augustin Chaintreau
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/JAJ3CPhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/JAJ3CP

    Description

    This is a cleaned and merged version of the OECD's Programme for the International Assessment of Adult Competencies. The data contains individual person-measures of several basic skills including literacy, numeracy and critical thinking, along with extensive biographical details about each subject. PIAAC is essentially a standardized test taken by a representative sample of all OECD countries (approximately 200K individuals in total). We have found this data useful in studies of predictive algorithms and human capital, in part because of its high quality, size, number and quality of biographical features per subject and representativeness of the population at large.

  9. f

    Data from: Robust inference under r-size-biased sampling without replacement...

    • tandf.figshare.com
    xlsx
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    P. Economou; G. Tzavelas; A. Batsidis (2023). Robust inference under r-size-biased sampling without replacement from finite population [Dataset]. http://doi.org/10.6084/m9.figshare.11542974.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Nov 28, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    P. Economou; G. Tzavelas; A. Batsidis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The case of size-biased sampling of known order from a finite population without replacement is considered. The behavior of such a sampling scheme is studied with respect to the sampling fraction. Based on a simulation study, it is concluded that such a sample cannot be treated either as a random sample from the parent distribution or as a random sample from the corresponding r-size weighted distribution and as the sampling fraction increases, the biasness in the sample decreases resulting in a transition from an r-size-biased sample to a random sample. A modified version of a likelihood-free method is adopted for making statistical inference for the unknown population parameters, as well as for the size of the population when it is unknown. A simulation study, which takes under consideration the sampling fraction, demonstrates that the proposed method presents better and more robust behavior compared to the approaches, which treat the r-size-biased sample either as a random sample from the parent distribution or as a random sample from the corresponding r-size weighted distribution. Finally, a numerical example which motivates this study illustrates our results.

  10. H

    Replication data for: Looking Beyond Demographics: Panel Attrition in the...

    • dataverse.harvard.edu
    Updated Oct 1, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2014). Replication data for: Looking Beyond Demographics: Panel Attrition in the ANES and GSS [Dataset]. http://doi.org/10.7910/DVN/RRDHGR
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 1, 2014
    Dataset provided by
    Harvard Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    2006 - 2010
    Area covered
    United States
    Description

    Longitudinal or panel surveys offer unique benefits for social science research, but they typically suffer from attrition, which reduces sample size and can result in biased inferences. Previous research tends to focus on the demographic predictors of attrition, conceptualizing attrition propensity as a stable, individual- level characteristic—some individuals (e.g., young, poor, residentially mobile) are more likely to drop out of a study than others. We argue that panel attrition reflects both the characteristics of the individual respondent as well as her survey experience, a factor shaped by the design and implementation features of the study. In this paper, we examine and compare the predictors of panel attrition in the 2008-2009 American National Election Study, an on- line panel, and the 2006-2010 General Social Survey, a face-to-face panel. In both cases, survey experience variables are predictive of panel attrition above and beyond the standard demographic predictors, but the particular measures of relevance differ across the two surveys. The findings inform statistical corrections for panel attrition bias and provide study design insights for future panel data collections.

  11. Opinion on political bias in news U.S. 2022, by political affiliation

    • statista.com
    Updated Sep 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2023). Opinion on political bias in news U.S. 2022, by political affiliation [Dataset]. https://www.statista.com/statistics/802278/opinion-extent-political-bias-news-coverage-us-political-affiliation/
    Explore at:
    Dataset updated
    Sep 6, 2023
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 31, 2022 - Jul 21, 2022
    Area covered
    United States
    Description

    According to a survey conducted in the United States in summer 2022, 79 percent of Republican respondents felt that news coverage had a great deal of political bias, making these voters the most likely to hold this opinion of the news media. Independents also felt strongly about this issue, whereas only 33 percent of Democrats said they saw a great deal of political bias in news.

    How politics affects news consumption

    Political bias in news can alienate consumers and may also be poorly received when coverage of a non-political topic leans too heavily towards one end of the spectrum. However, at the same time, personal politics in general are often closely interlinked with how a consumer perceives or engages with news and information. A clear example of this can be found when looking at political news sources used weekly in the U.S., with Republicans and Democrats opting for the national networks they most identify with. But what if audiences cannot find the content they want?

    A change in behavior

    Engaging with news aligning with one’s politics is not uncommon. That said, perceived bias in mainstream media may lead some consumers to look elsewhere and turn away from more “neutral” outlets if they believe the news is no longer partisan. Data shows that a number of leading conservative websites registered a substantial increase in visitors year over year. Looking at this data in context of Republicans’ concern about bias in political news, it is likely that this trend will continue and consumers will pursue outlets they feel resonate with them most.

  12. d

    Data from: Sampling schemes and drift can bias admixture proportions...

    • datadryad.org
    zip
    Updated Jul 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ken Toyama; Pierre-André Crochet; Raphaël Leblois (2020). Sampling schemes and drift can bias admixture proportions inferred by STRUCTURE [Dataset]. http://doi.org/10.5061/dryad.gf1vhhmkw
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 27, 2020
    Dataset provided by
    Dryad
    Authors
    Ken Toyama; Pierre-André Crochet; Raphaël Leblois
    Time period covered
    May 20, 2020
    Description

    The interbreeding of individuals coming from genetically differentiated but incompletely isolated populations can lead to the formation of admixed populations, having important implications in ecology and evolution. In this simulation study, we evaluate how individual admixture proportions estimated by the software structure are quantitatively affected by different factors. Using various scenarios of admixture between two diverging populations, we found that unbalanced sampling from parental populations may seriously bias the inferred admixture proportions; moreover, proportionally large samples from the admixed population can also decrease the accuracy and precision of the inferences. As expected, weak differentiation between parental populations and drift after the admixture event strongly increase the biases caused by uneven sampling. We also show that admixture proportions are generally more biased when parental populations unequally contributed to the admixed population. Finally, w...

  13. H

    Replication data for: Selection Bias and Continuous-Time Duration Models:...

    • dataverse.harvard.edu
    Updated Jan 21, 2009
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frederick J. Boehmke; Daniel S. Morey; Megan gan Shannon (2009). Replication data for: Selection Bias and Continuous-Time Duration Models: Consequences and a Proposed Solution [Dataset]. http://doi.org/10.7910/DVN/DUW1FA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 21, 2009
    Dataset provided by
    Harvard Dataverse
    Authors
    Frederick J. Boehmke; Daniel S. Morey; Megan gan Shannon
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This article analyzes the consequences of nonrandom sample selection for continuous-time duration analyses and develops a new estimator to correct for it when necessary. We conduct a series of Monte Carlo analyses that estimate common duration models as well as our proposed duration model with selection. These simulations show that ignoring sample selection issues can lead to biased parameter estimates, including the appearance of (nonexistent) duration dependence. In addition, our proposed estimator is found to be superior in root mean-square error terms when nontrivial amounts of selection are present. Finally, we provide an empirical application of our method by studying whether self-selectivity is a problem for studies of leaders' survival during and following militarized conflicts.

  14. f

    Resampling methods.

    • plos.figshare.com
    bin
    Updated Jul 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Annie Kim; Inkyung Jung (2023). Resampling methods. [Dataset]. http://doi.org/10.1371/journal.pone.0288540.t001
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 27, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Annie Kim; Inkyung Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Class imbalance is a major problem in classification, wherein the decision boundary is easily biased toward the majority class. A data-level solution (resampling) is one possible solution to this problem. However, several studies have shown that resampling methods can deteriorate the classification performance. This is because of the overgeneralization problem, which occurs when samples produced by the oversampling technique that should be represented in the minority class domain are introduced into the majority-class domain. This study shows that the overgeneralization problem is aggravated in complex data settings and introduces two alternate approaches to mitigate it. The first approach involves incorporating a filtering method into oversampling. The second approach is to apply undersampling. The main objective of this study is to provide guidance on selecting optimal resampling methods in imbalanced and complex datasets to improve classification performance. Simulation studies and real data analyses were performed to compare the resampling results in various scenarios with different complexities, imbalances, and sample sizes. In the case of noncomplex datasets, undersampling was found to be optimal. However, in the case of complex datasets, applying a filtering method to delete misallocated examples was optimal. In conclusion, this study can aid researchers in selecting the optimal method for resampling complex datasets.

  15. e

    Biased cognition in East Asian and Western Cultures: Behavioural data...

    • b2find.eudat.eu
    Updated Mar 30, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). Biased cognition in East Asian and Western Cultures: Behavioural data 2016-2018 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/b985488d-0ccb-51dc-81a1-1829b5df68e4
    Explore at:
    Dataset updated
    Mar 30, 2014
    Description

    This data collection consists of behavioural task data for measures of attention and interpretation bias, specifically: emotional Stroop, attention probe (both measuring attention bias) and similarity ratings task and scrambled sentence task (both measuring interpretation bias). Data on the following 6 participant groups are included in the dataset: native UK (n=36), native HK (n=39), UK migrants to HK (short term = 31, long term = 28) and HK migrants to UK (short term = 37, long term = 31). Also included are personal characteristics and questionnaire measures. The way in which we process information in the world around us has a significant effect on our health and well being. For example, some people are more prone than others to notice potential dangers, to remember bad things from the past and assume the worst, when the meaning of an event or comment is uncertain. These tendencies are called negative cognitive biases and can lead to low mood and poor quality of life. They also make people vulnerable to mental illnesses. In contrast, those with positive cognitive biases tend to function well and remain healthy. To date most of this work has been conducted on white, western populations and we do not know whether similar cognitive biases exist in Eastern cultures. This project will examine cognitive biases in Eastern (Hong Kong nationals ) and Western (UK nationals) people to see whether there are any differences between the two. It will also examine what happens to cognitive biases when someone migrates to a different culture. This will tell us whether influences from the society and culture around us have any effect on our cognitive biases. Finally the project will consider how much our own cognitive biases are inherited from our parents. Together these results will tell us whether the known good and bad effects of cognitive biases apply to non Western cultural groups as well, and how much cognitive biases are decided by our genes or our environment. Participants: Local Hong Kong and UK natives; short term and long term migrants in each country, aged 16-65 with no current major physical illness or psychological disorder, who were not receiving psychological therapy or medication for psychological conditions. Sampling procedure: Participants were recruited using circular emails, public flyers and other advertisements in local venues, universities and clubs. Data collection: Participants completed four previously developed and validated cognitive bias tasks (emotional Stroop, attention probe, similarity ratings task and scrambled sentence task) in their native language. They also completed socio-demographic information and questionnaires.

  16. Data from: Integrated population models: bias and inference

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin
    Updated Jun 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas V. Riecke; Perry J. Williams; Tessa L. Behnke; Daniel Gibson; Alan G. Leach; Benjamin S. Sedinger; Phillip A. Street; James S. Sedinger; Thomas V. Riecke; Perry J. Williams; Tessa L. Behnke; Daniel Gibson; Alan G. Leach; Benjamin S. Sedinger; Phillip A. Street; James S. Sedinger (2022). Data from: Integrated population models: bias and inference [Dataset]. http://doi.org/10.5061/dryad.fd28113
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thomas V. Riecke; Perry J. Williams; Tessa L. Behnke; Daniel Gibson; Alan G. Leach; Benjamin S. Sedinger; Phillip A. Street; James S. Sedinger; Thomas V. Riecke; Perry J. Williams; Tessa L. Behnke; Daniel Gibson; Alan G. Leach; Benjamin S. Sedinger; Phillip A. Street; James S. Sedinger
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Integrated population models (hereafter, IPMs) have become increasingly popular for the modeling of populations, as investigators seek to combine survey and demographic data to understand processes governing population dynamics. These models are particularly useful for identifying and exploring knowledge gaps within datasets, because they allow investigators to estimate biologically meaningful parameters, such as immigration and reproduction, that are uninformed by data. As IPMs have been developed relatively recently, model behavior remains relatively poorly understood. Much attention has been paid to parameter behavior such as parameter estimates near boundaries, as well as the consequences of dependent datasets. However, the movement of bias among parameters remains underexamined, particularly when models include parameters that are estimated without data. 2. To examine distribution of bias among model parameters, we simulated stable populations closed to immigration and emigration. We simulated two scenarios that might induce bias into survival estimates: marker induced bias in the capture-mark-recapture data, and heterogeneity in the mortality process. We subsequently ran appropriate capture-mark-recapture, state-space, and fecundity models, as well as integrated population models. 3. Simulation results suggest that when sampling bias exists in datasets, parameters that are not informed by data are extremely susceptible to bias. For example, in the presence of marker effects on survival of 0.1, estimates of immigration rate from an integrated population model were biased high (0.09). When heterogeneity in the mortality process was simulated, inducing bias in estimates of adult (-0.04) and juvenile (-0.097) survival rates, estimates of fecundity were biased by 46.2%. 4. We believe our results have important implications for biological inference when using integrated population models, as well as future model development and implementation. Specifically, parameters that are estimated without data absorb ~90% of the bias in integrated modelling frameworks. We suggest that investigators interpret posterior distributions of these parameters as a combination of biological process and systematic bias.

  17. o

    Data and Code for The Short- and the Long-Run Impact of Gender-Biased...

    • openicpsr.org
    Updated Sep 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Lavy; Rigissa Megalokonomou (2022). Data and Code for The Short- and the Long-Run Impact of Gender-Biased Teachers [Dataset]. http://doi.org/10.3886/E179241V1
    Explore at:
    Dataset updated
    Sep 4, 2022
    Dataset provided by
    American Economic Association
    Authors
    Victor Lavy; Rigissa Megalokonomou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2002 - 2012
    Area covered
    Greece
    Description

    We examine the persistence of teachers' gender biases by following teachers over time in different classes. Wend a very high correlation of gender biases for teachers across their classes. Based on out-of-sample measures of these biases, we estimate the substantial effects of these biases on students' performance in university admission exams, choice of university eld of study, and quality of the enrolled program. The effects on university choice outcomes are larger for girls, explaining some gender differences in STEM majors. Part of these effects, which are more prevalent among less effective teachers, are mediated by changing school attendance.---These are the data that produce the results found in the related paper.

  18. Bias and fact-checking in news in the U.S. 2022

    • statista.com
    Updated Nov 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Bias and fact-checking in news in the U.S. 2022 [Dataset]. https://www.statista.com/statistics/874821/news-media-bias-perceptions/
    Explore at:
    Dataset updated
    Nov 22, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 31, 2022 - Jul 21, 2022
    Area covered
    United States
    Description

    A survey from July 2022 asked Americans how they felt about the effects of bias in news on their ability to sort out facts, and revealed that 50 percent felt there was so much bias in the news that it was difficult to discern what was factual from information that was not. This was the highest share who said so across all years shown, and at the same time, the 2022 survey showed the lowest share of respondents who believed there were enough sources to be able to sort out fact from fiction.

  19. h

    md_gender_bias

    • huggingface.co
    • opendatalab.com
    Updated Mar 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2021). md_gender_bias [Dataset]. https://huggingface.co/datasets/facebook/md_gender_bias
    Explore at:
    Dataset updated
    Mar 26, 2021
    Dataset authored and provided by
    AI at Meta
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.

  20. H

    Replication data for: Split-Sample Instrumental Variables Estimates of the...

    • dataverse.harvard.edu
    Updated Jan 21, 2009
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joshua D. Angrist (2009). Replication data for: Split-Sample Instrumental Variables Estimates of the Return to Schooling [Dataset]. http://doi.org/10.7910/DVN/6LX9OE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 21, 2009
    Dataset provided by
    Harvard Dataverse
    Authors
    Joshua D. Angrist
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This article reevaluates recent instrumental variables (IV) estimates of the returns to schooling in light of the fact that two-stage least squares is biased in the same direction as ordinary least squares (OLS) even in very large samples. We propose a split-sample instrumental variables (SSIV) estimator that is not biased toward OLS. SSIV uses one-half of a sample to estimate parameters of the first-stage equation. Estimated first-stage parameters are then used to construct fitted values and second-stage parameter estimates in the other half sample. SSIV is biased toward 0, but this bias can be corrected. The splt-sample estimators confirm and reinforce some previous findings on the returns to schooling but fail to confirm others.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shaina Raza; Shaina Raza (2024). NewsMediaBias-Plus Dataset [Dataset]. http://doi.org/10.5281/zenodo.13961155
Organization logo

NewsMediaBias-Plus Dataset

Explore at:
bin, zipAvailable download formats
Dataset updated
Nov 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shaina Raza; Shaina Raza
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

NewsMediaBias-Plus Dataset

Overview

The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.

Dataset Description

NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.

Contents

  • unique_id: Unique identifier for each news item. Each unique_id matches an image for the same article.
  • outlet: The publisher of the article.
  • headline: The headline of the article.
  • article_text: The full content of the news article.
  • image_description: Description of the paired image.
  • image: The file path of the associated image.
  • date_published: The date the article was published.
  • source_url: The original URL of the article.
  • canonical_link: The canonical URL of the article.
  • new_categories: Categories assigned to the article.
  • news_categories_confidence_scores: Confidence scores for each category.

Annotation Labels

  • text_label: Indicates the likelihood of the article being disinformation:

    • Likely: Likely to be disinformation.
    • Unlikely: Unlikely to be disinformation.
  • multimodal_label: Indicates the likelihood of disinformation from the combination of the text snippet and image content:

    • Likely: Likely to be disinformation.
    • Unlikely: Unlikely to be disinformation.

Getting Started

Prerequisites

  • Python 3.6+
  • Pandas
  • Hugging Face Datasets
  • Hugging Face Hub

Installation

Load the dataset into Python:

python
Copy code
from datasets import load_dataset ds = load_dataset("vector-institute/newsmediabias-plus") print(ds) # View structure and splits print(ds['train'][0]) # Access the first record of the train split print(ds['train'][:5]) # Access the first five records

Load a Few Records

python
Copy code
from datasets import load_dataset # Load the dataset in streaming mode streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True) # Get an iterable dataset dataset_iterable = streamed_dataset['train'].take(5) # Print the records for record in dataset_iterable: print(record)

Contributions

Contributions are welcome! You can:

  • Add Data: Contribute more data points.
  • Refine Annotations: Improve annotation accuracy.
  • Share Usage Examples: Help others use the dataset effectively.

To contribute, fork the repository and create a pull request with your changes.

License

This dataset is released under a non-commercial license. See the LICENSE file for more details.

Citation

Please cite the dataset using this BibTeX entry:

bibtex
Copy code
@misc{vector_institute_2024_newsmediabias_plus, title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias}, author={Vector Institute Research Team}, year={2024}, url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus} }

Contact

For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai

Disclaimer and User Guidance

Disclaimer: The labels Likely and Unlikely are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.

Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.

Search
Clear search
Close search
Google apps
Main menu