Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.
NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.
unique_id
: Unique identifier for each news item. Each unique_id
matches an image for the same article.outlet
: The publisher of the article.headline
: The headline of the article.article_text
: The full content of the news article.image_description
: Description of the paired image.image
: The file path of the associated image.date_published
: The date the article was published.source_url
: The original URL of the article.canonical_link
: The canonical URL of the article.new_categories
: Categories assigned to the article.news_categories_confidence_scores
: Confidence scores for each category.text_label
: Indicates the likelihood of the article being disinformation:
Likely
: Likely to be disinformation.Unlikely
: Unlikely to be disinformation.multimodal_label
: Indicates the likelihood of disinformation from the combination of the text snippet and image content:
Likely
: Likely to be disinformation.Unlikely
: Unlikely to be disinformation.Load the dataset into Python:
from datasets import load_dataset
ds = load_dataset("vector-institute/newsmediabias-plus")
print(ds) # View structure and splits
print(ds['train'][0]) # Access the first record of the train split
print(ds['train'][:5]) # Access the first five records
from datasets import load_dataset
# Load the dataset in streaming mode
streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True)
# Get an iterable dataset
dataset_iterable = streamed_dataset['train'].take(5)
# Print the records
for record in dataset_iterable:
print(record)
Contributions are welcome! You can:
To contribute, fork the repository and create a pull request with your changes.
This dataset is released under a non-commercial license. See the LICENSE file for more details.
Please cite the dataset using this BibTeX entry:
@misc{vector_institute_2024_newsmediabias_plus,
title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias},
author={Vector Institute Research Team},
year={2024},
url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus}
}
For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai
Disclaimer: The labels Likely
and Unlikely
are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.
Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Short-term mining planning typically relies on samples obtained from channels or less-accurate sampling methods. The results may include larger sampling errors than those derived from diamond drill hole core samples. The aim of this paper is to evaluate the impact of the sampling error on grade estimation and propose a method of correcting the imprecision and bias in the soft data. In addition, this paper evaluates the benefits of using soft data in mining planning. These concepts are illustrated via a gold mine case study, where two different data types are presented. The study used Au grades collected via diamond drilling (hard data) and channels (soft data). Four methodologies were considered for estimation of the Au grades of each block to be mined: ordinary kriging with hard and soft data pooled without considering differences in data quality; ordinary kriging with only hard data; standardized ordinary kriging with pooled hard and soft data; and standardized, ordinary cokriging. The results show that even biased samples collected using poor sampling protocols improve the estimates more than a limited number of precise and unbiased samples. A welldesigned estimation method corrects the biases embedded in the samples, mitigating their propagation to the block model.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
New methods for species distribution models (SDMs) utilise presence‐absence (PA) data to correct the sampling bias of presence‐only (PO) data in a spatial point process setting. These have been shown to improve species estimates when both data sets are large and dense. However, is a PA data set that is smaller and patchier than hitherto examined able to do the same? Furthermore, when both data sets are relatively small, is there enough information contained within them to produce a useful estimate of species' distributions? These attributes are common in many applications.
A stochastic simulation was conducted to assess the ability of a pooled data SDM to estimate the distribution of species from increasingly sparser and patchier data sets. The simulated data sets were varied by changing the number of presence‐absence sample locations, the degree of patchiness of these locations, the number of PO observations, and the level of sampling bias within the PO observations. The performance of the pooled data SDM was compared to a PA SDM and a PO SDM to assess the strengths and limitations of each SDM.
The pooled data SDM successfully removed the sampling bias from the PO observations even when the presence‐absence data was sparse and patchy, and the PO observations formed the majority of the data. The pooled data SDM was, in general, more accurate and more precise than either the PA SDM or the PO SDM. All SDMs were more precise for the species responses than they were for the covariate coefficients.
The emerging SDM methodology that pools PO and PA data will facilitate more certainty around species' distribution estimates, which in turn will allow more relevant and concise management and policy decisions to be enacted. This work shows that it is possible to achieve this result even in relatively data‐poor regions.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Aim: Citizen science is a cost-effective potential source of invasive species occurrence data. However, data quality issues due to unstructured sampling approaches may discourage the use of these observations by science and conservation professionals. This study explored the utility of low-structure iNaturalist citizen science data in invasive plant monitoring. We first examined the prevalence of invasive taxa in iNaturalist plant observations and sampling biases associated with those data. Using four invasive species as examples, we then compared iNaturalist and professional agency observations and used the two datasets to model suitable habitat for each species. Location: Hawaiʻi, USA Methods: To estimate the prevalence of invasive plant data, we compared the number of species and observations recorded in iNaturalist to botanical checklists for Hawaiʻi. Sampling bias was quantified along gradients of site accessibility, protective status, and vegetation disturbance using a bias index. Habitat suitability for four invasive species was modeled in Maxent, using observations from iNaturalist, professional agencies, and stratified subsets of iNaturalist data. Results: iNaturalist plant observations were biased toward invasive species, which were frequently recorded in areas with higher road/trail density and vegetation disturbance. Professional observations of four example invasive species tended to occur in less accessible, native-dominated sites. Habitat suitability models based on iNaturalist versus professional data showed moderate overlap and different distributions of suitable habitat across vegetation disturbance classes. Stratifying iNaturalist observations had little effect on how suitable habitat was distributed for the species modeled in this study. Main conclusions: Opportunistic iNaturalist observations have the potential to complement and expand professional invasive plant monitoring, which we found was often affected by inverse sampling biases. Invasive species represented a high proportion of iNaturalist plant observations, and were recorded in environments that were not captured by professional surveys. Combining the datasets thus led to more comprehensive estimates of suitable habitat.
This data collection consists of pilot data measuring task equivalence for measures of attention and interpretation bias. Congruent Mandarin and English emotional Stroop, attention probe (both measuring attention bias) and similarity ratings task and scrambled sentence task (both measuring interpretation bias) were developed using back-translation and decentering procedures. Tasks were then completed by 47 bilingual Mandarin-English speakers. Presented are data detailing personal characteristics, task scores and bias scores.The way in which we process information in the world around us has a significant effect on our health and well being. For example, some people are more prone than others to notice potential dangers, to remember bad things from the past and assume the worst, when the meaning of an event or comment is uncertain. These tendencies are called negative cognitive biases and can lead to low mood and poor quality of life. They also make people vulnerable to mental illnesses. In contrast, those with positive cognitive biases tend to function well and remain healthy. To date most of this work has been conducted on white, western populations and we do not know whether similar cognitive biases exist in Eastern cultures. This project will examine cognitive biases in Eastern (Hong Kong nationals ) and Western (UK nationals) people to see whether there are any differences between the two. It will also examine what happens to cognitive biases when someone migrates to a different culture. This will tell us whether influences from the society and culture around us have any effect on our cognitive biases. Finally the project will consider how much our own cognitive biases are inherited from our parents. Together these results will tell us whether the known good and bad effects of cognitive biases apply to non Western cultural groups as well, and how much cognitive biases are decided by our genes or our environment. Participants: Fluent bilingual Mandarin and English speakers, aged 16-65 with no current major physical illness or psychological disorder, who were not receiving psychological therapy or medication for psychological conditions. Sampling procedure: Participants were recruited using circular emails which are sent to all university staff and students as well as through flyers around campuses. Relevant societies and language schools in central London were also contacted. Data collection: Participants completed four cognitive bias tasks (emotional Stroop, attention probe, similarity ratings task and scrambled sentence task) in both English and Mandarin. Order of language presentation and task presentation were counterbalanced.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This article compares distribution functions among pairs of locations in their domains, in contrast to the typical approach of univariate comparison across individual locations. This bivariate approach is studied in the presence of sampling bias, which has been gaining attention in COVID-19 studies that over-represent more symptomatic people. In cases with either known or unknown sampling bias, we introduce Anderson–Darling-type tests based on both the univariate and bivariate formulation. A simulation study shows the superior performance of the bivariate approach over the univariate one. We illustrate the proposed methods using real data on the distribution of the number of symptoms suggestive of COVID-19.
If the publication decisions of journals are a function of the statistical significance of research findings, the published literature may suffer from “publication bias.” This paper describes a method for detecting publication bias. We point out that to achieve statistical significance, the effect size must be larger in small samples. If publications tend to be biased against statistically insignificant results, we should observe that the effect size diminishes as sample sizes increase. This proposition is tested and confirmed using the experimental literature on voter mobilization.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/JAJ3CPhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/JAJ3CP
This is a cleaned and merged version of the OECD's Programme for the International Assessment of Adult Competencies. The data contains individual person-measures of several basic skills including literacy, numeracy and critical thinking, along with extensive biographical details about each subject. PIAAC is essentially a standardized test taken by a representative sample of all OECD countries (approximately 200K individuals in total). We have found this data useful in studies of predictive algorithms and human capital, in part because of its high quality, size, number and quality of biographical features per subject and representativeness of the population at large.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The case of size-biased sampling of known order from a finite population without replacement is considered. The behavior of such a sampling scheme is studied with respect to the sampling fraction. Based on a simulation study, it is concluded that such a sample cannot be treated either as a random sample from the parent distribution or as a random sample from the corresponding r-size weighted distribution and as the sampling fraction increases, the biasness in the sample decreases resulting in a transition from an r-size-biased sample to a random sample. A modified version of a likelihood-free method is adopted for making statistical inference for the unknown population parameters, as well as for the size of the population when it is unknown. A simulation study, which takes under consideration the sampling fraction, demonstrates that the proposed method presents better and more robust behavior compared to the approaches, which treat the r-size-biased sample either as a random sample from the parent distribution or as a random sample from the corresponding r-size weighted distribution. Finally, a numerical example which motivates this study illustrates our results.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Longitudinal or panel surveys offer unique benefits for social science research, but they typically suffer from attrition, which reduces sample size and can result in biased inferences. Previous research tends to focus on the demographic predictors of attrition, conceptualizing attrition propensity as a stable, individual- level characteristic—some individuals (e.g., young, poor, residentially mobile) are more likely to drop out of a study than others. We argue that panel attrition reflects both the characteristics of the individual respondent as well as her survey experience, a factor shaped by the design and implementation features of the study. In this paper, we examine and compare the predictors of panel attrition in the 2008-2009 American National Election Study, an on- line panel, and the 2006-2010 General Social Survey, a face-to-face panel. In both cases, survey experience variables are predictive of panel attrition above and beyond the standard demographic predictors, but the particular measures of relevance differ across the two surveys. The findings inform statistical corrections for panel attrition bias and provide study design insights for future panel data collections.
According to a survey conducted in the United States in summer 2022, 79 percent of Republican respondents felt that news coverage had a great deal of political bias, making these voters the most likely to hold this opinion of the news media. Independents also felt strongly about this issue, whereas only 33 percent of Democrats said they saw a great deal of political bias in news.
How politics affects news consumption
Political bias in news can alienate consumers and may also be poorly received when coverage of a non-political topic leans too heavily towards one end of the spectrum. However, at the same time, personal politics in general are often closely interlinked with how a consumer perceives or engages with news and information. A clear example of this can be found when looking at political news sources used weekly in the U.S., with Republicans and Democrats opting for the national networks they most identify with. But what if audiences cannot find the content they want?
A change in behavior
Engaging with news aligning with one’s politics is not uncommon. That said, perceived bias in mainstream media may lead some consumers to look elsewhere and turn away from more “neutral” outlets if they believe the news is no longer partisan. Data shows that a number of leading conservative websites registered a substantial increase in visitors year over year. Looking at this data in context of Republicans’ concern about bias in political news, it is likely that this trend will continue and consumers will pursue outlets they feel resonate with them most.
The interbreeding of individuals coming from genetically differentiated but incompletely isolated populations can lead to the formation of admixed populations, having important implications in ecology and evolution. In this simulation study, we evaluate how individual admixture proportions estimated by the software structure are quantitatively affected by different factors. Using various scenarios of admixture between two diverging populations, we found that unbalanced sampling from parental populations may seriously bias the inferred admixture proportions; moreover, proportionally large samples from the admixed population can also decrease the accuracy and precision of the inferences. As expected, weak differentiation between parental populations and drift after the admixture event strongly increase the biases caused by uneven sampling. We also show that admixture proportions are generally more biased when parental populations unequally contributed to the admixed population. Finally, w...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This article analyzes the consequences of nonrandom sample selection for continuous-time duration analyses and develops a new estimator to correct for it when necessary. We conduct a series of Monte Carlo analyses that estimate common duration models as well as our proposed duration model with selection. These simulations show that ignoring sample selection issues can lead to biased parameter estimates, including the appearance of (nonexistent) duration dependence. In addition, our proposed estimator is found to be superior in root mean-square error terms when nontrivial amounts of selection are present. Finally, we provide an empirical application of our method by studying whether self-selectivity is a problem for studies of leaders' survival during and following militarized conflicts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Class imbalance is a major problem in classification, wherein the decision boundary is easily biased toward the majority class. A data-level solution (resampling) is one possible solution to this problem. However, several studies have shown that resampling methods can deteriorate the classification performance. This is because of the overgeneralization problem, which occurs when samples produced by the oversampling technique that should be represented in the minority class domain are introduced into the majority-class domain. This study shows that the overgeneralization problem is aggravated in complex data settings and introduces two alternate approaches to mitigate it. The first approach involves incorporating a filtering method into oversampling. The second approach is to apply undersampling. The main objective of this study is to provide guidance on selecting optimal resampling methods in imbalanced and complex datasets to improve classification performance. Simulation studies and real data analyses were performed to compare the resampling results in various scenarios with different complexities, imbalances, and sample sizes. In the case of noncomplex datasets, undersampling was found to be optimal. However, in the case of complex datasets, applying a filtering method to delete misallocated examples was optimal. In conclusion, this study can aid researchers in selecting the optimal method for resampling complex datasets.
This data collection consists of behavioural task data for measures of attention and interpretation bias, specifically: emotional Stroop, attention probe (both measuring attention bias) and similarity ratings task and scrambled sentence task (both measuring interpretation bias). Data on the following 6 participant groups are included in the dataset: native UK (n=36), native HK (n=39), UK migrants to HK (short term = 31, long term = 28) and HK migrants to UK (short term = 37, long term = 31). Also included are personal characteristics and questionnaire measures. The way in which we process information in the world around us has a significant effect on our health and well being. For example, some people are more prone than others to notice potential dangers, to remember bad things from the past and assume the worst, when the meaning of an event or comment is uncertain. These tendencies are called negative cognitive biases and can lead to low mood and poor quality of life. They also make people vulnerable to mental illnesses. In contrast, those with positive cognitive biases tend to function well and remain healthy. To date most of this work has been conducted on white, western populations and we do not know whether similar cognitive biases exist in Eastern cultures. This project will examine cognitive biases in Eastern (Hong Kong nationals ) and Western (UK nationals) people to see whether there are any differences between the two. It will also examine what happens to cognitive biases when someone migrates to a different culture. This will tell us whether influences from the society and culture around us have any effect on our cognitive biases. Finally the project will consider how much our own cognitive biases are inherited from our parents. Together these results will tell us whether the known good and bad effects of cognitive biases apply to non Western cultural groups as well, and how much cognitive biases are decided by our genes or our environment. Participants: Local Hong Kong and UK natives; short term and long term migrants in each country, aged 16-65 with no current major physical illness or psychological disorder, who were not receiving psychological therapy or medication for psychological conditions. Sampling procedure: Participants were recruited using circular emails, public flyers and other advertisements in local venues, universities and clubs. Data collection: Participants completed four previously developed and validated cognitive bias tasks (emotional Stroop, attention probe, similarity ratings task and scrambled sentence task) in their native language. They also completed socio-demographic information and questionnaires.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Integrated population models (hereafter, IPMs) have become increasingly popular for the modeling of populations, as investigators seek to combine survey and demographic data to understand processes governing population dynamics. These models are particularly useful for identifying and exploring knowledge gaps within datasets, because they allow investigators to estimate biologically meaningful parameters, such as immigration and reproduction, that are uninformed by data. As IPMs have been developed relatively recently, model behavior remains relatively poorly understood. Much attention has been paid to parameter behavior such as parameter estimates near boundaries, as well as the consequences of dependent datasets. However, the movement of bias among parameters remains underexamined, particularly when models include parameters that are estimated without data. 2. To examine distribution of bias among model parameters, we simulated stable populations closed to immigration and emigration. We simulated two scenarios that might induce bias into survival estimates: marker induced bias in the capture-mark-recapture data, and heterogeneity in the mortality process. We subsequently ran appropriate capture-mark-recapture, state-space, and fecundity models, as well as integrated population models. 3. Simulation results suggest that when sampling bias exists in datasets, parameters that are not informed by data are extremely susceptible to bias. For example, in the presence of marker effects on survival of 0.1, estimates of immigration rate from an integrated population model were biased high (0.09). When heterogeneity in the mortality process was simulated, inducing bias in estimates of adult (-0.04) and juvenile (-0.097) survival rates, estimates of fecundity were biased by 46.2%. 4. We believe our results have important implications for biological inference when using integrated population models, as well as future model development and implementation. Specifically, parameters that are estimated without data absorb ~90% of the bias in integrated modelling frameworks. We suggest that investigators interpret posterior distributions of these parameters as a combination of biological process and systematic bias.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We examine the persistence of teachers' gender biases by following teachers over time in different classes. Wend a very high correlation of gender biases for teachers across their classes. Based on out-of-sample measures of these biases, we estimate the substantial effects of these biases on students' performance in university admission exams, choice of university eld of study, and quality of the enrolled program. The effects on university choice outcomes are larger for girls, explaining some gender differences in STEM majors. Part of these effects, which are more prevalent among less effective teachers, are mediated by changing school attendance.---These are the data that produce the results found in the related paper.
A survey from July 2022 asked Americans how they felt about the effects of bias in news on their ability to sort out facts, and revealed that 50 percent felt there was so much bias in the news that it was difficult to discern what was factual from information that was not. This was the highest share who said so across all years shown, and at the same time, the 2022 survey showed the lowest share of respondents who believed there were enough sources to be able to sort out fact from fiction.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This article reevaluates recent instrumental variables (IV) estimates of the returns to schooling in light of the fact that two-stage least squares is biased in the same direction as ordinary least squares (OLS) even in very large samples. We propose a split-sample instrumental variables (SSIV) estimator that is not biased toward OLS. SSIV uses one-half of a sample to estimate parameters of the first-stage equation. Estimated first-stage parameters are then used to construct fitted values and second-stage parameter estimates in the other half sample. SSIV is biased toward 0, but this bias can be corrected. The splt-sample estimators confirm and reinforce some previous findings on the returns to schooling but fail to confirm others.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.
NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.
unique_id
: Unique identifier for each news item. Each unique_id
matches an image for the same article.outlet
: The publisher of the article.headline
: The headline of the article.article_text
: The full content of the news article.image_description
: Description of the paired image.image
: The file path of the associated image.date_published
: The date the article was published.source_url
: The original URL of the article.canonical_link
: The canonical URL of the article.new_categories
: Categories assigned to the article.news_categories_confidence_scores
: Confidence scores for each category.text_label
: Indicates the likelihood of the article being disinformation:
Likely
: Likely to be disinformation.Unlikely
: Unlikely to be disinformation.multimodal_label
: Indicates the likelihood of disinformation from the combination of the text snippet and image content:
Likely
: Likely to be disinformation.Unlikely
: Unlikely to be disinformation.Load the dataset into Python:
from datasets import load_dataset
ds = load_dataset("vector-institute/newsmediabias-plus")
print(ds) # View structure and splits
print(ds['train'][0]) # Access the first record of the train split
print(ds['train'][:5]) # Access the first five records
from datasets import load_dataset
# Load the dataset in streaming mode
streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True)
# Get an iterable dataset
dataset_iterable = streamed_dataset['train'].take(5)
# Print the records
for record in dataset_iterable:
print(record)
Contributions are welcome! You can:
To contribute, fork the repository and create a pull request with your changes.
This dataset is released under a non-commercial license. See the LICENSE file for more details.
Please cite the dataset using this BibTeX entry:
@misc{vector_institute_2024_newsmediabias_plus,
title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias},
author={Vector Institute Research Team},
year={2024},
url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus}
}
For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai
Disclaimer: The labels Likely
and Unlikely
are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.
Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.