100+ datasets found

NewsMediaBias-Plus Dataset
zenodo.org
huggingface.co
bin, zip
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaina Raza; Shaina Raza (2024). NewsMediaBias-Plus Dataset [Dataset]. http://doi.org/10.5281/zenodo.13961155
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13961155
Dataset updated
Nov 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shaina Raza; Shaina Raza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NewsMediaBias-Plus Dataset

Overview

The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.

Dataset Description

NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.

Contents

unique_id: Unique identifier for each news item. Each unique_id matches an image for the same article.

outlet: The publisher of the article.

headline: The headline of the article.

article_text: The full content of the news article.

image_description: Description of the paired image.

image: The file path of the associated image.

date_published: The date the article was published.

source_url: The original URL of the article.

canonical_link: The canonical URL of the article.

new_categories: Categories assigned to the article.

news_categories_confidence_scores: Confidence scores for each category.

Annotation Labels

text_label: Indicates the likelihood of the article being disinformation:

Likely: Likely to be disinformation.

Unlikely: Unlikely to be disinformation.

multimodal_label: Indicates the likelihood of disinformation from the combination of the text snippet and image content:

Likely: Likely to be disinformation.

Unlikely: Unlikely to be disinformation.

Getting Started

Prerequisites

Python 3.6+

Pandas

Hugging Face Datasets

Hugging Face Hub

Installation

Load the dataset into Python:

python

Copy code

from datasets import load_dataset ds = load_dataset("vector-institute/newsmediabias-plus") print(ds) # View structure and splits print(ds['train'][0]) # Access the first record of the train split print(ds['train'][:5]) # Access the first five records

Load a Few Records

python

Copy code

from datasets import load_dataset # Load the dataset in streaming mode streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True) # Get an iterable dataset dataset_iterable = streamed_dataset['train'].take(5) # Print the records for record in dataset_iterable: print(record)

Contributions

Contributions are welcome! You can:

Add Data: Contribute more data points.

Refine Annotations: Improve annotation accuracy.

Share Usage Examples: Help others use the dataset effectively.

To contribute, fork the repository and create a pull request with your changes.

License

This dataset is released under a non-commercial license. See the LICENSE file for more details.

Citation

Please cite the dataset using this BibTeX entry:

bibtex

Copy code

@misc{vector_institute_2024_newsmediabias_plus, title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias}, author={Vector Institute Research Team}, year={2024}, url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus} }

Contact

For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai

Disclaimer and User Guidance

Disclaimer: The labels Likely and Unlikely are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.

Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.
f
Data from: Improving short-term grade block models: alternative for...
scielo.figshare.com
jpeg
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cristina da Paixão Araújo; João Felipe Coimbra Leite Costa; Vanessa Cerqueira Koppe (2023). Improving short-term grade block models: alternative for correcting soft data [Dataset]. http://doi.org/10.6084/m9.figshare.5772303.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5772303.v1
Dataset updated
May 31, 2023
Dataset provided by
SciELO journals
Authors
Cristina da Paixão Araújo; João Felipe Coimbra Leite Costa; Vanessa Cerqueira Koppe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract Short-term mining planning typically relies on samples obtained from channels or less-accurate sampling methods. The results may include larger sampling errors than those derived from diamond drill hole core samples. The aim of this paper is to evaluate the impact of the sampling error on grade estimation and propose a method of correcting the imprecision and bias in the soft data. In addition, this paper evaluates the benefits of using soft data in mining planning. These concepts are illustrated via a gold mine case study, where two different data types are presented. The study used Au grades collected via diamond drilling (hard data) and channels (soft data). Four methodologies were considered for estimation of the Au grades of each block to be mined: ordinary kriging with hard and soft data pooled without considering differences in data quality; ordinary kriging with only hard data; standardized ordinary kriging with pooled hard and soft data; and standardized, ordinary cokriging. The results show that even biased samples collected using poor sampling protocols improve the estimates more than a limited number of precise and unbiased samples. A welldesigned estimation method corrects the biases embedded in the samples, mitigating their propagation to the block model.
Data from: Reliable species distributions are obtainable with sparse, patchy...
zenodo.org
data.niaid.nih.gov
+1more
Updated May 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samantha L. Peel; Nicole A. Hill; Scott D. Foster; Simon J. Wotherspoon; Claudio Ghiglione; Stefano Schiaparelli; Samantha L. Peel; Nicole A. Hill; Scott D. Foster; Simon J. Wotherspoon; Claudio Ghiglione; Stefano Schiaparelli (2022). Data from: Reliable species distributions are obtainable with sparse, patchy and biased data by leveraging over species and data types [Dataset]. http://doi.org/10.5061/dryad.2226v8m
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.2226v8m
Dataset updated
May 30, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Samantha L. Peel; Nicole A. Hill; Scott D. Foster; Simon J. Wotherspoon; Claudio Ghiglione; Stefano Schiaparelli; Samantha L. Peel; Nicole A. Hill; Scott D. Foster; Simon J. Wotherspoon; Claudio Ghiglione; Stefano Schiaparelli
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
New methods for species distribution models (SDMs) utilise presence‐absence (PA) data to correct the sampling bias of presence‐only (PO) data in a spatial point process setting. These have been shown to improve species estimates when both data sets are large and dense. However, is a PA data set that is smaller and patchier than hitherto examined able to do the same? Furthermore, when both data sets are relatively small, is there enough information contained within them to produce a useful estimate of species' distributions? These attributes are common in many applications.

A stochastic simulation was conducted to assess the ability of a pooled data SDM to estimate the distribution of species from increasingly sparser and patchier data sets. The simulated data sets were varied by changing the number of presence‐absence sample locations, the degree of patchiness of these locations, the number of PO observations, and the level of sampling bias within the PO observations. The performance of the pooled data SDM was compared to a PA SDM and a PO SDM to assess the strengths and limitations of each SDM.

The pooled data SDM successfully removed the sampling bias from the PO observations even when the presence‐absence data was sparse and patchy, and the PO observations formed the majority of the data. The pooled data SDM was, in general, more accurate and more precise than either the PA SDM or the PO SDM. All SDMs were more precise for the species responses than they were for the covariate coefficients.

The emerging SDM methodology that pools PO and PA data will facilitate more certainty around species' distribution estimates, which in turn will allow more relevant and concise management and policy decisions to be enacted. This work shows that it is possible to achieve this result even in relatively data‐poor regions.
Data from: Citizen science can complement professional invasive plant...
data.niaid.nih.gov
datadryad.org
zip
Updated Sep 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monica Dimson (2024). Citizen science can complement professional invasive plant surveys and improve estimates of suitable habitat [Dataset]. http://doi.org/10.5068/D1769Q
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5068/D1769Q
Dataset updated
Sep 11, 2024
Dataset provided by
University of California, Los Angeles
Authors
Monica Dimson
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Aim: Citizen science is a cost-effective potential source of invasive species occurrence data. However, data quality issues due to unstructured sampling approaches may discourage the use of these observations by science and conservation professionals. This study explored the utility of low-structure iNaturalist citizen science data in invasive plant monitoring. We first examined the prevalence of invasive taxa in iNaturalist plant observations and sampling biases associated with those data. Using four invasive species as examples, we then compared iNaturalist and professional agency observations and used the two datasets to model suitable habitat for each species. Location: Hawaiʻi, USA Methods: To estimate the prevalence of invasive plant data, we compared the number of species and observations recorded in iNaturalist to botanical checklists for Hawaiʻi. Sampling bias was quantified along gradients of site accessibility, protective status, and vegetation disturbance using a bias index. Habitat suitability for four invasive species was modeled in Maxent, using observations from iNaturalist, professional agencies, and stratified subsets of iNaturalist data. Results: iNaturalist plant observations were biased toward invasive species, which were frequently recorded in areas with higher road/trail density and vegetation disturbance. Professional observations of four example invasive species tended to occur in less accessible, native-dominated sites. Habitat suitability models based on iNaturalist versus professional data showed moderate overlap and different distributions of suitable habitat across vegetation disturbance classes. Stratifying iNaturalist observations had little effect on how suitable habitat was distributed for the species modeled in this study. Main conclusions: Opportunistic iNaturalist observations have the potential to complement and expand professional invasive plant monitoring, which we found was often affected by inverse sampling biases. Invasive species represented a high proportion of iNaturalist plant observations, and were recorded in environments that were not captured by professional surveys. Combining the datasets thus led to more comprehensive estimates of suitable habitat.
e
Cross-cultural differences in biased cognition - Pilot task data - Dataset -...
b2find.eudat.eu
Updated Mar 30, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2014). Cross-cultural differences in biased cognition - Pilot task data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/545c1da9-93df-58a4-9cb2-41c48d1170cb
Explore at:
Dataset updated
Mar 30, 2014
Description
This data collection consists of pilot data measuring task equivalence for measures of attention and interpretation bias. Congruent Mandarin and English emotional Stroop, attention probe (both measuring attention bias) and similarity ratings task and scrambled sentence task (both measuring interpretation bias) were developed using back-translation and decentering procedures. Tasks were then completed by 47 bilingual Mandarin-English speakers. Presented are data detailing personal characteristics, task scores and bias scores.The way in which we process information in the world around us has a significant effect on our health and well being. For example, some people are more prone than others to notice potential dangers, to remember bad things from the past and assume the worst, when the meaning of an event or comment is uncertain. These tendencies are called negative cognitive biases and can lead to low mood and poor quality of life. They also make people vulnerable to mental illnesses. In contrast, those with positive cognitive biases tend to function well and remain healthy. To date most of this work has been conducted on white, western populations and we do not know whether similar cognitive biases exist in Eastern cultures. This project will examine cognitive biases in Eastern (Hong Kong nationals ) and Western (UK nationals) people to see whether there are any differences between the two. It will also examine what happens to cognitive biases when someone migrates to a different culture. This will tell us whether influences from the society and culture around us have any effect on our cognitive biases. Finally the project will consider how much our own cognitive biases are inherited from our parents. Together these results will tell us whether the known good and bad effects of cognitive biases apply to non Western cultural groups as well, and how much cognitive biases are decided by our genes or our environment. Participants: Fluent bilingual Mandarin and English speakers, aged 16-65 with no current major physical illness or psychological disorder, who were not receiving psychological therapy or medication for psychological conditions. Sampling procedure: Participants were recruited using circular emails which are sent to all university staff and students as well as through flyers around campuses. Relevant societies and language schools in central London were also contacted. Data collection: Participants completed four cognitive bias tasks (emotional Stroop, attention probe, similarity ratings task and scrambled sentence task) in both English and Mandarin. Order of language presentation and task presentation were counterbalanced.
f
Data from: Bivariate Analysis of Distribution Functions Under Biased...
tandf.figshare.com
txt
Updated Apr 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hsin-wen Chang; Shu-Hsiang Wang (2024). Bivariate Analysis of Distribution Functions Under Biased Sampling [Dataset]. http://doi.org/10.6084/m9.figshare.23998414.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23998414.v1
Dataset updated
Apr 17, 2024
Dataset provided by
Taylor & Francis
Authors
Hsin-wen Chang; Shu-Hsiang Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This article compares distribution functions among pairs of locations in their domains, in contrast to the typical approach of univariate comparison across individual locations. This bivariate approach is studied in the presence of sampling bias, which has been gaining attention in COVID-19 studies that over-represent more symptomatic people. In cases with either known or unknown sampling bias, we introduce Anderson–Darling-type tests based on both the univariate and bivariate formulation. A simulation study shows the superior performance of the bivariate approach over the univariate one. We illustrate the proposed methods using real data on the distribution of the number of symptoms suggestive of COVID-19.
d
Replication data for: Testing for Publication Bias in Political Science
search.dataone.org
dataverse.harvard.edu
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Gerber; Donald Green; David Nickerson (2023). Replication data for: Testing for Publication Bias in Political Science [Dataset]. http://doi.org/10.7910/DVN/DQC9KV
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/DQC9KV
Dataset updated
Nov 20, 2023
Dataset provided by
Harvard Dataverse
Authors
Alan Gerber; Donald Green; David Nickerson
Description
If the publication decisions of journals are a function of the statistical significance of research findings, the published literature may suffer from “publication bias.” This paper describes a method for detecting publication bias. We point out that to achieve statistical significance, the effect size must be larger in small samples. If publications tend to be biased against statistically insignificant results, we should observe that the effect size diminishes as sample sizes increase. This proposition is tested and confirmed using the experimental literature on voter mobilization.
H
Replication Data (A) for 'Biased Programmers or Biased Data?': Individual...
dataverse.harvard.edu
Updated Sep 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bo Cowgill; Fabrizio Dell'Acqua; Sam Deng; Daniel Hsu; Nakul Verma; Augustin Chaintreau (2020). Replication Data (A) for 'Biased Programmers or Biased Data?': Individual Measures of Numeracy, Literacy and Problem Solving Skill -- and Biographical Data -- for a Representative Sample of 200K OECD Residents [Dataset]. http://doi.org/10.7910/DVN/JAJ3CP
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/JAJ3CP
Dataset updated
Sep 2, 2020
Dataset provided by
Harvard Dataverse
Authors
Bo Cowgill; Fabrizio Dell'Acqua; Sam Deng; Daniel Hsu; Nakul Verma; Augustin Chaintreau
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/JAJ3CPhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/JAJ3CP
Description
This is a cleaned and merged version of the OECD's Programme for the International Assessment of Adult Competencies. The data contains individual person-measures of several basic skills including literacy, numeracy and critical thinking, along with extensive biographical details about each subject. PIAAC is essentially a standardized test taken by a representative sample of all OECD countries (approximately 200K individuals in total). We have found this data useful in studies of predictive algorithms and human capital, in part because of its high quality, size, number and quality of biographical features per subject and representativeness of the population at large.
f
Data from: Robust inference under r-size-biased sampling without replacement...
tandf.figshare.com
xlsx
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
P. Economou; G. Tzavelas; A. Batsidis (2023). Robust inference under r-size-biased sampling without replacement from finite population [Dataset]. http://doi.org/10.6084/m9.figshare.11542974.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11542974.v1
Dataset updated
Nov 28, 2023
Dataset provided by
Taylor & Francis
Authors
P. Economou; G. Tzavelas; A. Batsidis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The case of size-biased sampling of known order from a finite population without replacement is considered. The behavior of such a sampling scheme is studied with respect to the sampling fraction. Based on a simulation study, it is concluded that such a sample cannot be treated either as a random sample from the parent distribution or as a random sample from the corresponding r-size weighted distribution and as the sampling fraction increases, the biasness in the sample decreases resulting in a transition from an r-size-biased sample to a random sample. A modified version of a likelihood-free method is adopted for making statistical inference for the unknown population parameters, as well as for the size of the population when it is unknown. A simulation study, which takes under consideration the sampling fraction, demonstrates that the proposed method presents better and more robust behavior compared to the approaches, which treat the r-size-biased sample either as a random sample from the parent distribution or as a random sample from the corresponding r-size weighted distribution. Finally, a numerical example which motivates this study illustrates our results.
H
Replication data for: Looking Beyond Demographics: Panel Attrition in the...
dataverse.harvard.edu
Updated Oct 1, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2014). Replication data for: Looking Beyond Demographics: Panel Attrition in the ANES and GSS [Dataset]. http://doi.org/10.7910/DVN/RRDHGR
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/RRDHGR
Dataset updated
Oct 1, 2014
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
2006 - 2010
Area covered
United States
Description
Longitudinal or panel surveys offer unique benefits for social science research, but they typically suffer from attrition, which reduces sample size and can result in biased inferences. Previous research tends to focus on the demographic predictors of attrition, conceptualizing attrition propensity as a stable, individual- level characteristicÃƒÂ¢Ã‚Â€Ã‚Â”some individuals (e.g., young, poor, residentially mobile) are more likely to drop out of a study than others. We argue that panel attrition reflects both the characteristics of the individual respondent as well as her survey experience, a factor shaped by the design and implementation features of the study. In this paper, we examine and compare the predictors of panel attrition in the 2008-2009 American National Election Study, an on- line panel, and the 2006-2010 General Social Survey, a face-to-face panel. In both cases, survey experience variables are predictive of panel attrition above and beyond the standard demographic predictors, but the particular measures of relevance differ across the two surveys. The findings inform statistical corrections for panel attrition bias and provide study design insights for future panel data collections.
Opinion on political bias in news U.S. 2022, by political affiliation
statista.com
Updated Sep 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2023). Opinion on political bias in news U.S. 2022, by political affiliation [Dataset]. https://www.statista.com/statistics/802278/opinion-extent-political-bias-news-coverage-us-political-affiliation/
Explore at:
Dataset updated
Sep 6, 2023
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 31, 2022 - Jul 21, 2022
Area covered
United States
Description
According to a survey conducted in the United States in summer 2022, 79 percent of Republican respondents felt that news coverage had a great deal of political bias, making these voters the most likely to hold this opinion of the news media. Independents also felt strongly about this issue, whereas only 33 percent of Democrats said they saw a great deal of political bias in news.

How politics affects news consumption

Political bias in news can alienate consumers and may also be poorly received when coverage of a non-political topic leans too heavily towards one end of the spectrum. However, at the same time, personal politics in general are often closely interlinked with how a consumer perceives or engages with news and information. A clear example of this can be found when looking at political news sources used weekly in the U.S., with Republicans and Democrats opting for the national networks they most identify with. But what if audiences cannot find the content they want?

A change in behavior

Engaging with news aligning with one’s politics is not uncommon. That said, perceived bias in mainstream media may lead some consumers to look elsewhere and turn away from more “neutral” outlets if they believe the news is no longer partisan. Data shows that a number of leading conservative websites registered a substantial increase in visitors year over year. Looking at this data in context of Republicans’ concern about bias in political news, it is likely that this trend will continue and consumers will pursue outlets they feel resonate with them most.
d
Data from: Sampling schemes and drift can bias admixture proportions...
datadryad.org
zip
Updated Jul 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ken Toyama; Pierre-André Crochet; Raphaël Leblois (2020). Sampling schemes and drift can bias admixture proportions inferred by STRUCTURE [Dataset]. http://doi.org/10.5061/dryad.gf1vhhmkw
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.gf1vhhmkw
Dataset updated
Jul 27, 2020
Dataset provided by
Dryad
Authors
Ken Toyama; Pierre-André Crochet; Raphaël Leblois
Time period covered
May 20, 2020
Description
The interbreeding of individuals coming from genetically differentiated but incompletely isolated populations can lead to the formation of admixed populations, having important implications in ecology and evolution. In this simulation study, we evaluate how individual admixture proportions estimated by the software structure are quantitatively affected by different factors. Using various scenarios of admixture between two diverging populations, we found that unbalanced sampling from parental populations may seriously bias the inferred admixture proportions; moreover, proportionally large samples from the admixed population can also decrease the accuracy and precision of the inferences. As expected, weak differentiation between parental populations and drift after the admixture event strongly increase the biases caused by uneven sampling. We also show that admixture proportions are generally more biased when parental populations unequally contributed to the admixed population. Finally, w...
H
Replication data for: Selection Bias and Continuous-Time Duration Models:...
dataverse.harvard.edu
Updated Jan 21, 2009
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frederick J. Boehmke; Daniel S. Morey; Megan gan Shannon (2009). Replication data for: Selection Bias and Continuous-Time Duration Models: Consequences and a Proposed Solution [Dataset]. http://doi.org/10.7910/DVN/DUW1FA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/DUW1FA
Dataset updated
Jan 21, 2009
Dataset provided by
Harvard Dataverse
Authors
Frederick J. Boehmke; Daniel S. Morey; Megan gan Shannon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This article analyzes the consequences of nonrandom sample selection for continuous-time duration analyses and develops a new estimator to correct for it when necessary. We conduct a series of Monte Carlo analyses that estimate common duration models as well as our proposed duration model with selection. These simulations show that ignoring sample selection issues can lead to biased parameter estimates, including the appearance of (nonexistent) duration dependence. In addition, our proposed estimator is found to be superior in root mean-square error terms when nontrivial amounts of selection are present. Finally, we provide an empirical application of our method by studying whether self-selectivity is a problem for studies of leaders' survival during and following militarized conflicts.
f
Resampling methods.
plos.figshare.com
bin
Updated Jul 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Annie Kim; Inkyung Jung (2023). Resampling methods. [Dataset]. http://doi.org/10.1371/journal.pone.0288540.t001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0288540.t001
Dataset updated
Jul 27, 2023
Dataset provided by
PLOS ONE
Authors
Annie Kim; Inkyung Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Class imbalance is a major problem in classification, wherein the decision boundary is easily biased toward the majority class. A data-level solution (resampling) is one possible solution to this problem. However, several studies have shown that resampling methods can deteriorate the classification performance. This is because of the overgeneralization problem, which occurs when samples produced by the oversampling technique that should be represented in the minority class domain are introduced into the majority-class domain. This study shows that the overgeneralization problem is aggravated in complex data settings and introduces two alternate approaches to mitigate it. The first approach involves incorporating a filtering method into oversampling. The second approach is to apply undersampling. The main objective of this study is to provide guidance on selecting optimal resampling methods in imbalanced and complex datasets to improve classification performance. Simulation studies and real data analyses were performed to compare the resampling results in various scenarios with different complexities, imbalances, and sample sizes. In the case of noncomplex datasets, undersampling was found to be optimal. However, in the case of complex datasets, applying a filtering method to delete misallocated examples was optimal. In conclusion, this study can aid researchers in selecting the optimal method for resampling complex datasets.
e
Biased cognition in East Asian and Western Cultures: Behavioural data...
b2find.eudat.eu
Updated Mar 30, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2014). Biased cognition in East Asian and Western Cultures: Behavioural data 2016-2018 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/b985488d-0ccb-51dc-81a1-1829b5df68e4
Explore at:
Dataset updated
Mar 30, 2014
Description
This data collection consists of behavioural task data for measures of attention and interpretation bias, specifically: emotional Stroop, attention probe (both measuring attention bias) and similarity ratings task and scrambled sentence task (both measuring interpretation bias). Data on the following 6 participant groups are included in the dataset: native UK (n=36), native HK (n=39), UK migrants to HK (short term = 31, long term = 28) and HK migrants to UK (short term = 37, long term = 31). Also included are personal characteristics and questionnaire measures. The way in which we process information in the world around us has a significant effect on our health and well being. For example, some people are more prone than others to notice potential dangers, to remember bad things from the past and assume the worst, when the meaning of an event or comment is uncertain. These tendencies are called negative cognitive biases and can lead to low mood and poor quality of life. They also make people vulnerable to mental illnesses. In contrast, those with positive cognitive biases tend to function well and remain healthy. To date most of this work has been conducted on white, western populations and we do not know whether similar cognitive biases exist in Eastern cultures. This project will examine cognitive biases in Eastern (Hong Kong nationals ) and Western (UK nationals) people to see whether there are any differences between the two. It will also examine what happens to cognitive biases when someone migrates to a different culture. This will tell us whether influences from the society and culture around us have any effect on our cognitive biases. Finally the project will consider how much our own cognitive biases are inherited from our parents. Together these results will tell us whether the known good and bad effects of cognitive biases apply to non Western cultural groups as well, and how much cognitive biases are decided by our genes or our environment. Participants: Local Hong Kong and UK natives; short term and long term migrants in each country, aged 16-65 with no current major physical illness or psychological disorder, who were not receiving psychological therapy or medication for psychological conditions. Sampling procedure: Participants were recruited using circular emails, public flyers and other advertisements in local venues, universities and clubs. Data collection: Participants completed four previously developed and validated cognitive bias tasks (emotional Stroop, attention probe, similarity ratings task and scrambled sentence task) in their native language. They also completed socio-demographic information and questionnaires.
Data from: Integrated population models: bias and inference
zenodo.org
data.niaid.nih.gov
+1more
bin
Updated Jun 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas V. Riecke; Perry J. Williams; Tessa L. Behnke; Daniel Gibson; Alan G. Leach; Benjamin S. Sedinger; Phillip A. Street; James S. Sedinger; Thomas V. Riecke; Perry J. Williams; Tessa L. Behnke; Daniel Gibson; Alan G. Leach; Benjamin S. Sedinger; Phillip A. Street; James S. Sedinger (2022). Data from: Integrated population models: bias and inference [Dataset]. http://doi.org/10.5061/dryad.fd28113
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.fd28113
Dataset updated
Jun 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thomas V. Riecke; Perry J. Williams; Tessa L. Behnke; Daniel Gibson; Alan G. Leach; Benjamin S. Sedinger; Phillip A. Street; James S. Sedinger; Thomas V. Riecke; Perry J. Williams; Tessa L. Behnke; Daniel Gibson; Alan G. Leach; Benjamin S. Sedinger; Phillip A. Street; James S. Sedinger
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Integrated population models (hereafter, IPMs) have become increasingly popular for the modeling of populations, as investigators seek to combine survey and demographic data to understand processes governing population dynamics. These models are particularly useful for identifying and exploring knowledge gaps within datasets, because they allow investigators to estimate biologically meaningful parameters, such as immigration and reproduction, that are uninformed by data. As IPMs have been developed relatively recently, model behavior remains relatively poorly understood. Much attention has been paid to parameter behavior such as parameter estimates near boundaries, as well as the consequences of dependent datasets. However, the movement of bias among parameters remains underexamined, particularly when models include parameters that are estimated without data. 2. To examine distribution of bias among model parameters, we simulated stable populations closed to immigration and emigration. We simulated two scenarios that might induce bias into survival estimates: marker induced bias in the capture-mark-recapture data, and heterogeneity in the mortality process. We subsequently ran appropriate capture-mark-recapture, state-space, and fecundity models, as well as integrated population models. 3. Simulation results suggest that when sampling bias exists in datasets, parameters that are not informed by data are extremely susceptible to bias. For example, in the presence of marker effects on survival of 0.1, estimates of immigration rate from an integrated population model were biased high (0.09). When heterogeneity in the mortality process was simulated, inducing bias in estimates of adult (-0.04) and juvenile (-0.097) survival rates, estimates of fecundity were biased by 46.2%. 4. We believe our results have important implications for biological inference when using integrated population models, as well as future model development and implementation. Specifically, parameters that are estimated without data absorb ~90% of the bias in integrated modelling frameworks. We suggest that investigators interpret posterior distributions of these parameters as a combination of biological process and systematic bias.
o
Data and Code for The Short- and the Long-Run Impact of Gender-Biased...
openicpsr.org
Updated Sep 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Lavy; Rigissa Megalokonomou (2022). Data and Code for The Short- and the Long-Run Impact of Gender-Biased Teachers [Dataset]. http://doi.org/10.3886/E179241V1
Explore at:
Unique identifier
https://doi.org/10.3886/E179241V1
Dataset updated
Sep 4, 2022
Dataset provided by
American Economic Association
Authors
Victor Lavy; Rigissa Megalokonomou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2002 - 2012
Area covered
Greece
Description
We examine the persistence of teachers' gender biases by following teachers over time in different classes. Wend a very high correlation of gender biases for teachers across their classes. Based on out-of-sample measures of these biases, we estimate the substantial effects of these biases on students' performance in university admission exams, choice of university eld of study, and quality of the enrolled program. The effects on university choice outcomes are larger for girls, explaining some gender differences in STEM majors. Part of these effects, which are more prevalent among less effective teachers, are mediated by changing school attendance.---These are the data that produce the results found in the related paper.
Bias and fact-checking in news in the U.S. 2022
statista.com
Updated Nov 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Bias and fact-checking in news in the U.S. 2022 [Dataset]. https://www.statista.com/statistics/874821/news-media-bias-perceptions/
Explore at:
Dataset updated
Nov 22, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 31, 2022 - Jul 21, 2022
Area covered
United States
Description
A survey from July 2022 asked Americans how they felt about the effects of bias in news on their ability to sort out facts, and revealed that 50 percent felt there was so much bias in the news that it was difficult to discern what was factual from information that was not. This was the highest share who said so across all years shown, and at the same time, the 2022 survey showed the lowest share of respondents who believed there were enough sources to be able to sort out fact from fiction.
h
md_gender_bias
huggingface.co
opendatalab.com
Updated Mar 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2021). md_gender_bias [Dataset]. https://huggingface.co/datasets/facebook/md_gender_bias
Explore at:
Dataset updated
Mar 26, 2021
Dataset authored and provided by
AI at Meta
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.
H
Replication data for: Split-Sample Instrumental Variables Estimates of the...
dataverse.harvard.edu
Updated Jan 21, 2009
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joshua D. Angrist (2009). Replication data for: Split-Sample Instrumental Variables Estimates of the Return to Schooling [Dataset]. http://doi.org/10.7910/DVN/6LX9OE
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/6LX9OE
Dataset updated
Jan 21, 2009
Dataset provided by
Harvard Dataverse
Authors
Joshua D. Angrist
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This article reevaluates recent instrumental variables (IV) estimates of the returns to schooling in light of the fact that two-stage least squares is biased in the same direction as ordinary least squares (OLS) even in very large samples. We propose a split-sample instrumental variables (SSIV) estimator that is not biased toward OLS. SSIV uses one-half of a sample to estimate parameters of the first-stage equation. Estimated first-stage parameters are then used to construct fitted values and second-stage parameter estimates in the other half sample. SSIV is biased toward 0, but this bias can be corrected. The splt-sample estimators confirm and reinforce some previous findings on the returns to schooling but fail to confirm others.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shaina Raza; Shaina Raza (2024). NewsMediaBias-Plus Dataset [Dataset]. http://doi.org/10.5281/zenodo.13961155

NewsMediaBias-Plus Dataset

Explore at:

bin, zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13961155

Dataset updated

Nov 29, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Shaina Raza; Shaina Raza

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

NewsMediaBias-Plus Dataset

Overview

The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.

Dataset Description

NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.

unique_id: Unique identifier for each news item. Each unique_id matches an image for the same article.
outlet: The publisher of the article.
headline: The headline of the article.
article_text: The full content of the news article.
image_description: Description of the paired image.
image: The file path of the associated image.
date_published: The date the article was published.
source_url: The original URL of the article.
canonical_link: The canonical URL of the article.
new_categories: Categories assigned to the article.
news_categories_confidence_scores: Confidence scores for each category.

Annotation Labels

text_label: Indicates the likelihood of the article being disinformation:
- Likely: Likely to be disinformation.
- Unlikely: Unlikely to be disinformation.
multimodal_label: Indicates the likelihood of disinformation from the combination of the text snippet and image content:
- Likely: Likely to be disinformation.
- Unlikely: Unlikely to be disinformation.

Getting Started

Prerequisites

Python 3.6+
Pandas
Hugging Face Datasets
Hugging Face Hub

Installation

Load the dataset into Python:

python

Copy code

from datasets import load_dataset

ds = load_dataset("vector-institute/newsmediabias-plus")
print(ds) # View structure and splits
print(ds['train'][0]) # Access the first record of the train split
print(ds['train'][:5]) # Access the first five records

Load a Few Records

python

Copy code

from datasets import load_dataset

# Load the dataset in streaming mode
streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True)

# Get an iterable dataset
dataset_iterable = streamed_dataset['train'].take(5)

# Print the records
for record in dataset_iterable:
  print(record)

Contributions

Contributions are welcome! You can:

Add Data: Contribute more data points.
Refine Annotations: Improve annotation accuracy.
Share Usage Examples: Help others use the dataset effectively.

To contribute, fork the repository and create a pull request with your changes.

License

This dataset is released under a non-commercial license. See the LICENSE file for more details.

Citation

Please cite the dataset using this BibTeX entry:

bibtex

Copy code

@misc{vector_institute_2024_newsmediabias_plus,
 title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias},
 author={Vector Institute Research Team},
 year={2024},
 url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus}
}

Contact

For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai

Disclaimer and User Guidance

Disclaimer: The labels Likely and Unlikely are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.

Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.

Clear search

Close search

Google apps

Main menu

NewsMediaBias-Plus Dataset

NewsMediaBias-Plus Dataset

Overview

Dataset Description

Contents

Annotation Labels

Getting Started

Prerequisites

Installation

Load a Few Records

Contributions

License

Citation

Contact

Disclaimer and User Guidance

Data from: Improving short-term grade block models: alternative for...

Data from: Reliable species distributions are obtainable with sparse, patchy...

Data from: Citizen science can complement professional invasive plant...

Cross-cultural differences in biased cognition - Pilot task data - Dataset -...

Data from: Bivariate Analysis of Distribution Functions Under Biased...

Replication data for: Testing for Publication Bias in Political Science

Replication Data (A) for 'Biased Programmers or Biased Data?': Individual...

Data from: Robust inference under r-size-biased sampling without replacement...

Replication data for: Looking Beyond Demographics: Panel Attrition in the...

Opinion on political bias in news U.S. 2022, by political affiliation

Data from: Sampling schemes and drift can bias admixture proportions...

Replication data for: Selection Bias and Continuous-Time Duration Models:...

Resampling methods.

Biased cognition in East Asian and Western Cultures: Behavioural data...

Data from: Integrated population models: bias and inference

Data and Code for The Short- and the Long-Run Impact of Gender-Biased...

Bias and fact-checking in news in the U.S. 2022

md_gender_bias

Replication data for: Split-Sample Instrumental Variables Estimates of the...

NewsMediaBias-Plus Dataset

NewsMediaBias-Plus Dataset

Overview

Dataset Description

Contents

Annotation Labels

Getting Started

Prerequisites

Installation

Load a Few Records

Contributions

License

Citation

Contact

Disclaimer and User Guidance