100+ datasets found
  1. Emails to authors and responses and overall unclear risk of bias (data sets...

    • figshare.com
    xlsx
    Updated Feb 18, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kieran Shah (2016). Emails to authors and responses and overall unclear risk of bias (data sets 3 and 4) [Dataset]. http://doi.org/10.6084/m9.figshare.2324599.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 18, 2016
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Kieran Shah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Emails to authors and responses for data sets 3 and 4. Assigned risk of bias in data sets 3 and 4. Articles initially assigned as unclear risk also determined to be low, high, or further unclear risk based on author theme responses and group consensus for data sets 3 and 4.

  2. NewsUnravel Dataset

    • zenodo.org
    csv
    Updated Sep 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous; anonymous (2023). NewsUnravel Dataset [Dataset]. http://doi.org/10.5281/zenodo.8344882
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    anonymous; anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About the Dataset
    Media bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.

    Description of the data files
    This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain following data:

    NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labels
    Statistics.png: contains all Umami statistics for NewsUnravel's usage data
    Feedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasons
    Content.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentences and the bias rating, and reason, if given
    Article.csv: holds the article ID, title, source, article meta data, article topic, and bias amount in %
    Participant.csv: holds the participant IDs and data processing consent

  3. Z

    NewsUnravel Dataset

    • data.niaid.nih.gov
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anon (2024). NewsUnravel Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8344890
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset authored and provided by
    anon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About the NUDA DatasetMedia bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.

    General

    This dataset was created through user feedback on automatically generated bias highlights on news articles on the website NewsUnravel made by ANON. Its goal is to improve the detection of linguistic media bias for analysis and to indicate it to the public. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.

    The dataset consists of text, namely biased sentences with binary bias labels (processed, biased or not biased) as well as metadata about the article. It includes all feedback that was given. The single ratings (unprocessed) used to create the labels with correlating User IDs are included.

    For training, this dataset was combined with the BABE dataset. All data is completely anonymous. Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.

    Description of the Data Files

    This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain the following data:

    NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labelsStatistics.png: contains all Umami statistics for NewsUnravel's usage dataFeedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasonsContent.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentence and the bias rating, and reason, if givenArticle.csv: holds the article ID, title, source, article metadata, article topic, and bias amount in %Participant.csv: holds the participant IDs and data processing consent

    Collection Process

    Data was collected through interactions with the Feedback Mechanism on NewsUnravel. A news article was displayed with automatically generated bias highlights. Each highlight could be selected, and readers were able to agree or disagree with the automatic label. Through a majority vote, labels were generated from those feedback interactions. Spammers were excluded through a spam detection approach.

    Readers came to our website voluntarily through posts on LinkedIn and social media as well as posts on university boards. The data collection period lasted for one week, from March 4th to March 11th (2023). The landing page informed them about the goal and the data processing. After being informed, they could proceed to the article overview.

    So far, the dataset has been used on top of BABE to train a linguistic bias classifier, adopting hyperparameter configurations from BABE with a pre-trained model from Hugging Face.The dataset will be open source. On acceptance, a link with all details and contact information will be provided. No third parties are involved.

    The dataset will not be maintained as it captures the first test of NewsUnravel at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsUnravel paper if you use the dataset and contact us if you're interested in more information or joining the project.

  4. r

    Data from: Data : Heuristics and Biases in Home Care Package Resource...

    • researchdata.edu.au
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Professor Tracy Comans; Professor Tracy Comans (2025). Data : Heuristics and Biases in Home Care Package Resource Allocation [Dataset]. http://doi.org/10.48610/B2B5DE7
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    The University of Queensland
    Authors
    Professor Tracy Comans; Professor Tracy Comans
    License

    http://guides.library.uq.edu.au/deposit_your_data/terms_and_conditionshttp://guides.library.uq.edu.au/deposit_your_data/terms_and_conditions

    Description

    This dataset contains anonymised experiment data downloaded from a survey instrument. The experiment is designed to assess framing bias and the mechanism of data collection was online survey. The survey was designed in three parts: Information and consent, demographic questions, and case study vignettes. Demographic questions were identical in both forms of the survey. For the vignettes, respondents were randomised to one of two frames, with frame one presented from a medical assessment perspective (ACAT) and frame two presented from a service provider perspective There are four vignettes detailing real world choices in home-care packages, changing only the services and equipment suggested by either ACAT assessors or service providers (treatments).

  5. d

    Replication Data for: Reducing Political Bias in Political Science Estimates...

    • dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zigerell, Lawrence (2023). Replication Data for: Reducing Political Bias in Political Science Estimates [Dataset]. http://doi.org/10.7910/DVN/PZLCJM
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Zigerell, Lawrence
    Description

    Political science researchers have flexibility in how to analyze data, how to report data, and whether to report on data. Review of examples of reporting flexibility from the race and sex discrimination literature illustrates how research design choices can influence estimates and inferences. This reporting flexibility—coupled with the political imbalance among political scientists—creates the potential for political bias in reported political science estimates, but this potential for political bias can be reduced or eliminated through preregistration and preacceptance, in which researchers commit to a research design before completing data collection. Removing the potential for reporting flexibility can raise the credibility of political science research.

  6. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  7. Data from: Questioning Bias: Validating a Bias Crime Assessment Tool in...

    • catalog.data.gov
    • icpsr.umich.edu
    Updated Nov 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). Questioning Bias: Validating a Bias Crime Assessment Tool in California and New Jersey, 2016-2017 [Dataset]. https://catalog.data.gov/dataset/questioning-bias-validating-a-bias-crime-assessment-tool-in-california-and-new-jersey-2016-a062f
    Explore at:
    Dataset updated
    Nov 14, 2025
    Dataset provided by
    National Institute of Justicehttp://nij.ojp.gov/
    Area covered
    New Jersey, California
    Description

    These data are part of NACJD's Fast Track Release and are distributed as they were received from the data depositor. The files have been zipped by NACJD for release, but not checked or processed except for the removal of direct identifiers. Users should refer to the accompanying readme file for a brief description of the files available with this collection and consult the investigator(s) if further information is needed. This study investigates experiences surrounding hate and bias crimes and incidents and reasons and factors affecting reporting and under-reporting among youth and adults in LGBT, immigrant, Hispanic, Black, and Muslim communities in New Jersey and Los Angeles County, California. The collection includes 1 SPSS data file (QB_FinalDataset-Revised.sav (n=1,326; 513 variables)). The collection also contains 24 qualitative data files of transcripts from focus groups and interviews with key informants, which are not included in this release.

  8. Z

    News Ninja Dataset

    • data.niaid.nih.gov
    Updated Feb 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anon (2024). News Ninja Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8346881
    Explore at:
    Dataset updated
    Feb 20, 2024
    Dataset authored and provided by
    anon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AboutRecent research shows that visualizing linguistic media bias mitigates its negative effects. However, reliable automatic detection methods to generate such visualizations require costly, knowledge-intensive training data. To facilitate data collection for media bias datasets, we present News Ninja, a game employing data-collecting game mechanics to generate a crowdsourced dataset. Before annotating sentences, players are educated on media bias via a tutorial. Our findings show that datasets gathered with crowdsourced workers trained on News Ninja can reach significantly higher inter-annotator agreements than expert and crowdsourced datasets. As News Ninja encourages continuous play, it allows datasets to adapt to the reception and contextualization of news over time, presenting a promising strategy to reduce data collection expenses, educate players, and promote long-term bias mitigation.

    GeneralThis dataset was created through player annotations in the News Ninja Game made by ANON. Its goal is to improve the detection of linguistic media bias. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.

    The dataset includes sentences with binary bias labels (processed, biased or not biased) as well as the annotations of single players used for the majority vote. It includes all game-collected data. All data is completely anonymous. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.

    Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset contains topics such as violence, abortion, and hate against specific races, genders, religions, or sexual orientations.

    Description of the Data FilesThis repository contains the datasets for the anonymous News Ninja submission. The tables contain the following data:

    ExportNewsNinja.csv: Contains 370 BABE sentences and 150 new sentences with their text (sentence), words labeled as biased (words), BABE ground truth (ground_Truth), and the sentence bias label from the player annotations (majority_vote). The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences.

    AnalysisNewsNinja.xlsx: Contains 370 BABE sentences and 150 new sentences. The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences. The table includes the full sentence (Sentence), the sentence bias label from player annotations (isBiased Game), the new expert label (isBiased Expert), if the game label and expert label match (Game VS Expert), if differing labels are a false positives or false negatives (false negative, false positive), the ground truth label from BABE (isBiasedBABE), if Expert and BABE labels match (Expert VS BABE), and if the game label and BABE label match (Game VS BABE). It also includes the analysis of the agreement between the three rater categories (Game, Expert, BABE).

    demographics.csv: Contains demographic information of News Ninja players, including gender, age, education, English proficiency, political orientation, news consumption, and consumed outlets.

    Collection ProcessData was collected through interactions with the NewsNinja game. All participants went through a tutorial before annotating 2x10 BABE sentences and 2x10 new sentences. For this first test, players were recruited using Prolific. The game was hosted on a costume-built responsive website. The collection period was from 20.02.2023 to 28.02.2023. Before starting the game, players were informed about the goal and the data processing. After consenting, they could proceed to the tutorial.

    The dataset will be open source. A link with all details and contact information will be provided upon acceptance. No third parties are involved.

    The dataset will not be maintained as it captures the first test of NewsNinja at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsNinja paper if you use the dataset and contact us if you're interested in more information or joining the project.

  9. Cross-classified meta-regression models estimating the relationship between...

    • figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam Rybak (2023). Cross-classified meta-regression models estimating the relationship between survey characteristics and NBI. [Dataset]. http://doi.org/10.1371/journal.pone.0283092.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Adam Rybak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cross-classified meta-regression models estimating the relationship between survey characteristics and NBI.

  10. Distribution of survey modes in analyzed dataset by the United Nations world...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam Rybak (2023). Distribution of survey modes in analyzed dataset by the United Nations world regions. [Dataset]. http://doi.org/10.1371/journal.pone.0283092.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Adam Rybak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World, United States, United Nations
    Description

    Distribution of survey modes in analyzed dataset by the United Nations world regions.

  11. Data from: Preregistration in experimental linguistics: Applications,...

    • psyarxiv.com
    Updated Feb 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timo Roettger (2022). Preregistration in experimental linguistics: Applications, challenges, and limitations [Dataset]. http://doi.org/10.31234/osf.io/vc9hu
    Explore at:
    Dataset updated
    Feb 17, 2022
    Dataset provided by
    Center for Open Sciencehttps://cos.io/
    Authors
    Timo Roettger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Our current publication system neither incentivizes publishing null results nor direct replication attempts, which biases the scientific record toward novel findings that appear to support presented hypotheses (referred to as “publication bias”). Moreover, flexibility in data collection, measurement, and analysis (referred to as “researcher degrees of freedom”) can lead to overconfident beliefs in the robustness of a statistical relationship. One way to systematically decrease publication bias and researcher degrees of freedom is preregistration. A preregistration is a time-stamped document that specifies how data is to be collected, measured, and analyzed prior to data collection. While preregistration is a powerful tool to reduce bias, it comes with certain challenges and limitations which have to be evaluated for each scientific discipline individually. This paper discusses applications, challenges and limitations of preregistration for experimental linguistic research.

  12. Data from: Assessing and Correcting Neighborhood Socioeconomic Spatial...

    • zenodo.org
    bin
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Álvaro Padilla-Pozo; Frederic Bartumeus; Frederic Bartumeus; Tomás Montalvo; Tomás Montalvo; Isis Sanpera-Calbet; Isis Sanpera-Calbet; Andrea Valsecchi; John R.B. Palmer; John R.B. Palmer; Álvaro Padilla-Pozo; Andrea Valsecchi (2024). Assessing and Correcting Neighborhood Socioeconomic Spatial Sampling Biases in Citizen Science Mosquito Data Collection [Dataset]. http://doi.org/10.5281/zenodo.12605540
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Álvaro Padilla-Pozo; Frederic Bartumeus; Frederic Bartumeus; Tomás Montalvo; Tomás Montalvo; Isis Sanpera-Calbet; Isis Sanpera-Calbet; Andrea Valsecchi; John R.B. Palmer; John R.B. Palmer; Álvaro Padilla-Pozo; Andrea Valsecchi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reporting data from the Mosquito Alert citizen science system, active catch basin surveillance, and mosquito trap surveillance used in "Assessing and Correcting Neighborhood Socioeconomic Spatial Sampling Biases in Citizen Science Mosquito Data Collection."

    The file named mosquito_alert_adult_bite_reports_Barcelona_2014_2023.Rds includes all adult mosquito and mosquito bite reports received from Barcelona Municipality from the start of the Mosqiuto Alert project in 2014 through the end of 2023. The file named mosquito_alert_validated_albopictus_reports_Barcelona_2014_23.Rds includes all expert-validated Ae. albopictus reports received from Barcelona Municipality during the same time period. The data is stored as RDS files and contain the following fields:

    • year - the year in which the report was made. Class = dbl.
    • date - the date om which the report was made. Class = date.
    • type - the report type, either adult mosquito ("adult") or mosquito breeding site ("site"). Class = chr.
    • lon - the longitude of the report location. Class = dbl.
    • lat - the latitude of the report location. Class = dbl.
    • validation_score - Entolab validation score. Either 1 (possible Ae. albopictus) or 2 (probable Ae. albopictus). This field is present only in the validated reports data.

    The file named active_catch_basin_drain_data.Rds includes information about all catch basin drains in Barcelona Municipality in which the Barcelona Public Health Agency (ASPB) detected mosquito activity as part of its continuous monitoring and control of mosquitoes from 2019 through 2023. The data is stored in an RDS file with the following fields:

    • any_reports - dummy variable indicating whether any Mosquito Alert adult mosquito or mosquito bite reports were sent through Mosquito Alert from within 200 m of the catch basin drain during the year in which the ASPB detected mosquito activity in hte catch basin drain. Class = lgl.
    • se_expected - sampling effort for the 0.025 degree lon/lat sampling cell in which the catch basin drain lies during the year in which the ASPB detected mosquito activity in the drain. This value is taken from the SE_expected variable in the sampling_effort_daily_cellres_025.csv.gz file available at https://zenodo.org/records/12602985. Sampling effort is estimated as the expected number of participants sending at least one report from the cell during the day in question given the the number of participants recorded in the cell that day and the amount of time elapsed since each one began participating in the project. Class = dbl.
    • p_singlehh - proportion of single-member households in the population of the census tract in which the catch basin drain is located. Class = dbl.
    • mean_age - mean age of the population of the census tract in which the catch basin drain is located. Class = dbl.
    • mean_rent_consumption_unit - mean income per consumption unit in the census tract in which the catch basin drain is located. Class = dbl.
    • popd - population density of the census tract in which the catch basin drain is located. Class = dbl.
    • id_item - unique identifier given to the catch basin drain. Drain itentifiers appear multiple times in the data when the ASPB detected activity in the drain in multiple years. Class = dbl.

    The file named trap_data.Rds includes information on the adult mosquito trap surveillance analyzed in this article. The data is stored in an RDS file with the following fields:

    • females - number of Ae. albopictus females found in the trap. Class = dbl.
    • trap_name - unique identifier for the trap. Class = chr.
    • trapping_effort - number of days from when the trap was set to when it was checked. Class = dbl.
    • date - date on which the trap was checked. Class = date.
    • mean_tm30 - mean temperature for the 30 days leading up to the date on which the trap was checked. Class = dbl.
    • mean_rent_consumption_unit - mean income per consumption unit for the census tract in which the trap was located. Class = dbl.
  13. d

    Data from: Relative bee abundance varies by collection method and flowering...

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Apr 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philip Hahn; Marirose Kuhlman; Skylar Burrows; Dan Mummey; Philip Ramsey (2021). Relative bee abundance varies by collection method and flowering richness: implications for understanding patterns in bee community data [Dataset]. http://doi.org/10.5061/dryad.2z34tmpmd
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 27, 2021
    Dataset provided by
    Dryad
    Authors
    Philip Hahn; Marirose Kuhlman; Skylar Burrows; Dan Mummey; Philip Ramsey
    Time period covered
    Apr 27, 2021
    Description

    See metadata file for additional details.

  14. H

    Replication data for: Measuring Immigration Policy

    • dataverse.harvard.edu
    • datasetcatalog.nlm.nih.gov
    Updated Feb 9, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzanna Challen (2012). Replication data for: Measuring Immigration Policy [Dataset]. http://doi.org/10.7910/DVN/N9NT1C
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 9, 2012
    Dataset provided by
    Harvard Dataverse
    Authors
    Suzanna Challen
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.7910/DVN/N9NT1Chttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.7910/DVN/N9NT1C

    Time period covered
    1965 - 2009
    Area covered
    US
    Description

    The dissertation consists of three chapters relating to the measurement of immigration policies, which developed out of my work as an initial co-author of the International Migration Policy and Law Analysis (IMPALA) Database Project. The first chapter entitled, “Brain Gain? Measuring skill bias in U.S. migrant admissions policy,” develops a conceptual and operational definition of skill bias. I apply the measure to new data revealing the level of skill bias in U.S. migrant admissions policy between 1965 and 2008. Skill bias in U.S. migrant admissions policy is both a critical determinant of the skill composition of the migrant population and a response to economic and public demand for highly skilled migrants. However, despite its central role, this is the first direct, comprehensive, annual measure of skill bias in U.S. migrant admissions policy. The second chapter entitled, “Stalled in the Senate: Explaining change in US migrant admissions policy since 1965,” presents new data characterizing change in U.S. migrant admissions policy as both expansive and infrequent over recent decades. I present a new theory of policy change in U.S. migrant admissions policy that incorporates the role of supermajoritarian decision making procedures and organized anti-immigration groups to better account for both the expansive nature and t he infrequency of policy change. The theory highlights the importance of a coalition of immigrant advocacy groups, employers and unions in achieving policy change and identifies the conditions under which this coalition is most likely to form and least likely to be blocked by an anti-immigration group opposition. The third chapter entitled, “Post-coding aggregation: A methodological principle for independent data collection,” presents a new technique developed to enable independent collection of flexible, high quality data: post-coding aggregation. Post-coding aggregation is a methodological principle that minimizes data loss, increases transparency, and grants data analysts the ability to decide how best to aggregate information to produce measures. I demonstrate how it increases the fl exibility of data use by expanding the utility of data collections for a wider range of research objectives and improves the reliability and the content validity of measures in data analysis.

  15. D

    Analysis Scripts for 'Balancing Bias and Burden in Personal Network Studies'...

    • dataverse.nl
    bin, html, png +2
    Updated Dec 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. Stadel; M. Stadel; G Stulp; G Stulp (2021). Analysis Scripts for 'Balancing Bias and Burden in Personal Network Studies' [Dataset]. http://doi.org/10.34894/JNVWNH
    Explore at:
    html(1254713), bin(41902), type/x-r-syntax(10676), bin(63487), png(29651), text/x-sh(148), bin(20470), type/x-r-syntax(6398), html(1344090), type/x-r-syntax(11732)Available download formats
    Dataset updated
    Dec 1, 2021
    Dataset provided by
    DataverseNL
    Authors
    M. Stadel; M. Stadel; G Stulp; G Stulp
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In the study 'Balancing Bias and Burden in Personal Network Studies' we use a sample of 701 Dutch women and their personal networks of 25 alters to investigate two strategies reducing respondent burden in personal network data collections: (1) eliciting fewer alters and (2) selecting a random subsample from the original set of elicited alters for full assessment. We present the amount of bias in structural and compositional network characteristics connected to applying these strategies as well as the potential study time gain for every possible network size (2 to 24 alters) as a proxy for respondent burden reduction. Our results can aid researchers designing a personal network study to balance respondent burden and bias in estimates for a range of compositional and structural network characteristics. This folder contains the analysis scripts for the analyses presented in the paper. This includes: 1) data preparation (R Markdown), 2) emulations and error calculation for dropping alters completely (R-Markdown), 3) emulations and error calculations for randomly picking a subsample of all generated alters (R-Scripts), 4) summarizing errors and plotting them for randomly picking a subsample of all generated alters (R-Markdown), and 5) the accompanying shiny app (R-Script). 6) a figure on how the scripts relate to each other

  16. Data from: Approach-induced biases in human information sampling

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    pdf, zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laurence T. Hunt; Robb B. Rutledge; W. M. Nishantha Malalasekera; Steven W. Kennerley; Raymond J. Dolan; Laurence T. Hunt; Robb B. Rutledge; W. M. Nishantha Malalasekera; Steven W. Kennerley; Raymond J. Dolan (2024). Data from: Approach-induced biases in human information sampling [Dataset]. http://doi.org/10.5061/dryad.nb41c
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Laurence T. Hunt; Robb B. Rutledge; W. M. Nishantha Malalasekera; Steven W. Kennerley; Raymond J. Dolan; Laurence T. Hunt; Robb B. Rutledge; W. M. Nishantha Malalasekera; Steven W. Kennerley; Raymond J. Dolan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    IInformation sampling is often biased towards seeking evidence that confirms one's prior beliefs. Despite such biases being a pervasive feature of human behavior, their underlying causes remain unclear. Many accounts of these biases appeal to limitations of human hypothesis testing and cognition, de facto evoking notions of bounded rationality, but neglect more basic aspects of behavioral control. Here, we investigated a potential role for Pavlovian approach in biasing which information humans will choose to sample. We collected a large novel dataset from 32,445 human subjects, making over 3 million decisions, who played a gambling task designed to measure the latent causes and extent of information-sampling biases. We identified three novel approach-related biases, formalized by comparing subject behavior to a dynamic programming model of optimal information gathering. These biases reflected the amount of information sampled ("positive evidence approach"), the selection of which information to sample ("sampling the favorite"), and the interaction between information sampling and subsequent choices ("rejecting unsampled options"). The prevalence of all three biases was related to a Pavlovian approach-avoid parameter quantified within an entirely independent economic decision task. Our large dataset also revealed that individual differences in the amount of information gathered are a stable trait across multiple gameplays and can be related to demographic measures, including age and educational attainment. As well as revealing limitations in cognitive processing, our findings suggest information sampling biases reflect the expression of primitive, yet potentially ecologically adaptive, behavioral repertoires. One such behavior is sampling from options that will eventually be chosen, even when other sources of information are more pertinent for guiding future action.

  17. d

    Data from: Bias estimation for seven precipitation datasets for the eastern...

    • catalog.data.gov
    • data.usgs.gov
    Updated Oct 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Bias estimation for seven precipitation datasets for the eastern MENA region [Dataset]. https://catalog.data.gov/dataset/bias-estimation-for-seven-precipitation-datasets-for-the-eastern-mena-region
    Explore at:
    Dataset updated
    Oct 30, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Middle East and North Africa
    Description

    Information on the spatio-temporal distribution of rainfall is critical for addressing water-related disasters, especially in the Middle East and North Africa's (MENA) arid to semi-arid regions. However, the availability of reliable rainfall datasets for most river basins is limited. In this study, we utilized observations from satellite-based rainfall data, in situ rain gauge observations, and rainfall climatology to determine the most suitable precipitation dataset in the MENA region. This dataset includes the supporting data and graphics for the analysis. The collection includes a spreadsheet containing all the data for the tables and charts, as well as the text file for the in situ data collected and used for the analysis.

  18. CRITEO FAIRNESS IN JOB ADS DATASET

    • kaggle.com
    zip
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Abdur Rahman (2024). CRITEO FAIRNESS IN JOB ADS DATASET [Dataset]. https://www.kaggle.com/datasets/borhanitrash/fairness-in-job-ads-dataset
    Explore at:
    zip(201430692 bytes)Available download formats
    Dataset updated
    Jul 1, 2024
    Authors
    Md. Abdur Rahman
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Summary

    This dataset is released by Criteo to foster research and innovation on Fairness in Advertising and AI systems in general. See also Criteo pledge for Fairness in Advertising.

    The dataset is intended to learn click predictions models and evaluate by how much their predictions are biased between different gender groups.

    Data description

    The dataset contains pseudononymized users' context and publisher features that was collected from a job targeting campaign ran for 5 months by Criteo AdTech company. Each line represents a product that was shown to a user. Each user has an impression session where they can see several products at the same time. Each product can be clicked or not clicked by the user. The dataset consists of 1072226 rows and 55 columns.

    • features
      • user_id is a unique identifier assigned to each user. This identifier has been anonymized and does not contain any information related to the real users.
      • product_id is a unique identifier assigned to each product, i.e. job offer.
      • impression_id is a unique identifier assigned to each impression, i.e. online session that can have several products at the same time.
      • cat0 to cat5 are anonymized categorical user features.
      • cat6 to cat12 are anonymized categorical product features.
      • num13 to num47 are anonymized numerical user features.
    • labels
      • protected_attribute is a binary feature that describes user gender proxy, i.e. female is 0, male is 1. The detailed description on the meaning can be found below.
      • senior is a binary feature that describes the seniority of the job position, i.e. an assistant role is 0, a managerial role is 1. This feature was created during data processing step from the product title feature: if the product title contains words describing managerial role (e.g. 'president', 'ceo', and others), it is assigned to 1, otherwise to 0.
      • rank is a numerical feature that corresponds to the positional rank of the product on the display for given impression_id. Usually, the position on the display creates the bias with respect to the click: lower rank means higher position of the product on the display.
      • displayrandom is a binary feature that equals 1 if the display position on the banner of the products associated with the same impression_id was randomized. The click-rank metric should be computed on displayrandom = 1 to avoid positional bias.
      • click is a binary feature that equals 1 if the product product_id in the impression impression_id was clicked by the user user_id.

    Data statistics

    dimensionaverage
    click0.077
    protected attribute0.500
    senior0.704

    License

    The data is released under the CC-BY-NC-SA 4.0 license. You are free to Share and Adapt this data provided that you respect the Attribution, NonCommercial and ShareAlike conditions. Please read carefully the full license before using.

    Protected attribute

    As Criteo does not have access to user demographics we report a proxy of gender as protected attribute. This proxy is reported as binary for simplicity yet we acknowledge gender is not necessarily binary.

    The value of the proxy is computed as the majority of gender attributes of products seen in the user timeline. Product having a gender attribute are typically fashion and clothing. We acknowledge that this proxy does not necessarily represent how users relate to a given gender yet we believe it to be a realistic approximation for research purposes.

    We encourage research in Fairness defined with respect to other attributes as well.

    Limitations and interpretations

    We remark that the proposed gender proxy does not give a definition of the gender. Since we do not have access to the sensitive information, this is the best solution we have identified at this stage to idenitify bias on pseudonymised data, and we encourage any discussion on better approximations. This proxy is reported as binary for simplicity yet we acknowledge gender is not necessarily binary. Although our research focuses on gender, this should not diminish the importance of investigating other types of algorithmic discrimination. While this dataset provides important application of fairness-aware algorithms in a high-risk domain, there are several fundamental limitation that can not be addressed easily through data collection or curation processes. These limitations in...

  19. d

    Data from: Mapping species richness using opportunistic samples: a case...

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Dec 6, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Neyens; Peter Diggle; Christel Faes; Natalie Beenaerts; Tom Artois; Emanuele Giorgi (2019). Mapping species richness using opportunistic samples: a case study on ground-floor bryophyte species richness in the Belgian province of Limburg [Dataset]. http://doi.org/10.5061/dryad.brv15dv5r
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 6, 2019
    Dataset provided by
    Dryad
    Authors
    Thomas Neyens; Peter Diggle; Christel Faes; Natalie Beenaerts; Tom Artois; Emanuele Giorgi
    Time period covered
    Dec 5, 2019
    Area covered
    Belgium
    Description

    In species richness studies, citizen-science surveys where participants make individual decisions regarding sampling strategies provide a cost-effective approach to collect a large amount of data. However, it is unclear to what extent the bias inherent to opportunistically collected samples may invalidate our inferences. Here, we compare spatial predictions of forest ground-floor bryophyte species richness in Limburg (Belgium), based on crowd- and expert-sourced data, where the latter are collected by adhering to a rigorous geographical randomisation and data collection protocol. We develop a log-Gaussian Cox process model to analyse the opportunistic sampling process of the crowd-sourced data and assess its sampling bias. We then fit two geostatistical Poisson models to both data-sets and compare the parameter estimates and species richness predictions. We find that the citizens had a higher propensity for locations that were close to their homes and environmentally more valuable. The ...

  20. Logistic regressions for each selection variable.

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arthur A. Stone; Stefan Schneider; Joshua M. Smyth; Doerte U. Junghaenel; Cheng Wen; Mick P. Couper; Sarah Goldstein (2023). Logistic regressions for each selection variable. [Dataset]. http://doi.org/10.1371/journal.pone.0282591.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Arthur A. Stone; Stefan Schneider; Joshua M. Smyth; Doerte U. Junghaenel; Cheng Wen; Mick P. Couper; Sarah Goldstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Although the potential for participant selection bias is readily acknowledged in the momentary data collection literature, very little is known about uptake rates in these studies or about differences in the people that participate versus those who do not. This study analyzed data from an existing Internet panel of older people (age 50 and greater) who were offered participation into a momentary study (n = 3,169), which made it possible to compute uptake and to compare many characteristics of participation status. Momentary studies present participants with brief surveys multiple times a day over several days; these surveys ask about immediate or recent experiences. A 29.1% uptake rate was observed when all respondents were considered, whereas a 39.2% uptake rate was found when individuals who did not have eligible smartphones (necessary for ambulatory data collection) were eliminated from the analyses. Taking into account the participation rate for being in this Internet panel, we estimate uptake rates for the general population to be about 5%. A consistent pattern of differences emerged between those who accepted the invitation to participate versus those who did not (in univariate analyses): participants were more likely to be female, younger, have higher income, have higher levels of education, rate their health as better, be employed, not be retired, not be disabled, have better self-rated computer skills, and to have participated in more prior Internet surveys (all p < .0026). Many variables were not associated with uptake including race, big five personality scores, and subjective well-being. For several of the predictors, the magnitude of the effects on uptake was substantial. These results indicate the possibility that, depending upon the associations being investigated, person selection bias could be present in momentary data collection studies.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kieran Shah (2016). Emails to authors and responses and overall unclear risk of bias (data sets 3 and 4) [Dataset]. http://doi.org/10.6084/m9.figshare.2324599.v1
Organization logoOrganization logo

Emails to authors and responses and overall unclear risk of bias (data sets 3 and 4)

Explore at:
xlsxAvailable download formats
Dataset updated
Feb 18, 2016
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Kieran Shah
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Emails to authors and responses for data sets 3 and 4. Assigned risk of bias in data sets 3 and 4. Articles initially assigned as unclear risk also determined to be low, high, or further unclear risk based on author theme responses and group consensus for data sets 3 and 4.

Search
Clear search
Close search
Google apps
Main menu