100+ datasets found
  1. Utrecht Fairness Recruitment dataset

    • kaggle.com
    zip
    Updated Mar 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ICT Institute (2025). Utrecht Fairness Recruitment dataset [Dataset]. https://www.kaggle.com/datasets/ictinstitute/utrecht-fairness-recruitment-dataset
    Explore at:
    zip(47198 bytes)Available download formats
    Dataset updated
    Mar 11, 2025
    Authors
    ICT Institute
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    Utrecht
    Description

    This dataset is a purely synthetic dataset created to help educators and researchers understand fairness definitions. It is a convenient way to illustrate differences between different definitions, such as fairness through unawareness, group fairness, statistical parity, predictive parity equalised odds or treatment equality. The dataset contains multiple sensitive features: age, gender and lives-near-by. These can be combined to define many different sensitive groups. The dataset contains the decisions of five example decisions methods that can be evaluated. When using this dataset, you do not need to train your own methods. Instead you can focus on evaluation the existing models.

    This dataset is described and analysed in the following paper. Please cite this paper when using this dataset:

    *Burda, P and Van Otterloo, S. 2024. * Fairness definitions explained and illustrated with examples. Computers and Society Research Journal, 2025 (2). [https://doi.org/10.54822/PASR6281 ]

  2. Marketing Bias data

    • kaggle.com
    zip
    Updated Oct 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad (2023). Marketing Bias data [Dataset]. https://www.kaggle.com/datasets/pypiahmad/marketing-bias-data
    Explore at:
    zip(50328 bytes)Available download formats
    Dataset updated
    Oct 29, 2023
    Authors
    Ahmad
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The Marketing Bias dataset encapsulates the interactions between users and products on ModCloth and Amazon Electronics, emphasizing on the potential marketing bias inherent in product recommendations. This bias is explored through attributes related to product marketing and user/item interactions.

    Basic Statistics:
    - ModCloth: - Reviews: 99,893 - Items: 1,020 - Users: 44,783 - Bias Type: Body Shape

    • Amazon Electronics:
      • Reviews: 1,292,954
      • Items: 9,560
      • Users: 1,157,633
      • Bias Type: Gender

    Metadata: - Ratings - Product Images - User Identities - Item Sizes, User Genders

    Example (ModCloth): The data example provided showcases a snippet from ModCloth data with columns like item_id, user_id, rating, timestamp, size, fit, user_attr, model_attr, and others.

    Download Links: Visit the project page for download links.

    Citation: If you utilize this dataset, please cite the following:

    Title: Addressing Marketing Bias in Product Recommendations Authors: Mengting Wan, Jianmo Ni, Rishabh Misra, Julian McAuley Published In: WSDM, 2020 PDF Link

    Dataset Files: - df_electronics.csv - df_modcloth.csv

    The dataset is structured to provide a comprehensive overview of user-item interactions and attributes that may contribute to marketing bias, making it a valuable resource for anyone investigating marketing strategies and recommendation systems.

  3. H

    Replication data for: Selection Bias in Comparative Research: The Case of...

    • dataverse.harvard.edu
    Updated Mar 8, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Hug (2010). Replication data for: Selection Bias in Comparative Research: The Case of Incomplete Data Sets [Dataset]. http://doi.org/10.7910/DVN/QO28VG
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 8, 2010
    Dataset provided by
    Harvard Dataverse
    Authors
    Simon Hug
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Selection bias is an important but often neglected problem in comparative research. While comparative case studies pay some attention to this problem, this is less the case in broader cross-national studies, where this problem may appear through the way the data used are generated. The article discusses three examples: studies of the success of newly formed political parties, research on protest events, and recent work on ethnic conflict. In all cases the data at hand are likely to be afflicted by selection bias. Failing to take into consideration this problem leads to serious biases in the estimation of simple relationships. Empirical examples illustrate a possible solution (a variation of a Tobit model) to the problems in these cases. The article also discusses results of Monte Carlo simulations, illustrating under what conditions the proposed estimation procedures lead to improved results.

  4. u

    Marketing Bias data

    • cseweb.ucsd.edu
    json
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Marketing Bias data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    These datasets contain attributes about products sold on ModCloth and Amazon which may be sources of bias in recommendations (in particular, attributes about how the products are marketed). Data also includes user/item interactions for recommendation.

    Metadata includes

    • ratings

    • product images

    • user identities

    • item sizes, user genders

  5. h

    md_gender_bias

    • huggingface.co
    • opendatalab.com
    Updated Mar 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2021). md_gender_bias [Dataset]. https://huggingface.co/datasets/facebook/md_gender_bias
    Explore at:
    Dataset updated
    Mar 26, 2021
    Dataset authored and provided by
    AI at Meta
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.

  6. Women in Headlines: Bias

    • kaggle.com
    zip
    Updated Jan 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Women in Headlines: Bias [Dataset]. https://www.kaggle.com/datasets/thedevastator/women-in-headlines-bias
    Explore at:
    zip(30108592 bytes)Available download formats
    Dataset updated
    Jan 22, 2023
    Authors
    The Devastator
    Description

    Women in Headlines: Bias

    Investigating Gendered Language, Temporal Trends, and Themes

    By Amber Thomas [source]

    About this dataset

    This dataset contains all of the data used in the Pudding essay When Women Make Headlines published in January 2022. This dataset was created to analyze gendered language, bias and language themes in news headlines from across the world. It contains headlines from top50 news publications and news agencies from four major countries - USA, UK, India and South Africa - as published by SimilarWeb (as of 2021-06-06).

    To collect this data we used RapidAPI's google news API to query headlines containing one or more of keywords selected based on existing research done by Huimin Xu & team and The Swaddle team. We analyzed words used in headlines manually curating two dictionaries — gendered words about women (words that are explicitly gendered) and words that denote societal/behavioral stereotypes about women. To calculate bias scores, we utilized technology developed through Yasmeen Hitti & team’s research on gender bias text analysis. To categorize words used into themes (violence/crime, empowerment, race/ethnicity/identity etc), we manually curated four dictionaries utilizing Natural Language Processing packages for Python like spacy & nltk for our analysis. Plus, inverting polarity scores with vaderSentiment algorithm helped us shed light on differences between women-centered/non-women centered polarity levels as well as differences between global polarity baselines of each country's most visited publications & news agencies according to SimilarWeb 2020 statistics..

    This dataset enables journalists, researchers and educators researching issues related to gender equity within media outlets around the world further insights into potential disparities with just a few lines of code! Any discoveries made by using this data should provide valuable support for evidence-based argumentation . Let us advocate for greater awareness towards female representation better quality coverage!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides a comprehensive look at the portrayal of women in headlines from 2010-2020. Using this dataset, researchers and data scientists can explore a range of topics including language used to describe women, bias associated with different topics or publications, and temporal patterns in headlines about women over time.

    To use this dataset effectively, it is helpful to understand the structure of the data. The columns include headline_no_site (the text of the headline without any information about which publication it is from), time (the date and time that the article was published), country (the country where it was published), bias score (calculated using Gender Bias Taxonomy V1.0) and year (the year that the article was published).

    By exploring these columns individually or combining them into groups such as by publication or by topic, there are many ways to make meaningful discoveries using this data set. For example, one could explore if certain news outlets employ more gender-biased language when writing about female subjects than other outlets or investigate whether female-centric stories have higher/lower bias scores than average for a particular topic across multiple countries over time. This type of analysis helps researchers to gain insight into how our culture's dialogue has evolved over recent years as relates to women in media coverage worldwide

    Research Ideas

    • A comparative, cross-country study of the usage of gendered language and the prevalence of gender bias in headlines to better understand regional differences.
    • Creating an interactive visualization showing the evolution of headline bias scores over time with respect to a certain topic or population group (such as women).
    • Analyzing how different themes are covered in headlines featuring women compared to those without, such as crime or violence versus empowerment or race and ethnicity, to see if there’s any difference in how they are portrayed by the media

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: headlines_reduced_temporal.csv | Column name | Description | |:---------------------|:-------------------------------------------------------------------------------------...

  7. r

    Addressing sample selection bias for machine learning methods (replication...

    • resodate.org
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Brewer; Alyssa Carlson (2025). Addressing sample selection bias for machine learning methods (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9hZGRyZXNzaW5nLXNhbXBsZS1zZWxlY3Rpb24tYmlhcy1mb3ItbWFjaGluZS1sZWFybmluZy1tZXRob2RzLXJlcGxpY2F0aW9uLWRhdGE=
    Explore at:
    Dataset updated
    Oct 2, 2025
    Dataset provided by
    ZBW
    ZBW Journal Data Archive
    Journal of Applied Econometrics
    Authors
    Dylan Brewer; Alyssa Carlson
    Description

    Addressing sample selection bias for machine learning methods (replication data)

    Dylan Brewer and Alyssa Carlson

    Accepted at Journal of Applied Econometrics, 2023

    Overview

    This replication package contains files required to reproduce results, tables, and figures using Matlab and Stata. We divide the project into instructions to replicate the simulation, the result from Huang et al (2006), and the application.

    Simulation

    For reproducing the simulation results

    Included files in *\Simulation with short descriptions:

    • SSML_simfunc: function that produces individual simulations runs
    • SSML_simulation: script that loops over the SSML_simfunc for different DGP and multiple simulation runs
    • SSML_figures: script that generates all figures for the paper
    • SSML_compilefunc: function that compiles the results from SSML_simulation for the SSML_figures script

    Steps for replicating simulation:

    1. Save SSML_simfunc, SSML_simulation, SSML_figures, SSML_compilefunc to the same folder. This location will be referred to as the FILEPATH.
    2. Create OUTPUT folder inside the FILEPATH location.
    3. Change the FILEPATH location inside SSML_simulation and SSML_figures.
    4. Run SSML_simulation to produce simulation data and results.
    5. Run SSML_figures to produce figures.

    Huang et al replication

    For reproducing the Huang et. al. (2006) replication results.

    Included files in *\HuangetalReplication with short descriptions:

    • SSML_huangrep: script that replicates the results from Huang et. al. (2006)

    Obtaining the dataset:

    Go to https://archive.ics.uci.edu/dataset/14/breast+cancer and save file as "breast-cancer-wisconsin.data"

    Steps for replicating results:

    1. Save SSML_huangrep and the breast cancer data to the same folder. This location will be referred to as the FILEPATH.
    2. Change the FILEPATH location inside SSML_huangrep
    3. Run SSML_huangrep to produce results and figures.

    Application

    For reproducing the application section results.

    Included program files in *\Application with short descriptions:

    • G0_main_202308.do: Stata wrapper code that will run all application replication files
    • G1_cqclean_202308.do: Cleans election outcomes data
    • G2_cqopen_202308.do: Cleans open elections data
    • G3_demographics_cainc30_202308.do: Cleans demographics data
    • G4_fips_202308.do: Cleans FIPS code data
    • G5_klarnerclean_202308.do: Cleans Klarner gubernatorial data
    • G6_merge_202308.do: Merges cleaned datasets together
    • G7_summary_202308.do: Generates summary statistics tables and figures
    • G8_firststage_202308.do: Runs L1 penalized probit for the first stage
    • G9_prediction_202308.m: Trains learners and makes predictions
    • G10_figures_202308.m: Generates figures of prediction patterns
    • G11_final_202308.do: Generates final figures and tables of results
    • r1_lasso_alwayskeepCF_202308.do: Examines the effect of requiring the control function is not dropped from LASSO
    • latexTable.m: Code by Eli Duenisch to write LaTeX tables from Matlab (https://www.mathworks.com/matlabcentral/fileexchange/44274-latextable)

    Included non-confidential data in subdirectory `*\Application\Data`:

    Confidential data suppressed in subdirectory `*\Application\CD`:

    These data cannot be transferred as part of the data use agreement with the CQ Press. Thus, the files are not included.

    There is no batch download--downloads for each year must be done by hand. For each year, download as many state outcomes as possible and name the files YYYYa.csv, YYYYb.csv, etc. (Example: 1970a.csv, 1970b.csv, 1970c.csv, 1970d.csv). See line 18 of G1_cqclean_202308.do for file structure information.

    Steps for replicating application:

    1. Download confidential data from the CQ Press.
    2. Change the working directory in G0_main_202308.do on line 18 to the application folder.
    3. Change local matlabpath in G0_main_202308.do on line 18 to the appropriate location.
    4. Set directory and file path in G9_prediction_202308.m and G10_figures_202308.m as necessary.
    5. Run G0_main_202308.do in Stata to run all programs.
    6. All output (figures and tables) will be saved to subdirectory *\Application\Output.

    Contact

    Contact Dylan Brewer (brewer@gatech.edu) or Alyssa Carlson (carlsonah@missouri.edu) for help with replication.

  8. h

    Indic-Bias

    • huggingface.co
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI4Bharat (2025). Indic-Bias [Dataset]. https://huggingface.co/datasets/ai4bharat/Indic-Bias
    Explore at:
    Dataset updated
    Jul 1, 2025
    Dataset authored and provided by
    AI4Bharat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes

    Warning: This dataset includes content that may be considered offensive or upsetting.. We present Indic-Bias, a comprehensive benchmark to evaluate the fairness of LLMs across 85 Indian Identity groups, focusing on Bias and Stereotypes. We create three tasks - Plausibility, Judgment, and Generation, and evaluate 14 popular LLMs to identify allocative and representational harms. Please… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/Indic-Bias.

  9. d

    Data from: Compliance with mandatory reporting of clinical trial results on...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jan 4, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew P. Prayle; Matthew N. Hurley; Alan R. Smyth (2012). Compliance with mandatory reporting of clinical trial results on ClinicalTrials.gov: cross sectional study [Dataset]. http://doi.org/10.5061/dryad.j512f21p
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 4, 2012
    Dataset provided by
    Dryad
    Authors
    Andrew P. Prayle; Matthew N. Hurley; Alan R. Smyth
    Time period covered
    Dec 13, 2011
    Area covered
    United States
    Description

    clinicaltrials.gov_searchThis is complete original dataset.identify completed trialsThis is the R script which when run on "clinicaltrials.gov_search.txt" will produce a .csv file which lists all the completed trials.FDA_table_with_sensThis is the final dataset after cross referencing the trials. An explanation of the variables is included in the supplementary file "2011-10-31 Prayle Hurley Smyth Supplementary file 3 variables in the dataset".analysis_after_FDA_categorization_and_sensThis R script reproduces the analysis from the paper, including the tables and statistical tests. The comments should make it self explanatory.2011-11-02 prayle hurley smyth supplementary file 1 STROBE checklistThis is a STROBE checklist for the study2011-10-31 Prayle Hurley Smyth Supplementary file 2 examples of categorizationThis is a supplementary file which illustrates some of the decisions which had to be made when categorizing trials.2011-10-31 Prayle Hurley Smyth Supplementary file 3 variables in th...

  10. Effect of selection bias on estimates of relative CFR on the risk ratio (RR)...

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Lipsitch; Christl A. Donnelly; Christophe Fraser; Isobel M. Blake; Anne Cori; Ilaria Dorigatti; Neil M. Ferguson; Tini Garske; Harriet L. Mills; Steven Riley; Maria D. Van Kerkhove; Miguel A. Hernán (2023). Effect of selection bias on estimates of relative CFR on the risk ratio (RR) and odds ratio (OR) scale. [Dataset]. http://doi.org/10.1371/journal.pntd.0003846.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Marc Lipsitch; Christl A. Donnelly; Christophe Fraser; Isobel M. Blake; Anne Cori; Ilaria Dorigatti; Neil M. Ferguson; Tini Garske; Harriet L. Mills; Steven Riley; Maria D. Van Kerkhove; Miguel A. Hernán
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Subscript P represents the population values, while subscript D represents the values measured for those cases included in the data base; selection bias produces the discrepancy. The extent of selection bias may be measured as ORs=S00S11S01S10, where Sij is the probability a case with exposure (hospitalization at day 8) i and outcome (mortality) j appears in the database. In this example, selection bias spuriously enhances the negative association between hospitalization on day 8 and death, on all scales: RR, OR, and RD.Effect of selection bias on estimates of relative CFR on the risk ratio (RR) and odds ratio (OR) scale.

  11. f

    8d synthetic dataset labels from Clustering: how much bias do we need?.

    • rs.figshare.com
    txt
    Updated Jun 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom Lorimer; Jenny Held; Ruedi Stoop (2023). 8d synthetic dataset labels from Clustering: how much bias do we need?. [Dataset]. http://doi.org/10.6084/m9.figshare.4806571.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    The Royal Society
    Authors
    Tom Lorimer; Jenny Held; Ruedi Stoop
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Scientific investigations in medicine and beyond, increasingly require observations to be described by more features than can be simultaneously visualized. Simply reducing the dimensionality by projections destroys essential relationships in the data. Similarly, traditional clustering algorithms introduce data bias that prevents detection of natural structures expected from generic nonlinear processes. We examine how these problems can best be addressed, where in particular we focus on two recent clustering approaches, Phenograph and Hebbian learning clustering, applied to synthetic and natural data examples. Our results reveal that already for very basic questions, minimizing clustering bias is essential, but that results can benefit further from biased post-processing.

  12. g

    The 'Call me sexist but' Dataset (CMSB)

    • search.gesis.org
    Updated Oct 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samory, Mattia (2024). The 'Call me sexist but' Dataset (CMSB) [Dataset]. https://search.gesis.org/research_data/SDN-10.7802-2251
    Explore at:
    Dataset updated
    Oct 4, 2024
    Dataset provided by
    GESIS, Köln
    GESIS search
    Authors
    Samory, Mattia
    License

    https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

    Description

    This dataset consists of three types of 'short-text' content:

    1. social media posts (tweets)
    2. psychological survey items, and
    3. synthetic adversarial modifications of the former two categories.

    The tweet data can be further divided into 3 separate datasets based on their source:

    1.1 the hostile sexism dataset,
    1.2 the benevolent sexism dataset, and
    1.3 the callme sexism dataset.

    1.1 and 1.2 are pre-existing datasets obtained from Waseem, Z., & Hovy, D. (2016) and Jha, A., & Mamidi, R. (2017) that we re-annotated (see our paper and data statement for further information). The rationale for including these dataset specifically is that they feature a variety of sexist expressions in real conversational (social media) settings. In particular, they feature expressions that range from overtly antagonizing the minority gender through negative stereotypes (1.1) to leveraging positive stereotypes to subtly dismiss it as less-capable and fragile (1.2).

    The callme sexism dataset (1.3) was collected by us based on the presence of the phrase 'call me sexist but' in tweets. The rationale behind this choice of query was that several Twitter users opine potentially sexist comments and signal so using the presence of this phrase, which arguably serves as a disclaimer for sexist opinions.

    The survey items (2) pertain to attitudinal surveys that are designed to measure sexist attitudes and gender bias in participants. We provide a detailed account of our selection procedure in our paper.

    Finally, the adversarial examples are generated by crowdworkers from Amazon Mechanical Turk by making minimal changes to tweets and scale items, in order to change sexist examples to non-sexist ones. We hope that these examples will help us control for typical confounds in non-sexist data (e.g., topic, civility) and lead to datasets with fewer biases, and consequently allow us to train more robust machine learning models. We only asked to turn sexist examples into non-sexist ones, and not vice versa, for ethical reasons.

    The dataset is annotated to capture cases where text is sexist because of its content (what the speaker believes) or its phrasing (the speaker's choice of words). We explain the rationale for this codebook in our paper cited below.

  13. d

    Addressing publication bias in meta-analysis: Empirical findings from...

    • demo-b2find.dkrz.de
    Updated Jun 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Addressing publication bias in meta-analysis: Empirical findings from community-augmented meta-analyses of infant language development Dataset for: Addressing publication bias in meta-analysis: Empirical findings from community-augmented meta-analyses of infant language development - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/035516de-986a-5dd6-8b4e-cc997812a7b3
    Explore at:
    Dataset updated
    Jun 11, 2019
    Description

    Meta-analyses have long been an indispensable research synthesis tool for characterizing bodies of literature and advancing theories. However, they have been facing the same challenges as primary literature in the context of the replication crisis: A meta-analysis is only as good as the data it contains,and which data end up in the final sample can be influenced at various stages of the process. Early on, the selection of topic and search strategies might be biased by the meta-analyst’s subjective decision. Further,publication bias towards significant outcomes in primary studies might skew the search outcome, wheregrey, unpublished literature might not show up. Additional challenges might arise during data extraction from articles in the final search sample, for example since some articles might not contain sufficient detail for computing effect sizes and correctly characterizing moderator variables, or due to specific decisions of the meta-analyst during data extraction from multi-experiment papers.Community-augmented meta-analyses (CAMAs, Tsuji, Bergmann, & Cristia, 2014) have received increasing interest as a tool for countering the above-mentioned problems. CAMAs are open-access, online meta-analyses. In the original proposal, they allow the use and addition of data points by the research community, enabling to collectively shape the scope of a meta-analysis and encouraging the submission of unpublished or inaccessible data points. As such, CAMAs can counter biases introduced by data (in)availability and by the researcher. In addition, their dynamic nature serves to keep a meta-analysis, otherwise crystallized at the time of publication and quickly outdated, up to date.We have now been implementing CAMAs over the past four years in MetaLab(metalab.stanford.edu), a database gathering meta-analyses in Developmental Psychology and focused on infancy. Meta-analyses are updated through centralized, active curation.We here describe our successes and failures with gathering missing data, as well as quantify how the addition of these data points changes the outcomes of meta-analyses. First, we ask which strategies to counter publication bias are fruitful. To answer this question we evaluate efforts to gather data not readily accessible by database searches, which applies both to unpublished literature and to data not reported in published articles. Based on this investigation, we conclude that classical tools like database and citation searches can already contribute an important amount of grey literature. Furthermore, directly contacting authors is a fruitful way to get access to missing information. We then address whether and how including or excluding grey literature from a selection of meta-analyses impacts results, both in terms of indices of publication bias and in terms of main meta-analytic outcomes. Here, we find no differences in funnel plot asymmetry, but (as could be expected) a decrease in meta-analytic effect sizes. Based on these experiences, we finish with lessons learned and recommendations that can be generalized for meta-analysts beyond the field of infant research in order to get the most out of the CAMA framework and to gather maximally unbiased dataset.

  14. Understanding and Managing Missing Data.pdf

    • figshare.com
    pdf
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ibrahim Denis Fofanah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

  15. d

    Data from: Sampling bias exaggerates a textbook example of a trophic cascade...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Nov 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elaine Brice; Eric Larsen; Daniel MacNulty (2021). Sampling bias exaggerates a textbook example of a trophic cascade [Dataset]. http://doi.org/10.5061/dryad.2z34tmpnj
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 3, 2021
    Dataset provided by
    Dryad
    Authors
    Elaine Brice; Eric Larsen; Daniel MacNulty
    Time period covered
    Oct 14, 2021
    Description

    We measured browsing and height of young aspen (≥ 1 year-old) in 113 plots distributed randomly across the study area (Fig. 1). Each plot was a 1 × 20 m belt transect located randomly within an aspen stand that was itself randomly selected from an inventory of stands with respect to high and low wolf-use areas (Ripple et al. 2001). The inventory was a list of 992 grid cells (240 × 360 m) that contained at least one stand (Appendix S1). A “stand” was a group of tree-size aspen (>10 cm diameter at breast height) in which each tree was ≤ 30 m from every other tree. One hundred and thirteen grid cells were randomly selected from the inventory (~11% of 992 cells), one stand was randomly selected from each cell, and one plot was randomly established in each stand. Each plot likely represented a genetically-independent sample (Appendix S1).

    We measured aspen at the end of the growing season (late July to September), focusing on plants ≤ 600 cm tall, which we termed “young aspen.” For each ...

  16. Machine Learning Basics for Beginners🤖🧠

    • kaggle.com
    zip
    Updated Jun 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
    Explore at:
    zip(492015 bytes)Available download formats
    Dataset updated
    Jun 22, 2023
    Authors
    Bhanupratap Biswas
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

    1. Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

    2. Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

    3. Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

    4. Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

    5. Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

    6. Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

    7. Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

    8. Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

    9. Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

    10. Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

    These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.

  17. T

    civil_comments

    • tensorflow.org
    • huggingface.co
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
    Explore at:
    Dataset updated
    Feb 28, 2023
    Description

    This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

    The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

    The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

    For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('civil_comments', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  18. r

    MARB

    • demo.researchdata.se
    • researchdata.se
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Södahl Bladsjö, Tom; Muñoz Sánchez, Ricardo (2025). MARB [Dataset]. http://doi.org/10.23695/V3WP-6C64
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    University of Gothenburg
    Authors
    Södahl Bladsjö, Tom; Muñoz Sánchez, Ricardo
    Description

    Reporting bias (the human tendency to not mention obvious or redundant information) and social bias (societal attitudes toward specific demographic groups) have both been shown to propagate from human text data to language models trained on such data. However, the two phenomena have not previously been studied in combination. The MARB dataset was developed to begin to fill this gap by studying the interaction between social biases and reporting bias in language models. Unlike many existing benchmark datasets, MARB does not rely on artificially constructed templates or crowdworkers to create contrasting examples. Instead, the templates used in MARB are based on naturally occurring written language from the 2021 version of the enTenTen corpus (JakubĂ­ÄŤek et al., 2013).

  19. d

    Replication Data for: Reducing Political Bias in Political Science Estimates...

    • dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zigerell, Lawrence (2023). Replication Data for: Reducing Political Bias in Political Science Estimates [Dataset]. http://doi.org/10.7910/DVN/PZLCJM
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Zigerell, Lawrence
    Description

    Political science researchers have flexibility in how to analyze data, how to report data, and whether to report on data. Review of examples of reporting flexibility from the race and sex discrimination literature illustrates how research design choices can influence estimates and inferences. This reporting flexibility—coupled with the political imbalance among political scientists—creates the potential for political bias in reported political science estimates, but this potential for political bias can be reduced or eliminated through preregistration and preacceptance, in which researchers commit to a research design before completing data collection. Removing the potential for reporting flexibility can raise the credibility of political science research.

  20. Fair RecSys Datasets

    • data.niaid.nih.gov
    Updated Feb 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kowald Dominik (2023). Fair RecSys Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6123878
    Explore at:
    Dataset updated
    Feb 22, 2023
    Dataset provided by
    Know-Centerhttp://know-center.at/
    Authors
    Kowald Dominik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Four multimedia recommender systems datasets to study popularity bias and fairness:

    Last.fm (lfm.zip), based on the LFM-1b dataset of JKU Linz (http://www.cp.jku.at/datasets/LFM-1b/)

    MovieLens (ml.zip), based on MovieLens-1M dataset (https://grouplens.org/datasets/movielens/1m/)

    BookCrossing (book.zip), based on the BookCrossing dataset of Uni Freiburg (http://www2.informatik.uni-freiburg.de/~cziegler/BX/)

    MyAnimeList (anime.zip), based on the MyAnimeList dataset of Kaggle (https://www.kaggle.com/CooperUnion/anime-recommendations-database)

    Each dataset contains of user interactions (user_events.txt) and three user groups that differ in their inclination to popular/mainstream items: LowPop (low_main_users.txt), MedPop (med_main_users.txt), and HighPop (high_main_users.txt).

    The format of the three user files are "user,mainstreaminess"

    The format of the user-events files are "user,item,preference"

    Example Python-code for analyzing the datasets as well as more information on the user groups can be found on Github (https://github.com/domkowald/FairRecSys) and on Arxiv (https://arxiv.org/abs/2203.00376)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ICT Institute (2025). Utrecht Fairness Recruitment dataset [Dataset]. https://www.kaggle.com/datasets/ictinstitute/utrecht-fairness-recruitment-dataset
Organization logo

Utrecht Fairness Recruitment dataset

Learn to detect gender and age bias in recruitment decisions

Explore at:
13 scholarly articles cite this dataset (View in Google Scholar)
zip(47198 bytes)Available download formats
Dataset updated
Mar 11, 2025
Authors
ICT Institute
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Area covered
Utrecht
Description

This dataset is a purely synthetic dataset created to help educators and researchers understand fairness definitions. It is a convenient way to illustrate differences between different definitions, such as fairness through unawareness, group fairness, statistical parity, predictive parity equalised odds or treatment equality. The dataset contains multiple sensitive features: age, gender and lives-near-by. These can be combined to define many different sensitive groups. The dataset contains the decisions of five example decisions methods that can be evaluated. When using this dataset, you do not need to train your own methods. Instead you can focus on evaluation the existing models.

This dataset is described and analysed in the following paper. Please cite this paper when using this dataset:

*Burda, P and Van Otterloo, S. 2024. * Fairness definitions explained and illustrated with examples. Computers and Society Research Journal, 2025 (2). [https://doi.org/10.54822/PASR6281 ]

Search
Clear search
Close search
Google apps
Main menu