100+ datasets found

Utrecht Fairness Recruitment dataset
kaggle.com
zip
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ICT Institute (2025). Utrecht Fairness Recruitment dataset [Dataset]. https://www.kaggle.com/datasets/ictinstitute/utrecht-fairness-recruitment-dataset
Explore at:
zip(47198 bytes)Available download formats
Dataset updated
Mar 11, 2025
Authors
ICT Institute
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
Utrecht
Description
This dataset is a purely synthetic dataset created to help educators and researchers understand fairness definitions. It is a convenient way to illustrate differences between different definitions, such as fairness through unawareness, group fairness, statistical parity, predictive parity equalised odds or treatment equality. The dataset contains multiple sensitive features: age, gender and lives-near-by. These can be combined to define many different sensitive groups. The dataset contains the decisions of five example decisions methods that can be evaluated. When using this dataset, you do not need to train your own methods. Instead you can focus on evaluation the existing models.

This dataset is described and analysed in the following paper. Please cite this paper when using this dataset:

*Burda, P and Van Otterloo, S. 2024. * Fairness definitions explained and illustrated with examples. Computers and Society Research Journal, 2025 (2). [https://doi.org/10.54822/PASR6281 ]
Marketing Bias data
kaggle.com
zip
Updated Oct 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad (2023). Marketing Bias data [Dataset]. https://www.kaggle.com/datasets/pypiahmad/marketing-bias-data
Explore at:
zip(50328 bytes)Available download formats
Dataset updated
Oct 29, 2023
Authors
Ahmad
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Marketing Bias dataset encapsulates the interactions between users and products on ModCloth and Amazon Electronics, emphasizing on the potential marketing bias inherent in product recommendations. This bias is explored through attributes related to product marketing and user/item interactions.

Basic Statistics:
- ModCloth: - Reviews: 99,893 - Items: 1,020 - Users: 44,783 - Bias Type: Body Shape

Amazon Electronics:

Reviews: 1,292,954

Items: 9,560

Users: 1,157,633

Bias Type: Gender

Metadata: - Ratings - Product Images - User Identities - Item Sizes, User Genders

Example (ModCloth): The data example provided showcases a snippet from ModCloth data with columns like item_id, user_id, rating, timestamp, size, fit, user_attr, model_attr, and others.

Download Links: Visit the project page for download links.

Citation: If you utilize this dataset, please cite the following:

Title: Addressing Marketing Bias in Product Recommendations Authors: Mengting Wan, Jianmo Ni, Rishabh Misra, Julian McAuley Published In: WSDM, 2020 PDF Link

Dataset Files: - df_electronics.csv - df_modcloth.csv

The dataset is structured to provide a comprehensive overview of user-item interactions and attributes that may contribute to marketing bias, making it a valuable resource for anyone investigating marketing strategies and recommendation systems.
H
Replication data for: Selection Bias in Comparative Research: The Case of...
dataverse.harvard.edu
Updated Mar 8, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Hug (2010). Replication data for: Selection Bias in Comparative Research: The Case of Incomplete Data Sets [Dataset]. http://doi.org/10.7910/DVN/QO28VG
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/QO28VG
Dataset updated
Mar 8, 2010
Dataset provided by
Harvard Dataverse
Authors
Simon Hug
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Selection bias is an important but often neglected problem in comparative research. While comparative case studies pay some attention to this problem, this is less the case in broader cross-national studies, where this problem may appear through the way the data used are generated. The article discusses three examples: studies of the success of newly formed political parties, research on protest events, and recent work on ethnic conflict. In all cases the data at hand are likely to be afflicted by selection bias. Failing to take into consideration this problem leads to serious biases in the estimation of simple relationships. Empirical examples illustrate a possible solution (a variation of a Tobit model) to the problems in these cases. The article also discusses results of Monte Carlo simulations, illustrating under what conditions the proposed estimation procedures lead to improved results.
u
Marketing Bias data
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Marketing Bias data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
These datasets contain attributes about products sold on ModCloth and Amazon which may be sources of bias in recommendations (in particular, attributes about how the products are marketed). Data also includes user/item interactions for recommendation.

Metadata includes

ratings

product images

user identities

item sizes, user genders
h
md_gender_bias
huggingface.co
opendatalab.com
Updated Mar 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2021). md_gender_bias [Dataset]. https://huggingface.co/datasets/facebook/md_gender_bias
Explore at:
Dataset updated
Mar 26, 2021
Dataset authored and provided by
AI at Meta
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.
Women in Headlines: Bias
kaggle.com
zip
Updated Jan 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Women in Headlines: Bias [Dataset]. https://www.kaggle.com/datasets/thedevastator/women-in-headlines-bias
Explore at:
zip(30108592 bytes)Available download formats
Dataset updated
Jan 22, 2023
Authors
The Devastator
Description
Women in Headlines: Bias

Investigating Gendered Language, Temporal Trends, and Themes

By Amber Thomas [source]

About this dataset

This dataset contains all of the data used in the Pudding essay When Women Make Headlines published in January 2022. This dataset was created to analyze gendered language, bias and language themes in news headlines from across the world. It contains headlines from top50 news publications and news agencies from four major countries - USA, UK, India and South Africa - as published by SimilarWeb (as of 2021-06-06).

To collect this data we used RapidAPI's google news API to query headlines containing one or more of keywords selected based on existing research done by Huimin Xu & team and The Swaddle team. We analyzed words used in headlines manually curating two dictionaries — gendered words about women (words that are explicitly gendered) and words that denote societal/behavioral stereotypes about women. To calculate bias scores, we utilized technology developed through Yasmeen Hitti & team’s research on gender bias text analysis. To categorize words used into themes (violence/crime, empowerment, race/ethnicity/identity etc), we manually curated four dictionaries utilizing Natural Language Processing packages for Python like spacy & nltk for our analysis. Plus, inverting polarity scores with vaderSentiment algorithm helped us shed light on differences between women-centered/non-women centered polarity levels as well as differences between global polarity baselines of each country's most visited publications & news agencies according to SimilarWeb 2020 statistics..

This dataset enables journalists, researchers and educators researching issues related to gender equity within media outlets around the world further insights into potential disparities with just a few lines of code! Any discoveries made by using this data should provide valuable support for evidence-based argumentation . Let us advocate for greater awareness towards female representation better quality coverage!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a comprehensive look at the portrayal of women in headlines from 2010-2020. Using this dataset, researchers and data scientists can explore a range of topics including language used to describe women, bias associated with different topics or publications, and temporal patterns in headlines about women over time.

To use this dataset effectively, it is helpful to understand the structure of the data. The columns include headline_no_site (the text of the headline without any information about which publication it is from), time (the date and time that the article was published), country (the country where it was published), bias score (calculated using Gender Bias Taxonomy V1.0) and year (the year that the article was published).

By exploring these columns individually or combining them into groups such as by publication or by topic, there are many ways to make meaningful discoveries using this data set. For example, one could explore if certain news outlets employ more gender-biased language when writing about female subjects than other outlets or investigate whether female-centric stories have higher/lower bias scores than average for a particular topic across multiple countries over time. This type of analysis helps researchers to gain insight into how our culture's dialogue has evolved over recent years as relates to women in media coverage worldwide

Research Ideas

A comparative, cross-country study of the usage of gendered language and the prevalence of gender bias in headlines to better understand regional differences.

Creating an interactive visualization showing the evolution of headline bias scores over time with respect to a certain topic or population group (such as women).

Analyzing how different themes are covered in headlines featuring women compared to those without, such as crime or violence versus empowerment or race and ethnicity, to see if there’s any difference in how they are portrayed by the media

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

See the dataset description for more information.

Columns

File: headlines_reduced_temporal.csv | Column name | Description | |:---------------------|:-------------------------------------------------------------------------------------...
r
Addressing sample selection bias for machine learning methods (replication...
resodate.org
Updated Oct 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Brewer; Alyssa Carlson (2025). Addressing sample selection bias for machine learning methods (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9hZGRyZXNzaW5nLXNhbXBsZS1zZWxlY3Rpb24tYmlhcy1mb3ItbWFjaGluZS1sZWFybmluZy1tZXRob2RzLXJlcGxpY2F0aW9uLWRhdGE=
Explore at:
Dataset updated
Oct 2, 2025
Dataset provided by
ZBW
ZBW Journal Data Archive
Journal of Applied Econometrics
Authors
Dylan Brewer; Alyssa Carlson
Description
Addressing sample selection bias for machine learning methods (replication data)

Dylan Brewer and Alyssa Carlson

Accepted at Journal of Applied Econometrics, 2023

Overview

This replication package contains files required to reproduce results, tables, and figures using Matlab and Stata. We divide the project into instructions to replicate the simulation, the result from Huang et al (2006), and the application.

Simulation

For reproducing the simulation results

Included files in *\Simulation with short descriptions:

SSML_simfunc: function that produces individual simulations runs

SSML_simulation: script that loops over the SSML_simfunc for different DGP and multiple simulation runs

SSML_figures: script that generates all figures for the paper

SSML_compilefunc: function that compiles the results from SSML_simulation for the SSML_figures script

Steps for replicating simulation:

Save SSML_simfunc, SSML_simulation, SSML_figures, SSML_compilefunc to the same folder. This location will be referred to as the FILEPATH.

Create OUTPUT folder inside the FILEPATH location.

Change the FILEPATH location inside SSML_simulation and SSML_figures.

Run SSML_simulation to produce simulation data and results.

Run SSML_figures to produce figures.

Huang et al replication

For reproducing the Huang et. al. (2006) replication results.

Included files in *\HuangetalReplication with short descriptions:

SSML_huangrep: script that replicates the results from Huang et. al. (2006)

Obtaining the dataset:

Go to https://archive.ics.uci.edu/dataset/14/breast+cancer and save file as "breast-cancer-wisconsin.data"

Steps for replicating results:

Save SSML_huangrep and the breast cancer data to the same folder. This location will be referred to as the FILEPATH.

Change the FILEPATH location inside SSML_huangrep

Run SSML_huangrep to produce results and figures.

Application

For reproducing the application section results.

Included program files in *\Application with short descriptions:

G0_main_202308.do: Stata wrapper code that will run all application replication files

G1_cqclean_202308.do: Cleans election outcomes data

G2_cqopen_202308.do: Cleans open elections data

G3_demographics_cainc30_202308.do: Cleans demographics data

G4_fips_202308.do: Cleans FIPS code data

G5_klarnerclean_202308.do: Cleans Klarner gubernatorial data

G6_merge_202308.do: Merges cleaned datasets together

G7_summary_202308.do: Generates summary statistics tables and figures

G8_firststage_202308.do: Runs L1 penalized probit for the first stage

G9_prediction_202308.m: Trains learners and makes predictions

G10_figures_202308.m: Generates figures of prediction patterns

G11_final_202308.do: Generates final figures and tables of results

r1_lasso_alwayskeepCF_202308.do: Examines the effect of requiring the control function is not dropped from LASSO

latexTable.m: Code by Eli Duenisch to write LaTeX tables from Matlab (https://www.mathworks.com/matlabcentral/fileexchange/44274-latextable)

Included non-confidential data in subdirectory `*\Application\Data`:

\CAINC30: County level income and demographics data from the BEA

\CPI: CPI data from the BLS

\KlarnerGovernors: Carl Klarner's Governors Dataset available at https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/20408

Confidential data suppressed in subdirectory `*\Application\CD`:

These data cannot be transferred as part of the data use agreement with the CQ Press. Thus, the files are not included.

\CQ_county: County level election outcomes available from http://library.cqpress.com/elections/login.php?requested=%2Felections%2Fdownload-data.php

\CQ_open: Open elections available from http://library.cqpress.com/elections/advsearch/elections-with-open-seats-results.php?open_year1=1968&open_year2=2019&open_office=4

There is no batch download--downloads for each year must be done by hand. For each year, download as many state outcomes as possible and name the files YYYYa.csv, YYYYb.csv, etc. (Example: 1970a.csv, 1970b.csv, 1970c.csv, 1970d.csv). See line 18 of G1_cqclean_202308.do for file structure information.

Steps for replicating application:

Download confidential data from the CQ Press.

Change the working directory in G0_main_202308.do on line 18 to the application folder.

Change local matlabpath in G0_main_202308.do on line 18 to the appropriate location.

Set directory and file path in G9_prediction_202308.m and G10_figures_202308.m as necessary.

Run G0_main_202308.do in Stata to run all programs.

All output (figures and tables) will be saved to subdirectory *\Application\Output.

Contact

Contact Dylan Brewer (brewer@gatech.edu) or Alyssa Carlson (carlsonah@missouri.edu) for help with replication.
h
Indic-Bias
huggingface.co
Updated Jul 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI4Bharat (2025). Indic-Bias [Dataset]. https://huggingface.co/datasets/ai4bharat/Indic-Bias
Explore at:
Dataset updated
Jul 1, 2025
Dataset authored and provided by
AI4Bharat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes

Warning: This dataset includes content that may be considered offensive or upsetting.. We present Indic-Bias, a comprehensive benchmark to evaluate the fairness of LLMs across 85 Indian Identity groups, focusing on Bias and Stereotypes. We create three tasks - Plausibility, Judgment, and Generation, and evaluate 14 popular LLMs to identify allocative and representational harms. Please… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/Indic-Bias.
d
Data from: Compliance with mandatory reporting of clinical trial results on...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Jan 4, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew P. Prayle; Matthew N. Hurley; Alan R. Smyth (2012). Compliance with mandatory reporting of clinical trial results on ClinicalTrials.gov: cross sectional study [Dataset]. http://doi.org/10.5061/dryad.j512f21p
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.j512f21p
Dataset updated
Jan 4, 2012
Dataset provided by
Dryad
Authors
Andrew P. Prayle; Matthew N. Hurley; Alan R. Smyth
Time period covered
Dec 13, 2011
Area covered
United States
Description
clinicaltrials.gov_searchThis is complete original dataset.identify completed trialsThis is the R script which when run on "clinicaltrials.gov_search.txt" will produce a .csv file which lists all the completed trials.FDA_table_with_sensThis is the final dataset after cross referencing the trials. An explanation of the variables is included in the supplementary file "2011-10-31 Prayle Hurley Smyth Supplementary file 3 variables in the dataset".analysis_after_FDA_categorization_and_sensThis R script reproduces the analysis from the paper, including the tables and statistical tests. The comments should make it self explanatory.2011-11-02 prayle hurley smyth supplementary file 1 STROBE checklistThis is a STROBE checklist for the study2011-10-31 Prayle Hurley Smyth Supplementary file 2 examples of categorizationThis is a supplementary file which illustrates some of the decisions which had to be made when categorizing trials.2011-10-31 Prayle Hurley Smyth Supplementary file 3 variables in th...
Effect of selection bias on estimates of relative CFR on the risk ratio (RR)...
plos.figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Lipsitch; Christl A. Donnelly; Christophe Fraser; Isobel M. Blake; Anne Cori; Ilaria Dorigatti; Neil M. Ferguson; Tini Garske; Harriet L. Mills; Steven Riley; Maria D. Van Kerkhove; Miguel A. Hernán (2023). Effect of selection bias on estimates of relative CFR on the risk ratio (RR) and odds ratio (OR) scale. [Dataset]. http://doi.org/10.1371/journal.pntd.0003846.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pntd.0003846.t003
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Marc Lipsitch; Christl A. Donnelly; Christophe Fraser; Isobel M. Blake; Anne Cori; Ilaria Dorigatti; Neil M. Ferguson; Tini Garske; Harriet L. Mills; Steven Riley; Maria D. Van Kerkhove; Miguel A. Hernán
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Subscript P represents the population values, while subscript D represents the values measured for those cases included in the data base; selection bias produces the discrepancy. The extent of selection bias may be measured as ORs=S00S11S01S10, where Sij is the probability a case with exposure (hospitalization at day 8) i and outcome (mortality) j appears in the database. In this example, selection bias spuriously enhances the negative association between hospitalization on day 8 and death, on all scales: RR, OR, and RD.Effect of selection bias on estimates of relative CFR on the risk ratio (RR) and odds ratio (OR) scale.
f
8d synthetic dataset labels from Clustering: how much bias do we need?.
rs.figshare.com
txt
Updated Jun 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Lorimer; Jenny Held; Ruedi Stoop (2023). 8d synthetic dataset labels from Clustering: how much bias do we need?. [Dataset]. http://doi.org/10.6084/m9.figshare.4806571.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4806571.v2
Dataset updated
Jun 3, 2023
Dataset provided by
The Royal Society
Authors
Tom Lorimer; Jenny Held; Ruedi Stoop
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Scientific investigations in medicine and beyond, increasingly require observations to be described by more features than can be simultaneously visualized. Simply reducing the dimensionality by projections destroys essential relationships in the data. Similarly, traditional clustering algorithms introduce data bias that prevents detection of natural structures expected from generic nonlinear processes. We examine how these problems can best be addressed, where in particular we focus on two recent clustering approaches, Phenograph and Hebbian learning clustering, applied to synthetic and natural data examples. Our results reveal that already for very basic questions, minimizing clustering bias is essential, but that results can benefit further from biased post-processing.
g
The 'Call me sexist but' Dataset (CMSB)
search.gesis.org
Updated Oct 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samory, Mattia (2024). The 'Call me sexist but' Dataset (CMSB) [Dataset]. https://search.gesis.org/research_data/SDN-10.7802-2251
Explore at:
Dataset updated
Oct 4, 2024
Dataset provided by
GESIS, Köln
GESIS search
Authors
Samory, Mattia
License
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
Description
This dataset consists of three types of 'short-text' content:

1. social media posts (tweets)
2. psychological survey items, and
3. synthetic adversarial modifications of the former two categories.

The tweet data can be further divided into 3 separate datasets based on their source:

1.1 the hostile sexism dataset,
1.2 the benevolent sexism dataset, and
1.3 the callme sexism dataset.

1.1 and 1.2 are pre-existing datasets obtained from Waseem, Z., & Hovy, D. (2016) and Jha, A., & Mamidi, R. (2017) that we re-annotated (see our paper and data statement for further information). The rationale for including these dataset specifically is that they feature a variety of sexist expressions in real conversational (social media) settings. In particular, they feature expressions that range from overtly antagonizing the minority gender through negative stereotypes (1.1) to leveraging positive stereotypes to subtly dismiss it as less-capable and fragile (1.2).

The callme sexism dataset (1.3) was collected by us based on the presence of the phrase 'call me sexist but' in tweets. The rationale behind this choice of query was that several Twitter users opine potentially sexist comments and signal so using the presence of this phrase, which arguably serves as a disclaimer for sexist opinions.

The survey items (2) pertain to attitudinal surveys that are designed to measure sexist attitudes and gender bias in participants. We provide a detailed account of our selection procedure in our paper.

Finally, the adversarial examples are generated by crowdworkers from Amazon Mechanical Turk by making minimal changes to tweets and scale items, in order to change sexist examples to non-sexist ones. We hope that these examples will help us control for typical confounds in non-sexist data (e.g., topic, civility) and lead to datasets with fewer biases, and consequently allow us to train more robust machine learning models. We only asked to turn sexist examples into non-sexist ones, and not vice versa, for ethical reasons.

The dataset is annotated to capture cases where text is sexist because of its content (what the speaker believes) or its phrasing (the speaker's choice of words). We explain the rationale for this codebook in our paper cited below.
d
Addressing publication bias in meta-analysis: Empirical findings from...
demo-b2find.dkrz.de
Updated Jun 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Addressing publication bias in meta-analysis: Empirical findings from community-augmented meta-analyses of infant language development Dataset for: Addressing publication bias in meta-analysis: Empirical findings from community-augmented meta-analyses of infant language development - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/035516de-986a-5dd6-8b4e-cc997812a7b3
Explore at:
Dataset updated
Jun 11, 2019
Description
Meta-analyses have long been an indispensable research synthesis tool for characterizing bodies of literature and advancing theories. However, they have been facing the same challenges as primary literature in the context of the replication crisis: A meta-analysis is only as good as the data it contains,and which data end up in the final sample can be influenced at various stages of the process. Early on, the selection of topic and search strategies might be biased by the meta-analyst’s subjective decision. Further,publication bias towards significant outcomes in primary studies might skew the search outcome, wheregrey, unpublished literature might not show up. Additional challenges might arise during data extraction from articles in the final search sample, for example since some articles might not contain sufficient detail for computing effect sizes and correctly characterizing moderator variables, or due to specific decisions of the meta-analyst during data extraction from multi-experiment papers.Community-augmented meta-analyses (CAMAs, Tsuji, Bergmann, & Cristia, 2014) have received increasing interest as a tool for countering the above-mentioned problems. CAMAs are open-access, online meta-analyses. In the original proposal, they allow the use and addition of data points by the research community, enabling to collectively shape the scope of a meta-analysis and encouraging the submission of unpublished or inaccessible data points. As such, CAMAs can counter biases introduced by data (in)availability and by the researcher. In addition, their dynamic nature serves to keep a meta-analysis, otherwise crystallized at the time of publication and quickly outdated, up to date.We have now been implementing CAMAs over the past four years in MetaLab(metalab.stanford.edu), a database gathering meta-analyses in Developmental Psychology and focused on infancy. Meta-analyses are updated through centralized, active curation.We here describe our successes and failures with gathering missing data, as well as quantify how the addition of these data points changes the outcomes of meta-analyses. First, we ask which strategies to counter publication bias are fruitful. To answer this question we evaluate efforts to gather data not readily accessible by database searches, which applies both to unpublished literature and to data not reported in published articles. Based on this investigation, we conclude that classical tools like database and citation searches can already contribute an important amount of grey literature. Furthermore, directly contacting authors is a fruitful way to get access to missing information. We then address whether and how including or excluding grey literature from a selection of meta-analyses impacts results, both in terms of indices of publication bias and in terms of main meta-analytic outcomes. Here, we find no differences in funnel plot asymmetry, but (as could be expected) a decrease in meta-analytic effect sizes. Based on these experiences, we finish with lessons learned and recommendations that can be generalized for meta-analysts beyond the field of infant research in order to get the most out of the CAMA framework and to gather maximally unbiased dataset.
Understanding and Managing Missing Data.pdf
figshare.com
pdf
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29265155.v1
Dataset updated
Jun 9, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ibrahim Denis Fofanah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
d
Data from: Sampling bias exaggerates a textbook example of a trophic cascade...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Nov 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elaine Brice; Eric Larsen; Daniel MacNulty (2021). Sampling bias exaggerates a textbook example of a trophic cascade [Dataset]. http://doi.org/10.5061/dryad.2z34tmpnj
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2z34tmpnj
Dataset updated
Nov 3, 2021
Dataset provided by
Dryad
Authors
Elaine Brice; Eric Larsen; Daniel MacNulty
Time period covered
Oct 14, 2021
Description
We measured browsing and height of young aspen (≥ 1 year-old) in 113 plots distributed randomly across the study area (Fig. 1). Each plot was a 1 × 20 m belt transect located randomly within an aspen stand that was itself randomly selected from an inventory of stands with respect to high and low wolf-use areas (Ripple et al. 2001). The inventory was a list of 992 grid cells (240 × 360 m) that contained at least one stand (Appendix S1). A “stand” was a group of tree-size aspen (>10 cm diameter at breast height) in which each tree was ≤ 30 m from every other tree. One hundred and thirteen grid cells were randomly selected from the inventory (~11% of 992 cells), one stand was randomly selected from each cell, and one plot was randomly established in each stand. Each plot likely represented a genetically-independent sample (Appendix S1).

We measured aspen at the end of the growing season (late July to September), focusing on plants ≤ 600 cm tall, which we termed “young aspen.” For each ...
Machine Learning Basics for Beginners🤖🧠
kaggle.com
zip
Updated Jun 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
Explore at:
zip(492015 bytes)Available download formats
Dataset updated
Jun 22, 2023
Authors
Bhanupratap Biswas
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.
T
civil_comments
tensorflow.org
huggingface.co
Updated Feb 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
Explore at:
Dataset updated
Feb 28, 2023
Description
This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('civil_comments', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
r
MARB
demo.researchdata.se
researchdata.se
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Södahl Bladsjö, Tom; Muñoz Sánchez, Ricardo (2025). MARB [Dataset]. http://doi.org/10.23695/V3WP-6C64
Explore at:
Unique identifier
https://doi.org/10.23695/V3WP-6C64
Dataset updated
Jun 5, 2025
Dataset provided by
University of Gothenburg
Authors
Södahl Bladsjö, Tom; Muñoz Sánchez, Ricardo
Description
Reporting bias (the human tendency to not mention obvious or redundant information) and social bias (societal attitudes toward specific demographic groups) have both been shown to propagate from human text data to language models trained on such data. However, the two phenomena have not previously been studied in combination. The MARB dataset was developed to begin to fill this gap by studying the interaction between social biases and reporting bias in language models. Unlike many existing benchmark datasets, MARB does not rely on artificially constructed templates or crowdworkers to create contrasting examples. Instead, the templates used in MARB are based on naturally occurring written language from the 2021 version of the enTenTen corpus (Jakubíček et al., 2013).
d
Replication Data for: Reducing Political Bias in Political Science Estimates...
dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zigerell, Lawrence (2023). Replication Data for: Reducing Political Bias in Political Science Estimates [Dataset]. http://doi.org/10.7910/DVN/PZLCJM
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/PZLCJM
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Zigerell, Lawrence
Description
Political science researchers have flexibility in how to analyze data, how to report data, and whether to report on data. Review of examples of reporting flexibility from the race and sex discrimination literature illustrates how research design choices can influence estimates and inferences. This reporting flexibility—coupled with the political imbalance among political scientists—creates the potential for political bias in reported political science estimates, but this potential for political bias can be reduced or eliminated through preregistration and preacceptance, in which researchers commit to a research design before completing data collection. Removing the potential for reporting flexibility can raise the credibility of political science research.
Fair RecSys Datasets
data.niaid.nih.gov
Updated Feb 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kowald Dominik (2023). Fair RecSys Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6123878
Explore at:
Dataset updated
Feb 22, 2023
Dataset provided by
Know-Centerhttp://know-center.at/
Authors
Kowald Dominik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Four multimedia recommender systems datasets to study popularity bias and fairness:

Last.fm (lfm.zip), based on the LFM-1b dataset of JKU Linz (http://www.cp.jku.at/datasets/LFM-1b/)

MovieLens (ml.zip), based on MovieLens-1M dataset (https://grouplens.org/datasets/movielens/1m/)

BookCrossing (book.zip), based on the BookCrossing dataset of Uni Freiburg (http://www2.informatik.uni-freiburg.de/~cziegler/BX/)

MyAnimeList (anime.zip), based on the MyAnimeList dataset of Kaggle (https://www.kaggle.com/CooperUnion/anime-recommendations-database)

Each dataset contains of user interactions (user_events.txt) and three user groups that differ in their inclination to popular/mainstream items: LowPop (low_main_users.txt), MedPop (med_main_users.txt), and HighPop (high_main_users.txt).

The format of the three user files are "user,mainstreaminess"

The format of the user-events files are "user,item,preference"

Example Python-code for analyzing the datasets as well as more information on the user groups can be found on Github (https://github.com/domkowald/FairRecSys) and on Arxiv (https://arxiv.org/abs/2203.00376)

Facebook

Twitter

Click to copy link

Link copied

Cite

ICT Institute (2025). Utrecht Fairness Recruitment dataset [Dataset]. https://www.kaggle.com/datasets/ictinstitute/utrecht-fairness-recruitment-dataset

Utrecht Fairness Recruitment dataset

Learn to detect gender and age bias in recruitment decisions

Explore at:

13 scholarly articles cite this dataset (View in Google Scholar)

zip(47198 bytes)Available download formats

Dataset updated

Mar 11, 2025

Authors

ICT Institute

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Area covered

Utrecht

Description

This dataset is a purely synthetic dataset created to help educators and researchers understand fairness definitions. It is a convenient way to illustrate differences between different definitions, such as fairness through unawareness, group fairness, statistical parity, predictive parity equalised odds or treatment equality. The dataset contains multiple sensitive features: age, gender and lives-near-by. These can be combined to define many different sensitive groups. The dataset contains the decisions of five example decisions methods that can be evaluated. When using this dataset, you do not need to train your own methods. Instead you can focus on evaluation the existing models.

This dataset is described and analysed in the following paper. Please cite this paper when using this dataset:

*Burda, P and Van Otterloo, S. 2024. * Fairness definitions explained and illustrated with examples. Computers and Society Research Journal, 2025 (2). [https://doi.org/10.54822/PASR6281 ]

Clear search

Close search

Google apps

Main menu

Utrecht Fairness Recruitment dataset

Marketing Bias data

Replication data for: Selection Bias in Comparative Research: The Case of...

Marketing Bias data

md_gender_bias

Women in Headlines: Bias

Women in Headlines: Bias

Investigating Gendered Language, Temporal Trends, and Themes

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Addressing sample selection bias for machine learning methods (replication...

Addressing sample selection bias for machine learning methods (replication data)

Overview

Simulation

Included files in *\Simulation with short descriptions:

Steps for replicating simulation:

Huang et al replication

Included files in *\HuangetalReplication with short descriptions:

Obtaining the dataset:

Steps for replicating results:

Application

Included program files in *\Application with short descriptions:

Included non-confidential data in subdirectory `*\Application\Data`:

Confidential data suppressed in subdirectory `*\Application\CD`:

Steps for replicating application:

Contact

Indic-Bias

Data from: Compliance with mandatory reporting of clinical trial results on...

Effect of selection bias on estimates of relative CFR on the risk ratio (RR)...

8d synthetic dataset labels from Clustering: how much bias do we need?.

The 'Call me sexist but' Dataset (CMSB)

Addressing publication bias in meta-analysis: Empirical findings from...

Understanding and Managing Missing Data.pdf

Data from: Sampling bias exaggerates a textbook example of a trophic cascade...

Machine Learning Basics for Beginners🤖🧠

civil_comments

MARB

Replication Data for: Reducing Political Bias in Political Science Estimates...

Fair RecSys Datasets

Utrecht Fairness Recruitment dataset

Learn to detect gender and age bias in recruitment decisions

Included files in `*\HuangetalReplication` with short descriptions:

Included program files in `*\Application` with short descriptions: