Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is a purely synthetic dataset created to help educators and researchers understand fairness definitions. It is a convenient way to illustrate differences between different definitions, such as fairness through unawareness, group fairness, statistical parity, predictive parity equalised odds or treatment equality. The dataset contains multiple sensitive features: age, gender and lives-near-by. These can be combined to define many different sensitive groups. The dataset contains the decisions of five example decisions methods that can be evaluated. When using this dataset, you do not need to train your own methods. Instead you can focus on evaluation the existing models.
This dataset is described and analysed in the following paper. Please cite this paper when using this dataset:
*Burda, P and Van Otterloo, S. 2024. * Fairness definitions explained and illustrated with examples. Computers and Society Research Journal, 2025 (2). [https://doi.org/10.54822/PASR6281 ]
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Marketing Bias dataset encapsulates the interactions between users and products on ModCloth and Amazon Electronics, emphasizing on the potential marketing bias inherent in product recommendations. This bias is explored through attributes related to product marketing and user/item interactions.
Basic Statistics:
- ModCloth:
- Reviews: 99,893
- Items: 1,020
- Users: 44,783
- Bias Type: Body Shape
Metadata: - Ratings - Product Images - User Identities - Item Sizes, User Genders
Example (ModCloth): The data example provided showcases a snippet from ModCloth data with columns like item_id, user_id, rating, timestamp, size, fit, user_attr, model_attr, and others.
Download Links: Visit the project page for download links.
Citation: If you utilize this dataset, please cite the following:
Title: Addressing Marketing Bias in Product Recommendations Authors: Mengting Wan, Jianmo Ni, Rishabh Misra, Julian McAuley Published In: WSDM, 2020 PDF Link
Dataset Files: - df_electronics.csv - df_modcloth.csv
The dataset is structured to provide a comprehensive overview of user-item interactions and attributes that may contribute to marketing bias, making it a valuable resource for anyone investigating marketing strategies and recommendation systems.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Selection bias is an important but often neglected problem in comparative research. While comparative case studies pay some attention to this problem, this is less the case in broader cross-national studies, where this problem may appear through the way the data used are generated. The article discusses three examples: studies of the success of newly formed political parties, research on protest events, and recent work on ethnic conflict. In all cases the data at hand are likely to be afflicted by selection bias. Failing to take into consideration this problem leads to serious biases in the estimation of simple relationships. Empirical examples illustrate a possible solution (a variation of a Tobit model) to the problems in these cases. The article also discusses results of Monte Carlo simulations, illustrating under what conditions the proposed estimation procedures lead to improved results.
Facebook
TwitterThese datasets contain attributes about products sold on ModCloth and Amazon which may be sources of bias in recommendations (in particular, attributes about how the products are marketed). Data also includes user/item interactions for recommendation.
Metadata includes
ratings
product images
user identities
item sizes, user genders
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.
Facebook
TwitterBy Amber Thomas [source]
This dataset contains all of the data used in the Pudding essay When Women Make Headlines published in January 2022. This dataset was created to analyze gendered language, bias and language themes in news headlines from across the world. It contains headlines from top50 news publications and news agencies from four major countries - USA, UK, India and South Africa - as published by SimilarWeb (as of 2021-06-06).
To collect this data we used RapidAPI's google news API to query headlines containing one or more of keywords selected based on existing research done by Huimin Xu & team and The Swaddle team. We analyzed words used in headlines manually curating two dictionaries — gendered words about women (words that are explicitly gendered) and words that denote societal/behavioral stereotypes about women. To calculate bias scores, we utilized technology developed through Yasmeen Hitti & team’s research on gender bias text analysis. To categorize words used into themes (violence/crime, empowerment, race/ethnicity/identity etc), we manually curated four dictionaries utilizing Natural Language Processing packages for Python like spacy & nltk for our analysis. Plus, inverting polarity scores with vaderSentiment algorithm helped us shed light on differences between women-centered/non-women centered polarity levels as well as differences between global polarity baselines of each country's most visited publications & news agencies according to SimilarWeb 2020 statistics..
This dataset enables journalists, researchers and educators researching issues related to gender equity within media outlets around the world further insights into potential disparities with just a few lines of code! Any discoveries made by using this data should provide valuable support for evidence-based argumentation . Let us advocate for greater awareness towards female representation better quality coverage!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a comprehensive look at the portrayal of women in headlines from 2010-2020. Using this dataset, researchers and data scientists can explore a range of topics including language used to describe women, bias associated with different topics or publications, and temporal patterns in headlines about women over time.
To use this dataset effectively, it is helpful to understand the structure of the data. The columns include headline_no_site (the text of the headline without any information about which publication it is from), time (the date and time that the article was published), country (the country where it was published), bias score (calculated using Gender Bias Taxonomy V1.0) and year (the year that the article was published).
By exploring these columns individually or combining them into groups such as by publication or by topic, there are many ways to make meaningful discoveries using this data set. For example, one could explore if certain news outlets employ more gender-biased language when writing about female subjects than other outlets or investigate whether female-centric stories have higher/lower bias scores than average for a particular topic across multiple countries over time. This type of analysis helps researchers to gain insight into how our culture's dialogue has evolved over recent years as relates to women in media coverage worldwide
- A comparative, cross-country study of the usage of gendered language and the prevalence of gender bias in headlines to better understand regional differences.
- Creating an interactive visualization showing the evolution of headline bias scores over time with respect to a certain topic or population group (such as women).
- Analyzing how different themes are covered in headlines featuring women compared to those without, such as crime or violence versus empowerment or race and ethnicity, to see if there’s any difference in how they are portrayed by the media
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: headlines_reduced_temporal.csv | Column name | Description | |:---------------------|:-------------------------------------------------------------------------------------...
Facebook
TwitterDylan Brewer and Alyssa Carlson
Accepted at Journal of Applied Econometrics, 2023
This replication package contains files required to reproduce results, tables, and figures using Matlab and Stata. We divide the project into instructions to replicate the simulation, the result from Huang et al (2006), and the application.
For reproducing the simulation results
SSML_simfunc: function that produces individual simulations runsSSML_simulation: script that loops over the SSML_simfunc for different DGP and multiple simulation runsSSML_figures: script that generates all figures for the paperSSML_compilefunc: function that compiles the results from SSML_simulation for the SSML_figures scriptSSML_simfunc, SSML_simulation, SSML_figures, SSML_compilefunc to the same folder. This location will be referred to as the FILEPATH.FILEPATH location. FILEPATH location inside SSML_simulation and SSML_figures. SSML_simulation to produce simulation data and results.SSML_figures to produce figures.For reproducing the Huang et. al. (2006) replication results.
*\HuangetalReplication with short descriptions:SSML_huangrep: script that replicates the results from Huang et. al. (2006)Go to https://archive.ics.uci.edu/dataset/14/breast+cancer and save file as "breast-cancer-wisconsin.data"
SSML_huangrep and the breast cancer data to the same folder. This location will be referred to as the FILEPATH.FILEPATH location inside SSML_huangrep SSML_huangrep to produce results and figures.For reproducing the application section results.
*\Application with short descriptions:G0_main_202308.do: Stata wrapper code that will run all application replication filesG1_cqclean_202308.do: Cleans election outcomes dataG2_cqopen_202308.do: Cleans open elections dataG3_demographics_cainc30_202308.do: Cleans demographics dataG4_fips_202308.do: Cleans FIPS code dataG5_klarnerclean_202308.do: Cleans Klarner gubernatorial dataG6_merge_202308.do: Merges cleaned datasets togetherG7_summary_202308.do: Generates summary statistics tables and figuresG8_firststage_202308.do: Runs L1 penalized probit for the first stageG9_prediction_202308.m: Trains learners and makes predictionsG10_figures_202308.m: Generates figures of prediction patternsG11_final_202308.do: Generates final figures and tables of resultsr1_lasso_alwayskeepCF_202308.do: Examines the effect of requiring the control function is not dropped from LASSOlatexTable.m: Code by Eli Duenisch to write LaTeX tables from Matlab (https://www.mathworks.com/matlabcentral/fileexchange/44274-latextable)\CAINC30: County level income and demographics data from the BEA\CPI: CPI data from the BLS\KlarnerGovernors: Carl Klarner's Governors Dataset available at https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/20408These data cannot be transferred as part of the data use agreement with the CQ Press. Thus, the files are not included.
\CQ_county: County level election outcomes available from http://library.cqpress.com/elections/login.php?requested=%2Felections%2Fdownload-data.php\CQ_open: Open elections available from http://library.cqpress.com/elections/advsearch/elections-with-open-seats-results.php?open_year1=1968&open_year2=2019&open_office=4There is no batch download--downloads for each year must be done by hand. For each year, download as many state outcomes as possible and name the files YYYYa.csv, YYYYb.csv, etc. (Example: 1970a.csv, 1970b.csv, 1970c.csv, 1970d.csv). See line 18 of G1_cqclean_202308.do for file structure information.
G0_main_202308.do on line 18 to the application folder.matlabpath in G0_main_202308.do on line 18 to the appropriate location.G9_prediction_202308.m and G10_figures_202308.m as necessary.G0_main_202308.do in Stata to run all programs.*\Application\Output.Contact Dylan Brewer (brewer@gatech.edu) or Alyssa Carlson (carlsonah@missouri.edu) for help with replication.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes
Warning: This dataset includes content that may be considered offensive or upsetting.. We present Indic-Bias, a comprehensive benchmark to evaluate the fairness of LLMs across 85 Indian Identity groups, focusing on Bias and Stereotypes. We create three tasks - Plausibility, Judgment, and Generation, and evaluate 14 popular LLMs to identify allocative and representational harms. Please… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/Indic-Bias.
Facebook
Twitterclinicaltrials.gov_searchThis is complete original dataset.identify completed trialsThis is the R script which when run on "clinicaltrials.gov_search.txt" will produce a .csv file which lists all the completed trials.FDA_table_with_sensThis is the final dataset after cross referencing the trials. An explanation of the variables is included in the supplementary file "2011-10-31 Prayle Hurley Smyth Supplementary file 3 variables in the dataset".analysis_after_FDA_categorization_and_sensThis R script reproduces the analysis from the paper, including the tables and statistical tests. The comments should make it self explanatory.2011-11-02 prayle hurley smyth supplementary file 1 STROBE checklistThis is a STROBE checklist for the study2011-10-31 Prayle Hurley Smyth Supplementary file 2 examples of categorizationThis is a supplementary file which illustrates some of the decisions which had to be made when categorizing trials.2011-10-31 Prayle Hurley Smyth Supplementary file 3 variables in th...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Subscript P represents the population values, while subscript D represents the values measured for those cases included in the data base; selection bias produces the discrepancy. The extent of selection bias may be measured as ORs=S00S11S01S10, where Sij is the probability a case with exposure (hospitalization at day 8) i and outcome (mortality) j appears in the database. In this example, selection bias spuriously enhances the negative association between hospitalization on day 8 and death, on all scales: RR, OR, and RD.Effect of selection bias on estimates of relative CFR on the risk ratio (RR) and odds ratio (OR) scale.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scientific investigations in medicine and beyond, increasingly require observations to be described by more features than can be simultaneously visualized. Simply reducing the dimensionality by projections destroys essential relationships in the data. Similarly, traditional clustering algorithms introduce data bias that prevents detection of natural structures expected from generic nonlinear processes. We examine how these problems can best be addressed, where in particular we focus on two recent clustering approaches, Phenograph and Hebbian learning clustering, applied to synthetic and natural data examples. Our results reveal that already for very basic questions, minimizing clustering bias is essential, but that results can benefit further from biased post-processing.
Facebook
Twitterhttps://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
This dataset consists of three types of 'short-text' content:
1. social media posts (tweets)
2. psychological survey items, and
3. synthetic adversarial modifications of the former two categories.
The tweet data can be further divided into 3 separate datasets based on their source:
1.1 the hostile sexism dataset,
1.2 the benevolent sexism dataset, and
1.3 the callme sexism dataset.
1.1 and 1.2 are pre-existing datasets obtained from Waseem, Z., & Hovy, D. (2016) and Jha, A., & Mamidi, R. (2017) that we re-annotated (see our paper and data statement for further information). The rationale for including these dataset specifically is that they feature a variety of sexist expressions in real conversational (social media) settings. In particular, they feature expressions that range from overtly antagonizing the minority gender through negative stereotypes (1.1) to leveraging positive stereotypes to subtly dismiss it as less-capable and fragile (1.2).
The callme sexism dataset (1.3) was collected by us based on the presence of the phrase 'call me sexist but' in tweets. The rationale behind this choice of query was that several Twitter users opine potentially sexist comments and signal so using the presence of this phrase, which arguably serves as a disclaimer for sexist opinions.
The survey items (2) pertain to attitudinal surveys that are designed to measure sexist attitudes and gender bias in participants. We provide a detailed account of our selection procedure in our paper.
Finally, the adversarial examples are generated by crowdworkers from Amazon Mechanical Turk by making minimal changes to tweets and scale items, in order to change sexist examples to non-sexist ones. We hope that these examples will help us control for typical confounds in non-sexist data (e.g., topic, civility) and lead to datasets with fewer biases, and consequently allow us to train more robust machine learning models. We only asked to turn sexist examples into non-sexist ones, and not vice versa, for ethical reasons.
The dataset is annotated to capture cases where text is sexist because of its content (what the speaker believes) or its phrasing (the speaker's choice of words). We explain the rationale for this codebook in our paper cited below.
Facebook
TwitterMeta-analyses have long been an indispensable research synthesis tool for characterizing bodies of literature and advancing theories. However, they have been facing the same challenges as primary literature in the context of the replication crisis: A meta-analysis is only as good as the data it contains,and which data end up in the final sample can be influenced at various stages of the process. Early on, the selection of topic and search strategies might be biased by the meta-analyst’s subjective decision. Further,publication bias towards significant outcomes in primary studies might skew the search outcome, wheregrey, unpublished literature might not show up. Additional challenges might arise during data extraction from articles in the final search sample, for example since some articles might not contain sufficient detail for computing effect sizes and correctly characterizing moderator variables, or due to specific decisions of the meta-analyst during data extraction from multi-experiment papers.Community-augmented meta-analyses (CAMAs, Tsuji, Bergmann, & Cristia, 2014) have received increasing interest as a tool for countering the above-mentioned problems. CAMAs are open-access, online meta-analyses. In the original proposal, they allow the use and addition of data points by the research community, enabling to collectively shape the scope of a meta-analysis and encouraging the submission of unpublished or inaccessible data points. As such, CAMAs can counter biases introduced by data (in)availability and by the researcher. In addition, their dynamic nature serves to keep a meta-analysis, otherwise crystallized at the time of publication and quickly outdated, up to date.We have now been implementing CAMAs over the past four years in MetaLab(metalab.stanford.edu), a database gathering meta-analyses in Developmental Psychology and focused on infancy. Meta-analyses are updated through centralized, active curation.We here describe our successes and failures with gathering missing data, as well as quantify how the addition of these data points changes the outcomes of meta-analyses. First, we ask which strategies to counter publication bias are fruitful. To answer this question we evaluate efforts to gather data not readily accessible by database searches, which applies both to unpublished literature and to data not reported in published articles. Based on this investigation, we conclude that classical tools like database and citation searches can already contribute an important amount of grey literature. Furthermore, directly contacting authors is a fruitful way to get access to missing information. We then address whether and how including or excluding grey literature from a selection of meta-analyses impacts results, both in terms of indices of publication bias and in terms of main meta-analytic outcomes. Here, we find no differences in funnel plot asymmetry, but (as could be expected) a decrease in meta-analytic effect sizes. Based on these experiences, we finish with lessons learned and recommendations that can be generalized for meta-analysts beyond the field of infant research in order to get the most out of the CAMA framework and to gather maximally unbiased dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
Facebook
TwitterWe measured browsing and height of young aspen (≥ 1 year-old) in 113 plots distributed randomly across the study area (Fig. 1). Each plot was a 1 × 20 m belt transect located randomly within an aspen stand that was itself randomly selected from an inventory of stands with respect to high and low wolf-use areas (Ripple et al. 2001). The inventory was a list of 992 grid cells (240 × 360 m) that contained at least one stand (Appendix S1). A “stand” was a group of tree-size aspen (>10 cm diameter at breast height) in which each tree was ≤ 30 m from every other tree. One hundred and thirteen grid cells were randomly selected from the inventory (~11% of 992 cells), one stand was randomly selected from each cell, and one plot was randomly established in each stand. Each plot likely represented a genetically-independent sample (Appendix S1).
We measured aspen at the end of the growing season (late July to September), focusing on plants ≤ 600 cm tall, which we termed “young aspen.” For each ...
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:
Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.
Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.
Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.
Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).
Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).
Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.
Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.
Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.
Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.
Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.
These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.
Facebook
TwitterThis version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.
The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.
The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.
For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('civil_comments', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterReporting bias (the human tendency to not mention obvious or redundant information) and social bias (societal attitudes toward specific demographic groups) have both been shown to propagate from human text data to language models trained on such data. However, the two phenomena have not previously been studied in combination. The MARB dataset was developed to begin to fill this gap by studying the interaction between social biases and reporting bias in language models. Unlike many existing benchmark datasets, MARB does not rely on artificially constructed templates or crowdworkers to create contrasting examples. Instead, the templates used in MARB are based on naturally occurring written language from the 2021 version of the enTenTen corpus (JakubĂÄŤek et al., 2013).
Facebook
TwitterPolitical science researchers have flexibility in how to analyze data, how to report data, and whether to report on data. Review of examples of reporting flexibility from the race and sex discrimination literature illustrates how research design choices can influence estimates and inferences. This reporting flexibility—coupled with the political imbalance among political scientists—creates the potential for political bias in reported political science estimates, but this potential for political bias can be reduced or eliminated through preregistration and preacceptance, in which researchers commit to a research design before completing data collection. Removing the potential for reporting flexibility can raise the credibility of political science research.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Four multimedia recommender systems datasets to study popularity bias and fairness:
Last.fm (lfm.zip), based on the LFM-1b dataset of JKU Linz (http://www.cp.jku.at/datasets/LFM-1b/)
MovieLens (ml.zip), based on MovieLens-1M dataset (https://grouplens.org/datasets/movielens/1m/)
BookCrossing (book.zip), based on the BookCrossing dataset of Uni Freiburg (http://www2.informatik.uni-freiburg.de/~cziegler/BX/)
MyAnimeList (anime.zip), based on the MyAnimeList dataset of Kaggle (https://www.kaggle.com/CooperUnion/anime-recommendations-database)
Each dataset contains of user interactions (user_events.txt) and three user groups that differ in their inclination to popular/mainstream items: LowPop (low_main_users.txt), MedPop (med_main_users.txt), and HighPop (high_main_users.txt).
The format of the three user files are "user,mainstreaminess"
The format of the user-events files are "user,item,preference"
Example Python-code for analyzing the datasets as well as more information on the user groups can be found on Github (https://github.com/domkowald/FairRecSys) and on Arxiv (https://arxiv.org/abs/2203.00376)
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is a purely synthetic dataset created to help educators and researchers understand fairness definitions. It is a convenient way to illustrate differences between different definitions, such as fairness through unawareness, group fairness, statistical parity, predictive parity equalised odds or treatment equality. The dataset contains multiple sensitive features: age, gender and lives-near-by. These can be combined to define many different sensitive groups. The dataset contains the decisions of five example decisions methods that can be evaluated. When using this dataset, you do not need to train your own methods. Instead you can focus on evaluation the existing models.
This dataset is described and analysed in the following paper. Please cite this paper when using this dataset:
*Burda, P and Van Otterloo, S. 2024. * Fairness definitions explained and illustrated with examples. Computers and Society Research Journal, 2025 (2). [https://doi.org/10.54822/PASR6281 ]