100+ datasets found

H
Age Bias Training and Testing Data
dataverse.harvard.edu
Updated Jul 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Diaz (2020). Age Bias Training and Testing Data [Dataset]. http://doi.org/10.7910/DVN/F6EMTS
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/F6EMTS
Dataset updated
Jul 10, 2020
Dataset provided by
Harvard Dataverse
Authors
Mark Diaz
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Training and testing data annotated by a panel survey sample of older adults (aged 50+) and used to create a basic, maximum entropy bag-of-words sentiment classifier. Testing data is scraped from blog posts discussing aging authored by older adults (see published work: https://doi.org/10.1145/3173574.3173986). Training data is a re-annotated subset of the Sentiment140 training data set containing the strings "old" and "young" (see Sentiment140: http://help.sentiment140.com/for-students). For model building, the subset was re-introduce into the full data set, replacing the original annotations.
f
Data_Sheet_1_Gender Bias in Artificial Intelligence: Severity Prediction at...
frontiersin.figshare.com
docx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heewon Chung; Chul Park; Wu Seong Kang; Jinseok Lee (2023). Data_Sheet_1_Gender Bias in Artificial Intelligence: Severity Prediction at an Early Stage of COVID-19.docx [Dataset]. http://doi.org/10.3389/fphys.2021.778720.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fphys.2021.778720.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Heewon Chung; Chul Park; Wu Seong Kang; Jinseok Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Artificial intelligence (AI) technologies have been applied in various medical domains to predict patient outcomes with high accuracy. As AI becomes more widely adopted, the problem of model bias is increasingly apparent. In this study, we investigate the model bias that can occur when training a model using datasets for only one particular gender and aim to present new insights into the bias issue. For the investigation, we considered an AI model that predicts severity at an early stage based on the medical records of coronavirus disease (COVID-19) patients. For 5,601 confirmed COVID-19 patients, we used 37 medical records, namely, basic patient information, physical index, initial examination findings, clinical findings, comorbidity diseases, and general blood test results at an early stage. To investigate the gender-based AI model bias, we trained and evaluated two separate models—one that was trained using only the male group, and the other using only the female group. When the model trained by the male-group data was applied to the female testing data, the overall accuracy decreased—sensitivity from 0.93 to 0.86, specificity from 0.92 to 0.86, accuracy from 0.92 to 0.86, balanced accuracy from 0.93 to 0.86, and area under the curve (AUC) from 0.97 to 0.94. Similarly, when the model trained by the female-group data was applied to the male testing data, once again, the overall accuracy decreased—sensitivity from 0.97 to 0.90, specificity from 0.96 to 0.91, accuracy from 0.96 to 0.91, balanced accuracy from 0.96 to 0.90, and AUC from 0.97 to 0.95. Furthermore, when we evaluated each gender-dependent model with the test data from the same gender used for training, the resultant accuracy was also lower than that from the unbiased model.
f
Data_Sheet_1_Data and model bias in artificial intelligence for healthcare...
frontiersin.figshare.com
zip
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vithya Yogarajan; Gillian Dobbie; Sharon Leitch; Te Taka Keegan; Joshua Bensemann; Michael Witbrock; Varsha Asrani; David Reith (2023). Data_Sheet_1_Data and model bias in artificial intelligence for healthcare applications in New Zealand.zip [Dataset]. http://doi.org/10.3389/fcomp.2022.1070493.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fcomp.2022.1070493.s001
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Vithya Yogarajan; Gillian Dobbie; Sharon Leitch; Te Taka Keegan; Joshua Bensemann; Michael Witbrock; Varsha Asrani; David Reith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New Zealand
Description
IntroductionDevelopments in Artificial Intelligence (AI) are adopted widely in healthcare. However, the introduction and use of AI may come with biases and disparities, resulting in concerns about healthcare access and outcomes for underrepresented indigenous populations. In New Zealand, Māori experience significant inequities in health compared to the non-Indigenous population. This research explores equity concepts and fairness measures concerning AI for healthcare in New Zealand.MethodsThis research considers data and model bias in NZ-based electronic health records (EHRs). Two very distinct NZ datasets are used in this research, one obtained from one hospital and another from multiple GP practices, where clinicians obtain both datasets. To ensure research equality and fair inclusion of Māori, we combine expertise in Artificial Intelligence (AI), New Zealand clinical context, and te ao Māori. The mitigation of inequity needs to be addressed in data collection, model development, and model deployment. In this paper, we analyze data and algorithmic bias concerning data collection and model development, training and testing using health data collected by experts. We use fairness measures such as disparate impact scores, equal opportunities and equalized odds to analyze tabular data. Furthermore, token frequencies, statistical significance testing and fairness measures for word embeddings, such as WEAT and WEFE frameworks, are used to analyze bias in free-form medical text. The AI model predictions are also explained using SHAP and LIME.ResultsThis research analyzed fairness metrics for NZ EHRs while considering data and algorithmic bias. We show evidence of bias due to the changes made in algorithmic design. Furthermore, we observe unintentional bias due to the underlying pre-trained models used to represent text data. This research addresses some vital issues while opening up the need and opportunity for future research.DiscussionsThis research takes early steps toward developing a model of socially responsible and fair AI for New Zealand's population. We provided an overview of reproducible concepts that can be adopted toward any NZ population data. Furthermore, we discuss the gaps and future research avenues that will enable more focused development of fairness measures suitable for the New Zealand population's needs and social structure. One of the primary focuses of this research was ensuring fair inclusions. As such, we combine expertise in AI, clinical knowledge, and the representation of indigenous populations. This inclusion of experts will be vital moving forward, proving a stepping stone toward the integration of AI for better outcomes in healthcare.
Opinion on mitigating AI data bias in healthcare worldwide 2024
statista.com
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Opinion on mitigating AI data bias in healthcare worldwide 2024 [Dataset]. https://www.statista.com/statistics/1559311/ways-to-mitigate-ai-bias-in-healthcare-worldwide/
Explore at:
Dataset updated
Jul 18, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Dec 2023 - Mar 2024
Area covered
Worldwide
Description
According to a survey of healthcare leaders carried out globally in 2024, almost half of respondents believed that by making AI more transparent and interpretable, this would mitigate the risk of data bias in AI applications for healthcare. Furthermore, ** percent of healthcare leaders thought there should be continuous training and education in AI.
NewsMediaBias-Plus Dataset
zenodo.org
huggingface.co
bin, zip
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaina Raza; Shaina Raza (2024). NewsMediaBias-Plus Dataset [Dataset]. http://doi.org/10.5281/zenodo.13961155
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13961155
Dataset updated
Nov 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shaina Raza; Shaina Raza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NewsMediaBias-Plus Dataset

Overview

The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.

Dataset Description

NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.

Contents

unique_id: Unique identifier for each news item. Each unique_id matches an image for the same article.

outlet: The publisher of the article.

headline: The headline of the article.

article_text: The full content of the news article.

image_description: Description of the paired image.

image: The file path of the associated image.

date_published: The date the article was published.

source_url: The original URL of the article.

canonical_link: The canonical URL of the article.

new_categories: Categories assigned to the article.

news_categories_confidence_scores: Confidence scores for each category.

Annotation Labels

text_label: Indicates the likelihood of the article being disinformation:

Likely: Likely to be disinformation.

Unlikely: Unlikely to be disinformation.

multimodal_label: Indicates the likelihood of disinformation from the combination of the text snippet and image content:

Likely: Likely to be disinformation.

Unlikely: Unlikely to be disinformation.

Getting Started

Prerequisites

Python 3.6+

Pandas

Hugging Face Datasets

Hugging Face Hub

Installation

Load the dataset into Python:

python

Copy code

from datasets import load_dataset ds = load_dataset("vector-institute/newsmediabias-plus") print(ds) # View structure and splits print(ds['train'][0]) # Access the first record of the train split print(ds['train'][:5]) # Access the first five records

Load a Few Records

python

Copy code

from datasets import load_dataset # Load the dataset in streaming mode streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True) # Get an iterable dataset dataset_iterable = streamed_dataset['train'].take(5) # Print the records for record in dataset_iterable: print(record)

Contributions

Contributions are welcome! You can:

Add Data: Contribute more data points.

Refine Annotations: Improve annotation accuracy.

Share Usage Examples: Help others use the dataset effectively.

To contribute, fork the repository and create a pull request with your changes.

License

This dataset is released under a non-commercial license. See the LICENSE file for more details.

Citation

Please cite the dataset using this BibTeX entry:

bibtex

Copy code

@misc{vector_institute_2024_newsmediabias_plus, title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias}, author={Vector Institute Research Team}, year={2024}, url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus} }

Contact

For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai

Disclaimer and User Guidance

Disclaimer: The labels Likely and Unlikely are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.

Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.
Artificial Intelligence (AI) Training Dataset Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Artificial Intelligence (AI) Training Dataset Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/artificial-intelligence-training-dataset-market-global-industry-analysis
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Aug 29, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Artificial Intelligence (AI) Training Dataset Market Outlook

According to our latest research, the global Artificial Intelligence (AI) Training Dataset market size reached USD 3.15 billion in 2024, reflecting robust industry momentum. The market is expanding at a notable CAGR of 20.8% and is forecasted to attain USD 20.92 billion by 2033. This impressive growth is primarily attributed to the surging demand for high-quality, annotated datasets to fuel machine learning and deep learning models across diverse industry verticals. The proliferation of AI-driven applications, coupled with rapid advancements in data labeling technologies, is further accelerating the adoption and expansion of the AI training dataset market globally.

One of the most significant growth factors propelling the AI training dataset market is the exponential rise in data-driven AI applications across industries such as healthcare, automotive, retail, and finance. As organizations increasingly rely on AI-powered solutions for automation, predictive analytics, and personalized customer experiences, the need for large, diverse, and accurately labeled datasets has become critical. Enhanced data annotation techniques, including manual, semi-automated, and fully automated methods, are enabling organizations to generate high-quality datasets at scale, which is essential for training sophisticated AI models. The integration of AI in edge devices, smart sensors, and IoT platforms is further amplifying the demand for specialized datasets tailored for unique use cases, thereby fueling market growth.

Another key driver is the ongoing innovation in machine learning and deep learning algorithms, which require vast and varied training data to achieve optimal performance. The increasing complexity of AI models, especially in areas such as computer vision, natural language processing, and autonomous systems, necessitates the availability of comprehensive datasets that accurately represent real-world scenarios. Companies are investing heavily in data collection, annotation, and curation services to ensure their AI solutions can generalize effectively and deliver reliable outcomes. Additionally, the rise of synthetic data generation and data augmentation techniques is helping address challenges related to data scarcity, privacy, and bias, further supporting the expansion of the AI training dataset market.

The market is also benefiting from the growing emphasis on ethical AI and regulatory compliance, particularly in data-sensitive sectors like healthcare, finance, and government. Organizations are prioritizing the use of high-quality, unbiased, and diverse datasets to mitigate algorithmic bias and ensure transparency in AI decision-making processes. This focus on responsible AI development is driving demand for curated datasets that adhere to strict quality and privacy standards. Moreover, the emergence of data marketplaces and collaborative data-sharing initiatives is making it easier for organizations to access and exchange valuable training data, fostering innovation and accelerating AI adoption across multiple domains.

As the AI training dataset market continues to evolve, the role of Perception Dataset Management Platforms is becoming increasingly crucial. These platforms are designed to handle the complexities of managing large-scale datasets, ensuring that data is not only collected and stored efficiently but also annotated and curated to meet the specific needs of AI models. By providing tools for data organization, quality control, and collaboration, these platforms enable organizations to streamline their data management processes and enhance the overall quality of their AI training datasets. This is particularly important as the demand for diverse and high-quality datasets grows, driven by the expanding scope of AI applications across various industries.

From a regional perspective, North America currently dominates the AI training dataset market, accounting for the largest revenue share in 2024, driven by significant investments in AI research, a mature technology ecosystem, and the presence of leading AI companies and data annotation service providers. Europe and Asia Pacific are also witnessing rapid growth, with increasing government support for AI initiatives, expanding digital infrastructure, and a rising number of AI startups. While North America sets the pace in terms of technological
h
training-bias
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alice Mao, training-bias [Dataset]. https://huggingface.co/datasets/d4un/training-bias
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Alice Mao
Description
Dataset Card for Dataset Name

Dataset Summary

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

This dataset is purely in English. Some of the responses were generated by ChatGPT.

Discussion of Biases

This dataset intentionally carries gender and job-related biases which reflect ones that exist in society, for… See the full description on the dataset page: https://huggingface.co/datasets/d4un/training-bias.
Z
NewsUnravel Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anon (2024). NewsUnravel Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8344890
Explore at:
Dataset updated
Jul 11, 2024
Dataset authored and provided by
anon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About the NUDA DatasetMedia bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.

General

This dataset was created through user feedback on automatically generated bias highlights on news articles on the website NewsUnravel made by ANON. Its goal is to improve the detection of linguistic media bias for analysis and to indicate it to the public. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.

The dataset consists of text, namely biased sentences with binary bias labels (processed, biased or not biased) as well as metadata about the article. It includes all feedback that was given. The single ratings (unprocessed) used to create the labels with correlating User IDs are included.

For training, this dataset was combined with the BABE dataset. All data is completely anonymous. Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.

Description of the Data Files

This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain the following data:

NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labelsStatistics.png: contains all Umami statistics for NewsUnravel's usage dataFeedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasonsContent.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentence and the bias rating, and reason, if givenArticle.csv: holds the article ID, title, source, article metadata, article topic, and bias amount in %Participant.csv: holds the participant IDs and data processing consent

Collection Process

Data was collected through interactions with the Feedback Mechanism on NewsUnravel. A news article was displayed with automatically generated bias highlights. Each highlight could be selected, and readers were able to agree or disagree with the automatic label. Through a majority vote, labels were generated from those feedback interactions. Spammers were excluded through a spam detection approach.

Readers came to our website voluntarily through posts on LinkedIn and social media as well as posts on university boards. The data collection period lasted for one week, from March 4th to March 11th (2023). The landing page informed them about the goal and the data processing. After being informed, they could proceed to the article overview.

So far, the dataset has been used on top of BABE to train a linguistic bias classifier, adopting hyperparameter configurations from BABE with a pre-trained model from Hugging Face.The dataset will be open source. On acceptance, a link with all details and contact information will be provided. No third parties are involved.

The dataset will not be maintained as it captures the first test of NewsUnravel at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsUnravel paper if you use the dataset and contact us if you're interested in more information or joining the project.
h
md_gender_bias
huggingface.co
opendatalab.com
Updated Mar 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2021). md_gender_bias [Dataset]. https://huggingface.co/datasets/facebook/md_gender_bias
Explore at:
Dataset updated
Mar 26, 2021
Dataset authored and provided by
AI at Meta
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.
NewsUnravel Dataset
zenodo.org
csv
Updated Sep 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anonymous; anonymous (2023). NewsUnravel Dataset [Dataset]. http://doi.org/10.5281/zenodo.8344882
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8344882
Dataset updated
Sep 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
anonymous; anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About the Dataset
Media bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.

Description of the data files
This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain following data:

NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labels
Statistics.png: contains all Umami statistics for NewsUnravel's usage data
Feedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasons
Content.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentences and the bias rating, and reason, if given
Article.csv: holds the article ID, title, source, article meta data, article topic, and bias amount in %
Participant.csv: holds the participant IDs and data processing consent
f
Implications for future LLM research.
plos.figshare.com
xls
Updated Jan 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce (2024). Implications for future LLM research. [Dataset]. http://doi.org/10.1371/journal.pdig.0000417.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000417.t002
Dataset updated
Jan 18, 2024
Dataset provided by
PLOS Digital Health
Authors
Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The study provides a comprehensive review of OpenAI’s Generative Pre-trained Transformer 4 (GPT-4) technical report, with an emphasis on applications in high-risk settings like healthcare. A diverse team, including experts in artificial intelligence (AI), natural language processing, public health, law, policy, social science, healthcare research, and bioethics, analyzed the report against established peer review guidelines. The GPT-4 report shows a significant commitment to transparent AI research, particularly in creating a systems card for risk assessment and mitigation. However, it reveals limitations such as restricted access to training data, inadequate confidence and uncertainty estimations, and concerns over privacy and intellectual property rights. Key strengths identified include the considerable time and economic investment in transparent AI research and the creation of a comprehensive systems card. On the other hand, the lack of clarity in training processes and data raises concerns about encoded biases and interests in GPT-4. The report also lacks confidence and uncertainty estimations, crucial in high-risk areas like healthcare, and fails to address potential privacy and intellectual property issues. Furthermore, this study emphasizes the need for diverse, global involvement in developing and evaluating large language models (LLMs) to ensure broad societal benefits and mitigate risks. The paper presents recommendations such as improving data transparency, developing accountability frameworks, establishing confidence standards for LLM outputs in high-risk settings, and enhancing industry research review processes. It concludes that while GPT-4’s report is a step towards open discussions on LLMs, more extensive interdisciplinary reviews are essential for addressing bias, harm, and risk concerns, especially in high-risk domains. The review aims to expand the understanding of LLMs in general and highlights the need for new reflection forms on how LLMs are reviewed, the data required for effective evaluation, and addressing critical issues like bias and risk.
h
debiased_dataset
huggingface.co
Updated Sep 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
News Media Biases (2023). debiased_dataset [Dataset]. http://doi.org/10.57967/hf/1050
Explore at:
Unique identifier
https://doi.org/10.57967/hf/1050
Dataset updated
Sep 18, 2023
Dataset authored and provided by
News Media Biases
License
https://choosealicense.com/licenses/creativeml-openrail-m/https://choosealicense.com/licenses/creativeml-openrail-m/
Description
Dataset Description

About the Dataset: This dataset contains text data that has been processed to identify biased statements based on dimensions and aspects. Each entry has been processed using the GPT-4 language model and manually verified by 5 human annotators for quality assurance. Purpose: The dataset aims to help train and evaluate machine learning models in detecting, classifying, and correcting biases in text content, making it essential for NLP research related to fairness… See the full description on the dataset page: https://huggingface.co/datasets/newsmediabias/debiased_dataset.
f
Navigating News Narratives: A Media Bias Analysis Dataset
figshare.com
txt
Updated Dec 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaina Raza (2023). Navigating News Narratives: A Media Bias Analysis Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24422122.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24422122.v4
Dataset updated
Dec 8, 2023
Dataset provided by
figshare
Authors
Shaina Raza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII).The data structure is tabulated as follows:Text: The main content.Dimension: Descriptive category of the text.Biased_Words: A compilation of words regarded as biased.Aspect: Specific sub-topic within the main content.Label: Indicates the presence (True) or absence (False) of bias. The label is ternary - highly biased, slightly biased and neutralToxicity: Indicates the presence (True) or absence (False) of bias.Identity_mention: Mention of any identity based on words match.Annotation SchemeThe labels and annotations in the dataset are generated through a system of Active Learning, cycling through:Manual LabelingSemi-Supervised LearningHuman VerificationThe scheme comprises:Bias Label: Specifies the degree of bias (e.g., no bias, mild, or strong).Words/Phrases Level Biases: Pinpoints specific biased terms or phrases.Subjective Bias (Aspect): Highlights biases pertinent to content dimensions.Due to the nuances of semantic match algorithms, certain labels such as 'identity' and 'aspect' may appear distinctively different.List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news.We also utilize publicly available data from the following links. Our Attribution to others.MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detectionToxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge.Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu)Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtVSocial biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.If you use this dataset, please cite us.Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0
Z
News Ninja Dataset
data.niaid.nih.gov
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anon (2024). News Ninja Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8346881
Explore at:
Dataset updated
Feb 20, 2024
Dataset authored and provided by
anon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AboutRecent research shows that visualizing linguistic media bias mitigates its negative effects. However, reliable automatic detection methods to generate such visualizations require costly, knowledge-intensive training data. To facilitate data collection for media bias datasets, we present News Ninja, a game employing data-collecting game mechanics to generate a crowdsourced dataset. Before annotating sentences, players are educated on media bias via a tutorial. Our findings show that datasets gathered with crowdsourced workers trained on News Ninja can reach significantly higher inter-annotator agreements than expert and crowdsourced datasets. As News Ninja encourages continuous play, it allows datasets to adapt to the reception and contextualization of news over time, presenting a promising strategy to reduce data collection expenses, educate players, and promote long-term bias mitigation.

GeneralThis dataset was created through player annotations in the News Ninja Game made by ANON. Its goal is to improve the detection of linguistic media bias. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.

The dataset includes sentences with binary bias labels (processed, biased or not biased) as well as the annotations of single players used for the majority vote. It includes all game-collected data. All data is completely anonymous. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.

Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset contains topics such as violence, abortion, and hate against specific races, genders, religions, or sexual orientations.

Description of the Data FilesThis repository contains the datasets for the anonymous News Ninja submission. The tables contain the following data:

ExportNewsNinja.csv: Contains 370 BABE sentences and 150 new sentences with their text (sentence), words labeled as biased (words), BABE ground truth (ground_Truth), and the sentence bias label from the player annotations (majority_vote). The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences.

AnalysisNewsNinja.xlsx: Contains 370 BABE sentences and 150 new sentences. The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences. The table includes the full sentence (Sentence), the sentence bias label from player annotations (isBiased Game), the new expert label (isBiased Expert), if the game label and expert label match (Game VS Expert), if differing labels are a false positives or false negatives (false negative, false positive), the ground truth label from BABE (isBiasedBABE), if Expert and BABE labels match (Expert VS BABE), and if the game label and BABE label match (Game VS BABE). It also includes the analysis of the agreement between the three rater categories (Game, Expert, BABE).

demographics.csv: Contains demographic information of News Ninja players, including gender, age, education, English proficiency, political orientation, news consumption, and consumed outlets.

Collection ProcessData was collected through interactions with the NewsNinja game. All participants went through a tutorial before annotating 2x10 BABE sentences and 2x10 new sentences. For this first test, players were recruited using Prolific. The game was hosted on a costume-built responsive website. The collection period was from 20.02.2023 to 28.02.2023. Before starting the game, players were informed about the goal and the data processing. After consenting, they could proceed to the tutorial.

The dataset will be open source. A link with all details and contact information will be provided upon acceptance. No third parties are involved.

The dataset will not be maintained as it captures the first test of NewsNinja at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsNinja paper if you use the dataset and contact us if you're interested in more information or joining the project.
CrowS-Pairs (Social biases in MLMs)
kaggle.com
Updated Nov 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). CrowS-Pairs (Social biases in MLMs) [Dataset]. https://www.kaggle.com/datasets/thedevastator/a-dataset-for-measuring-social-biases-in-mlms
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 27, 2022
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
CrowS-Pairs (Social biases in MLMs)

CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked LM

By [source]

About this dataset

The CrowS-Pairs dataset is a collection of 1,508 sentence pairs that cover nine types of biases: race/color, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status. Each sentence pair is a minimal edit of the first sentence: The only words that change between them are those that identify the group. The first sentence can demonstrate or violate a stereotype. The other sentence is a minimal edit of the first sentence: The only words that change between them are those that identify the group. Each example has the following information:

Columns:,**sent_more**,sent_less,**stereo_antistereo**,bias_type,**annotations**,,anon_writer,,anon_annotators,,prompt,,source

The CrowS-Pairs dataset is a collection of 1,508 sentence pairs that cover nine types of biases: race/color, gender/gender identity, sexual orientation, religion, age

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The CrowS-Pairs dataset is a collection of 1,508 sentence pairs that cover nine types of biases: race/color, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status. Each sentence pair is a minimal edit of the first sentence: The only words that change between them are those that identify the group. The first sentence can demonstrate or violate a stereotype. The other sentence is a minimal edit of the first sentence: The only words that change between them are those that identify the group. Each example has the following information:

Columns:,**sent_less**sent_more,,stereo_antistereo,,bias_type,,annotations,,anon_writer,,anon_annotators,,,,prompt,,source

This dataset can be used to measure social biases in MLMs by training models on it and evaluating their performance

Research Ideas

Measuring the ability of MLMs to identify and avoid social biases;

Developing new methods for reducing social biases in MLMs; and

Investigating the impact of social biases on downstream tasks such as reading comprehension or question answering

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: crows_pairs_anonymized.csv | Column name | Description | |:----------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------| | sent_more | The first sentence in the pair, which can demonstrate or violate a stereotype. (String) | | sent_less | The second sentence in the pair, which is a minimal edit of the first sentence. The only words that change between them are those that identify the group. (String) | | stereo_antistereo | Whether the first sentence demonstrates or violates a stereotype. (String) | | bias_type | The type of bias represented in the sentence pair. (String) | | annotations | The annotations made by the crowdworkers on the sentence pair. (String) | | anon_writer | The anonymous writer of the sentence pair. (String) | | anon_annotators | The anonymous annotators of the sentence pair. (String) |

File: prompts.csv | Column name | Descripti...
Navigating News Narratives: A Media Bias Analysis Dataset
zenodo.org
csv
Updated Nov 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaina Raza; Shaina Raza (2023). Navigating News Narratives: A Media Bias Analysis Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24422122
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24422122
Dataset updated
Nov 30, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shaina Raza; Shaina Raza
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
2023
Description
The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.
Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII).
Data Format: The format of data is:
ID: Numeric unique identifier.
Text: Main content.
Dimension: Categorical descriptor of the text.
Biased_Words: List of words considered biased.
Aspect: Specific topic within the text.
Label: Neutral, Slightly Biased , Highly Biased

Annotation Scheme: The annotation scheme is based on Active learning, which is Manual Labeling --> Semi-Supervised Learning --> Human Verifications (iterative process)
Bias Label: Indicate the presence/absence of bias (e.g., no bias, mild, strong).
Words/Phrases Level Biases: Identify specific biased words/phrases.
Subjective Bias (Aspect): Capture biases related to content aspects.

List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news.
We also utilize publicly available data from the following links. Our Attribution to others.
MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336
Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detection
Toxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge.
Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.
Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu)
Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtV
Social biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/

Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.
If you use this dataset, please cite us.
Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0
Police Officer Learning, Mentoring, and Racial Bias in Traffic Stops,...
catalog.data.gov
datasets.ai
+1more
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Justice (2025). Police Officer Learning, Mentoring, and Racial Bias in Traffic Stops, Syracuse, New York, 2006-2009 [Dataset]. https://catalog.data.gov/dataset/police-officer-learning-mentoring-and-racial-bias-in-traffic-stops-syracuse-new-york-2006--7490b
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
National Institute of Justicehttp://nij.ojp.gov/
Area covered
Syracuse, New York
Description
This project is concerned with understanding the determinants of racial bias in police traffic stops and in the city of Syracuse, New York. Using an officer-level panel of data on vehicle stops and vehicle searches by 512 officers from 2006 to 2009, the primary goal of this research is to better understand the effects of officer experience on their proclivities for racial bias in traffic stops, while controlling for officer, citizen, and neighborhood demographics. Included in these data are variables for census tracts as well as their racial and ethnic makeup, times and dates when traffic stops occurred, sunrise and sunset data for the City of Syracuse, and the racial and ethnic makeup of citizens involved in stops.
D
Data Collection And Labeling Report
datainsightsmarket.com
doc, pdf, ppt
Updated Aug 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Data Collection And Labeling Report [Dataset]. https://www.datainsightsmarket.com/reports/data-collection-and-labeling-1415734
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Aug 12, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Data Collection and Labeling market is experiencing robust growth, driven by the increasing demand for high-quality training data to fuel the advancements in artificial intelligence (AI) and machine learning (ML) technologies. The market's expansion is fueled by the burgeoning adoption of AI across diverse sectors, including healthcare, automotive, finance, and retail. Companies are increasingly recognizing the critical role of accurate and well-labeled data in developing effective AI models. This has led to a surge in outsourcing data collection and labeling tasks to specialized companies, contributing to the market's expansion. The market is segmented by data type (image, text, audio, video), labeling technique (supervised, unsupervised, semi-supervised), and industry vertical. We project a steady CAGR of 20% for the period 2025-2033, reflecting continued strong demand across various applications. Key trends include the increasing use of automation and AI-powered tools to streamline the data labeling process, resulting in higher efficiency and lower costs. The growing demand for synthetic data generation is also emerging as a significant trend, alleviating concerns about data privacy and scarcity. However, challenges remain, including data bias, ensuring data quality, and the high cost associated with manual labeling for complex datasets. These restraints are being addressed through technological innovations and improvements in data management practices. The competitive landscape is characterized by a mix of established players and emerging startups. Companies like Scale AI, Appen, and others are leading the market, offering comprehensive solutions that span data collection, annotation, and model validation. The presence of numerous companies suggests a fragmented yet dynamic market, with ongoing competition driving innovation and service enhancements. The geographical distribution of the market is expected to be broad, with North America and Europe currently holding significant market share, followed by Asia-Pacific showing robust growth potential. Future growth will depend on technological advancements, increasing investment in AI, and the emergence of new applications that rely on high-quality data.
t
Eliminating Search Intent Bias in Learning to Rank - Dataset - LDM
service.tib.eu
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Eliminating Search Intent Bias in Learning to Rank - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/eliminating-search-intent-bias-in-learning-to-rank
Explore at:
Dataset updated
Jan 3, 2025
Description
Click-through data has proven to be a valuable resource for improving search-ranking quality. Search engines can easily collect click data, but biases introduced in the data can make it difficult to use the data effectively.
Data for "Training data composition affects performance of protein structure...
zenodo.org
application/gzip
Updated Oct 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Derry; Alexander Derry; Kristy A. Carpenter; Kristy A. Carpenter; Russ B. Altman; Russ B. Altman (2021). Data for "Training data composition affects performance of protein structure analysis algorithms" by A. Derry, K. A. Carpenter, & R. B. Altman [Dataset]. http://doi.org/10.5281/zenodo.5542201
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5542201
Dataset updated
Oct 1, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander Derry; Alexander Derry; Kristy A. Carpenter; Kristy A. Carpenter; Russ B. Altman; Russ B. Altman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

This repository contains all data used in "Training data composition affects performance of protein structure analysis algorithms", published in the Pacific Symposium on Biocomputing 2022 by A. Derry, K. A. Carpenter, & R. B. Altman.

The data consists of the following files:

ema_zenodo_data.tar.gz: train, validation, and test splits for Estimation of Model Accuracy task, in LMDB format

design_zenodo_data.tar.gz: train, validation, and test splits for Protein Sequence Design task, in JSON format

enz_cat_res_zenodo_data.tar.gz: train, validation, and test splits for Catalytic Residue and Enzyme Prediction task, in TF record format

Details on dataset construction can be found in our paper and dataloaders can be found in our Github repo.

Reference

A. Derry*, K. A. Carpenter*, & R. B. Altman, "Training data composition affects performance of protein structure analysis algorithms", 2021.

Dataset References

Datasets used were derived from the following works:

Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K., & Moult, J. (2019). Critical assessment of methods of protein structure prediction (CASP)—Round XIII. In Proteins: Structure, Function and Bioinformatics (Vol. 87, Issue 12, pp. 1011–1020). https://doi.org/10.1002/prot.25823

Ingraham, J., Garg, V. K., Barzilay, R., & Jaakkola, T. (2019). Generative Models for Graph-Based Protein Design. https://openreview.net/pdf?id=SJgxrLLKOE

Furnham, N., Holliday, G. L., de Beer, T. A. P., Jacobsen, J. O. B., Pearson, W. R., & Thornton, J. M. (2014). The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes. Nucleic Acids Research, 42 (Database issue), D485–D489.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mark Diaz (2020). Age Bias Training and Testing Data [Dataset]. http://doi.org/10.7910/DVN/F6EMTS

Age Bias Training and Testing Data

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.7910/DVN/F6EMTS

Dataset updated

Jul 10, 2020

Dataset provided by

Harvard Dataverse

Authors

Mark Diaz

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Training and testing data annotated by a panel survey sample of older adults (aged 50+) and used to create a basic, maximum entropy bag-of-words sentiment classifier. Testing data is scraped from blog posts discussing aging authored by older adults (see published work: https://doi.org/10.1145/3173574.3173986). Training data is a re-annotated subset of the Sentiment140 training data set containing the strings "old" and "young" (see Sentiment140: http://help.sentiment140.com/for-students). For model building, the subset was re-introduce into the full data set, replacing the original annotations.

Clear search

Close search

Google apps

Main menu

Age Bias Training and Testing Data

Data_Sheet_1_Gender Bias in Artificial Intelligence: Severity Prediction at...

Data_Sheet_1_Data and model bias in artificial intelligence for healthcare...

Opinion on mitigating AI data bias in healthcare worldwide 2024

NewsMediaBias-Plus Dataset

NewsMediaBias-Plus Dataset

Overview

Dataset Description

Contents

Annotation Labels

Getting Started

Prerequisites

Installation

Load a Few Records

Contributions

License

Citation

Contact

Disclaimer and User Guidance

Artificial Intelligence (AI) Training Dataset Market Research Report 2033

Artificial Intelligence (AI) Training Dataset Market Outlook

training-bias

NewsUnravel Dataset

md_gender_bias

NewsUnravel Dataset

Implications for future LLM research.

debiased_dataset

Navigating News Narratives: A Media Bias Analysis Dataset

News Ninja Dataset

CrowS-Pairs (Social biases in MLMs)

CrowS-Pairs (Social biases in MLMs)

CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked LM

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Navigating News Narratives: A Media Bias Analysis Dataset

Police Officer Learning, Mentoring, and Racial Bias in Traffic Stops,...

Data Collection And Labeling Report

Eliminating Search Intent Bias in Learning to Rank - Dataset - LDM

Data for "Training data composition affects performance of protein structure...

Age Bias Training and Testing Data