Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for TruthfulQA
Dataset Summary
TruthfulQA: Measuring How Models Mimic Human Falsehoods We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers… See the full description on the dataset page: https://huggingface.co/datasets/domenicrosati/TruthfulQA.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Description
TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. Questions are crafted so that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. Here we provide the Romanian translation of the… See the full description on the dataset page: https://huggingface.co/datasets/OpenLLM-Ro/ro_truthfulqa.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for truthful_qa_context
Dataset Summary
TruthfulQA Context is an extension of the TruthfulQA benchmark, specifically designed to enhance its utility for models that rely on Retrieval-Augmented Generation (RAG). This version includes the original questions and answers from TruthfulQA, along with the added context text directly associated with each question. This additional context aims to provide immediate reference material for models, making it particularly… See the full description on the dataset page: https://huggingface.co/datasets/portkey/truthful_qa_context.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Uhura-TruthfulQA
Dataset Summary
TruthfulQA is a widely recognized safety benchmark designed to measure the truthfulness of language model outputs across 38 categories, including health, law, finance, and politics. The English version of the benchmark originates from TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin et al., 2022) and consists of 817 questions in both multiple-choice and generation formats, targeting common misconceptions and… See the full description on the dataset page: https://huggingface.co/datasets/masakhane/uhura-truthfulqa.
TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. Questions are crafted so that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for TruthfulQA
Dataset Details
Dataset Description
TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 790 questions that span 38 categories, including health, law, finance and politics. Questions are crafted so that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from… See the full description on the dataset page: https://huggingface.co/datasets/rahmanidashti/tiny-truthful-qa.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for TruthfulQA-multi
TruthfulQA-multi is a professionally translated extension of the original TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish. The dataset enables evaluating the ability of Large Language Models (LLMs) to maintain truthfulness across multiple languages.
Dataset Details
Dataset Description
TruthfulQA-multi extends the original English TruthfulQA dataset to four additional languages… See the full description on the dataset page: https://huggingface.co/datasets/HiTZ/truthfulqa-multi.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
TruthfulQA‑CFB · Measuring How Models Mimic Human Falsehoods (Conversation Fact Benchmark Format)
TruthfulQA‑CFB is a 817 example benchmark derived from the original TruthfulQA dataset, transformed and adapted for the Conversation Fact Benchmark framework. Each item consists of questions designed to test whether language models can distinguish truth from common human misconceptions and false beliefs. The dataset focuses on truthfulness evaluation: questions target areas where humans… See the full description on the dataset page: https://huggingface.co/datasets/onionmonster/truthful_qa.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Summary TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. Questions are crafted so that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. Supported Tasks and Leaderboards [Needs More Information]… See the full description on the dataset page: https://huggingface.co/datasets/YeBhoneLin10/Simbolo_data.
Dataset Card for truthfulqa
This is a preprocessed version of truthfulqa dataset for benchmarks in LM-Polygraph.
Dataset Details
Dataset Description
Curated by: https://huggingface.co/LM-Polygraph License: https://github.com/IINemo/lm-polygraph/blob/main/LICENSE.md
Dataset Sources [optional]
Repository: https://github.com/IINemo/lm-polygraph
Uses
Direct Use
This dataset should be used for performing benchmarks on… See the full description on the dataset page: https://huggingface.co/datasets/LM-Polygraph/truthfulqa.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Qwen2.5 TruthfulQA 推理代码
model_truthfulqa.py 是针对 Qwen2.5 模型的相关推理代码,用于运行 TruthfulQA 基准测试。该基准测试的重点是评估生成的答案在真实度和信息量上的表现,或评估模型在多选题任务上的准确率。
TruthfulQA Benchmark
TruthfulQA 基准测试包括两个任务,使用相同的问题集和参考答案:
1. 生成类任务 (Generation Task)
任务描述: 给定一个问题,生成 1-2 句的答案。 评估目标: 主要目标: 答案的整体真实性 (% true),即模型生成的答案中真实的比例。 次要目标: 答案的信息量 (% info),避免模型通过回答诸如“我不评论”等无信息量的内容来“投机取巧”。
评估指标: 使用微调的 GPT-3 模型(GPT-judge 和 GPT-info)来预测答案的真实性和信息量。 使用传统相似性指标(BLEURT、ROUGE、BLEU)计算生成答案与参考答案(真/假参考答案)的相似性:得分 =… See the full description on the dataset page: https://huggingface.co/datasets/studymakesmehappyyyyy/TruthfulQA.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
X-TruthfulQA
🤗 Paper | 📖 arXiv
Dataset Description
X-TruthfulQA is an evaluation benchmark for multilingual large language models (LLMs), including questions and answers in 5 languages (English, Chinese, Korean, Italian and Spanish). It is intended to evaluate the truthfulness of LLMs. The dataset is translated by GPT-4 from the original English-version TruthfulQA. In our paper, we evaluate LLMs in a zero-shot generative setting: prompt the instruction-tuned LLM with… See the full description on the dataset page: https://huggingface.co/datasets/zhihz0535/X-TruthfulQA_en_zh_ko_it_es.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs
Overview
The AraDiCE dataset is designed to evaluate dialectal and cultural capabilities in large language models (LLMs). The dataset consists of post-edited versions of various benchmark datasets, curated for validation in cultural and dialectal contexts relevant to Arabic. In this repository, we present the TruthfulQA split of the data
Evaluation
We have used lm-harness eval framework to… See the full description on the dataset page: https://huggingface.co/datasets/QCRI/AraDiCE-TruthfulQA.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This Dataset is part of a series of datasets aimed at advancing Turkish LLM Developments by establishing rigid Turkish benchmarks to evaluate the performance of LLM's Produced in the Turkish Language.
Dataset Card for truthful_qa-tr
malhajar/truthful_qa-tr is a translated version of truthful_qa aimed specifically to be used in the OpenLLMTurkishLeaderboard Developed by: Mohamad Alhajar
Dataset Summary
TruthfulQA is a benchmark to measure whether a language model is… See the full description on the dataset page: https://huggingface.co/datasets/malhajar/truthfull_qa-tr.
all-processed dataset is a concatenation of of medical-meadow-* and chatdoctor_healthcaremagic datasets The Chat Doctor term is replaced by the chatbot term in the chatdoctor_healthcaremagic dataset Similar to the literature the medical_meadow_cord19 dataset is subsampled to 50,000 samples truthful-qa-* is a benchmark dataset for evaluating the truthfulness of models in text generation, which is used in Llama 2 paper. Within this dataset, there are 55 and 16 questions related to Health and… See the full description on the dataset page: https://huggingface.co/datasets/lavita/medical-qa-datasets.
Dataset Card for truthful_qa_indic
Dataset Description
Dataset Summary
truthful_qa_indic is an extension of the TruthfulQA dataset, focusing on generating truthful answers in Indic languages. The benchmark comprises 817 questions spanning 38 categories, challenging models to avoid generating false answers learned from imitating human texts.
Creation Process
It's a high-quality translation of TruthfulQA, meticulously crafted with a beam width of 5… See the full description on the dataset page: https://huggingface.co/datasets/iitrsamrat/truthful_qa_indic_gen.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for jtruthful_qa
Dataset Summary
JTruthfulQA is a Japanese iteration of TruthfulQA (Lin+, 2022). This particular dataset isn't a translation of the original TruthfulQA, but rather, it's been constructed from the ground up. The purpose of this benchmark is to gauge the truthfulness of a language model in its generation of responses to various questions. The benchmark encompasses a total of 604 questions, which are distributed across three categories: Fact… See the full description on the dataset page: https://huggingface.co/datasets/andrijdavid/jtruthful_qa.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for TruthfulQA
Dataset Summary
TruthfulQA: Measuring How Models Mimic Human Falsehoods We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers… See the full description on the dataset page: https://huggingface.co/datasets/domenicrosati/TruthfulQA.