68 datasets found

h
rag_instruct_benchmark_tester
huggingface.co
opendatalab.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
llmware, rag_instruct_benchmark_tester [Dataset]. https://huggingface.co/datasets/llmware/rag_instruct_benchmark_tester
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
llmware
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for RAG-Instruct-Benchmark-Tester

Dataset Summary

This is an updated benchmarking test dataset for "retrieval augmented generation" (RAG) use cases in the enterprise, especially for financial services, and legal. This test dataset includes 200 questions with context passages pulled from common 'retrieval scenarios', e.g., financial news, earnings releases, contracts, invoices, technical articles, general news and short texts.
The questions are segmented… See the full description on the dataset page: https://huggingface.co/datasets/llmware/rag_instruct_benchmark_tester.
h
German-RAG-LLM-EASY-BENCHMARK
huggingface.co
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Avemio AG (2025). German-RAG-LLM-EASY-BENCHMARK [Dataset]. https://huggingface.co/datasets/avemio/German-RAG-LLM-EASY-BENCHMARK
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 6, 2025
Dataset authored and provided by
Avemio AG
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
German-RAG-LLM-EASY-BENCHMARK

German-RAG - German Retrieval Augmented Generation Dataset Summary

This German-RAG-LLM-BENCHMARK represents a specialized collection for evaluating language models with a focus on source citation, time difference stating in RAG-specific tasks. To evaluate models compatible with OpenAI-Endpoints you can refer to our Github Repo: https://github.com/avemio-digital/German-RAG-LLM-EASY-BENCHMARK/ Most of the Subsets are synthetically… See the full description on the dataset page: https://huggingface.co/datasets/avemio/German-RAG-LLM-EASY-BENCHMARK.
SciRAG-QA: Multi-domain Closed-Question Benchmark Dataset for Scientific QA
zenodo.org
bin, csv, json
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahira Ibnath Joytu; Mahira Ibnath Joytu; Md Raisul Kibria; Md Raisul Kibria; Sébastien Lafond; Sébastien Lafond (2024). SciRAG-QA: Multi-domain Closed-Question Benchmark Dataset for Scientific QA [Dataset]. http://doi.org/10.5281/zenodo.14390011
Explore at:
csv, bin, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14390011
Dataset updated
Dec 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mahira Ibnath Joytu; Mahira Ibnath Joytu; Md Raisul Kibria; Md Raisul Kibria; Sébastien Lafond; Sébastien Lafond
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 2024
Description
In recent times, one of the most impactful applications of the growing capabilities of Large Language Models (LLMs) has been their use in Retrieval-Augmented Generation (RAG) systems. RAG applications are inherently more robust against LLM hallucinations and provide source traceability, which holds critical importance in the scientific reading and writing process. However, validating such systems is essential due to the stringent systematic requirements of the scientific domain. Existing benchmark datasets are limited in the scope of research areas they cover, often focusing on the natural sciences, which restricts their applicability and validation across other scientific fields.

To address this gap, we present a closed-question answering (QA) dataset for benchmarking scientific RAG applications. This dataset spans 34 research topics across 10 distinct areas of study. It includes 108 manually curated question-answer pairs, each annotated with answer type, difficulty level, and a gold reference along with a link to the source paper. Further details on each of these attributes can be found in the accompanying README.md file.

Please cite the following publication when using the dataset: TBD

The publication is available at: TBD

A preprint version of the publication is available at: TBD
h
RAG-Evaluation-Dataset-KO
huggingface.co
Updated Aug 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
allganize (2024). RAG-Evaluation-Dataset-KO [Dataset]. https://huggingface.co/datasets/allganize/RAG-Evaluation-Dataset-KO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2024
Dataset authored and provided by
allganize
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Allganize RAG Leaderboard

Allganize RAG 리더보드는 5개 도메인(금융, 공공, 의료, 법률, 커머스)에 대해서 한국어 RAG의 성능을 평가합니다.일반적인 RAG는 간단한 질문에 대해서는 답변을 잘 하지만, 문서의 테이블과 이미지에 대한 질문은 답변을 잘 못합니다.
RAG 도입을 원하는 수많은 기업들은 자사에 맞는 도메인, 문서 타입, 질문 형태를 반영한 한국어 RAG 성능표를 원하고 있습니다.평가를 위해서는 공개된 문서와 질문, 답변 같은 데이터 셋이 필요하지만, 자체 구축은 시간과 비용이 많이 드는 일입니다.이제 올거나이즈는 RAG 평가 데이터를 모두 공개합니다. RAG는 Parser, Retrieval, Generation 크게 3가지 파트로 구성되어 있습니다.현재, 공개되어 있는 RAG 리더보드 중, 3가지 파트를 전체적으로 평가하는 한국어로 구성된 리더보드는 없습니다. Allganize RAG 리더보드에서는 문서를… See the full description on the dataset page: https://huggingface.co/datasets/allganize/RAG-Evaluation-Dataset-KO.
h
nexa-rag-benchmark
huggingface.co
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zhanx (2025). nexa-rag-benchmark [Dataset]. https://huggingface.co/datasets/zhanxxx/nexa-rag-benchmark
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2025
Authors
zhanx
Description
nexa-rag-benchmark

The Nexa RAG Benchmark dataset is designed for evaluating Retrieval-Augmented Generation (RAG) models across multiple question-answering benchmarks. It includes a variety of datasets covering different domains. For evaluation, you can use the repository:🔗 Nexa RAG Benchmark on GitHub

Dataset Structure

This benchmark integrates multiple datasets suitable for RAG performance. You can choose datasets based on context size, number of examples, or… See the full description on the dataset page: https://huggingface.co/datasets/zhanxxx/nexa-rag-benchmark.
h
rag-benchmark-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Onepane.ai, rag-benchmark-dataset [Dataset]. https://huggingface.co/datasets/onepaneai/rag-benchmark-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Onepane.ai
Description
onepaneai/rag-benchmark-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
frames-benchmark
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google, frames-benchmark [Dataset]. https://huggingface.co/datasets/google/frames-benchmark
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Googlehttp://google.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
FRAMES: Factuality, Retrieval, And reasoning MEasurement Set

FRAMES is a comprehensive evaluation dataset designed to test the capabilities of Retrieval-Augmented Generation (RAG) systems across factuality, retrieval accuracy, and reasoning. Our paper with details and experiments is available on arXiv: https://arxiv.org/abs/2409.12941.

Dataset Overview

824 challenging multi-hop questions requiring information from 2-15 Wikipedia articles Questions span diverse topics… See the full description on the dataset page: https://huggingface.co/datasets/google/frames-benchmark.
e
Replication Data for: Advanced System Integration: Analyzing OpenAPI...
b2find.eudat.eu
Updated Jul 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Replication Data for: Advanced System Integration: Analyzing OpenAPI Chunking for Retrieval-Augmented Generation - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/cef84fc3-1b58-5368-bef2-be7b1a3097a6
Explore at:
Dataset updated
Jul 28, 2025
Description
Integrating multiple (sub-)systems is essential to create advanced Information Systems (ISs). Difficulties mainly arise when integrating dynamic environments across the IS lifecycle, e.g., services not yet existent at design time. A traditional approach is a registry that provides the API documentation of the systems’ endpoints. Large Language Models (LLMs) have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input token limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. Within this work, we (i) analyze the usage of Retrieval Augmented Generation (RAG) for endpoint discovery and the chunking, i.e., preprocessing, of state-of-practice OpenAPIs to reduce the input token length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints and retrieves specification details on demand. We evaluate RAG for endpoint discovery using the RestBench benchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval recall, precision, and F1 score. Then, we assess the Discovery Agent using the same test set. With our prototype, we demonstrate how to successfully employ RAG for endpoint discovery to reduce token count. While revealing high values for recall, precision, and F1, further research is necessary to retrieve all requisite endpoints. Our experiments show that for preprocessing, LLM-based and format-specific approaches outperform naïve chunking methods. Relying on an agent further enhances these results as the agent splits the tasks into multiple fine granular subtasks, improving the overall RAG performance in the token count, precision, and F1 score. Content: code.zip:Python source code to perform the experiments. evaluate.py: Script to execute the experiments (Uncomment lines to select the embedding model). socrag/*: Source code for the RAG. benchmark/*: RestBench specification. results.zip:Results of the RAG experiments (in the folder /results/data/ inside the zip file). Experiment results for the RAG: results_{embedding_model}_{top-k}.json. Experiment results for the Discovery Agent: results_{embedding_model}_{agent}_{refinement}_{llm}.json. FAISS store (intermediate data required for exact reproduction of results; one folder for each embedding model): bge_small, nvidia and oai. Intermediate data of the LLM-based refinement methods required for the exact reproduction of results: *_parser.json.
REAL-MM-RAG_TechSlides
huggingface.co
Updated Mar 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IBM Research (2025). REAL-MM-RAG_TechSlides [Dataset]. https://huggingface.co/datasets/ibm-research/REAL-MM-RAG_TechSlides
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 13, 2025
Dataset provided by
IBMhttp://ibm.com/
IBM Research
Authors
IBM Research
License
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Description
REAL-MM-RAG-Bench: A Real-World Multi-Modal Retrieval Benchmark

We introduced REAL-MM-RAG-Bench, a real-world multi-modal retrieval benchmark designed to evaluate retrieval models in reliable, challenging, and realistic settings. The benchmark was constructed using an automated pipeline, where queries were generated by a vision-language model (VLM), filtered by a large language model (LLM), and rephrased by an LLM to ensure high-quality retrieval evaluation. To simulate real-world… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/REAL-MM-RAG_TechSlides.
h
silma-rag-qa-benchmark-v1.0
huggingface.co
Updated May 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SILMA AI - Arabic Language Models (2025). silma-rag-qa-benchmark-v1.0 [Dataset]. https://huggingface.co/datasets/silma-ai/silma-rag-qa-benchmark-v1.0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 13, 2025
Dataset authored and provided by
SILMA AI - Arabic Language Models
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
SILMA RAGQA Benchmark Dataset V1.0

SILMA RAGQA is a dataset and benchmark created by silma.ai to assess the effectiveness of Arabic Language Models in Extractive Question Answering tasks, with a specific emphasis on RAG applications The benchmark includes 17 bilingual datasets in Arabic and English, spanning various domains

What capabilities does the benchmark test?

General Arabic and English QA capabilities Ability to handle short and long contexts Ability to… See the full description on the dataset page: https://huggingface.co/datasets/silma-ai/silma-rag-qa-benchmark-v1.0.
f
Supplementary file 1_Swedish Medical LLM Benchmark: development and...
frontiersin.figshare.com
pdf
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Birger Moëll; Fabian Farestam; Jonas Beskow (2025). Supplementary file 1_Swedish Medical LLM Benchmark: development and evaluation of a framework for assessing large language models in the Swedish medical domain.pdf [Dataset]. http://doi.org/10.3389/frai.2025.1557920.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1557920.s001
Dataset updated
Jul 11, 2025
Dataset provided by
Frontiers
Authors
Birger Moëll; Fabian Farestam; Jonas Beskow
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionWe present the Swedish Medical LLM Benchmark (SMLB), an evaluation framework for assessing large language models (LLMs) in the Swedish medical domain.MethodThe SMLB addresses the lack of language-specific, clinically relevant benchmarks by incorporating four datasets: translated PubMedQA questions, Swedish Medical Exams, Emergency Medicine scenarios, and General Medicine cases.ResultOur evaluation of 18 state-of-the-art LLMs reveals GPT-4-turbo, Claude- 3.5 (October 2023), and the o3model as top performers, demonstrating a strong alignment between medical reasoning and general language understanding capabilities. Hybrid systems incorporating retrieval-augmented generation (RAG) improved accuracy for clinical knowledge questions, highlighting promising directions for safe implementation.DiscussionThe SMLB provides not only an evaluation tool but also reveals fundamental insights about LLM capabilities and limitations in Swedish healthcare applications, including significant performance variations between models. By open-sourcing the benchmark, we enable transparent assessment of medical LLMs while promoting responsible development through community-driven refinement. This study emphasizes the critical need for rigorous evaluation frameworks as LLMs become increasingly integrated into clinical workflows, particularly in non-English medical contexts where linguistic and cultural specificity are paramount.
Spreadsheet Manipulation using Large Language Models
figshare.com
zip
Updated Jul 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amila Indika (2025). Spreadsheet Manipulation using Large Language Models [Dataset]. http://doi.org/10.6084/m9.figshare.29602751.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29602751.v1
Dataset updated
Jul 19, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Amila Indika
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Spreadsheet manipulation code to text summary dataset descriptionThe benchmark dataset comprises 111 instances of spreadsheet manipulation tasks, each accompanied by xwAPI code and corresponding subtasks in natural language.The YAML file (.yaml) within each directory contains xwAPI code ("refined response") and its corresponding natural language summary of subtasks ("intermediate response").
WixQA
huggingface.co
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wix (2024). WixQA [Dataset]. https://huggingface.co/datasets/Wix/WixQA
Explore at:
Dataset updated
Dec 2, 2024
Dataset provided by
Wix.comhttp://wix.com/
Authors
Wix
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
WixQA: Enterprise RAG Question-Answering Benchmark

📄 Full Paper Available: For comprehensive details on dataset design, methodology, evaluation results, and analysis, please see our complete research paper: WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generation Cohen et al. (2025) - arXiv:2505.08643

Dataset Summary

WixQA is a three-config collection for evaluating and training Retrieval-Augmented Generation (RAG) systems in enterprise… See the full description on the dataset page: https://huggingface.co/datasets/Wix/WixQA.
h
RAG-RewardBench
huggingface.co
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhuoran Jin (2024). RAG-RewardBench [Dataset]. https://huggingface.co/datasets/jinzhuoran/RAG-RewardBench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 18, 2024
Authors
Zhuoran Jin
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This repository contains the data presented in RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment. Code: https://github.com/jinzhuoran/RAG-RewardBench/
f
350M Model
figshare.com
json
Updated May 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Chizhov (2025). 350M Model [Dataset]. http://doi.org/10.6084/m9.figshare.29135096.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29135096.v1
Dataset updated
May 23, 2025
Dataset provided by
figshare
Authors
Pavel Chizhov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
350M Model**RAG-350M** is a 350 million parameters Small Reasoning Model, trained for retrieval-augmented general (RAG), search and source summarization. Along with RAG-1B it belongs to our family of specialized reasoning models.RAG-350M outperforms most SLMs (4 billion parameters and below) on standardized benchmarks for retrieval-augmented general (HotPotQA, 2wiki) and is a highly cost-effective alternative with popular larger models, including Qwen-2.5-7B, Llama-3.1-8B and Gemma-3-4B. It is the only SLM to date to maintain consistent RAG performance across leading European languages and to ensure systematic reference grounding for statements. Due to its size, ease of deployment on constrained infrastructure (including mobile phone) and built-in support for factual and accurate information, RAG-350m unlocks a range of new use cases for generative AI.## FeaturesRAG-350M is a specialized language model using a series of special tokens to process a structured input (query and sources) and generate a structured output (reasoning sequence and answer with sources). For easier implementation, we encourage to use the associated API library.### Citation supportRAG-350M natively generated grounded answers on the basis of excerpts and citations extracted from the provided sources, using a custom syntax inspired by Wikipedia. It is one a handful open weights model to date to have been developed with this feature and the first one designed for actual deployment. In contrast with Anthropic approach (Citation mode), citation are integrally generated by the model and are not the product of external chunking. As a result we can provide another desirable feature to simplify source checking: citation shortening for longer excerpts (using "(…)").### RAG reasoningRAG-350M generates a specific reasoning sequences incorporating several proto-agentic abilities for RAG applications. The model is able to make a series of decisions directly:* Assessing whether the query is understandable.* Assessing whether the query is trivial enough to not require a lengthy pre-analysis (adjustable reasoning)* Assessing whether the sources do contain enough input to generate a grounded answer.The structured reasoning trace include the following steps:* Language detection of the query. The model will always strive to answer in the language of the original query.* Query analysis and associated query report. The analysis can either lead to a standard answer, a shortening reasoning trace/answer for trivial question, a reformulated query or a refusal (that could in the context of the application be transformed into user input querying).* Source analysis and associated source report. This step evaluates the coverage and depth of the provided sources in regards to the query.* Draft of the final answer.### MultilingualityRAG-350M is able to read and write in the main European languages: French, German, Italian, Spanish and, to a lesser extent, Polish, Latin and Portuguese.To date, it is the only small language model with negligible loss of performance in leading European languages for RAG-related tasks. On a translated set of HotPotQA we observed a significant drop of performance in most SLMs from 10\% to 30-35\% for sub-1B models. We do expect the results of any standard English evaluation on our RAG models should be largely transferable to the main European languages limiting the costs of evaluation and deployment in multilingual settings.## TrainingRAG-350M is trained on large synthetic dataset emulating retrieval of wide variety of multilingual open sources from Common Corpus. They provide native support for citation and grounding with literal quotes. Following on the latest trends of agentification, the models reintegrate multiple features associated with RAG workflows such as query routing, query reformulation, source reranking.## EvaluationRAG-350M was evaluated on three standard RAG benchmarks, 2wiki, HotpotQA and MuSique.All the benchmarks only assess the "trivial" mode on questions requiring some form of multi-hop reasoning over sources (answer disseminated into different sources) as well as discrimination of distractor sources.RAG-350M is not simply a cost-effective version of larger models. We found it has been able to answer correctly to several hundred questions from HotPotQA that neither Llama-3-8b nor Qwen-2.5-7b could solve. Consequently we encourage its use as part of multi-model RAG systems.
g
The BABILong Benchmark
github.com
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). The BABILong Benchmark [Dataset]. https://github.com/booydar/babilong
Explore at:
Dataset updated
Apr 15, 2025
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This repository contains code and instructions for BABILong benchmark. The BABILong benchmark is designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. BABILong uses tasks with facts and questions from bAbI. PG-19 books are used as source of long natural contexts.
h
open_ragbench
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vectara, open_ragbench [Dataset]. https://huggingface.co/datasets/vectara/open_ragbench
Explore at:
Dataset authored and provided by
Vectara
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Open RAG Benchmark

The Open RAG Benchmark is a unique, high-quality Retrieval-Augmented Generation (RAG) dataset constructed directly from arXiv PDF documents, specifically designed for evaluating RAG systems with a focus on multimodal PDF understanding. Unlike other datasets, Open RAG Benchmark emphasizes pure PDF content, meticulously extracting and generating queries on diverse modalities including text, tables, and images, even when they are intricately interwoven within a… See the full description on the dataset page: https://huggingface.co/datasets/vectara/open_ragbench.
Ger-RAG-eval
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deutsche Telekom AG, Ger-RAG-eval [Dataset]. https://huggingface.co/datasets/deutsche-telekom/Ger-RAG-eval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Deutsche Telekomhttp://www.telekom.de/
Authors
Deutsche Telekom AG
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
German RAG LLM Evaluation Dataset

This dataset is intended for the evaluation of German RAG (retrieval augmented generation) capabilities of LLM models. It is based on the test set of the deutsche-telekom/wikipedia-22-12-de-dpr data set (also see wikipedia-22-12-de-dpr on GitHub) and consists of 4 subsets or tasks.

Task Description

The dataset consists of 4 subsets for the following 4 tasks (each task with 1000 prompts):

choose_context_by_question (subset… See the full description on the dataset page: https://huggingface.co/datasets/deutsche-telekom/Ger-RAG-eval.
AeroEngQA
zenodo.org
bin, json, txt
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stuart E. Middleton; Stuart E. Middleton (2025). AeroEngQA [Dataset]. http://doi.org/10.5281/zenodo.14215677
Explore at:
json, txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14215677
Dataset updated
Jun 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Stuart E. Middleton; Stuart E. Middleton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset name:

AeroEngQA

Description:

AeroEngQA is a low volume, high quality benchmark aircraft design Question Answer (QA) dataset to support qualitative evaluatation of Large Language Models (LLMs).

Dataset DOI:

10.5281/zenodo.14215677

Paper citation:

Silva, E.A. Marsh, R. Yong, H.K. Middleton, S.E. Sóbester, A. Retrieval-Augmented Generation and In-Context Prompted Large Language Models in Aircraft Engineering, AIAA-2025, AIAA, doi:10.2514/6.2025-0700

Abstract:

With the aerospace industry taking its first steps towards exploiting the rapidly evolving technology of Large Language Models (LLMs), this study explores the potential of the latest generation of LLMs to become an effective link in the aircraft design tool chain of the future. Our focus is on the task of Question Answering (QA) in engineering, which has the potential to augment future aircraft design team meetings with an intelligent LLM-based agent able to engage with the team via a chatbot interface. We compare three of the most effective and popular classes of LLM QA prompting today – LLM zero-shot prompting, LLM in-context prompting and LLM-based Retrieval-Augmented Generation (RAG). We describe a new, low volume, high quality benchmark aircraft design QA dataset (AeroEngQA) and use it to qualitatively evaluate each class of LLM and exploring properties including answer accuracy and answer simplicity of the answer. We provide domain-specific insights into the usefulness of today’s LLMs for engineering design tasks such as aircraft design, and a view on how this might evolve in the future as the next generation of LLMs emerges.

Acknowledgements:

The DAWS 2 (Development of Advanced Wing Solutions 2) project is supported by the ATI Programme, a joint Government and industry investment to maintain and grow the UK’s competitive position in civil aerospace design and manufacture. The programme, delivered through a partnership between the Aerospace Technology Institute (ATI), Department for Business, Energy & Industrial Strategy (BEIS) and Innovate UK, addresses technology, capability and supply chain challenges.
TrustMus benchmark: The Role of Large Language Models in Musicology: Are We...
zenodo.org
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pedro Ramoneda; Pedro Ramoneda; Emilia Parada-Cabaleiro; Emilia Parada-Cabaleiro; Weck Benno; Serra Xavier; Weck Benno; Serra Xavier (2024). TrustMus benchmark: The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines? [Dataset]. http://doi.org/10.5281/zenodo.13644330
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13644330
Dataset updated
Sep 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pedro Ramoneda; Pedro Ramoneda; Emilia Parada-Cabaleiro; Emilia Parada-Cabaleiro; Weck Benno; Serra Xavier; Weck Benno; Serra Xavier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TrustMus is an initial, rigorously validated benchmark designed to assess the accuracy and reliability of large language models (LLMs) in the domain of musicology. This dataset includes a collection of 400 human-validated multiple-choice questions, categorized into four thematic areas: People (Ppl), Instruments and Technology (I&T), Genres, Forms, and Theory (Thr), and Culture and History (C&H).

The questions are derived from The Grove Dictionary Online using a semi-automated methodology. The process involves generating initial questions with a fine-tuned retrieval-augmented generation (RAG) model, filtering them through a series of automated checks, and finally validating them through expert human annotation. TrustMus is introduced in an initial paper, providing a critical resource for researchers and developers aiming to evaluate and improve LLM performance in this specialized field of musicology.

This benchmark is discussed in the paper :

BibTeX Citation:

@inproceedings{ramoneda2024trustmus,
title={The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?},
author={Ramoneda, Pedro and Parada-Cabaleiro, Emilia and Weck, Benno and Serra, Xavier},
booktitle={Proceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA)},
year={2024},
month={November},
address={San Francisco, USA},
organization={Co-located with ISMIR'2024}
}

Facebook

Twitter

Click to copy link

Link copied

Cite

llmware, rag_instruct_benchmark_tester [Dataset]. https://huggingface.co/datasets/llmware/rag_instruct_benchmark_tester

rag_instruct_benchmark_tester

llmware/rag_instruct_benchmark_tester

RAG Instruct Benchmarking Test Dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset authored and provided by

llmware

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Card for RAG-Instruct-Benchmark-Tester

  Dataset Summary

This is an updated benchmarking test dataset for "retrieval augmented generation" (RAG) use cases in the enterprise, especially for financial services, and legal. This test dataset includes 200 questions with context passages pulled from common 'retrieval scenarios', e.g., financial news, earnings releases, contracts, invoices, technical articles, general news and short texts.
The questions are segmented… See the full description on the dataset page: https://huggingface.co/datasets/llmware/rag_instruct_benchmark_tester.

Clear search

Close search

Google apps

Main menu

rag_instruct_benchmark_tester

German-RAG-LLM-EASY-BENCHMARK

SciRAG-QA: Multi-domain Closed-Question Benchmark Dataset for Scientific QA

RAG-Evaluation-Dataset-KO

nexa-rag-benchmark

rag-benchmark-dataset

frames-benchmark

Replication Data for: Advanced System Integration: Analyzing OpenAPI...

REAL-MM-RAG_TechSlides

silma-rag-qa-benchmark-v1.0

Supplementary file 1_Swedish Medical LLM Benchmark: development and...

Spreadsheet Manipulation using Large Language Models

WixQA

RAG-RewardBench

350M Model

The BABILong Benchmark

open_ragbench

Ger-RAG-eval

AeroEngQA

TrustMus benchmark: The Role of Large Language Models in Musicology: Are We...

rag_instruct_benchmark_testerSee More Versions

llmware/rag_instruct_benchmark_tester

RAG Instruct Benchmarking Test Dataset

rag_instruct_benchmark_tester