Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for RAG-Instruct-Benchmark-Tester
Dataset Summary
This is an updated benchmarking test dataset for "retrieval augmented generation" (RAG) use cases in the enterprise, especially for financial services, and legal. This test dataset includes 200 questions with context passages pulled from common 'retrieval scenarios', e.g., financial news, earnings releases,
contracts, invoices, technical articles, general news and short texts.
The questions are segmented… See the full description on the dataset page: https://huggingface.co/datasets/llmware/rag_instruct_benchmark_tester.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
German-RAG-LLM-EASY-BENCHMARK
German-RAG - German Retrieval Augmented Generation
Dataset Summary
This German-RAG-LLM-BENCHMARK represents a specialized collection for evaluating language models with a focus on source citation, time difference stating in RAG-specific tasks. To evaluate models compatible with OpenAI-Endpoints you can refer to our Github Repo: https://github.com/avemio-digital/German-RAG-LLM-EASY-BENCHMARK/ Most of the Subsets are synthetically… See the full description on the dataset page: https://huggingface.co/datasets/avemio/German-RAG-LLM-EASY-BENCHMARK.
In recent times, one of the most impactful applications of the growing capabilities of Large Language Models (LLMs) has been their use in Retrieval-Augmented Generation (RAG) systems. RAG applications are inherently more robust against LLM hallucinations and provide source traceability, which holds critical importance in the scientific reading and writing process. However, validating such systems is essential due to the stringent systematic requirements of the scientific domain. Existing benchmark datasets are limited in the scope of research areas they cover, often focusing on the natural sciences, which restricts their applicability and validation across other scientific fields. To address this gap, we present a closed-question answering (QA) dataset for benchmarking scientific RAG applications. This dataset spans 34 research topics across 10 distinct areas of study. It includes 108 manually curated question-answer pairs, each annotated with answer type, difficulty level, and a gold reference along with a link to the source paper. Further details on each of these attributes can be found in the accompanying README.md file. Please cite the following publication when using the dataset: TBD The publication is available at: TBD A preprint version of the publication is available at: TBD
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This repository contains the data presented in RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment. Code: https://github.com/jinzhuoran/RAG-RewardBench/
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
FRAMES: Factuality, Retrieval, And reasoning MEasurement Set
FRAMES is a comprehensive evaluation dataset designed to test the capabilities of Retrieval-Augmented Generation (RAG) systems across factuality, retrieval accuracy, and reasoning. Our paper with details and experiments is available on arXiv: https://arxiv.org/abs/2409.12941.
Dataset Overview
824 challenging multi-hop questions requiring information from 2-15 Wikipedia articles Questions span diverse topics… See the full description on the dataset page: https://huggingface.co/datasets/google/frames-benchmark.
nexa-rag-benchmark
The Nexa RAG Benchmark dataset is designed for evaluating Retrieval-Augmented Generation (RAG) models across multiple question-answering benchmarks. It includes a variety of datasets covering different domains. For evaluation, you can use the repository:🔗 Nexa RAG Benchmark on GitHub
Dataset Structure
This benchmark integrates multiple datasets suitable for RAG performance. You can choose datasets based on context size, number of examples, or… See the full description on the dataset page: https://huggingface.co/datasets/zhanxxx/nexa-rag-benchmark.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SILMA RAGQA Benchmark Dataset V1.0
SILMA RAGQA is a dataset and benchmark created by silma.ai to assess the effectiveness of Arabic Language Models in Extractive Question Answering tasks, with a specific emphasis on RAG applications The benchmark includes 17 bilingual datasets in Arabic and English, spanning various domains
What capabilities does the benchmark test?
General Arabic and English QA capabilities Ability to handle short and long contexts Ability to… See the full description on the dataset page: https://huggingface.co/datasets/silma-ai/silma-rag-qa-benchmark-v1.0.
BEIR (Benchmarking IR) is a heterogeneous benchmark containing different information retrieval (IR) tasks. Through BEIR, it is possible to systematically study the zero-shot generalization capabilities of multiple neural retrieval approaches.
The benchmark contains a total of 9 information retrieval tasks (Fact Checking, Citation Prediction, Duplicate Question Retrieval, Argument Retrieval, News Retrieval, Question Answering, Tweet Retrieval, Biomedical IR, Entity Retrieval) from 19 different datasets:
MS MARCO TREC-COVID NFCorpus BioASQ Natural Questions HotpotQA FiQA-2018 Signal-1M TREC-News ArguAna Touche 2020 CQADupStack Quora Question Pairs DBPedia SciDocs FEVER Climate-FEVER SciFact Robust04
https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4605https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4605
Integrating multiple (sub-)systems is essential to create advanced Information Systems (ISs). Difficulties mainly arise when integrating dynamic environments across the IS lifecycle, e.g., services not yet existent at design time. A traditional approach is a registry that provides the API documentation of the systems’ endpoints. Large Language Models (LLMs) have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input token limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. Within this work, we (i) analyze the usage of Retrieval Augmented Generation (RAG) for endpoint discovery and the chunking, i.e., preprocessing, of state-of-practice OpenAPIs to reduce the input token length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints and retrieves specification details on demand. We evaluate RAG for endpoint discovery using the RestBench benchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval recall, precision, and F1 score. Then, we assess the Discovery Agent using the same test set. With our prototype, we demonstrate how to successfully employ RAG for endpoint discovery to reduce token count. While revealing high values for recall, precision, and F1, further research is necessary to retrieve all requisite endpoints. Our experiments show that for preprocessing, LLM-based and format-specific approaches outperform naïve chunking methods. Relying on an agent further enhances these results as the agent splits the tasks into multiple fine granular subtasks, improving the overall RAG performance in the token count, precision, and F1 score. Content: code.zip:Python source code to perform the experiments. evaluate.py: Script to execute the experiments (Uncomment lines to select the embedding model). socrag/*: Source code for the RAG. benchmark/*: RestBench specification. results.zip:Results of the RAG experiments (in the folder /results/data/ inside the zip file). Experiment results for the RAG: results_{embedding_model}_{top-k}.json. Experiment results for the Discovery Agent: results_{embedding_model}_{agent}_{refinement}_{llm}.json. FAISS store (intermediate data required for exact reproduction of results; one folder for each embedding model): bge_small, nvidia and oai. Intermediate data of the LLM-based refinement methods required for the exact reproduction of results: *_parser.json.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Open RAG Benchmark
The Open RAG Benchmark is a unique, high-quality Retrieval-Augmented Generation (RAG) dataset constructed directly from arXiv PDF documents, specifically designed for evaluating RAG systems with a focus on multimodal PDF understanding. Unlike other datasets, Open RAG Benchmark emphasizes pure PDF content, meticulously extracting and generating queries on diverse modalities including text, tables, and images, even when they are intricately interwoven within a… See the full description on the dataset page: https://huggingface.co/datasets/vectara/open_ragbench.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
WixQA: Enterprise RAG Question-Answering Benchmark
📄 Full Paper Available: For comprehensive details on dataset design, methodology, evaluation results, and analysis, please see our complete research paper: WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generation Cohen et al. (2025) - arXiv:2505.08643
Dataset Summary
WixQA is a three-config collection for evaluating and training Retrieval-Augmented Generation (RAG) systems in enterprise… See the full description on the dataset page: https://huggingface.co/datasets/Wix/WixQA.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This repository contains code and instructions for BABILong benchmark. The BABILong benchmark is designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. BABILong uses tasks with facts and questions from bAbI. PG-19 books are used as source of long natural contexts.
ALCE is a benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations.
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
REAL-MM-RAG-Bench: A Real-World Multi-Modal Retrieval Benchmark
We introduced REAL-MM-RAG-Bench, a real-world multi-modal retrieval benchmark designed to evaluate retrieval models in reliable, challenging, and realistic settings. The benchmark was constructed using an automated pipeline, where queries were generated by a vision-language model (VLM), filtered by a large language model (LLM), and rephrased by an LLM to ensure high-quality retrieval evaluation. To simulate real-world… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/REAL-MM-RAG_TechSlides.
onepaneai/rag-benchmark-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset name:
AeroEngQA
Description:
AeroEngQA is a low volume, high quality benchmark aircraft design Question Answer (QA) dataset to support qualitative evaluatation of Large Language Models (LLMs).
Dataset DOI:
10.5281/zenodo.14215677
Paper citation:
Silva, E.A. Marsh, R. Yong, H.K. Middleton, S.E. Sóbester, A. Retrieval-Augmented Generation and In-Context Prompted Large Language Models in Aircraft Engineering, AIAA-2025, AIAA, doi:10.2514/6.2025-0700
Abstract:
With the aerospace industry taking its first steps towards exploiting the rapidly evolving technology of Large Language Models (LLMs), this study explores the potential of the latest generation of LLMs to become an effective link in the aircraft design tool chain of the future. Our focus is on the task of Question Answering (QA) in engineering, which has the potential to augment future aircraft design team meetings with an intelligent LLM-based agent able to engage with the team via a chatbot interface. We compare three of the most effective and popular classes of LLM QA prompting today – LLM zero-shot prompting, LLM in-context prompting and LLM-based Retrieval-Augmented Generation (RAG). We describe a new, low volume, high quality benchmark aircraft design QA dataset (AeroEngQA) and use it to qualitatively evaluate each class of LLM and exploring properties including answer accuracy and answer simplicity of the answer. We provide domain-specific insights into the usefulness of today’s LLMs for engineering design tasks such as aircraft design, and a view on how this might evolve in the future as the next generation of LLMs emerges.
Acknowledgements:
The DAWS 2 (Development of Advanced Wing Solutions 2) project is supported by the ATI Programme, a joint Government and industry investment to maintain and grow the UK’s competitive position in civil aerospace design and manufacture. The programme, delivered through a partnership between the Aerospace Technology Institute (ATI), Department for Business, Energy & Industrial Strategy (BEIS) and Innovate UK, addresses technology, capability and supply chain challenges.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
🔥Toward General Instruction-Following Alignment for Retrieval-Augmented Generation
🤖️ Website • 🤗 VIF-RAG-QA-110K • 👉 VIF-RAG-QA-20K • 📖 Arxiv • 🤗 HF-Paper
We propose a instruction-following alignement pipline named VIF-RAG framework and auto-evaluation Benchmark named FollowRAG:
IF-RAG: It is the first automated, scalable, and verifiable data synthesis pipeline for aligning complex instruction-following in RAG scenarios. VIF-RAG integrates a verification process at each… See the full description on the dataset page: https://huggingface.co/datasets/dongguanting/VIF-RAG-QA-20K.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RAG Benchmark for Protein-Protein Interactions (RAGPPI)
📊 Overview
Retrieving expected therapeutic impacts in protein-protein interactions (PPIs) is crucial in drug development, enabling researchers to prioritize promising targets and improve success rates. While Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks accelerate discovery, no benchmark exists for identifying therapeutic impacts in PPIs. RAGPPI is the first factual QA benchmark… See the full description on the dataset page: https://huggingface.co/datasets/Youngseung/RAGPPI.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation
CORAL is a a large-scale multi-turn conversational RAG benchmark that fulfills the critical features mentioned in our paper to systematically evaluate and advance conversational RAG systems.In CORAL, we evaluate conversational RAG systems across three essential tasks:(1) Conversational Passage Retrieval: assessing the system’s ability to retrieve the relevant information from a large document set based… See the full description on the dataset page: https://huggingface.co/datasets/ariya2357/CORAL.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for RAG-Instruct-Benchmark-Tester
Dataset Summary
This is an updated benchmarking test dataset for "retrieval augmented generation" (RAG) use cases in the enterprise, especially for financial services, and legal. This test dataset includes 200 questions with context passages pulled from common 'retrieval scenarios', e.g., financial news, earnings releases,
contracts, invoices, technical articles, general news and short texts.
The questions are segmented… See the full description on the dataset page: https://huggingface.co/datasets/llmware/rag_instruct_benchmark_tester.