45 datasets found

h
rag_instruct_benchmark_tester
huggingface.co
opendatalab.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
llmware, rag_instruct_benchmark_tester [Dataset]. https://huggingface.co/datasets/llmware/rag_instruct_benchmark_tester
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
llmware
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for RAG-Instruct-Benchmark-Tester

Dataset Summary

This is an updated benchmarking test dataset for "retrieval augmented generation" (RAG) use cases in the enterprise, especially for financial services, and legal. This test dataset includes 200 questions with context passages pulled from common 'retrieval scenarios', e.g., financial news, earnings releases, contracts, invoices, technical articles, general news and short texts.
The questions are segmented… See the full description on the dataset page: https://huggingface.co/datasets/llmware/rag_instruct_benchmark_tester.
h
German-RAG-LLM-EASY-BENCHMARK
huggingface.co
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Avemio AG (2025). German-RAG-LLM-EASY-BENCHMARK [Dataset]. https://huggingface.co/datasets/avemio/German-RAG-LLM-EASY-BENCHMARK
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 6, 2025
Dataset authored and provided by
Avemio AG
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
German-RAG-LLM-EASY-BENCHMARK

German-RAG - German Retrieval Augmented Generation Dataset Summary

This German-RAG-LLM-BENCHMARK represents a specialized collection for evaluating language models with a focus on source citation, time difference stating in RAG-specific tasks. To evaluate models compatible with OpenAI-Endpoints you can refer to our Github Repo: https://github.com/avemio-digital/German-RAG-LLM-EASY-BENCHMARK/ Most of the Subsets are synthetically… See the full description on the dataset page: https://huggingface.co/datasets/avemio/German-RAG-LLM-EASY-BENCHMARK.
o
SciRAG-QA: Multi-domain Closed-Question Benchmark Dataset for Scientific QA
explore.openaire.eu
Updated Dec 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahira Ibnath Joytu; Md Raisul Kibria; Sébastien Lafond (2024). SciRAG-QA: Multi-domain Closed-Question Benchmark Dataset for Scientific QA [Dataset]. http://doi.org/10.5281/zenodo.14390011
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14390011
Dataset updated
Dec 11, 2024
Authors
Mahira Ibnath Joytu; Md Raisul Kibria; Sébastien Lafond
Description
In recent times, one of the most impactful applications of the growing capabilities of Large Language Models (LLMs) has been their use in Retrieval-Augmented Generation (RAG) systems. RAG applications are inherently more robust against LLM hallucinations and provide source traceability, which holds critical importance in the scientific reading and writing process. However, validating such systems is essential due to the stringent systematic requirements of the scientific domain. Existing benchmark datasets are limited in the scope of research areas they cover, often focusing on the natural sciences, which restricts their applicability and validation across other scientific fields. To address this gap, we present a closed-question answering (QA) dataset for benchmarking scientific RAG applications. This dataset spans 34 research topics across 10 distinct areas of study. It includes 108 manually curated question-answer pairs, each annotated with answer type, difficulty level, and a gold reference along with a link to the source paper. Further details on each of these attributes can be found in the accompanying README.md file. Please cite the following publication when using the dataset: TBD The publication is available at: TBD A preprint version of the publication is available at: TBD
h
nexa-rag-benchmark
huggingface.co
Updated Dec 3, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zhanx (2014). nexa-rag-benchmark [Dataset]. https://huggingface.co/datasets/zhanxxx/nexa-rag-benchmark
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 3, 2014
Authors
zhanx
Description
nexa-rag-benchmark

The Nexa RAG Benchmark dataset is designed for evaluating Retrieval-Augmented Generation (RAG) models across multiple question-answering benchmarks. It includes a variety of datasets covering different domains. For evaluation, you can use the repository:🔗 Nexa RAG Benchmark on GitHub

Dataset Structure

This benchmark integrates multiple datasets suitable for RAG performance. You can choose datasets based on context size, number of examples, or… See the full description on the dataset page: https://huggingface.co/datasets/zhanxxx/nexa-rag-benchmark.
h
rag-benchmark-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Onepane.ai, rag-benchmark-dataset [Dataset]. https://huggingface.co/datasets/onepaneai/rag-benchmark-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Onepane.ai
Description
onepaneai/rag-benchmark-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
frames-benchmark
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google, frames-benchmark [Dataset]. https://huggingface.co/datasets/google/frames-benchmark
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Googlehttp://google.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
FRAMES: Factuality, Retrieval, And reasoning MEasurement Set

FRAMES is a comprehensive evaluation dataset designed to test the capabilities of Retrieval-Augmented Generation (RAG) systems across factuality, retrieval accuracy, and reasoning. Our paper with details and experiments is available on arXiv: https://arxiv.org/abs/2409.12941.

Dataset Overview

824 challenging multi-hop questions requiring information from 2-15 Wikipedia articles Questions span diverse topics… See the full description on the dataset page: https://huggingface.co/datasets/google/frames-benchmark.
D
Replication Data for: Advanced System Integration: Analyzing OpenAPI...
darus.uni-stuttgart.de
Updated Dec 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robin D. Pesl; Jerin George Mathew; Massimo Mecella; Marco Aiello (2024). Replication Data for: Advanced System Integration: Analyzing OpenAPI Chunking for Retrieval-Augmented Generation [Dataset]. http://doi.org/10.18419/DARUS-4605
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-4605
Dataset updated
Dec 9, 2024
Dataset provided by
DaRUS
Authors
Robin D. Pesl; Jerin George Mathew; Massimo Mecella; Marco Aiello
License
https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4605https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4605
Dataset funded by
BMWK
MWK
Description
Integrating multiple (sub-)systems is essential to create advanced Information Systems (ISs). Difficulties mainly arise when integrating dynamic environments across the IS lifecycle, e.g., services not yet existent at design time. A traditional approach is a registry that provides the API documentation of the systems’ endpoints. Large Language Models (LLMs) have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input token limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. Within this work, we (i) analyze the usage of Retrieval Augmented Generation (RAG) for endpoint discovery and the chunking, i.e., preprocessing, of state-of-practice OpenAPIs to reduce the input token length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints and retrieves specification details on demand. We evaluate RAG for endpoint discovery using the RestBench benchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval recall, precision, and F1 score. Then, we assess the Discovery Agent using the same test set. With our prototype, we demonstrate how to successfully employ RAG for endpoint discovery to reduce token count. While revealing high values for recall, precision, and F1, further research is necessary to retrieve all requisite endpoints. Our experiments show that for preprocessing, LLM-based and format-specific approaches outperform naïve chunking methods. Relying on an agent further enhances these results as the agent splits the tasks into multiple fine granular subtasks, improving the overall RAG performance in the token count, precision, and F1 score. Content: code.zip:Python source code to perform the experiments. evaluate.py: Script to execute the experiments (Uncomment lines to select the embedding model). socrag/*: Source code for the RAG. benchmark/*: RestBench specification. results.zip:Results of the RAG experiments (in the folder /results/data/ inside the zip file). Experiment results for the RAG: results_{embedding_model}_{top-k}.json. Experiment results for the Discovery Agent: results_{embedding_model}_{agent}_{refinement}_{llm}.json. FAISS store (intermediate data required for exact reproduction of results; one folder for each embedding model): bge_small, nvidia and oai. Intermediate data of the LLM-based refinement methods required for the exact reproduction of results: *_parser.json.
P
BEIR Dataset
paperswithcode.com
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandan Thakur; Nils Reimers; Andreas Rücklé; Abhishek Srivastava; Iryna Gurevych (2023). BEIR Dataset [Dataset]. https://paperswithcode.com/dataset/beir
Explore at:
Dataset updated
Dec 7, 2023
Authors
Nandan Thakur; Nils Reimers; Andreas Rücklé; Abhishek Srivastava; Iryna Gurevych
Description
BEIR (Benchmarking IR) is a heterogeneous benchmark containing different information retrieval (IR) tasks. Through BEIR, it is possible to systematically study the zero-shot generalization capabilities of multiple neural retrieval approaches.

The benchmark contains a total of 9 information retrieval tasks (Fact Checking, Citation Prediction, Duplicate Question Retrieval, Argument Retrieval, News Retrieval, Question Answering, Tweet Retrieval, Biomedical IR, Entity Retrieval) from 19 different datasets:

MS MARCO TREC-COVID NFCorpus BioASQ Natural Questions HotpotQA FiQA-2018 Signal-1M TREC-News ArguAna Touche 2020 CQADupStack Quora Question Pairs DBPedia SciDocs FEVER Climate-FEVER SciFact Robust04
WixQA
huggingface.co
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wix (2024). WixQA [Dataset]. https://huggingface.co/datasets/Wix/WixQA
Explore at:
Dataset updated
Dec 2, 2024
Dataset provided by
Wix.comhttp://wix.com/
Authors
Wix
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
WixQA: Enterprise RAG Question-Answering Benchmark

📄 Full Paper Available: For comprehensive details on dataset design, methodology, evaluation results, and analysis, please see our complete research paper: WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generation Cohen et al. (2025) - arXiv:2505.08643

Dataset Summary

WixQA is a four-config collection for evaluating and training Retrieval-Augmented Generation (RAG) systems in enterprise… See the full description on the dataset page: https://huggingface.co/datasets/Wix/WixQA.
REAL-MM-RAG_FinReport
huggingface.co
Updated Mar 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IBM Research (2025). REAL-MM-RAG_FinReport [Dataset]. https://huggingface.co/datasets/ibm-research/REAL-MM-RAG_FinReport
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 13, 2025
Dataset provided by
IBMhttp://ibm.com/
IBM Research
Authors
IBM Research
License
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Description
REAL-MM-RAG-Bench: A Real-World Multi-Modal Retrieval Benchmark

We introduced REAL-MM-RAG-Bench, a real-world multi-modal retrieval benchmark designed to evaluate retrieval models in reliable, challenging, and realistic settings. The benchmark was constructed using an automated pipeline, where queries were generated by a vision-language model (VLM), filtered by a large language model (LLM), and rephrased by an LLM to ensure high-quality retrieval evaluation. To simulate real-world… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/REAL-MM-RAG_FinReport.
g
The BABILong Benchmark
github.com
benchflow.ai
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). The BABILong Benchmark [Dataset]. https://github.com/booydar/babilong
Explore at:
Dataset updated
Apr 15, 2025
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This repository contains code and instructions for BABILong benchmark. The BABILong benchmark is designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. BABILong uses tasks with facts and questions from bAbI. PG-19 books are used as source of long natural contexts.
P
Data from: ALCE Dataset
paperswithcode.com
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianyu Gao; Howard Yen; Jiatong Yu; Danqi Chen (2024). ALCE Dataset [Dataset]. https://paperswithcode.com/dataset/alce
Explore at:
Dataset updated
Sep 3, 2024
Authors
Tianyu Gao; Howard Yen; Jiatong Yu; Danqi Chen
Description
ALCE is a benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations.
h
ShareGPT-12K
huggingface.co
Updated Sep 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KABI (2024). ShareGPT-12K [Dataset]. https://huggingface.co/datasets/dongguanting/ShareGPT-12K
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 2, 2024
Authors
KABI
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
🔥Toward General Instruction-Following Alignment for Retrieval-Augmented Generation

🤖️ Website • 🤗 VIF-RAG-QA-110K • 👉 VIF-RAG-QA-20K • 📖 Arxiv • 🤗 HF-Paper

We propose a instruction-following alignement pipline named VIF-RAG framework and auto-evaluation Benchmark named FollowRAG:

IF-RAG: It is the first automated, scalable, and verifiable data synthesis pipeline for aligning complex instruction-following in RAG scenarios. VIF-RAG integrates a verification process at each… See the full description on the dataset page: https://huggingface.co/datasets/dongguanting/ShareGPT-12K.
P
MuSiQue-Ans Dataset
paperswithcode.com
opendatalab.com
Updated Aug 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harsh Trivedi; Niranjan Balasubramanian; Tushar Khot; Ashish Sabharwal (2021). MuSiQue-Ans Dataset [Dataset]. https://paperswithcode.com/dataset/musique-ans
Explore at:
Dataset updated
Aug 1, 2021
Authors
Harsh Trivedi; Niranjan Balasubramanian; Tushar Khot; Ashish Sabharwal
Description
MuSiQue-Ans is a new multihop QA dataset with ~25K 2-4 hop questions using seed questions from 5 existing single-hop datasets.
h
OmniEval-KnowledgeCorpus
huggingface.co
Updated Jan 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NLPIR Lab @ RUC (2025). OmniEval-KnowledgeCorpus [Dataset]. https://huggingface.co/datasets/RUC-NLPIR/OmniEval-KnowledgeCorpus
Explore at:
Dataset updated
Jan 2, 2025
Dataset authored and provided by
NLPIR Lab @ RUC
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Information

We introduce an omnidirectional and automatic RAG benchmark, OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including:

a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios; a… See the full description on the dataset page: https://huggingface.co/datasets/RUC-NLPIR/OmniEval-KnowledgeCorpus.
h
VIF-RAG-QA-20K
huggingface.co
Updated Oct 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KABI (2024). VIF-RAG-QA-20K [Dataset]. https://huggingface.co/datasets/dongguanting/VIF-RAG-QA-20K
Explore at:
Dataset updated
Oct 10, 2024
Authors
KABI
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
🔥Toward General Instruction-Following Alignment for Retrieval-Augmented Generation

🤖️ Website • 🤗 VIF-RAG-QA-110K • 👉 VIF-RAG-QA-20K • 📖 Arxiv • 🤗 HF-Paper

We propose a instruction-following alignement pipline named VIF-RAG framework and auto-evaluation Benchmark named FollowRAG:

IF-RAG: It is the first automated, scalable, and verifiable data synthesis pipeline for aligning complex instruction-following in RAG scenarios. VIF-RAG integrates a verification process at each… See the full description on the dataset page: https://huggingface.co/datasets/dongguanting/VIF-RAG-QA-20K.
Ger-RAG-eval
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deutsche Telekom AG, Ger-RAG-eval [Dataset]. https://huggingface.co/datasets/deutsche-telekom/Ger-RAG-eval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Deutsche Telekomhttp://www.telekom.de/
Authors
Deutsche Telekom AG
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
German RAG LLM Evaluation Dataset

This dataset is intended for the evaluation of German RAG (retrieval augmented generation) capabilities of LLM models. It is based on the test set of the deutsche-telekom/wikipedia-22-12-de-dpr data set (also see wikipedia-22-12-de-dpr on GitHub) and consists of 4 subsets or tasks.

Task Description

The dataset consists of 4 subsets for the following 4 tasks (each task with 1000 prompts):

choose_context_by_question (subset… See the full description on the dataset page: https://huggingface.co/datasets/deutsche-telekom/Ger-RAG-eval.
h
RAGPPI_Atomics
huggingface.co
Updated May 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeon (2025). RAGPPI_Atomics [Dataset]. https://huggingface.co/datasets/Youngseung/RAGPPI_Atomics
Explore at:
Dataset updated
May 28, 2025
Authors
Jeon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RAG Benchmark for Protein-Protein Interactions (RAGPPI)

📊 Overview

Retrieving expected therapeutic impacts in protein-protein interactions (PPIs) is crucial in drug development, enabling researchers to prioritize promising targets and improve success rates. While Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks accelerate discovery, no benchmark exists for identifying therapeutic impacts in PPIs. RAGPPI is the first factual QA benchmark… See the full description on the dataset page: https://huggingface.co/datasets/Youngseung/RAGPPI_Atomics.
h
ragbench
huggingface.co
Updated Apr 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Galileo Datasets (2025). ragbench [Dataset]. https://huggingface.co/datasets/rungalileo/ragbench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 4, 2025
Authors
Galileo Datasets
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RAGBench

Dataset Overview

RAGBEnch is a large-scale RAG benchmark dataset of 100k RAG examples. It covers five unique industry-specific domains and various RAG task types. RAGBench examples are sourced from industry corpora such as user manuals, making it particularly relevant for industry applications. RAGBench comrises 12 sub-component datasets, each one split into train/validation/test splits

Usage

from datasets import load_dataset

load… See the full description on the dataset page: https://huggingface.co/datasets/rungalileo/ragbench.
h
BiomixQA
huggingface.co
Updated Sep 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KG-RAG (2024). BiomixQA [Dataset]. https://huggingface.co/datasets/kg-rag/BiomixQA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2024
Dataset authored and provided by
KG-RAG
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
BiomixQA Dataset

Overview

BiomixQA is a curated biomedical question-answering dataset comprising two distinct components:

Multiple Choice Questions (MCQ) True/False Questions

This dataset has been utilized to validate the Knowledge Graph based Retrieval-Augmented Generation (KG-RAG) framework across different Large Language Models (LLMs). The diverse nature of questions in this dataset, spanning multiple choice and true/false formats, along with its coverage of various… See the full description on the dataset page: https://huggingface.co/datasets/kg-rag/BiomixQA.

Facebook

Twitter

Click to copy link

Link copied

Cite

llmware, rag_instruct_benchmark_tester [Dataset]. https://huggingface.co/datasets/llmware/rag_instruct_benchmark_tester

rag_instruct_benchmark_tester

llmware/rag_instruct_benchmark_tester

RAG Instruct Benchmarking Test Dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset authored and provided by

llmware

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Card for RAG-Instruct-Benchmark-Tester

  Dataset Summary

This is an updated benchmarking test dataset for "retrieval augmented generation" (RAG) use cases in the enterprise, especially for financial services, and legal. This test dataset includes 200 questions with context passages pulled from common 'retrieval scenarios', e.g., financial news, earnings releases, contracts, invoices, technical articles, general news and short texts.
The questions are segmented… See the full description on the dataset page: https://huggingface.co/datasets/llmware/rag_instruct_benchmark_tester.

Clear search

Close search

Google apps

Main menu

rag_instruct_benchmark_tester

German-RAG-LLM-EASY-BENCHMARK

SciRAG-QA: Multi-domain Closed-Question Benchmark Dataset for Scientific QA

nexa-rag-benchmark

rag-benchmark-dataset

frames-benchmark

Replication Data for: Advanced System Integration: Analyzing OpenAPI...

BEIR Dataset

WixQA

REAL-MM-RAG_FinReport

The BABILong Benchmark

Data from: ALCE Dataset

ShareGPT-12K

MuSiQue-Ans Dataset

OmniEval-KnowledgeCorpus

VIF-RAG-QA-20K

Ger-RAG-eval

RAGPPI_Atomics

ragbench

load… See the full description on the dataset page: https://huggingface.co/datasets/rungalileo/ragbench.

BiomixQA

rag_instruct_benchmark_testerSee More Versions

llmware/rag_instruct_benchmark_tester

RAG Instruct Benchmarking Test Dataset

rag_instruct_benchmark_tester