Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This contains the dataset of the manual execution results of evaluation scenarios, the list of RAG repositories, the automatic generation of question answer pairs and execution of evaluation scenarios across 5 open-source RAG pipelines using our approach and RAGAS approach, and automation scripts for generating question-answer pairs and execution of generated questions across the selected RAG pipelines.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Allganize RAG Leaderboard
Allganize RAG 리더보드는 5개 도메인(금융, 공공, 의료, 법률, 커머스)에 대해서 한국어 RAG의 성능을 평가합니다.일반적인 RAG는 간단한 질문에 대해서는 답변을 잘 하지만, 문서의 테이블과 이미지에 대한 질문은 답변을 잘 못합니다.
RAG 도입을 원하는 수많은 기업들은 자사에 맞는 도메인, 문서 타입, 질문 형태를 반영한 한국어 RAG 성능표를 원하고 있습니다.평가를 위해서는 공개된 문서와 질문, 답변 같은 데이터 셋이 필요하지만, 자체 구축은 시간과 비용이 많이 드는 일입니다.이제 올거나이즈는 RAG 평가 데이터를 모두 공개합니다.
RAG는 Parser, Retrieval, Generation 크게 3가지 파트로 구성되어 있습니다.현재, 공개되어 있는 RAG 리더보드 중, 3가지 파트를 전체적으로 평가하는 한국어로 구성된 리더보드는 없습니다.
Allganize RAG 리더보드에서는 문서를… See the full description on the dataset page: https://huggingface.co/datasets/allganize/RAG-Evaluation-Dataset-KO.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a collection of Acquired Podcast Transcripts, specifically curated for evaluating Retrieval-Augmented Generation (RAG) systems. It includes human-verified answers and AI model responses both with and without access to the transcripts, along with correctness ratings and quality assessments. The dataset's core purpose is to facilitate the development and testing of AI models, particularly in the domain of natural language processing and question-answering.
The dataset contains several key columns designed for RAG evaluation: * question: The query posed for evaluation. * human_answer: The reference answer provided by a human. * ai_answer_without_the_transcript: The answer generated by an AI model when it does not have access to the transcript. * ai_answer_without_the_transcript_correctness: A human-verified assessment of the factual accuracy of the AI answer without the transcript (e.g., CORRECT, INCORRECT, Other). * ai_answer_with_the_transcript: The answer generated by an AI model when it does have access to the transcript. * ai_answer_with_the_transcript_correctness: A human-verified assessment of the factual accuracy of the AI answer with the transcript (e.g., CORRECT, INCORRECT, Other). * quality_rating_for_answer_with_transcript: A human rating of the quality of the AI answer when the model had access to the transcript. * post_url: The URL of the specific Acquired Podcast episode related to the question. * file_name: The name of the transcript file corresponding to the episode.
The dataset comprises 200 Acquired Podcast Transcripts, totalling approximately 3.5 million words. This is roughly equivalent to 5,500 pages when formatted into a Word document. It also includes a dedicated QA dataset for RAG evaluation, structured as a CSV file.
This dataset is ideal for: * Evaluating the factual accuracy and quality of AI models, particularly those employing RAG techniques. * Developing and refining natural language processing (NLP) models. * Training and testing question-answering systems. * Benchmarking the performance of different AI models in information retrieval tasks. * Conducting research in artificial intelligence and machine learning, focusing on generative AI.
The dataset's content is derived from 200 episodes of the Acquired Podcast, collected from its official website. It covers a range of topics typically discussed on the podcast, including business, technology, and finance. The data collection focused on transcripts available at the time of sourcing.
CC0
Original Data Source: Acquired Podcast Transcripts and RAG Evaluation
emirMb/RAG-EVALUATION-QA dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
German RAG LLM Evaluation Dataset
This dataset is intended for the evaluation of German RAG (retrieval augmented generation) capabilities of LLM models. It is based on the test set of the deutsche-telekom/wikipedia-22-12-de-dpr data set (also see wikipedia-22-12-de-dpr on GitHub) and consists of 4 subsets or tasks.
Task Description
The dataset consists of 4 subsets for the following 4 tasks (each task with 1000 prompts):
choose_context_by_question (subset… See the full description on the dataset page: https://huggingface.co/datasets/deutsche-telekom/Ger-RAG-eval.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This contains the dataset of the manual execution results of evaluation scenarios, the list of RAG repositories, the automatic generation of question answer pairs and execution of evaluation scenarios across 5 open-source RAG pipelines using our approach and RAGAS approach, and automation scripts for generating question-answer pairs and execution of generated questions across the selected RAG pipelines.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file provides the evaluation metrics used to assess the performance of RAG pipelines in the various papers.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
LEANN-RAG Evaluation Data
This repository contains the necessary data to run the recall evaluation scripts for the LEANN-RAG project.
Dataset Components
This dataset is structured into three main parts:
Pre-built LEANN Indices:
dpr/: A pre-built index for the DPR dataset. rpj_wiki/: A pre-built index for the RPJ-Wiki dataset. These indices were created using the leann-core library and are required by the LeannSearcher.
Ground Truth Data:
ground_truth/: Contains the… See the full description on the dataset page: https://huggingface.co/datasets/LEANN-RAG/leann-rag-evaluation-data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A classic retrieval augmented generation (RAG) Q&A bot that answers questions about the GVHD medical condition.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset provides a summary of the experimental results obtained from an HCAI system implemented using a RAG framework and the Llama 3.0 model. A total of 125 hyperparameter configurations were defined by aggregating metrics based on the median of the results from 91 questions and their corresponding answers. These configurations represent the alternatives evaluated through Multi-Criteria Decision-Making (MCDM) methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file contains the raw data of all papers collected
https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4605https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4605
Integrating multiple (sub-)systems is essential to create advanced Information Systems (ISs). Difficulties mainly arise when integrating dynamic environments across the IS lifecycle, e.g., services not yet existent at design time. A traditional approach is a registry that provides the API documentation of the systems’ endpoints. Large Language Models (LLMs) have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input token limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. Within this work, we (i) analyze the usage of Retrieval Augmented Generation (RAG) for endpoint discovery and the chunking, i.e., preprocessing, of state-of-practice OpenAPIs to reduce the input token length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints and retrieves specification details on demand. We evaluate RAG for endpoint discovery using the RestBench benchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval recall, precision, and F1 score. Then, we assess the Discovery Agent using the same test set. With our prototype, we demonstrate how to successfully employ RAG for endpoint discovery to reduce token count. While revealing high values for recall, precision, and F1, further research is necessary to retrieve all requisite endpoints. Our experiments show that for preprocessing, LLM-based and format-specific approaches outperform naïve chunking methods. Relying on an agent further enhances these results as the agent splits the tasks into multiple fine granular subtasks, improving the overall RAG performance in the token count, precision, and F1 score. Content: code.zip:Python source code to perform the experiments. evaluate.py: Script to execute the experiments (Uncomment lines to select the embedding model). socrag/*: Source code for the RAG. benchmark/*: RestBench specification. results.zip:Results of the RAG experiments (in the folder /results/data/ inside the zip file). Experiment results for the RAG: results_{embedding_model}_{top-k}.json. Experiment results for the Discovery Agent: results_{embedding_model}_{agent}_{refinement}_{llm}.json. FAISS store (intermediate data required for exact reproduction of results; one folder for each embedding model): bge_small, nvidia and oai. Intermediate data of the LLM-based refinement methods required for the exact reproduction of results: *_parser.json.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Open RAG Benchmark
The Open RAG Benchmark is a unique, high-quality Retrieval-Augmented Generation (RAG) dataset constructed directly from arXiv PDF documents, specifically designed for evaluating RAG systems with a focus on multimodal PDF understanding. Unlike other datasets, Open RAG Benchmark emphasizes pure PDF content, meticulously extracting and generating queries on diverse modalities including text, tables, and images, even when they are intricately interwoven within a… See the full description on the dataset page: https://huggingface.co/datasets/vectara/open_ragbench.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A detailed list of different RAG methods used in the surveyed studies.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset card for RAG-BENCH
Data Summary
RAG-bench aims to provide results of many commonly used RAG datasets. All the results in this dataset are evaluated by the RAG evaluation tool Rageval, which could be easily reproduced with the tool. Currently, we have provided the results of ASQA dataset,ELI5 dataset and HotPotQA dataset.
Data Instance
ASQA
{ "ambiguous_question":"Who is the original artist of sound of silence?", "qa_pairs":[{… See the full description on the dataset page: https://huggingface.co/datasets/golaxy/rag-bench.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Allganize RAG Leaderboard とは
Allganize RAG Leaderboard は、5つの業種ドメイン(金融、情報通信、製造、公共、流通・小売)において、日本語のRAGの性能評価を実施したものです。一般的なRAGは簡単な質問に対する回答は可能ですが、図表の中に記載されている情報などに対して回答できないケースが多く存在します。RAGの導入を希望する多くの企業は、自社と同じ業種ドメイン、文書タイプ、質問形態を反映した日本語のRAGの性能評価を求めています。RAGの性能評価には、検証ドキュメントや質問と回答といったデータセット、検証環境の構築が必要となりますが、AllganizeではRAGの導入検討の参考にしていただきたく、日本語のRAG性能評価に必要なデータを公開いたしました。RAGソリューションは、Parser、Retrieval、Generation の3つのパートで構成されています。現在、この3つのパートを総合的に評価した日本語のRAG Leaderboardは存在していません。(公開時点)Allganize RAG… See the full description on the dataset page: https://huggingface.co/datasets/allganize/RAG-Evaluation-Dataset-JA.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Information
We introduce an omnidirectional and automatic RAG benchmark, OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including:
a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios; a… See the full description on the dataset page: https://huggingface.co/datasets/RUC-NLPIR/OmniEval-Human-Questions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication Package for the research paper "Conversing with business process-aware Large Language Models: the BPLLM framework".
The package includes the process models, the questions (and expected answers), the results of the qualitative evaluation, and the Hugging Face links to the fine-tuned versions of Llama 3.1 8B employed in the quantitative evaluation of the framework.
In particular, the process models are:
The natural language Directly-follows graph (DFG) of the Food Delivery process: food_delivery_activities.txt for the definition of the activities and food_delivery_flow.txt for the sequence flow.
The BPMN model of the Food Delivery, E-commerce, and Reimbursement processes: ecommerce.bpmn, food_delivery.bpmn, and reimbursement.bpmn.
The datasets with the questions and the expected answers are:
1_questions_answers_not_refined_for_DFG.csv ;
1.1_questions_answers_refined_for_DFG.csv ;
2_questions_answers_not_refined.csv ;
3_questions_answers_refined.csv ;
4_questions_answers_different_processes.csv ;
5_questions_answers_similar_processes.csv ;
6_questions_answers_refined_ft.csv .
The complete results of the qualitative evaluation are contained in the file qualitative_experiments_results.pdf.
The Hugging Face links to the fine-tuned versions of Llama 3.1 8B are reported in hf_links_finetuned_models.pdf.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data reflect the results of the experimentation with an HCAI system implemented using a RAG framework and the Llama 3.0 model. During the experimentation, 91 questions were utilized in the domain of legal advice and migrant rights. Metrics assessed included contextual enrichment, textual quality, discourse analysis, and sentiment evaluation. This allows for the analysis of sentiments and emotions, bias detection, content and toxicity classification, as well as an analysis of inclusion and diversity.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
German-RAG-LLM-EASY-BENCHMARK
German-RAG - German Retrieval Augmented Generation
Dataset Summary
This German-RAG-LLM-BENCHMARK represents a specialized collection for evaluating language models with a focus on source citation, time difference stating in RAG-specific tasks. To evaluate models compatible with OpenAI-Endpoints you can refer to our Github Repo: https://github.com/avemio-digital/German-RAG-LLM-EASY-BENCHMARK/ Most of the Subsets are synthetically… See the full description on the dataset page: https://huggingface.co/datasets/avemio/German-RAG-LLM-EASY-BENCHMARK.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This contains the dataset of the manual execution results of evaluation scenarios, the list of RAG repositories, the automatic generation of question answer pairs and execution of evaluation scenarios across 5 open-source RAG pipelines using our approach and RAGAS approach, and automation scripts for generating question-answer pairs and execution of generated questions across the selected RAG pipelines.