88 datasets found
  1. Rag Instruct Benchmark Tester

    • kaggle.com
    • opendatalab.com
    • +1more
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Rag Instruct Benchmark Tester [Dataset]. https://www.kaggle.com/datasets/thedevastator/rag-financial-legal-evaluation-dataset
    Explore at:
    zip(33777 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Rag Instruct Benchmark Tester

    200 Samples for Enterprise Core Q&A Tasks

    By Huggingface Hub [source]

    About this dataset

    This RAG: Financial & Legal Retrieval-Augmented-Generation Benchmark Evaluation Dataset provides a unique opportunity for professionals in the legal and financial industries to analyze the latest retrieval augmented generation (RAG) technology. With 200 diverse samples that contains both a relevant context passage and a related question, it is an invaluable assessment tool to measure different capabilities of retrieval augmented generation enterprise use cases. Whether you are looking to optimize Core Q&A, classify Not Found topics, apply Boolean Yes/No principles, delve into deep math equations, explore complex Q&A inquiries or summarize core principles – this dataset is here provide all of these tasks in an accurate and efficient manner. Illuminating solutions from robust questions and context passages, this is a benchmark for advanced techniques across all areas of legal & financial services which will allow decision-makers full insight into retrieval augmented generation technology

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    • Explore the dataset by examining the columns listed above: query, answer, sample_number and tokens; and also take a look at the category of each sample.
    • Create hypotheses using a sample question from one of the categories you are interested in studying more closely. Formulate questions that relate directly to your hypothesis using more or fewer variables from this dataset as well as others you may find useful for your particular research needs.
    • Take into account any limitations or assumptions that may exist related to either this set’s data or any other related external sources when crafting research questions based upon this dataset’s data schema or content: before formulating any conclusions be sure to double check your work with reliable references on hand!
    • Utilize statistics analysis tools such as correlation coefficients (i..e r), linear regression equations (slope/intercept) and scatter plots (or other visualizations) if necessary– prioritizing one variable from each category over another should be handled accordingly within context what would best suit your research needs given these limitation constraints! As mentioned earlier additional external data might come into play here too — remember keep records all evidence for future reference purposes! 5 .Refine specific questions and develop an experimental setup wherein promising results can begin testing theories with improved accuracy — note whether failures occurred due too trivial errors taken during human analytical processing outlier distortion produced by manipulated outliers / variables accompanied by deflated explanatory power leading up these erroneous outcomes on their own according's subject matter expertise level difficulty settings versus expected mean standard deviations etc.. Reforming further experiments around other more accurate working models involving this same series' empirical studies should continuously reviewed if needed – linking back core findings associated with initial input(s)! Advice recommended prior engaging research emphasis involves breaking individual questing resolving into smaller subtasks continuingly providing measurable evidence explains large scale phenomena in terms once those analyzed better comprehended domain professionals evaluated current progress undergone since prior iteration trials begun had formerly scoped examine subcomponents separated them one part discuss branch individual components related discussed subsequent progression stages between sections backdrop applicable aspects... Pruning methods utilized slim down information Thus while Working Develop Practical

    Research Ideas

    • Utilizing the tokens to create a sophisticated text-summarization network for automatic summarization of legal documents.
    • Training models to recognize problems for which there may not be established answers/solutions yet, and estimate future outcomes based on data trends an patterns with machine learning algorithms.
    • Analyzing the dataset to determine keywords, common topics or key issues related to financial and legal services that can be used in enterprise decision making operations

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1.0 Unive...

  2. h

    nexa-rag-benchmark

    • huggingface.co
    Updated Mar 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zhanx (2025). nexa-rag-benchmark [Dataset]. https://huggingface.co/datasets/zhanxxx/nexa-rag-benchmark
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 11, 2025
    Authors
    zhanx
    Description

    nexa-rag-benchmark

    The Nexa RAG Benchmark dataset is designed for evaluating Retrieval-Augmented Generation (RAG) models across multiple question-answering benchmarks. It includes a variety of datasets covering different domains. For evaluation, you can use the repository:🔗 Nexa RAG Benchmark on GitHub

      Dataset Structure
    

    This benchmark integrates multiple datasets suitable for RAG performance. You can choose datasets based on context size, number of examples, or… See the full description on the dataset page: https://huggingface.co/datasets/zhanxxx/nexa-rag-benchmark.

  3. SciRAG-QA: Multi-domain Closed-Question Benchmark Dataset for Scientific QA

    • zenodo.org
    • data-staging.niaid.nih.gov
    bin, csv, json
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahira Ibnath Joytu; Mahira Ibnath Joytu; Md Raisul Kibria; Md Raisul Kibria; Sébastien Lafond; Sébastien Lafond (2024). SciRAG-QA: Multi-domain Closed-Question Benchmark Dataset for Scientific QA [Dataset]. http://doi.org/10.5281/zenodo.14390011
    Explore at:
    csv, bin, jsonAvailable download formats
    Dataset updated
    Dec 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mahira Ibnath Joytu; Mahira Ibnath Joytu; Md Raisul Kibria; Md Raisul Kibria; Sébastien Lafond; Sébastien Lafond
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 2024
    Description

    In recent times, one of the most impactful applications of the growing capabilities of Large Language Models (LLMs) has been their use in Retrieval-Augmented Generation (RAG) systems. RAG applications are inherently more robust against LLM hallucinations and provide source traceability, which holds critical importance in the scientific reading and writing process. However, validating such systems is essential due to the stringent systematic requirements of the scientific domain. Existing benchmark datasets are limited in the scope of research areas they cover, often focusing on the natural sciences, which restricts their applicability and validation across other scientific fields.

    To address this gap, we present a closed-question answering (QA) dataset for benchmarking scientific RAG applications. This dataset spans 34 research topics across 10 distinct areas of study. It includes 108 manually curated question-answer pairs, each annotated with answer type, difficulty level, and a gold reference along with a link to the source paper. Further details on each of these attributes can be found in the accompanying README.md file.

    Please cite the following publication when using the dataset: TBD

    The publication is available at: TBD

    A preprint version of the publication is available at: TBD

  4. c

    Finance GraphRAG Benchmark Dataset

    • cubig.ai
    zip
    Updated Aug 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Finance GraphRAG Benchmark Dataset [Dataset]. https://cubig.ai/store/products/589/finance-graphrag-benchmark-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 25, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description
    1. Data Introduction : This dataset is designed to evaluate the reasoning ability of RAG-based models on complex financial regulatory questions.
    2. Utilization : Ideal for benchmarking Naive RAG vs. Graph RAG in compliance-heavy financial QA.
  5. h

    open_ragbench

    • huggingface.co
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vectara (2025). open_ragbench [Dataset]. https://huggingface.co/datasets/vectara/open_ragbench
    Explore at:
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Vectara
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Open RAG Benchmark

    The Open RAG Benchmark is a unique, high-quality Retrieval-Augmented Generation (RAG) dataset constructed directly from arXiv PDF documents, specifically designed for evaluating RAG systems with a focus on multimodal PDF understanding. Unlike other datasets, Open RAG Benchmark emphasizes pure PDF content, meticulously extracting and generating queries on diverse modalities including text, tables, and images, even when they are intricately interwoven within a… See the full description on the dataset page: https://huggingface.co/datasets/vectara/open_ragbench.

  6. c

    Healthcare GraphRAG Benchmark Dataset

    • cubig.ai
    zip
    Updated Aug 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Healthcare GraphRAG Benchmark Dataset [Dataset]. https://cubig.ai/store/products/590/healthcare-graphrag-benchmark-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 25, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description
    1. Data Introduction : This dataset is designed to evaluate RAG-based models on healthcare and public health questions spanning prevention, screening, vaccination, genetics, and environmental health.
    2. Utilization : Ideal for benchmarking Naive RAG vs. Graph RAG architectures in healthcare QA tasks, including medical reasoning, prevention strategies, and health education contexts.
  7. h

    rag-benchmark-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Onepane.ai, rag-benchmark-dataset [Dataset]. https://huggingface.co/datasets/onepaneai/rag-benchmark-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Onepane.ai
    Description

    onepaneai/rag-benchmark-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. c

    Public sector GraphRAG Benchmark Dataset

    • cubig.ai
    zip
    Updated Sep 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Public sector GraphRAG Benchmark Dataset [Dataset]. https://cubig.ai/store/products/593/public-sector-graphrag-benchmark-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 9, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
    Description
    1. Data Introduction : This dataset evaluates RAG-based models on questions derived from U.S. government reports, public laws, and assistive technology program documentation. It emphasizes reasoning over policy provisions, regulatory authority, program support, and statistical reporting.
    2. Utilization : Ideal for benchmarking Naive RAG vs. Graph RAG architectures in QA tasks. Supports evaluation of reasoning over legislative texts, federal program reports, and statistical summaries of government activities.
  9. D

    RAG Evaluation Platform Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). RAG Evaluation Platform Market Research Report 2033 [Dataset]. https://dataintelo.com/report/rag-evaluation-platform-market
    Explore at:
    pptx, pdf, csvAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    RAG Evaluation Platform Market Outlook



    According to our latest research, the global RAG Evaluation Platform market size reached USD 1.42 billion in 2024 and is expected to grow at a robust CAGR of 18.7% during the forecast period, reaching USD 6.86 billion by 2033. This significant growth is driven by the increasing adoption of Retrieval-Augmented Generation (RAG) systems across industries seeking to enhance the accuracy and reliability of generative AI outputs. The surge in demand for advanced evaluation platforms stems from the critical need to ensure trustworthy, explainable, and compliant AI-generated content in enterprise environments.



    A primary growth factor for the RAG Evaluation Platform market is the rapid proliferation of generative AI applications in sectors such as healthcare, finance, and retail. As organizations increasingly deploy RAG models to leverage external knowledge bases for more contextually accurate outputs, the demand for comprehensive evaluation platforms has soared. These platforms play a vital role in monitoring, benchmarking, and optimizing the performance of RAG systems, ensuring generated content meets stringent industry standards for accuracy, safety, and compliance. Furthermore, the integration of RAG evaluation tools with existing enterprise workflows is becoming a strategic imperative, driven by the need to manage risks associated with AI adoption and to maximize return on investment.



    Another significant driver is the evolving regulatory landscape around AI and data privacy. Governments and industry bodies worldwide are introducing new guidelines to ensure the responsible use of AI, particularly in sectors handling sensitive data such as healthcare and financial services. The RAG Evaluation Platform market is witnessing increased traction as organizations seek solutions that can automate compliance checks, provide audit trails, and deliver transparent evaluation metrics. This regulatory push is compelling enterprises to invest in robust evaluation platforms that not only validate model performance but also offer explainability and traceability, which are crucial for meeting legal and ethical obligations.



    The market is also being propelled by advancements in AI infrastructure and cloud computing. The scalability and flexibility offered by cloud-based RAG evaluation platforms are enabling organizations of all sizes to experiment with and deploy sophisticated AI models without heavy upfront investments in hardware. This democratization of AI technology is fostering innovation across industries, with small and medium enterprises (SMEs) now able to access the same high-quality evaluation tools as large enterprises. Additionally, the growing ecosystem of AI service providers and open-source evaluation frameworks is lowering barriers to entry, accelerating the adoption of RAG evaluation platforms globally.



    From a regional perspective, North America continues to dominate the RAG Evaluation Platform market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The presence of leading AI technology companies, a mature digital infrastructure, and a strong focus on research and development are key factors underpinning North America’s leadership. Meanwhile, Asia Pacific is emerging as the fastest-growing region, driven by rapid digital transformation, expanding investments in AI research, and supportive government initiatives. Europe, on the other hand, is characterized by its emphasis on regulatory compliance and ethical AI, which is spurring the adoption of advanced evaluation platforms, particularly in the healthcare and BFSI sectors.



    Component Analysis



    The Component segment of the RAG Evaluation Platform market is bifurcated into Software and Services, each playing a crucial role in the market’s growth trajectory. The software component encompasses a wide array of tools and platforms designed to automate the evaluation of RAG models, including performance benchmarking, bias detection, and explainability analytics. These software solutions are increasingly leveraging advanced algorithms to deliver real-time insights into model behavior, enabling organizations to quickly identify and rectify issues. The growing complexity of generative AI models necessitates sophisticated software platforms that can handle diverse data types, support multi-modal evaluations, and integrate seamlessly with existing AI pipelines.



  10. REAL-MM-RAG_TechSlides

    • huggingface.co
    Updated Mar 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IBM Research (2025). REAL-MM-RAG_TechSlides [Dataset]. https://huggingface.co/datasets/ibm-research/REAL-MM-RAG_TechSlides
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 13, 2025
    Dataset provided by
    IBMhttp://ibm.com/
    IBM Research
    Authors
    IBM Research
    License

    https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

    Description

    REAL-MM-RAG-Bench: A Real-World Multi-Modal Retrieval Benchmark

    We introduced REAL-MM-RAG-Bench, a real-world multi-modal retrieval benchmark designed to evaluate retrieval models in reliable, challenging, and realistic settings. The benchmark was constructed using an automated pipeline, where queries were generated by a vision-language model (VLM), filtered by a large language model (LLM), and rephrased by an LLM to ensure high-quality retrieval evaluation. To simulate real-world… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/REAL-MM-RAG_TechSlides.

  11. CORAL-Conversational RAG

    • kaggle.com
    zip
    Updated Nov 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wayne_127 (2024). CORAL-Conversational RAG [Dataset]. https://www.kaggle.com/datasets/wayne127/coral-conversational-rag/data
    Explore at:
    zip(218185197 bytes)Available download formats
    Dataset updated
    Nov 26, 2024
    Authors
    Wayne_127
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation CORAL is a a large-scale multi-turn conversational RAG benchmark that fulfills the critical features mentioned in our paper to systematically evaluate and advance conversational RAG systems. In CORAL, we evaluate conversational RAG systems across three essential tasks: (1) Conversational Passage Retrieval: assessing the system’s ability to retrieve the relevant information from a large document set based on multi-turn context; (2) Response Generation: evaluating the system’s capacity to generate accurate, contextually rich answers; (3) Citation Labeling: ensuring that the generated responses are transparent and grounded by requiring correct attribution of sources.

    For more information, please view our GitHub repo and paper:

    GitHub repo: https://github.com/Ariya12138/CORAL

    Paper link: CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation.

  12. h

    RAG-RewardBench

    • huggingface.co
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhuoran Jin (2024). RAG-RewardBench [Dataset]. https://huggingface.co/datasets/jinzhuoran/RAG-RewardBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 18, 2024
    Authors
    Zhuoran Jin
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This repository contains the data presented in RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment. Code: https://github.com/jinzhuoran/RAG-RewardBench/

  13. Full BLUFF-1000 dataset and evaluation scripts

    • figshare.com
    txt
    Updated Oct 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ron Zharzhavsky (2025). Full BLUFF-1000 dataset and evaluation scripts [Dataset]. http://doi.org/10.6084/m9.figshare.30397369.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 20, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Ron Zharzhavsky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Retrieval-augmented generation (RAG) systems often fail to adequately modulate their linguistic certainty when evidence deteriorates. This gap in how models respond to imperfect retrieval is critical for the safety and reliability of a real-world RAG system. To address this gap, we propose \textbf{BLUFF-1000}, a benchmark systematically designed to evaluate how large language models (LLMs) manage linguistic confidence under conflicting evidence to simulate poor retrieval. We created a novel dataset, introduced two novel metrics, calculated full metrics quantifying faithfulness, factuality, linguistic uncertainty, and calibration, and finally conducted experiments on 7 LLMs on the benchmark, measuring their uncertainty awareness and general performance. Our findings uncover a fundamental misalignment between linguistic expression of uncertainty and source quality across seven state-of-the-art RAG systems. We recommend that future RAG systems incorporate uncertainty-aware methods to transparently convey confidence throughout the system.

  14. Single-Topic RAG Evaluation Dataset

    • kaggle.com
    zip
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuel Matsuo Harris (2025). Single-Topic RAG Evaluation Dataset [Dataset]. https://www.kaggle.com/samuelmatsuoharris/single-topic-rag-evaluation-dataset
    Explore at:
    zip(268838 bytes)Available download formats
    Dataset updated
    Apr 18, 2025
    Authors
    Samuel Matsuo Harris
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    What is this dataset?

    This dataset was designed to evaluate the performance of RAG AI querying text documents about a single topic with word counts ranging from a few thousand to a few tens of thousands, such as articles, blogs, and documentation. The sources were intentionally chosen to have been produced within the last few years (from the time of writing in July 2024) and to be relatively niche, to reduce the chance of evaluated LLMs including this information in their training datasets.

    There are 120 question-answer pairs in this dataset.

    In this dataset, there are - 40 questions that do not have an answer within the document. - 40 question-answer pairs that have an answer that must be generated from a single passage of the document. - 40 question-answer pairs that have an answer that must be generated from multiple passages of the document.

    The answers to the questions with no answer within the text are intended to be some variation of "I do not know". The exact expected answer can be decided by the user of this dataset.

    This dataset consists of 20 text documents with 6 questions-answer pairs per document. For each document: - 2 questions do not have an answer within the text. - 2 questions have an answer that must be generated from a single passage of the document. - 2 questions have an answer that must be generated from multiple passages of the document.

    Why was this dataset created?

    This dataset was created for my STICI-note AI that you can read about in my blog here and the code for it can be found here. I created this dataset because I could not find a dataset that could properly evaluate my RAG system. The RAG evaluation datasets that I found would either: evaluate a RAG system with text chunks from many varying topics from marine biology to history; evaluate whether only the retriever in the RAG system; or they would use Wikipedia. The variability in topics was an issue because my RAG system was intended to answer queries on text documents that are entirely about a single topic such as documentation on a repo or a notes made about a subject the user is learning about. I wanted to evaluate my AI system as a whole instead of just the retriever, which made datasets for testing whether the correct chunk of text was fetched irrelevant to my use-case. Wikipedia being the source was an issue because Wikipedia is used to train most LLMs, making data leakage a serious concern when using pre-trained models like I was.

  15. c

    Legal GraphRAG Benchmark Dataset

    • cubig.ai
    zip
    Updated Sep 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Legal GraphRAG Benchmark Dataset [Dataset]. https://cubig.ai/store/products/597/legal-graphrag-benchmark-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 25, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description
    1. Data Introduction : This dataset evaluates RAG-based models on legal and regulatory QA pairs derived from case law, state statutes, medical-legal protocols, and federal agency responsibilities. It emphasizes reasoning across voting rights, criminal justice, forensic evidence, and healthcare-related legal processes.
    2. Utilization : Ideal for benchmarking Naive RAG vs. Graph RAG architectures in QA tasks. Supports evaluation of reasoning over legal precedents, regulatory frameworks, institutional responsibilities, and the integration of law with healthcare and forensic practice.
  16. h

    RAG-Benchmark-in-LIB

    • huggingface.co
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xianghao Kong (2025). RAG-Benchmark-in-LIB [Dataset]. https://huggingface.co/datasets/Kong1020/RAG-Benchmark-in-LIB
    Explore at:
    Dataset updated
    Nov 21, 2025
    Authors
    Xianghao Kong
    Description

    Battery Thermal Safety RAG Benchmark

    This dataset is a domain-specific RAG benchmark for evaluating retrieval-augmented question answering systems in the lithium-ion battery thermal safety domain.

      Contents
    

    3,000+ curated QA pairs from battery safety literature, standards, and reports. Each query includes: Question Ground truth Answer Positive context chunk(s) Negative distractor chunks Metadata (source, section, chunk_id)

      Dataset Structure
    

    The dataset… See the full description on the dataset page: https://huggingface.co/datasets/Kong1020/RAG-Benchmark-in-LIB.

  17. c

    Manufacturing GraphRAG Benchmark Dataset

    • cubig.ai
    zip
    Updated Oct 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Manufacturing GraphRAG Benchmark Dataset [Dataset]. https://cubig.ai/store/products/598/manufacturing-graphrag-benchmark-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 26, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description
    1. Data Introduction : This dataset evaluates retrieval-augmented generation (RAG) models on complex QA tasks drawn from advanced manufacturing, sustainability, and AI integration domains. It incorporates references to NIST Advanced Manufacturing Series (AMS) reports, Circular Economy frameworks, additive manufacturing, investment analysis (NPV, IRR), and workforce development initiatives supported by U.S. federal agencies such as NIST and NSF.
    2. Utilization : Designed for benchmarking Naive RAG vs. Graph RAG models in industrial, sustainability, and policy-oriented question answering. Supports evaluation of reasoning across technical standards (ISO, ASTM, IEC), government initiatives (NIST AMS 100-47, AMS 500-1, AMS 100-63), and sustainability transitions (Circular Economy, Bioeconomy, Net Zero Manufacturing).
  18. h

    ragbench

    • huggingface.co
    Updated Jun 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Galileo (2024). ragbench [Dataset]. https://huggingface.co/datasets/galileo-ai/ragbench
    Explore at:
    Dataset updated
    Jun 8, 2024
    Dataset authored and provided by
    Galileo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RAGBench

      Dataset Overview
    

    RAGBEnch is a large-scale RAG benchmark dataset of 100k RAG examples. It covers five unique industry-specific domains and various RAG task types. RAGBench examples are sourced from industry corpora such as user manuals, making it particularly relevant for industry applications. RAGBench comrises 12 sub-component datasets, each one split into train/validation/test splits

      Usage
    

    from datasets import load_dataset

    load… See the full description on the dataset page: https://huggingface.co/datasets/galileo-ai/ragbench.

  19. G

    Retrieval-Augmented Generation Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Retrieval-Augmented Generation Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/retrieval-augmented-generation-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Sep 1, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Retrieval-Augmented Generation (RAG) Market Outlook



    According to our latest research, the global Retrieval-Augmented Generation (RAG) market size reached USD 1.47 billion in 2024. The market is witnessing robust momentum, driven by rapid enterprise adoption and technological advancements, and is projected to expand at a CAGR of 27.8% during the forecast period. By 2033, the RAG market is forecasted to attain a value of USD 13.2 billion, underlining its transformative impact across multiple industries. The key growth factor fueling this surge is the increasing demand for contextually accurate and explainable AI solutions, particularly in knowledge-intensive sectors.




    The exponential growth of the Retrieval-Augmented Generation market is primarily attributed to the mounting necessity for advanced AI models that can deliver more precise, context-aware, and reliable outputs. Unlike traditional generative AI, RAG systems integrate retrieval mechanisms that allow access to vast external databases or proprietary knowledge bases, thus enhancing the factual accuracy of generated content. This is especially crucial for enterprises in sectors such as healthcare, finance, and legal, where the veracity and traceability of AI-generated information are non-negotiable. Furthermore, the proliferation of unstructured data within organizations has accelerated the deployment of RAG models, as they offer a scalable solution for extracting actionable insights from disparate data sources.




    Another significant growth driver for the RAG market is the rapid evolution of AI infrastructure and the increasing sophistication of natural language processing (NLP) technologies. The integration of RAG architectures with large language models (LLMs) such as GPT-4 and beyond has enabled organizations to unlock new capabilities in content generation, question answering, and document summarization. These advancements are further supported by the rise of open-source RAG frameworks and the availability of pre-trained models, which lower the entry barriers for enterprises of varying sizes. The ongoing investments in AI research and the collaboration between technology providers and industry verticals are expected to further catalyze market growth over the next decade.




    The expanding role of RAG solutions in enhancing customer experiences and operational efficiencies across industries is another pivotal factor contributing to market expansion. In sectors like retail and e-commerce, RAG-powered chatbots and virtual assistants are revolutionizing customer support by providing accurate, up-to-date responses sourced from real-time databases. Similarly, in the media and entertainment industry, RAG technologies are being leveraged for content personalization, automated news generation, and fact-checking, thereby streamlining editorial workflows. As enterprises increasingly recognize the value of explainable AI, the adoption of RAG solutions is expected to witness sustained acceleration globally.



    To effectively harness the capabilities of RAG systems, organizations are increasingly turning to RAG Evaluation Tools. These tools are essential in assessing the performance and reliability of RAG models, ensuring that they meet the specific needs of various industries. By providing metrics and benchmarks, RAG Evaluation Tools enable enterprises to fine-tune their models for optimal accuracy and efficiency. This is particularly important in sectors like finance and healthcare, where precision and reliability are paramount. As the demand for explainable AI grows, these evaluation tools play a crucial role in validating the outputs of RAG systems, thereby enhancing trust and adoption across different verticals.




    Regionally, North America continues to dominate the Retrieval-Augmented Generation market, accounting for the largest revenue share in 2024, driven by the presence of leading AI innovators, robust digital infrastructure, and high enterprise readiness. However, the Asia Pacific region is emerging as a formidable growth engine, supported by rapid digital transformation initiatives, rising investments in AI, and the proliferation of data-centric business models. Europe also presents significant opportunities, particularly in regulated industries that demand transparent and auditable AI systems. Latin America and the Middle East & Africa are gradually c

  20. m

    Large Language Models in Materials Science: Evaluating RAG Performance in...

    • data.mendeley.com
    Updated Sep 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zen Han Cho (2025). Large Language Models in Materials Science: Evaluating RAG Performance in Graphene Synthesis Using RAGAS [Dataset]. http://doi.org/10.17632/ry7phxn4js.2
    Explore at:
    Dataset updated
    Sep 2, 2025
    Authors
    Zen Han Cho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Retrieval-Augmented Generation (RAG) systems increasingly support scientific research, yet evaluating their performance in specialized domains remains challenging due to the technical complexity and precision requirements of scientific knowledge. This study presents the first systematic analysis of automated evaluation frameworks for scientific RAG systems, focusing on the RAGAS framework applied to RAG-augmented large language models in materials science, with graphene synthesis as a representative case study. We develop a comprehensive evaluation protocol comparing four assessment approaches: RAGAS (an automated RAG evaluation framework), BERTScore, LLM-as-a-Judge, and expert human evaluation across 20 domain-specific questions. Our analysis reveals that while automated metrics can capture relative performance improvements from retrieval augmentation, they exhibit fundamental limitations in absolute score interpretation for scientific content. RAGAS successfully identified performance gains in RAG-augmented systems (0.52-point improvement for Gemini, 1.03-point for Qwen on a 10-point scale), demonstrating particular sensitivity as well as retrieval benefits for smaller, open-source models. These findings establish methodological guidelines for scientific RAG evaluation and highlight critical considerations for researchers deploying AI systems in specialized domains

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2023). Rag Instruct Benchmark Tester [Dataset]. https://www.kaggle.com/datasets/thedevastator/rag-financial-legal-evaluation-dataset
Organization logo

Rag Instruct Benchmark Tester

200 Samples for Enterprise Core Q&A Tasks

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
zip(33777 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Rag Instruct Benchmark Tester

200 Samples for Enterprise Core Q&A Tasks

By Huggingface Hub [source]

About this dataset

This RAG: Financial & Legal Retrieval-Augmented-Generation Benchmark Evaluation Dataset provides a unique opportunity for professionals in the legal and financial industries to analyze the latest retrieval augmented generation (RAG) technology. With 200 diverse samples that contains both a relevant context passage and a related question, it is an invaluable assessment tool to measure different capabilities of retrieval augmented generation enterprise use cases. Whether you are looking to optimize Core Q&A, classify Not Found topics, apply Boolean Yes/No principles, delve into deep math equations, explore complex Q&A inquiries or summarize core principles – this dataset is here provide all of these tasks in an accurate and efficient manner. Illuminating solutions from robust questions and context passages, this is a benchmark for advanced techniques across all areas of legal & financial services which will allow decision-makers full insight into retrieval augmented generation technology

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

  • Explore the dataset by examining the columns listed above: query, answer, sample_number and tokens; and also take a look at the category of each sample.
  • Create hypotheses using a sample question from one of the categories you are interested in studying more closely. Formulate questions that relate directly to your hypothesis using more or fewer variables from this dataset as well as others you may find useful for your particular research needs.
  • Take into account any limitations or assumptions that may exist related to either this set’s data or any other related external sources when crafting research questions based upon this dataset’s data schema or content: before formulating any conclusions be sure to double check your work with reliable references on hand!
  • Utilize statistics analysis tools such as correlation coefficients (i..e r), linear regression equations (slope/intercept) and scatter plots (or other visualizations) if necessary– prioritizing one variable from each category over another should be handled accordingly within context what would best suit your research needs given these limitation constraints! As mentioned earlier additional external data might come into play here too — remember keep records all evidence for future reference purposes! 5 .Refine specific questions and develop an experimental setup wherein promising results can begin testing theories with improved accuracy — note whether failures occurred due too trivial errors taken during human analytical processing outlier distortion produced by manipulated outliers / variables accompanied by deflated explanatory power leading up these erroneous outcomes on their own according's subject matter expertise level difficulty settings versus expected mean standard deviations etc.. Reforming further experiments around other more accurate working models involving this same series' empirical studies should continuously reviewed if needed – linking back core findings associated with initial input(s)! Advice recommended prior engaging research emphasis involves breaking individual questing resolving into smaller subtasks continuingly providing measurable evidence explains large scale phenomena in terms once those analyzed better comprehended domain professionals evaluated current progress undergone since prior iteration trials begun had formerly scoped examine subcomponents separated them one part discuss branch individual components related discussed subsequent progression stages between sections backdrop applicable aspects... Pruning methods utilized slim down information Thus while Working Develop Practical

Research Ideas

  • Utilizing the tokens to create a sophisticated text-summarization network for automatic summarization of legal documents.
  • Training models to recognize problems for which there may not be established answers/solutions yet, and estimate future outcomes based on data trends an patterns with machine learning algorithms.
  • Analyzing the dataset to determine keywords, common topics or key issues related to financial and legal services that can be used in enterprise decision making operations

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Unive...

Search
Clear search
Close search
Google apps
Main menu