Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MultiHopQA
This dataset contains the MultiHopQA data along with intermediate retrieval and generation steps, as well as final predictions generated in the paper Chain-of-Retrieval Augmented Generation.
Fields
The dataset includes the following fields for each data point:
query: The multi-hop question. query_id: A unique identifier for the query. answers: A list of correct answer(s) to the multi-hop question. context_doc_ids: A list of document IDs retrieved by the… See the full description on the dataset page: https://huggingface.co/datasets/corag/multihopqa.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
KILT Corpus
This dataset contains approximately 36 million Wikipedia passages from the "Multi-task retrieval for knowledge-intensive tasks" paper. It is also the retrieval corpus used in the paper Chain-of-Retrieval Augmented Generation.
Fields
id: A unique identifier for each passage. title: The title of the Wikipedia page from which the passage originates. contents: The textual content of the passage. wikipedia_id: The unique identifier for the Wikipedia page, used for… See the full description on the dataset page: https://huggingface.co/datasets/corag/kilt-corpus.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset name:
AeroEngQA
Description:
AeroEngQA is a low volume, high quality benchmark aircraft design Question Answer (QA) dataset to support qualitative evaluatation of Large Language Models (LLMs).
Dataset DOI:
10.5281/zenodo.14215677
Paper citation:
Silva, E.A. Marsh, R. Yong, H.K. Middleton, S.E. Sóbester, A. Retrieval-Augmented Generation and In-Context Prompted Large Language Models in Aircraft Engineering, AIAA-2025, AIAA, doi:10.2514/6.2025-0700
Abstract:
With the aerospace industry taking its first steps towards exploiting the rapidly evolving technology of Large Language Models (LLMs), this study explores the potential of the latest generation of LLMs to become an effective link in the aircraft design tool chain of the future. Our focus is on the task of Question Answering (QA) in engineering, which has the potential to augment future aircraft design team meetings with an intelligent LLM-based agent able to engage with the team via a chatbot interface. We compare three of the most effective and popular classes of LLM QA prompting today – LLM zero-shot prompting, LLM in-context prompting and LLM-based Retrieval-Augmented Generation (RAG). We describe a new, low volume, high quality benchmark aircraft design QA dataset (AeroEngQA) and use it to qualitatively evaluate each class of LLM and exploring properties including answer accuracy and answer simplicity of the answer. We provide domain-specific insights into the usefulness of today’s LLMs for engineering design tasks such as aircraft design, and a view on how this might evolve in the future as the next generation of LLMs emerges.
Acknowledgements:
The DAWS 2 (Development of Advanced Wing Solutions 2) project is supported by the ATI Programme, a joint Government and industry investment to maintain and grow the UK’s competitive position in civil aerospace design and manufacture. The programme, delivered through a partnership between the Aerospace Technology Institute (ATI), Department for Business, Energy & Industrial Strategy (BEIS) and Innovate UK, addresses technology, capability and supply chain challenges.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the original data for processing for manuscript "A Comparative Study on Retrieval-Augmented Generation and Chain-of-Thought Applications for LLM-Assisted Engineering Design Ideation"
Retrieval-Augmented Generation (RAG) Dataset 12000
Retrieval-Augmented Generation (RAG) Dataset 12000 is an English dataset designed for RAG-optimized models, built by Neural Bridge AI, and released under Apache license 2.0.
Dataset Description**
Dataset Summary
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by allowing them to consult an external authoritative knowledge base before generating responses. This approach significantly… See the full description on the dataset page: https://huggingface.co/datasets/chloedh0228/rag-dataset-12000.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Datahub for Graph-KV
This directory contains processed datasets for retrieval-augmented generation (RAG) and Arxiv-QA tasks, used in the paper Graph-KV: Breaking Sequence via Injecting Structural Biases into Large Language Models. It is organized into two main folders: rag and arxiv, and results.
📁 rag/
This folder includes preprocessed data for several commonly used RAG datasets. Each subdirectory corresponds to a different dataset split or benchmark:
2wiki_dev:… See the full description on the dataset page: https://huggingface.co/datasets/Graph-COM/GraphKV.
LongInter Dataset
Introduction
LongInter: the first large-scale dataset focused on long-term human-human interactions. We collect high-quality 3D motion sequences by retrieving and transitioning existing short motions using retrieval-augmented generation and transition inference strategies. We apply rigorous filtering criteria to ensure motion realism and consistency. Additionally, we provide rich, extended textual annotations by summarizing short-sequence captions using… See the full description on the dataset page: https://huggingface.co/datasets/LongInterDataset/LongInterSample.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MultiHopQA
This dataset contains the MultiHopQA data along with intermediate retrieval and generation steps, as well as final predictions generated in the paper Chain-of-Retrieval Augmented Generation.
Fields
The dataset includes the following fields for each data point:
query: The multi-hop question. query_id: A unique identifier for the query. answers: A list of correct answer(s) to the multi-hop question. context_doc_ids: A list of document IDs retrieved by the… See the full description on the dataset page: https://huggingface.co/datasets/corag/multihopqa.