Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
🌐 Project Page: https://longbench2.github.io 💻 Github Repo: https://github.com/THUDM/LongBench 📚 Arxiv Paper: https://arxiv.org/abs/2412.15204 LongBench v2 is designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 has the following features: (1) Length: Context length ranging from 8k to… See the full description on the dataset page: https://huggingface.co/datasets/JamesBegin/LongBench-v2-Pause1.
Tongyi-Zhiwen/longbench dataset hosted on Hugging Face and contributed by the HF Datasets community
giulio98/LongBench-512 dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
🌐 Project Page: https://longbench2.github.io 💻 Github Repo: https://github.com/THUDM/LongBench 📚 Arxiv Paper: https://arxiv.org/abs/2412.15204 LongBench v2 is designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 has the following features: (1) Length: Context length ranging from 8k to… See the full description on the dataset page: https://huggingface.co/datasets/zai-org/LongBench-v2.
Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at https://github.com/THUDM/LongBench.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
LongBench is the first benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models. LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models' multilingual capabilities on long contexts. In addition, LongBench is composed of six major categories and twenty one different tasks, covering key long-text application scenarios such as single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks and code completion.
LongBench is a comprehensive benchmark for multilingual and multi-task purposes, with the goal to fully measure and evaluate the ability of pre-trained language models to understand long text. This dataset consists of twenty different tasks, covering key long-text application scenarios such as multi-document QA, single-document QA, summarization, few-shot learning, synthetic tasks, and code completion.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
LongBench-T2I
LongBench-T2I is a benchmark dataset introduced in the paper Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation.It is a standalone dataset designed specifically for evaluating text-to-image (T2I) generation models under long and compositionally rich prompts.
📦 Dataset Summary
This dataset contains 500 samples, each composed of:
A long-form instruction (complex natural language prompt). A… See the full description on the dataset page: https://huggingface.co/datasets/YCZhou/LongBench-T2I.
figuremout/LongBench-2k dataset hosted on Hugging Face and contributed by the HF Datasets community
minghuiliu/longbench dataset hosted on Hugging Face and contributed by the HF Datasets community
zwhe99/LongBench-v2-reformatted dataset hosted on Hugging Face and contributed by the HF Datasets community
xy21593/longbench-results dataset hosted on Hugging Face and contributed by the HF Datasets community
sheryc/LongBench-hotpotqa-with-evidence-label dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📚 Filtered Synthetic QA Dataset (L1 Questions, LongBench-v2)
This is a synthetic L1 QA dataset focused on simple, context-dependent questions.
Source documents are from LongBench-v2, covering Single/Multi-Document QA across Finance, Legal, and Government domains. Documents are split into 10k-token chunks. Each chunk is passed to DeepSeek-R1, which extracts/generates multiple L1-level QA pairs that are strictly grounded in that chunk. Questions target information retrieval: facts… See the full description on the dataset page: https://huggingface.co/datasets/mmilunovic/qna-l1-synthetic.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Below is a structured, professional‐tone description of the “QA Increasing Context Length” dataset. You can use this text as a README, a data card, or incorporate it directly into documentation.
QA Increasing Context Length Dataset
1. Overview
The QA Increasing Context Length dataset is designed to facilitate benchmarking and research on question‐answering (QA) systems as the size of the input context grows. It compiles QA examples drawn from multiple LongBench subsets… See the full description on the dataset page: https://huggingface.co/datasets/slinusc/ContextStretchQA.
LongAlign-10k
🤗 [LongAlign Dataset] • 💻 [Github Repo] • 📃 [LongAlign Paper]
LongAlign is the first full recipe for LLM alignment on long context. We propose the LongAlign-10k dataset, containing 10,000 long instruction data of 8k-64k in length. We investigate on trianing strategies, namely packing (with loss weighting) and sorted batching, which are all implemented in our code. For real-world long context evaluation, we introduce LongBench-Chat that evaluate the… See the full description on the dataset page: https://huggingface.co/datasets/zai-org/LongAlign-10k.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Marathon
Release
[2024/05/15] 🔥 Marathon is accepted by ACL 2024 Main Conference.
Dataset Summary
Marathon benchmark is a new long-context multiple-choice benchmark, mainly based on LooGLE, with some original data from LongBench. The context length can reach up to 200K+. Marathon benchmark comprises six tasks: Comprehension and Reasoning, Multiple Information Retrieval, Timeline Reorder, Computation, Passage Retrieval, and Short… See the full description on the dataset page: https://huggingface.co/datasets/Lemoncoke/Marathon.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
🌐 Project Page: https://longbench2.github.io 💻 Github Repo: https://github.com/THUDM/LongBench 📚 Arxiv Paper: https://arxiv.org/abs/2412.15204 LongBench v2 is designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 has the following features: (1) Length: Context length ranging from 8k to… See the full description on the dataset page: https://huggingface.co/datasets/JamesBegin/LongBench-v2-Pause1.