Dataset Card for "livebench/data_analysis"
LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties:
LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. Each question has verifiable, objective ground-truth answers, allowing hard questions to be… See the full description on the dataset page: https://huggingface.co/datasets/livebench/data_analysis.
Dataset Card for "livebench/model_answer"
LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties:
LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. Each question has verifiable, objective ground-truth answers, allowing hard questions to be… See the full description on the dataset page: https://huggingface.co/datasets/livebench/model_answer.
Dataset Card for "livebench/instruction_following"
LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties:
LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. Each question has verifiable, objective ground-truth answers, allowing hard questions… See the full description on the dataset page: https://huggingface.co/datasets/livebench/instruction_following.
Dataset Card for "livebench/reasoning"
LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties:
LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored… See the full description on the dataset page: https://huggingface.co/datasets/livebench/reasoning.
Dataset Card for "livebench/coding"
LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties:
LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored… See the full description on the dataset page: https://huggingface.co/datasets/livebench/coding.
Dataset Card for "livebench/model_judgment"
LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties:
LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. Each question has verifiable, objective ground-truth answers, allowing hard questions to be… See the full description on the dataset page: https://huggingface.co/datasets/livebench/model_judgment.
Dataset Card for "livebench/math"
LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties:
LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored… See the full description on the dataset page: https://huggingface.co/datasets/livebench/math.
Dataset Card for "livebench/language"
LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties:
LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored… See the full description on the dataset page: https://huggingface.co/datasets/livebench/language.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Dataset Card for "livebench/data_analysis"
LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties:
LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. Each question has verifiable, objective ground-truth answers, allowing hard questions to be… See the full description on the dataset page: https://huggingface.co/datasets/livebench/data_analysis.