CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This file includes our annotations of 238 dataset papers published at the NeurIPS Datasets and Benchmarks Track. A full report of our findings can be found at https://arxiv.org/abs/2411.00266
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Subset of the benchmark dataset published in Luecken et al. (2021).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset is associated to submission 1335 at NeurIPS 2025 - Dataset and Benchmarks track. The benchmark is intended to be used with the proposed submission environments (see the source code). See the provided README for information about dataset downloading and running the evaluations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RepLiQA - Repository of Likely Question-Answer for benchmarking
NeurIPS Datasets presentation
Dataset Summary
RepLiQA is an evaluation dataset that contains Context-Question-Answer triplets, where contexts are non-factual but natural-looking documents about made up entities such as people or places that do not exist in reality. RepLiQA is human-created, and designed to test for the ability of Large Language Models (LLMs) to find and use contextual information in provided… See the full description on the dataset page: https://huggingface.co/datasets/ServiceNow/repliqa.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The gene expression portion of the NeurIPS 2021 challenge 10x multiome dataset (Luecken et al., NeurIPS datasets and benchmarks track 2021), originally obtained from GEO. Contains single-cell gene expression of 69,249 cells for 13,431 genes. The adata.X field contains normalized data and adata.layers['counts'] contains raw expression values. We computed a latent space using scANVI (Xu et al., MSB 2021), following their tutorial.
This is the official data repository of the Data-Centric Image Classification (DCIC) Benchmark. The goal of this benchmark is to measure the impact of tuning the dataset instead of the model for a variety of image classification datasets. Full details about the collection process, the structure and automatic download at
Paper: https://arxiv.org/abs/2207.06214
Source Code: https://github.com/Emprime/dcic
The license information is given below as download.
Citation
Please cite as
@article{schmarje2022benchmark,
author = {Schmarje, Lars and Grossmann, Vasco and Zelenka, Claudius and Dippel, Sabine and Kiko, Rainer and Oszust, Mariusz and Pastell, Matti and Stracke, Jenny and Valros, Anna and Volkmann, Nina and Koch, Reinahrd},
journal = {36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks},
title = {{Is one annotation enough? A data-centric image classification benchmark for noisy and ambiguous label estimation}},
year = {2022}
}
Please see the full details about the used datasets below, which should also be cited as part of the license.
@article{schoening2020Megafauna,
author = {Schoening, T and Purser, A and Langenk{\"{a}}mper, D and Suck, I and Taylor, J and Cuvelier, D and Lins, L and Simon-Lled{\'{o}}, E and Marcon, Y and Jones, D O B and Nattkemper, T and K{\"{o}}ser, K and Zurowietz, M and Greinert, J and Gomes-Pereira, J},
doi = {10.5194/bg-17-3115-2020},
journal = {Biogeosciences},
number = {12},
pages = {3115--3133},
title = {{Megafauna community assessment of polymetallic-nodule fields with cameras: platform and methodology comparison}},
volume = {17},
year = {2020}
}
@article{Langenkamper2020GearStudy,
author = {Langenk{\"{a}}mper, Daniel and van Kevelaer, Robin and Purser, Autun and Nattkemper, Tim W},
doi = {10.3389/fmars.2020.00506},
issn = {2296-7745},
journal = {Frontiers in Marine Science},
title = {{Gear-Induced Concept Drift in Marine Images and Its Effect on Deep Learning Classification}},
volume = {7},
year = {2020}
}
@article{peterson2019cifar10h,
author = {Peterson, Joshua and Battleday, Ruairidh and Griffiths, Thomas and Russakovsky, Olga},
doi = {10.1109/ICCV.2019.00971},
issn = {15505499},
journal = {Proceedings of the IEEE International Conference on Computer Vision},
pages = {9616--9625},
title = {{Human uncertainty makes classification more robust}},
volume = {2019-Octob},
year = {2019}
}
@article{schmarje2019,
author = {Schmarje, Lars and Zelenka, Claudius and Geisen, Ulf and Gl{\"{u}}er, Claus-C. and Koch, Reinhard},
doi = {10.1007/978-3-030-33676-9_26},
issn = {23318422},
journal = {DAGM German Conference of Pattern Regocnition},
number = {November},
pages = {374--386},
publisher = {Springer},
title = {{2D and 3D Segmentation of uncertain local collagen fiber orientations in SHG microscopy}},
volume = {11824 LNCS},
year = {2019}
}
@article{schmarje2021foc,
author = {Schmarje, Lars and Br{\"{u}}nger, Johannes and Santarossa, Monty and Schr{\"{o}}der, Simon-Martin and Kiko, Rainer and Koch, Reinhard},
doi = {10.3390/s21196661},
issn = {1424-8220},
journal = {Sensors},
number = {19},
pages = {6661},
title = {{Fuzzy Overclustering: Semi-Supervised Classification of Fuzzy Labels with Overclustering and Inverse Cross-Entropy}},
volume = {21},
year = {2021}
}
@article{schmarje2022dc3,
author = {Schmarje, Lars and Santarossa, Monty and Schr{\"{o}}der, Simon-Martin and Zelenka, Claudius and Kiko, Rainer and Stracke, Jenny and Volkmann, Nina and Koch, Reinhard},
journal = {Proceedings of the European Conference on Computer Vision (ECCV)},
title = {{A data-centric approach for improving ambiguous labels with combined semi-supervised classification and clustering}},
year = {2022}
}
@article{obuchowicz2020qualityMRI,
author = {Obuchowicz, Rafal and Oszust, Mariusz and Piorkowski, Adam},
doi = {10.1186/s12880-020-00505-z},
issn = {1471-2342},
journal = {BMC Medical Imaging},
number = {1},
pages = {109},
title = {{Interobserver variability in quality assessment of magnetic resonance images}},
volume = {20},
year = {2020}
}
@article{stepien2021cnnQuality,
author = {St{\c{e}}pie{\'{n}}, Igor and Obuchowicz, Rafa{\l} and Pi{\'{o}}rkowski, Adam and Oszust, Mariusz},
doi = {10.3390/s21041043},
issn = {1424-8220},
journal = {Sensors},
number = {4},
title = {{Fusion of Deep Convolutional Neural Networks for No-Reference Magnetic Resonance Image Quality Assessment}},
volume = {21},
year = {2021}
}
@article{volkmann2021turkeys,
author = {Volkmann, Nina and Br{\"{u}}nger, Johannes and Stracke, Jenny and Zelenka, Claudius and Koch, Reinhard and Kemper, Nicole and Spindler, Birgit},
doi = {10.3390/ani11092655},
journal = {Animals 2021},
pages = {1--13},
title = {{Learn to train: Improving training data for a neural network to detect pecking injuries in turkeys}},
volume = {11},
year = {2021}
}
@article{volkmann2022keypoint,
author = {Volkmann, Nina and Zelenka, Claudius and Devaraju, Archana Malavalli and Br{\"{u}}nger, Johannes and Stracke, Jenny and Spindler, Birgit and Kemper, Nicole and Koch, Reinhard},
doi = {10.3390/s22145188},
issn = {1424-8220},
journal = {Sensors},
number = {14},
pages = {5188},
title = {{Keypoint Detection for Injury Identification during Turkey Husbandry Using Neural Networks}},
volume = {22},
year = {2022}
}
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is the benchmark associated with submission 1331 at NeurIPS 2025 - Dataset and Benchmarks track. The benchmark is intended to be used with the proposed submission environments (see the source code). The .jsonl files do not contain proper image paths but rather image path templates, as each .jsonl entry is a sample, and each sample corresponds to a different environment with its own images. See the submitted code README for information about dataset downloading and preprocessing, and to… See the full description on the dataset page: https://huggingface.co/datasets/submission1331/ProactiveBench.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset for "A Benchmark for Antimicrobial Peptide Recognition Based on Structure and Sequence Representation" at NeurIPS 2025 Dataset and Benchmark
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
NLID: A Large-Scale Neuromorphic Liquid Identification Dataset
NeurIPS 2025 Datasets and Benchmarks Track paper: Bubbles Talk: A Neuromorphic Dataset for Liquid Identification from Pouring process.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The benchmark code is available at: https://github.com/Junjue-Wang/LoveDA
Highlights:
Reference:
@inproceedings{wang2021loveda,
title={Love{DA}: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation},
author={Junjue Wang and Zhuo Zheng and Ailong Ma and Xiaoyan Lu and Yanfei Zhong},
booktitle={Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
editor = {J. Vanschoren and S. Yeung},
year={2021},
volume = {1},
pages = {},
url={https://datasets-benchmarks proceedings.neurips.cc/paper/2021/file/4e732ced3463d06de0ca9a15b6153677-Paper-round2.pdf}
}
License:
The owners of the data and of the copyright on the data are RSIDEA, Wuhan University. Use of the Google Earth images must respect the "Google Earth" terms of use. All images and their associated annotations in LoveDA can be used for academic purposes only, but any commercial use is prohibited. (CC BY-NC-SA 4.0)
Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. The paper can be found at https://arxiv.org/pdf/2206.08023.pdf In addition to providing the labeled 600 CT and MRI scans, we expect to provide 2000 CT and 1200 MRI scans without labels to support more learning tasks (semi-supervised, un-supervised, domain adaption, ...). The link can be found in: labeled data (500CT+100MRI) unlabeled data Part I (900CT) unlabeled data Part II (1100CT) (Now there are 1000CT, we will replenish to 1100CT) unlabeled data Part III (1200MRI) if you found this dataset useful for your research, please cite: @inproceedings{NEURIPS2022_ee604e1b, author = {Ji, Yuanfeng and Bai, Haotian and GE, Chongjian and Yang, Jie and Zhu, Ye and Zhang, Ruimao and Li, Zhen and Zhanng, Lingyan and Ma, Wanling and Wan, Xiang and Luo, Ping}, booktitle = {Advances in Neural Information Processing Systems}, editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh}, pages = {36722--36732}, publisher = {Curran Associates, Inc.}, title = {AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation}, url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/ee604e1bedbd069d9fc9328b7b9584be-Paper-Datasets_and_Benchmarks.pdf}, volume = {35}, year = {2022} }
TMGBench: TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs
This repository contains the code, data, and metadata for our NeurIPS 2025 Datasets and Benchmarks submission: TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs. The benchmark evaluates the strategic reasoning of large language models using 2x2 matrix games with narrative contexts and theory-of-mind variations.
Directory Structure… See the full description on the dataset page: https://huggingface.co/datasets/pinkex/TMGBench.
The OCW dataset is for evaluating creative problem solving tasks by curating the problems and human performance results from the popular British quiz show Only Connect.
The OCW dataset contains 618 connecting wall puzzles and solutions in total from 15 seasons of the show. Each show episode has two walls.
The dataset has two tasks: Task 1 (Grouping), and Task 2 (Connections) are identical to the quiz-show’s human participant tasks.
Task 1 (Groupings) is evaluated via six metrics: number of solved walls, number of correct groups (max. four per wall), Adjusted Mutual Information (AMI), Adjusted Rand Index (ARI), Fowlkes Mallows Score (FMS), and Wasserstein Distance (WD), normalized to (0, 1) range, between predicted and ground-truth labels.
Task 2 (Connections) is evaluated with three metrics: exact string matching, ROUGE-1 F1, and BERTScore F1.
Baseline results with pre-trained language models and with few-shot In-context Learning (ICL) with LLMs such as GPT-4 are available here:
"Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset" Saeid Alavi Naeini, Raeid Saqur, Mozhgan Saeidi, John Giorgi, Babak Taati. 2023 https://neurips.cc/virtual/2023/poster/73547
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fake audio detection is a growing concern and some relevant datasets have been designed for research. But there is no standard public Chinese dataset under additive noise conditions. In this paper, we aim to fill in the gap and design a
Chinese fake audio detection dataset (FAD) for studying more generalized detection methods. Twelve mainstream speech generation techniques are used to generate fake audios. To simulate the real-life scenarios, three noise datasets are selected for
noisy adding at five different signal noise ratios. FAD dataset can be used not only for fake audio detection, but also for detecting the algorithms of fake utterances for
audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging.
The FAD dataset is publicly available. The source code of baselines is available on GitHub https://github.com/ADDchallenge/FAD
The FAD dataset is designed to evaluate the methods of fake audio detection and fake algorithms recognition and other relevant studies. To better study the robustness of the methods under noisy
conditions when applied in real life, we construct the corresponding noisy dataset. The total FAD dataset consists of two versions: clean version and noisy version. Both versions are divided into
disjoint training, development and test sets in the same way. There is no speaker overlap across these three subsets. Each test sets is further divided into seen and unseen test sets. Unseen test sets can
evaluate the generalization of the methods to unknown types. It is worth mentioning that both real audios and fake audios in the unseen test set are unknown to the model.
For the noisy speech part, we select three noise database for simulation. Additive noises are added to each audio in the clean dataset at 5 different SNRs. The additive noises of the unseen test set and the
remaining subsets come from different noise databases. In each version of FAD dataset, there are 138400 utterances in training set, 14400 utterances in development set, 42000 utterances in seen test set, and 21000 utterances in unseen test set. More detailed statistics are demonstrated in the Tabel 2.
Clean Real Audios Collection
From the point of eliminating the interference of irrelevant factors, we collect clean real audios from
two aspects: 5 open resources from OpenSLR platform (http://www.openslr.org/12/) and one self-recording dataset.
Clean Fake Audios Generation
We select 11 representative speech synthesis methods to generate the fake audios and one partially fake audios.
Noisy Audios Simulation
Noisy audios aim to quantify the robustness of the methods under noisy conditions. To simulate the real-life scenarios, we artificially sample the noise signals and add them to clean audios at 5 different
SNRs, which are 0dB, 5dB, 10dB, 15dB and 20dB. Additive noises are selected from three noise databases: PNL 100 Nonspeech Sounds, NOISEX-92, and TAU Urban Acoustic Scenes.
This data set is licensed with a CC BY-NC-ND 4.0 license.
You can cite the data using the following BibTeX entry:
@inproceedings{ma2022fad,
title={FAD: A Chinese Dataset for Fake Audio Detection},
author={Haoxin Ma, Jiangyan Yi, Chenglong Wang, Xinrui Yan, Jianhua Tao, Tao Wang, Shiming Wang, Le Xu, Ruibo Fu},
booktitle={Submitted to the 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks },
year={2022},
}
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The dataset contains UAV footage of wild antelopes (blackbucks) in grassland habitats. It can be mainly used for two tasks: Multi-object tracking (MOT) and Re-Identification (Re-ID). We provide annotations for the position of animals in each frame, allowing us to offer very long videos (up to 3 min) completely annotated while maintaining the identity of each animal in the video. The Re-ID dataset offers two videos, that capture the movement of some animals simultaneously from two different UAVs. The Re-ID task is to find the same individual in two videos taken simultaneously from a slightly different perspective. The relevant paper will be published in the NeurIPS 2024 Dataset and Benchmarking Track. https://nips.cc/virtual/2024/poster/97563 Resolution: 5.4 K MOT: 12 videos ( MOT17 Format) Re-ID: 6 sets (each with a pair of drones) (Custom) Detection: 320 Images (COCO, YOLO)
Repository for the 3D-ADAM Dataset. Submitted to NeurIPS 2025 - Datasets and Benchmarks Track.
license: cc-by-nc-sa-4.0
This dataset contains the Burgers' equation and Darcy flow benchmarks featured in several operator learning tasks involving parametric partial differential equations: for \(a\in \mathcal{A}\)
\[L_a (u) = f \quad \text{ in } \Omega \subset\mathbb{R}^m,\]
with appropriate boundary or initial conditions. The tasks are
The datasets are given in the form of matlab
file. The dataset can be loaded to torch.Tensor
format using the following snippets. The first index is the samples; the rest dimensions contains the spacial discretizations of the data and the targets. data_path
specified by user.
PDE for \(u \in H^1((0,2\pi)) \cap C^0(\mathbb{S}^1)\) \[ \frac {\partial u}{\partial t} + u \frac{\partial u}{\partial x} = u \frac {\partial ^{2}u}{\partial x^{2}} \quad \text{ for } (x,t)\in (0,2\pi) \times (0,1],\] Initial condition \(u(x, 0) = u_0(x) \text{ for } x\in (0,2\pi)\).
data = loadmat(data_path)
x_data = data['a'] # input
y_data = data['u'] # target
In a forward problem, a
is the input and u
is the solution (target).
In an inverse problem, u
is the input and a
is the target.
PDE: the coefficient \(a\in L^{\infty}(\Omega)\), \(u \in H^1_0(\Omega)\) \[- abla \cdot( a(x) abla u ) = 1\quad \text{in }\; (0,1)^2,\] with a zero Dirichlet boundary condition.
from scipy.io import loadmat
data = loadmat(data_path)
a = data['coeff']
u = data['sol']
The dataset is owned by Zongyi Li and the license is MIT: https://github.com/zongyi-li/fourier_neural_operator
@misc{li2020fourier,
title={Fourier Neural Operator for Parametric Partial Differential Equations},
author={Zongyi Li and Nikola Kovachki and Kamyar Azizzadenesheli and Burigede Liu and Kaushik Bhattacharya and Andrew Stuart and Anima Anandkumar},
year={2020},
eprint={2010.08895},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{li2020neural,
title={Neural Operator: Graph Kernel Network for Partial Differential Equations},
author={Zongyi Li and Nikola Kovachki and Kamyar Azizzadenesheli and Burigede Liu and Kaushik Bhattacharya and Andrew Stuart and Anima Anandkumar},
year={2020},
eprint={2003.03485},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{cao2021,
title={Choose a Transformer: Fourier or Galerkin},
author={Shuhao Cao},
year={2021},
eprint={2105.14995},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
SIMSHIFT: A Benchmark for Adapting Neural Surrogates to Distribution Shifts
This is the official data repository to the NeurIPS 2025 Datasets & Benchmarks Track Submission.
Usage
We provide dataset loading utilities and full training and evaluation pipelines in the accompanying code repository that will be released upon publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
🖼️ COinCO: Common Inpainted Objects In-N-Out of Context
Authors: Tianze Yang*, Tyson Jordan*, Ninghao Liu, Jin Sun*Equal contributionAffiliation: University of GeorgiaStatus: Submitted to NeurIPS 2025 Datasets and Benchmarks Track (under review)
📦 1. Dataset Overview
The COinCO dataset is a large-scale benchmark constructed from the COCO dataset to study object-scene contextual relationships via inpainting. Each image in COinCO contains one inpainted object, and… See the full description on the dataset page: https://huggingface.co/datasets/ytz009/COinCO.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the first dataset of the NeurIPS 2025 submission: The SafePowerGraph Benchmark: Toward Reliable and Realistic Graph Learning in Power Grids.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This file includes our annotations of 238 dataset papers published at the NeurIPS Datasets and Benchmarks Track. A full report of our findings can be found at https://arxiv.org/abs/2411.00266