Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
A detailed description available in "SynthRAD2025_dataset_description.pdf". A paper describing the dataset has been submitted to Medical Physics and is available as pre-print at: https://arxiv.org/abs/2502.17609" target="_blank" rel="noopener">https://arxiv.org/abs/2502.17609. The dataset is divided into two tasks:
After extraction, the dataset is organized as follows:
Within each task, cases are categorized into three anatomical regions:
Each anatomical region contains individual patient folders, named using a unique seven-letter alphanumeric code:[Task Number][Anatomy][Center][PatientID]
Example: 1HNA001
Each patient folder in the training dataset contains (for other sets see Table below):
ct.mha
: preprocessed CT imagemr.mha
or cbct.mha
(depending on the task): preprocessed MR or CBCT imagemask.mha
: Binary mask of the patient outline (dilated)An overview folder within each anatomical region contains:
[task]_[anatomy]_parameters.xlsx
: Imaging protocol details for each patient.[task][anatomy][center][PatientID]_overview.png
: A visualization of axial, coronal, and sagittal slices of CBCT/MR, CT, mask, and difference images.The SynthRAD2025 dataset is part of the second edition of the SynthRAD deep learning challenge (https://synthrad2025.grand-challenge.org/), which benchmarks synthetic CT generation for MRI- and CBCT-based radiotherapy workflows.
Imaging data was collected from five European university medical centers:
All centers have independently approved the study in accordance with their institutional review boards or medical ethics committee regulations.
Inclusion criteria:
The dataset is provided under two different licenses:
Subset |
Files |
Release Date |
Link |
---|---|---|---|
Training |
Input, CT, Mask |
01-03-2025 | |
Training Center D |
Input, CT, Mask |
01-03-2025 |
Check the download link at: |
Validation Input |
Input, Mask |
01-06-2025 | |
Validation Input Center D |
Input, Mask |
01-06-2025 |
Check the download link at: |
Validation Ground Truth |
CT, Deformed CT |
01-03-2030 | |
Test |
Input, CT, Deformed CT, Mask |
01-03-2030 |
The number of cases collected at each center for training, validation, and test sets.
Task | Center | HN | TH | AB | Total |
---|---|---|---|---|---|
1 | A | 91 | 91 | 65 | 247 |
B | 0 | 91 | 91 | 182 | |
C | 65 | 0 | 19 | 84 | |
D | 65 | 0 | 0 | 65 | |
E | 0 | 0 | 0 | 0 | |
Total | 221 | 182 | 175 | 578 | |
2 | A | 65 | 65 | 64 | 195 |
B | 65 | 65 | 65 | 195 | |
C | 65 | 63 | 62 | 190 | |
D | 65 | 63 | 53 | 181 | |
E | 65 | 65 | 65 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MSMD is a synthetic dataset of 497 pieces of (classical) music that contains both audio and score representations of the pieces aligned at a fine-grained level (344,742 pairs of noteheads aligned to their audio/MIDI counterpart). It can be used for training and evaluating multimodal models that enable crossing from one modality to the other, such as retrieving sheet music using recordings or following a performance in the score image.
Please find further information and a corresponding Python package on this Github page: https://github.com/CPJKU/msmd
If you use this dataset, please cite:
[1] Matthias Dorfer, Jan Hajič jr., Andreas Arzt, Harald Frostel, Gerhard Widmer.
Learning Audio-Sheet Music Correspondences for Cross-Modal Retrieval and Piece Identification (PDF).
Transactions of the International Society for Music Information Retrieval, issue 1, 2018.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Bank Account Fraud (BAF) suite of datasets has been published at NeurIPS 2022 and it comprises a total of 6 different synthetic bank account fraud tabular datasets. BAF is a realistic, complete, and robust test bed to evaluate novel and existing methods in ML and fair ML, and the first of its kind!
This suite of datasets is:
- Realistic, based on a present-day real-world dataset for fraud detection;
- Biased, each dataset has distinct controlled types of bias;
- Imbalanced, this setting presents a extremely low prevalence of positive class;
- Dynamic, with temporal data and observed distribution shifts;
- Privacy preserving, to protect the identity of potential applicants we have applied differential privacy techniques (noise addition), feature encoding and trained a generative model (CTGAN).
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3349776%2F4271ec763b04362801df2660c6e2ec30%2FScreenshot%20from%202022-11-29%2017-42-41.png?generation=1669743799938811&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3349776%2Faf502caf5b9e370b869b85c9d4642c5c%2FScreenshot%20from%202022-12-15%2015-17-59.png?generation=1671117525527314&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3349776%2Ff3789bd484ee392d648b7809429134df%2FScreenshot%20from%202022-11-29%2017-40-58.png?generation=1669743681526133&alt=media" alt="">
Each dataset is composed of: - 1 million instances; - 30 realistic features used in the fraud detection use-case; - A column of “month”, providing temporal information about the dataset; - Protected attributes, (age group, employment status and % income).
Detailed information (datasheet) on the suite: https://github.com/feedzai/bank-account-fraud/blob/main/documents/datasheet.pdf
Check out the github repository for more resources and some example notebooks: https://github.com/feedzai/bank-account-fraud
Read the NeurIPS 2022 paper here: https://arxiv.org/abs/2211.13358
Learn more about Feedzai Research here: https://research.feedzai.com/
Please, use the following citation of BAF dataset suite
@article{jesusTurningTablesBiased2022,
title={Turning the {{Tables}}: {{Biased}}, {{Imbalanced}}, {{Dynamic Tabular Datasets}} for {{ML Evaluation}}},
author={Jesus, S{\'e}rgio and Pombal, Jos{\'e} and Alves, Duarte and Cruz, Andr{\'e} and Saleiro, Pedro and Ribeiro, Rita P. and Gama, Jo{\~a}o and Bizarro, Pedro},
journal={Advances in Neural Information Processing Systems},
year={2022}
}
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains hundreds of thousands (hopefully millions soon) of textures and PBR/SV-BRDF materials extracted from real-world natural images.
The repository is composed of RGB images of textures given as RGB images (each image is one uniform texture) and folders of PBR/SVBRDF materials given as a set of property maps (base color, roughness, metallic, etc).
Visualisation of sampled PBRs and Textures can be seen in: PBR_examples.jpg and Textures_Examples.jpg
Texture images are given in the Extracted_textures_*.zip files.
Each image in this zip file is a single texture, the textures were extracted and cropped from the open images dataset.
PBR Materials are available in PBR_*.zip files these PBRs were generated from the texture images in an unsupervised way (with no human intervention). Each subfolder in this file contains the properties map of the PBRs (roughness, metallic, etc, suitable for blender/unreal engine). Visualization of the rendered material appears in the file Material_View.jpg in each PBR folder.
PBR materials that were generated by mixing other PBR materials are available in files with the names PBR_mix*.zip
Samples for each case can be found in files named: Sample_*.zip
Documented code used to extract the textures and generate the PBRs is available at:
Texture_And_Material_ExtractionCode_And_Documentation.zip
The materials and textures were extracted from real-world images using an unsupervised extraction method (code supplied). As such they are far more diverse and wide in scope compared to existing repositories, at the same time they are much more noisy and contain more outliers compared to existing repositories. This repository is probably more useful for things that demand large-scale and very diverse data, yet can use noisy and lower quality compared to professional repositories with manually made assets like ambientCG. It can be very useful for creating machine learning datasets, or large-scale procedural generation. It is less suitable for areas that demand precise clean and categorized PBR like CGI art and graphic design. For preview It is recommended to look at PBR_examples.jpg and Textures_Examples.jpg or download the Sample files and look at the Material_View.jpg files to visualize the quality of the materials.
Currently, there are a few hundred of thousands PBR materials and textures but the goal is to make this into over a million in the near future.
The Python scripts used to extract these assets are supplied at:
Texture_And_Material_ExtractionCode_And_Documentation.zip
The code could be run in any folder of random images extract regions with uniform textures and turn these into PBR materials.
Alternative download sources:
https://sites.google.com/view/infinitexture/home
https://e.pcloud.link/publink/show?code=kZON5TZtxLfdvKrVCzn12NADBFRNuCKHm70
https://icedrive.net/s/jfY1xSDNkVwtYDYD4FN5wha2A8Pz
This work was done as part of the paper "Learning Zero-Shot Material States Segmentation,
by Implanting Natural Image Patterns in Synthetic Data".
@article{eppel2024learning,
title={Learning Zero-Shot Material States Segmentation, by Implanting Natural Image Patterns in Synthetic Data},
author={Eppel, Sagi and Li, Jolina and Drehwald, Manuel and Aspuru-Guzik, Alan},
journal={arXiv preprint arXiv:2403.03309},
year={2024}
}
All the code and repositories are available on CC0 (free to use) licenses.
Textures were extracted from the open images dataset which is an Apache license.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
A detailed description available in "SynthRAD2025_dataset_description.pdf". A paper describing the dataset has been submitted to Medical Physics and is available as pre-print at: https://arxiv.org/abs/2502.17609" target="_blank" rel="noopener">https://arxiv.org/abs/2502.17609. The dataset is divided into two tasks:
After extraction, the dataset is organized as follows:
Within each task, cases are categorized into three anatomical regions:
Each anatomical region contains individual patient folders, named using a unique seven-letter alphanumeric code:[Task Number][Anatomy][Center][PatientID]
Example: 1HNA001
Each patient folder in the training dataset contains (for other sets see Table below):
ct.mha
: preprocessed CT imagemr.mha
or cbct.mha
(depending on the task): preprocessed MR or CBCT imagemask.mha
: Binary mask of the patient outline (dilated)An overview folder within each anatomical region contains:
[task]_[anatomy]_parameters.xlsx
: Imaging protocol details for each patient.[task][anatomy][center][PatientID]_overview.png
: A visualization of axial, coronal, and sagittal slices of CBCT/MR, CT, mask, and difference images.The SynthRAD2025 dataset is part of the second edition of the SynthRAD deep learning challenge (https://synthrad2025.grand-challenge.org/), which benchmarks synthetic CT generation for MRI- and CBCT-based radiotherapy workflows.
Imaging data was collected from five European university medical centers:
All centers have independently approved the study in accordance with their institutional review boards or medical ethics committee regulations.
Inclusion criteria:
The dataset is provided under two different licenses:
Subset |
Files |
Release Date |
Link |
---|---|---|---|
Training |
Input, CT, Mask |
01-03-2025 | |
Training Center D |
Input, CT, Mask |
01-03-2025 |
Check the download link at: |
Validation Input |
Input, Mask |
01-06-2025 | |
Validation Input Center D |
Input, Mask |
01-06-2025 |
Check the download link at: |
Validation Ground Truth |
CT, Deformed CT |
01-03-2030 | |
Test |
Input, CT, Deformed CT, Mask |
01-03-2030 |
The number of cases collected at each center for training, validation, and test sets.
Task | Center | HN | TH | AB | Total |
---|---|---|---|---|---|
1 | A | 91 | 91 | 65 | 247 |
B | 0 | 91 | 91 | 182 | |
C | 65 | 0 | 19 | 84 | |
D | 65 | 0 | 0 | 65 | |
E | 0 | 0 | 0 | 0 | |
Total | 221 | 182 | 175 | 578 | |
2 | A | 65 | 65 | 64 | 195 |
B | 65 | 65 | 65 | 195 | |
C | 65 | 63 | 62 | 190 | |
D | 65 | 63 | 53 | 181 | |
E | 65 | 65 | 65 |