29 datasets found

f
Data from: Time-Split Cross-Validation as a Method for Estimating the...
acs.figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/ci400084k.s001
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.
Glaucoma Dataset: EyePACS-AIROGS-light-V2
kaggle.com
zip
Updated Mar 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Riley Kiefer (2024). Glaucoma Dataset: EyePACS-AIROGS-light-V2 [Dataset]. https://www.kaggle.com/datasets/deathtrooper/glaucoma-dataset-eyepacs-airogs-light-v2/code
Explore at:
zip(549533071 bytes)Available download formats
Dataset updated
Mar 9, 2024
Authors
Riley Kiefer
Description
News: Now with a 10.0 Kaggle usability score: supplemental metadata.csv file added to dataset.

Overview: This is an improved machine-learning-ready glaucoma dataset using a balanced subset of standardized fundus images from the Rotterdam EyePACS AIROGS [1] set. This dataset is split into training, validation, and test folders which contain 4000 (~84%), 385 (~8%), and 385 (~8%) fundus images in each class respectively. Each training set has a folder for each class: referable glaucoma (RG) and non-referable glaucoma (NRG). This dataset is designed to easily benchmark your glaucoma classification models in Kaggle. Please make a contribution in the code tab, I have created a template to make it even easier!

Please cite the dataset and at least the first of my related works if you found this dataset useful!

Riley Kiefer. "EyePACS-AIROGS-light-V2". Kaggle, 2024, doi: 10.34740/KAGGLE/DSV/7802508.

Riley Kiefer. "EyePACS-AIROGS-light-V1". Kaggle, 2023, doi: 10.34740/kaggle/ds/3222646.

Riley Kiefer. "Standardized Multi-Channel Dataset for Glaucoma, v19 (SMDG-19)". Kaggle, 2023, doi: 10.34740/kaggle/ds/2329670

Steen, J., Kiefer, R., Ardali, M., Abid, M. & Amjadian, E. Standardized and Open-Access Glaucoma Dataset for Artificial Intelligence Applications. Invest. Ophthalmol. Vis. Sci. 64, 384–384 (2023).

Amjadian, E., Ardali, M. R., Kiefer, R., Abid, M. & Steen, J. Ground truth validation of publicly available datasets utilized in artificial intelligence models for glaucoma detection. Invest. Ophthalmol. Vis. Sci. 64, 392–392 (2023).

R. Kiefer, M. Abid, M. R. Ardali, J. Steen and E. Amjadian, "Automated Fundus Image Standardization Using a Dynamic Global Foreground Threshold Algorithm," 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 2023, pp. 460-465, doi: 10.1109/ICIVC58118.2023.10270429.

Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023.

R. Kiefer, J. Steen, M. Abid, M. R. Ardali and E. Amjadian, "A Survey of Glaucoma Detection Algorithms using Fundus and OCT Images," 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 2022, pp. 0191-0196, doi: 10.1109/IEMCON56893.2022.9946629.

E. Amjadian, R. Kiefer, J. Steen, M. Abid, M. Ardali, "A Comprehensive Survey of Publicly Available Glaucoma Datasets for Automated Glaucoma Detection". American Academy of Optometry. 2022.

Improvements from v1: - According to an ablation study on the image standardization methods applied to dataset v1 [3], images are standardized according to the CROP methodology (remove black background before resizing). This method yields more of the actual fundus foreground in the resultant image. - Increased the image resize dimensions from 256x256 pixels to 512x512 pixels - Reason: Provides greater model input flexibility, detail, and size. This also better supports the ONH-cropping models. - Added 3000 images from the Rotterdam EyePACS AIROGS dev set - Reason: More data samples can improve model generalizability - Readjusted train/val/test split - Reason: The validation and test split sizes were different - Improved sampling from source dataset - Reason: v1 NRG samples were not randomly selected

Drawbacks of Rotterdam EyePACS AIROGS: One of the largest drawbacks of the original dataset is the accessibility of the dataset. The dataset requires a long download, a large storage space, it spans several folders, and it is not machine-learning-ready (it requires data processing and splitting). The dataset also contains raw fundus images in their original dimensions; these original images often contain a large amount of black background and the dimensions are too large for machine learning inputs. The proposed dataset addresses the aforementioned concerns by image sampling and image standardization to balance and reduce the dataset size respectively.

Origin: The images in this dataset are sourced from the Rotterdam EyePACS AIROGS [1] dataset, which contains 113,893 color fundus images from 60,357 subjects and approximately 500 different sites with a heterogeneous ethnicity; this impressive dataset is over 60GB when compressed. The first lightweight version of the dataset is known as EyePACS-AIROGS-light (v1) [2].

About Me: I have studied glaucoma-related research for my computer science master's thesis. Since my graduation, I have dedicated my time to keeping my research up-to-date and relevant for fellow glaucoma researchers. I hope that my research can provi...
physioDL: A dataset for geomorphic deep learning representing a scene...
figshare.com
zip
Updated Jul 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Maxwell (2024). physioDL: A dataset for geomorphic deep learning representing a scene classification task (predict physiographic region in which a hillshade occurs) [Dataset]. http://doi.org/10.6084/m9.figshare.26363824.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26363824.v2
Dataset updated
Jul 24, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Aaron Maxwell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
physioDL: A dataset for geomorphic deep learning representing a scene classification task (predict physiographic region in which a hilshade occurs)Purpose: Datasets for geomorphic deep learning. Predict the physiographic region of an area based on a hillshade image. Terrain data were derived from the 30 m (1 arc-second) 3DEP product across the entirety of CONUS. Each chip has a spatial resolution of 30 m and 256 rows and columns of pixels. As a result, each chip measures 7,680 meters-by-7,680 meters. Two datasets are provided. Chips in the hs folder represent a multidirectional hillshade while chips in the ths folder represent a tinted multidirectional hillshade. Data are represented in 8-bit (0 to 255 scale, integer values). Data are projected to the Web Mercator projection relative to the WGS84 datum. Data were split into training, test, and validation partitions using stratified random sampling by region. 70% of the samples per region were selected for training, 15% for testing, and 15% for validation. There are a total of 16,325 chips. The following 22 physiographic regions are represented: "ADIRONDACK" , "APPALACHIAN PLATEAUS", "BASIN AND RANGE", "BLUE RIDGE", "CASCADE-SIERRA MOUNTAINS", "CENTRAL LOWLAND", "COASTAL PLAIN", "COLORADO PLATEAUS", "COLUMBIA PLATEAU", "GREAT PLAINS", "INTERIOR LOW PLATEAUS", "MIDDLE ROCKY MOUNTAINS", "NEW ENGLAND", "NORTHERN ROCKY MOUNTAINS", "OUACHITA", "OZARK PLATEAUS", "PACIFIC BORDER", and "PIEDMONT", "SOUTHERN ROCKY MOUNTAINS", "SUPERIOR UPLAND", "VALLEY AND RIDGE", "WYOMING BASIN". Input digital terrain models and hillshades are not provided due to the large file size (> 100GB). FilesphysioDL.csv: Table listing all image chips and associated physiographic region (id = unique ID for each chip; region = physiographic region; fnameHS = file name of associated chip in hs folder; fnameTHS = file name of associated chip in ths folder; set = data split (train, test, or validation).chipCounts.csv: Number of chips in each data partition per physiographic province. map.png: Map of data.makeChips.R: R script used to process the data into image chips and create CSV files.inputVectorschipBounds.shp = square extent of each chipchipCenters.shp = center coordinate of each chipprovinces.shp = physiographic provincesprovinces10km.shp = physiographic provinces with a 10 km negative buffer
Link-prediction on Biomedical Knowledge Graphs
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus (2024). Link-prediction on Biomedical Knowledge Graphs [Dataset]. http://doi.org/10.5281/zenodo.12097377
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12097377
Dataset updated
Jun 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Time period covered
Jun 25, 2021
Description
Release of the experimental data from the paper Towards Linking Graph Topology to Model Performance for Biomedical Knowledge Graph Completion (accepted at Machine Learning for Life and Material Sciences workshop @ ICML2024).

Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions we invite the community to build upon our work and continue improving the understanding of these crucial applications.

Experiments were conducted on six datasets: five from the biomedical domain (Hetionet, PrimeKG, PharmKG, OpenBioLink2020 HQ, PharMeBINet) and one trivia KG (FB15k-237). All datasets were randomly split into training, validation and test set (80% / 10% / 10%; in the case of PharMeBINet, 99.3% / 0.35% / 0.35% to mitigate the increased inference cost on the larger dataset).

On each dataset, four different KGE models were compared: TransE, DistMult, RotatE, TripleRE. Hyperparameters were tuned on the validation split and we release results for tail predictions on the test split. In particular, each test query (h,r,?) is scored against all entities in the KG and we compute the rank of the score of the correct completion (h,r,t) , after masking out scores of other (h,r,t') triples contained in the graph.

Note: the ranks provided are computed as the average between the optimistic and pessimistic ranks of triple scores.

Inside experimental_data.zip, the following files are provided for each dataset:

{dataset}_preprocessing.ipynb: a Jupyter notebook for downloading and preprocessing the dataset. In particular, this generates the custom label->ID mapping for entities and relations, and the numerical tensor of (h_ID,r_ID,t_ID) triples for all edges in the graph, which can be used to compute graph topological metrics (e.g., using kg-topology-toolbox) and compare them with the edge prediction accuracy.

test_ranks.csv: csv table with columns ["h", "r", "t"] specifying the head, relation, tail IDs of the test triples, and columns ["DistMult", "TransE", "RotatE", "TripleRE"] with the rank of the ground-truth tail in the ordered list of predictions made by the four models;

entity_dict.csv: the list of entity labels, ordered by entity ID (as generated in the preprocessing notebook);

relation_dict.csv: the list of relation labels, ordered by relation ID (as generated in the preprocessing notebook).

The separate top_100_tail_predictions.zip archive contains, for each of the test queries in the corresponding test_ranks.csv table, the IDs of the top-100 tail predictions made by each of the four KGE models, ordered by decreasing likelihood. The predictions are released in a .npz archive of numpy arrays (one array of shape (n_test_triples, 100) for each of the KGE models).

All experiments (training and inference) have been run on Graphcore IPU hardware using the BESS-KGE distribution framework.
o
madelon
openml.org
Updated May 22, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2015). madelon [Dataset]. https://www.openml.org/d/1485
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2015
Description
Author: Isabelle Guyon
Source: UCI
Please cite: Isabelle Guyon, Steve R. Gunn, Asa Ben-Hur, Gideon Dror, 2004. Result analysis of the NIPS 2003 feature selection challenge.

Abstract:

MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear.

Source:

Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 90708 isabelle '@' clopinet.com

Data Set Information:

MADELON is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five-dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (corresponding to the +-1 labels). It was added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized.

This dataset is one of five datasets used in the NIPS 2003 feature selection challenge. The original data was split into training, validation and test set. Target values are provided only for two first sets (not for the test set). So, this dataset version contains all the examples from training and validation partitions.

There is no attribute information provided to avoid biasing the feature selection process.

Relevant Papers:

The best challenge entrants wrote papers collected in the book: Isabelle Guyon, Steve Gunn, Masoud Nikravesh, Lofti Zadeh (Eds.), Feature Extraction, Foundations and Applications. Studies in Fuzziness and Soft Computing. Physica-Verlag, Springer.

Isabelle Guyon, et al, 2007. Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recognition Letters 28 (2007) 1438–1444.

Isabelle Guyon, et al. 2006. Feature selection with the CLOP package. Technical Report.
WD50K
data.europa.eu
data.niaid.nih.gov
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). WD50K [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-4036498?locale=lv
Explore at:
unknown(7080916)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
WD50K dataset: An hyper-relational dataset derived from Wikidata statements. The dataset is constructed by the following procedure based on the Wikidata RDF dump of August 2019: - A set of seed nodes corresponding to entities from FB15K-237 having a direct mapping in Wikidata (P646 "Freebase ID") is extracted from the dump. - For each seed node, all statements whose main object and qualifier values corresponding to wikibase:Item are extracted from the dump. - All literals are filtered out from the qualifiers of the above obtained statements. - All the entities from the dataset which have less than two mentions are dropped. The statements corresponding to the dropped entities are also dropped. - The remaining statements are randomly split into the train, test, and validation sets. - All statements from train and validation sets are removed which share the same main triple (s,p,o) with test statements. - WD50k_33, WD50k_66, WD50k_100 are then sampled from the above statements. Here 33, 66, 100 represents the amount of hyper-relational facts (statements with qualifiers) in the dataset. The table below provides some basic statistics of our dataset and its three further variations: | Dataset | Statements | w/Quals (%) | Entities | Relations | E only in Quals | R only in Quals | Train | Valid | Test | |-------------|------------|----------------|----------|-----------|-----------------|-----------------|---------|--------|--------| | WD50K | 236,507 | 32,167 (13.6%) | 47,156 | 532 | 5460 | 45 | 166,435 | 23,913 | 46,159 | | WD50K (33) | 102,107 | 31,866 (31.2%) | 38,124 | 475 | 6463 | 47 | 73,406 | 10,668 | 18,133 | | WD50K (66) | 49,167 | 31,696 (64.5%) | 27,347 | 494 | 7167 | 53 | 35,968 | 5,154 | 8,045 | | WD50K (100) | 31,314 | 31,314 (100%) | 18,792 | 279 | 7862 | 75 | 22,738 | 3,279 | 5,297 | When using the dataset please cite: @inproceedings{StarE, title={Message Passing for Hyper-Relational Knowledge Graphs}, author={Galkin, Mikhail and Trivedi, Priyansh and Maheshwari, Gaurav and Usbeck, Ricardo and Lehmann, Jens}, booktitle={EMNLP}, year={2020} } For any further questions, please contact: mikhail.galkin@iais.fraunhofer.de
f
Probing Datasets for Noisy Texts
federation.figshare.com
researchdata.edu.au
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Buddhika Kasthuriarachchy; Madhu Chetty; Adrian Shatte (2021). Probing Datasets for Noisy Texts [Dataset]. http://doi.org/10.25955/604c5307db043
Explore at:
Unique identifier
https://doi.org/10.25955/604c5307db043
Dataset updated
Mar 14, 2021
Dataset provided by
Federation University Australia
Authors
Buddhika Kasthuriarachchy; Madhu Chetty; Adrian Shatte
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
ContextProbing tasks are popular among NLP researchers to assess the richness of the encoded representations of linguistic information. Each probing task is a classification problem, and the model’s performance shall vary depending on the richness of the linguistic properties crammed into the representation.

This dataset contains five new probing datasets consist of noisy texts (Tweets) which can serve as a benchmark dataset for researchers to study the linguistic characteristics of unstructured and noisy texts.File StructureFormat: A tab-separated text file

Column 1: train/test/validation split (tr-train, te-test, va-validation)

Column 2: class label (refer to the content

section for the class labels of each task file)

Column 3: Tweet message (text) Column

4: a unique ID Contentsent_len.tsvIn this classification task, the goal is to predict the sentence length in 8 possible bins (0-7) based on their lengths; 0: (5-8), 1: (9-12), 2: (13-16), 3: (17-20), 4: (21-25), 5: (26-29), 6: (30-33), 7: (34-70). This task is called “SentLen” in the paper.word_content.tsvWe consider a 10-way classifications task with 10 words as targets considering the available manually annotated instances. The task is predicting which of the target words appears on the given sentence. We have considered only the words that appear in the BERT vocabulary as target words. We constructed the data by picking the first 10 lower-cased words occurring in the corpus vocabulary ordered by frequency and having a length of at least 4 characters (to remove noise). Each sentence contains a single target word, and the word occurs precisely once in the sentence. The task is referred to as “WC” in the paper. bigram_shift.tsvThe purpose of the Bigram Shift task is to test whether an encoder is sensitive to legal word orders. Two adjacent words in a Tweet are inverted, and the classification model performs a binary classification to identify inverted (I) and non-inverted/original (O) Tweets. The task is referred to as “BShift” in the paper. tree_depth.tsvThe Tree Depth task evaluates the encoded sentence's ability to understand the hierarchical structure by allowing the classification model to predict the depth of the longest path from the root to any leaf in the Tweet's parser tree. The task is referred to as “TreeDepth” in the paper. odd_man_out.tsv

The Tweets are modified by replacing a random noun or a verb o with another noun or verb r. The task of the classifier is to identify whether the sentence gets modified due to this change. Class label O refers to the unmodified sentences while C refers to modified sentences. The task is called “SOMO” in the paper.
Rescaled Fashion-MNIST dataset
zenodo.org
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST dataset [Dataset]. http://doi.org/10.5281/zenodo.15187793
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15187793
Dataset updated
Jun 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
Time period covered
Apr 10, 2025
Description
Motivation

The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

The Rescaled Fashion-MNIST dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

Access and rights

The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

[4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

The h5 files containing the dataset

The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2^k/4, with k being integers in the range [-4, 4]:

fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5

These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.
Rescaled CIFAR-10 dataset
zenodo.org
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15188748
Dataset updated
Jun 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
Description
Motivation

The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

The Rescaled CIFAR-10 dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

and is therefore significantly more challenging.

Access and rights

The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

[4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

The h5 files containing the dataset

The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2^k/4, with k being integers in the range [-4, 4]:

cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
Life Expectancy WHO
kaggle.com
zip
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Life Expectancy WHO [Dataset]. https://www.kaggle.com/datasets/vikramamin/life-expectancy-who
Explore at:
zip(121472 bytes)Available download formats
Dataset updated
Jun 19, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.

We use DECISION TREE MODEL for the analysis.

Run the required libraries (rpart, rpart.plot, RColorBrewer, rattle).

We run the decision tree analysis using rpart and plot the tree. We use fancyRpartPlot.

We use 5 fold cross validation method with CP (complexity parameter) being 0.01.

In Decision Tree , RMSE (Root Mean Squared Error) is 3.06. This indicates that on an average, the predicted values have an error of 3.06 years as compared to the actual life expectancy values.

MAPE (Mean Absolute Percentage Error) is 0.035. This indicates an accuracy prediction of 96.45% (1-0.035).

MAE (Mean Absolute Error) is 2.35. This indicates that on an average, the predicted values deviate by approximately 2.35 years from the actual values.

We use RANDOM FOREST for the analysis.

Run library(randomForest)

We use varImpPlot to find out which variables are most significant and least significant. Income composition is the most important followed by adult mortality and the least relevant independent variable is Population.

Predict Life expectancy through random forest model.

In Random Forest , RMSE (Root Mean Squared Error) is 1.73. This indicates that on an average, the predicted values have an error of 1.73 years as compared to the actual life expectancy values.

MAPE (Mean Absolute Percentage Error) is 0.01. This indicates an accuracy prediction of 98.27% (1-0.01).

MAE (Mean Absolute Error) is 1.14. This indicates that on an average, the predicted values deviate by approximately 1.14 years from the actual values.

Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.
Data and script pipeline for: Common to rare transfer learning (CORAL)...
zenodo.org
bin, html
Updated Mar 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Otso Ovaskainen; Otso Ovaskainen (2025). Data and script pipeline for: Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods [Dataset]. http://doi.org/10.5281/zenodo.14962497
Explore at:
bin, htmlAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14962497
Dataset updated
Mar 3, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Otso Ovaskainen; Otso Ovaskainen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The scripts and the data provided in this depository demonstrate how to apply the approach described in the paper "Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods" by Ovaskainen et al. Here we summarize how to use the software with a small, simulated dataset, with running time less than a minute in a typical laptop (Demo 1); (2) how to apply the analyses presented in the paper for a small subset of the data, with running time of ca. one hour in a powerful laptop (Demo 2); how to reproduce the full analyses presented in the paper, with running time up to several days, depending on the computational resources (Demo 3). The Demos 1 and 2 are aimed to be user-friendly starting points for understanding and testing how to implement CORAL. The Demo 3 is included mainly for reproducibility.

System requirements

· The software can be used in any operating system where R can be installed.

· We have developed and tested the software in a windows environment with R version 4.3.1.

· Demo 1 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).

· Demo 2 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).

· Demo 3 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0), jsonify (1.2.2), buildmer (2.11), colorspace (2.1-0), matlib (0.9.6), vioplot (0.4.0), MLmetrics (1.1.3) and ggplot2 (3.5.0).

· The use of the software does not require any non-standard hardware.

Installation guide

· The CORAL functions are implemented in Hmsc (3.3-3). The software that applies the is presented as a R-pipeline and thus it does not require any installation other than installation of R.

Demo 1: Software demo with simulated data

The software demonstration consists of two R-markdown files:

· D01_software_demo_simulate_data. This script creates a simulated dataset of 100 species on 200 sampling units. The species occurrences are simulated with a probit model that assumes phylogenetically structured responses to two environmental predictors. The pipeline saves all the data needed to data analysis in the file allDataDemo.RData: XData (the first predictor; the second one is not provided in the dataset as it is assumed to remain unknown for the user), Y (species occurrence data), phy (phylogenetic tree), studyDesign (list of sampling units). Additionally, true values used for data generation are save in the file trueValuesDemo.RData: LF (the second environmental predictor that will be estimated through a latent factor approach), and beta (species responses to environmental predictors).

· D02_software_demo_apply_CORAL. This script loads the data generated by the script D01 and applies the CORAL approach to it. The script demonstrates the informativeness of the CORAL priors, the higher predictive power of CORAL models than baseline models, and the ability of CORAL to estimate the true values used for data generation.

Both markdown files provide more detailed information and illustrations. The provided html file shows the expected output. The running time of the demonstration is very short, from few seconds to at most one minute.

Demo 2: Software demo with a small subset of the data used in the paper

The software demonstration consists of one R-markdown file:

MA_small_demo. This script uses the CORAL functions in HMSC to analyze a small subset of the Malagasy arthropod data. In this demo, we define rare species as those with prevalence at least 40 and less than 50, and common species as those with prevalence at least 200. This leaves 51 species to the backbone model and 460 rare species modelled through the CORAL approach. The script assess model fit for CORAL priors, CORAL posteriors, and null models. It further visualizes the responses of both the common and the rare species to the included predictors.

Scripts and data for reproducing the results presented in the paper (Demo 3)

The input data for the script pipeline is the file “allData.RData”. This file includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy). Each file in the pipeline below depends on the outputs of previous files: they must be run in order. The first six files are used for fitting the backbone HMSC model and calculating parameters for the CORAL prior:

· S01_define_Hmsc_model - defines the initial HMSC model with fixed effects and sample- and site-level random effects.

· S02_export_Hmsc_model - prepares the initial model for HPC sampling for fitting with Hmsc-HPC. Fitting of the model can be then done in an HPC environment with the bash file generated by the script. Computationally intensive.

· S03_import_posterior – imports the posterior distributions sampled by the initial model.

· S04_define_second_stage_Hmsc_model - extracts latent factors from the initial model and defines the backbone model. This is then sampled using the same S02 export + S03 import scripts. Computationally intensive.

· S05_visualize_backbone_model – check backbone model quality with visual/numerical summaries. Generates Fig. 2 of the paper.

· S06_construct_coral_priors – calculate CORAL prior parameters.

The remaining scripts evaluate the model:

· S07_evaluate_prior_predictionss – use the CORAL prior to predict rare species presence/absences and evaluate the predictions in terms of AUC. Generates Fig. 3 of the paper.

· S08_make_training_test_split – generate train/test splits for cross-validation ensuring at least 40% of positive samples are in each partition.

· S09_cross-validate – fit CORAL and the baseline model to the train/test splits and calculate performance summaries. Note: we ran this once with the initial train/test split and then again with on the inverse split (i.e., training = ! training in the code, see comment). The paper presents the average results across these two splits. Computationally intensive.

· S10_show_cross-validation_results – Make plots visualizing AUC/Tjur’s R² produced by cross-validation. Generates Fig. 4 of the paper.

· S11a_fit_coral_models – Fit the CORAL model to all 250k rare species. Computationally intensive.

· S11b_fit_baseline_models – Fit the baseline model to all 250k rare species. Computationally intensive.

· S12_compare_posterior_inference – compare posterior climate predictions using CORAL and baseline models on selected species, as well as variance reduction for all species. Generates Fig. 5 of the paper.

Pre-processing scripts:

· P01_preprocess_sequence_data.R – Reads in the outputs of the bioinformatics pipeline and converts them into R-objects.

· P02_download_climatic_data.R – Downloads the climatic data from "sis-biodiversity-era5-global” and adds that to metadata.

· P03_construct_Y_matrix.R – Converts the response matrix from a sparse data format to regular matrix. Saves “allData.RData”, which includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy).

Computationally intensive files had runtimes of 5-24 hours on high-performance machines. Preliminary testing suggests runtimes of over 100 hours on a standard laptop.
Dataset for "SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy...
zenodo.org
bin, csv, pdf
Updated Jan 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jake Lee; Jake Lee; Michael Kiper; Michael Kiper; David R. Thompson; David R. Thompson; Philip Brodrick; Philip Brodrick (2025). Dataset for "SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection" [Dataset]. http://doi.org/10.5281/zenodo.14614218
Explore at:
bin, pdf, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14614218
Dataset updated
Jan 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jake Lee; Jake Lee; Michael Kiper; Michael Kiper; David R. Thompson; David R. Thompson; Philip Brodrick; Philip Brodrick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection

Summary

Manuscript in review. Preprint: https://arxiv.org/abs/2501.04916

This repository contains the dataset used to train and evaluate the Spectroscopic Transformer model for EMIT cloud screening.

spectf_cloud_labelbox.hdf5

1,841,641 Labeled spectra from 221 EMIT Scenes.

spectf_cloud_mmgis.hdf5

1,733,801 Labeled spectra from 313 EMIT Scenes.

These scenes were speciffically labeled to correct false detections by an earlier version of the model.

train_fids.csv

465 EMIT scenes comprising the training set.

test_fids.csv

69 EMIT scenes comprising the held-out validation set.

v2 adds validation_scenes.pdf, a PDF displaying the 69 validation scenes in RGB and Falsecolor, their existing baseline cloud masks, as well as their cloud masks produced by the ANN and GBT reference models and the SpecTf model.

Data Description

221 EMIT Scenes were initially selected for labeling with diversity in mind. After sparse segmentation labeling of confident regions in Labelbox, up to 10,000 spectra were selected per-class per-scene to form the spectf_cloud_labelbox dataset. We deployed a preliminary model trained on these spectra on all EMIT scenes observed in March 2024, then labeled another 313 EMIT Scenes using MMGIS's polygonal labeling tool to correct false positive and false negative detections. After similarly sampling spectra from these scenes, A total of 3,575,442 spectra were labeled and sampled.

The train/test split was randomly determined by scene FID to prevent the same EMIT scene from contributing spectra to both the training and validation datasets.

Please refer to Section 4.2 in the paper for a complete description, and to our code repository for example usage and a Pytorch dataloader.

Each hdf5 file contains the following arrays:

'spectra'

Top-of-Atmosphere reflectance calculated from the EMIT L1B Radiance product

Float64 of shape (n, 268)

'fids'

The FID from which each spectrum was sampled

Binary string of shape (n,)

'indices'

The (col, row) index from which each spectrum was sampled

Int64 of shape (n, 2)

'labels'

Annotation label of each spectrum

0 - "Clear"

1 - "Cloud"

2 - "Cloud Shadow" (Only for the Labelbox dataset, and this class was combined with the clear class for this work. See paper for details.)

label[label==2] = 0

Int64 of shape (n,2)

Each hdf5 file contains the following attribute:

'bands'

The band center wavelengths (nm) of the spectrum

Float64 of shape (268,)

Acknowledgements

The EMIT online mapping tool was developed by the JPL MMGIS team. The High Performance Computing resources used in this investigation were provided by funding from the JPL Information and Technology Solutions Directorate.

This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004).

© 2024 California Institute of Technology. Government sponsorship acknowledged.
Glaucoma Dataset: EyePACS AIROGS - Light
kaggle.com
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Riley Kiefer (2023). Glaucoma Dataset: EyePACS AIROGS - Light [Dataset]. https://www.kaggle.com/deathtrooper/eyepacs-airogs-light
Explore at:
zip(316720067 bytes)Available download formats
Dataset updated
May 23, 2023
Authors
Riley Kiefer
Description
This is a machine-learning-ready glaucoma dataset using a balanced subset of standardized fundus images from the Rotterdam EyePACS AIROGS [1] train set. This dataset is split into training, validation, and test folders which contain 2500, 270, and 500 fundus images in each class respectively. Each training set has a folder for each class: referable glaucoma (RG) and non-referable glaucoma (NRG).

This dataset has been updated with more training samples and general improvements: https://www.kaggle.com/datasets/deathtrooper/glaucoma-dataset-eyepacs-airogs-light-v2

Three versions of the same dataset are available with different standardization strategies: 1. RAW - Resizing the source image to 256x256 pixels 2. PAD - Padding the source image to a square image and then resizing it to 256x256 pixels. This method preserves the aspect ratio but the resultant image contains less usable information. 3. CROP - Cropping black background in the fundus image, pad the resultant image to create a square image, and then resize to 256x256 pixels. This method preserves the aspect ratio and the resultant image contains the most usable information.

Please review the ablation study to review the impact of the standardization method on the model performance: https://www.kaggle.com/code/deathtrooper/glaucoma-standardization-ablation-study

Please see the code tab for glaucoma detection benchmark progress. The top-performing model has been made by KEREM KARABACAK with a test accuracy of 93.5%.

This work has been published in the IEEE-ICIVC 2023 Conference: Automated Fundus Image Standardization using a Dynamic Global Foreground Threshold Algorithm By Riley Kiefer, Muhammad Abid, Mahsa Raeisi Ardali, Jessica Steen, and Ehsan Amjadian. Learn more about how the algorithm created this dataset here: https://ieeexplore.ieee.org/abstract/document/10270429

[1] EyePACS-AIROGS; https://zenodo.org/record/5793241

Citation

Please cite at least the first work in academic publications: 1. Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023. 2. R. Kiefer, M. Abid, M. R. Ardali, J. Steen and E. Amjadian, "Automated Fundus Image Standardization Using a Dynamic Global Foreground Threshold Algorithm," 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 2023, pp. 460-465, doi: 10.1109/ICIVC58118.2023.10270429. 3. Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023. 4. R. Kiefer, J. Steen, M. Abid, M. R. Ardali and E. Amjadian, "A Survey of Glaucoma Detection Algorithms using Fundus and OCT Images," 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 2022, pp. 0191-0196, doi: 10.1109/IEMCON56893.2022.9946629.

Please also see the following optometry abstract publications: 1. A Comprehensive Survey of Publicly Available Glaucoma Datasets for Automated Glaucoma Detection; AAO 2022; https://aaopt.org/past-meeting-abstract-archives/?SortBy=ArticleYear&ArticleType=&ArticleYear=2022&Title=&Abstract=&Authors=&Affiliation=&PROGRAMNUMBER=225129 2. Standardized and Open-Access Glaucoma Dataset for Artificial Intelligence Applications; ARVO 2023; https://iovs.arvojournals.org/article.aspx?articleid=2790420 3. Ground truth validation of publicly available datasets utilized in artificial intelligence models for glaucoma detection; ARVO 2023; https://iovs.arvojournals.org/article.aspx?articleid=2791017

Please also see the DOI citations for this and related datasets: 1. SMDG; @dataset{smdg, title={SMDG, A Standardized Fundus Glaucoma Dataset}, url={https://www.kaggle.com/ds/2329670}, DOI={10.34740/KAGGLE/DS/2329670}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} } 2. EyePACS-light-v1 @dataset{eyepacs-light-v1, title={Glaucoma Dataset: EyePACS AIROGS - Light}, url={https://www.kaggle.com/ds/3222646}, DOI={10.34740/KAGGLE/DS/3222646}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} } 3. EyePACS-light-v2 @dataset{eyepacs-light-v2, title={Glaucoma Dataset: EyePACS-AIROGS-light-V2}, url={https://www.kaggle.com/dsv/7300206}, DOI={10.34740/KAGGLE/DSV/7300206}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} }
UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA)
zenodo.org
bin, zip
Updated Dec 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth; Luca Giancardo; Luca Giancardo; Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth (2023). UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA) [Dataset]. http://doi.org/10.5281/zenodo.6476639
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6476639
Dataset updated
Dec 11, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth; Luca Giancardo; Luca Giancardo; Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth
Description
Introduction

Vessel segmentation in fundus images is essential in the diagnosis and prognosis of retinal diseases and the identification of image-based biomarkers. However, creating a vessel segmentation map can be a tedious and time consuming process, requiring careful delineation of the vasculature, which is especially hard for microcapillary plexi in fundus images. Optical coherence tomography angiography (OCT-A) is a relatively novel modality visualizing blood flow and microcapillary plexi not clearly observed in fundus photography. Unfortunately, current commercial OCT-A cameras have various limitations due to their complex optics making them more expensive, less portable, and with a reduced field of view (FOV) compared to fundus cameras. Moreover, the vast majority of population health data collection efforts do not include OCT-A data.

We believe that strategies able to map fundus images to en-face OCT-A can create precise vascular vessel segmentation with less effort.

In this dataset, called UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA), we include fundus images and en-face OCT-A images for 112 subjects. The two modalities have been manually aligned to allow for training of medical imaging machine learning pipelines. This dataset is accompanied by a manuscript that describes an approach to generate fundus vessel segmentations using OCT-A for training (Coronado et al., 2022). We refer to this approach as "Synthetic OCT-A".

Fundus Imaging

We include 45 degree macula centered fundus images that cover both macula and optic disc. All images were acquired using a OptoVue iVue fundus camera without pupil dilation.

The full images are available at the fov45/fundus directory. In addition, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/fundus/disc and cropped/fundus/macula.

Enface OCT-A

We include the en-face OCT-A images of the superficial capillary plexus. All images were acquired using an OptoVue Avanti OCT camera with OCT-A reconstruction software (AngioVue). Low quality images with errors in the retina layer segmentations were not included.

En-face OCTA images are located in cropped/octa/disc and cropped/octa/macula. In addition, we include a denoised version of these images where only vessels are included. This has been performed automatically using the ROSE algorithm (Ma et al. 2021). These can be found in cropped/GT_OCT_net/noThresh and cropped/GT_OCT_net/Thresh, the former contains the probabilities of the ROSE algorithm the latter a binary map.

Synthetic OCT-A

We train a custom conditional generative adversarial network (cGAN) to map a fundus image to an en face OCT-A image. Our model consists of a generator synthesizing en face OCT-A images from corresponding areas in fundus photographs and a discriminator judging the resemblance of the synthesized images to the real en face OCT-A samples. This allows us to avoid the use of manual vessel segmentation maps altogether.

The full images are available at the fov45/synthetic_octa directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/synthetic_octa/disc and cropped/synthetic_octa/macula. In addition, we performed the same denoising ROSE algorithm (Ma et al. 2021) used for the original enface OCT-A images, the results are available in cropped/denoised_synthetic_octa/noThresh and cropped/denoised_synthetic_octa/Thresh, the former contains the probabilities of the ROSE algorithm the latter a binary map.

Other Fundus Vessel Segmentations Included

In this dataset, we have also included the output of two recent vessel segmentation algorithms trained on external datasets with manual vessel segmentations. SA-Unet (Li et. al, 2020) and IterNet (Guo et. al, 2021).

SA-Unet. The full images are available at the fov45/SA_Unet directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/SA_Unet/disc and cropped/SA_Unet/macula.

IterNet. The full images are available at the fov45/Iternet directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/Iternet/disc and cropped/Iternet/macula.

Train/Validation/Test Replication

In order to replicate or compare your model to the results of our paper, we report below the data split used.

Training subjects IDs: 1 - 25

Validation subjects IDs: 26 - 30

Testing subjects IDs: 31 - 112

Data Acquisition

This dataset was acquired at the Texas Medical Center - Memorial Hermann Hospital in accordance with the guidelines from the Helsinki Declaration and it was approved by the UTHealth IRB with protocol HSC-MS-19-0352.

User Agreement

The UT-FSOCTA dataset is free to use for non-commercial scientific research only. In case of any publication the following paper needs to be cited

Coronado I, Pachade S, Trucco E, Abdelkhaleq R, Yan J, Salazar-Marioni S, Jagolino-Cole A, Bahrainian M, Channa R, Sheth SA, Giancardo L. Synthetic OCT-A blood vessel maps using fundus images and generative adversarial networks. Sci Rep 2023;13:15325. https://doi.org/10.1038/s41598-023-42062-9.

Funding

This work is supported by the Translational Research Institute for Space Health through NASA Cooperative Agreement NNX16AO69A.

Research Team and Acknowledgements

Here are the people behind this data acquisition effort:

Ivan Coronado, Samiksha Pachade, Rania Abdelkhaleq, Juntao Yan, Sergio Salazar-Marioni, Amanda Jagolino, Mozhdeh Bahrainian, Roomasa Channa, Sunil Sheth, Luca Giancardo

We would also like to acknowledge for their support: the Institute for Stroke and Cerebrovascular Diseases at UTHealth, the VAMPIRE team at University of Dundee, UK and Memorial Hermann Hospital System.

References

Coronado I, Pachade S, Trucco E, Abdelkhaleq R, Yan J, Salazar-Marioni S, Jagolino-Cole A, Bahrainian M, Channa R, Sheth SA, Giancardo L. Synthetic OCT-A blood vessel maps using fundus images and generative adversarial networks. Sci Rep 2023;13:15325. https://doi.org/10.1038/s41598-023-42062-9. C. Guo, M. Szemenyei, Y. Yi, W. Wang, B. Chen, and C. Fan, "SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation," in 2020 25th International Conference on Pattern Recognition (ICPR), Jan. 2021, pp. 1236–1242. doi: 10.1109/ICPR48806.2021.9413346. L. Li, M. Verma, Y. Nakashima, H. Nagahara, and R. Kawasaki, "IterNet: Retinal Image Segmentation Utilizing Structural Redundancy in Vessel Networks," 2020 IEEE Winter Conf. Appl. Comput. Vis. WACV, 2020, doi: 10.1109/WACV45572.2020.9093621. Y. Ma et al., "ROSE: A Retinal OCT-Angiography Vessel Segmentation Dataset and New Model," IEEE Trans. Med. Imaging, vol. 40, no. 3, pp. 928–939, Mar. 2021, doi: 10.1109/TMI.2020.3042802.

Data from: A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image...

zenodo.org
data-staging.niaid.nih.gov

zip

Updated Feb 1, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Roser Viñals; Roser Viñals; Jean-Philippe Thiran; Jean-Philippe Thiran (2024). A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning: Dataset (2/6) [Dataset]. http://doi.org/10.5281/zenodo.10591473

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10591473

Dataset updated

Feb 1, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Roser Viñals; Roser Viñals; Jean-Philippe Thiran; Jean-Philippe Thiran

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains a collection of ultrafast ultrasound acquisitions from nine volunteers and the CIRS 054G phantom. For a comprehensive understanding of the dataset, please refer to the paper: Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. https://doi.org/10.3390/jimaging9120256. Please cite the original paper when using this dataset.

Due to data size restriction, the dataset has been divided into six subdatasets, each one published into a separate entry in Zenodo. This repository contains subdataset 2.

Structure

In Vivo Data

Number of Acquisitions: 20,000
Volunteers: Nine volunteers
File Structure: Each volunteer's data is compressed in a separate zip file.
- Note: For volunteer 1, due to a higher number of acquisitions, data for this volunteer is distributed across multiple zip files, each containing acquisitions from different body regions.
Regions :
- Abdomen: 6599 acquisitions
- Neck: 3294 acquisitions
- Breast: 3291 acquisitions
- Lower limbs: 2616 acquisitions
- Upper limbs: 2110 acquisitions
- Back: 2090 acquisitions
File Naming Convention: Incremental IDs from acquisition_00000 to acquisition_19999.

In Vitro Data

Number of Acquisitions: 32 from CIRS model 054G phantom
File Structure: The in vitro data is compressed in the cirs-phantom.zip file.
File Naming Convention: Incremental IDs from invitro_00000 to invitro_00031.

CSV Files

Two CSV files are provided:

invivo_dataset.csv :
- Contains a list of all in vivo acquisitions.
- Columns: id, path, volunteer id, body region.
invitro_dataset.csv :
- Contains a list of all in vitro acquisitions.
- Columns: id, path

Zenodo dataset splits and files

The dataset has been divided into six subdatasets, each one published in a separate entry on Zenodo. The following table indicates, for each file or compressed folder, the Zenodo dataset split where it has been uploaded along with its size. Each dataset split is named "A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning: Dataset (ii/6)", where ii represents the split number. This repository contains the 2nd split.

File name	Size	Zenodo subdataset number
invivo_dataset.csv	995.9 kB	1
invitro_dataset.csv	1.1 kB	1
cirs-phantom.zip	418.2 MB	1
volunteer-1-lowerLimbs.zip	29.7 GB	1
volunteer-1-carotids.zip	8.8 GB	1
volunteer-1-back.zip	7.1 GB	1
volunteer-1-abdomen.zip	34.0 GB	2
volunteer-1-breast.zip	15.7 GB	2
volunteer-1-upperLimbs.zip	25.0 GB	3
volunteer-2.zip	26.5 GB	4
volunteer-3.zip	20.3 GB	3
volunteer-4.zip	24.1 GB	5
volunteer-5.zip	6.5 GB	5
volunteer-6.zip	11.5 GB	5
volunteer-7.zip	11.1 GB	6
volunteer-8.zip	21.2 GB	6
volunteer-9.zip	23.2 GB	4

Normalized RF Images

Beamforming:
- Depth from 1 mm to 55 mm
- Width spanning the probe aperture
- Grid: 𝜆/8 × 𝜆/8
- Resulting images shape: 1483 × 1189
- Two beamformed RF images from each acquisition:
  - Input image: single unfocused acquisition obtained from a single plane wave (PW) steered at 0° (acquisition-xxxx-1PW)
  - Target image: coherently compounded image from 87 PWs acquisitions steered at different angles (acquisition-xxxx-87PWs)
Normalization:
- The two RF images have been normalized
To display the images:
- Perform the envelop detection (to obtain the IQ images)
- Log-compress (to obtain the B-mode images)
File Format: Saved in npy format, loadable using Python and numpy.load(file).

Training and Validation Split in the paper

For the volunteer-based split used in the paper:

Training set: volunteers 1, 2, 3, 6, 7, 9
Validation set: volunteer 4
Test set: volunteers 5, 8
Images analyzed in the paper
- Carotid acquisition (from volunteer 5): acquisition_12397
- Back acquisition (from volunteer 8): acquisition_19764
- In vitro acquisition: invitro-00030

License

This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Please cite the original paper when using this dataset :

Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. DOI: 10.3390/jimaging9120256

Contact

For inquiries or issues related to this dataset, please contact:

Name: Roser Viñals
Email: roser.vinalsterres@epfl.ch

HaDR: Dataset for hands instance segmentation

kaggle.com

zip

Updated Mar 7, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Ales Vysocky (2023). HaDR: Dataset for hands instance segmentation [Dataset]. https://www.kaggle.com/datasets/alevysock/hadr-dataset-for-hands-instance-segmentation

Explore at:

zip(10662295286 bytes)Available download formats

Dataset updated

Mar 7, 2023

Authors

Ales Vysocky

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

If you use this dataset for your work, please cite the related papers: A. Vysocky, S. Grushko, T. Spurny, R. Pastor and T. Kot, Generating Synthetic Depth Image Dataset for Industrial Applications of Hand Localisation, in IEEE Access, 2022, doi: 10.1109/ACCESS.2022.3206948.

S. Grushko, A. Vysocký, J. Chlebek, P. Prokop, HaDR: Applying Domain Randomization for Generating Synthetic Multimodal Dataset for Hand Instance Segmentation in Cluttered Industrial Environments. preprint in arXiv, 2023, https://doi.org/10.48550/arXiv.2304.05826

The HaDR dataset is a multimodal dataset designed for human-robot gesture-based interaction research, consisting of RGB and Depth frames, with binary masks for each hand instance (i1, i2, single class data). The dataset is entirely synthetic, generated using Domain Randomization technique in CoppeliaSim 3D. The dataset can be used to train Deep Learning models to recognize hands using either a single modality (RGB or depth) or both simultaneously. The training-validation split comprises 95K and 22K samples, respectively, with annotations provided in COCO format. The instances are uniformly distributed across the image boundaries. The vision sensor captures depth and color images of the scene, with the depth pixel values scaled into a single channel 8-bit grayscale image in the range [0.2, 1.0] m. The following aspects of the scene were randomly varied during generation of dataset: • Number, colors, textures, scales and types of distractor objects selected from a set of 3D models of general tools and geometric primitives. A special type of distractor – an articulated dummy without hands (for instance-free samples) • Hand gestures (9 options). • Hand models’ positions and orientations. • Texture and surface properties (diffuse, specular and emissive properties) and number (from none to 2) of the object of interest, as well as its background. • Number and locations of directional lights sources (from 1 to 4), in addition to a planar light for ambient illumination. The sample resolution is set to 320×256, encoded in lossless PNG format, and contains only right hand meshes (we suggest using Flip augmentations during training), with a maximum of two instances per sample.

Test dataset (real camera images): Test dataset containing 706 images was captured using a real RGB-D camera (RealSense L515) in a cluttered and unstructured industrial environment. The dataset comprises various scenarios with diverse lighting conditions, backgrounds, obstacles, number of hands, and different types of work gloves (red, green, white, yellow, no gloves) with varying sleeve lengths. The dataset is assumed to have only one user, and the maximum number of hand instances per sample was limited to two. The dataset was manually labelled, and we provide hand instance segmentation COCO annotations in instances_hands_full.json (separately for train and val) and full arm instance annotations in instances_arms_full.json. The sample resolution was set to 640×480, and depth images were encoded in the same way as those of the synthetic dataset.

Channel-wise normalization and standardization parameters for datasets

Dataset	Mean (R, G, B, D)	STD (R, G, B, D)
Train	98.173, 95.456, 93.858, 55.872	67.539, 67.194, 67.796, 47.284
Validation	99.321, 97.284, 96.318, 58.189	67.814, 67.518, 67.576, 47.186
Test	123.675, 116.28, 103.53, 35.3792	58.395, 57.12, 57.375, 45.978

Z
Data from: SynDroneVision: A Synthetic Dataset for Image-Based Drone...
data-staging.niaid.nih.gov
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lenhard, Tamara R.; Weinmann, Andreas; Franke, Kai; Koch, Tobias (2024). SynDroneVision: A Synthetic Dataset for Image-Based Drone Detection [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_13360115
Explore at:
Dataset updated
Nov 13, 2024
Dataset provided by
Deutsches Zentrum für Luft- und Raumfahrt e. V. (DLR)
Darmstadt University of Applied Sciences
Authors
Lenhard, Tamara R.; Weinmann, Andreas; Franke, Kai; Koch, Tobias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

Developing robust drone detection systems is often constrained by the limited availability of large-scale annotated training data and the high costs associated with real-world data collection. However, synthetic data presents a promising and cost-effective solution to overcome this issue. Therefore, we present SynDroneVision, a synthetic dataset specifically designed for RGB-based drone detection in surveillance applications. Featuring diverse backgrounds, lighting conditions, and drone models, SynDroneVision offers a comprehensive training foundation for deep learning algorithms. To evaluate the dataset's effectiveness, we perform a comparative analysis across a selection of recent YOLO detection models. Our findings demonstrated that SynDroneVision is a valuable resource for real-world data enrichment, achieving notable enhancements in model performance and robustness, while significantly reducing the time and costs of real-world data acquisition.

Paper

Accepted for publication at the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV2025)!

SynDroneVision is presented in the upcoming paper SynDroneVision: A Synthetic Dataset for Image-Based Drone Detection by Tamara R. Lenhard, Andreas Weinmann, Kai Franke, and Tobias Koch. This work is accepted and will be published in the Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV2025).

For early access, the preprint is currently available on ArXiv: https://arxiv.org/abs/2411.05633v1

Dataset Details

SynDroneVision comprises a total of 140,038 annotaed RGB images (131,238 for training, 8,800 for validation, and 4,000 for testing), featuring a resolution of 2560x1489 pixels. All images are recorded in a sequential manner using Unreal Engine 5.0 in combination with Colosseum. Apart from drone images, SynDroneVision also includes ~7% of background images (i.e., imag frames without drone instances).

Annotation Format: Annotations (bounding boxes) are provided via text files according to the YOLO standard format

Here, and represent the normalized coordinates of the bounding box center, while and denote the normalized bounding box wisth and height. In SynDroneVision, is always 0, indicating the drone class.

Download

The SynDroneVision dataset offers around 900 GB of data dedicated to image-based drone detection. To facilitate the download process, we have partitioned the dataset into smaller sections. Specifically, we have divided the training data into 10 segments, organized by sequences.

Annotations are available below, with image data accessible via the following links:

Dataset Split Sequences File Name Link Size (GB)

Training Set Seq. 001 - 009 images_train_seq001-009.zip Training images PART 1 57

Seq. 010 - 018 images_train_seq010-018.zip Trainng images PART 2 95.4

Seq. 019 - 027 images_train_seq019-027.zip Training images PART 3 96.2

Seq. 028 - 035 images_train_seq028-035.zip Training images PART 4 83.9

Seq. 036 - 043 images_train_seq036-043.zip Training images PART 5 77.1

Seq. 044 - 050 images_train_seq044-050.zip Training images PART 6 84.7

Seq. 051 - 056 images_train_seq051-056.zip Training images PART 7 86.8

Seq. 057 - 065 images_train_seq057-065.zip Training images PART 8 86.2

Seq. 066 - 070 images_train_seq066-070.zip Training images PART 9 75.7

Seq. 071 - 073 images_train_seq071-073.zip Training images PART 10 38.5

Validation Set Seq. 001 - 073 images_val.zip Validation images 55.2

Test Set Seq. 001 - 073 images_test.zip Test images 26.5

Citation

If you find SynDroneVision helpful in your research, we kindly ask that you cite the associated preprint. Below is the citation in BibTeX format for your convenience:

BibTeX:

@inproceedings{Lenhard:2024, title={{SynDroneVision: A Synthetic Dataset for Image-Based Drone Detection}}, author={Lenhard, Tamara R. and Weinmann, Andreas and Franke, Kai and Koch, Tobias}, year={2024}, url={https://arxiv.org/abs/2411.05633}}

SynDroneVision uses Unreal® Engine. Unreal® is a trademark or registered trademark of Epic Games, Inc. in the United States of America and elsewhere.
Z
Data from: Tango Spacecraft Dataset for Region of Interest Estimation and...
data.niaid.nih.gov
zenodo.org
Updated May 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bechini Michele; Lunghi Paolo; Lavagna Michèle (2023). Tango Spacecraft Dataset for Region of Interest Estimation and Semantic Segmentation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6507863
Explore at:
Dataset updated
May 23, 2023
Dataset provided by
Politecnico di Milano
Authors
Bechini Michele; Lunghi Paolo; Lavagna Michèle
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Reference Paper:

M. Bechini, M. Lavagna, P. Lunghi, Dataset generation and validation for spacecraft pose estimation via monocular images processing, Acta Astronautica 204 (2023) 358–369

M. Bechini, P. Lunghi, M. Lavagna. "Spacecraft Pose Estimation via Monocular Image Processing: Dataset Generation and Validation". In 9th European Conference for Aeronautics and Aerospace Sciences (EUCASS)

General Description:

The "Tango Spacecraft Dataset for Region of Interest Estimation and Semantic Segmentation" dataset here published should be used for Region of Interest (ROI) and/or semantic segmentation tasks. It is split into 30002 train images and 3002 test images representing the Tango spacecraft from Prisma mission, being the largest publicly available dataset of synthetic space-borne noise-free images tailored to ROI extraction and Semantic Segmentation tasks (up to our knowledge). The label of each image gives, for the Bounding Box annotations, the filename of the image, the ROI top-left corner (minimum x, minimum y) in pixels, the ROI bottom-right corner (maximum x, maximum y) in pixels, and the center point of the ROI in pixels. The annotation are taken in image reference frame with the origin located at the top-left corner of the image, positive x rightward and positive y downward. Concerning the Semantic Segmentation, RGB masks are provided. Each RGB mask correspond to a single image in both train and test dataset. The RGB images are such that the R channel corresponds to the spacecraft, the G channel corresponds to the Earth (if present), and the B channel corresponds to the background (deep space). Per each channel the pixels have non-zero value only in correspondence of the object that they represent (Tango, Earth, Deep Space). More information on the dataset split and on the label format are reported below.

Images Information:

The dataset comprises 30002 synthetic grayscale images of Tango spacecraft from Prisma mission that serves as train set, while the test set is formed by 3002 synthetic grayscale images of Tango spacecraft from Prisma mission in PNG format. About 1/6 of the images both in the train and in the test set have a non-black background, obtained by rendering an Earth-like model in the raytracing process used to define the images reported. The images are noise-free to increase the flexibility of the dataset. The illumination direction of the spacecraft in the scene is uniformly distributed in the 3D space in agreement with the Sun position constraints.

Labels Information:

Labels for the bounding box extraction are here provided in separated JSON files. The files are formatted per each image as in the following example:

filename : tango_img_1 # name of the image to which the data are referred rol_tl : [x, y] # ROI top-left corner (minimum x, minimum y) in pixels roi_br : [x, y] # ROI bottom-right corner (maximum x, maximum y) in pixels roi_cc : [x, y] # center point of the ROI in pixels

Notice that the annotation are taken in image reference frame with the origin located at the top-left corner of the image, positive x rightward and positive y downward.To make the usage of the dataset easier, both the training set and the test set are split in two folders containing the images with earth as background and without background.

Concerning the Semantic Segmentation Labels, they are provided as RGB masks named as "filename_mask.png" where "filename" is the filename of the image of the training set or the test set to which a specific mask is referred. The RGB images are such that the R channel corresponds to the spacecraft, the G channel corresponds to the Earth (if present), and the B channel corresponds to the background (deep space). Per each channel the pixels have non-zero value only in correspondence of the object that they represent (Tango, Earth, Deep Space).

VERSION CONTROL

v1.0: This version contains the dataset (both train and test) of full scale images with ROI annotations and RGB masks for Semantic Segmentation tasks. These images have width=height=1024 pixels. The position of tango with respect to the camera is randomly selected from a uniform distribution, but it is ensured the full visibility in all the images.

Note: this dataset contains the same images of the "Tango Spacecraft Wireframe Dataset Model for Line Segments Detection" v2.0 full-scale (DOI: https://doi.org/10.5281/zenodo.6372848) and also "Tango Spacecraft Dataset for Monocular Pose Estimation" v1.0 (DOI: https://doi.org/10.5281/zenodo.6499007) and they can be used together by combining the annotations of the relative pose and the ones of the reprojected wireframe model of Tango, with also the ones of the ROI. These three datasets give the most comprehensive dataset of space borne synthetic images ever published (up to our knowledge).
Housing Price Prediction using DT and RF in R
kaggle.com
zip
Updated Aug 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r
Explore at:
zip(629100 bytes)Available download formats
Dataset updated
Aug 31, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Objective: To predict the prices of houses in the City of Melbourne

Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">

Data Cleaning:

Date column is shown as a character vector which is converted into a date vector using the library ‘lubridate’

We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ‘Date’ and subtract it from the column ‘Year Built’

We remove 11566 records which have missing values

We drop columns which are not significant such as ‘X’, ‘suburb’, ‘address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ‘type’, ‘method’, ‘SellerG’, ‘date’, ‘Car’, ‘year built’, ‘Council Area’, ‘Region Name’

We split the data into ‘train’ and ‘test’ in 80/20 ratio using the sample function

Run libraries ‘rpart’, ‘rpart.plot’, ‘rattle’, ‘RcolorBrewer’

Run decision tree using the rpart function. ‘Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">

Average price for 5464 houses is $1084349

Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.

$4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">

We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">

We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)

Variables ‘postcode’, longitude and building are the most important variables

Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">

We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">

The below image indicates that ‘Building Area’, ‘Age of the house’ and ‘Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">

Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7

Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">

We tune the model and find mtry = 3 has the lowest out of bag error

We use the caret package and use 5 fold cross validation technique

RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4

We can conclude that Random Forest give us more accurate results as compared to Decision Tree

In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.
Labelled dataset to classify direct deforestation drivers in Cameroon:...
zenodo.org
zip
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amandine Debus; Amandine Debus; Emilie Beauchamp; Emilie Beauchamp; James Acworth; Achille Ewolo; Justin Kamga; Astrid Verhegghen; Christiane Zébazé; Emily R. Lines; Emily R. Lines; James Acworth; Achille Ewolo; Justin Kamga; Astrid Verhegghen; Christiane Zébazé (2025). Labelled dataset to classify direct deforestation drivers in Cameroon: NIR-R-G bands [Dataset]. http://doi.org/10.5281/zenodo.15538497
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15538497
Dataset updated
May 30, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amandine Debus; Amandine Debus; Emilie Beauchamp; Emilie Beauchamp; James Acworth; Achille Ewolo; Justin Kamga; Astrid Verhegghen; Christiane Zébazé; Emily R. Lines; Emily R. Lines; James Acworth; Achille Ewolo; Justin Kamga; Astrid Verhegghen; Christiane Zébazé
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Cameroon
Description
Overview

This dataset includes the images (NIR-R-G bands for Landsat-8 or NICFI PlanetScope), auxiliary data (infrared, NCEP, forest gain, OpenStreetMap, SRTM, GFW), and data about forest loss (Global Forest Change) used to train, validate and test a model to classify direct deforestation drivers in Cameroon. The creation of this dataset follows the same structure as: Labelled dataset to classify direct deforestation drivers in Cameroon but with a different set of bands.

For more details about how this dataset has been created and can be used, please refer to our paper and code: https://github.com/aedebus/Cam-ForestNet. The paper, describing the generation of RGB images, can be found here: https://www.nature.com/articles/s41597-024-03384-z.

Citation: Debus, A. et al. A labelled dataset to classify direct deforestation drivers from Earth Observation imagery in Cameroon. Sci Data 11, 564 (2024).

Here, the only difference compared with what is described in the paper is that we select NIR-R-G instead of R-G-B bands for our PNG images.

Description of the files and images

'my_examples_landsat_nir.zip': Landsat-8 images (courtesy of the U.S. Geological Survey), auxiliary data and forest loss data used to train, validate and test a model for a detailed classification of deforestation drivers in Cameroon. This dataset contains 332x 332 pixels NIR-R-G calibrated top-of-atmosphere (TOA) reflectance with a 30 m resolution (less than 20% cloud cover)

'my_examples_landsat_sr_nir.zip': Same as above, but with surface reflectance (SR) instead of TOA

'my_examples_planet_nir.zip': NICFI PlanetScope images (catalog owner: Planet), auxiliary data and forest loss data used to train, validate and test a model for a detailed classification of deforestation drivers in Cameroon. This dataset contains 332x 332 pixels monthly NIR-R-G composite with a 4.77 m resolution

'my_examples_planet_nir_biannual.zip': Same as above, but with biannual composites instead of monthly composites

For ‘labels_nir.zip’, we have subfolders for Landsat-8 (TOA, SR, groups TOA) and NICFI PlanetScope (monthly, biannual, groups monthly).

For each folder, subfolders named with the coordinates of the centre of the images contain each:
• A folder ‘images’, with a sub-folder ‘visible’ containing the PNG image; and a sub-folder ‘infrared’ containing the infrared bands in a NPY file.
• A folder ‘auxiliary’ with topographic and forest gain information in a NPY format, OpenStreetMap and peat data in a JSON format, and a sub-folder ‘ncep’ containing all data from NCEP in a NPY format.
• The forest loss pickle file delimiting the area of forest loss.

Note: The images provided have been filtered to enable a train/validation/test split that ensures a minimum distance of 100 meters between the edges of forest loss areas.

Details about the auxiliary data

Forest gain from GFC: 30-m resolution, yearly data for 2000-2021, downloaded via Google Earth Engine

Near infrared, shortwave infrared 1 and 2 bands from Landsat-8 TOA: 30-m resolution, data every 16 days for 2013-2023, downloaded via Google Earth Engine and selected using the same process as for Landsat-8 RGB images

From NCEP Climate Forecast System Version 2 (CFSv2) 6-hourly Products: surface level albedo and volumetric soil moisture content (depths: 0.1 m, 0.4 m, 1.0 m, 2.0m) in 0.01%; radiative fluxes (clear-sky longwave flux downward and upward, clear-sky solar flux downward and upward, direct evaporation from bare soil, longwave and shortwave radiation flux downward and upward, latent, ground and sensible heat net flux), potential evaporation rate, and sublimation in W/m²; humidity (specific, maximum specific, minimum specific) in 10-4 kg/kg; ground level precipitation in 0.1 mm; air pressure at surface level in 10 Pa; wind level (u and v component) in 0.01 m/s, water runoff at surface level in 232.01 kg/ m²; temperature in K: 22264-m resolution, available four times a day for 2011-2023, downloaded directly from the NOAA website and selected the mean of the monthly mean over 5 years before the forest loss event, the monthly maximum over 5 years before the forest loss event, and the monthly minimum over 5 years before the forest loss event for each parameter

Closest street and closest city from OpenStreetMap in km: directly downloaded with the Nominatim API

Altitude in m, slope and aspect in 0.01° from Shuttle Radar Topography Mission (SRTM): 30-m resolution, measured for 2000, downloaded via Google Earth Engine

Presence of peat from GFW: 232-m resolution, measured for 2017, directly downloaded on the GFW website

Details about Global Forest Change

For each image, there is a corresponding 'forest_loss_region' .pkl file delimiting a forest loss region polygon from Global Forest Change (GFC). GFC consists of annual maps of forest cover loss with a 30-m resolution.

License

The NICFI PlanetScope images fall under the same license as the NICFI data program license agreement (data in 'my_examples_planet_nir.zip', 'my_examples_planet_nir_biannual.zip': subfolders '[coordinates]'>'images'>'visible').

OpenStreetMap® is open data, licensed under the Open Data Commons Open Database License (ODbL) by the OpenStreetMap Foundation (OSMF) (data in all 'my_examples' folders: subfolders '[coordinates]'>'auxiliary'>'closest_city.json'/'closest_street.json'). The documentation is licensed under the Creative Commons Attribution-ShareAlike 2.0 license (CC BY-SA 2.0).

The rest of the data is under a Creative Commons Attribution 4.0 International License. The data has been transformed following the code that can be found via this link: https://github.com/aedebus/Cam-ForestNet (in 'prepare_files').

Facebook

Twitter

Click to copy link

Link copied

Cite

Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001

Data from: Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction.

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.1021/ci400084k.s001

Dataset updated

Jun 2, 2023

Dataset provided by

ACS Publications

Authors

Robert P. Sheridan

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

Clear search

Close search

Google apps

Main menu

Data from: Time-Split Cross-Validation as a Method for Estimating the...

Glaucoma Dataset: EyePACS-AIROGS-light-V2

physioDL: A dataset for geomorphic deep learning representing a scene...

Link-prediction on Biomedical Knowledge Graphs

madelon

Abstract:

Source:

Data Set Information:

Relevant Papers:

WD50K

Probing Datasets for Noisy Texts

Rescaled Fashion-MNIST dataset

Motivation

Access and rights

The dataset

The h5 files containing the dataset

Instructions for loading the data set

Rescaled CIFAR-10 dataset

Motivation

Access and rights

The dataset

The h5 files containing the dataset

Instructions for loading the data set

Life Expectancy WHO

We use DECISION TREE MODEL for the analysis.

We use RANDOM FOREST for the analysis.

Data and script pipeline for: Common to rare transfer learning (CORAL)...

Dataset for "SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy...

SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection

Summary

Data Description

Acknowledgements

Glaucoma Dataset: EyePACS AIROGS - Light

Citation

UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA)

Data from: A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image...

Structure

In Vivo Data

In Vitro Data

CSV Files

Zenodo dataset splits and files

Normalized RF Images

Training and Validation Split in the paper

License

Contact

HaDR: Dataset for hands instance segmentation

Data from: SynDroneVision: A Synthetic Dataset for Image-Based Drone...

Data from: Tango Spacecraft Dataset for Region of Interest Estimation and...

Housing Price Prediction using DT and RF in R

Labelled dataset to classify direct deforestation drivers in Cameroon:...

Data from: Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction.