29 datasets found
  1. f

    Data from: Time-Split Cross-Validation as a Method for Estimating the...

    • acs.figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

  2. Glaucoma Dataset: EyePACS-AIROGS-light-V2

    • kaggle.com
    zip
    Updated Mar 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Riley Kiefer (2024). Glaucoma Dataset: EyePACS-AIROGS-light-V2 [Dataset]. https://www.kaggle.com/datasets/deathtrooper/glaucoma-dataset-eyepacs-airogs-light-v2/code
    Explore at:
    zip(549533071 bytes)Available download formats
    Dataset updated
    Mar 9, 2024
    Authors
    Riley Kiefer
    Description

    News: Now with a 10.0 Kaggle usability score: supplemental metadata.csv file added to dataset.

    Overview: This is an improved machine-learning-ready glaucoma dataset using a balanced subset of standardized fundus images from the Rotterdam EyePACS AIROGS [1] set. This dataset is split into training, validation, and test folders which contain 4000 (~84%), 385 (~8%), and 385 (~8%) fundus images in each class respectively. Each training set has a folder for each class: referable glaucoma (RG) and non-referable glaucoma (NRG). This dataset is designed to easily benchmark your glaucoma classification models in Kaggle. Please make a contribution in the code tab, I have created a template to make it even easier!

    Please cite the dataset and at least the first of my related works if you found this dataset useful!

    • Riley Kiefer. "EyePACS-AIROGS-light-V2". Kaggle, 2024, doi: 10.34740/KAGGLE/DSV/7802508.
    • Riley Kiefer. "EyePACS-AIROGS-light-V1". Kaggle, 2023, doi: 10.34740/kaggle/ds/3222646.
    • Riley Kiefer. "Standardized Multi-Channel Dataset for Glaucoma, v19 (SMDG-19)". Kaggle, 2023, doi: 10.34740/kaggle/ds/2329670
    • Steen, J., Kiefer, R., Ardali, M., Abid, M. & Amjadian, E. Standardized and Open-Access Glaucoma Dataset for Artificial Intelligence Applications. Invest. Ophthalmol. Vis. Sci. 64, 384–384 (2023).
    • Amjadian, E., Ardali, M. R., Kiefer, R., Abid, M. & Steen, J. Ground truth validation of publicly available datasets utilized in artificial intelligence models for glaucoma detection. Invest. Ophthalmol. Vis. Sci. 64, 392–392 (2023).
    • R. Kiefer, M. Abid, M. R. Ardali, J. Steen and E. Amjadian, "Automated Fundus Image Standardization Using a Dynamic Global Foreground Threshold Algorithm," 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 2023, pp. 460-465, doi: 10.1109/ICIVC58118.2023.10270429.
    • Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023.
    • R. Kiefer, J. Steen, M. Abid, M. R. Ardali and E. Amjadian, "A Survey of Glaucoma Detection Algorithms using Fundus and OCT Images," 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 2022, pp. 0191-0196, doi: 10.1109/IEMCON56893.2022.9946629.
    • E. Amjadian, R. Kiefer, J. Steen, M. Abid, M. Ardali, "A Comprehensive Survey of Publicly Available Glaucoma Datasets for Automated Glaucoma Detection". American Academy of Optometry. 2022.

    Improvements from v1: - According to an ablation study on the image standardization methods applied to dataset v1 [3], images are standardized according to the CROP methodology (remove black background before resizing). This method yields more of the actual fundus foreground in the resultant image. - Increased the image resize dimensions from 256x256 pixels to 512x512 pixels - Reason: Provides greater model input flexibility, detail, and size. This also better supports the ONH-cropping models. - Added 3000 images from the Rotterdam EyePACS AIROGS dev set - Reason: More data samples can improve model generalizability - Readjusted train/val/test split - Reason: The validation and test split sizes were different - Improved sampling from source dataset - Reason: v1 NRG samples were not randomly selected

    Drawbacks of Rotterdam EyePACS AIROGS: One of the largest drawbacks of the original dataset is the accessibility of the dataset. The dataset requires a long download, a large storage space, it spans several folders, and it is not machine-learning-ready (it requires data processing and splitting). The dataset also contains raw fundus images in their original dimensions; these original images often contain a large amount of black background and the dimensions are too large for machine learning inputs. The proposed dataset addresses the aforementioned concerns by image sampling and image standardization to balance and reduce the dataset size respectively.

    Origin: The images in this dataset are sourced from the Rotterdam EyePACS AIROGS [1] dataset, which contains 113,893 color fundus images from 60,357 subjects and approximately 500 different sites with a heterogeneous ethnicity; this impressive dataset is over 60GB when compressed. The first lightweight version of the dataset is known as EyePACS-AIROGS-light (v1) [2].

    About Me: I have studied glaucoma-related research for my computer science master's thesis. Since my graduation, I have dedicated my time to keeping my research up-to-date and relevant for fellow glaucoma researchers. I hope that my research can provi...

  3. physioDL: A dataset for geomorphic deep learning representing a scene...

    • figshare.com
    zip
    Updated Jul 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Maxwell (2024). physioDL: A dataset for geomorphic deep learning representing a scene classification task (predict physiographic region in which a hillshade occurs) [Dataset]. http://doi.org/10.6084/m9.figshare.26363824.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 24, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Aaron Maxwell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    physioDL: A dataset for geomorphic deep learning representing a scene classification task (predict physiographic region in which a hilshade occurs)Purpose: Datasets for geomorphic deep learning. Predict the physiographic region of an area based on a hillshade image. Terrain data were derived from the 30 m (1 arc-second) 3DEP product across the entirety of CONUS. Each chip has a spatial resolution of 30 m and 256 rows and columns of pixels. As a result, each chip measures 7,680 meters-by-7,680 meters. Two datasets are provided. Chips in the hs folder represent a multidirectional hillshade while chips in the ths folder represent a tinted multidirectional hillshade. Data are represented in 8-bit (0 to 255 scale, integer values). Data are projected to the Web Mercator projection relative to the WGS84 datum. Data were split into training, test, and validation partitions using stratified random sampling by region. 70% of the samples per region were selected for training, 15% for testing, and 15% for validation. There are a total of 16,325 chips. The following 22 physiographic regions are represented: "ADIRONDACK" , "APPALACHIAN PLATEAUS", "BASIN AND RANGE", "BLUE RIDGE", "CASCADE-SIERRA MOUNTAINS", "CENTRAL LOWLAND", "COASTAL PLAIN", "COLORADO PLATEAUS", "COLUMBIA PLATEAU", "GREAT PLAINS", "INTERIOR LOW PLATEAUS", "MIDDLE ROCKY MOUNTAINS", "NEW ENGLAND", "NORTHERN ROCKY MOUNTAINS", "OUACHITA", "OZARK PLATEAUS", "PACIFIC BORDER", and "PIEDMONT", "SOUTHERN ROCKY MOUNTAINS", "SUPERIOR UPLAND", "VALLEY AND RIDGE", "WYOMING BASIN". Input digital terrain models and hillshades are not provided due to the large file size (> 100GB). FilesphysioDL.csv: Table listing all image chips and associated physiographic region (id = unique ID for each chip; region = physiographic region; fnameHS = file name of associated chip in hs folder; fnameTHS = file name of associated chip in ths folder; set = data split (train, test, or validation).chipCounts.csv: Number of chips in each data partition per physiographic province. map.png: Map of data.makeChips.R: R script used to process the data into image chips and create CSV files.inputVectorschipBounds.shp = square extent of each chipchipCenters.shp = center coordinate of each chipprovinces.shp = physiographic provincesprovinces10km.shp = physiographic provinces with a 10 km negative buffer

  4. Link-prediction on Biomedical Knowledge Graphs

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus (2024). Link-prediction on Biomedical Knowledge Graphs [Dataset]. http://doi.org/10.5281/zenodo.12097377
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Time period covered
    Jun 25, 2021
    Description

    Release of the experimental data from the paper Towards Linking Graph Topology to Model Performance for Biomedical Knowledge Graph Completion (accepted at Machine Learning for Life and Material Sciences workshop @ ICML2024).

    Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions we invite the community to build upon our work and continue improving the understanding of these crucial applications.
    Experiments were conducted on six datasets: five from the biomedical domain (Hetionet, PrimeKG, PharmKG, OpenBioLink2020 HQ, PharMeBINet) and one trivia KG (FB15k-237). All datasets were randomly split into training, validation and test set (80% / 10% / 10%; in the case of PharMeBINet, 99.3% / 0.35% / 0.35% to mitigate the increased inference cost on the larger dataset).
    On each dataset, four different KGE models were compared: TransE, DistMult, RotatE, TripleRE. Hyperparameters were tuned on the validation split and we release results for tail predictions on the test split. In particular, each test query (h,r,?) is scored against all entities in the KG and we compute the rank of the score of the correct completion (h,r,t) , after masking out scores of other (h,r,t') triples contained in the graph.
    Note: the ranks provided are computed as the average between the optimistic and pessimistic ranks of triple scores.
    Inside experimental_data.zip, the following files are provided for each dataset:
    • {dataset}_preprocessing.ipynb: a Jupyter notebook for downloading and preprocessing the dataset. In particular, this generates the custom label->ID mapping for entities and relations, and the numerical tensor of (h_ID,r_ID,t_ID) triples for all edges in the graph, which can be used to compute graph topological metrics (e.g., using kg-topology-toolbox) and compare them with the edge prediction accuracy.
    • test_ranks.csv: csv table with columns ["h", "r", "t"] specifying the head, relation, tail IDs of the test triples, and columns ["DistMult", "TransE", "RotatE", "TripleRE"] with the rank of the ground-truth tail in the ordered list of predictions made by the four models;
    • entity_dict.csv: the list of entity labels, ordered by entity ID (as generated in the preprocessing notebook);
    • relation_dict.csv: the list of relation labels, ordered by relation ID (as generated in the preprocessing notebook).

    The separate top_100_tail_predictions.zip archive contains, for each of the test queries in the corresponding test_ranks.csv table, the IDs of the top-100 tail predictions made by each of the four KGE models, ordered by decreasing likelihood. The predictions are released in a .npz archive of numpy arrays (one array of shape (n_test_triples, 100) for each of the KGE models).

    All experiments (training and inference) have been run on Graphcore IPU hardware using the BESS-KGE distribution framework.

  5. o

    madelon

    • openml.org
    Updated May 22, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2015). madelon [Dataset]. https://www.openml.org/d/1485
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2015
    Description

    Author: Isabelle Guyon
    Source: UCI
    Please cite: Isabelle Guyon, Steve R. Gunn, Asa Ben-Hur, Gideon Dror, 2004. Result analysis of the NIPS 2003 feature selection challenge.

    Abstract:

    MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear.

    Source:

    Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 90708 isabelle '@' clopinet.com

    Data Set Information:

    MADELON is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five-dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (corresponding to the +-1 labels). It was added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized.

    This dataset is one of five datasets used in the NIPS 2003 feature selection challenge. The original data was split into training, validation and test set. Target values are provided only for two first sets (not for the test set). So, this dataset version contains all the examples from training and validation partitions.

    There is no attribute information provided to avoid biasing the feature selection process.

    Relevant Papers:

    The best challenge entrants wrote papers collected in the book: Isabelle Guyon, Steve Gunn, Masoud Nikravesh, Lofti Zadeh (Eds.), Feature Extraction, Foundations and Applications. Studies in Fuzziness and Soft Computing. Physica-Verlag, Springer.

    Isabelle Guyon, et al, 2007. Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recognition Letters 28 (2007) 1438–1444.

    Isabelle Guyon, et al. 2006. Feature selection with the CLOP package. Technical Report.

  6. WD50K

    • data.europa.eu
    • data.niaid.nih.gov
    unknown
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). WD50K [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-4036498?locale=lv
    Explore at:
    unknown(7080916)Available download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    WD50K dataset: An hyper-relational dataset derived from Wikidata statements. The dataset is constructed by the following procedure based on the Wikidata RDF dump of August 2019: - A set of seed nodes corresponding to entities from FB15K-237 having a direct mapping in Wikidata (P646 "Freebase ID") is extracted from the dump. - For each seed node, all statements whose main object and qualifier values corresponding to wikibase:Item are extracted from the dump. - All literals are filtered out from the qualifiers of the above obtained statements. - All the entities from the dataset which have less than two mentions are dropped. The statements corresponding to the dropped entities are also dropped. - The remaining statements are randomly split into the train, test, and validation sets. - All statements from train and validation sets are removed which share the same main triple (s,p,o) with test statements. - WD50k_33, WD50k_66, WD50k_100 are then sampled from the above statements. Here 33, 66, 100 represents the amount of hyper-relational facts (statements with qualifiers) in the dataset. The table below provides some basic statistics of our dataset and its three further variations: | Dataset | Statements | w/Quals (%) | Entities | Relations | E only in Quals | R only in Quals | Train | Valid | Test | |-------------|------------|----------------|----------|-----------|-----------------|-----------------|---------|--------|--------| | WD50K | 236,507 | 32,167 (13.6%) | 47,156 | 532 | 5460 | 45 | 166,435 | 23,913 | 46,159 | | WD50K (33) | 102,107 | 31,866 (31.2%) | 38,124 | 475 | 6463 | 47 | 73,406 | 10,668 | 18,133 | | WD50K (66) | 49,167 | 31,696 (64.5%) | 27,347 | 494 | 7167 | 53 | 35,968 | 5,154 | 8,045 | | WD50K (100) | 31,314 | 31,314 (100%) | 18,792 | 279 | 7862 | 75 | 22,738 | 3,279 | 5,297 | When using the dataset please cite: @inproceedings{StarE, title={Message Passing for Hyper-Relational Knowledge Graphs}, author={Galkin, Mikhail and Trivedi, Priyansh and Maheshwari, Gaurav and Usbeck, Ricardo and Lehmann, Jens}, booktitle={EMNLP}, year={2020} } For any further questions, please contact: mikhail.galkin@iais.fraunhofer.de

  7. f

    Probing Datasets for Noisy Texts

    • federation.figshare.com
    • researchdata.edu.au
    Updated Mar 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Buddhika Kasthuriarachchy; Madhu Chetty; Adrian Shatte (2021). Probing Datasets for Noisy Texts [Dataset]. http://doi.org/10.25955/604c5307db043
    Explore at:
    Dataset updated
    Mar 14, 2021
    Dataset provided by
    Federation University Australia
    Authors
    Buddhika Kasthuriarachchy; Madhu Chetty; Adrian Shatte
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    ContextProbing tasks are popular among NLP researchers to assess the richness of the encoded representations of linguistic information. Each probing task is a classification problem, and the model’s performance shall vary depending on the richness of the linguistic properties crammed into the representation.

    This dataset contains five new probing datasets consist of noisy texts (Tweets) which can serve as a benchmark dataset for researchers to study the linguistic characteristics of unstructured and noisy texts.File StructureFormat: A tab-separated text file

    Column 1: train/test/validation split (tr-train, te-test, va-validation)

         Column 2: class label (refer to the content
    

    section for the class labels of each task file)

         Column 3: Tweet message (text)
    
         Column
    

    4: a unique ID Contentsent_len.tsvIn this classification task, the goal is to predict the sentence length in 8 possible bins (0-7) based on their lengths; 0: (5-8), 1: (9-12), 2: (13-16), 3: (17-20), 4: (21-25), 5: (26-29), 6: (30-33), 7: (34-70). This task is called “SentLen” in the paper.word_content.tsvWe consider a 10-way classifications task with 10 words as targets considering the available manually annotated instances. The task is predicting which of the target words appears on the given sentence. We have considered only the words that appear in the BERT vocabulary as target words. We constructed the data by picking the first 10 lower-cased words occurring in the corpus vocabulary ordered by frequency and having a length of at least 4 characters (to remove noise). Each sentence contains a single target word, and the word occurs precisely once in the sentence. The task is referred to as “WC” in the paper. bigram_shift.tsvThe purpose of the Bigram Shift task is to test whether an encoder is sensitive to legal word orders. Two adjacent words in a Tweet are inverted, and the classification model performs a binary classification to identify inverted (I) and non-inverted/original (O) Tweets. The task is referred to as “BShift” in the paper. tree_depth.tsvThe Tree Depth task evaluates the encoded sentence's ability to understand the hierarchical structure by allowing the classification model to predict the depth of the longest path from the root to any leaf in the Tweet's parser tree. The task is referred to as “TreeDepth” in the paper. odd_man_out.tsv

    The Tweets are modified by replacing a random noun or a verb o with another noun or verb r. The task of the classifier is to identify whether the sentence gets modified due to this change. Class label O refers to the unmodified sentences while C refers to modified sentences. The task is called “SOMO” in the paper.

  8. Rescaled Fashion-MNIST dataset

    • zenodo.org
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST dataset [Dataset]. http://doi.org/10.5281/zenodo.15187793
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Time period covered
    Apr 10, 2025
    Description

    Motivation

    The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled Fashion-MNIST dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

    Access and rights

    The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

    [4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

    The h5 files containing the dataset

    The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

    Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

    There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.

  9. Rescaled CIFAR-10 dataset

    • zenodo.org
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Description

    Motivation

    The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled CIFAR-10 dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

    and is therefore significantly more challenging.

    Access and rights

    The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

    [4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

    The h5 files containing the dataset

    The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

    Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

  10. Life Expectancy WHO

    • kaggle.com
    zip
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Life Expectancy WHO [Dataset]. https://www.kaggle.com/datasets/vikramamin/life-expectancy-who
    Explore at:
    zip(121472 bytes)Available download formats
    Dataset updated
    Jun 19, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.

    We use DECISION TREE MODEL for the analysis.

    • Run the required libraries (rpart, rpart.plot, RColorBrewer, rattle).
    • We run the decision tree analysis using rpart and plot the tree. We use fancyRpartPlot.
    • We use 5 fold cross validation method with CP (complexity parameter) being 0.01.
    • In Decision Tree , RMSE (Root Mean Squared Error) is 3.06. This indicates that on an average, the predicted values have an error of 3.06 years as compared to the actual life expectancy values.
    • MAPE (Mean Absolute Percentage Error) is 0.035. This indicates an accuracy prediction of 96.45% (1-0.035).
    • MAE (Mean Absolute Error) is 2.35. This indicates that on an average, the predicted values deviate by approximately 2.35 years from the actual values.

    We use RANDOM FOREST for the analysis.

    • Run library(randomForest)
    • We use varImpPlot to find out which variables are most significant and least significant. Income composition is the most important followed by adult mortality and the least relevant independent variable is Population.
    • Predict Life expectancy through random forest model.
    • In Random Forest , RMSE (Root Mean Squared Error) is 1.73. This indicates that on an average, the predicted values have an error of 1.73 years as compared to the actual life expectancy values.
    • MAPE (Mean Absolute Percentage Error) is 0.01. This indicates an accuracy prediction of 98.27% (1-0.01).
    • MAE (Mean Absolute Error) is 1.14. This indicates that on an average, the predicted values deviate by approximately 1.14 years from the actual values.

    Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.

  11. Data and script pipeline for: Common to rare transfer learning (CORAL)...

    • zenodo.org
    bin, html
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Otso Ovaskainen; Otso Ovaskainen (2025). Data and script pipeline for: Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods [Dataset]. http://doi.org/10.5281/zenodo.14962497
    Explore at:
    bin, htmlAvailable download formats
    Dataset updated
    Mar 3, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Otso Ovaskainen; Otso Ovaskainen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The scripts and the data provided in this depository demonstrate how to apply the approach described in the paper "Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods" by Ovaskainen et al. Here we summarize how to use the software with a small, simulated dataset, with running time less than a minute in a typical laptop (Demo 1); (2) how to apply the analyses presented in the paper for a small subset of the data, with running time of ca. one hour in a powerful laptop (Demo 2); how to reproduce the full analyses presented in the paper, with running time up to several days, depending on the computational resources (Demo 3). The Demos 1 and 2 are aimed to be user-friendly starting points for understanding and testing how to implement CORAL. The Demo 3 is included mainly for reproducibility.

    System requirements

    · The software can be used in any operating system where R can be installed.

    · We have developed and tested the software in a windows environment with R version 4.3.1.

    · Demo 1 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).

    · Demo 2 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).

    · Demo 3 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0), jsonify (1.2.2), buildmer (2.11), colorspace (2.1-0), matlib (0.9.6), vioplot (0.4.0), MLmetrics (1.1.3) and ggplot2 (3.5.0).

    · The use of the software does not require any non-standard hardware.

    Installation guide

    · The CORAL functions are implemented in Hmsc (3.3-3). The software that applies the is presented as a R-pipeline and thus it does not require any installation other than installation of R.

    Demo 1: Software demo with simulated data

    The software demonstration consists of two R-markdown files:

    · D01_software_demo_simulate_data. This script creates a simulated dataset of 100 species on 200 sampling units. The species occurrences are simulated with a probit model that assumes phylogenetically structured responses to two environmental predictors. The pipeline saves all the data needed to data analysis in the file allDataDemo.RData: XData (the first predictor; the second one is not provided in the dataset as it is assumed to remain unknown for the user), Y (species occurrence data), phy (phylogenetic tree), studyDesign (list of sampling units). Additionally, true values used for data generation are save in the file trueValuesDemo.RData: LF (the second environmental predictor that will be estimated through a latent factor approach), and beta (species responses to environmental predictors).

    · D02_software_demo_apply_CORAL. This script loads the data generated by the script D01 and applies the CORAL approach to it. The script demonstrates the informativeness of the CORAL priors, the higher predictive power of CORAL models than baseline models, and the ability of CORAL to estimate the true values used for data generation.

    Both markdown files provide more detailed information and illustrations. The provided html file shows the expected output. The running time of the demonstration is very short, from few seconds to at most one minute.

    Demo 2: Software demo with a small subset of the data used in the paper

    The software demonstration consists of one R-markdown file:

    MA_small_demo. This script uses the CORAL functions in HMSC to analyze a small subset of the Malagasy arthropod data. In this demo, we define rare species as those with prevalence at least 40 and less than 50, and common species as those with prevalence at least 200. This leaves 51 species to the backbone model and 460 rare species modelled through the CORAL approach. The script assess model fit for CORAL priors, CORAL posteriors, and null models. It further visualizes the responses of both the common and the rare species to the included predictors.

    Scripts and data for reproducing the results presented in the paper (Demo 3)

    The input data for the script pipeline is the file “allData.RData”. This file includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy). Each file in the pipeline below depends on the outputs of previous files: they must be run in order. The first six files are used for fitting the backbone HMSC model and calculating parameters for the CORAL prior:

    · S01_define_Hmsc_model - defines the initial HMSC model with fixed effects and sample- and site-level random effects.

    · S02_export_Hmsc_model - prepares the initial model for HPC sampling for fitting with Hmsc-HPC. Fitting of the model can be then done in an HPC environment with the bash file generated by the script. Computationally intensive.

    · S03_import_posterior – imports the posterior distributions sampled by the initial model.

    · S04_define_second_stage_Hmsc_model - extracts latent factors from the initial model and defines the backbone model. This is then sampled using the same S02 export + S03 import scripts. Computationally intensive.

    · S05_visualize_backbone_model – check backbone model quality with visual/numerical summaries. Generates Fig. 2 of the paper.

    · S06_construct_coral_priors – calculate CORAL prior parameters.

    The remaining scripts evaluate the model:

    · S07_evaluate_prior_predictionss – use the CORAL prior to predict rare species presence/absences and evaluate the predictions in terms of AUC. Generates Fig. 3 of the paper.

    · S08_make_training_test_split – generate train/test splits for cross-validation ensuring at least 40% of positive samples are in each partition.

    · S09_cross-validate – fit CORAL and the baseline model to the train/test splits and calculate performance summaries. Note: we ran this once with the initial train/test split and then again with on the inverse split (i.e., training = ! training in the code, see comment). The paper presents the average results across these two splits. Computationally intensive.

    · S10_show_cross-validation_results – Make plots visualizing AUC/Tjur’s R2 produced by cross-validation. Generates Fig. 4 of the paper.

    · S11a_fit_coral_models – Fit the CORAL model to all 250k rare species. Computationally intensive.

    · S11b_fit_baseline_models – Fit the baseline model to all 250k rare species. Computationally intensive.

    · S12_compare_posterior_inference – compare posterior climate predictions using CORAL and baseline models on selected species, as well as variance reduction for all species. Generates Fig. 5 of the paper.

    Pre-processing scripts:

    · P01_preprocess_sequence_data.R – Reads in the outputs of the bioinformatics pipeline and converts them into R-objects.

    · P02_download_climatic_data.R – Downloads the climatic data from "sis-biodiversity-era5-global” and adds that to metadata.

    · P03_construct_Y_matrix.R – Converts the response matrix from a sparse data format to regular matrix. Saves “allData.RData”, which includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy).

    Computationally intensive files had runtimes of 5-24 hours on high-performance machines. Preliminary testing suggests runtimes of over 100 hours on a standard laptop.

  12. Dataset for "SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy...

    • zenodo.org
    bin, csv, pdf
    Updated Jan 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jake Lee; Jake Lee; Michael Kiper; Michael Kiper; David R. Thompson; David R. Thompson; Philip Brodrick; Philip Brodrick (2025). Dataset for "SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection" [Dataset]. http://doi.org/10.5281/zenodo.14614218
    Explore at:
    bin, pdf, csvAvailable download formats
    Dataset updated
    Jan 10, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jake Lee; Jake Lee; Michael Kiper; Michael Kiper; David R. Thompson; David R. Thompson; Philip Brodrick; Philip Brodrick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection

    Summary

    Manuscript in review. Preprint: https://arxiv.org/abs/2501.04916

    This repository contains the dataset used to train and evaluate the Spectroscopic Transformer model for EMIT cloud screening.

    • spectf_cloud_labelbox.hdf5
      • 1,841,641 Labeled spectra from 221 EMIT Scenes.
    • spectf_cloud_mmgis.hdf5
      • 1,733,801 Labeled spectra from 313 EMIT Scenes.
      • These scenes were speciffically labeled to correct false detections by an earlier version of the model.
    • train_fids.csv
      • 465 EMIT scenes comprising the training set.
    • test_fids.csv
      • 69 EMIT scenes comprising the held-out validation set.

    v2 adds validation_scenes.pdf, a PDF displaying the 69 validation scenes in RGB and Falsecolor, their existing baseline cloud masks, as well as their cloud masks produced by the ANN and GBT reference models and the SpecTf model.

    Data Description

    221 EMIT Scenes were initially selected for labeling with diversity in mind. After sparse segmentation labeling of confident regions in Labelbox, up to 10,000 spectra were selected per-class per-scene to form the spectf_cloud_labelbox dataset. We deployed a preliminary model trained on these spectra on all EMIT scenes observed in March 2024, then labeled another 313 EMIT Scenes using MMGIS's polygonal labeling tool to correct false positive and false negative detections. After similarly sampling spectra from these scenes, A total of 3,575,442 spectra were labeled and sampled.

    The train/test split was randomly determined by scene FID to prevent the same EMIT scene from contributing spectra to both the training and validation datasets.

    Please refer to Section 4.2 in the paper for a complete description, and to our code repository for example usage and a Pytorch dataloader.

    Each hdf5 file contains the following arrays:

    • 'spectra'
    • 'fids'
      • The FID from which each spectrum was sampled
      • Binary string of shape (n,)
    • 'indices'
      • The (col, row) index from which each spectrum was sampled
      • Int64 of shape (n, 2)
    • 'labels'
      • Annotation label of each spectrum
        • 0 - "Clear"
        • 1 - "Cloud"
        • 2 - "Cloud Shadow" (Only for the Labelbox dataset, and this class was combined with the clear class for this work. See paper for details.)
          • label[label==2] = 0
      • Int64 of shape (n,2)

    Each hdf5 file contains the following attribute:

    • 'bands'
      • The band center wavelengths (nm) of the spectrum
      • Float64 of shape (268,)

    Acknowledgements

    The EMIT online mapping tool was developed by the JPL MMGIS team. The High Performance Computing resources used in this investigation were provided by funding from the JPL Information and Technology Solutions Directorate.

    This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004).

    © 2024 California Institute of Technology. Government sponsorship acknowledged.

  13. Glaucoma Dataset: EyePACS AIROGS - Light

    • kaggle.com
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Riley Kiefer (2023). Glaucoma Dataset: EyePACS AIROGS - Light [Dataset]. https://www.kaggle.com/deathtrooper/eyepacs-airogs-light
    Explore at:
    zip(316720067 bytes)Available download formats
    Dataset updated
    May 23, 2023
    Authors
    Riley Kiefer
    Description

    This is a machine-learning-ready glaucoma dataset using a balanced subset of standardized fundus images from the Rotterdam EyePACS AIROGS [1] train set. This dataset is split into training, validation, and test folders which contain 2500, 270, and 500 fundus images in each class respectively. Each training set has a folder for each class: referable glaucoma (RG) and non-referable glaucoma (NRG).

    This dataset has been updated with more training samples and general improvements: https://www.kaggle.com/datasets/deathtrooper/glaucoma-dataset-eyepacs-airogs-light-v2

    Three versions of the same dataset are available with different standardization strategies: 1. RAW - Resizing the source image to 256x256 pixels 2. PAD - Padding the source image to a square image and then resizing it to 256x256 pixels. This method preserves the aspect ratio but the resultant image contains less usable information. 3. CROP - Cropping black background in the fundus image, pad the resultant image to create a square image, and then resize to 256x256 pixels. This method preserves the aspect ratio and the resultant image contains the most usable information.

    Please review the ablation study to review the impact of the standardization method on the model performance: https://www.kaggle.com/code/deathtrooper/glaucoma-standardization-ablation-study

    Please see the code tab for glaucoma detection benchmark progress. The top-performing model has been made by KEREM KARABACAK with a test accuracy of 93.5%.

    This work has been published in the IEEE-ICIVC 2023 Conference: Automated Fundus Image Standardization using a Dynamic Global Foreground Threshold Algorithm By Riley Kiefer, Muhammad Abid, Mahsa Raeisi Ardali, Jessica Steen, and Ehsan Amjadian. Learn more about how the algorithm created this dataset here: https://ieeexplore.ieee.org/abstract/document/10270429

    [1] EyePACS-AIROGS; https://zenodo.org/record/5793241

    Citation

    Please cite at least the first work in academic publications: 1. Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023. 2. R. Kiefer, M. Abid, M. R. Ardali, J. Steen and E. Amjadian, "Automated Fundus Image Standardization Using a Dynamic Global Foreground Threshold Algorithm," 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 2023, pp. 460-465, doi: 10.1109/ICIVC58118.2023.10270429. 3. Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023. 4. R. Kiefer, J. Steen, M. Abid, M. R. Ardali and E. Amjadian, "A Survey of Glaucoma Detection Algorithms using Fundus and OCT Images," 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 2022, pp. 0191-0196, doi: 10.1109/IEMCON56893.2022.9946629.

    Please also see the following optometry abstract publications: 1. A Comprehensive Survey of Publicly Available Glaucoma Datasets for Automated Glaucoma Detection; AAO 2022; https://aaopt.org/past-meeting-abstract-archives/?SortBy=ArticleYear&ArticleType=&ArticleYear=2022&Title=&Abstract=&Authors=&Affiliation=&PROGRAMNUMBER=225129 2. Standardized and Open-Access Glaucoma Dataset for Artificial Intelligence Applications; ARVO 2023; https://iovs.arvojournals.org/article.aspx?articleid=2790420 3. Ground truth validation of publicly available datasets utilized in artificial intelligence models for glaucoma detection; ARVO 2023; https://iovs.arvojournals.org/article.aspx?articleid=2791017

    Please also see the DOI citations for this and related datasets: 1. SMDG; @dataset{smdg, title={SMDG, A Standardized Fundus Glaucoma Dataset}, url={https://www.kaggle.com/ds/2329670}, DOI={10.34740/KAGGLE/DS/2329670}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} } 2. EyePACS-light-v1 @dataset{eyepacs-light-v1, title={Glaucoma Dataset: EyePACS AIROGS - Light}, url={https://www.kaggle.com/ds/3222646}, DOI={10.34740/KAGGLE/DS/3222646}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} } 3. EyePACS-light-v2 @dataset{eyepacs-light-v2, title={Glaucoma Dataset: EyePACS-AIROGS-light-V2}, url={https://www.kaggle.com/dsv/7300206}, DOI={10.34740/KAGGLE/DSV/7300206}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} }

  14. UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA)

    • zenodo.org
    bin, zip
    Updated Dec 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth; Luca Giancardo; Luca Giancardo; Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth (2023). UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA) [Dataset]. http://doi.org/10.5281/zenodo.6476639
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Dec 11, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth; Luca Giancardo; Luca Giancardo; Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth
    Description

    Introduction

    Vessel segmentation in fundus images is essential in the diagnosis and prognosis of retinal diseases and the identification of image-based biomarkers. However, creating a vessel segmentation map can be a tedious and time consuming process, requiring careful delineation of the vasculature, which is especially hard for microcapillary plexi in fundus images. Optical coherence tomography angiography (OCT-A) is a relatively novel modality visualizing blood flow and microcapillary plexi not clearly observed in fundus photography. Unfortunately, current commercial OCT-A cameras have various limitations due to their complex optics making them more expensive, less portable, and with a reduced field of view (FOV) compared to fundus cameras. Moreover, the vast majority of population health data collection efforts do not include OCT-A data.

    We believe that strategies able to map fundus images to en-face OCT-A can create precise vascular vessel segmentation with less effort.

    In this dataset, called UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA), we include fundus images and en-face OCT-A images for 112 subjects. The two modalities have been manually aligned to allow for training of medical imaging machine learning pipelines. This dataset is accompanied by a manuscript that describes an approach to generate fundus vessel segmentations using OCT-A for training (Coronado et al., 2022). We refer to this approach as "Synthetic OCT-A".

    Fundus Imaging

    We include 45 degree macula centered fundus images that cover both macula and optic disc. All images were acquired using a OptoVue iVue fundus camera without pupil dilation.

    The full images are available at the fov45/fundus directory. In addition, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/fundus/disc and cropped/fundus/macula.

    Enface OCT-A

    We include the en-face OCT-A images of the superficial capillary plexus. All images were acquired using an OptoVue Avanti OCT camera with OCT-A reconstruction software (AngioVue). Low quality images with errors in the retina layer segmentations were not included.

    En-face OCTA images are located in cropped/octa/disc and cropped/octa/macula. In addition, we include a denoised version of these images where only vessels are included. This has been performed automatically using the ROSE algorithm (Ma et al. 2021). These can be found in cropped/GT_OCT_net/noThresh and cropped/GT_OCT_net/Thresh, the former contains the probabilities of the ROSE algorithm the latter a binary map.

    Synthetic OCT-A

    We train a custom conditional generative adversarial network (cGAN) to map a fundus image to an en face OCT-A image. Our model consists of a generator synthesizing en face OCT-A images from corresponding areas in fundus photographs and a discriminator judging the resemblance of the synthesized images to the real en face OCT-A samples. This allows us to avoid the use of manual vessel segmentation maps altogether.

    The full images are available at the fov45/synthetic_octa directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/synthetic_octa/disc and cropped/synthetic_octa/macula. In addition, we performed the same denoising ROSE algorithm (Ma et al. 2021) used for the original enface OCT-A images, the results are available in cropped/denoised_synthetic_octa/noThresh and cropped/denoised_synthetic_octa/Thresh, the former contains the probabilities of the ROSE algorithm the latter a binary map.

    Other Fundus Vessel Segmentations Included

    In this dataset, we have also included the output of two recent vessel segmentation algorithms trained on external datasets with manual vessel segmentations. SA-Unet (Li et. al, 2020) and IterNet (Guo et. al, 2021).

    • SA-Unet. The full images are available at the fov45/SA_Unet directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/SA_Unet/disc and cropped/SA_Unet/macula.

    • IterNet. The full images are available at the fov45/Iternet directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/Iternet/disc and cropped/Iternet/macula.

    Train/Validation/Test Replication

    In order to replicate or compare your model to the results of our paper, we report below the data split used.

    • Training subjects IDs: 1 - 25

    • Validation subjects IDs: 26 - 30

    • Testing subjects IDs: 31 - 112

    Data Acquisition

    This dataset was acquired at the Texas Medical Center - Memorial Hermann Hospital in accordance with the guidelines from the Helsinki Declaration and it was approved by the UTHealth IRB with protocol HSC-MS-19-0352.

    User Agreement

    The UT-FSOCTA dataset is free to use for non-commercial scientific research only. In case of any publication the following paper needs to be cited

    
    Coronado I, Pachade S, Trucco E, Abdelkhaleq R, Yan J, Salazar-Marioni S, Jagolino-Cole A, Bahrainian M, Channa R, Sheth SA, Giancardo L. Synthetic OCT-A blood vessel maps using fundus images and generative adversarial networks. Sci Rep 2023;13:15325. https://doi.org/10.1038/s41598-023-42062-9.
    

    Funding

    This work is supported by the Translational Research Institute for Space Health through NASA Cooperative Agreement NNX16AO69A.

    Research Team and Acknowledgements

    Here are the people behind this data acquisition effort:

    Ivan Coronado, Samiksha Pachade, Rania Abdelkhaleq, Juntao Yan, Sergio Salazar-Marioni, Amanda Jagolino, Mozhdeh Bahrainian, Roomasa Channa, Sunil Sheth, Luca Giancardo

    We would also like to acknowledge for their support: the Institute for Stroke and Cerebrovascular Diseases at UTHealth, the VAMPIRE team at University of Dundee, UK and Memorial Hermann Hospital System.

    References

    Coronado I, Pachade S, Trucco E, Abdelkhaleq R, Yan J, Salazar-Marioni S, Jagolino-Cole A, Bahrainian M, Channa R, Sheth SA, Giancardo L. Synthetic OCT-A blood vessel maps using fundus images and generative adversarial networks. Sci Rep 2023;13:15325. https://doi.org/10.1038/s41598-023-42062-9.
    
    
    C. Guo, M. Szemenyei, Y. Yi, W. Wang, B. Chen, and C. Fan, "SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation," in 2020 25th International Conference on Pattern Recognition (ICPR), Jan. 2021, pp. 1236–1242. doi: 10.1109/ICPR48806.2021.9413346.
    
    L. Li, M. Verma, Y. Nakashima, H. Nagahara, and R. Kawasaki, "IterNet: Retinal Image Segmentation Utilizing Structural Redundancy in Vessel Networks," 2020 IEEE Winter Conf. Appl. Comput. Vis. WACV, 2020, doi: 10.1109/WACV45572.2020.9093621.
    
    Y. Ma et al., "ROSE: A Retinal OCT-Angiography Vessel Segmentation Dataset and New Model," IEEE Trans. Med. Imaging, vol. 40, no. 3, pp. 928–939, Mar. 2021, doi: 10.1109/TMI.2020.3042802.
    
  15. Data from: A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image...

    • zenodo.org
    • data-staging.niaid.nih.gov
    zip
    Updated Feb 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roser Viñals; Roser Viñals; Jean-Philippe Thiran; Jean-Philippe Thiran (2024). A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning: Dataset (2/6) [Dataset]. http://doi.org/10.5281/zenodo.10591473
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Roser Viñals; Roser Viñals; Jean-Philippe Thiran; Jean-Philippe Thiran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a collection of ultrafast ultrasound acquisitions from nine volunteers and the CIRS 054G phantom. For a comprehensive understanding of the dataset, please refer to the paper: Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. https://doi.org/10.3390/jimaging9120256. Please cite the original paper when using this dataset.

    Due to data size restriction, the dataset has been divided into six subdatasets, each one published into a separate entry in Zenodo. This repository contains subdataset 2.

    Structure

    In Vivo Data

    • Number of Acquisitions: 20,000

    • Volunteers: Nine volunteers

    • File Structure: Each volunteer's data is compressed in a separate zip file.

      • Note: For volunteer 1, due to a higher number of acquisitions, data for this volunteer is distributed across multiple zip files, each containing acquisitions from different body regions.
    • Regions :

      • Abdomen: 6599 acquisitions
      • Neck: 3294 acquisitions
      • Breast: 3291 acquisitions
      • Lower limbs: 2616 acquisitions
      • Upper limbs: 2110 acquisitions
      • Back: 2090 acquisitions
    • File Naming Convention: Incremental IDs from acquisition_00000 to acquisition_19999.

    In Vitro Data

    • Number of Acquisitions: 32 from CIRS model 054G phantom
    • File Structure: The in vitro data is compressed in the cirs-phantom.zip file.
    • File Naming Convention: Incremental IDs from invitro_00000 to invitro_00031.

    CSV Files

    Two CSV files are provided:

    • invivo_dataset.csv :

      • Contains a list of all in vivo acquisitions.
      • Columns: id, path, volunteer id, body region.
    • invitro_dataset.csv :

      • Contains a list of all in vitro acquisitions.
      • Columns: id, path

    Zenodo dataset splits and files

    The dataset has been divided into six subdatasets, each one published in a separate entry on Zenodo. The following table indicates, for each file or compressed folder, the Zenodo dataset split where it has been uploaded along with its size. Each dataset split is named "A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning: Dataset (ii/6)", where ii represents the split number. This repository contains the 2nd split.

    File nameSizeZenodo subdataset number
    invivo_dataset.csv995.9 kB1
    invitro_dataset.csv1.1 kB1
    cirs-phantom.zip418.2 MB1
    volunteer-1-lowerLimbs.zip29.7 GB1
    volunteer-1-carotids.zip8.8 GB1
    volunteer-1-back.zip7.1 GB1
    volunteer-1-abdomen.zip34.0 GB2
    volunteer-1-breast.zip15.7 GB2
    volunteer-1-upperLimbs.zip25.0 GB3
    volunteer-2.zip26.5 GB4
    volunteer-3.zip20.3 GB3
    volunteer-4.zip24.1 GB5
    volunteer-5.zip6.5 GB5
    volunteer-6.zip11.5 GB5
    volunteer-7.zip11.1 GB6
    volunteer-8.zip21.2 GB6
    volunteer-9.zip23.2 GB4

    Normalized RF Images

    • Beamforming:

      • Depth from 1 mm to 55 mm

      • Width spanning the probe aperture

      • Grid: 𝜆/8 × 𝜆/8

      • Resulting images shape: 1483 × 1189

      • Two beamformed RF images from each acquisition:

        • Input image: single unfocused acquisition obtained from a single plane wave (PW) steered at 0° (acquisition-xxxx-1PW)
        • Target image: coherently compounded image from 87 PWs acquisitions steered at different angles (acquisition-xxxx-87PWs)
    • Normalization:

      • The two RF images have been normalized
    • To display the images:

      • Perform the envelop detection (to obtain the IQ images)
      • Log-compress (to obtain the B-mode images)
    • File Format: Saved in npy format, loadable using Python and numpy.load(file).

    Training and Validation Split in the paper

    For the volunteer-based split used in the paper:

    • Training set: volunteers 1, 2, 3, 6, 7, 9
    • Validation set: volunteer 4
    • Test set: volunteers 5, 8
    • Images analyzed in the paper
      • Carotid acquisition (from volunteer 5): acquisition_12397
      • Back acquisition (from volunteer 8): acquisition_19764
      • In vitro acquisition: invitro-00030

    License

    This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

    Please cite the original paper when using this dataset :

    Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. DOI: 10.3390/jimaging9120256

    Contact

    For inquiries or issues related to this dataset, please contact:

    • Name: Roser Viñals
    • Email: roser.vinalsterres@epfl.ch
  16. HaDR: Dataset for hands instance segmentation

    • kaggle.com
    zip
    Updated Mar 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ales Vysocky (2023). HaDR: Dataset for hands instance segmentation [Dataset]. https://www.kaggle.com/datasets/alevysock/hadr-dataset-for-hands-instance-segmentation
    Explore at:
    zip(10662295286 bytes)Available download formats
    Dataset updated
    Mar 7, 2023
    Authors
    Ales Vysocky
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    If you use this dataset for your work, please cite the related papers: A. Vysocky, S. Grushko, T. Spurny, R. Pastor and T. Kot, Generating Synthetic Depth Image Dataset for Industrial Applications of Hand Localisation, in IEEE Access, 2022, doi: 10.1109/ACCESS.2022.3206948.

    S. Grushko, A. Vysocký, J. Chlebek, P. Prokop, HaDR: Applying Domain Randomization for Generating Synthetic Multimodal Dataset for Hand Instance Segmentation in Cluttered Industrial Environments. preprint in arXiv, 2023, https://doi.org/10.48550/arXiv.2304.05826

    The HaDR dataset is a multimodal dataset designed for human-robot gesture-based interaction research, consisting of RGB and Depth frames, with binary masks for each hand instance (i1, i2, single class data). The dataset is entirely synthetic, generated using Domain Randomization technique in CoppeliaSim 3D. The dataset can be used to train Deep Learning models to recognize hands using either a single modality (RGB or depth) or both simultaneously. The training-validation split comprises 95K and 22K samples, respectively, with annotations provided in COCO format. The instances are uniformly distributed across the image boundaries. The vision sensor captures depth and color images of the scene, with the depth pixel values scaled into a single channel 8-bit grayscale image in the range [0.2, 1.0] m. The following aspects of the scene were randomly varied during generation of dataset: • Number, colors, textures, scales and types of distractor objects selected from a set of 3D models of general tools and geometric primitives. A special type of distractor – an articulated dummy without hands (for instance-free samples) • Hand gestures (9 options). • Hand models’ positions and orientations. • Texture and surface properties (diffuse, specular and emissive properties) and number (from none to 2) of the object of interest, as well as its background. • Number and locations of directional lights sources (from 1 to 4), in addition to a planar light for ambient illumination. The sample resolution is set to 320×256, encoded in lossless PNG format, and contains only right hand meshes (we suggest using Flip augmentations during training), with a maximum of two instances per sample.

    Test dataset (real camera images): Test dataset containing 706 images was captured using a real RGB-D camera (RealSense L515) in a cluttered and unstructured industrial environment. The dataset comprises various scenarios with diverse lighting conditions, backgrounds, obstacles, number of hands, and different types of work gloves (red, green, white, yellow, no gloves) with varying sleeve lengths. The dataset is assumed to have only one user, and the maximum number of hand instances per sample was limited to two. The dataset was manually labelled, and we provide hand instance segmentation COCO annotations in instances_hands_full.json (separately for train and val) and full arm instance annotations in instances_arms_full.json. The sample resolution was set to 640×480, and depth images were encoded in the same way as those of the synthetic dataset.

    Channel-wise normalization and standardization parameters for datasets

    DatasetMean (R, G, B, D)STD (R, G, B, D)
    Train98.173, 95.456, 93.858, 55.87267.539, 67.194, 67.796, 47.284
    Validation99.321, 97.284, 96.318, 58.18967.814, 67.518, 67.576, 47.186
    Test123.675, 116.28, 103.53, 35.379258.395, 57.12, 57.375, 45.978

    If you use this dataset for your work, please cite the related papers: A. Vysocky, S. Grushko, T. Spurny, R. Pastor and T. Kot, Generating Synthetic Depth Image Dataset for Industrial Applications of Hand Localisation, in IEEE Access, 2022, doi: 10.1109/ACCESS.2022.3206948.

    S. Grushko, A. Vysocký, J. Chlebek, P. Prokop, HaDR: Applying Domain Randomization for Generating Synthetic Multimodal Dataset for Hand Instance Segmentation in Cluttered Industrial Environments. preprint in arXiv, 2023, https://doi.org/10.48550/arXiv.2304.05826

  17. Z

    Data from: SynDroneVision: A Synthetic Dataset for Image-Based Drone...

    • data-staging.niaid.nih.gov
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lenhard, Tamara R.; Weinmann, Andreas; Franke, Kai; Koch, Tobias (2024). SynDroneVision: A Synthetic Dataset for Image-Based Drone Detection [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_13360115
    Explore at:
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    Deutsches Zentrum für Luft- und Raumfahrt e. V. (DLR)
    Darmstadt University of Applied Sciences
    Authors
    Lenhard, Tamara R.; Weinmann, Andreas; Franke, Kai; Koch, Tobias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Developing robust drone detection systems is often constrained by the limited availability of large-scale annotated training data and the high costs associated with real-world data collection. However, synthetic data presents a promising and cost-effective solution to overcome this issue. Therefore, we present SynDroneVision, a synthetic dataset specifically designed for RGB-based drone detection in surveillance applications. Featuring diverse backgrounds, lighting conditions, and drone models, SynDroneVision offers a comprehensive training foundation for deep learning algorithms. To evaluate the dataset's effectiveness, we perform a comparative analysis across a selection of recent YOLO detection models. Our findings demonstrated that SynDroneVision is a valuable resource for real-world data enrichment, achieving notable enhancements in model performance and robustness, while significantly reducing the time and costs of real-world data acquisition.

    Paper

    Accepted for publication at the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV2025)!

    SynDroneVision is presented in the upcoming paper SynDroneVision: A Synthetic Dataset for Image-Based Drone Detection by Tamara R. Lenhard, Andreas Weinmann, Kai Franke, and Tobias Koch. This work is accepted and will be published in the Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV2025).

    For early access, the preprint is currently available on ArXiv: https://arxiv.org/abs/2411.05633v1

    Dataset Details

    SynDroneVision comprises a total of 140,038 annotaed RGB images (131,238 for training, 8,800 for validation, and 4,000 for testing), featuring a resolution of 2560x1489 pixels. All images are recorded in a sequential manner using Unreal Engine 5.0 in combination with Colosseum. Apart from drone images, SynDroneVision also includes ~7% of background images (i.e., imag frames without drone instances).

    Annotation Format: Annotations (bounding boxes) are provided via text files according to the YOLO standard format

    Here, and represent the normalized coordinates of the bounding box center, while and denote the normalized bounding box wisth and height. In SynDroneVision, is always 0, indicating the drone class.

    Download

    The SynDroneVision dataset offers around 900 GB of data dedicated to image-based drone detection. To facilitate the download process, we have partitioned the dataset into smaller sections. Specifically, we have divided the training data into 10 segments, organized by sequences.

    Annotations are available below, with image data accessible via the following links:

    Dataset Split Sequences File Name Link Size (GB)

    Training Set Seq. 001 - 009 images_train_seq001-009.zip Training images PART 1 57

    Seq. 010 - 018 images_train_seq010-018.zip Trainng images PART 2 95.4

    Seq. 019 - 027 images_train_seq019-027.zip Training images PART 3 96.2

    Seq. 028 - 035 images_train_seq028-035.zip Training images PART 4 83.9

    Seq. 036 - 043 images_train_seq036-043.zip Training images PART 5 77.1

    Seq. 044 - 050 images_train_seq044-050.zip Training images PART 6 84.7

    Seq. 051 - 056 images_train_seq051-056.zip Training images PART 7 86.8

    Seq. 057 - 065 images_train_seq057-065.zip Training images PART 8 86.2

    Seq. 066 - 070 images_train_seq066-070.zip Training images PART 9 75.7

    Seq. 071 - 073 images_train_seq071-073.zip Training images PART 10 38.5

    Validation Set Seq. 001 - 073 images_val.zip Validation images 55.2

    Test Set Seq. 001 - 073 images_test.zip Test images 26.5

    Citation

    If you find SynDroneVision helpful in your research, we kindly ask that you cite the associated preprint. Below is the citation in BibTeX format for your convenience:

    BibTeX:

    @inproceedings{Lenhard:2024, title={{SynDroneVision: A Synthetic Dataset for Image-Based Drone Detection}}, author={Lenhard, Tamara R. and Weinmann, Andreas and Franke, Kai and Koch, Tobias}, year={2024}, url={https://arxiv.org/abs/2411.05633}}

    SynDroneVision uses Unreal® Engine. Unreal® is a trademark or registered trademark of Epic Games, Inc. in the United States of America and elsewhere.

  18. Z

    Data from: Tango Spacecraft Dataset for Region of Interest Estimation and...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bechini Michele; Lunghi Paolo; Lavagna Michèle (2023). Tango Spacecraft Dataset for Region of Interest Estimation and Semantic Segmentation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6507863
    Explore at:
    Dataset updated
    May 23, 2023
    Dataset provided by
    Politecnico di Milano
    Authors
    Bechini Michele; Lunghi Paolo; Lavagna Michèle
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Reference Paper:

    M. Bechini, M. Lavagna, P. Lunghi, Dataset generation and validation for spacecraft pose estimation via monocular images processing, Acta Astronautica 204 (2023) 358–369

    M. Bechini, P. Lunghi, M. Lavagna. "Spacecraft Pose Estimation via Monocular Image Processing: Dataset Generation and Validation". In 9th European Conference for Aeronautics and Aerospace Sciences (EUCASS)

    General Description:

    The "Tango Spacecraft Dataset for Region of Interest Estimation and Semantic Segmentation" dataset here published should be used for Region of Interest (ROI) and/or semantic segmentation tasks. It is split into 30002 train images and 3002 test images representing the Tango spacecraft from Prisma mission, being the largest publicly available dataset of synthetic space-borne noise-free images tailored to ROI extraction and Semantic Segmentation tasks (up to our knowledge). The label of each image gives, for the Bounding Box annotations, the filename of the image, the ROI top-left corner (minimum x, minimum y) in pixels, the ROI bottom-right corner (maximum x, maximum y) in pixels, and the center point of the ROI in pixels. The annotation are taken in image reference frame with the origin located at the top-left corner of the image, positive x rightward and positive y downward. Concerning the Semantic Segmentation, RGB masks are provided. Each RGB mask correspond to a single image in both train and test dataset. The RGB images are such that the R channel corresponds to the spacecraft, the G channel corresponds to the Earth (if present), and the B channel corresponds to the background (deep space). Per each channel the pixels have non-zero value only in correspondence of the object that they represent (Tango, Earth, Deep Space). More information on the dataset split and on the label format are reported below.

    Images Information:

    The dataset comprises 30002 synthetic grayscale images of Tango spacecraft from Prisma mission that serves as train set, while the test set is formed by 3002 synthetic grayscale images of Tango spacecraft from Prisma mission in PNG format. About 1/6 of the images both in the train and in the test set have a non-black background, obtained by rendering an Earth-like model in the raytracing process used to define the images reported. The images are noise-free to increase the flexibility of the dataset. The illumination direction of the spacecraft in the scene is uniformly distributed in the 3D space in agreement with the Sun position constraints.

    Labels Information:

    Labels for the bounding box extraction are here provided in separated JSON files. The files are formatted per each image as in the following example:

    filename  : tango_img_1       # name of the image to which the data are referred
    
    rol_tl     : [x, y]              # ROI top-left corner (minimum x, minimum y) in pixels
    
    roi_br     : [x, y]             # ROI bottom-right corner (maximum x, maximum y) in pixels
    
    roi_cc     : [x, y]             # center point of the ROI in pixels
    

    Notice that the annotation are taken in image reference frame with the origin located at the top-left corner of the image, positive x rightward and positive y downward.To make the usage of the dataset easier, both the training set and the test set are split in two folders containing the images with earth as background and without background.

    Concerning the Semantic Segmentation Labels, they are provided as RGB masks named as "filename_mask.png" where "filename" is the filename of the image of the training set or the test set to which a specific mask is referred. The RGB images are such that the R channel corresponds to the spacecraft, the G channel corresponds to the Earth (if present), and the B channel corresponds to the background (deep space). Per each channel the pixels have non-zero value only in correspondence of the object that they represent (Tango, Earth, Deep Space).

    VERSION CONTROL

    v1.0: This version contains the dataset (both train and test) of full scale images with ROI annotations and RGB masks for Semantic Segmentation tasks. These images have width=height=1024 pixels. The position of tango with respect to the camera is randomly selected from a uniform distribution, but it is ensured the full visibility in all the images.

    Note: this dataset contains the same images of the "Tango Spacecraft Wireframe Dataset Model for Line Segments Detection" v2.0 full-scale (DOI: https://doi.org/10.5281/zenodo.6372848) and also "Tango Spacecraft Dataset for Monocular Pose Estimation" v1.0 (DOI: https://doi.org/10.5281/zenodo.6499007) and they can be used together by combining the annotations of the relative pose and the ones of the reprojected wireframe model of Tango, with also the ones of the ROI. These three datasets give the most comprehensive dataset of space borne synthetic images ever published (up to our knowledge).

  19. Housing Price Prediction using DT and RF in R

    • kaggle.com
    zip
    Updated Aug 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r
    Explore at:
    zip(629100 bytes)Available download formats
    Dataset updated
    Aug 31, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    • Objective: To predict the prices of houses in the City of Melbourne
    • Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">
    • Data Cleaning:
    • Date column is shown as a character vector which is converted into a date vector using the library ‘lubridate’
    • We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ‘Date’ and subtract it from the column ‘Year Built’
    • We remove 11566 records which have missing values
    • We drop columns which are not significant such as ‘X’, ‘suburb’, ‘address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ‘type’, ‘method’, ‘SellerG’, ‘date’, ‘Car’, ‘year built’, ‘Council Area’, ‘Region Name’
    • We split the data into ‘train’ and ‘test’ in 80/20 ratio using the sample function
    • Run libraries ‘rpart’, ‘rpart.plot’, ‘rattle’, ‘RcolorBrewer’
    • Run decision tree using the rpart function. ‘Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">
    • Average price for 5464 houses is $1084349
    • Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.
    • $4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
      https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">
    • We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">
    • We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)
    • Variables ‘postcode’, longitude and building are the most important variables
    • Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">
    • We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">
    • The below image indicates that ‘Building Area’, ‘Age of the house’ and ‘Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">
    • Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7
    • Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">
    • We tune the model and find mtry = 3 has the lowest out of bag error
    • We use the caret package and use 5 fold cross validation technique
    • RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4
    • We can conclude that Random Forest give us more accurate results as compared to Decision Tree
    • In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.
  20. Labelled dataset to classify direct deforestation drivers in Cameroon:...

    • zenodo.org
    zip
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amandine Debus; Amandine Debus; Emilie Beauchamp; Emilie Beauchamp; James Acworth; Achille Ewolo; Justin Kamga; Astrid Verhegghen; Christiane Zébazé; Emily R. Lines; Emily R. Lines; James Acworth; Achille Ewolo; Justin Kamga; Astrid Verhegghen; Christiane Zébazé (2025). Labelled dataset to classify direct deforestation drivers in Cameroon: NIR-R-G bands [Dataset]. http://doi.org/10.5281/zenodo.15538497
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amandine Debus; Amandine Debus; Emilie Beauchamp; Emilie Beauchamp; James Acworth; Achille Ewolo; Justin Kamga; Astrid Verhegghen; Christiane Zébazé; Emily R. Lines; Emily R. Lines; James Acworth; Achille Ewolo; Justin Kamga; Astrid Verhegghen; Christiane Zébazé
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Cameroon
    Description

    Overview

    This dataset includes the images (NIR-R-G bands for Landsat-8 or NICFI PlanetScope), auxiliary data (infrared, NCEP, forest gain, OpenStreetMap, SRTM, GFW), and data about forest loss (Global Forest Change) used to train, validate and test a model to classify direct deforestation drivers in Cameroon. The creation of this dataset follows the same structure as: Labelled dataset to classify direct deforestation drivers in Cameroon but with a different set of bands.

    For more details about how this dataset has been created and can be used, please refer to our paper and code: https://github.com/aedebus/Cam-ForestNet. The paper, describing the generation of RGB images, can be found here: https://www.nature.com/articles/s41597-024-03384-z.

    Citation: Debus, A. et al. A labelled dataset to classify direct deforestation drivers from Earth Observation imagery in Cameroon. Sci Data 11, 564 (2024).

    Here, the only difference compared with what is described in the paper is that we select NIR-R-G instead of R-G-B bands for our PNG images.

    Description of the files and images

    • 'my_examples_landsat_nir.zip': Landsat-8 images (courtesy of the U.S. Geological Survey), auxiliary data and forest loss data used to train, validate and test a model for a detailed classification of deforestation drivers in Cameroon. This dataset contains 332x 332 pixels NIR-R-G calibrated top-of-atmosphere (TOA) reflectance with a 30 m resolution (less than 20% cloud cover)
    • 'my_examples_landsat_sr_nir.zip': Same as above, but with surface reflectance (SR) instead of TOA
    • 'my_examples_planet_nir.zip': NICFI PlanetScope images (catalog owner: Planet), auxiliary data and forest loss data used to train, validate and test a model for a detailed classification of deforestation drivers in Cameroon. This dataset contains 332x 332 pixels monthly NIR-R-G composite with a 4.77 m resolution
    • 'my_examples_planet_nir_biannual.zip': Same as above, but with biannual composites instead of monthly composites
    For ‘labels_nir.zip’, we have subfolders for Landsat-8 (TOA, SR, groups TOA) and NICFI PlanetScope (monthly, biannual, groups monthly).
    For each folder, subfolders named with the coordinates of the centre of the images contain each:
    • A folder ‘images’, with a sub-folder ‘visible’ containing the PNG image; and a sub-folder ‘infrared’ containing the infrared bands in a NPY file.
    • A folder ‘auxiliary’ with topographic and forest gain information in a NPY format, OpenStreetMap and peat data in a JSON format, and a sub-folder ‘ncep’ containing all data from NCEP in a NPY format.
    • The forest loss pickle file delimiting the area of forest loss.
    Note: The images provided have been filtered to enable a train/validation/test split that ensures a minimum distance of 100 meters between the edges of forest loss areas.

    Details about the auxiliary data

    • Forest gain from GFC: 30-m resolution, yearly data for 2000-2021, downloaded via Google Earth Engine
    • Near infrared, shortwave infrared 1 and 2 bands from Landsat-8 TOA: 30-m resolution, data every 16 days for 2013-2023, downloaded via Google Earth Engine and selected using the same process as for Landsat-8 RGB images
    • From NCEP Climate Forecast System Version 2 (CFSv2) 6-hourly Products: surface level albedo and volumetric soil moisture content (depths: 0.1 m, 0.4 m, 1.0 m, 2.0m) in 0.01%; radiative fluxes (clear-sky longwave flux downward and upward, clear-sky solar flux downward and upward, direct evaporation from bare soil, longwave and shortwave radiation flux downward and upward, latent, ground and sensible heat net flux), potential evaporation rate, and sublimation in W/m²; humidity (specific, maximum specific, minimum specific) in 10-4 kg/kg; ground level precipitation in 0.1 mm; air pressure at surface level in 10 Pa; wind level (u and v component) in 0.01 m/s, water runoff at surface level in 232.01 kg/ m²; temperature in K: 22264-m resolution, available four times a day for 2011-2023, downloaded directly from the NOAA website and selected the mean of the monthly mean over 5 years before the forest loss event, the monthly maximum over 5 years before the forest loss event, and the monthly minimum over 5 years before the forest loss event for each parameter
    • Closest street and closest city from OpenStreetMap in km: directly downloaded with the Nominatim API
    • Altitude in m, slope and aspect in 0.01° from Shuttle Radar Topography Mission (SRTM): 30-m resolution, measured for 2000, downloaded via Google Earth Engine
    • Presence of peat from GFW: 232-m resolution, measured for 2017, directly downloaded on the GFW website

    Details about Global Forest Change

    For each image, there is a corresponding 'forest_loss_region' .pkl file delimiting a forest loss region polygon from Global Forest Change (GFC). GFC consists of annual maps of forest cover loss with a 30-m resolution.

    License

    The NICFI PlanetScope images fall under the same license as the NICFI data program license agreement (data in 'my_examples_planet_nir.zip', 'my_examples_planet_nir_biannual.zip': subfolders '[coordinates]'>'images'>'visible').

    OpenStreetMap® is open data, licensed under the Open Data Commons Open Database License (ODbL) by the OpenStreetMap Foundation (OSMF) (data in all 'my_examples' folders: subfolders '[coordinates]'>'auxiliary'>'closest_city.json'/'closest_street.json'). The documentation is licensed under the Creative Commons Attribution-ShareAlike 2.0 license (CC BY-SA 2.0).

    The rest of the data is under a Creative Commons Attribution 4.0 International License. The data has been transformed following the code that can be found via this link: https://github.com/aedebus/Cam-ForestNet (in 'prepare_files').

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001

Data from: Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction.

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

Search
Clear search
Close search
Google apps
Main menu