100+ datasets found
  1. Challenge Round 0 (Dry Run) Test Dataset

    • catalog.data.gov
    • data.nist.gov
    • +1more
    Updated Jul 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). Challenge Round 0 (Dry Run) Test Dataset [Dataset]. https://catalog.data.gov/dataset/challenge-round-0-dry-run-test-dataset-ff885
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This dataset was an initial test harness infrastructure test for the TrojAI program. It should not be used for research. Please use the more refined datasets generated for the other rounds. The data being generated and disseminated is training, validation, and test data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 200 trained, human level, image classification AI models using the following architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.

  2. f

    Data from: Cross-Validation With Confidence

    • tandf.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jing Lei (2023). Cross-Validation With Confidence [Dataset]. http://doi.org/10.6084/m9.figshare.9976901.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Jing Lei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cross-validation is one of the most popular model and tuning parameter selection methods in statistics and machine learning. Despite its wide applicability, traditional cross-validation methods tend to overfit, due to the ignorance of the uncertainty in the testing sample. We develop a novel statistically principled inference tool based on cross-validation that takes into account the uncertainty in the testing sample. This method outputs a set of highly competitive candidate models containing the optimal one with guaranteed probability. As a consequence, our method can achieve consistent variable selection in a classical linear regression setting, for which existing cross-validation methods require unconventional split ratios. When used for tuning parameter selection, the method can provide an alternative trade-off between prediction accuracy and model interpretability than existing variants of cross-validation. We demonstrate the performance of the proposed method in several simulated and real data examples. Supplemental materials for this article can be found online.

  3. Training and Validation Datasets for Neural Network to Fill in Missing Data...

    • catalog.data.gov
    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). Training and Validation Datasets for Neural Network to Fill in Missing Data in EBSD Maps [Dataset]. https://catalog.data.gov/dataset/training-and-validation-datasets-for-neural-network-to-fill-in-missing-data-in-ebsd-maps
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This dataset consists of the synthetic electron backscatter diffraction (EBSD) maps generated for the paper, titled "Hybrid Algorithm for Filling in Missing Data in Electron Backscatter Diffraction Maps" by Emmanuel Atindama, Conor Miller-Lynch, Huston Wilhite, Cody Mattice, Günay Doğan, and Prashant Athavale. The EBSD maps were used to train, test, and validate a neural network algorithm to fill in missing data points in a given EBSD map.The dataset includes 8000 maps for training, 1000 maps for testing, 2000 maps for validation. The dataset also includes noise-added versions of the maps, namely, one more map per each clean map.

  4. f

    Performance of ML models on test data.

    • plos.figshare.com
    xls
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Performance of ML models on test data. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    PLOS Global Public Health
    Authors
    Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.

  5. t

    MS Training Set, MS Validation Set, and UW Validation/Test Set - Dataset -...

    • service.tib.eu
    Updated Dec 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). MS Training Set, MS Validation Set, and UW Validation/Test Set - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/ms-training-set--ms-validation-set--and-uw-validation-test-set
    Explore at:
    Dataset updated
    Dec 17, 2024
    Description

    The MS Training Set, MS Validation Set, and UW Validation/Test Set are used for training, validation, and testing the proposed methods.

  6. DRIVE Train/Validation Split Dataset

    • kaggle.com
    Updated Feb 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sovit Ranjan Rath (2023). DRIVE Train/Validation Split Dataset [Dataset]. https://www.kaggle.com/datasets/sovitrath/drive-trainvalidation-split-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sovit Ranjan Rath
    Description

    This dataset contains images and masks for Retinal Vessel Extraction (Segmentation). It contains a training and validation split to easily train semantic segmentation models.

    The original dataset can be found here => https://www.kaggle.com/datasets/andrewmvd/drive-digital-retinal-images-for-vessel-extraction

    This dataset also has an accompanying blog post => Retinal Vessel Segmentation using PyTorch Semantic Segmentation

    Split sample numbers: Training images and masks: 16 Validation images and masks: 4 Test images: 20

  7. f

    Data from: Robust Validation: Confident Predictions Even When Distributions...

    • tandf.figshare.com
    bin
    Updated Dec 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maxime Cauchois; Suyash Gupta; Alnur Ali; John C. Duchi (2023). Robust Validation: Confident Predictions Even When Distributions Shift* [Dataset]. http://doi.org/10.6084/m9.figshare.24904721.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 26, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Maxime Cauchois; Suyash Gupta; Alnur Ali; John C. Duchi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategy—coming from robust statistics and optimization—is thus to build a model robust to distributional perturbations. In this paper, we take a different approach to describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions. We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an f-divergence ball around the training population. The method, based on conformal inference, achieves (nearly) valid coverage in finite samples, under only the condition that the training data be exchangeable. An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it; we develop estimators and prove their consistency for protection and validity of uncertainty estimates under shifts. By experimenting on several large-scale benchmark datasets, including Recht et al.’s CIFAR-v4 and ImageNet-V2 datasets, we provide complementary empirical results that highlight the importance of robust predictive validity.

  8. F

    Data from: A Neural Approach for Text Extraction from Scholarly Figures

    • data.uni-hannover.de
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 20, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A Neural Approach for Text Extraction from Scholarly Figures

    This is the readme for the supplemental data for our ICDAR 2019 paper.

    You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

    If you found this dataset useful, please consider citing our paper:

    @inproceedings{DBLP:conf/icdar/MorrisTE19,
     author  = {David Morris and
            Peichen Tang and
            Ralph Ewerth},
     title   = {A Neural Approach for Text Extraction from Scholarly Figures},
     booktitle = {2019 International Conference on Document Analysis and Recognition,
            {ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
     pages   = {1438--1443},
     publisher = {{IEEE}},
     year   = {2019},
     url    = {https://doi.org/10.1109/ICDAR.2019.00231},
     doi    = {10.1109/ICDAR.2019.00231},
     timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
     biburl  = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
     bibsource = {dblp computer science bibliography, https://dblp.org}
    }
    

    This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

    Datasets

    We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

    Testing

    These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

    Validation

    The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

    Training

    We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

    Code

    We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

    Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

    We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

    We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

    Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.

  9. c

    Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation,...

    • cancerimagingarchive.net
    csv, dicom, n/a +1
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2025). Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation, MIDI-B-Curated-Validation, MIDI-B-Synthetic-Test, MIDI-B-Curated-Test) [Dataset]. http://doi.org/10.7937/cf2p-aw56
    Explore at:
    sqlite and zip, dicom, csv, n/aAvailable download formats
    Dataset updated
    May 2, 2025
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    May 2, 2025
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    Abstract

    These resources comprise a large and diverse collection of multi-site, multi-modality, and multi-cancer clinical DICOM images from 538 subjects infused with synthetic PHI/PII in areas encountered by TCIA curation teams. Also provided is a TCIA-curated version of the synthetic dataset, along with mapping files for mapping identifiers between the two.

    This new MIDI data resource includes DICOM datasets used in the Medical Image De-Identification Benchmark (MIDI-B) challenge at MICCAI 2024. They are accompanied by ground truth answer keys and a validation script for evaluating the effectiveness of medical image de-identification workflows. The validation script systematically assesses de-identified data against an answer key outlining appropriate actions and values for proper de-identification of medical images, promoting safer and more consistent medical image sharing.

    Introduction

    Medical imaging research increasingly relies on large-scale data sharing. However, reliable de-identification of DICOM images still presents significant challenges due to the wide variety of DICOM header elements and pixel data where identifiable information may be embedded. To address this, we have developed an openly accessible synthetic dataset containing artificially generated protected health information (PHI) and personally identifiable information (PII).

    These resources complement our earlier work (Pseudo-PHI-DICOM-data ) hosted on The Cancer Imaging Archive. As an example of its use, we also provide a version curated by The Cancer Imaging Archive (TCIA) curation team. This resource builds upon best practices emphasized by the MIDI Task Group who underscore the importance of transparency, documentation, and reproducibility in de-identification workflows, part of the themes at recent conferences (Synapse:syn53065760) and workshops (2024 MIDI-B Challenge Workshop).

    This framework enables objective benchmarking of de-identification performance, promotes transparency in compliance with regulatory standards, and supports the establishment of consistent best practices for sharing clinical imaging data. We encourage the research community to use these resources to enhance and standardize their medical image de-identification workflows.

    Methods

    Subject Inclusion and Exclusion Criteria

    The source data were selected from imaging already hosted in de-identified form on TCIA. Imaging containing faces were excluded, and no new human studies were performed for his project.

    Data Acquisition

    To build the synthetic dataset, image series were selected from TCIA’s curated datasets to represent a broad range of imaging modalities (CR, CT, DX, MG, MR, PT, SR, US) , manufacturers including (GE, Siemens, Varian , Confirma, Agfa, Eigen, Elekta, Hologic, KONICA MINOLTA, others) , scan parameters, and regions of the body. These were processed to inject the synthetic PHI/PII as described.

    Data Analysis

    Synthetic pools of PHI, like subject and scanning institution information, were generated using the Python package Faker (https://pypi.org/project/Faker/8.10.3/). These were inserted into DICOM metadata of selected imaging files using a system of inheritable rule-based templates outlining re-identification functions for data insertion and logging for answer key creation. Text was also burned-in to the pixel data of a number of images. By systematically embedding realistic synthetic PHI into image headers and pixel data, accompanied by a detailed ground-truth answer key, our framework enables users transparency, documentation, and reproducibility in de-identification practices, aligned with the HIPAA Safe Harbor method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices.

    Usage Notes

    This DICOM collection is split into two datasets, synthetic and curated. The synthetic dataset is the PHI/PII infused DICOM collection accompanied by a validation script and answer keys for testing, refining and benchmarking medical image de-identification pipelines. The curated dataset is a version of the synthetic dataset curated and de-identified by members of The Cancer Imaging Archive curation team. It can be used as a guide, an example of medical image curation best practices. For the purposes of the De-Identification challenge at MICCAI 2024, the synthetic and curated datasets each contain two subsets, a portion for Validation and the other for Testing.

    To link a curated dataset to the original synthetic dataset and answer keys, a mapping between the unique identifiers (UIDs) and patient IDs must be provided in CSV format to the evaluation software. We include the mapping files associated with the TCIA-curated set as an example. Lastly, for both the Validation and Testing datasets, an answer key in sqlite.db format is provided. These components are for use with the Python validation script linked below (4). Combining these components, a user developing or evaluating de-identification methods can ensure they meet a specification for successfully de-identifying medical image data.

  10. Validation Data and Control Software for ATIC: Automated Testbed for...

    • catalog.data.gov
    • data.nist.gov
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). Validation Data and Control Software for ATIC: Automated Testbed for Interference Testing in Communication Systems [Dataset]. https://catalog.data.gov/dataset/validation-data-and-control-software-for-atic-automated-testbed-for-interference-testing-i-61d38
    Explore at:
    Dataset updated
    Dec 15, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    Validation data and software for the paper, "ATIC: Automated Testbed for Interference Testing in Communication Systems," to appear in Proceedings of 2023 IEEE Military Communications Conference. See the README file for descriptions of the data files. Software is available at https://github.com/usnistgov/atic.

  11. d

    Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  12. CLASPP Training Testing Validation data (both initial prototype data and...

    • zenodo.org
    Updated Aug 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan Gravel; Nathan Gravel (2025). CLASPP Training Testing Validation data (both initial prototype data and finalized data) [Dataset]. http://doi.org/10.5281/zenodo.16739128
    Explore at:
    Dataset updated
    Aug 6, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nathan Gravel; Nathan Gravel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 7, 2024
    Description

    There are 2 versions of data sets included in this repo: the prototype (24_05_07) and finalized (25_05_11)

    The major difference between Prototype and the finalized data sets are as following

    1. Prototype has more unsupervised custering labels (60) then finalized (54)
    2. Prototype has more clusters associated with Y-Phos, K-Sumo K-Malo
    3. Prototype has K-Succ and finalized does not
    4. Finalized has PK-Hydr and Prototype does not
    5. They are put through the same curation pipeline but with different random seeds for sampling

    Most difference are an outcome of testing the stability of each PTM type (class) in a multi-classification setting. Most PTM (All that was tested) are stable (will converge) in a single binary classification setting. K-Succ when added to the multi-classification tanked its performance and other PTM types (anecdotal behavoir testing). The swap to a different finalized training set was primarily due to this clash in performance. In theory with different hyper-parameters, one could fix this. Difference 3 being the major reason and all other differences are just to tell a better story and cleaning up the data to coincide with benchmarks performance from Fig2 and S_Fig2.

    Finalized (25_05_11)

    • train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to60NegRaio-25_05_11.csv
      • Used to train the final model and this was utilized in Fig4, Fig5, Fig6, S_Fig3, and S_fig4
    • val_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-25_05_11.csv
      • Used to help train the final model
    • test_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-25_05_11.csv
      • Used as the benchmark in Fig4 and S_Fig3
      • Fig 4 and S_Fig3 used the positive labels and just neg labels that share the same res type as the negative class in this benchmark
      • Same positive labels as HUMAN_labs.txt

    All data in the finalized data set are labeled using the unsupervised clustering labels (54) rather than final labels (20)


    Prototype (24_05_07)

    • train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to60NegRaio-24_05_07.csv
      • Used to train initial model that was used to in Fig2, Fig3, S_Fig1, S_Fig2
      • segmented and made a Singel Binary Classification models for Fig2 and S_Fig2
    • val_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-24_05_07.csv
      • Used to help train initial model and used to benchmark Fig2, Fig3, S_Fig1, S_Fig2
      • Fig3 and S_Fig1 used the positive labels and all negative labels negative class in this benchmark
      • Fig2 and S_Fig2 used the positive labels and all negative labels negative class in this benchmark (Singel Binary Classification)

    All data in the prototype data set are labeled using the unsupervised clustering labels (60) rather than final labels (20)

    Out-of-distribution

    • HUMAN_labs.txt
    • MOUSE_labs.txt
    • DROME_labs.txt
    • CAEEL_labs.txt
    • YEAST_labs.txt
    • ECOLI_labs.txt

    All benchmarks here use final labels (20) rather than the unsupervised clustering labels (54)

    Negative labels were only used if they share the residue to the positive label

    All positive and negative labels were under-sampled to have a max of 500

    Each species specific PTM type benchmark was only used if they have at least 100 positive examples and 50 negatives

    Different Res and Neg ratios

    (not good performance in practice but could work in theory)

    • train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to70NegRaio-24_05_07.csv
      • Pos/Neg ratio is 1/70 (each class) - Medium/Easy neg Res ration is uniform 1/20 - tot res ratio NOT uniform
    • train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to100NegRaio-24_05_07.csv
      • Pos/Neg ratio is 1/100 (each class) - Medium/Easy neg Res ration is uniform 1/20 - tot res ratio NOT uniform
    • train_hd3_CustBL62SeqDistSpecClus_uniResRatio_CustNegRaio-24_05_07.csv
      • Pos/Neg ratio is 1/1000 (each class) - Medium/Easy neg Res ration is NOT uniform - tot res ratio is uniform

    Other related repos

    RepoLink (will go live when submitted)Discription
    GitHubgithub_version_Data_curThis verstion contains code but but no data. It needs you to run the code to generate all the helper-files (will take some time run this code)
    GitHubgithub_version_ForwardThis verstion contains code but NOT any weights (file too big for github)
    Huggingfacehuggingface_version_ForwardThis verstion contains code and training weights
    Zenodozenodo_version_training_datazenodo version of training/testing/validation data
    webtoolwebtoolwebtool hosted on a server
  13. R

    5 Validation Testing Brightness Dataset

    • universe.roboflow.com
    zip
    Updated May 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University Science Malaysia (2023). 5 Validation Testing Brightness Dataset [Dataset]. https://universe.roboflow.com/university-science-malaysia/5-validation-testing-brightness
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 24, 2023
    Dataset authored and provided by
    University Science Malaysia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    5 Validation Testing Brightness Bounding Boxes
    Description

    5 Validation Testing Brightness

    ## Overview
    
    5 Validation Testing Brightness is a dataset for object detection tasks - it contains 5 Validation Testing Brightness annotations for 441 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  14. R

    25 Validation Testing Brightness Dataset

    • universe.roboflow.com
    zip
    Updated May 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University Science Malaysia (2023). 25 Validation Testing Brightness Dataset [Dataset]. https://universe.roboflow.com/university-science-malaysia/25-validation-testing-brightness/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 24, 2023
    Dataset authored and provided by
    University Science Malaysia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    25 Validation Testing Brightness Bounding Boxes
    Description

    25 Validation Testing Brightness

    ## Overview
    
    25 Validation Testing Brightness is a dataset for object detection tasks - it contains 25 Validation Testing Brightness annotations for 465 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  15. Automated Cryptographic Validation Test System Generators and Validators

    • catalog.data.gov
    • data.nist.gov
    • +2more
    Updated Jul 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). Automated Cryptographic Validation Test System Generators and Validators [Dataset]. https://catalog.data.gov/dataset/automated-cryptographic-validation-test-system-generators-and-validators
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This is a program that takes in a description of a cryptographic algorithm implementation's capabilities, and generates test vectors to ensure the implementation conforms to the standard. After generating the test vectors, the program also validates the correctness of the responses from the user.

  16. Z

    Testing dataset for the "long-bone-diaphyseal-CSG-Toolkit"

    • data.niaid.nih.gov
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bertsatos Andreas (2020). Testing dataset for the "long-bone-diaphyseal-CSG-Toolkit" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1466961
    Explore at:
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    Bertsatos Andreas
    Chovalopoulou Maria-Eleni
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    The present dataset has been used for the validation study of correct operation for the "long-bone-diaphyseal-CSG-Toolkit". It consists of three 3D mesh bone models (a humerus, a femur and a tibia, which are part of the Athens modern reference skeletal collection) used for comparison to alternative methods for calculating CSG properties of long bones and one 3D mesh ground model (with known geometric properties) used as a gold standard reference.

    Additionally, the dataset includes all the results (stored in the respective csv files) from analyzing each of these models with the GNU Octave CSG Toolkit v1.0.1. The present dataset acts both as supplementary material to the validation study and as a sample dataset for user testing of the operation of the GNU Octave CSG Toolkit.

  17. Z

    Data from: ViF-GTAD: A new Automotive Data Set with Ground Truth for ADAS/AD...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Apr 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reckenzaun, Jakob (2023). ViF-GTAD: A new Automotive Data Set with Ground Truth for ADAS/AD Development, Testing and Validation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6624298
    Explore at:
    Dataset updated
    Apr 8, 2023
    Dataset provided by
    Genser, Simon
    Reckenzaun, Jakob
    Haas, Sarah
    Solmaz, Selim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A new dataset for automated driving, which is the subject matter of this paper, identifies and addresses a gap in existing similar perception data sets. While the most state-of-the-art perception data sets primarily focus on provision of various on-board sensor measurements along with the semantic information under various driving conditions, the provided information is often insufficient since the object list and position data provided include unknown and time-varying errors. The current paper and the associated data-set describes the first publicly available perception measurement data that include not only the on-board sensor information from camera, Lidar and radar with semantically classified objects, but also the high precision ground-truth position measurements enabled by the accurate RTK assisted GPS localization systems available on both the ego vehicle and the dynamic target objects. This paper provides insight on the capturing of the data, explicitly explaining the meta data structure and the content, as well as the potential application examples where it has been, and can potentially be, applied and implemented in relation to automated driving and environmental perception systems development, testing and validation.

  18. f

    Training, validation and test datasets and model files for larger US Health...

    • ufs.figshare.com
    txt
    Updated Dec 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Marthinus Blomerus (2023). Training, validation and test datasets and model files for larger US Health Insurance dataset [Dataset]. http://doi.org/10.38140/ufs.24598881.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 12, 2023
    Dataset provided by
    University of the Free State
    Authors
    Jan Marthinus Blomerus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Formats1.xlsx contains the descriptions of the columns of the following datasets: Training, validation and test datasets in combination are all the records.sens1.csv and and meansdX.csv are required for testing.

  19. Z

    Training and Testing Datasets for Machine Learning of Shortwave Radiative...

    • data.niaid.nih.gov
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schneiderman, Henry (2025). Training and Testing Datasets for Machine Learning of Shortwave Radiative Transfer [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_15089912
    Explore at:
    Dataset updated
    Mar 28, 2025
    Dataset authored and provided by
    Schneiderman, Henry
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets for Machine Learning Shortwave Radiative Transfer

    Author - Henry Schneiderman, henry@pittdata.comPlease contact me for any questions or feedback

    Input reanalysis data downloaded from ECMWF's Copernicus Atmospheric Monitoring Service. Each atmospheric column contains the following input variables:

    mu - Cosine of solar zenith anglealbedo - Surface albedois_valid_zenith_angle - Indicates if daylight is presentVertical profiles (60 layers): Temperature Pressure, Change in Pressure, H2O (vapor, liquid, solid), O3, CO2, O2, N2O, CH4

    The ecRad emulator (Hogan and Bozzo, 2018) generated the following output profiles at the layer interfaces for input each atmospheric column:

    flux_down_direct, flux_down_diffuse, flux_down_direct_clear_sky, flux_down_diffuse_clear_sky, flux_up_diffuse, flux_up_clear_sky

    All data is sampled at 5,120 global locations

    The training dataset uses input from 2008 sampled at three-hour intervals within every fourth day

    The validation dataset uses input from 2008 sampled at three-hour intervals within every 28th day offset two days from the training set to avoid duplication

    Testing datasets use input from 2009, 2015, and 2020. Each of these samples data at three-hour intervals within every 28th day.

    For more information see:Henry Schneiderman. "An Open Box Physics-Based Neural Network for Shortwave Radiative Transfer." Submitted to Artificial Intelligence for the Earth Systems.

  20. R

    15 Validation Testing Brightness Dataset

    • universe.roboflow.com
    zip
    Updated May 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University Science Malaysia (2023). 15 Validation Testing Brightness Dataset [Dataset]. https://universe.roboflow.com/university-science-malaysia/15-validation-testing-brightness/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 24, 2023
    Dataset authored and provided by
    University Science Malaysia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    15 Validation Testing Brightness Bounding Boxes
    Description

    15 Validation Testing Brightness

    ## Overview
    
    15 Validation Testing Brightness is a dataset for object detection tasks - it contains 15 Validation Testing Brightness annotations for 460 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
National Institute of Standards and Technology (2022). Challenge Round 0 (Dry Run) Test Dataset [Dataset]. https://catalog.data.gov/dataset/challenge-round-0-dry-run-test-dataset-ff885
Organization logo

Challenge Round 0 (Dry Run) Test Dataset

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description

This dataset was an initial test harness infrastructure test for the TrojAI program. It should not be used for research. Please use the more refined datasets generated for the other rounds. The data being generated and disseminated is training, validation, and test data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 200 trained, human level, image classification AI models using the following architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.

Search
Clear search
Close search
Google apps
Main menu