100+ datasets found

Challenge Round 0 (Dry Run) Test Dataset
catalog.data.gov
data.nist.gov
+1more
Updated Jul 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). Challenge Round 0 (Dry Run) Test Dataset [Dataset]. https://catalog.data.gov/dataset/challenge-round-0-dry-run-test-dataset-ff885
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This dataset was an initial test harness infrastructure test for the TrojAI program. It should not be used for research. Please use the more refined datasets generated for the other rounds. The data being generated and disseminated is training, validation, and test data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 200 trained, human level, image classification AI models using the following architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.
f
Data from: Cross-Validation With Confidence
tandf.figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Lei (2023). Cross-Validation With Confidence [Dataset]. http://doi.org/10.6084/m9.figshare.9976901.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9976901.v3
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Jing Lei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cross-validation is one of the most popular model and tuning parameter selection methods in statistics and machine learning. Despite its wide applicability, traditional cross-validation methods tend to overfit, due to the ignorance of the uncertainty in the testing sample. We develop a novel statistically principled inference tool based on cross-validation that takes into account the uncertainty in the testing sample. This method outputs a set of highly competitive candidate models containing the optimal one with guaranteed probability. As a consequence, our method can achieve consistent variable selection in a classical linear regression setting, for which existing cross-validation methods require unconventional split ratios. When used for tuning parameter selection, the method can provide an alternative trade-off between prediction accuracy and model interpretability than existing variants of cross-validation. We demonstrate the performance of the proposed method in several simulated and real data examples. Supplemental materials for this article can be found online.
Training and Validation Datasets for Neural Network to Fill in Missing Data...
catalog.data.gov
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2025). Training and Validation Datasets for Neural Network to Fill in Missing Data in EBSD Maps [Dataset]. https://catalog.data.gov/dataset/training-and-validation-datasets-for-neural-network-to-fill-in-missing-data-in-ebsd-maps
Explore at:
Dataset updated
Jul 9, 2025
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This dataset consists of the synthetic electron backscatter diffraction (EBSD) maps generated for the paper, titled "Hybrid Algorithm for Filling in Missing Data in Electron Backscatter Diffraction Maps" by Emmanuel Atindama, Conor Miller-Lynch, Huston Wilhite, Cody Mattice, Günay Doğan, and Prashant Athavale. The EBSD maps were used to train, test, and validate a neural network algorithm to fill in missing data points in a given EBSD map.The dataset includes 8000 maps for training, 1000 maps for testing, 2000 maps for validation. The dataset also includes noise-added versions of the maps, namely, one more map per each clean map.
f
Performance of ML models on test data.
plos.figshare.com
xls
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Performance of ML models on test data. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0002475.t005
Dataset updated
Oct 31, 2023
Dataset provided by
PLOS Global Public Health
Authors
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.
t
MS Training Set, MS Validation Set, and UW Validation/Test Set - Dataset -...
service.tib.eu
Updated Dec 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). MS Training Set, MS Validation Set, and UW Validation/Test Set - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/ms-training-set--ms-validation-set--and-uw-validation-test-set
Explore at:
Dataset updated
Dec 17, 2024
Description
The MS Training Set, MS Validation Set, and UW Validation/Test Set are used for training, validation, and testing the proposed methods.
DRIVE Train/Validation Split Dataset
kaggle.com
Updated Feb 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sovit Ranjan Rath (2023). DRIVE Train/Validation Split Dataset [Dataset]. https://www.kaggle.com/datasets/sovitrath/drive-trainvalidation-split-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sovit Ranjan Rath
Description
This dataset contains images and masks for Retinal Vessel Extraction (Segmentation). It contains a training and validation split to easily train semantic segmentation models.

The original dataset can be found here => https://www.kaggle.com/datasets/andrewmvd/drive-digital-retinal-images-for-vessel-extraction

This dataset also has an accompanying blog post => Retinal Vessel Segmentation using PyTorch Semantic Segmentation

Split sample numbers: Training images and masks: 16 Validation images and masks: 4 Test images: 20
f
Data from: Robust Validation: Confident Predictions Even When Distributions...
tandf.figshare.com
bin
Updated Dec 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maxime Cauchois; Suyash Gupta; Alnur Ali; John C. Duchi (2023). Robust Validation: Confident Predictions Even When Distributions Shift* [Dataset]. http://doi.org/10.6084/m9.figshare.24904721.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24904721.v1
Dataset updated
Dec 26, 2023
Dataset provided by
Taylor & Francis
Authors
Maxime Cauchois; Suyash Gupta; Alnur Ali; John C. Duchi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategy—coming from robust statistics and optimization—is thus to build a model robust to distributional perturbations. In this paper, we take a different approach to describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions. We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an f-divergence ball around the training population. The method, based on conformal inference, achieves (nearly) valid coverage in finite samples, under only the condition that the training data be exchangeable. An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it; we develop estimators and prove their consistency for protection and validity of uncertainty estimates under shifts. By experimenting on several large-scale benchmark datasets, including Recht et al.’s CIFAR-v4 and ImageNet-V2 datasets, we provide complementary empirical results that highlight the importance of robust predictive validity.
F
Data from: A Neural Approach for Text Extraction from Scholarly Figures
data.uni-hannover.de
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
Explore at:
zipAvailable download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
TIB
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
A Neural Approach for Text Extraction from Scholarly Figures

This is the readme for the supplemental data for our ICDAR 2019 paper.

You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

If you found this dataset useful, please consider citing our paper:

@inproceedings{DBLP:conf/icdar/MorrisTE19, author = {David Morris and Peichen Tang and Ralph Ewerth}, title = {A Neural Approach for Text Extraction from Scholarly Figures}, booktitle = {2019 International Conference on Document Analysis and Recognition, {ICDAR} 2019, Sydney, Australia, September 20-25, 2019}, pages = {1438--1443}, publisher = {{IEEE}}, year = {2019}, url = {https://doi.org/10.1109/ICDAR.2019.00231}, doi = {10.1109/ICDAR.2019.00231}, timestamp = {Tue, 04 Feb 2020 13:28:39 +0100}, biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

Datasets

We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

Testing

These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

Validation

The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

Training

We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

Code

We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.
c
Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation,...
cancerimagingarchive.net
csv, dicom, n/a +1
Updated May 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2025). Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation, MIDI-B-Curated-Validation, MIDI-B-Synthetic-Test, MIDI-B-Curated-Test) [Dataset]. http://doi.org/10.7937/cf2p-aw56
Explore at:
sqlite and zip, dicom, csv, n/aAvailable download formats
Unique identifier
https://doi.org/10.7937/cf2p-aw56
Dataset updated
May 2, 2025
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
May 2, 2025
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
Abstract
These resources comprise a large and diverse collection of multi-site, multi-modality, and multi-cancer clinical DICOM images from 538 subjects infused with synthetic PHI/PII in areas encountered by TCIA curation teams. Also provided is a TCIA-curated version of the synthetic dataset, along with mapping files for mapping identifiers between the two.
This new MIDI data resource includes DICOM datasets used in the Medical Image De-Identification Benchmark (MIDI-B) challenge at MICCAI 2024. They are accompanied by ground truth answer keys and a validation script for evaluating the effectiveness of medical image de-identification workflows. The validation script systematically assesses de-identified data against an answer key outlining appropriate actions and values for proper de-identification of medical images, promoting safer and more consistent medical image sharing.
Introduction
Medical imaging research increasingly relies on large-scale data sharing. However, reliable de-identification of DICOM images still presents significant challenges due to the wide variety of DICOM header elements and pixel data where identifiable information may be embedded. To address this, we have developed an openly accessible synthetic dataset containing artificially generated protected health information (PHI) and personally identifiable information (PII).
These resources complement our earlier work (Pseudo-PHI-DICOM-data ) hosted on The Cancer Imaging Archive. As an example of its use, we also provide a version curated by The Cancer Imaging Archive (TCIA) curation team. This resource builds upon best practices emphasized by the MIDI Task Group who underscore the importance of transparency, documentation, and reproducibility in de-identification workflows, part of the themes at recent conferences (Synapse:syn53065760) and workshops (2024 MIDI-B Challenge Workshop).
This framework enables objective benchmarking of de-identification performance, promotes transparency in compliance with regulatory standards, and supports the establishment of consistent best practices for sharing clinical imaging data. We encourage the research community to use these resources to enhance and standardize their medical image de-identification workflows.
Methods
Subject Inclusion and Exclusion Criteria
The source data were selected from imaging already hosted in de-identified form on TCIA. Imaging containing faces were excluded, and no new human studies were performed for his project.
Data Acquisition
To build the synthetic dataset, image series were selected from TCIA’s curated datasets to represent a broad range of imaging modalities (CR, CT, DX, MG, MR, PT, SR, US) , manufacturers including (GE, Siemens, Varian , Confirma, Agfa, Eigen, Elekta, Hologic, KONICA MINOLTA, others) , scan parameters, and regions of the body. These were processed to inject the synthetic PHI/PII as described.
Data Analysis
Synthetic pools of PHI, like subject and scanning institution information, were generated using the Python package Faker (https://pypi.org/project/Faker/8.10.3/). These were inserted into DICOM metadata of selected imaging files using a system of inheritable rule-based templates outlining re-identification functions for data insertion and logging for answer key creation. Text was also burned-in to the pixel data of a number of images. By systematically embedding realistic synthetic PHI into image headers and pixel data, accompanied by a detailed ground-truth answer key, our framework enables users transparency, documentation, and reproducibility in de-identification practices, aligned with the HIPAA Safe Harbor method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices.
Usage Notes
This DICOM collection is split into two datasets, synthetic and curated. The synthetic dataset is the PHI/PII infused DICOM collection accompanied by a validation script and answer keys for testing, refining and benchmarking medical image de-identification pipelines. The curated dataset is a version of the synthetic dataset curated and de-identified by members of The Cancer Imaging Archive curation team. It can be used as a guide, an example of medical image curation best practices. For the purposes of the De-Identification challenge at MICCAI 2024, the synthetic and curated datasets each contain two subsets, a portion for Validation and the other for Testing.
To link a curated dataset to the original synthetic dataset and answer keys, a mapping between the unique identifiers (UIDs) and patient IDs must be provided in CSV format to the evaluation software. We include the mapping files associated with the TCIA-curated set as an example. Lastly, for both the Validation and Testing datasets, an answer key in sqlite.db format is provided. These components are for use with the Python validation script linked below (4). Combining these components, a user developing or evaluating de-identification methods can ensure they meet a specification for successfully de-identifying medical image data.
Validation Data and Control Software for ATIC: Automated Testbed for...
catalog.data.gov
data.nist.gov
Updated Dec 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). Validation Data and Control Software for ATIC: Automated Testbed for Interference Testing in Communication Systems [Dataset]. https://catalog.data.gov/dataset/validation-data-and-control-software-for-atic-automated-testbed-for-interference-testing-i-61d38
Explore at:
Dataset updated
Dec 15, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Validation data and software for the paper, "ATIC: Automated Testbed for Interference Testing in Communication Systems," to appear in Proceedings of 2023 IEEE Military Communications Conference. See the README file for descriptions of the data files. Software is available at https://github.com/usnistgov/atic.
d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

CLASPP Training Testing Validation data (both initial prototype data and...

zenodo.org

Updated Aug 6, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Nathan Gravel; Nathan Gravel (2025). CLASPP Training Testing Validation data (both initial prototype data and finalized data) [Dataset]. http://doi.org/10.5281/zenodo.16739128

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.16739128

Dataset updated

Aug 6, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Nathan Gravel; Nathan Gravel

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

May 7, 2024

Description

There are 2 versions of data sets included in this repo: the prototype (24_05_07) and finalized (25_05_11)

The major difference between Prototype and the finalized data sets are as following

Prototype has more unsupervised custering labels (60) then finalized (54)
Prototype has more clusters associated with Y-Phos, K-Sumo K-Malo
Prototype has K-Succ and finalized does not
Finalized has PK-Hydr and Prototype does not
They are put through the same curation pipeline but with different random seeds for sampling

Most difference are an outcome of testing the stability of each PTM type (class) in a multi-classification setting. Most PTM (All that was tested) are stable (will converge) in a single binary classification setting. K-Succ when added to the multi-classification tanked its performance and other PTM types (anecdotal behavoir testing). The swap to a different finalized training set was primarily due to this clash in performance. In theory with different hyper-parameters, one could fix this. Difference 3 being the major reason and all other differences are just to tell a better story and cleaning up the data to coincide with benchmarks performance from Fig2 and S_Fig2.

Finalized (25_05_11)

train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to60NegRaio-25_05_11.csv
- Used to train the final model and this was utilized in Fig4, Fig5, Fig6, S_Fig3, and S_fig4
val_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-25_05_11.csv
- Used to help train the final model
test_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-25_05_11.csv
- Used as the benchmark in Fig4 and S_Fig3
- Fig 4 and S_Fig3 used the positive labels and just neg labels that share the same res type as the negative class in this benchmark
- Same positive labels as HUMAN_labs.txt

All data in the finalized data set are labeled using the unsupervised clustering labels (54) rather than final labels (20)

Prototype (24_05_07)

train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to60NegRaio-24_05_07.csv
- Used to train initial model that was used to in Fig2, Fig3, S_Fig1, S_Fig2
- segmented and made a Singel Binary Classification models for Fig2 and S_Fig2
val_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-24_05_07.csv
- Used to help train initial model and used to benchmark Fig2, Fig3, S_Fig1, S_Fig2
- Fig3 and S_Fig1 used the positive labels and all negative labels negative class in this benchmark
- Fig2 and S_Fig2 used the positive labels and all negative labels negative class in this benchmark (Singel Binary Classification)

All data in the prototype data set are labeled using the unsupervised clustering labels (60) rather than final labels (20)

Out-of-distribution

HUMAN_labs.txt
MOUSE_labs.txt
DROME_labs.txt
CAEEL_labs.txt
YEAST_labs.txt
ECOLI_labs.txt

All benchmarks here use final labels (20) rather than the unsupervised clustering labels (54)

Negative labels were only used if they share the residue to the positive label

All positive and negative labels were under-sampled to have a max of 500

Each species specific PTM type benchmark was only used if they have at least 100 positive examples and 50 negatives

Different Res and Neg ratios

(not good performance in practice but could work in theory)

train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to70NegRaio-24_05_07.csv
- Pos/Neg ratio is 1/70 (each class) - Medium/Easy neg Res ration is uniform 1/20 - tot res ratio NOT uniform
train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to100NegRaio-24_05_07.csv
- Pos/Neg ratio is 1/100 (each class) - Medium/Easy neg Res ration is uniform 1/20 - tot res ratio NOT uniform
train_hd3_CustBL62SeqDistSpecClus_uniResRatio_CustNegRaio-24_05_07.csv
- Pos/Neg ratio is 1/1000 (each class) - Medium/Easy neg Res ration is NOT uniform - tot res ratio is uniform

Other related repos

Repo	Link (will go live when submitted)	Discription
GitHub	github_version_Data_cur	This verstion contains code but but no data. It needs you to run the code to generate all the helper-files (will take some time run this code)
GitHub	github_version_Forward	This verstion contains code but NOT any weights (file too big for github)
Huggingface	huggingface_version_Forward	This verstion contains code and training weights
Zenodo	zenodo_version_training_data	zenodo version of training/testing/validation data
webtool	webtool	webtool hosted on a server

R
5 Validation Testing Brightness Dataset
universe.roboflow.com
zip
Updated May 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University Science Malaysia (2023). 5 Validation Testing Brightness Dataset [Dataset]. https://universe.roboflow.com/university-science-malaysia/5-validation-testing-brightness
Explore at:
zipAvailable download formats
Dataset updated
May 24, 2023
Dataset authored and provided by
University Science Malaysia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
5 Validation Testing Brightness Bounding Boxes
Description
5 Validation Testing Brightness

## Overview 5 Validation Testing Brightness is a dataset for object detection tasks - it contains 5 Validation Testing Brightness annotations for 441 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
R
25 Validation Testing Brightness Dataset
universe.roboflow.com
zip
Updated May 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University Science Malaysia (2023). 25 Validation Testing Brightness Dataset [Dataset]. https://universe.roboflow.com/university-science-malaysia/25-validation-testing-brightness/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
May 24, 2023
Dataset authored and provided by
University Science Malaysia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
25 Validation Testing Brightness Bounding Boxes
Description
25 Validation Testing Brightness

## Overview 25 Validation Testing Brightness is a dataset for object detection tasks - it contains 25 Validation Testing Brightness annotations for 465 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Automated Cryptographic Validation Test System Generators and Validators
catalog.data.gov
data.nist.gov
+2more
Updated Jul 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). Automated Cryptographic Validation Test System Generators and Validators [Dataset]. https://catalog.data.gov/dataset/automated-cryptographic-validation-test-system-generators-and-validators
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a program that takes in a description of a cryptographic algorithm implementation's capabilities, and generates test vectors to ensure the implementation conforms to the standard. After generating the test vectors, the program also validates the correctness of the responses from the user.
Z
Testing dataset for the "long-bone-diaphyseal-CSG-Toolkit"
data.niaid.nih.gov
Updated Jan 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bertsatos Andreas (2020). Testing dataset for the "long-bone-diaphyseal-CSG-Toolkit" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1466961
Explore at:
Dataset updated
Jan 21, 2020
Dataset provided by
Bertsatos Andreas
Chovalopoulou Maria-Eleni
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The present dataset has been used for the validation study of correct operation for the "long-bone-diaphyseal-CSG-Toolkit". It consists of three 3D mesh bone models (a humerus, a femur and a tibia, which are part of the Athens modern reference skeletal collection) used for comparison to alternative methods for calculating CSG properties of long bones and one 3D mesh ground model (with known geometric properties) used as a gold standard reference.

Additionally, the dataset includes all the results (stored in the respective csv files) from analyzing each of these models with the GNU Octave CSG Toolkit v1.0.1. The present dataset acts both as supplementary material to the validation study and as a sample dataset for user testing of the operation of the GNU Octave CSG Toolkit.
Z
Data from: ViF-GTAD: A new Automotive Data Set with Ground Truth for ADAS/AD...
data.niaid.nih.gov
zenodo.org
+1more
Updated Apr 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reckenzaun, Jakob (2023). ViF-GTAD: A new Automotive Data Set with Ground Truth for ADAS/AD Development, Testing and Validation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6624298
Explore at:
Dataset updated
Apr 8, 2023
Dataset provided by
Genser, Simon
Reckenzaun, Jakob
Haas, Sarah
Solmaz, Selim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A new dataset for automated driving, which is the subject matter of this paper, identifies and addresses a gap in existing similar perception data sets. While the most state-of-the-art perception data sets primarily focus on provision of various on-board sensor measurements along with the semantic information under various driving conditions, the provided information is often insufficient since the object list and position data provided include unknown and time-varying errors. The current paper and the associated data-set describes the first publicly available perception measurement data that include not only the on-board sensor information from camera, Lidar and radar with semantically classified objects, but also the high precision ground-truth position measurements enabled by the accurate RTK assisted GPS localization systems available on both the ego vehicle and the dynamic target objects. This paper provides insight on the capturing of the data, explicitly explaining the meta data structure and the content, as well as the potential application examples where it has been, and can potentially be, applied and implemented in relation to automated driving and environmental perception systems development, testing and validation.
f
Training, validation and test datasets and model files for larger US Health...
ufs.figshare.com
txt
Updated Dec 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Marthinus Blomerus (2023). Training, validation and test datasets and model files for larger US Health Insurance dataset [Dataset]. http://doi.org/10.38140/ufs.24598881.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.38140/ufs.24598881.v2
Dataset updated
Dec 12, 2023
Dataset provided by
University of the Free State
Authors
Jan Marthinus Blomerus
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Formats1.xlsx contains the descriptions of the columns of the following datasets: Training, validation and test datasets in combination are all the records.sens1.csv and and meansdX.csv are required for testing.
Z
Training and Testing Datasets for Machine Learning of Shortwave Radiative...
data.niaid.nih.gov
Updated Mar 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schneiderman, Henry (2025). Training and Testing Datasets for Machine Learning of Shortwave Radiative Transfer [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_15089912
Explore at:
Dataset updated
Mar 28, 2025
Dataset authored and provided by
Schneiderman, Henry
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets for Machine Learning Shortwave Radiative Transfer

Author - Henry Schneiderman, henry@pittdata.comPlease contact me for any questions or feedback

Input reanalysis data downloaded from ECMWF's Copernicus Atmospheric Monitoring Service. Each atmospheric column contains the following input variables:

mu - Cosine of solar zenith anglealbedo - Surface albedois_valid_zenith_angle - Indicates if daylight is presentVertical profiles (60 layers): Temperature Pressure, Change in Pressure, H2O (vapor, liquid, solid), O3, CO2, O2, N2O, CH4

The ecRad emulator (Hogan and Bozzo, 2018) generated the following output profiles at the layer interfaces for input each atmospheric column:

flux_down_direct, flux_down_diffuse, flux_down_direct_clear_sky, flux_down_diffuse_clear_sky, flux_up_diffuse, flux_up_clear_sky

All data is sampled at 5,120 global locations

The training dataset uses input from 2008 sampled at three-hour intervals within every fourth day

The validation dataset uses input from 2008 sampled at three-hour intervals within every 28th day offset two days from the training set to avoid duplication

Testing datasets use input from 2009, 2015, and 2020. Each of these samples data at three-hour intervals within every 28th day.

For more information see:Henry Schneiderman. "An Open Box Physics-Based Neural Network for Shortwave Radiative Transfer." Submitted to Artificial Intelligence for the Earth Systems.
R
15 Validation Testing Brightness Dataset
universe.roboflow.com
zip
Updated May 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University Science Malaysia (2023). 15 Validation Testing Brightness Dataset [Dataset]. https://universe.roboflow.com/university-science-malaysia/15-validation-testing-brightness/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
May 24, 2023
Dataset authored and provided by
University Science Malaysia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
15 Validation Testing Brightness Bounding Boxes
Description
15 Validation Testing Brightness

## Overview 15 Validation Testing Brightness is a dataset for object detection tasks - it contains 15 Validation Testing Brightness annotations for 460 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).

Facebook

Twitter

Click to copy link

Link copied

Cite

National Institute of Standards and Technology (2022). Challenge Round 0 (Dry Run) Test Dataset [Dataset]. https://catalog.data.gov/dataset/challenge-round-0-dry-run-test-dataset-ff885

Challenge Round 0 (Dry Run) Test Dataset

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jul 29, 2022

Dataset provided by

National Institute of Standards and Technologyhttp://www.nist.gov/

Description

This dataset was an initial test harness infrastructure test for the TrojAI program. It should not be used for research. Please use the more refined datasets generated for the other rounds. The data being generated and disseminated is training, validation, and test data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 200 trained, human level, image classification AI models using the following architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.

Clear search

Close search

Google apps

Main menu

Challenge Round 0 (Dry Run) Test Dataset

Data from: Cross-Validation With Confidence

Training and Validation Datasets for Neural Network to Fill in Missing Data...

Performance of ML models on test data.

MS Training Set, MS Validation Set, and UW Validation/Test Set - Dataset -...

DRIVE Train/Validation Split Dataset

Data from: Robust Validation: Confident Predictions Even When Distributions...

Data from: A Neural Approach for Text Extraction from Scholarly Figures

A Neural Approach for Text Extraction from Scholarly Figures

Datasets

Testing

Validation

Training

Code

Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation,...

Abstract

Introduction

Methods

Subject Inclusion and Exclusion Criteria

Data Acquisition

Data Analysis

Usage Notes

Validation Data and Control Software for ATIC: Automated Testbed for...

Training dataset for NABat Machine Learning V1.0

CLASPP Training Testing Validation data (both initial prototype data and...

Finalized (25_05_11)

Prototype (24_05_07)

Out-of-distribution

Different Res and Neg ratios

Other related repos

5 Validation Testing Brightness Dataset

5 Validation Testing Brightness

25 Validation Testing Brightness Dataset

25 Validation Testing Brightness

Automated Cryptographic Validation Test System Generators and Validators

Testing dataset for the "long-bone-diaphyseal-CSG-Toolkit"

Data from: ViF-GTAD: A new Automotive Data Set with Ground Truth for ADAS/AD...

Training, validation and test datasets and model files for larger US Health...

Training and Testing Datasets for Machine Learning of Shortwave Radiative...

15 Validation Testing Brightness Dataset

15 Validation Testing Brightness

Challenge Round 0 (Dry Run) Test Dataset