100+ datasets found

Training/Validation/Test set split
figshare.com
zip
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianfan Jin (2024). Training/Validation/Test set split [Dataset]. http://doi.org/10.6084/m9.figshare.25511056.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25511056.v1
Dataset updated
Mar 30, 2024
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Tianfan Jin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Including the split of real and null reactions for training, validation and test
d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
h
alpaca-train-validation-test-split
huggingface.co
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Doula Isham Rashik Hasan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.
a
Challenge 2 Train and Test Sets
academictorrents.com
bittorrent
Updated Oct 27, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
None (2016). Challenge 2 Train and Test Sets [Dataset]. https://academictorrents.com/details/9b0c6c1044633d076b0f73dc312aa34433a25c56
Explore at:
bittorrent(70189157929)Available download formats
Dataset updated
Oct 27, 2016
Authors
None
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Challenge 2 Image Sets. Training data is accompanied by interpolated steering values. Test data only has center image frames.
TREC 2022 Deep Learning test collection
catalog.data.gov
data.nist.gov
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
f
Data from: Machine Learning Models Identify New Inhibitors for Human OATP1B1...
figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas R. Lane; Fabio Urbina; Xiaohong Zhang; Margret Fye; Jacob Gerlach; Stephen H. Wright; Sean Ekins (2023). Machine Learning Models Identify New Inhibitors for Human OATP1B1 [Dataset]. http://doi.org/10.1021/acs.molpharmaceut.2c00662.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.molpharmaceut.2c00662.s002
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Thomas R. Lane; Fabio Urbina; Xiaohong Zhang; Margret Fye; Jacob Gerlach; Stephen H. Wright; Sean Ekins
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The uptake transporter OATP1B1 (SLC01B1) is largely localized to the sinusoidal membrane of hepatocytes and is a known victim of unwanted drug–drug interactions. Computational models are useful for identifying potential substrates and/or inhibitors of clinically relevant transporters. Our goal was to generate OATP1B1 in vitro inhibition data for [3H] estrone-3-sulfate (E3S) transport in CHO cells and use it to build machine learning models to facilitate a comparison of seven different classification models (Deep learning, Adaboosted decision trees, Bernoulli naïve bayes, k-nearest neighbors (knn), random forest, support vector classifier (SVC), logistic regression (lreg), and XGBoost (xgb)] using ECFP6 fingerprints to perform 5-fold, nested cross validation. In addition, we compared models using 3D pharmacophores, simple chemical descriptors alone or plus ECFP6, as well as ECFP4 and ECFP8 fingerprints. Several machine learning algorithms (SVC, lreg, xgb, and knn) had excellent nested cross validation statistics, particularly for accuracy, AUC, and specificity. An external test set containing 207 unique compounds not in the training set demonstrated that at every threshold SVC outperformed the other algorithms based on a rank normalized score. A prospective validation test set was chosen using prediction scores from the SVC models with ECFP fingerprints and were tested in vitro with 15 of 19 compounds (84% accuracy) predicted as active (≥20% inhibition) showed inhibition. Of these compounds, six (abamectin, asiaticoside, berbamine, doramectin, mobocertinib, and umbralisib) appear to be novel inhibitors of OATP1B1 not previously reported. These validated machine learning models can now be used to make predictions for drug–drug interactions for human OATP1B1 alongside other machine learning models for important drug transporters in our MegaTrans software.
Training, Validation and Test Sets for paper 'A Little Data goes a Long Way:...
zenodo.org
bin
Updated Feb 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sacha Lapins; Sacha Lapins; Berhe Goitom; Berhe Goitom; J-Michael Kendall; J-Michael Kendall; Maximilian J. Werner; Maximilian J. Werner; Katharine V. Cashman; Katharine V. Cashman; James O. S. Hammond; James O. S. Hammond (2023). Training, Validation and Test Sets for paper 'A Little Data goes a Long Way: Automating Seismic Phase Arrival Picking at Nabro Volcano with Transfer Learning' [Dataset]. http://doi.org/10.5281/zenodo.7646981
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7646981
Dataset updated
Feb 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sacha Lapins; Sacha Lapins; Berhe Goitom; Berhe Goitom; J-Michael Kendall; J-Michael Kendall; Maximilian J. Werner; Maximilian J. Werner; Katharine V. Cashman; Katharine V. Cashman; James O. S. Hammond; James O. S. Hammond
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Nabro Volcano
Description
Training, Validation and Test Data for model presented in paper 'A Little Data Goes A Long Way: Automating Seismic Phase Arrival Picking at Nabro Volcano with Transfer Learning', submitted to Journal of Geophysical Research: Solid Earth.

Files:

- train_events_2498.h5 = training set of seismic waveforms (events with P-/S-wave labelled arrivals only, i.e., no noise waveforms)

- train_events_2498.pkl = event training set metadata (UTC P-/S-wave phase arrival times)

- train_noise_2498.h5 = training set of seismic waveforms (noise sections only, i.e., no event waveforms)

- train_noise_2498.pkl = noise training set metadata (UTC time for training noise waveforms)

- val_events.h5 = validation set of seismic waveforms (events with P-/S-wave labelled arrivals only, i.e., no noise waveforms)

- val_events.pkl = event validation set metadata (UTC P-/S-wave phase arrival times)

- val_noise.h5 = validation set of seismic waveforms (noise sections only, i.e., no event waveforms)

- val_noise.pkl = noise validation set metadata (UTC time for validation noise waveforms)

- test.h5 = test set of seismic waveforms (events and noise)

- test_events.pkl = event test set metadata (UTC P-/S-wave phase arrival times for test event waveforms)

- test_noise.pkl = noise test set metadata (UTC time for test noise waveforms)

- nabro_2011-247.mseed = 24 hours seismic data from Nabro Urgency Array (2011-09-04), saved in mseed format (e.g., can be read with obspy)

- nabro_2011-269.mseed = 24 hours seismic data from Nabro Urgency Array (2011-09-26), saved in mseed format (e.g., can be read with obspy)

Further details and code for reading and using these files can be found at the GitHub repo for this paper: https://github.com/sachalapins/U-GPD
h
training-set-mo-v1-test
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AISI whitebox evaluations, training-set-mo-v1-test [Dataset]. https://huggingface.co/datasets/aisi-whitebox/training-set-mo-v1-test
Explore at:
Dataset authored and provided by
AISI whitebox evaluations
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Inspect Dataset: test_dataset

Dataset Information

This dataset was created using the create_inspect_dataset function from the deception_sprint package on 2025-05-02.

Model Information

Model: vllm/meta-llama/Llama-3.2-1B-Instruct

Task Information

Tasks: deception_sprint/wmdp_bio, deception_sprint/wmdp_chem, deception_sprint/wmdp_cyber, deception_sprint/cybermetric_2000, deception_sprint/sec_qa_v1, deception_sprint/sec_qa_v2… See the full description on the dataset page: https://huggingface.co/datasets/aisi-whitebox/training-set-mo-v1-test.
f
An Efficient Method for Predicting Soil Thickness in Large-scale Area Based...
figshare.com
xlsx
Updated Jun 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Wang (2020). An Efficient Method for Predicting Soil Thickness in Large-scale Area Based on Cluster Sampling [Dataset]. http://doi.org/10.6084/m9.figshare.12496841.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12496841.v1
Dataset updated
Jun 17, 2020
Dataset provided by
figshare
Authors
Wei Wang
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
After fast mean shift (FMS) clustering, the whole research area was divided to 10 subareas, so the new samples can characterize the geographical features of each subarea were collected through field investigations. Because of our limited human and material resources, it is difficult to conduct a mass of sampling in each subarea. In order to make the most of our limited resources, we need to conduct reasonable field sampling strategy. For the first two large subareas, we collected 70 field samples respectively, and labeled them as the first sample set and the second sample set that will be used to build their own GWR models for extend prediction of unobserved points in each area, i.e. local extension prediction; while the remaining 8 small subareas took moderate amounts of samples according to their size, if one subarea owns the size of raster points more than 5000, 16 samples will be collected from it, otherwise, take 12 samples. In this way, a total of 112 samples are put together as the third sample set, and the third GWR model is constructed to achieve the global extension prediction of 8 subareas. In addition, three sample sets were divided into training set and test set, respectively. For the first two sample sets, the ratio of sample size of training set and test set are all 5:2, i.e. training set contains 50 samples, test set has 20 samples. Because of the third sample set composed of samples from 8 subareas, we divided the samples of each subarea into training set and test set according to the ratio of 3:1. In the other word, the sample number of training set from third to tenth subarea is 12, 9, 9, 12, 9, 12, 12 and 9 respectively, and 84 training sample in total; and the sample number of test set from eight subarea is 4, 3, 3, 4, 3, 4, 4 and 3 respectively, a total of 28 samples.
4
Train, validation, test data sets and confusion matrices underlying...
data.4tu.nl
zip
Updated Sep 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen (2023). Train, validation, test data sets and confusion matrices underlying publication: "Automated cell counting for Trypan blue stained cell cultures using machine learning" [Dataset]. http://doi.org/10.4121/21695819.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/21695819.v1
Dataset updated
Sep 7, 2023
Dataset provided by
4TU.ResearchData
Authors
Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Annotated test and train data sets. Both images and annotations are provided separately.

Validation data set for Hi5, Sf9 and HEK cells.

Confusion matrices for the determination of performance parameters
Downsized camera trap images for automated classification
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Dec 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers (2022). Downsized camera trap images for automated classification [Dataset]. http://doi.org/10.5281/zenodo.6627707
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6627707
Dataset updated
Dec 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description:
Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707.
Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions
Funding: These data were collected as part of research funded by:
NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A)
This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.
XML metadata: GEMINI compliant metadata for this dataset is available here
Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip
CT_image_data_info2.xlsx
This file contains dataset metadata and 1 data tables:
Dataset Images (described in worksheet Dataset_images)
Description: This worksheet details the composition of each dataset used in the analyses
Number of fields: 69
Number of data rows: 270287
Fields:
filename: Root ID (Field type: id)
camera_trap_site: Site ID for the camera trap location (Field type: location)
taxon: Taxon recorded by camera trap (Field type: taxa)
dist_level: Level of disturbance at site (Field type: ordered categorical)
baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical)
increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical)
dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type:

SVG Code Generation Sample Training Data

kaggle.com

Updated May 3, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Vinothkumar Sekar (2025). SVG Code Generation Sample Training Data [Dataset]. https://www.kaggle.com/datasets/vinothkumarsekar89/svg-generation-sample-training-data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 3, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Vinothkumar Sekar

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.

The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.

 
prompt=f""" I am participating in an SVG code generation competition.
  
   The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows:
  
   - Descriptions are generic and do not contain brand names, trademarks, or personal names.
   - No descriptions include people, even in generic terms.
   - Descriptions are concise—each is no more than 200 characters, with an average length of about 50 characters.
   - Categories cover various domains, with some overlap between public and private test sets.
  
   To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style?
  
   Requirements:
   - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**.
   - Ensure **diversity and creativity** across topics.
   - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**.
   - Avoid duplication or overly similar phrasing.
  
   Example topics:
                 a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid,  purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet,  a snowy plain, black and white checkered pants,  a starlit night over snow-covered peaks, khaki triangles and azure crescents,  a maroon dodecahedron interwoven with teal threads.
  
   Please return the 100 topics in csv format.
   """

In the second step, SVG code is generated by prompting the GPT-4o model. The following prompt is used to query the model to generate svg.

 
  prompt = f"""
      Generate SVG code to visually represent the following text description, while respecting the given constraints.
      
      Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs`
      Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity`
      

      Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. 
      Focus on a clear and concise representation of the input description within the given limitations. 
      Always give the complete SVG code with nothing omitted. Never use an ellipsis.

      The code is scored based on similarity to the description, Visual question anwering and aesthetic components.
      Please generate a detailed svg code accordingly.

      input description: {text}
      """

The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.

MSL Curiosity Rover Images with Science and Engineering Classes
zenodo.org
explore.openaire.eu
+1more
zip
Updated Sep 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steven Lu; Steven Lu; Kiri L. Wagstaff; Kiri L. Wagstaff (2020). MSL Curiosity Rover Images with Science and Engineering Classes [Dataset]. http://doi.org/10.5281/zenodo.4033453
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4033453
Dataset updated
Sep 17, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Steven Lu; Steven Lu; Kiri L. Wagstaff; Kiri L. Wagstaff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please note that the file msl-labeled-data-set-v2.1.zip below contains the latest images and labels associated with this data set.

Data Set Description

The data set consists of 6,820 images that were collected by the Mars Science Laboratory (MSL) Curiosity Rover by three instruments: (1) the Mast Camera (Mastcam) Left Eye; (2) the Mast Camera Right Eye; (3) the Mars Hand Lens Imager (MAHLI). With the help from Dr. Raymond Francis, a member of the MSL operations team, we identified 19 classes with science and engineering interests (see the "Classes" section for more information), and each image is assigned with 1 class label. We split the data set into training, validation, and test sets in order to train and evaluate machine learning algorithms. The training set contains 5,920 images (including augmented images; see the "Image Augmentation" section for more information); the validation set contains 300 images; the test set contains 600 images. The training set images were randomly sampled from sol (Martian day) range 1 - 948; validation set images were randomly sampled from sol range 949 - 1920; test set images were randomly sampled from sol range 1921 - 2224. All images are resized to 227 x 227 pixels without preserving the original height/width aspect ratio.

Directory Contents

images - contains all 6,820 images

class_map.csv - string-integer class mappings

train-set-v2.1.txt - label file for the training set

val-set-v2.1.txt - label file for the validation set

test-set-v2.1.txt - label file for the test set

The label files are formatted as below:

"Image-file-name class_in_integer_representation"

Labeling Process

Each image was labeled with help from three different volunteers (see Contributor list). The final labels are determined using the following processes:

If all three labels agree with each other, then use the label as the final label.

If the three labels do not agree with each other, then we manually review the labels and decide the final label.

We also performed error analysis to correct labels as a post-processing step in order to remove noisy/incorrect labels in the data set.

Classes

There are 19 classes identified in this data set. In order to simplify our training and evaluation algorithms, we mapped the class names from string to integer representations. The names of classes, string-integer mappings, distributions are shown below:

Class name, counts (training set), counts (validation set), counts (test set), integer representation

Arm cover, 10, 1, 4, 0

Other rover part, 190, 11, 10, 1

Artifact, 680, 62, 132, 2

Nearby surface, 1554, 74, 187, 3

Close-up rock, 1422, 50, 84, 4

DRT, 8, 4, 6, 5

DRT spot, 214, 1, 7, 6

Distant landscape, 342, 14, 34, 7

Drill hole, 252, 5, 12, 8

Night sky, 40, 3, 4, 9

Float, 190, 5, 1, 10

Layers, 182, 21, 17, 11

Light-toned veins, 42, 4, 27, 12

Mastcam cal target, 122, 12, 29, 13

Sand, 228, 19, 16, 14

Sun, 182, 5, 19, 15

Wheel, 212, 5, 5, 16

Wheel joint, 62, 1, 5, 17

Wheel tracks, 26, 3, 1, 18

Image Augmentation

Only the training set contains augmented images. 3,920 of the 5,920 images in the training set are augmented versions of the remaining 2000 original training images. Images taken by different instruments were augmented differently. As shown below, we employed 5 different methods to augment images. Images taken by the Mastcam left and right eye cameras were augmented using a horizontal flipping method, and images taken by the MAHLI camera were augmented using all 5 methods. Note that one can filter based on the file names listed in the train-set.txt file to obtain a set of non-augmented images.

90 degrees clockwise rotation (file name ends with -r90.jpg)

180 degrees clockwise rotation (file name ends with -r180.jpg)

270 degrees clockwise rotation (file name ends with -r270.jpg)

Horizontal flip (file name ends with -fh.jpg)

Vertical flip (file name ends with -fv.jpg)

Acknowledgment

The authors would like to thank the volunteers (as in the Contributor list) who provided annotations for this data set. We would also like to thank the PDS Imaging Note for the continuous support of this work.
P
MNIST Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Y. LeCun; L. Bottou; Y. Bengio; P. Haffner, MNIST Dataset [Dataset]. https://paperswithcode.com/dataset/mnist
Explore at:
Authors
Y. LeCun; L. Bottou; Y. Bengio; P. Haffner
Description
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.
i
Experimental data set
ieee-dataport.org
Updated Jul 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chaoyi Zhang (2024). Experimental data set [Dataset]. https://ieee-dataport.org/documents/experimental-data-set
Explore at:
Dataset updated
Jul 14, 2024
Authors
Chaoyi Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
with 600 sets of data in the training set and 200 sets of data in the test set
i
Data set for various metal types
ieee-dataport.org
Updated Jun 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RADHAMADHAB DALAI (2020). Data set for various metal types [Dataset]. https://ieee-dataport.org/open-access/data-set-various-metal-types
Explore at:
Dataset updated
Jun 25, 2020
Authors
RADHAMADHAB DALAI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
scaled and modified to represent a number a training set dataset.It can be used to detect and identify object type based on material type in the image.In this process both training data set and test data set can be generated from these image files.
h
test-collection-to-training-set
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AISI whitebox evaluations, test-collection-to-training-set [Dataset]. https://huggingface.co/datasets/aisi-whitebox/test-collection-to-training-set
Explore at:
Dataset authored and provided by
AISI whitebox evaluations
Description
aisi-whitebox/test-collection-to-training-set dataset hosted on Hugging Face and contributed by the HF Datasets community
P
WDC LSPM Dataset
paperswithcode.com
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). WDC LSPM Dataset [Dataset]. https://paperswithcode.com/dataset/wdc-products
Explore at:
Dataset updated
May 31, 2022
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
f
Data from: Analysis, Modeling, and Target-Specific Predictions of Linear...
acs.figshare.com
xlsx
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Boris Vishnepolsky; Maya Grigolava; Andrei Gabrielian; Alex Rosenthal; Darrell Hurt; Michael Tartakovsky; Malak Pirtskhalava (2023). Analysis, Modeling, and Target-Specific Predictions of Linear Peptides Inhibiting Virus Entry [Dataset]. http://doi.org/10.1021/acsomega.3c07521.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acsomega.3c07521.s001
Dataset updated
Nov 24, 2023
Dataset provided by
ACS Publications
Authors
Boris Vishnepolsky; Maya Grigolava; Andrei Gabrielian; Alex Rosenthal; Darrell Hurt; Michael Tartakovsky; Malak Pirtskhalava
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Antiviral peptides (AVPs) are bioactive peptides that exhibit the inhibitory activity against viruses through a range of mechanisms. Virus entry inhibitory peptides (VEIPs) make up a specific class of AVPs that can prevent envelope viruses from entering cells. With the growing number of experimentally verified VEIPs, there is an opportunity to use machine learning to predict peptides that inhibit the virus entry. In this paper, we have developed the first target-specific prediction model for the identification of new VEIPs using, along with the peptide sequence characteristics, the attributes of the envelope proteins of the target virus, which overcomes the problem of insufficient data for particular viral strains and improves the predictive ability. The model’s performance was evaluated through 10 repeats of 10-fold cross-validation on the training data set, and the results indicate that it can predict VEIPs with 87.33% accuracy and Matthews correlation coefficient (MCC) value of 0.76. The model also performs well on an independent test set with 90.91% accuracy and MCC of 0.81. We have also developed an automatic computational tool that predicts VEIPs, which is freely available at https://dbaasp.org/tools?page=linear-amp-prediction.
P
Titanic Dataset
paperswithcode.com
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Titanic Dataset [Dataset]. https://paperswithcode.com/dataset/titanic
Explore at:
Dataset updated
Nov 25, 2024
Description
Titanic Dataset Description Overview The data is divided into two groups: - Training set (train.csv): Used to build machine learning models. It includes the outcome (also called the "ground truth") for each passenger, allowing models to predict survival based on “features” like gender and class. Feature engineering can also be applied to create new features. - Test set (test.csv): Used to evaluate model performance on unseen data. The ground truth is not provided; the task is to predict survival for each passenger in the test set using the trained model.

Additionally, gender_submission.csv is provided as an example submission file, containing predictions based on the assumption that all and only female passengers survive.

Data Dictionary | Variable | Definition | Key | |------------|------------------------------------------|-------------------------------------------------| | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | age | Age in years | | | sibsp | # of siblings/spouses aboard the Titanic | | | parch | # of parents/children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

Variable Notes

pclass: Proxy for socio-economic status (SES): 1st = Upper 2nd = Middle 3rd = Lower age:
Fractional if less than 1 year.
Estimated ages are represented in the form xx.5. sibsp: Defines family relations as: Sibling: Brother, sister, stepbrother, stepsister. Spouse: Husband, wife (excluding mistresses and fiancés). parch: Defines family relations as: Parent: Mother, father. Child: Daughter, son, stepdaughter, stepson. Some children traveled only with a nanny, so parch = 0 for them.

Facebook

Twitter

Click to copy link

Link copied

Cite

Tianfan Jin (2024). Training/Validation/Test set split [Dataset]. http://doi.org/10.6084/m9.figshare.25511056.v1

Training/Validation/Test set split

Explore at:

90 scholarly articles cite this dataset (View in Google Scholar)

zipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.25511056.v1

Dataset updated

Mar 30, 2024

Dataset provided by

Figsharehttp://figshare.com/
figshare

Authors

Tianfan Jin

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Including the split of real and null reactions for training, validation and test

Clear search

Close search

Google apps

Main menu

Training/Validation/Test set split

Training dataset for NABat Machine Learning V1.0

alpaca-train-validation-test-split

Challenge 2 Train and Test Sets

TREC 2022 Deep Learning test collection

Data from: Machine Learning Models Identify New Inhibitors for Human OATP1B1...

Training, Validation and Test Sets for paper 'A Little Data goes a Long Way:...

training-set-mo-v1-test

An Efficient Method for Predicting Soil Thickness in Large-scale Area Based...

Train, validation, test data sets and confusion matrices underlying...

Downsized camera trap images for automated classification

SVG Code Generation Sample Training Data

MSL Curiosity Rover Images with Science and Engineering Classes

MNIST Dataset

Experimental data set

Data set for various metal types

test-collection-to-training-set

WDC LSPM Dataset

Data from: Analysis, Modeling, and Target-Specific Predictions of Linear...

Titanic Dataset

Training/Validation/Test set split