100+ datasets found
  1. MNIST dataset with train, validation and test sets

    • kaggle.com
    zip
    Updated Nov 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Guardamino Ojeda (2024). MNIST dataset with train, validation and test sets [Dataset]. https://www.kaggle.com/datasets/davidguardaminoojeda/nmist-dataset-with-train-validation-and-test-sets
    Explore at:
    zip(17535459 bytes)Available download formats
    Dataset updated
    Nov 10, 2024
    Authors
    David Guardamino Ojeda
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    X_train,y_train=train_set[0],train_set[1] X_validation,y_validation=validation_set[0],validation_set[1] X_test,y_test=test_set[0],test_set[1]

    print("Shape of X_train: ",X_train.shape) print("Shape of y_train: ",y_train.shape) print("Shape of X_validation: ", X_validation.shape) print("Shape of y_validation: ", y_validation.shape) print("Shape of X_test: ", X_test.shape) print("Shape of y_test: ", y_test.shape)

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6992718%2F4b428195c0d5021ab4f7a30699e217fe%2FScreenshot%202024-11-10%20at%209.43.44AM.png?generation=1731249837688375&alt=media" alt="">

    Create Pandas DataFrames from the datasets

    train_index = range(0,len(X_train))

    validation_index = range(len(X_train), len(X_train)+len(X_validation))

    test_index = range(len(X_train)+len(X_validation), len(X_train)+len(X_validation)+len(X_test))

    X_train = pd.DataFrame(data=X_train,index=train_index) y_train = pd.Series(data=y_train,index=train_index)

    X_validation = pd.DataFrame(data=X_validation,index=validation_index) y_validation = pd.Series(data=y_validation,index=validation_index)

    X_test = pd.DataFrame(data=X_test,index=test_index) y_test = pd.Series(data=y_test,index=test_index)

  2. Prediction R-squared evaluated in testing data sets (average over 30...

    • plos.figshare.com
    xls
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gustavo de los Campos; Ana I. Vazquez; Rohan Fernando; Yann C. Klimentidis; Daniel Sorensen (2023). Prediction R-squared evaluated in testing data sets (average over 30 randomly drawn testing data sets, each having 500 individuals) by training and validation data sets and model. [Dataset]. http://doi.org/10.1371/journal.pgen.1003608.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Gustavo de los Campos; Ana I. Vazquez; Rohan Fernando; Yann C. Klimentidis; Daniel Sorensen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    N-FHS = Number of records from Framingham, N-GEN = Number of records from GENEVA. G-BLUP uses 400 K SNPs, wG-BLUP uses 400 K SNPs, but the contribution of each SNP to the genomic relationship matrix was weighted using as weight, where is the SNP associated p-value reported by [5].

  3. f

    Overall distribution of training, validation, and test data.

    • datasetcatalog.nlm.nih.gov
    Updated Apr 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dahal, Keshab Raj; Gupta, Ankrit; Pokhrel, Nawa Raj; Joshi, Jeorge; Gaire, Santosh; Mahatara, Sharad; Joshi, Rajendra P.; Banjade, Huta R. (2023). Overall distribution of training, validation, and test data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000993481
    Explore at:
    Dataset updated
    Apr 25, 2023
    Authors
    Dahal, Keshab Raj; Gupta, Ankrit; Pokhrel, Nawa Raj; Joshi, Jeorge; Gaire, Santosh; Mahatara, Sharad; Joshi, Rajendra P.; Banjade, Huta R.
    Description

    Overall distribution of training, validation, and test data.

  4. Atmopheric Machine Learning Emulation Challenge, 1st Ed. (AMLEC-1)

    • zenodo.org
    zip
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jorge Vicent Servera; Jorge Vicent Servera (2025). Atmopheric Machine Learning Emulation Challenge, 1st Ed. (AMLEC-1) [Dataset]. http://doi.org/10.5281/zenodo.17670939
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jorge Vicent Servera; Jorge Vicent Servera
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Note: Unzip files to have access to the original data files (.h5, .csv, .xml)

    This dataset contains the train and test data used for the first edition of the Atmospheric Machine Learning Emulation Challenge (AMLEC-1), presented at ECMLPKDD 2025 (https://ecmlpkdd.org/2025/discovery-challenges/) and carried out within the EU ELIAS project (https://elias-ai.eu/opportunities/amlec/).

    The dataset contains a series of .h5 storing MODTRAN6 spectral simulations (transmittances, spherical albedo, path radiance), computed at a with various atmospheric and geometric conditions and two scenarios --atmospheric correction of hyperspectral data (A) and CO2 retrieval (B)-- each associated with its own spectral configuration.

    The training data (i.e., inputs and outputs of RTM simulations) is stored in HDF5 format with the following structure:

    Dimensions
    NameDescription
    n_wlNumber of wavelengths for which spectral data is provided
    n_funcsNumber of atmospheric transfer functions
    n_combNumber of data points at which spectral data is provided
    n_paramDimensionality of the input variable space
    Data Components
    NameDescriptionDimensionsDatatype
    LUTdataAtmospheric transfer functions (i.e. outputs)n_funcs*n_wvl x n_combsingle
    LUTHeaderMatrix of input variable values for each combination (i.e., inputs)n_param x n_combdouble
    wvlWavelength values associated with the atmospheric transfer functions (i.e., spectral grid)n_wvldouble

    Note: Participants may choose to predict the spectral data either as a single vector of length n_funcs*n_wvl or as n_funcs separate vectors of lenght n_wvl.

    Testing input datasets (i.e., input for predictions) are stored in a tabulated .csv format with dimensions n_param x n_comb. During the challenge, participants only had access to this .csv data, while here we also provide the reference spectral simulations using for evaluation

    The training and testing dataset will be organized organized into scenario-specific folders: scenarioA (Atmospheric Correction), and scenarioB (CO2 Column Retrieval). Each folder will contain:

    • A train with multiple .h5 files corresponding to different training sample sizes (e.g. train2000.h5contains 2000 samples).
    • A reference subfolder containg two test files (refInterp and refExtrap) referring to the two aforementioned tracks (i.e., interpolation and extrapolation).

    Here is an example of how to load each dataset in python:

    import h5py
    import pandas as pd
    import numpy as np
    
    # Replace with the actual path to your training and testing data
    trainFile = 'train2000.h5'
    testFile = 'refInterp.csv'
    
    # Open the H5 file
    with h5py.File(file_path, 'r') as h5_file
    Ytrain = h5_file['LUTdata'][:]
    Xtrain = h5_file['LUTHeader'][:]
    wvl = h5_file['wvl'][:]
    
    # Read testing data
    df = pd.read_csv(testFile)
    Xtest = df.to_numpy()
    

    in Matlab:

    # Replace with the actual path to your training and testing data
    trainFile = 'train2000.h5';
    testFile = 'refInterp.csv';
    
    # Open the H5 file
    Ytrain = h5read(trainFile,'/LUTdata');
    Xtrain = h5read(trainFile,'/LUTheader');
    wvl = h5read(trainFile,'/wvl');
    
    # Read testing data
    Xtest = importdata(testFile);
    

    and in R language:

    library(rhdf5)
    
    # Replace with the actual path to your training and testing data
    trainFile <- "train2000.h5"
    testFile <- "refInterp.csv"
    
    # Open the H5 file
    lut_data <- h5read(file_path, "LUTdata")
    lut_header <- h5read(file_path, "LUTHeader")
    wavelengths <- h5read(file_path, "wvl")
    
    # Read testing data
    Xtest <- as.matrix(read.table(file_path, sep = ",", header = TRUE))
    
  5. t

    Marmoset - train and test data - Vdataset - LDM

    • service.tib.eu
    Updated Nov 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Marmoset - train and test data - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/goe-doi-10-25625-dyg3kv
    Explore at:
    Dataset updated
    Nov 14, 2025
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Contains recordings and manual annotations of calls from pairs of male and female marmosets. Manual annotations were created by the original authors and manually corrected for training and testing DAS. Original data source for the recordings and the annotations: https://osf.io/q4bm3/ Original reference: Landman R, Sharma J, Hyman JB, Fanucci-Kiss A, Meisner O, Parmar S, Feng G, Desimone R. 2020. Close-range vocal interaction in the common marmoset (Callithrix jacchus). PLOS ONE 15:e0227392. doi:10.1371/journal.pone.0227392

  6. f

    R-squared in longitudinal performance evaluation of the models with 20% of...

    • datasetcatalog.nlm.nih.gov
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nasir,; Gholami, Shahrzad; Marfin, Anthony; Dodhia, Rahul; Weeks, William B.; Bhat, Niranjan; Alderson, Mark; Ferres, Juan Lavista; Taliesin, Brian; Leader, Troy (2025). R-squared in longitudinal performance evaluation of the models with 20% of Pers-007 as test data for different training datasets. Subjects under the test set were excluded from each training set. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002083413
    Explore at:
    Dataset updated
    May 15, 2025
    Authors
    Nasir,; Gholami, Shahrzad; Marfin, Anthony; Dodhia, Rahul; Weeks, William B.; Bhat, Niranjan; Alderson, Mark; Ferres, Juan Lavista; Taliesin, Brian; Leader, Troy
    Description

    R-squared in longitudinal performance evaluation of the models with 20% of Pers-007 as test data for different training datasets. Subjects under the test set were excluded from each training set.

  7. S

    Galaxy, star, quasar dataset

    • scidb.cn
    Updated Feb 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li Xin (2023). Galaxy, star, quasar dataset [Dataset]. http://doi.org/10.57760/sciencedb.07177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Li Xin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The data used in this paper is from the 16th issue of SDSS. SDSS-DR16 contains a total of 930,268 photometric images, with 1.2 billion observation sources and tens of millions of spectra. The data obtained in this paper is downloaded from the official website of SDSS. Specifically, the data is obtained through the SkyServerAPI structure by using SQL query statements in the subwebsite CasJobs. As the current SDSS photometric table PhotoObj can only classify all observed sources as point sources and surface sources, the target sources can be better classified as galaxies, stars and quasars through spectra. Therefore, we obtain calibrated sources in CasJobs by crossing SpecPhoto with the PhotoObj star list, and obtain target position information (right ascension and declination). Calibrated sources can tell them apart precisely and quickly. Each calibrated source is labeled with the parameter "Class" as "galaxy", "star", or "quasar". In this paper, observation day area 3462, 3478, 3530 and other 4 areas in SDSS-DR16 are selected as experimental data, because a large number of sources can be obtained in these areas to provide rich sample data for the experiment. For example, there are 9891 sources in the 3462-day area, including 2790 galactic sources, 2378 stellar sources and 4723 quasar sources. There are 3862 sources in the 3478 day area, including 1759 galactic sources, 577 stellar sources and 1526 quasar sources. FITS files are a commonly used data format in the astronomical community. By cross-matching the star list and FITS files in the local celestial region, we obtained images of 5 bands of u, g, r, i and z of 12499 galaxy sources, 16914 quasar sources and 16908 star sources as training and testing data.1.1 Image SynthesisSDSS photometric data includes photometric images of five bands u, g, r, i and z, and these photometric image data are respectively packaged in single-band format in FITS files. Images of different bands contain different information. Since the three bands g, r and i contain more feature information and less noise, Astronomical researchers typically use the g, r, and i bands corresponding to the R, G, and B channels of the image to synthesize photometric images. Generally, different bands cannot be directly synthesized. If three bands are directly synthesized, the image of different bands may not be aligned. Therefore, this paper adopts the RGB multi-band image synthesis software written by He Zhendong et al. to synthesize images in g, r and i bands. This method effectively avoids the problem that images in different bands cannot be aligned. The pixel of each photometry image in this paper is 2048×1489.1.2 Data tailoringThis paper first clipped the target image, image clipping can use image segmentation tools to solve this problem, this paper uses Python to achieve this process. In the process of clipping, we convert the right ascension and declination of the source in the star list into pixel coordinates on the photometric image through the coordinate conversion formula, and determine the specific position of the source through the pixel coordinates. The coordinates are regarded as the center point and clipping is carried out in the form of a rectangular box. We found that the input image size affects the experimental results. Therefore, according to the target size of the source, we selected three different cutting sizes, 40×40, 60×60 and 80×80 respectively. Through experiment and analysis, we find that convolutional neural network has better learning ability and higher accuracy for data with small image size. In the end, we chose to divide the surface source galaxies, point source quasars, and stars into 40×40 sizes.1.3 Division of training and test dataIn order to make the algorithm have more accurate recognition performance, we need enough image samples. The selection of training set, verification set and test set is an important factor affecting the final recognition accuracy. In this paper, the training set, verification set and test set are set according to the ratio of 8:1:1. The purpose of verification set is used to revise the algorithm, and the purpose of test set is used to evaluate the generalization ability of the final algorithm. Table 1 shows the specific data partitioning information. The total sample size is 34,000 source images, including 11543 galaxy sources, 11967 star sources, and 10490 quasar sources.1.4 Data preprocessingIn this experiment, the training set and test set can be used as the training and test input of the algorithm after data preprocessing. The data quantity and quality largely determine the recognition performance of the algorithm. The pre-processing of the training set and the test set are different. In the training set, we first perform vertical flip, horizontal flip and scale on the cropped image to enrich the data samples and enhance the generalization ability of the algorithm. Since the features in the celestial object source have the flip invariability, the labels of galaxies, stars and quasars will not change after rotation. In the test set, our preprocessing process is relatively simple compared with the training set. We carry out simple scaling processing on the input image and test input the obtained image.

  8. g

    Data from: Data for Machine Learning Predictions of Nitrate in Shallow...

    • gimi9.com
    • data.usgs.gov
    Updated Oct 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Data for Machine Learning Predictions of Nitrate in Shallow Groundwater in the Conterminous United States [Dataset]. https://gimi9.com/dataset/data-gov_data-for-machine-learning-predictions-of-nitrate-in-shallow-groundwater-in-the-conterminou/
    Explore at:
    Dataset updated
    Oct 26, 2021
    Area covered
    Contiguous United States, United States
    Description

    An extreme gradient boosting (XGB) machine learning model was developed to predict the distribution of nitrate in shallow groundwater across the conterminous United States (CONUS). Nitrate was predicted at a 1-square-kilometer (km) resolution at a depth below the water table of 10 m. The model builds off a previous XGB machine learning model developed to predict nitrate at domestic and public supply groundwater zones (Ransom and others, 2022) by incorporating additional monitoring well samples and modifying and adding predictor variables. The shallow zone model included variables representing well characteristics, hydrologic conditions, soil type, geology, climate, oxidation/reduction, and nitrogen inputs. Predictor variables derived from empirical or numerical process-based models were also included to integrate information on controlling processes and conditions. This data release documents the model and provides the model results. Included in this data release are, 1) a model archive of the R project: source code, input files (including model training and testing data, rasters of all final predictor variables, and an output raster representing predicted nitrate concentration in the shallow zone), 2) a read_me.txt file describing the model archive and an explanation of its use and the modeling details, and 3) a table describing the model variables.

  9. Codes in R for spatial statistics analysis, ecological response models and...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    D. W. Rössel-Ramírez; D. W. Rössel-Ramírez; J. Palacio-Núñez; J. Palacio-Núñez; S. Espinosa; S. Espinosa; J. F. Martínez-Montoya; J. F. Martínez-Montoya (2025). Codes in R for spatial statistics analysis, ecological response models and spatial distribution models [Dataset]. http://doi.org/10.5281/zenodo.7603557
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    D. W. Rössel-Ramírez; D. W. Rössel-Ramírez; J. Palacio-Núñez; J. Palacio-Núñez; S. Espinosa; S. Espinosa; J. F. Martínez-Montoya; J. F. Martínez-Montoya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).

    It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:

    In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.). Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).

    Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).

    After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.

    Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).

    Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.

    On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).

    Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).

    Validation set

    Model

    True

    False

    Presence

    A

    B

    Background

    C

    D

    We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).

    The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.

    Regarding the model evaluation and estimation, we selected the following estimators:

    1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).

    2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).

  10. m

    urban forest

    • data.mendeley.com
    Updated Jul 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatwa Ramdani (2022). urban forest [Dataset]. http://doi.org/10.17632/j739yc6cgc.1
    Explore at:
    Dataset updated
    Jul 28, 2022
    Authors
    Fatwa Ramdani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data contain of: 1. Data of satellite imagery of PlanetScope of University of Brawijaya with 3m spatial resolution. 2. Data training and testing in CSV format 3. R Script of four different algorithms (XGBoost, Random Forest, Support Vector Machine, and Neural Networks) The manuscript that using this dataset has been submitted to F1000 Research (https://f1000research.com/)

  11. Handful of Pixels - machine learning data

    • resodate.org
    • zenodo.org
    Updated Aug 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koen Hufkens (2023). Handful of Pixels - machine learning data [Dataset]. https://resodate.org/resources/aHR0cHM6Ly96ZW5vZG8ub3JnL3JlY29yZHMvODI5ODQ5MQ==
    Explore at:
    Dataset updated
    Aug 30, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Koen Hufkens
    Description

    Training and testing data within the context of the Handful of Pixels course, and in particular the Land-Use and Land-Cover mapping chapter. The data included samples 300 random locations across 10 land cover classes from the data described in Fritz et al. 2017, and downloaded through the AppEEARS API using the {appeears} R package (Hufkens et al. 2023). A 80% split is executed on this larger dataset with 240 locations retained for training, while the remainder is used for testing purposes. For the testing data the input data is shared, the labels are withheld (stored in a closed release of this archive, and accessible on reasonable request). This data can be used within the context of small demonstration machine learning exercises or competitions.

    Data structure

    The data contains all seven (7) bands of the MODIS MCD43A4 data product for the year 2012. Band names are indicated in full. In addition MODIS MOD11A2 daytime land surface temperature (LST) data is provided, where band names only contain the date (YYYY-MM-DD) of acquisition. Additional indices can be calculated from these band combinations if so desired.

    Data is provided in compressed serialized R rds files, and can be read into R as follows:

    df - readRDS("training_data.rds")

    References

    Fritz, Steffen, Linda See, Christoph Perger, Ian McCallum, Christian Schill, Dmitry Schepaschenko, Martina Duerauer, et al. A Global Dataset of Crowdsourced Land Cover and Land Use Reference Data. Scientific Data 4, no. 1 (June 13, 2017): 170075. https://doi.org/10.1038/sdata.2017.75.

    Koen Hufkens. (2023). bluegreen-labs/appeears: appeears: an interface to the NASA AppEEARS API (v1.0). Zenodo. https://doi.org/10.5281/zenodo.7958270

  12. U

    Random forest regression model and prediction rasters of fluoride in...

    • data.usgs.gov
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Celia Rosecrans; Katherine Ransom; Olga Rodriguez, Random forest regression model and prediction rasters of fluoride in groundwater in basin-fill aquifers of western United States [Dataset]. http://doi.org/10.5066/P991L1ZR
    Explore at:
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Celia Rosecrans; Katherine Ransom; Olga Rodriguez
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Oct 1, 2000 - Jul 1, 2018
    Area covered
    Western United States, United States
    Description

    A random forest regression (RFR) model was developed to predict groundwater fluoride concentrations in four western United Stated principal aquifers —California Coastal basin-fill aquifers, Central Valley aquifer system, Basin and Range basin-fill aquifers, and the Rio Grande aquifer system. The selected basin-fill aquifers are a vital resource for drinking-water supplies. The RFR model was developed with a dataset of over 12,000 wells sampled for fluoride between 2000 and 2018. This data release provides rasters of predicted fluoride concentrations at depth typical of domestic and public supply wells in the selected basin-fill aquifers and includes the final RFR model that documents the prediction modeling process and verifies and reproduces the model fit metrics and mapped predictions in the accompanying publication. Included in this data release are 1) a model archive of the R project including source code, input files (model training and testing data and rasters of predictor ...

  13. u

    Testing and training data for machine learning models to detect, classify...

    • agdatacommons.nal.usda.gov
    bin
    Updated Feb 22, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jessica L. Duttenhefner; AbdElRahman A. ElSaid; Page E. Klug (2026). Testing and training data for machine learning models to detect, classify and count blackbirds damaging agriculture using drone-based imagery [Dataset]. http://doi.org/10.2737/NWRC-RDS-2025-002
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 22, 2026
    Dataset provided by
    USDA, APHIS, WS National Wildlife Research Center
    Authors
    Jessica L. Duttenhefner; AbdElRahman A. ElSaid; Page E. Klug
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We used drones to capture images of mixed-species blackbird (Icteridae) flocks damaging sunflower (Helianthus annuus) in North Dakota. Images included several blackbirds (Icteridae) that breed in North Dakota and are considered agricultural pests, including red-winged blackbirds (RWBL) (Agelaius phoeniceus), common grackles (Quiscalus quiscula), yellow-headed blackbirds (Xanthocephalus xanthocephalus), brown-headed cowbirds (Molothrus ater), and European starlings (Sturnidae: Sturnus vulgaris). This study was implemented between September 2021 and October 2022 in multiple counties in North Dakota, USA, where blackbird damage to sunflowers is prevalent. We simultaneously hazed and captured video and photographs of flocks with the drone, thus images consist of airborne flocks with sky, green vegetation, or tan vegetation backgrounds. Images were used to train and test two models: 1) a ResNet-18 convolutional neural network (CNN) model to detect flocks of varying size and distance from the camera and 2) Faster Region-based Convolutional Neural Network (Faster-RCNN) models to detect individual blackbirds, classify individual blackbirds by species and for RWBL sex and age class, and count blackbirds. The Faster-RCNN model required individual birds in the images to be manually annotated by trained biologists for model training. This data publication contains the data and R code used to analyze these data, as well as the 400 images used in the models to detect blackbird flocks, the 131 images used in the models to detect and classify individual blackbirds, and the 131 image annotation files.We designed this study to assess efficacy of drone-based aerial imagery combined with deep learning algorithms to accurately detect mixed-species blackbird flocks, as well as detect, classify, and count individual birds on varying backgrounds.For more information about this study and these data, see Duttenhefner et al. (2025).

  14. SP500_data

    • kaggle.com
    zip
    Updated May 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Franco Dicosola (2023). SP500_data [Dataset]. https://www.kaggle.com/datasets/francod/s-and-p-500-data
    Explore at:
    zip(39005 bytes)Available download formats
    Dataset updated
    May 28, 2023
    Authors
    Franco Dicosola
    Description

    Project Documentation: Predicting S&P 500 Price Problem Statement: The goal of this project is to develop a machine learning model that can predict the future price of the S&P 500 index based on historical data and relevant features. By accurately predicting the price movements, we aim to assist investors and financial professionals in making informed decisions and managing their portfolios effectively. Dataset Description: The dataset used for this project contains historical data of the S&P 500 index, along with several other features such as dividends, earnings, consumer price index (CPI), interest rates, and more. The dataset spans a certain time period and includes daily values of these variables. Steps Taken: 1. Data Preparation and Exploration: • Loaded the dataset and performed initial exploration. • Checked for missing values and handled them if any. • Explored the statistical summary and distributions of the variables. • Conducted correlation analysis to identify potential features for prediction. 2. Data Visualization and Analysis: • Plotted time series graphs to visualize the S&P 500 index and other variables over time. • Examined the trends, seasonality, and residual behavior of the time series using decomposition techniques. • Analyzed the relationships between the S&P 500 index and other features using scatter plots and correlation matrices. 3. Feature Engineering and Selection: • Selected relevant features based on correlation analysis and domain knowledge. • Explored feature importance using tree-based models and selected informative features. • Prepared the final feature set for model training. 4. Model Training and Evaluation: • Split the dataset into training and testing sets. • Selected a regression model (Linear Regression) for price prediction. • Trained the model using the training set. • Evaluated the model's performance using mean squared error (MSE) and R-squared (R^2) metrics on both training and testing sets. 5. Prediction and Interpretation: • Obtained predictions for future S&P 500 prices using the trained model. • Interpreted the predicted prices in the context of the current market conditions and the percentage change from the current price. Limitations and Future Improvements: • The predictive performance of the model is based on the available features and historical data, and it may not capture all the complexities and factors influencing the S&P 500 index. • The model's accuracy and reliability are subject to the quality and representativeness of the training data. • The model assumes that the historical patterns and relationships observed in the data will continue in the future, which may not always hold true. • Future improvements could include incorporating additional relevant features, exploring different regression algorithms, and considering more sophisticated techniques such as time series forecasting models.

  15. n

    A Consensus of In-silico Sequence-based Modeling Techniques for...

    • narcis.nl
    • data.mendeley.com
    Updated Nov 1, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mall, R (via Mendeley Data) (2020). A Consensus of In-silico Sequence-based Modeling Techniques for Compound-Viral Protein Activity Prediction for SARS-COV-2 [Dataset]. http://doi.org/10.17632/8rrwnbcgmx.1
    Explore at:
    Dataset updated
    Nov 1, 2020
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Mall, R (via Mendeley Data)
    Description

    Here we provide the datasets used for training and testing of the end-to-end supervised deep learning models as well as the datasets used with vector representations of compounds and proteins and passed to supervised state-of-the-art machine learning models (XGBoost, RF, SVM). We also provide the full list of viral proteins with their sequences used for the protein autoencoder along with the list of SMILES representations of compounds used for the compound autoencoder.

  16. UNSW Training and Testing Dataset

    • kaggle.com
    zip
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhasri R A (2024). UNSW Training and Testing Dataset [Dataset]. https://www.kaggle.com/datasets/subhasrira/unsw-training-and-testing-dataset
    Explore at:
    zip(12484064 bytes)Available download formats
    Dataset updated
    Feb 8, 2024
    Authors
    Subhasri R A
    Description

    Dataset

    This dataset was created by Subhasri R A

    Contents

  17. R-CAUSTIC: Rippling CAUSTICs underwater Image dataset

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Panagiotis Agrafiotis; Panagiotis Agrafiotis (2025). R-CAUSTIC: Rippling CAUSTICs underwater Image dataset [Dataset]. http://doi.org/10.5281/zenodo.6467283
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Panagiotis Agrafiotis; Panagiotis Agrafiotis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Version 2 available! Please make sure to download the latest version of the dataset!

    Description

    Rippling caustics seem to be the main factor degrading the underwater RGB image quality and affecting the image- based 3D reconstruction process in very shallow waters. These effects are adversely affecting image matching algorithms by throwing off most of them, leading to less accurate matches and causing issues in the Simultaneous Localization and Mapping (SLAM) based navigation of the Remotely Operated Vehicles (ROV) and Autonomous Underwater Vehicles (AUV) on shallow waters. Also, they are the main cause for dissimilarities in the generated textures and orthoimages. In order to fill the gap in the literature regading underwater rippling caustics imagery with real ground truth and reference images, the first real-world underwater caustics benchmark dataset which contains 1465 underwater images is presented. Together with the RGB imagery, the corresponding generated ground truth images are delivered for facilitating the training and testing of machine learning and deep learning methods for image classification. R-CAUSTIC dataset also provides the necessary data to evaluate, at least to some extent, the performance of 3D reconstruction approaches. Data were acquired using a GoPro Hero 4 Black action camera with image dimensions of 4000 x 3000 pixels, focal length of 2.77mm and pixel size of 1.55μm and a tripod. Action cameras are widely used for underwater image acquisition. The dataset was captured in near-shore underwater sites at depths varying from 0.5 to 2m. No artificial light sources were used. Due to the wind, the turbulent surface of the water created dynamic rippling caustics on the seabed. In total 1465 RGB images were collected, separated in 7 different datasets; five of them containing stereo images, one of them tri-stereo images and one consists of multi-stereo imagery acquired in 7 different camera poses.

    Publication

    The paper is availbale in Open Access here: https://ieeexplore.ieee.org/document/10172291

    If you use this dataset please cite it as R-CAUSTIC [Reference].
    [Reference]: P. Agrafiotis, K. Karantzalos and A. Georgopoulos, "Seafloor-Invariant Caustics Removal From Underwater Imagery," in IEEE Journal of Oceanic Engineering, vol. 48, no. 4, pp. 1300-1321, Oct. 2023, doi: 10.1109/JOE.2023.3277168.

    BibTeX:

    @ARTICLE{10172291, author={Agrafiotis, Panagiotis and Karantzalos, Konstantinos and Georgopoulos, Andreas}, journal={IEEE Journal of Oceanic Engineering}, title={Seafloor-Invariant Caustics Removal From Underwater Imagery}, year={2023}, volume={48}, number={4}, pages={1300-1321}, doi={10.1109/JOE.2023.3277168}}

  18. Downsized camera trap images for automated classification

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Dec 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers (2022). Downsized camera trap images for automated classification [Dataset]. http://doi.org/10.5281/zenodo.6627707
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Dec 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description:

    Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707.

    Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions

    Funding: These data were collected as part of research funded by:

    This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.

    XML metadata: GEMINI compliant metadata for this dataset is available here

    Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip

    CT_image_data_info2.xlsx

    This file contains dataset metadata and 1 data tables:

    1. Dataset Images (described in worksheet Dataset_images)

      Description: This worksheet details the composition of each dataset used in the analyses

      Number of fields: 69

      Number of data rows: 270287

      Fields:

      • filename: Root ID (Field type: id)
      • camera_trap_site: Site ID for the camera trap location (Field type: location)
      • taxon: Taxon recorded by camera trap (Field type: taxa)
      • dist_level: Level of disturbance at site (Field type: ordered categorical)
      • baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical)
      • increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical)
      • dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type:

  19. The known/novel training and testing dataset composition when testing RDP on...

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yemin Lan; Qiong Wang; James R. Cole; Gail L. Rosen (2023). The known/novel training and testing dataset composition when testing RDP on all taxonomic levels. [Dataset]. http://doi.org/10.1371/journal.pone.0032491.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yemin Lan; Qiong Wang; James R. Cole; Gail L. Rosen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The known/novel training and testing dataset composition when testing RDP on all taxonomic levels.

  20. Rescaled CIFAR-10 dataset

    • zenodo.org
    • explore.openaire.eu
    • +1more
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Description

    Motivation

    The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled CIFAR-10 dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

    and is therefore significantly more challenging.

    Access and rights

    The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

    [4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

    The h5 files containing the dataset

    The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

    Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
David Guardamino Ojeda (2024). MNIST dataset with train, validation and test sets [Dataset]. https://www.kaggle.com/datasets/davidguardaminoojeda/nmist-dataset-with-train-validation-and-test-sets
Organization logo

MNIST dataset with train, validation and test sets

train_set, validation_set, test_set = pd.read_pickle(r'PATH_TO_FILE')

Explore at:
zip(17535459 bytes)Available download formats
Dataset updated
Nov 10, 2024
Authors
David Guardamino Ojeda
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

X_train,y_train=train_set[0],train_set[1] X_validation,y_validation=validation_set[0],validation_set[1] X_test,y_test=test_set[0],test_set[1]

print("Shape of X_train: ",X_train.shape) print("Shape of y_train: ",y_train.shape) print("Shape of X_validation: ", X_validation.shape) print("Shape of y_validation: ", y_validation.shape) print("Shape of X_test: ", X_test.shape) print("Shape of y_test: ", y_test.shape)

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6992718%2F4b428195c0d5021ab4f7a30699e217fe%2FScreenshot%202024-11-10%20at%209.43.44AM.png?generation=1731249837688375&alt=media" alt="">

Create Pandas DataFrames from the datasets

train_index = range(0,len(X_train))

validation_index = range(len(X_train), len(X_train)+len(X_validation))

test_index = range(len(X_train)+len(X_validation), len(X_train)+len(X_validation)+len(X_test))

X_train = pd.DataFrame(data=X_train,index=train_index) y_train = pd.Series(data=y_train,index=train_index)

X_validation = pd.DataFrame(data=X_validation,index=validation_index) y_validation = pd.Series(data=y_validation,index=validation_index)

X_test = pd.DataFrame(data=X_test,index=test_index) y_test = pd.Series(data=y_test,index=test_index)

Search
Clear search
Close search
Google apps
Main menu