Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
X_train,y_train=train_set[0],train_set[1] X_validation,y_validation=validation_set[0],validation_set[1] X_test,y_test=test_set[0],test_set[1]
print("Shape of X_train: ",X_train.shape) print("Shape of y_train: ",y_train.shape) print("Shape of X_validation: ", X_validation.shape) print("Shape of y_validation: ", y_validation.shape) print("Shape of X_test: ", X_test.shape) print("Shape of y_test: ", y_test.shape)
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6992718%2F4b428195c0d5021ab4f7a30699e217fe%2FScreenshot%202024-11-10%20at%209.43.44AM.png?generation=1731249837688375&alt=media" alt="">
train_index = range(0,len(X_train))
validation_index = range(len(X_train), len(X_train)+len(X_validation))
test_index = range(len(X_train)+len(X_validation), len(X_train)+len(X_validation)+len(X_test))
X_train = pd.DataFrame(data=X_train,index=train_index) y_train = pd.Series(data=y_train,index=train_index)
X_validation = pd.DataFrame(data=X_validation,index=validation_index) y_validation = pd.Series(data=y_validation,index=validation_index)
X_test = pd.DataFrame(data=X_test,index=test_index) y_test = pd.Series(data=y_test,index=test_index)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
N-FHS = Number of records from Framingham, N-GEN = Number of records from GENEVA. G-BLUP uses 400 K SNPs, wG-BLUP uses 400 K SNPs, but the contribution of each SNP to the genomic relationship matrix was weighted using as weight, where is the SNP associated p-value reported by [5].
Facebook
TwitterOverall distribution of training, validation, and test data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: Unzip files to have access to the original data files (.h5, .csv, .xml)
This dataset contains the train and test data used for the first edition of the Atmospheric Machine Learning Emulation Challenge (AMLEC-1), presented at ECMLPKDD 2025 (https://ecmlpkdd.org/2025/discovery-challenges/) and carried out within the EU ELIAS project (https://elias-ai.eu/opportunities/amlec/).
The dataset contains a series of .h5 storing MODTRAN6 spectral simulations (transmittances, spherical albedo, path radiance), computed at a with various atmospheric and geometric conditions and two scenarios --atmospheric correction of hyperspectral data (A) and CO2 retrieval (B)-- each associated with its own spectral configuration.
The training data (i.e., inputs and outputs of RTM simulations) is stored in HDF5 format with the following structure:
| Dimensions | |
|---|---|
| Name | Description |
n_wl | Number of wavelengths for which spectral data is provided |
n_funcs | Number of atmospheric transfer functions |
n_comb | Number of data points at which spectral data is provided |
n_param | Dimensionality of the input variable space |
| Data Components | |||
|---|---|---|---|
| Name | Description | Dimensions | Datatype |
LUTdata | Atmospheric transfer functions (i.e. outputs) | n_funcs*n_wvl x n_comb | single |
LUTHeader | Matrix of input variable values for each combination (i.e., inputs) | n_param x n_comb | double |
wvl | Wavelength values associated with the atmospheric transfer functions (i.e., spectral grid) | n_wvl | double |
Note: Participants may choose to predict the spectral data either as a single vector of length n_funcs*n_wvl or as n_funcs separate vectors of lenght n_wvl.
Testing input datasets (i.e., input for predictions) are stored in a tabulated .csv format with dimensions n_param x n_comb. During the challenge, participants only had access to this .csv data, while here we also provide the reference spectral simulations using for evaluation
The training and testing dataset will be organized organized into scenario-specific folders: scenarioA (Atmospheric Correction), and scenarioB (CO2 Column Retrieval). Each folder will contain:
train with multiple .h5 files corresponding to different training sample sizes (e.g. train2000.h5contains 2000 samples).reference subfolder containg two test files (refInterp and refExtrap) referring to the two aforementioned tracks (i.e., interpolation and extrapolation).Here is an example of how to load each dataset in python:
import h5py
import pandas as pd
import numpy as np
# Replace with the actual path to your training and testing data
trainFile = 'train2000.h5'
testFile = 'refInterp.csv'
# Open the H5 file
with h5py.File(file_path, 'r') as h5_file
Ytrain = h5_file['LUTdata'][:]
Xtrain = h5_file['LUTHeader'][:]
wvl = h5_file['wvl'][:]
# Read testing data
df = pd.read_csv(testFile)
Xtest = df.to_numpy()
in Matlab:
# Replace with the actual path to your training and testing data
trainFile = 'train2000.h5';
testFile = 'refInterp.csv';
# Open the H5 file
Ytrain = h5read(trainFile,'/LUTdata');
Xtrain = h5read(trainFile,'/LUTheader');
wvl = h5read(trainFile,'/wvl');
# Read testing data
Xtest = importdata(testFile);
and in R language:
library(rhdf5)
# Replace with the actual path to your training and testing data
trainFile <- "train2000.h5"
testFile <- "refInterp.csv"
# Open the H5 file
lut_data <- h5read(file_path, "LUTdata")
lut_header <- h5read(file_path, "LUTHeader")
wavelengths <- h5read(file_path, "wvl")
# Read testing data
Xtest <- as.matrix(read.table(file_path, sep = ",", header = TRUE))
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Contains recordings and manual annotations of calls from pairs of male and female marmosets. Manual annotations were created by the original authors and manually corrected for training and testing DAS. Original data source for the recordings and the annotations: https://osf.io/q4bm3/ Original reference: Landman R, Sharma J, Hyman JB, Fanucci-Kiss A, Meisner O, Parmar S, Feng G, Desimone R. 2020. Close-range vocal interaction in the common marmoset (Callithrix jacchus). PLOS ONE 15:e0227392. doi:10.1371/journal.pone.0227392
Facebook
TwitterR-squared in longitudinal performance evaluation of the models with 20% of Pers-007 as test data for different training datasets. Subjects under the test set were excluded from each training set.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The data used in this paper is from the 16th issue of SDSS. SDSS-DR16 contains a total of 930,268 photometric images, with 1.2 billion observation sources and tens of millions of spectra. The data obtained in this paper is downloaded from the official website of SDSS. Specifically, the data is obtained through the SkyServerAPI structure by using SQL query statements in the subwebsite CasJobs. As the current SDSS photometric table PhotoObj can only classify all observed sources as point sources and surface sources, the target sources can be better classified as galaxies, stars and quasars through spectra. Therefore, we obtain calibrated sources in CasJobs by crossing SpecPhoto with the PhotoObj star list, and obtain target position information (right ascension and declination). Calibrated sources can tell them apart precisely and quickly. Each calibrated source is labeled with the parameter "Class" as "galaxy", "star", or "quasar". In this paper, observation day area 3462, 3478, 3530 and other 4 areas in SDSS-DR16 are selected as experimental data, because a large number of sources can be obtained in these areas to provide rich sample data for the experiment. For example, there are 9891 sources in the 3462-day area, including 2790 galactic sources, 2378 stellar sources and 4723 quasar sources. There are 3862 sources in the 3478 day area, including 1759 galactic sources, 577 stellar sources and 1526 quasar sources. FITS files are a commonly used data format in the astronomical community. By cross-matching the star list and FITS files in the local celestial region, we obtained images of 5 bands of u, g, r, i and z of 12499 galaxy sources, 16914 quasar sources and 16908 star sources as training and testing data.1.1 Image SynthesisSDSS photometric data includes photometric images of five bands u, g, r, i and z, and these photometric image data are respectively packaged in single-band format in FITS files. Images of different bands contain different information. Since the three bands g, r and i contain more feature information and less noise, Astronomical researchers typically use the g, r, and i bands corresponding to the R, G, and B channels of the image to synthesize photometric images. Generally, different bands cannot be directly synthesized. If three bands are directly synthesized, the image of different bands may not be aligned. Therefore, this paper adopts the RGB multi-band image synthesis software written by He Zhendong et al. to synthesize images in g, r and i bands. This method effectively avoids the problem that images in different bands cannot be aligned. The pixel of each photometry image in this paper is 2048×1489.1.2 Data tailoringThis paper first clipped the target image, image clipping can use image segmentation tools to solve this problem, this paper uses Python to achieve this process. In the process of clipping, we convert the right ascension and declination of the source in the star list into pixel coordinates on the photometric image through the coordinate conversion formula, and determine the specific position of the source through the pixel coordinates. The coordinates are regarded as the center point and clipping is carried out in the form of a rectangular box. We found that the input image size affects the experimental results. Therefore, according to the target size of the source, we selected three different cutting sizes, 40×40, 60×60 and 80×80 respectively. Through experiment and analysis, we find that convolutional neural network has better learning ability and higher accuracy for data with small image size. In the end, we chose to divide the surface source galaxies, point source quasars, and stars into 40×40 sizes.1.3 Division of training and test dataIn order to make the algorithm have more accurate recognition performance, we need enough image samples. The selection of training set, verification set and test set is an important factor affecting the final recognition accuracy. In this paper, the training set, verification set and test set are set according to the ratio of 8:1:1. The purpose of verification set is used to revise the algorithm, and the purpose of test set is used to evaluate the generalization ability of the final algorithm. Table 1 shows the specific data partitioning information. The total sample size is 34,000 source images, including 11543 galaxy sources, 11967 star sources, and 10490 quasar sources.1.4 Data preprocessingIn this experiment, the training set and test set can be used as the training and test input of the algorithm after data preprocessing. The data quantity and quality largely determine the recognition performance of the algorithm. The pre-processing of the training set and the test set are different. In the training set, we first perform vertical flip, horizontal flip and scale on the cropped image to enrich the data samples and enhance the generalization ability of the algorithm. Since the features in the celestial object source have the flip invariability, the labels of galaxies, stars and quasars will not change after rotation. In the test set, our preprocessing process is relatively simple compared with the training set. We carry out simple scaling processing on the input image and test input the obtained image.
Facebook
TwitterAn extreme gradient boosting (XGB) machine learning model was developed to predict the distribution of nitrate in shallow groundwater across the conterminous United States (CONUS). Nitrate was predicted at a 1-square-kilometer (km) resolution at a depth below the water table of 10 m. The model builds off a previous XGB machine learning model developed to predict nitrate at domestic and public supply groundwater zones (Ransom and others, 2022) by incorporating additional monitoring well samples and modifying and adding predictor variables. The shallow zone model included variables representing well characteristics, hydrologic conditions, soil type, geology, climate, oxidation/reduction, and nitrogen inputs. Predictor variables derived from empirical or numerical process-based models were also included to integrate information on controlling processes and conditions. This data release documents the model and provides the model results. Included in this data release are, 1) a model archive of the R project: source code, input files (including model training and testing data, rasters of all final predictor variables, and an output raster representing predicted nitrate concentration in the shallow zone), 2) a read_me.txt file describing the model archive and an explanation of its use and the modeling details, and 3) a table describing the model variables.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).
It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:
In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.). Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).
Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).
After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.
Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).
Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.
On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).
Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).
|
|
Validation set | |
|
Model |
True |
False |
|
Presence |
A |
B |
|
Background |
C |
D |
We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).
The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.
Regarding the model evaluation and estimation, we selected the following estimators:
1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).
2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data contain of: 1. Data of satellite imagery of PlanetScope of University of Brawijaya with 3m spatial resolution. 2. Data training and testing in CSV format 3. R Script of four different algorithms (XGBoost, Random Forest, Support Vector Machine, and Neural Networks) The manuscript that using this dataset has been submitted to F1000 Research (https://f1000research.com/)
Facebook
TwitterTraining and testing data within the context of the Handful of Pixels course, and in particular the Land-Use and Land-Cover mapping chapter. The data included samples 300 random locations across 10 land cover classes from the data described in Fritz et al. 2017, and downloaded through the AppEEARS API using the {appeears} R package (Hufkens et al. 2023). A 80% split is executed on this larger dataset with 240 locations retained for training, while the remainder is used for testing purposes. For the testing data the input data is shared, the labels are withheld (stored in a closed release of this archive, and accessible on reasonable request). This data can be used within the context of small demonstration machine learning exercises or competitions.
Data structure
The data contains all seven (7) bands of the MODIS MCD43A4 data product for the year 2012. Band names are indicated in full. In addition MODIS MOD11A2 daytime land surface temperature (LST) data is provided, where band names only contain the date (YYYY-MM-DD) of acquisition. Additional indices can be calculated from these band combinations if so desired.
Data is provided in compressed serialized R rds files, and can be read into R as follows:
df - readRDS("training_data.rds")
References
Fritz, Steffen, Linda See, Christoph Perger, Ian McCallum, Christian Schill, Dmitry Schepaschenko, Martina Duerauer, et al. A Global Dataset of Crowdsourced Land Cover and Land Use Reference Data. Scientific Data 4, no. 1 (June 13, 2017): 170075. https://doi.org/10.1038/sdata.2017.75.
Koen Hufkens. (2023). bluegreen-labs/appeears: appeears: an interface to the NASA AppEEARS API (v1.0). Zenodo. https://doi.org/10.5281/zenodo.7958270
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
A random forest regression (RFR) model was developed to predict groundwater fluoride concentrations in four western United Stated principal aquifers —California Coastal basin-fill aquifers, Central Valley aquifer system, Basin and Range basin-fill aquifers, and the Rio Grande aquifer system. The selected basin-fill aquifers are a vital resource for drinking-water supplies. The RFR model was developed with a dataset of over 12,000 wells sampled for fluoride between 2000 and 2018. This data release provides rasters of predicted fluoride concentrations at depth typical of domestic and public supply wells in the selected basin-fill aquifers and includes the final RFR model that documents the prediction modeling process and verifies and reproduces the model fit metrics and mapped predictions in the accompanying publication. Included in this data release are 1) a model archive of the R project including source code, input files (model training and testing data and rasters of predictor ...
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We used drones to capture images of mixed-species blackbird (Icteridae) flocks damaging sunflower (Helianthus annuus) in North Dakota. Images included several blackbirds (Icteridae) that breed in North Dakota and are considered agricultural pests, including red-winged blackbirds (RWBL) (Agelaius phoeniceus), common grackles (Quiscalus quiscula), yellow-headed blackbirds (Xanthocephalus xanthocephalus), brown-headed cowbirds (Molothrus ater), and European starlings (Sturnidae: Sturnus vulgaris). This study was implemented between September 2021 and October 2022 in multiple counties in North Dakota, USA, where blackbird damage to sunflowers is prevalent. We simultaneously hazed and captured video and photographs of flocks with the drone, thus images consist of airborne flocks with sky, green vegetation, or tan vegetation backgrounds. Images were used to train and test two models: 1) a ResNet-18 convolutional neural network (CNN) model to detect flocks of varying size and distance from the camera and 2) Faster Region-based Convolutional Neural Network (Faster-RCNN) models to detect individual blackbirds, classify individual blackbirds by species and for RWBL sex and age class, and count blackbirds. The Faster-RCNN model required individual birds in the images to be manually annotated by trained biologists for model training. This data publication contains the data and R code used to analyze these data, as well as the 400 images used in the models to detect blackbird flocks, the 131 images used in the models to detect and classify individual blackbirds, and the 131 image annotation files.We designed this study to assess efficacy of drone-based aerial imagery combined with deep learning algorithms to accurately detect mixed-species blackbird flocks, as well as detect, classify, and count individual birds on varying backgrounds.For more information about this study and these data, see Duttenhefner et al. (2025).
Facebook
TwitterProject Documentation: Predicting S&P 500 Price Problem Statement: The goal of this project is to develop a machine learning model that can predict the future price of the S&P 500 index based on historical data and relevant features. By accurately predicting the price movements, we aim to assist investors and financial professionals in making informed decisions and managing their portfolios effectively. Dataset Description: The dataset used for this project contains historical data of the S&P 500 index, along with several other features such as dividends, earnings, consumer price index (CPI), interest rates, and more. The dataset spans a certain time period and includes daily values of these variables. Steps Taken: 1. Data Preparation and Exploration: • Loaded the dataset and performed initial exploration. • Checked for missing values and handled them if any. • Explored the statistical summary and distributions of the variables. • Conducted correlation analysis to identify potential features for prediction. 2. Data Visualization and Analysis: • Plotted time series graphs to visualize the S&P 500 index and other variables over time. • Examined the trends, seasonality, and residual behavior of the time series using decomposition techniques. • Analyzed the relationships between the S&P 500 index and other features using scatter plots and correlation matrices. 3. Feature Engineering and Selection: • Selected relevant features based on correlation analysis and domain knowledge. • Explored feature importance using tree-based models and selected informative features. • Prepared the final feature set for model training. 4. Model Training and Evaluation: • Split the dataset into training and testing sets. • Selected a regression model (Linear Regression) for price prediction. • Trained the model using the training set. • Evaluated the model's performance using mean squared error (MSE) and R-squared (R^2) metrics on both training and testing sets. 5. Prediction and Interpretation: • Obtained predictions for future S&P 500 prices using the trained model. • Interpreted the predicted prices in the context of the current market conditions and the percentage change from the current price. Limitations and Future Improvements: • The predictive performance of the model is based on the available features and historical data, and it may not capture all the complexities and factors influencing the S&P 500 index. • The model's accuracy and reliability are subject to the quality and representativeness of the training data. • The model assumes that the historical patterns and relationships observed in the data will continue in the future, which may not always hold true. • Future improvements could include incorporating additional relevant features, exploring different regression algorithms, and considering more sophisticated techniques such as time series forecasting models.
Facebook
TwitterHere we provide the datasets used for training and testing of the end-to-end supervised deep learning models as well as the datasets used with vector representations of compounds and proteins and passed to supervised state-of-the-art machine learning models (XGBoost, RF, SVM). We also provide the full list of viral proteins with their sequences used for the protein autoencoder along with the list of SMILES representations of compounds used for the compound autoencoder.
Facebook
TwitterThis dataset was created by Subhasri R A
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Rippling caustics seem to be the main factor degrading the underwater RGB image quality and affecting the image- based 3D reconstruction process in very shallow waters. These effects are adversely affecting image matching algorithms by throwing off most of them, leading to less accurate matches and causing issues in the Simultaneous Localization and Mapping (SLAM) based navigation of the Remotely Operated Vehicles (ROV) and Autonomous Underwater Vehicles (AUV) on shallow waters. Also, they are the main cause for dissimilarities in the generated textures and orthoimages. In order to fill the gap in the literature regading underwater rippling caustics imagery with real ground truth and reference images, the first real-world underwater caustics benchmark dataset which contains 1465 underwater images is presented. Together with the RGB imagery, the corresponding generated ground truth images are delivered for facilitating the training and testing of machine learning and deep learning methods for image classification. R-CAUSTIC dataset also provides the necessary data to evaluate, at least to some extent, the performance of 3D reconstruction approaches. Data were acquired using a GoPro Hero 4 Black action camera with image dimensions of 4000 x 3000 pixels, focal length of 2.77mm and pixel size of 1.55μm and a tripod. Action cameras are widely used for underwater image acquisition. The dataset was captured in near-shore underwater sites at depths varying from 0.5 to 2m. No artificial light sources were used. Due to the wind, the turbulent surface of the water created dynamic rippling caustics on the seabed. In total 1465 RGB images were collected, separated in 7 different datasets; five of them containing stereo images, one of them tri-stereo images and one consists of multi-stereo imagery acquired in 7 different camera poses.
Publication
The paper is availbale in Open Access here: https://ieeexplore.ieee.org/document/10172291
If you use this dataset please cite it as R-CAUSTIC [Reference].
[Reference]: P. Agrafiotis, K. Karantzalos and A. Georgopoulos, "Seafloor-Invariant Caustics Removal From Underwater Imagery," in IEEE Journal of Oceanic Engineering, vol. 48, no. 4, pp. 1300-1321, Oct. 2023, doi: 10.1109/JOE.2023.3277168.
BibTeX:
@ARTICLE{10172291, author={Agrafiotis, Panagiotis and Karantzalos, Konstantinos and Georgopoulos, Andreas}, journal={IEEE Journal of Oceanic Engineering}, title={Seafloor-Invariant Caustics Removal From Underwater Imagery}, year={2023}, volume={48}, number={4}, pages={1300-1321}, doi={10.1109/JOE.2023.3277168}}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description:
Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707.
Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions
Funding: These data were collected as part of research funded by:
This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.
XML metadata: GEMINI compliant metadata for this dataset is available here
Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip
CT_image_data_info2.xlsx
This file contains dataset metadata and 1 data tables:
Dataset Images (described in worksheet Dataset_images)
Description: This worksheet details the composition of each dataset used in the analyses
Number of fields: 69
Number of data rows: 270287
Fields:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The known/novel training and testing dataset composition when testing RDP on all taxonomic levels.
Facebook
TwitterThe goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.
The Rescaled CIFAR-10 dataset was introduced in the paper:
[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.
with a pre-print available at arXiv:
[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.
Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:
[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2
and is therefore significantly more challenging.
The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:
[4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.
and also for this new rescaled version, using the reference [1] above.
The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.
The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].
There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].
The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.
The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:
cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5
Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:
cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5
These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].
The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.
The training dataset can be loaded in Python as:
with h5py.File(`
x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)
We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:
x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))
The test datasets can be loaded in Python as:
with h5py.File(`
x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)
The test datasets can be loaded in Matlab as:
x_test = h5read(`
The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
X_train,y_train=train_set[0],train_set[1] X_validation,y_validation=validation_set[0],validation_set[1] X_test,y_test=test_set[0],test_set[1]
print("Shape of X_train: ",X_train.shape) print("Shape of y_train: ",y_train.shape) print("Shape of X_validation: ", X_validation.shape) print("Shape of y_validation: ", y_validation.shape) print("Shape of X_test: ", X_test.shape) print("Shape of y_test: ", y_test.shape)
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6992718%2F4b428195c0d5021ab4f7a30699e217fe%2FScreenshot%202024-11-10%20at%209.43.44AM.png?generation=1731249837688375&alt=media" alt="">
train_index = range(0,len(X_train))
validation_index = range(len(X_train), len(X_train)+len(X_validation))
test_index = range(len(X_train)+len(X_validation), len(X_train)+len(X_validation)+len(X_test))
X_train = pd.DataFrame(data=X_train,index=train_index) y_train = pd.Series(data=y_train,index=train_index)
X_validation = pd.DataFrame(data=X_validation,index=validation_index) y_validation = pd.Series(data=y_validation,index=validation_index)
X_test = pd.DataFrame(data=X_test,index=test_index) y_test = pd.Series(data=y_test,index=test_index)