Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Imports:
# All Imports
import os
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.calibration import LabelEncoder
import seaborn as sns
import matplotlib.image as mpimg
import cv2
import numpy as np
import pickle
# Tensflor and Keras Layer and Model and Optimize and Loss
import tensorflow as tf
from tensorflow import keras
from keras import Sequential
from keras.layers import *
#Kernel Intilizer
from keras.optimizers import Adamax
# PreTrained Model
from keras.applications import *
#Early Stopping
from keras.callbacks import EarlyStopping
import warnings
Warnings Suppression | Configuration
# Warnings Remove
warnings.filterwarnings("ignore")
# Define the base path for the training folder
base_path = 'jaguar_cheetah/train'
# Weights file
weights_file = 'Model_train_weights.weights.h5'
# Path to the saved or to save the model:
model_file = 'Model-cheetah_jaguar_Treined.keras'
# Model history
history_path = 'training_history_cheetah_jaguar.pkl'
# Initialize lists to store file paths and labels
filepaths = []
labels = []
# Iterate over folders and files within the training directory
for folder in ['Cheetah', 'Jaguar']:
folder_path = os.path.join(base_path, folder)
for filename in os.listdir(folder_path):
file_path = os.path.join(folder_path, filename)
filepaths.append(file_path)
labels.append(folder)
# Create the TRAINING dataframe
file_path_series = pd.Series(filepaths , name= 'filepath')
Label_path_series = pd.Series(labels , name = 'label')
df_train = pd.concat([file_path_series ,Label_path_series ] , axis = 1)
# Define the base path for the test folder
directory = "jaguar_cheetah/test"
filepath =[]
label = []
folds = os.listdir(directory)
for fold in folds:
f_path = os.path.join(directory , fold)
imgs = os.listdir(f_path)
for img in imgs:
img_path = os.path.join(f_path , img)
filepath.append(img_path)
label.append(fold)
# Create the TEST dataframe
file_path_series = pd.Series(filepath , name= 'filepath')
Label_path_series = pd.Series(label , name = 'label')
df_test = pd.concat([file_path_series ,Label_path_series ] , axis = 1)
# Display the first rows of the dataframe for verification
#print(df_train)
# Folders with Training and Test files
data_dir = 'jaguar_cheetah/train'
test_dir = 'jaguar_cheetah/test'
# Image size 256x256
IMAGE_SIZE = (256,256)
Tain | Test
#print('Training Images:')
# Create the TRAIN dataframe
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.1,
subset='training',
seed=123,
image_size=IMAGE_SIZE,
batch_size=32)
#Testing Data
#print('Validation Images:')
validation_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.1,
subset='validation',
seed=123,
image_size=IMAGE_SIZE,
batch_size=32)
print('Testing Images:')
test_ds = tf.keras.utils.image_dataset_from_directory(
test_dir,
seed=123,
image_size=IMAGE_SIZE,
batch_size=32)
# Extract labels
train_labels = train_ds.class_names
test_labels = test_ds.class_names
validation_labels = validation_ds.class_names
# Encode labels
# Defining the class labels
class_labels = ['CHEETAH', 'JAGUAR']
# Instantiate (encoder) LabelEncoder
label_encoder = LabelEncoder()
# Fit the label encoder on the class labels
label_encoder.fit(class_labels)
# Transform the labels for the training dataset
train_labels_encoded = label_encoder.transform(train_labels)
# Transform the labels for the validation dataset
validation_labels_encoded = label_encoder.transform(validation_labels)
# Transform the labels for the testing dataset
test_labels_encoded = label_encoder.transform(test_labels)
# Normalize the pixel values
# Train files
train_ds = train_ds.map(lambda x, y: (x / 255.0, y))
# Validate files
validation_ds = validation_ds.map(lambda x, y: (x / 255.0, y))
# Test files
test_ds = test_ds.map(lambda x, y: (x / 255.0, y))
#TRAINING VISUALIZATION
#Count the occurrences of each category in the column
count = df_train['label'].value_counts()
# Create a figure with 2 subplots
fig, axs = plt.subplots(1, 2, figsize=(12, 6), facecolor='white')
# Plot a pie chart on the first subplot
palette = sns.color_palette("viridis")
sns.set_palette(palette)
axs[0].pie(count, labels=count.index, autopct='%1.1f%%', startangle=140)
axs[0].set_title('Distribution of Training Categories')
# Plot a bar chart on the second subplot
sns.barplot(x=count.index, y=count.values, ax=axs[1], palette="viridis")
axs[1].set_title('Count of Training Categories')
# Adjust the layout
plt.tight_layout()
# Visualize
plt.show()
# TEST VISUALIZATION
count = df_test['label'].value_counts()
# Create a figure with 2 subplots
fig, axs = plt.subplots(1, 2, figsize=(12, 6), facec...
Facebook
TwitterExtracting useful and accurate information from scanned geologic and other earth science maps is a time-consuming and laborious process involving manual human effort. To address this limitation, the USGS partnered with the Defense Advanced Research Projects Agency (DARPA) to run the AI for Critical Mineral Assessment Competition, soliciting innovative solutions for automatically georeferencing and extracting features from maps. The competition opened for registration in August 2022 and concluded in December 2022. Training, validation, and evaluation data from the map feature extraction challenge are provided here, as well as competition details and a baseline solution. The data were derived from published sources and are provided to the public to support continued development of automated georeferencing and feature extraction tools. References for all maps are included with the data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategy—coming from robust statistics and optimization—is thus to build a model robust to distributional perturbations. In this paper, we take a different approach to describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions. We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an f-divergence ball around the training population. The method, based on conformal inference, achieves (nearly) valid coverage in finite samples, under only the condition that the training data be exchangeable. An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it; we develop estimators and prove their consistency for protection and validity of uncertainty estimates under shifts. By experimenting on several large-scale benchmark datasets, including Recht et al.’s CIFAR-v4 and ImageNet-V2 datasets, we provide complementary empirical results that highlight the importance of robust predictive validity.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.
To use this dataset for summarization tasks: - Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation). - Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization. - Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry..
- Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content
- Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset.
- Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance.
- Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: comparisons_validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split | Split of the dataset between training and validation sets. (String) | | extra | Additional information about the given source material available. (String) |
File: comparisons_train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split ...
Facebook
TwitterThese are CARLA Simulation Datasets of the project "Out-Of-Domain Data Detection using Uncertainty Quantification in End-to-End Driving Algorithms". The simulations are generated in CARLA Town 02 for different sun angles (in degrees). You will find image frames, command labels, and steering control values in the respective 'xxxx_files_data' folder. You will find videos of each simulation run in the 'xxxx_files_visualizations' folder.
The 8 simulation runs for Training Data, are with the Sun Angles : 90, 80, 70, 60, 50, 40, 30, 20
The 8 simulation runs for Training Data were seeded at 0000, 1000, 2000, 3000, 4000, 5000, 6000, 7000 respectively
The 4 simulation runs for Validation Data, are with the Sun Angles : 87, 67, 47, 23
The 4 simulation runs for Validation Data were seeded at 0000, 2000, 4000, 7000 respectively
The 29 simulation runs for Testing Data, are with the Sun Angles : 85, 75, 65, 55, 45, 35, 25, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 09, 08, 07, 06, 05, 04, 03, 02, 01, 00, -1, -10
The 29 simulation runs for Testing Data were all seeded at 5000 respectively
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The package contains files for two modules designed to improve the accuracy of the indoor positioning system, namely the following:
door detection
videos_test - videos used to demonstrate the application of door detector
videos_res - videos from videos_test directory with detected doors marked
parts detection
frames_train_val - images generated from videos used for training and validation of VGG16 neural network model
frames_test - images generated from videos used for testing of the trained model
videos_test - videos used to demonstrate the application of parts detector
videos_res - videos from videos_test directory with detected parts marked
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Training and validation data for each material studied in the manuscript.
Facebook
TwitterThis dataset consists of the synthetic electron backscatter diffraction (EBSD) maps generated for the paper, titled "Hybrid Algorithm for Filling in Missing Data in Electron Backscatter Diffraction Maps" by Emmanuel Atindama, Conor Miller-Lynch, Huston Wilhite, Cody Mattice, Günay Doğan, and Prashant Athavale. The EBSD maps were used to train, test, and validate a neural network algorithm to fill in missing data points in a given EBSD map.The dataset includes 8000 maps for training, 1000 maps for testing, 2000 maps for validation. The dataset also includes noise-added versions of the maps, namely, one more map per each clean map.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset describes the training, validation and test dataset used for the development of a hybrid ANN coagulation model.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes the date and time, latitude (“lat”), longitude (“lon”), sun angle (“sun_angle”, in degrees [o]), rainbow presence (TRUE = rainbow, FALSE = no rainbow), cloud cover (“cloud_cover”, proportion), and liquid precipitation (“liquid_precip”, kg m-2 s-1) for each record used to train and/or validate the models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).
Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
Training Data: Contains the training dataset used to train the machine learning model.
Validation Data: Used for hyperparameter tuning and model selection.
Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)
Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Facebook
TwitterTrain a machine learning model based on the runtime data provided to you in the training dataset and further predict the runtime of graphs and configurations in the test dataset.
https://www.kaggle.com/code/rishabh15virgo/first-impression-understand-data-eda-baseline-15
https://www.kaggle.com/datasets/rishabh15virgo/google-fast-or-slowtile-xla-train-data-csv-format
https://www.kaggle.com/datasets/rishabh15virgo/google-fast-or-slowtile-xla-test-data-csv-format
Tile .npz files Suppose a .npz file stores a graph (representing a kernel) with n nodes and m edges. In addition, suppose we compile the graph with c different configurations, and run each on a TPU. Crucially, the configuration is at the graph-level. Then, the .npz file stores the following dictionary
Key "node_feat": contains float32 matrix with shape (n, 140). The uth row contains the feature vector for node u < n . Nodes are ordered topologically. Key "node_opcode" contains int32 vector with shape (n, ). The uth entry stores the op-code for node u. Key **"edge_index" **contains int32 matrix with shape (m, 2). If entry i is = u, v, then there is a directed edge from node u to node v, where u consumes the output of v. Key "config_feat" contains float32 matrix with shape (c, 24) with row j containing the (graph-level) configuration feature vector. Keys "config_runtime" and "config_runtime_normalizers": both are int64 vectors of length c. Entry j stores the runtime (in nanoseconds) of the given graph compiled with configuration j and a default configuration, respectively. Samples from the same graph may have slightly different "config_runtime_normalizers" because they are measured from different runs on multiple machines. Finally, for the tile collection, your job is to predict the indices of the best configurations (i.e., ones leading to the smallest d["config_runtime"] / d["config_runtime_normalizers"]).
Facebook
TwitterThis short, 8-minute technical training video is created by The Children's Bureau Data Analytics and Reporting Team and gives a brief demonstration which shows agencies how to validate their XML files against the XSD. For more information, or to access the XSD, please see AFCARS Technical Bulletin 21.
Audio Descriptive Version
Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Post-marketing reports of suspected adverse drug reactions are important for establishing the safety profile of a medicinal product. However, a high influx of reports poses a challenge for regulatory authorities as a delay in identification of previously unknown adverse drug reactions can potentially be harmful to patients. In this study, we use natural language processing (NLP) to predict whether a report is of serious nature based solely on the free-text fields and adverse event terms in the report, potentially allowing reports mislabelled at time of reporting to be detected and prioritized for assessment. We consider four different NLP models at various levels of complexity, bootstrap their train-validation data split to eliminate random effects in the performance estimates and conduct prospective testing to avoid the risk of data leakage. Using a Swedish BERT based language model, continued language pre-training and final classification training, we achieve close to human-level performance in this task. Model architectures based on less complex technical foundation such as bag-of-words approaches and LSTM neural networks trained with random initiation of weights appear to perform less well, likely due to the lack of robustness that a base of general language training provides.
Facebook
TwitterThis dataset stores separate files of training and validation data for Riiid!
These files are made by following notebook. https://www.kaggle.com/its7171/cv-strategy
You can read these files like:
train1 = pd.read_pickle('../input/riiid-cross-validation-files/cv1_train.pickle')
valid1 = pd.read_pickle('../input/riiid-cross-validation-files/cv1_valid.pickle')
Usage example: https://www.kaggle.com/its7171/riiid-cross-validation-files
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Abstract: Ziel des Datensatzes ist das Training sowie die Validierung von Modellen zur Prognose von Zeitreihen für Fräsprozesse. Hierfür wurden an einer DMG CMX 600 V durch eine Siemens Industrial Edge Prozesse mit einer Abtastrate von 500 Hz aufgenommen. Es wurde ein Prozess für das Modelltraining und ein Prozess für die Validierung aufgenommen, welche sowohl für die Bearbeitung von Stahl sowie von Aluminium verwendet wurden. Es wurden mehrere Aufnahmen mit und ohne Werkstück (Aircut) aufgenommen, um möglichst viele Fälle abdecken zu können. Abstract: The aim of the data set is the training as well as the validation of models for the prediction of time series for milling processes. For this purpose, processes with a sampling rate of 500 Hz were recorded on a DMG CMX 600 V by a Siemens Industrial Edge. A process for model training and a process for validation were recorded, which were used for both steel and aluminum machining. Several recordings were made with and without the workpiece (aircut) in order to cover as many cases as possible. TechnicalRemarks: Documents: -Design of Experiments: Information on the paths such as the technological values of the experiments -Recording information: Information about the recordings with comments -Data: All recorded datasets. The first level contains the folders for training and validation both with and without the workpiece. In the next level, the individual test executions are located. The individual recordings are stored in the form of a JSON file. This consists of a header with all relevant information such as the signal sources. This is followed by the entries of the recorded time series. -NC-Code: NC programs executed on the machine -Workpiece: Pictures of the raw parts as well as the machined workpieces. The pictures show the unfinished part on the left, the training part in the middle and a part with two validation runs on the right. Experimental data: -Machine: DMG CMX 600 V -Material: S235JR, 2007 T4 -Tools: -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 5mm -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 10mm -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 20mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 5mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 10mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 5mm -Workpiece blank dimensions: 150x75x50mm License: This work is licensed under a Creative Commons Attribution 4.0 International License. Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training and validation datasets for the second subtask of the shared task "Field of Research Classification" to be held at the Natural Scientific Language Processing and Research Knowledge Graphs (NSLP 2024) workshop (https://nfdi4ds.github.io/nslp2024/). **The first two authors contributed equally to the development of this dataset.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Community science image libraries offer a massive, but largely untapped, source of observational data for phenological research. The iNaturalist platform offers a particularly rich archive, containing more than 49 million verifiable, georeferenced, open access images, encompassing seven continents and over 278,000 species. A critical limitation preventing scientists from taking full advantage of this rich data source is labor. Each image must be manually inspected and categorized by phenophase, which is both time-intensive and costly. Consequently, researchers may only be able to use a subset of the total number of images available in the database. While iNaturalist has the potential to yield enough data for high-resolution and spatially extensive studies, it requires more efficient tools for phenological data extraction. A promising solution is automation of the image annotation process using deep learning. Recent innovations in deep learning have made these open-source tools accessible to a general research audience. However, it is unknown whether deep learning tools can accurately and efficiently annotate phenophases in community science images. Here, we train a convolutional neural network (CNN) to annotate images of Alliaria petiolata into distinct phenophases from iNaturalist and compare the performance of the model with non-expert human annotators. We demonstrate that researchers can successfully employ deep learning techniques to extract phenological information from community science images. A CNN classified two-stage phenology (flowering and non-flowering) with 95.9% accuracy and classified four-stage phenology (vegetative, budding, flowering, and fruiting) with 86.4% accuracy. The overall accuracy of the CNN did not differ from humans (p = 0.383), although performance varied across phenophases. We found that a primary challenge of using deep learning for image annotation was not related to the model itself, but instead in the quality of the community science images. Up to 4% of A. petiolata images in iNaturalist were taken from an improper distance, were physically manipulated, or were digitally altered, which limited both human and machine annotators in accurately classifying phenology. Thus, we provide a list of photography guidelines that could be included in community science platforms to inform community scientists in the best practices for creating images that facilitate phenological analysis.
Methods Creating a training and validation image set
We downloaded 40,761 research-grade observations of A. petiolata from iNaturalist, ranging from 1995 to 2020. Observations on the iNaturalist platform are considered “research-grade if the observation is verifiable (includes image), includes the date and location observed, is growing wild (i.e. not cultivated), and at least two-thirds of community users agree on the species identification. From this dataset, we used a subset of images for model training. The total number of observations in the iNaturalist dataset are heavily skewed towards more recent years. Less than 5% of the images we downloaded (n=1,790) were uploaded between 1995-2016, while over 50% of the images were uploaded in 2020. To mitigate temporal bias, we used all available images between the years 1995 and 2016 and we randomly selected images uploaded between 2017-2020. We restricted the number of randomly-selected images in 2020 by capping the number of 2020 images to approximately the number of 2019 observations in the training set. The annotated observation records are available in the supplement (supplementary data sheet 1). The majority of the unprocessed records (those which hold a CC-BY-NC license) are also available on GBIF.org (2021).
One of us (R. Reeb) annotated the phenology of training and validation set images using two different classification schemes: two-stage (non-flowering, flowering) and four-stage (vegetative, budding, flowering, fruiting). For the two-stage scheme, we classified 12,277 images and designated images as ‘flowering’ if there was one or more open flowers on the plant. All other images were classified as non-flowering. For the four-stage scheme, we classified 12,758 images. We classified images as ‘vegetative’ if no reproductive parts were present, ‘budding’ if one or more unopened flower buds were present, ‘flowering’ if at least one opened flower was present, and ‘fruiting’ if at least one fully-formed fruit was present (with no remaining flower petals attached at the base). Phenology categories were discrete; if there was more than one type of reproductive organ on the plant, the image was labeled based on the latest phenophase (e.g. if both flowers and fruits were present, the image was classified as fruiting).
For both classification schemes, we only included images in the model training and validation dataset if the image contained one or more plants with clearly visible reproductive parts were clear and we could exclude the possibility of a later phenophase. We removed 1.6% of images from the two-stage dataset that did not meet this requirement, leaving us with a total of 12,077 images, and 4.0% of the images from the four-stage leaving us with a total of 12,237 images. We then split the two-stage and four-stage datasets into a model training dataset (80% of each dataset) and a validation dataset (20% of each dataset).
Training a two-stage and four-stage CNN
We adapted techniques from studies applying machine learning to herbarium specimens for use with community science images (Lorieul et al. 2019; Pearson et al. 2020). We used transfer learning to speed up training of the model and reduce the size requirements for our labeled dataset. This approach uses a model that has been pre-trained using a large dataset and so is already competent at basic tasks such as detecting lines and shapes in images. We trained a neural network (ResNet-18) using the Pytorch machine learning library (Psake et al. 2019) within Python. We chose the ResNet-18 neural network because it had fewer convolutional layers and thus was less computationally intensive than pre-trained neural networks with more layers. In early testing we reached desired accuracy with the two-stage model using ResNet-18. ResNet-18 was pre-trained using the ImageNet dataset, which has 1,281,167 images for training (Deng et al. 2009). We utilized default parameters for batch size (4), learning rate (0.001), optimizer (stochastic gradient descent), and loss function (cross entropy loss). Because this led to satisfactory performance, we did not further investigate hyperparameters.
Because the ImageNet dataset has 1,000 classes while our data was labeled with either 2 or 4 classes, we replaced the final fully-connected layer of the ResNet-18 architecture with fully-connected layers containing an output size of 2 for the 2-class problem and 4 for the 4-class problem. We resized and cropped the images to fit ResNet’s input size of 224x224 pixels and normalized the distribution of the RGB values in each image to a mean of zero and a standard deviation of one, to simplify model calculations. During training, the CNN makes predictions on the labeled data from the training set and calculates a loss parameter that quantifies the model’s inaccuracy. The slope of the loss in relation to model parameters is found and then the model parameters are updated to minimize the loss value. After this training step, model performance is estimated by making predictions on the validation dataset. The model is not updated during this process, so that the validation data remains ‘unseen’ by the model (Rawat and Wang 2017; Tetko et al. 1995). This cycle is repeated until the desired level of accuracy is reached. We trained our model for 25 of these cycles, or epochs. We stopped training at 25 epochs to prevent overfitting, where the model becomes trained too specifically for the training images and begins to lose accuracy on images in the validation dataset (Tetko et al. 1995).
We evaluated model accuracy and created confusion matrices using the model’s predictions on the labeled validation data. This allowed us to evaluate the model’s accuracy and which specific categories are the most difficult for the model to distinguish. For using the model to make phenology predictions on the full, 40,761 image dataset, we created a custom dataloader function in Pytorch using the Custom Dataset function, which would allow for loading images listed in a csv and passing them through the model associated with unique image IDs.
Hardware information
Model training was conducted using a personal laptop (Ryzen 5 3500U cpu and 8 GB of memory) and a desktop computer (Ryzen 5 3600 cpu, NVIDIA RTX 3070 GPU and 16 GB of memory).
Comparing CNN accuracy to human annotation accuracy
We compared the accuracy of the trained CNN to the accuracy of seven inexperienced human scorers annotating a random subsample of 250 images from the full, 40,761 image dataset. An expert annotator (R. Reeb, who has over a year’s experience in annotating A. petiolata phenology) first classified the subsample images using the four-stage phenology classification scheme (vegetative, budding, flowering, fruiting). Nine images could not be classified for phenology and were removed. Next, seven non-expert annotators classified the 241 subsample images using an identical protocol. This group represented a variety of different levels of familiarity with A. petiolata phenology, ranging from no research experience to extensive research experience (two or more years working with this species). However, no one in the group had substantial experience classifying community science images and all were naïve to the four-stage phenology scoring protocol. The trained CNN was also used to classify the subsample images. We compared human annotation accuracy in each phenophase to the accuracy of the CNN using students
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.