Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
val_half: containing 1/4 ids which have 50% of pictures in this validation set and 50% in the training set
val_all: containing 1/4 ids whose pictures are not included in the training set
train: training set
test: test set
Facebook
TwitterMany e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
Facebook
TwitterFive models are trained using various input masking probabilities (IMP). Each resulting model is validated using the heavily masked validation dataset of 13596 samples (5668 positive) to evaluate their performance in the context of missing input data. AUC values for the optimal training IMP are shown, along with those achieved with no input masking (NIM). Bold font indicates the highest AUC in the table. Results for other IMP values are provided in the S1 File.
Facebook
TwitterFile description: 1. train.Mutation_Meth_CNV_data.xls : The feature matrix file used in the training model includes sample name, point mutation data, methylation data and CNV data. The first column must be the sample name. 2. train.sample_label.xls : Pathological information of the training set samples, where 1 represents prostate cancer and 0 represents non-prostate cancer. 3. validation.Mutation_Meth_CNV_data.xls : The feature matrix file used in validation set includes sample name, point mutation data, methylation dataand CNV data. The first column must be the sample name. 4. validation.sample_label.xls : Pathological information of the validation set samples, where 1 represents prostate cancer and 0 represents non-prostate cancer.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes the date and time, latitude (“lat”), longitude (“lon”), sun angle (“sun_angle”, in degrees [o]), rainbow presence (TRUE = rainbow, FALSE = no rainbow), cloud cover (“cloud_cover”, proportion), and liquid precipitation (“liquid_precip”, kg m-2 s-1) for each record used to train and/or validate the models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description: Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707. Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions Funding: These data were collected as part of research funded by:
NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A) This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.
XML metadata: GEMINI compliant metadata for this dataset is available here Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip CT_image_data_info2.xlsx This file contains dataset metadata and 1 data tables:
Dataset Images (described in worksheet Dataset_images) Description: This worksheet details the composition of each dataset used in the analyses Number of fields: 69 Number of data rows: 270287 Fields:
filename: Root ID (Field type: id) camera_trap_site: Site ID for the camera trap location (Field type: location) taxon: Taxon recorded by camera trap (Field type: taxa) dist_level: Level of disturbance at site (Field type: ordered categorical) baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical) increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_all_1_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3, 4 and 5 (all)' training set, or not included (NA) (Field type: categorical) dist_camera_level_individ_1: Label as to whether image is included in the 'disturbance level combination analysis split at camera level: disturbance
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
†Not applicable.
Facebook
TwitterBackgroundAbout 15% of lung cancers in men and 53% in women are not attributable to smoking worldwide. The aim was to develop and validate a simple and non-invasive model which could assess and stratify lung cancer risk in non-smokers in China.MethodsA large-sample size, population-based study was conducted under the framework of the Cancer Screening Program in Urban China (CanSPUC). Data on the lung cancer screening in Henan province, China, from October 2013 to October 2019 were used and randomly divided into the training and validation sets. Related risk factors were identified through multivariable Cox regression analysis, followed by establishment of risk prediction nomogram. Discrimination [area under the curve (AUC)] and calibration were further performed to assess the validation of risk prediction nomogram in the training set, and then validated by the validation set.ResultsA total of 214,764 eligible subjects were included, with a mean age of 55.19 years. Subjects were randomly divided into the training (107,382) and validation (107,382) sets. Elder age, being male, a low education level, family history of lung cancer, history of tuberculosis, and without a history of hyperlipidemia were the independent risk factors for lung cancer. Using these six variables, we plotted 1-year, 3-year, and 5-year lung cancer risk prediction nomogram. The AUC was 0.753, 0.752, and 0.755 for the 1-, 3- and 5-year lung cancer risk in the training set, respectively. In the validation set, the model showed a moderate predictive discrimination, with the AUC was 0.668, 0.678, and 0.685 for the 1-, 3- and 5-year lung cancer risk.ConclusionsWe developed and validated a simple and non-invasive lung cancer risk model in non-smokers. This model can be applied to identify and triage patients at high risk for developing lung cancers in non-smokers.
Facebook
TwitterEARS-Reverb_v2 Dataset Card
Overview
EARS-Reverb_v2 is a large-scale dataset designed for speech enhancement and dereverberation research. It contains reverberant speech data generated as the output of the code from the ears_benchmark repository. The dataset is intended for training and validation purposes and does not include a test set.
Dataset Structure
validation/: Contains the validation data. validation.csv: Metadata for the validation set.
There is no… See the full description on the dataset page: https://huggingface.co/datasets/Amayas/ears-reverb-dataset-validation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a validation dataset for Google Landmark Recognition 2021 (GLRec2021). This might be able to used as validation data of Google Landmark Retrieval 2021).
This dataset is imported from Google Landmarks Dataset v2 (GLDv2). The images are test images in GLDv2, and the label file is a simplified version of recognition_solution_v2.1.csv. In order to use this dataset in GLRec2021, the label file is modified in the same manner in train.csv of GLRec2021, but labels of non-landmark images are marked as -1. In addition, records which are not related with any landmarks in train.csv are removed.
The details of the imported dataset (GLDv2) is described in the following paper:
"Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval"
T. Weyand*, A. Araujo*, B. Cao, J. Sim
Proc. CVPR'20
The license complies with the license of GLDv2. Check GLDv2 repository.
This dataset contains the model files trained on the GLRec2021 training dataset. The model has a ResNet-34 as a backbone CNN and a head module for extracting image features. This model is included for use in the code of GLRec2021, but the model file can be used as follows.
model = torch.jit.load(path_to_the_model_file)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains food and non-food images.
It divided by 3 sets - train, validation and evaluation.
Each set contains 2 categories - food and non_food, each with 500 images
The dataset was taken from official source, the only difference that I divided images by categories in each set (train, validation and evaluation) to make model training process more convenient.
Facebook
TwitterThe field data and WorldView imagery were leveraged to generate an extensive set of segments labeled with land cover class. These segment interpretations provided the training and validation data for the mapping. Analysts reviewed each aerial and ground plot from the 2019 field survey, examining the plot center and training polygon over the WorldView mosaic, and reviewing field photos, cover estimates, and notes. For each plot, one image segment was identified as the primary example of the vegetation type of the plot (unless there was no suitable example segment, as in cases when a ground plot was targeting a small but distinct vegetation patch that was not captured in the image segmentation). Usually, the primary segment included or was close to the nominal plot center, but this was not always the case, since the target area for the aerial plots could encompass several segments. After identifying a primary segment, the analyst also identified a set of 0–15 secondary segments that were good examples of the same vegetation type. This assessment was informed by field experience, review of field photos of the landscape setting, and photo-interpretation of the WorldView mosaic. An additional set of auxiliary segments were identified and assigned to a land cover class. The first set of auxiliary segments was assigned to non-vegetated classes such as lakes, ponds, ocean, barrens, and snowfields or aufeis. While a limited effort was expended to sample such classes during field work, we knew that these would be readily identifiable with high confidence from the WorldView imagery and so focused the field sampling on vegetated classes. Later, after reviewing preliminary models and receiving feedback from Janet Jorgenson (retired plant ecologist for the Arctic Refuge), we added additional auxiliary segments for vegetated classes based on expert photo interpretation. These were designed to provide the model with additional training data to define the breakpoints between similar classes. Land cover classes were assigned to all of the primary, secondary, and auxiliary segments. 20% of the segments were randomly selected to be withheld from model training. The final model was validated using the reserved validation segment interpretation points (20% of the full set). These segments were not used to develop the model. The map class was extracted from the final land cover map for each validation point. A confusion matrix, overall accuracy metrics, and per-class performance metrics were calculated from the validation data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).
Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
Training Data: Contains the training dataset used to train the machine learning model.
Validation Data: Used for hyperparameter tuning and model selection.
Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)
Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotated masks and Sentinel-1/-2 images split into training, validation, and test sets. Used for training convolutional neural network for small reservoir mapping.
- manet_sentinel.ckpt: PyTorch model checkpoint file containing model weights.
- annotations.zip: Contains binary reservoir masks (0 is non-reservoir, 1 is reservoir) split into training, validation, and test sets.
- images.zip: Contains Sentinel-1/-2 images split into training, validation, and test sets with the following bands:
Facebook
TwitterA diverse data set of 1667 chemicals with AR experimental activity were provided by the U.S. EPA from the oxicity Forecaster (ToxCast) program which generates data using in vitro high-throughput screening (HTS) assays measuring activity of chemicals at multiple points along the androgen receptor (AR) activity pathway. The Endocrine Disruptor Knowledgebase (EDKB) androgen receptor (AR) binding data set (Fang et al., 2003) was downloaded from the FDA website and was produced expressly as a training set designed for developing predictive models. The data is based on a validated assay using recombinant AR. The dataset contains 146 AR binders and 56 non-AR binders. These training set chemicals were selected for both chemical structure diversity and range of activity, both of which are essential to develop robust QSAR and other models (Perkins, 2003). This dataset is associated with the following publication: Manganelli, S., A. Roncaglioni, K. Mansouri, R. Judson, E. Benfenati, A. Manganaro, and P. Ruiz. Development, validation and integration of in silico models to identify androgen active chemicals. CHEMOSPHERE. Elsevier Science Ltd, New York, NY, USA, 220: 204-215, (2019).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundAbout 15% of lung cancers in men and 53% in women are not attributable to smoking worldwide. The aim was to develop and validate a simple and non-invasive model which could assess and stratify lung cancer risk in non-smokers in China.MethodsA large-sample size, population-based study was conducted under the framework of the Cancer Screening Program in Urban China (CanSPUC). Data on the lung cancer screening in Henan province, China, from October 2013 to October 2019 were used and randomly divided into the training and validation sets. Related risk factors were identified through multivariable Cox regression analysis, followed by establishment of risk prediction nomogram. Discrimination [area under the curve (AUC)] and calibration were further performed to assess the validation of risk prediction nomogram in the training set, and then validated by the validation set.ResultsA total of 214,764 eligible subjects were included, with a mean age of 55.19 years. Subjects were randomly divided into the training (107,382) and validation (107,382) sets. Elder age, being male, a low education level, family history of lung cancer, history of tuberculosis, and without a history of hyperlipidemia were the independent risk factors for lung cancer. Using these six variables, we plotted 1-year, 3-year, and 5-year lung cancer risk prediction nomogram. The AUC was 0.753, 0.752, and 0.755 for the 1-, 3- and 5-year lung cancer risk in the training set, respectively. In the validation set, the model showed a moderate predictive discrimination, with the AUC was 0.668, 0.678, and 0.685 for the 1-, 3- and 5-year lung cancer risk.ConclusionsWe developed and validated a simple and non-invasive lung cancer risk model in non-smokers. This model can be applied to identify and triage patients at high risk for developing lung cancers in non-smokers.
Facebook
TwitterBats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Facebook
TwitterBackgroundMajor adverse cardiovascular events (MACEs) represent a significant reason of morbidity and mortality in non-cardiac surgery during perioperative period. The prevention of perioperative MACEs has always been one of the hotspots in the research field. Current existing models have not been validated in Chinese population, and have become increasingly unable to adapt to current clinical needs.ObjectivesTo establish and validate several simple bedside tools for predicting MACEs during perioperative period of non-cardiac surgery in Chinese hospitalized patients.DesignWe used a nested case-control study to establish our prediction models. A nomogram along with a risk score were developed using logistic regression analysis. An internal cohort was used to evaluate the performance of discrimination and calibration of these predictive models including the revised cardiac risk index (RCRI) score recommended by current guidelines.SettingPeking University Third Hospital between January 2010 and December 2020.PatientsTwo hundred and fifty three patients with MACEs and 1,012 patients without were included in the training set from January 2010 to December 2019 while 38,897 patients were included in the validation set from January 2020 and December 2020, of whom 112 patients had MACEs.Main Outcome MeasuresThe MACEs included the composite outcomes of cardiac death, non-fatal myocardial infarction, non-fatal congestive cardiac failure or hemodynamically significant ventricular arrhythmia, and Takotsubo cardiomyopathy.ResultsSeven predictors, including Hemoglobin, CARDIAC diseases, Aspartate aminotransferase (AST), high Blood pressure, Leukocyte count, general Anesthesia, and Diabetes mellitus (HASBLAD), were selected in the final model. The nomogram and HASBLAD score all achieved satisfactory prediction performance in the training set (C statistic, 0.781 vs. 0.768) and the validation set (C statistic, 0.865 vs. 0.843). Good calibration was observed for the probability of MACEs in the training set and the validation set. The two predictive models both had excellent discrimination that performed better than RCRI in the validation set (C statistic, 0.660, P < 0.05 vs. nomogram and HASBLAD score).ConclusionThe nomogram and HASBLAD score could be useful bedside tools for predicting perioperative MACEs of non-cardiac surgery in Chinese hospitalized patients.
Facebook
TwitterRangeland ecosystems provide critical wildlife habitat (e.g., greater sage grouse, pronghorn, black-footed ferret), forage for livestock, carbon sequestration, provision of water resources, and recreational opportunities. At the same time, rangelands are vulnerable to climate change, fire, and anthropogenic disturbances. The arid-semiarid climate in most rangelands fluctuates widely, impacting livestock forage availability, wildlife habitat, and water resources. Many of these changes can be subtle or evolve over long time periods, responding to climate, anthropogenic, and disturbance driving forces. To understand vegetation change, scientists from the USGS and Bureau of Land Management (BLM) developed the Rangeland Condition Monitoring Assessment and Projection (RCMAP) project. RCMAP provides robust, long-term, and floristically detailed maps of vegetation cover at yearly time-steps, a critical reference to advancing science in the BLM and assessing Landscape Health standards. RCMAP quantifies the percent cover of ten rangeland components (annual herbaceous, bare ground, herbaceous, litter, non-sagebrush shrub, perennial herbaceous, sagebrush, shrub, and tree cover and shrub height) at yearly time-steps across the western U.S. using field training data, Landsat imagery, and machine learning. We utilize an ecologically comprehensive series of field-trained, high-resolution predictions of component cover and BLM Analysis Inventory and Monitoring (AIM) data to train machine learning models predicting component cover over the Landsat time-series. This dataset enables retrospective analysis of vegetation condition, impacts of weather variation and longer-term climatic change, and understanding of vegetation treatment and altered management practice effectiveness. RCMAP data can be used to answer critical questions regarding the influence of climate change and the suitability of management practices. Component products can be downloaded https://www.mrlc.gov/data. Independent validation was our primary validation approach, consisting of field measurements of component cover at stratified-random locations. Independent validation point placement used a stratified random design, with two levels of stratified restrictions to simplify logistics of field sampling (Rigge et al. 2020, Xian et al. 2015). The first level of stratification randomly selected 15, 8 km in diameter, sites across each mapping region. First level sites excluded areas less than 30 km away from training sites and other validation sites. The second level stratification randomly placed 6–10 points within each 8 km diameter validation site (total n = 2,014 points at n = 229 sites). Only sites on public land, between 100 and 1000 m from the nearest road, and in rangeland vegetation cover within each site were considered. The random points within a site were evenly allocated to three NDVI thresholds from a leaf-on Landsat image (low, medium, and high). Sites with relatively high spatial variance within a 90 m by 90 m patch (3 × 3 Landsat pixels) were excluded to minimize plot-pixel locational error. Using NDVI as a stratum ensured plot locations were distributed across the range of validation site productivity. At each validation point, we measured component cover using the line point intercept method along 2, 30 m transects. Data were collected from the first hit perspective.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.