Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time to Update the Split-Sample Approach in Hydrological Model Calibration
Hongren Shen1, Bryan A. Tolson1, Juliane Mai1
1Department of Civil and Environmental Engineering, University of Waterloo, Waterloo, Ontario, Canada
Corresponding author: Hongren Shen (hongren.shen@uwaterloo.ca)
Abstract
Model calibration and validation are critical in hydrological model robustness assessment. Unfortunately, the commonly-used split-sample test (SST) framework for data splitting requires modelers to make subjective decisions without clear guidelines. This large-sample SST assessment study empirically assesses how different data splitting methods influence post-validation model testing period performance, thereby identifying optimal data splitting methods under different conditions. This study investigates the performance of two lumped conceptual hydrological models calibrated and tested in 463 catchments across the United States using 50 different data splitting schemes. These schemes are established regarding the data availability, length and data recentness of the continuous calibration sub-periods (CSPs). A full-period CSP is also included in the experiment, which skips model validation. The assessment approach is novel in multiple ways including how model building decisions are framed as a decision tree problem and viewing the model building process as a formal testing period classification problem, aiming to accurately predict model success/failure in the testing period. Results span different climate and catchment conditions across a 35-year period with available data, making conclusions quite generalizable. Calibrating to older data and then validating models on newer data produces inferior model testing period performance in every single analysis conducted and should be avoided. Calibrating to the full available data and skipping model validation entirely is the most robust split-sample decision. Experimental findings remain consistent no matter how model building factors (i.e., catchments, model types, data availability, and testing periods) are varied. Results strongly support revising the traditional split-sample approach in hydrological modeling.
Data description
This data was used in the paper entitled "Time to Update the Split-Sample Approach in Hydrological Model Calibration" by Shen et al. (2022).
Catchment, meteorological forcing and streamflow data are provided for hydrological modeling use. Specifically, the forcing and streamflow data are archived in the Raven hydrological modeling required format. The GR4J and HMETS model building results in the paper, i.e., reference KGE and KGE metrics in calibration, validation and testing periods, are provided for replication of the split-sample assessment performed in the paper.
Data content
The data folder contains a gauge info file (CAMELS_463_gauge_info.txt), which reports basic information of each catchment, and 463 subfolders, each having four files for a catchment, including:
(1) Raven_Daymet_forcing.rvt, which contains Daymet meteorological forcing (i.e., daily precipitation in mm/d, minimum and maximum air temperature in deg_C, shortwave in MJ/m2/day, and day length in day) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format.
(2) Raven_USGS_streamflow.rvt, which contains daily discharge data (in m3/s) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format.
(3) GR4J_metrics.txt, which contains reference KGE and GR4J-based KGE metrics in calibration, validation and testing periods.
(4) HMETS_metrics.txt, which contains reference KGE and HMETS-based KGE metrics in calibration, validation and testing periods.
Data collection and processing methods
Data source
Forcing data processing
Streamflow data processing
GR4J and HMETS metrics
The GR4J and HMETS metrics files consists of reference KGE and KGE in model calibration, validation, and testing periods, which are derived in the massive split-sample test experiment performed in the paper.
More details of the split-sample test experiment and modeling results analysis can be referred to the paper by Shen et al. (2022).
Citation
Journal Publication
This study:
Shen, H., Tolson, B. A., & Mai, J.(2022). Time to update the split-sample approach in hydrological model calibration. Water Resources Research, 58, e2021WR031523. https://doi.org/10.1029/2021WR031523
Original CAMELS dataset:
A. J. Newman, M. P. Clark, K. Sampson, A. Wood, L. E. Hay, A. Bock, R. J. Viger, D. Blodgett, L. Brekke, J. R. Arnold, T. Hopson, and Q. Duan (2015). Development of a large-sample watershed-scale hydrometeorological dataset for the contiguous USA: dataset characteristics and assessment of regional variability in hydrologic model performance. Hydrol. Earth Syst. Sci., 19, 209-223, http://doi.org/10.5194/hess-19-209-2015
Data Publication
This study:
H. Shen, B.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OverviewData used for publication in "Comparing Gaussian Process Kernels Used in LSG Models for Flood Inundation Predictions". We investigate the impact of 13 Gaussian Process (GP) kernels, consisting of five single kernels and eight composite kernels, on the prediction accuracy and computational efficiency of the Low-fidelity, Spatial analysis, and Gaussian process learning (LSG) modelling approach. The GP kernels are compared for three distinct case studies namely Carlisle (United Kingdom), Chowilla floodplain (Australia), and Burnett River (Australia). The high- and low-fidelity model simulation results are obtained from the data repository Fraehr, N. (2024, January 19). Surrogate flood model comparison - Datasets and python code (Version 1). The University of Melbourne. https://doi.org/10.26188/24312658.v1.Dataset structureThe dataset is structured in 5 file folders:CarlisleChowillaBurnettRVComparison_resultsPython_dataThe first three folders contain simulation data and analysis codes. The "Comparison_results" folder contains plotting codes, figures and tables for comparison results. The "Python_data" folder contains LSG model functions and Python environment requirement.Carlisle, Chowilla, and BurnettRVThese files contain high- and low-fidelity hydrodynamic modelling data for training and validation for each individual case study, as well as specific Python scripts for training and running the LSG model with different GP kernels in each case study. There are only small differences between each folder, depending on the hydrodynamic model simulation results and EOF analysis results.Each case study file has the following folders:Geometry_dataDEM files.npz files containing of the high-fidelity models grid (XYZ-coordinates) and areas (Same data is available for the low-fidelity model used in the LSG model).shp files indicating location of boundaries and main flow pathsXXX_modeldataFolder to storage trained model data for each XXX kernel LSG model. For example, EXP_modeldata contains files used to store the trainined LSG model using exponential Gaussian Process kernel.ME3LIN means ME3 + LIN. ME3mLIN means ME3 x LIN.EXPLow mean inducing points percentage for Sparse GP is 5%.EXPMid mean inducing points percentage for Sparse GP is 15%.EXPHigh mean inducing points percentage for Sparse GP is 35%.EXPFULL mean inducing points percentage for Sparse GP is 100%.HD_model_dataHigh-fidelity simulation results for all flood events of that case studyLow-fidelity simulation results for all flood events of that case studyAll boundary input conditionsHF_EOF_analysisStoring of data used in the EOF analysis for the LSG model.Results_dataStoring results of running the evaluation of the LSG models with different GP kernel candidates.Train_test_split_dataThe train-test-validation data split is the same for all LSG models with different GP kernel candidates. The specific split for each cross-validation fold is stored in this folder.YYY_event_summary.csv, YYY_Extrap_event_summary.csvFiles containing overview of all events, and which events are connected between the low- and high-fidelity models for each YYY case study.EOF_analysis_HFdata_preprocessing.py, EOF_analysis_HFdata.pyPreprocessing before EOF analysis and the EOF analysis of the high-fidelity data.Evaluation.py, Evaluation_extrap.pyScripts for evaluating the LSG model for that case study and saving the results for each cross-validation fold.train_test_split.pyScript for splitting the flood datasets for each cross-validation fold, so all LSG models with different GP kernel candidates train on the same data.XXX_training.pyScript for training each LSG model using the XXX GP kernel.ME3LIN means ME3 + LIN. ME3mLIN means ME3 x LIN.EXPLow mean inducing points percentage for Sparse GP is 5%.EXPMid mean inducing points percentage for Sparse GP is 15%.EXPHigh mean inducing points percentage for Sparse GP is 35%.EXPFULL mean inducing points percentage for Sparse GP is 100%.XXX_training.batBatch scripts for training all LSG models using different GP kernel candidates.Comparison_resultsFiles used for comparing LSG models using different GP kernel candidates and generate the figures in the paper "Comparing Gaussian Process Kernels Used in LSG Models for Flood Inundation Predictions". Figures are also included.Python_dataFolder containing Python script with utility functions for setting up, training, and running the LSG models, as well as for evaluating the LSG models. Python environmentThis folder also contains two python environment file with all Python package versions and dependencies. You can install CPU version or GPU version of environment. GPU version environment can use GPU to speed up the GPflow training process. It will install cuda and CUDnn package.You can choose to install environment online or offline. Offline installation reduces dependency issues, but it requires that you also use the same Windows 10 operating system as I do.Online installationLSG_CPU_environment.yml: python environment for running LSG models using CPU of the computerLSG_GPU_environment.yml: python environment for running LSG models using GPU of the computer, mainly using GPU to speed up the GPflow training process. It need to install cuda and CUDnn package.In the directory where the .yml file is located, use the console to enter the following commandconda env create -f LSG_CPU_environment.yml -n myenv_nameorconda env create -f LSG_GPU_environment.yml -n myenv_nameOffline installationIf you also use Windows 10 system as I do, you can directly unzip environment packed by conda-pack.LSG_CPU.tar.gz: Zip file containing all packages in the virtual environment for CPU onlyLSG_GPU.tar.gz: Zip file containing all packages in the virtual environment for GPU accelerationIn Windows system, create a new LSG_CPU or LSG_GPU folder in the Anaconda environment folder and extract the packaged LSG_CPU.tar.gz or LSG_GPU.tar.gz file into that folder.tar -xzvf LSG_CPU.tar.gz -C ./LSG_CPUortar -xzvf LSG_GPU.tar.gz -C ./LSG_GPUAccess to the environment pathcd ./LSG_GPUactivation environment.\Scripts\activate.batRemove prefixes from the activation environment.\Scripts\conda-unpack.exeExit environment.\Scripts\deactivate.batLSG_mods_and_funcPython scripts for using the LSG model.Evaluation_metrics.pyMetrics used to evaluate the prediction accuracy and computational efficiency of the LSG models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used the human genome reference sequence in its GRCh38.p13 version in order to have a reliable source of data in which to carry out our experiments. We chose this version because it is the most recent one available in Ensemble at the moment. However, the DNA sequence by itself is not enough, the specific TSS position of each transcript is needed. In this section, we explain the steps followed to generate the final dataset. These steps are: raw data gathering, positive instances processing, negative instances generation and data splitting by chromosomes.
First, we need an interface in order to download the raw data, which is composed by every transcript sequence in the human genome. We used Ensembl release 104 (Howe et al., 2020) and its utility BioMart (Smedley et al., 2009), which allows us to get large amounts of data easily. It also enables us to select a wide variety of interesting fields, including the transcription start and end sites. After filtering instances that present null values in any relevant field, this combination of the sequence and its flanks will form our raw dataset. Once the sequences are available, we find the TSS position (given by Ensembl) and the 2 following bases to treat it as a codon. After that, 700 bases before this codon and 300 bases after it are concatenated, getting the final sequence of 1003 nucleotides that is going to be used in our models. These specific window values have been used in (Bhandari et al., 2021) and we have kept them as we find it interesting for comparison purposes. One of the most sensitive parts of this dataset is the generation of negative instances. We cannot get this kind of data in a straightforward manner, so we need to generate it synthetically. In order to get examples of negative instances, i.e. sequences that do not represent a transcript start site, we select random DNA positions inside the transcripts that do not correspond to a TSS. Once we have selected the specific position, we get 700 bases ahead and 300 bases after it as we did with the positive instances.
Regarding the positive to negative ratio, in a similar problem, but studying TIS instead of TSS (Zhang135 et al., 2017), a ratio of 10 negative instances to each positive one was found optimal. Following this136 idea, we select 10 random positions from the transcript sequence of each positive codon and label them137 as negative instances. After this process, we end up with 1,122,113 instances: 102,488 positive and 1,019,625 negative sequences. In order to validate and test our models, we need to split this dataset into three parts: train, validation and test. We have decided to make this differentiation by chromosomes, as it is done in (Perez-Rodriguez et al., 2020). Thus, we use chromosome 16 as validation because it is a good example of a chromosome with average characteristics. Then we selected samples from chromosomes 1, 3, 13, 19 and 21 to be part of the test set and used the rest of them to train our models. Every step of this process can be replicated using the scripts available in https://github.com/JoseBarbero/EnsemblTSSPrediction.
Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder
I do a lot of work with image data sets. Often it is necessary to partition the images into male and female data sets. Doing this by hand can be a long and tedious task particularly on large data sets. So I decided to create a classifier that could do the task for me.
I used the CELEBA aligned data set to provide the images. I went through and separated the images visually into 1747 female and 1747 male training images. I also created 100 male and 100 female test image and 100 male, 100 female validation images. I want to only the face to be in the image so I developed an image cropping function using MTCNN to crop all the images. That function is included as one of the notebooks should anyone have a need for a good face cropping function. I also created an image duplicate detector to try to eliminate any of the training images from appearing in the test or validation images. I have developed a general purpose image classification function that works very well for most image classification tasks. It contains the option to select 1 of 7 models for use. For this application I used the MobileNet model because it is less computationally expensive and gives excellent results. On the test set accuracy is near 100%.
The CELEBA aligned data set was used. This data set is very large and of good quality. To crop the images to only include the face I developed a face cropping function using MTCNN. MTCNN is a very accurate program and is reasonably fast, however it is notflawless so after cropping the iages you shouldalways visually inspect the results.
I developed this data set to train a classifier to be able to distinguish the gender shown in an image. Why bother you may ask I can just look at the image and tell. True but lets say you have a data set of 50,000 images that you want to separate it into male and female data sets. Doing that by hand would take forever. With the trained classifier with near 100% accuracy you can use the classifier with model.predict to do the job for you.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes bibliographic information for 501 papers that were published from 2010-April 2017 (time of search) and use online biodiversity databases for research purposes. Our overarching goal in this study is to determine how research uses of biodiversity data developed during a time of unprecedented growth of online data resources. We also determine uses with the highest number of citations, how online occurrence data are linked to other data types, and if/how data quality is addressed. Specifically, we address the following questions:
1.) What primary biodiversity databases have been cited in published research, and which
databases have been cited most often?
2.) Is the biodiversity research community citing databases appropriately, and are
the cited databases currently accessible online?
3.) What are the most common uses, general taxa addressed, and data linkages, and how
have they changed over time?
4.) What uses have the highest impact, as measured through the mean number of citations
per year?
5.) Are certain uses applied more often for plants/invertebrates/vertebrates?
6.) Are links to specific data types associated more often with particular uses?
7.) How often are major data quality issues addressed?
8.) What data quality issues tend to be addressed for the top uses?
Relevant papers for this analysis include those that use online and openly accessible primary occurrence records, or those that add data to an online database. Google Scholar (GS) provides full-text indexing, which was important to identify data sources that often appear buried in the methods section of a paper. Our search was therefore restricted to GS. All authors discussed and agreed upon representative search terms, which were relatively broad to capture a variety of databases hosting primary occurrence records. The terms included: “species occurrence” database (8,800 results), “natural history collection” database (634 results), herbarium database (16,500 results), “biodiversity database” (3,350 results), “primary biodiversity data” database (483 results), “museum collection” database (4,480 results), “digital accessible information” database (10 results), and “digital accessible knowledge” database (52 results)--note that quotations are used as part of the search terms where specific phrases are needed in whole. We downloaded all records returned by each search (or the first 500 if there were more) into a Zotero reference management database. About one third of the 2500 papers in the final dataset were relevant. Three of the authors with specialized knowledge of the field characterized relevant papers using a standardized tagging protocol based on a series of key topics of interest. We developed a list of potential tags and descriptions for each topic, including: database(s) used, database accessibility, scale of study, region of study, taxa addressed, research use of data, other data types linked to species occurrence data, data quality issues addressed, authors, institutions, and funding sources. Each tagged paper was thoroughly checked by a second tagger.
The final dataset of tagged papers allow us to quantify general areas of research made possible by the expansion of online species occurrence databases, and trends over time. Analyses of this data will be published in a separate quantitative review.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundDigital self-help interventions for reducing the use of alcohol tobacco and other drugs (ATOD) have generally shown positive but small effects in controlling substance use and improving the quality of life of participants. Nonetheless, low adherence rates remain a major drawback of these digital interventions, with mixed results in (prolonged) participation and outcome. To prevent non-adherence, we developed models to predict success in the early stages of an ATOD digital self-help intervention and explore the predictors associated with participant’s goal achievement.MethodsWe included previous and current participants from a widely used, evidence-based ATOD intervention from the Netherlands (Jellinek Digital Self-help). Participants were considered successful if they completed all intervention modules and reached their substance use goals (i.e., stop/reduce). Early dropout was defined as finishing only the first module. During model development, participants were split per substance (alcohol, tobacco, cannabis) and features were computed based on the log data of the first 3 days of intervention participation. Machine learning models were trained, validated and tested using a nested k-fold cross-validation strategy.ResultsFrom the 32,398 participants enrolled in the study, 80% of participants did not complete the first module of the intervention and were excluded from further analysis. From the remaining participants, the percentage of success for each substance was 30% for alcohol, 22% for cannabis and 24% for tobacco. The area under the Receiver Operating Characteristic curve was the highest for the Random Forest model trained on data from the alcohol and tobacco programs (0.71 95%CI 0.69–0.73) and (0.71 95%CI 0.67–0.76), respectively, followed by cannabis (0.67 95%CI 0.59–0.75). Quitting substance use instead of moderation as an intervention goal, initial daily consumption, no substance use on the weekends as a target goal and intervention engagement were strong predictors of success.DiscussionUsing log data from the first 3 days of intervention use, machine learning models showed positive results in identifying successful participants. Our results suggest the models were especially able to identify participants at risk of early dropout. Multiple variables were found to have high predictive value, which can be used to further improve the intervention.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time to Update the Split-Sample Approach in Hydrological Model Calibration
Hongren Shen1, Bryan A. Tolson1, Juliane Mai1
1Department of Civil and Environmental Engineering, University of Waterloo, Waterloo, Ontario, Canada
Corresponding author: Hongren Shen (hongren.shen@uwaterloo.ca)
Abstract
Model calibration and validation are critical in hydrological model robustness assessment. Unfortunately, the commonly-used split-sample test (SST) framework for data splitting requires modelers to make subjective decisions without clear guidelines. This large-sample SST assessment study empirically assesses how different data splitting methods influence post-validation model testing period performance, thereby identifying optimal data splitting methods under different conditions. This study investigates the performance of two lumped conceptual hydrological models calibrated and tested in 463 catchments across the United States using 50 different data splitting schemes. These schemes are established regarding the data availability, length and data recentness of the continuous calibration sub-periods (CSPs). A full-period CSP is also included in the experiment, which skips model validation. The assessment approach is novel in multiple ways including how model building decisions are framed as a decision tree problem and viewing the model building process as a formal testing period classification problem, aiming to accurately predict model success/failure in the testing period. Results span different climate and catchment conditions across a 35-year period with available data, making conclusions quite generalizable. Calibrating to older data and then validating models on newer data produces inferior model testing period performance in every single analysis conducted and should be avoided. Calibrating to the full available data and skipping model validation entirely is the most robust split-sample decision. Experimental findings remain consistent no matter how model building factors (i.e., catchments, model types, data availability, and testing periods) are varied. Results strongly support revising the traditional split-sample approach in hydrological modeling.
Version updates
v1.1 Updated on May 19, 2022. We added hydrographs for each catchment.
There are 8 parts of the zipped file attached in v1.1. You should download all of them and unzip all those eight parts together.
In this update, we added two zipped files in each gauge subfolder:
(1) GR4J\_Hydrographs.zip and
(2) HMETS\_Hydrographs.zip
Each of the zip files contains 50 CSV files. These CSV files are named with keywords of model name, gauge ID, and the calibration sub-period (CSP) identifier.
Each hydrograph CSV file contains four key columns:
(1) Date time (note that the hour column is less significant since this is daily data);
(2) Precipitation in mm that is the aggregated basin mean precipitation;
(3) Simulated streamflow in m3/s and the column is named as "subXXX", where XXX is the ID of the catchment, specified in the CAMELS\_463\_gauge\_info.txt file; and
(4) Observed streamflow in m3/s and the column is named as "subXXX(observed)".
Note that these hydrograph CSV files reported period-ending time-averaged flows. They were directly produced by the Raven hydrological modeling framework. More information about the format of the hydrograph CSV files can be redirected to the Raven webpage.
v1.0 First version published on Jan 29, 2022.
Data description
This data was used in the paper entitled "Time to Update the Split-Sample Approach in Hydrological Model Calibration" by Shen et al. (2022).
Catchment, meteorological forcing and streamflow data are provided for hydrological modeling use. Specifically, the forcing and streamflow data are archived in the Raven hydrological modeling required format. The GR4J and HMETS model building results in the paper, i.e., reference KGE and KGE metrics in calibration, validation and testing periods, are provided for replication of the split-sample assessment performed in the paper.
Data content
The data folder contains a gauge info file (CAMELS_463_gauge_info.txt), whichreports basic information of each catchment, and 463 subfolders, each having four files for a catchment, including:
(1) **Raven\_Daymet\_forcing.rvt**, which contains Daymet meteorological forcing (i.e., daily precipitation in mm/d, minimum and maximum air temperature in deg\_C, shortwave in MJ/m2/day, and day length in day) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format.
(2) **Raven\_USGS\_streamflow.rvt**, which contains daily discharge data (in m3/s) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format.
(3) **GR4J\_metrics.txt**, which contains reference KGE and GR4J-based KGE metrics in calibration, validation and testing periods.
(4) **HMETS\_metrics.txt**, which contains reference KGE and HMETS-based KGE metrics in calibration, validation and testing periods.
Data collection and processing methods
**Data source**
Forcing data processing
Streamflow data processing
GR4J and HMETS metrics
The GR4J and HMETS metrics files consists of reference KGE and KGE in model calibration, validation, and testing periods, which are derived in the massive split-sample test experiment performed in the paper.
More details of the split-sample test experiment and modeling results analysis can be referred to the paper by Shen et al. (2022).
Citation
Journal Publication
This study:
Shen, H., Tolson, B. A., & Mai, J.(2022). Time to update the split-sample approach in hydrological model calibration. Water Resources Research, 58, e2021WR031523.
This dataset consists of growth and yield data for each season when winter wheat (Triticum aestivum L.) was grown for grain at the USDA-ARS Conservation and Production Laboratory (CPRL), Soil and Water Management Research Unit (SWMRU) research weather station, Bushland, Texas (Lat. 35.186714°, Long. -102.094189°, elevation 1170 m above MSL). In each season, winter wheat was grown for grain on two large, precision weighing lysimeters, each in the center of a 4.44 ha square field. The square fields are themselves arranged in a larger square with the fields in four adjacent quadrants of the larger square. Fields and lysimeters within each field are thus designated northeast (NE), southeast (SE), northwest (NW), and southwest (SW). Irrigation was by linear move sprinkler system. Irrigation protocols described as full were managed to replenish soil water used by the crop on a weekly or more frequent basis as determined by soil profile water content readings made with a neutron probe to 2.4-m depth in the field. Irrigation protocols described as deficit typically involved irrigations to establish the crop early in the season, followed by reduced or absent irrigations later in the season (typically in the later winter and spring). The growth and yield data include plant population density, height (except in 1989-1990), plant row width, leaf area index, growth stage, total above-ground biomass, leaf and stem biomass, head mass (when present), kernel number, and final yield. Data are from replicate samples in the field and non-destructive (except for final harvest) measurements on the weighing lysimeters. In most cases yield data are available from both manual sampling on replicate plots in each field and from machine harvest. These datasets originate from research aimed at determining crop water use (ET), crop coefficients for use in ET-based irrigation scheduling based on a reference ET, crop growth, yield, harvest index, and crop water productivity as affected by irrigation method, timing, amount (full or some degree of deficit), agronomic practices, cultivar, and weather. Prior publications have focused on winter wheat ET, crop coefficients, and crop water productivity. Crop coefficients have been used by ET networks. The data have utility for testing simulation models of crop ET, growth, and yield and have been used by the Agricultural Model Intercomparison and Improvement Project (AgMIP) and by many others for testing, and calibrating models of ET that use satellite and/or weather data. Resources in this dataset:Resource Title: 1989-1990 Bushland, TX, west winter wheat growth and yield data. File Name: 1989-1990_West_Wheat_Growth_and_Yield.xlsxResource Description: This dataset consists of growth and yield data the 1989-1990 winter wheat (Triticum aestivum L.) season at the USDA-ARS Conservation and Production Laboratory (CPRL), Soil and Water Management Research Unit (SWMRU) research weather station, Bushland, Texas (Lat. 35.186714°, Long. -102.094189°, elevation 1170 m above MSL). Winter wheat was grown on two large, precision weighing lysimeters, each in the center of a 4.44 ha square field. The two square fields were themselves arranged with one directly north of and contiguous with the other. Fields and lysimeters within each field were designated northwest (NW), and southwest (SW). Irrigation was by linear move sprinkler system. Irrigations described as full were managed to replenish soil water used by the crop on a weekly or more frequent basis as determined by soil profile water content readings made with a neutron probe to 2.4-m depth in the field. Irrigation described as deficit typically involved irrigation to establish the crop in the autumn followed by reduced or no irrigation later in the late winter or spring. The growth and yield data include plant height (except in 1989-1990), leaf area index, growth stage, total above-ground biomass, leaf and stem biomass, hea biomass, and final yield. Data are from replicate samples in the field and non-destructive (except for final harvest) measurements on the weighing lysimeters. In most cases yield data are available from both manual sampling on replicate plots in each field and from machine harvest. There is a single spreadsheet for the west (NW and SW) lysimeters and fields. The spreadsheets contain tabs for data and corresponding tabs for data dictionaries. Typically, there are separate data tabs and corresponding dictionaries for plant growth during the season, crop growth stage, plant population, manual harvest from replicate plots in each field and from lysimeter surfaces, and machine (combine) harvest, An Introduction tab explains the tab names and contents, lists the authors, explains conventions, and lists some relevant references.Resource Title: 1991-1992 Bushland, TX, east winter wheat growth and yield data. File Name: 1991-1992_East_Wheat_Growth_and_Yield.xlsxResource Description: This dataset consists of growth and yield data the 1991-1992 winter wheat (Triticum aestivum L.) season at the USDA-ARS Conservation and Production Laboratory (CPRL), Soil and Water Management Research Unit (SWMRU) research weather station, Bushland, Texas (Lat. 35.186714°, Long. -102.094189°, elevation 1170 m above MSL). Winter wheat was grown on two large, precision weighing lysimeters, each in the center of a 4.44 ha square field. The two square fields were themselves arranged with one directly north of and contiguous with the other. Fields and lysimeters within each field were designated northeast (NE), and southeast (SE). Irrigation was by linear move sprinkler system. Irrigations described as full were managed to replenish soil water used by the crop on a weekly or more frequent basis as determined by soil profile water content readings made with a neutron probe to 2.4-m depth in the field. Irrigation described as deficit typically involved irrigation to establish the crop in the autumn followed by reduced or no irrigation later in the late winter or spring. The growth and yield data include plant height, leaf area index, growth stage, total above-ground biomass, leaf and stem biomass, hea biomass, and final yield. Data are from replicate samples in the field and non-destructive (except for final harvest) measurements on the weighing lysimeters. In most cases yield data are available from both manual sampling on replicate plots in each field and from machine harvest. There is a single spreadsheet for the east (NE and SE) lysimeters and fields. The spreadsheets contain tabs for data and corresponding tabs for data dictionaries. Typically, there are separate data tabs and corresponding dictionaries for plant growth during the season, crop growth stage, plant population, manual harvest from replicate plots in each field and from lysimeter surfaces, and machine (combine) harvest, An Introduction tab explains the tab names and contents, lists the authors, explains conventions, and lists some relevant references.Resource Title: 1992-1993 Bushland, TX, west winter wheat growth and yield data. File Name: 1992-1993_W_Wheat_Growth_and_Yield.xlsxResource Description: This dataset consists of growth and yield data the 1992-1993 winter wheat (Triticum aestivum L.) season at the USDA-ARS Conservation and Production Laboratory (CPRL), Soil and Water Management Research Unit (SWMRU) research weather station, Bushland, Texas (Lat. 35.186714°, Long. -102.094189°, elevation 1170 m above MSL). Winter wheat was grown on two large, precision weighing lysimeters, each in the center of a 4.44 ha square field. The two square fields were themselves arranged with one directly north of and contiguous with the other. Fields and lysimeters within each field were designated northwest (NW), and southwest (SW). Irrigation was by linear move sprinkler system. Irrigations described as full were managed to replenish soil water used by the crop on a weekly or more frequent basis as determined by soil profile water content readings made with a neutron probe to 2.4-m depth in the field. Irrigation described as deficit typically involved irrigation to establish the crop in the autumn followed by reduced or no irrigation later in the late winter or spring. The growth and yield data include plant height, leaf area index, growth stage, total above-ground biomass, leaf and stem biomass, hea biomass, and final yield. Data are from replicate samples in the field and non-destructive (except for final harvest) measurements on the weighing lysimeters. In most cases yield data are available from both manual sampling on replicate plots in each field and from machine harvest. There is a single spreadsheet for the west (NW and SW) lysimeters and fields. The spreadsheets contain tabs for data and corresponding tabs for data dictionaries. Typically, there are separate data tabs and corresponding dictionaries for plant growth during the season, crop growth stage, plant population, manual harvest from replicate plots in each field and from lysimeter surfaces, and machine (combine) harvest, An Introduction tab explains the tab names and contents, lists the authors, explains conventions, and lists some relevant references.
Idaho’s landscape-scale wetland condition assessment tool— Methods and applications in conservation and restoration planningLandscape-scale wetland threat and impairment assessment has been widely applied, both at the national level (NatureServe 2009) and in various states, including Colorado (Lemly et al. 2011), Delaware and Maryland (Tiner 2002 and 2005; Weller et al. 2007), Minnesota (Sands 2002), Montana (Daumiller 2003, Vance 2009), North Dakota (Mita et al. 2007), Ohio (Fennessy et al. 2007), Pennsylvania (Brooks et al. 2002 and 2004; Hychka et al. 2007; Wardrop et al. 2007), and South Dakota (Troelstrup and Stueven 2007). Most of these landscape-scale analyses use a relatively similar list of spatial layer inputs to calculate metrics for condition analyses. This is a cost-effective, objective way to obtain this information from all wetlands in a broad geographic area. Similar landscape-scale assessment projects in Idaho (Murphy and Schmidt 2010) used spatial analysis to estimate the relative condition of wetlands habitats throughout Idaho. Spatial data sources: Murphy and Schmidt (2010) reviewed literature and availability of spatial data to choose which spatial layers to include in their model of landscape integrity. Spatial layers preferably had statewide coverage for inclusion in the analysis. Nearly all spatial layers were downloaded from the statewide geospatial data clearinghouse, the Interactive Numeric and Spatial Information Data Engine for Idaho (INSIDE Idaho; http://inside.uidaho.edu/index.html). A complete list of layers used in the landscape integrity model is in Table 1. Statewide spatial layers were lacking for some important potential condition indicators, such as mine tailings, beaver presence, herbicide or pesticide use, non-native species abundance, nutrient loading, off-highway vehicle use, recreational and boating impacts, and sediment accumulation. Statewide spatial layers were also lacking for two presumably important potential indicators of wetland/riparian condition, recent timber harvest and livestock grazing. To rectify this, GIS models of potential recent timber harvest and livestock grazing were created using National Land Cover Data, grazing allotment maps, and NW ReGAP land cover maps. Calculation of landscape and disturbance metrics: We used a landscape integrity model approach similar to that used by Lemly et al. (2011), Vance (2009), and Faber-Langendoen et al. (2006). Spatial analysis in GIS was used to calculate human land use, or disturbance, metrics for every 30 m2 pixel across Idaho. A single raster layer that indicated threats and impairments for that pixel was produced. This was accomplished by first calculating the distance from each human land use category, development type, or disturbance for each pixel. This inverse weighted distance model is based on the assumption that ecological condition will be poorer in areas of the landscape with the most cumulative human activities and disturbances. Condition improves as you move toward least developed areas (Faber-Langendoen et al. 2006, Vance 2009, Lemly et al. 2011). Land uses or disturbances within 50 m were considered to have twice the impact of those 50 - 100 m away. For this model, land uses and disturbances > 100 m away were assumed to have zero or negligible impact. Because not all land uses impact wetlands the same way, weights for each land use or disturbance type were then determined using published literature (Hauer et al. 2002, Brown and Vivas 2005, Fennessy et al. 2007, Durkalec et al. 2009). A list of weights applied to each land use or disturbance type is in Table 2. A condition value for each pixel was then calculated. For example, the value for a pixel with a 2-lane highway and railroad within 50 m and a home and urban park between 50 and 100 m would be: Weight x Distance = Impact Factor2-lane highway = 7.81 2 15.62railroad = 7.81 2 + 15.62single family home - low density = 6.91 1 + 6.91recreation / open space - medium intensity = 4.38 1 + 4.38 Total Disturbance Value = 42.53The integrity of each pixel was then ranked relative to all others in Idhao using methods analogous to Stoddard et al. (2005), Fennessy et al. (2007), Mita et al. (2007), and Troelstrup and Stueven (2007). Five condition categories based on the sum of weighted impacts present in each pixel were used: 1 = minimally disturbed (top 1% of wetlands); wetland present in the absence or near absence of human disturbances; zero to few stressors are present; land use is almost completely not human-created; equivalent to reference condition; conservation priority;2 = lightly disturbed (2 - 5%); wetland deviates the least from that in the minimally disturbed class based on existing landscape impacts; few stressors are present; majority of land use is not human-created; these are the best wetlands in areas where human influences are present; ecosystem processes and functions are within natural ranges of variation found in the reference condition, but threats exist; conservation and/or restoration priority; 3 = moderately disturbed (6 - 15%); several stressors are present; land use is roughly split between human-created and non-human land use; ecosystem processes and functions are impaired and somewhat outside the range of variation found in the reference condition, but are still present; ecosystem processes are restorable;4 = severely disturbed (16 - 40%); numerous stressors are present; land use is majority human-created; ecosystem processes and functions are severely altered or disrupted and outside the range of variation found in the reference condition; ecosystem processes are restorable, but may require large investments of energy and money for successful restoration; 5 = completely disturbed (bottom 41 - 100%); many stressors are present; land use is nearly completely human-created; ecosystem processes and functions are disrupted and outside the range of variation in the reference condition; ecosystem processes are very difficult to restore.The resulting layer was then filtered using the map of potential wetland occurrence to show only those pixels potentially supporting wetlands.Results of GIS landscape-scale assessment were verified by comparing results with the condition of wetlands determined by in the field using rapid assessment methods. The landscape assessment matched the rapidly assessed condition estimated in the field 61% of the time (Murphy et al. 2012). Thirty-one percent of the sites were misclassified by one condition class and 8% misclassified by two condition classes. These results were similar to an accuracy assessment of landscape scale assessment performed by Mita et al. (2007) in North Dakota. When sites classified correctly and those only off by one condition class were combined (92% of the samples), results were similar to Vance (2009) in Montana (85%). The model of landscape integrity performed much better than the initial prototype model produced for Idaho by Murphy and Schmidt (2010).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation". Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 × 224 pixels and normalized from 0 to 1. Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data). We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets. We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively. GitHub version The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version. Folders Structure In the Root directory of the project, we have two folders:
magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA). Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders.
M24/M48: both present the following sub-folders structure:
Seqs16; SF_MViT; SF_MViT_oT; SF_MViT_oTV; SF_MViT_oTV_Test. There are also two files in root:
inst_packages.sh: install the packages and dependencies to run the models. download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache. M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files. Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders. All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models. Naming pattern for the files:
magnetogram_jpg: follows the format "hmi.sharp_720s.
hmi: is the instrument that captured the image sharp_720s: is the database source of SDO/HMI.
Model training codes: "SF_MViT_
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The data have been used in an investigation for a PhD thesis in English Linguistics on similarities and differences in the use of the progressive aspect in two different language systems, English and Persian, both of which have the grammaticalised progressive. It is an application of the Heidelberg-Paris model of investigation into the impact of the progressive aspect on event conceptualisation. It builds on an analysis of single event descriptions at sentence level and re-narrations of a film clip at discourse level, as presented in von Stutterheim and Lambert (2005) DOI: 10.1515/9783110909593.203; Carroll and Lambert (2006: 54–73) http://libris.kb.se/bib/10266700; and von Stutterheim, Andermann, Carroll, Flecken & Schmiedtová (2012) DOI: 10.1515/ling-2012-0026. However, there are system-based typological differences between these two language systems due to the absence/presence of the imperfective-perfective categories, respectively. Thus, in addition to the description of the status of the progressive aspect in English and Persian and its impact on event conceptualisation, an important part of the investigation is the analysis of the L2 English speakers’ language production as the progressives in the first languages, L1s, exhibit differences in their principles of use due to the typological differences. The question of importance in the L2 context concerns the way they conceptualise ongoing events when the language systems are different, i.e. whether their language production is conceptually driven by their first language Persian.
The data consist of two data sets as the study includes two linguistic experiments, Experiment 1 and Experiment 2. The data for both experiments were collected by email. Separate forms of instructions, and language background questions were prepared for the six different informant groups, i.e. three speaker groups and two experimental tasks, as well as a Nelson English test https://www.worldcat.org/isbn/9780175551972 on the proficiency of English for Experiment 2 was selected and modified for the L2 English speaker group. Nelson English tests are published in Fowler, W.S. & Coe, N. (1976). Nelson English tests. Middlesex: Nelson and Sons. The test battery provides tests for all levels of proficiency. The graded tests are compiled in ten sets from elementary to very advanced level. Each set includes four graded tests, i.e. A, B, C, and D, resulting in 40 separate tests, each with 50 multiple-choice questions. The test entitled 250C was selected for this project. It belongs to the slot 19 out of the 40 slots of the total battery. The multiple-choice questions were checked with a native English professional and 5 inadequate questions relevant for pronunciation were omitted. In addition, a few modifications of the grammar questions were made, aiming at including questions that involve a contrast for the Persian L2 English learner with respect to the grammars of the two languages. The omissions and modifications provide an appropriate grammar test for very advanced Iranian learners of L2 English who have learnt the language in a classroom setting. The data set collected from the informants are characterised as follows: The data from Experiment 1 functions as the basis for the description of the progressive aspect in English, Persian and L2 English, while the data from Experiment 2 is the basis for the analysis of its use in a long stretch of discourse/language production for the three speaker groups. The parameters selected for the investigation comprised, first, phasal decomposition, which involves the use of the progressive in unrelated single motion events and narratives, and uses of begin/start in narratives. Second, granularity in narratives, which relates to the overall amount of language production in narratives. Third, event boundedness (encoded in the use of 2-state verbs and 1-state verbs with an endpoint adjunct) partly in single motion events and partly in temporal shift in narratives. Temporal shift is defined as follows: Events in the narrative which are bounded shift the time line via a right boundary; events with a left boundary also shift the time line, even if they are unbounded. Fourth, left boundary comprising the use of begin/start and try in narratives. Finally, temporal structuring, which involves the use of bounded versus unbounded events preceding the temporal adverbial then in narratives (The tests are described in the documentation files aspectL2English_Persian_Exp2Chi-square-tests-in-SPSS.docx and aspectL2English_Persian_Exp2Chi-square-tests-in-SPSS.rtf). In both experiments the participants watched a video, one relevant for single event descriptions, the other relevant for re-narration of a series of events. Thus, two different videos with stimuli for the different kinds of experimental tasks were used. For Experiment 1, a video of 63 short film clips presenting unrelated single events was provided by Professor Christiane von Stutterheim, Heidelberg University Language & Cognition (HULC) Lab, at Heidelberg University, German, https://www.hulclab.eu/. For Experiment 2, an animation called Quest produced by Thomas Stellmach 1996 was used. It is available online at http://www.youtube.com/watch?v=uTyev6OaThg. Both stimuli have been used in the previous investigations on different languages by the research groups associated with the HULC Lab. The informants were asked to describe the events seen in the stimuli videos, to record their language production and send it to the researcher. For Experiment 2, most part of the L1 English data were provided by Prof. von Stutterheim, Heidelberg University, making available 34 re-narrations of the film Quest in English. 24 of them were selected for the present investigation. The project used six different informant groups, i.e. fully separate groups for the two experiments. The data from single event descriptions in Experiment 1 were analysed quantitatively in Excel. The re-narrations of Experiment 2 were coded in NVivo 10 (2014) providing frequencies of various parametrical features (Ltd, Nv. (2014). NVivo QSR International Pty Ltd, Version 10. Doncaster, Australia: QSR International). The numbers from NVivo 10 were analysed statistically in Excel and SPSS (2017). The tools are appropriate for this research. Excel suits well for the smaller data load in Experiment 1 while NVivo 10 is practical for the large amount of data and parameters in Experiment 2. Notably, NVivo 10 enabled the analysis of the three data sets to take place in the same manner once the categories of analysis and parameters had been defined under different nodes. As the results were to be extracted in the same fashion from each data set, the L1 English data received from the Heidelberg for Experiment 2 were re-analysed according to the criteria employed in this project. Yet, the analysis in the project conforms to the criteria used earlier in the model.
Our dataset provides detailed and precise insights into the business, commercial, and industrial aspects of any given area in the USA (Including Point of Interest (POI) Data and Foot Traffic. The dataset is divided into 150x150 sqm areas (geohash 7) and has over 50 variables. - Use it for different applications: Our combined dataset, which includes POI and foot traffic data, can be employed for various purposes. Different data teams use it to guide retailers and FMCG brands in site selection, fuel marketing intelligence, analyze trade areas, and assess company risk. Our dataset has also proven to be useful for real estate investment.- Get reliable data: Our datasets have been processed, enriched, and tested so your data team can use them more quickly and accurately.- Ideal for trainning ML models. The high quality of our geographic information layers results from more than seven years of work dedicated to the deep understanding and modeling of geospatial Big Data. Among the features that distinguished this dataset is the use of anonymized and user-compliant mobile device GPS location, enriched with other alternative and public data.- Easy to use: Our dataset is user-friendly and can be easily integrated to your current models. Also, we can deliver your data in different formats, like .csv, according to your analysis requirements. - Get personalized guidance: In addition to providing reliable datasets, we advise your analysts on their correct implementation.Our data scientists can guide your internal team on the optimal algorithms and models to get the most out of the information we provide (without compromising the security of your internal data).Answer questions like: - What places does my target user visit in a particular area? Which are the best areas to place a new POS?- What is the average yearly income of users in a particular area?- What is the influx of visits that my competition receives?- What is the volume of traffic surrounding my current POS?This dataset is useful for getting insights from industries like:- Retail & FMCG- Banking, Finance, and Investment- Car Dealerships- Real Estate- Convenience Stores- Pharma and medical laboratories- Restaurant chains and franchises- Clothing chains and franchisesOur dataset includes more than 50 variables, such as:- Number of pedestrians seen in the area.- Number of vehicles seen in the area.- Average speed of movement of the vehicles seen in the area.- Point of Interest (POIs) (in number and type) seen in the area (supermarkets, pharmacies, recreational locations, restaurants, offices, hotels, parking lots, wholesalers, financial services, pet services, shopping malls, among others). - Average yearly income range (anonymized and aggregated) of the devices seen in the area.Notes to better understand this dataset:- POI confidence means the average confidence of POIs in the area. In this case, POIs are any kind of location, such as a restaurant, a hotel, or a library. - Category confidences, for example"food_drinks_tobacco_retail_confidence" indicates how confident we are in the existence of food/drink/tobacco retail locations in the area. - We added predictions for The Home Depot and Lowe's Home Improvement stores in the dataset sample. These predictions were the result of a machine-learning model that was trained with the data. Knowing where the current stores are, we can find the most similar areas for new stores to open.How efficient is a Geohash?Geohash is a faster, cost-effective geofencing option that reduces input data load and provides actionable information. Its benefits include faster querying, reduced cost, minimal configuration, and ease of use.Geohash ranges from 1 to 12 characters. The dataset can be split into variable-size geohashes, with the default being geohash7 (150m x 150m).
This shapefile represents habitat suitability categories (High, Moderate, Low, and Non-Habitat) derived from a composite, continuous surface of sage-grouse habitat suitability index (HSI) values for Nevada and northeastern California during spring, which is a surrogate for habitat conditions during the sage-grouse breeding and nesting period. Summary of steps to create Habitat Categories: HABITAT SUITABILITY INDEX: The HSI was derived from a generalized linear mixed model (specified by binomial distribution) that contrasted data from multiple environmental factors at used sites (telemetry locations) and available sites (random locations). Predictor variables for the model represented vegetation communities at multiple spatial scales, water resources, habitat configuration, urbanization, roads, elevation, ruggedness, and slope. Vegetation data was derived from various mapping products, which included NV SynthMap (Petersen 2008, SageStitch (Comer et al. 2002, LANDFIRE (Landfire 2010), and the CA Fire and Resource Assessment Program (CFRAP 2006). The analysis was updated to include high resolution percent cover within 30 x 30 m pixels for Sagebrush, non-sagebrush, herbaceous vegetation, and bare ground (C. Homer, unpublished; based on the methods of Homer et al. 2014, Xian et al. 2015 ) and conifer (primarily pinyon-juniper, P. Coates, unpublished). The pool of telemetry data included the same data from 1998 - 2013 used by Coates et al. (2014); additional telemetry location data from field sites in 2014 were added to the dataset. The dataset was then split according calendar date into three seasons (spring, summer, winter). Spring included telemetry locations (n = 14,058) from mid-March to June, and is a surrogate for habitat conditions during the sage-grouse breeding and nesting period. All age and sex classes of marked grouse were used in the analysis. Sufficient data (i.e., a minimum of 100 locations from at least 20 marked Sage-grouse) for modeling existed in 10 subregions for spring and summer, and seven subregions in winter, using all age and sex classes of marked grouse. It is important to note that although this map is composed of HSI values derived from the seasonal data, it does not explicitly represent habitat suitability for reproductive females (i.e., nesting). Insufficient data were available to allow for estimation of this habitat type for all seasons throughout the study area extent. A Resource Selection Function (RSF) was calculated for each subregion and using generalized linear models to derive model-averaged parameter estimates for each covariate across a set of additive models. Subregional RSFs were transformed into Habitat Suitability Indices, and averaged together to produce an overall statewide HSI whereby a relative probability of occurrence was calculated for each raster cell during the spring season. In order to account for discrepancies in HSI values caused by varying ecoregions within Nevada, the HSI was divided into north and south extents using a slightly modified flood region boundary (Mason 1999) that was designed to represent respective mesic and xeric regions of the state. North and south HSI rasters were each relativized according to their maximum value to rescale between zero and one, then mosaicked once more into a state-wide extent. HABITAT CATEGORIZATION: Using the same ecoregion boundaries described above, the habitat classification dataset (an independent data set comprising 10% of the total telemetry location sample) was split into locations falling within respective north and south regions. HSI values from the composite and relativized statewide HSI surface were then extracted to each classification dataset location within the north and south region. The distribution of these values were used to identify class break values corresponding to 0.5 (high), 1.0 (moderate), and 1.5 (low) standard deviations (SD) from the mean HSI. These class breaks were used to classify the HSI surface into four discrete categories of habitat suitability: High, Moderate, Low, and Non-Habitat. In terms of percentiles, High habitat comprised greater than 30.9 % of the HSI values, Moderate comprised 15 – 30.9%, Low comprised 6.7 – 15%, and Non-Habitat comprised less than 6.7%.The classified north and south regions were then clipped by the boundary layer and mosaicked to create a statewide categorical surface for habitat selection. Each habitat suitability category was converted to a vector output where gaps within polygons less than 1.2 million square meters were eliminated, polygons within 500 meters of each other were connected to create corridors and polygons less than 1.2 million square meters in one category were incorporated to the adjacent category. The final step was to mask major roads that were buffered by 50m (Census, 2014), lakes (Peterson, 2008) and urban areas, and place those masked areas into the non-habitat category. The existing urban layer (Census 2010) was not sufficient for our needs because it excluded towns with a population lower than 1,500. Hence, we masked smaller towns (populations of 100 to 1500) and development with Census Block polygons (Census 2015) that had at least 50% urban development within their boundaries when viewed with reference imagery (ArcGIS World Imagery Service Layer). REFERENCES: California Forest and Resource Assessment Program (CFRAP). 2006. Statewide Land Use / Land Cover Mosaic. [Geospatial data.] California Department of Forestry and Fire Protection, http://frap.cdf.ca.gov/data/frapgisdata-sw-rangeland-assessment_data.php Census 2010. TIGER/Line Shapefiles. Urban Areas [Geospatial data.] U.S. Census Bureau, Washington D.C., https://www.census.gov/geo/maps-data/data/tiger-line.html Census 2014. TIGER/Line Shapefiles. Roads [Geospatial data.] U.S. Census Bureau, Washington D.C., https://www.census.gov/geo/maps-data/data/tiger-line.html Census 2015. TIGER/Line Shapefiles. Blocks [Geospatial data.] U.S. Census Bureau, Washington D.C., https://www.census.gov/geo/maps-data/data/tiger-line.html Coates, P.S., Casazza, M.L., Brussee, B.E., Ricca, M.A., Gustafson, K.B., Overton, C.T., Sanchez-Chopitea, E., Kroger, T., Mauch, K., Niell, L., Howe, K., Gardner, S., Espinosa, S., and Delehanty, D.J. 2014, Spatially explicit modeling of greater sage-grouse (Centrocercus urophasianus) habitat in Nevada and northeastern California—A decision-support tool for management: U.S. Geological Survey Open-File Report 2014-1163, 83 p., http://dx.doi.org/10.3133/ofr20141163. ISSN 2331-1258 (online) Comer, P., Kagen, J., Heiner, M., and Tobalske, C. 2002. Current distribution of sagebrush and associated vegetation in the western United States (excluding NM). [Geospatial data.] Interagency Sagebrush Working Group, http://sagemap.wr.usgs.gov Homer, C.G., Aldridge, C.L., Meyer, D.K., and Schell, S.J. 2014. Multi-Scale Remote Sensing Sagebrush Characterization with Regression Trees over Wyoming, USA; Laying a Foundation for Monitoring. International Journal of Applied Earth Observation and Geoinformation 14, Elsevier, US. LANDFIRE. 2010. 1.2.0 Existing Vegetation Type Layer. [Geospatial data.] U.S. Department of the Interior, Geological Survey, http://landfire.cr.usgs.gov/viewer/ Mason, R.R. 1999. The National Flood-Frequency Program—Methods For Estimating Flood Magnitude And Frequency In Rural Areas In Nevada U.S. Geological Survey Fact Sheet 123-98 September, 1999, Prepared by Robert R. Mason, Jr. and Kernell G. Ries III, of the U.S. Geological Survey; and Jeffrey N. King and Wilbert O. Thomas, Jr., of Michael Baker, Jr., Inc. http://pubs.usgs.gov/fs/fs-123-98/ Peterson, E. B. 2008. A Synthesis of Vegetation Maps for Nevada (Initiating a 'Living' Vegetation Map). Documentation and geospatial data, Nevada Natural Heritage Program, Carson City, Nevada, http://www.heritage.nv.gov/gis Xian, G., Homer, C., Rigge, M., Shi, H., and Meyer, D. 2015. Characterization of shrubland ecosystem components as continuous fields in the northwest United States. Remote Sensing of Environment 168:286-300. NOTE: This file does not include habitat areas for the Bi-State management area and the spatial extent is modified in comparison to Coates et al. 2014
This shapefile represents proposed management categories (Core, Priority, General, and Non-Habitat) derived from the intersection of habitat suitability categories and lek space use. Habitat suitability categories were derived from a composite, continuous surface of sage-grouse habitat suitability index (HSI) values for Nevada and northeastern California formed from the multiplicative product of the spring, summer, and winter HSI surfaces. Summary of steps to create Management Categories: HABITAT SUITABILITY INDEX: The HSI was derived from a generalized linear mixed model (specified by binomial distribution and created using ArcGIS 10.2.2) that contrasted data from multiple environmental factors at used sites (telemetry locations) and available sites (random locations). Predictor variables for the model represented vegetation communities at multiple spatial scales, water resources, habitat configuration, urbanization, roads, elevation, ruggedness, and slope. Vegetation data was derived from various mapping products, which included NV SynthMap (Petersen 2008, SageStitch (Comer et al. 2002, LANDFIRE (Landfire 2010), and the CA Fire and Resource Assessment Program (CFRAP 2006). The analysis was updated to include high resolution percent cover within 30 x 30 m pixels for Sagebrush, non-sagebrush, herbaceous vegetation, and bare ground (C. Homer, unpublished; based on the methods of Homer et al. 2014, Xian et al. 2015 ) and conifer (primarily pinyon-juniper, P. Coates, unpublished). The pool of telemetry data included the same data from 1998 - 2013 used by Coates et al. (2014) as well as additional telemetry location data from field sites in 2014. The dataset was then split according to calendar date into three seasons. Spring included telemetry locations (n = 14,058) from mid-March to June; summer included locations (n = 11,743) from July to mid-October; winter included locations (n = 4862) from November to March. All age and sex classes of marked grouse were used in the analysis. Sufficient data (i.e., a minimum of 100 locations from at least 20 marked Sage-grouse) for modeling existed in 10 subregions for spring and summer, and seven subregions in winter, using all age and sex classes of marked grouse. It is important to note that although this map is composed of HSI values derived from the seasonal data, it does not explicitly represent habitat suitability for reproductive females (i.e., nesting and with broods). Insufficient data were available to allow for estimation of this habitat type for all seasons throughout the study area extent. A Resource Selection Function (RSF) was calculated for each subregion using R software (v 3.13) and season using generalized linear models to derive model-averaged parameter estimates for each covariate across a set of additive models. For each season, subregional RSFs were transformed into Habitat Suitability Indices, and averaged together to produce an overall statewide HSI whereby a relative probability of occurrence was calculated for each raster cell. The three seasonal HSI rasters were then multiplied to create a composite annual HSI. In order to account for discrepancies in HSI values caused by varying ecoregions within Nevada, the HSI was divided into north and south extents using a slightly modified flood region boundary (Mason 1999) that was designed to represent respective mesic and xeric regions of the state. North and south HSI rasters were each relativized according to their maximum value to rescale between zero and one, then mosaicked once more into a state-wide extent. HABITAT CATEGORIZATION: Using the same ecoregion boundaries described above, the habitat classification dataset (an independent data set comprising 10% of the total telemetry location sample) was split into locations falling within respective north and south regions. HSI values from the composite and relativized statewide HSI surface were then extracted to each classification dataset location within the north and south region. The distribution of these values were used to identify class break values corresponding to 0.5 (high), 1.0 (moderate), and 1.5 (low) standard deviations (SD) from the mean HSI. These class breaks were used to classify the HSI surface into four discrete categories of habitat suitability: High, Moderate, Low, and Non-Habitat. In terms of percentiles, High habitat comprised greater than 30.9 % of the HSI values, Moderate comprised 15 – 30.9%, Low comprised 6.7 – 15%, and Non-Habitat comprised less than 6.7%.The classified north and south regions were then clipped by the boundary layer and mosaicked to create a statewide categorical surface for habitat selection. Each habitat suitability category was converted to a vector output where gaps within polygons less than 1.2 million square meters were eliminated, polygons within 500 meters of each other were connected to create corridors and polygons less than 1.2 million square meters in one category were incorporated to the adjacent category. The final step was to mask major roads that were buffered by 50m (Census, 2014), lakes (Peterson, 2008) and urban areas, and place those masked areas into the non-habitat category. The existing urban layer (Census 2010) was not sufficient for our needs because it excluded towns with a population lower than 1,500. Hence, we masked smaller towns (populations of 100 to 1500) and development with Census Block polygons (Census 2015) that had at least 50% urban development within their boundaries when viewed with reference imagery (ArcGIS World Imagery Service Layer). SPACE USE INDEX CALCULATION: Updated lek coordinates and associated trend count data were obtained from the 2015 Nevada Sage-grouse Lek Database compiled by the Nevada Department of Wildlife (NDOW, S. Espinosa, 9/20/2015). Leks count data from the California side of the Buffalo-Skedaddle and Modoc PMU's that contributed to the overall space-use model were obtained from the Western Association of Fish and Wildlife Agencies (WAFWA), and included count data up to 2014. We used NDOW data for border leks (n = 12), and WAFWA data for those fully in California and not consistently surveyed by NDOW. We queried the database for leks with a ‘LEKSTATUS’ field classified as ‘Active’ or ‘Pending’. Active leks comprised leks with breeding males observed within the last 5 years (through the 2014 breeding season). Pending leks comprised leks without consistent breeding activity during the prior 3 - 5 surveys or had not been surveyed during the past 5 years; these leks typically trended towards ‘inactive’, or newly discovered leks with at least 2 males. A sage-grouse management area (SGMA) was calculated by buffering Population Management Units developed by NDOW by 10km. This included leks from the Buffalo-Skedaddle PMU that straddles the northeastern California – Nevada border, but excluded leks for the Bi-State Distinct Population Segment. The 5-year average (2011 - 2015) for the number of male grouse (or NDOW classified 'pseudo-males' if males were not clearly identified but likely) attending each lek was calculated. Compared to the 2014 input lek dataset, 36 leks switched from pending to inactive, and 74 new leks were added for 2015 (which included pending ‘new’ leks with one year of counts. A total of 917 leks were used for space use index calculation in 2015 compared to 878 leks in 2014. Utilization distributions describing the probability of lek occurrence were calculated using fixed kernel density estimators (Silverman 1986) with bandwidths estimated from likelihood based cross-validation (CVh) (Horne and Garton 2006). UDs were weighted by the 5-year average (2011 - 2015) for the number of males grouse (or unknown gender if males were not identified) attending leks. UDs and bandwidths were calculated using Geospatial Modelling Environment (Beyer 2012) and the ‘ks’ package (Duong 2012) in Program R. Grid cell size was 30m. The resulting raster was re-scaled between zero and one by dividing by the maximum pixel value. The non-linear effect of distance to lek on the probability of grouse spatial use was estimated using the inverse of the utilization distribution curves described by Coates et al. (2013), where essentially the highest probability of grouse spatial use occurs near leks and then declines precipitously as a non-linear function. Euclidean distance was first calculated in ArcGIS, reclassified into 30-m distance bins (ranging from 0 - 30,000m), and bins reclassified according to the non-linear curve in Coates et al. (2013). The resulting raster was re-scaled between zero and one by dividing by the maximum cell value. A Spatial Use Index (SUI) was calculated by taking the average of the lek utilization distribution and non-linear distance-to-lek rasters in ArcGIS, and re-scaled between zero and one by dividing by the maximum cell value. The volume of the SUI at cumulative at specific isopleths was extracted in Geospatial Modelling Environment (Beyer 2012) with the command ‘isopleth’. Interior polygons (i.e., donuts’ > 1.2 km2) representing no probability of use within a larger polygon of use were erased from each isopleth. The 85% isopleth, which provided greater spatial connectivity and consistency with previously used agency standards (e.g., Doherty et al. 2010), was ultimately recommended by the Sagebrush Ecosystem Technical Team. The 85% SUI isopleth was clipped by the Nevada state boundary. MANAGEMENT CATEGORIES: The process for category determination was directed by the Nevada Sagebrush Ecosystem Technical team. Sage-grouse habitat was categorized into 4 classes: High, Moderate, Low, and Non-Habitat as described above, and intersected with the space use index to form the following management categories . 1) Core habitat: Defined as the intersection between all suitable habitat (High, Moderate, and Low) and the 85% Space Use Index (SUI). 2) Priority habitat: Defined as all high quality habitat
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The High Resolution Digital Elevation Model (HRDEM) product is derived from airborne LiDAR data (mainly in the south) and satellite images in the north. The complete coverage of the Canadian territory is gradually being established. It includes a Digital Terrain Model (DTM), a Digital Surface Model (DSM) and other derived data. For DTM datasets, derived data available are slope, aspect, shaded relief, color relief and color shaded relief maps and for DSM datasets, derived data available are shaded relief, color relief and color shaded relief maps. The productive forest line is used to separate the northern and the southern parts of the country. This line is approximate and may change based on requirements. In the southern part of the country (south of the productive forest line), DTM and DSM datasets are generated from airborne LiDAR data. They are offered at a 1 m or 2 m resolution and projected to the UTM NAD83 (CSRS) coordinate system and the corresponding zones. The datasets at a 1 m resolution cover an area of 10 km x 10 km while datasets at a 2 m resolution cover an area of 20 km by 20 km. In the northern part of the country (north of the productive forest line), due to the low density of vegetation and infrastructure, only DSM datasets are generally generated. Most of these datasets have optical digital images as their source data. They are generated at a 2 m resolution using the Polar Stereographic North coordinate system referenced to WGS84 horizontal datum or UTM NAD83 (CSRS) coordinate system. Each dataset covers an area of 50 km by 50 km. For some locations in the north, DSM and DTM datasets can also be generated from airborne LiDAR data. In this case, these products will be generated with the same specifications as those generated from airborne LiDAR in the southern part of the country. The HRDEM product is referenced to the Canadian Geodetic Vertical Datum of 2013 (CGVD2013), which is now the reference standard for heights across Canada. Source data for HRDEM datasets is acquired through multiple projects with different partners. Since data is being acquired by project, there is no integration or edgematching done between projects. The tiles are aligned within each project. The product High Resolution Digital Elevation Model (HRDEM) is part of the CanElevation Series created in support to the National Elevation Data Strategy implemented by NRCan. Collaboration is a key factor to the success of the National Elevation Data Strategy. Refer to the “Supporting Document” section to access the list of the different partners including links to their respective data.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
A small brain and short life allegedly limit cognitive abilities. Our view of invertebrate cognition may also be biased by the choice of experimental stimuli. Here, the stimuli (color) pairs in Match-To-Sample (MTS) tasks affected the performance of buff-tailed bumblebees (Bombus terrestris). We trained the bees to roll a tool, ball, to a goal that matched its color. Color-matching performance was slower with yellow-and-orange/red than with blue-and-yellow stimuli. When assessing the bees' concept learning in a transfer test with a novel color, the bees trained with blue-and-yellow (novel color: orange/red) were highly successful, the bees trained with blue-and-orange/red (novel color: yellow) did not differ from random, and those trained with yellow-and-orange/red (novel color: blue) failed the test. These results highlight that stimulus salience can affect the conclusions on test subjects' cognitive ability. Therefore, we encourage paying attention to stimulus salience (among other factors) when assessing invertebrate cognition. Methods Study system The experiments were conducted in 2018 in bumblebee facilities at the Botanical Garden of the University of Oulu, Finland. We obtained bumblebees from a continuous rearing program (Koppert B.V., The Netherlands). Each of the bumblebee hives (N = 7) used in the study were housed in a wooden box (31 cm × 13.5 cm × 11.5 cm) that had holes for air exchange and separate entrance and main hive chambers, with a 3 cm layer of cat litter at the bottom of the former. Each hive had a queen and ~30 workers. We provided each hive with ~7 g commercial pollen (Koppert B.V., The Netherlands) on every second day and, when not being trained or tested (see below) in which the bees had a continuous opportunity to forage on a 30% sucrose solution from a feeder. We used one hive at a time. Its entrance chamber was connected to a transparent plexiglass corridor (25 cm × 4 cm × 4.5 cm), which allowed bumblebees to access an arena (60 cm × 25 cm × 43 cm). Three transparent plastic sliding doors along the corridor provided means to control the access of bumblebees to the arena (for testing purposes). This setup was used during pretraining, training and testing (see below). Pretraining The aim of this pretraining was to allow the bees to learn the location where to access the reward. In the pretraining, the bumblebees had unrestricted access to the arena where they could access 30% sucrose solution from the middle of a circular white platform (Ø 150 mm) that was placed in the central part of the arena. During the pretraining, the most active foragers were identified by an observer (OJL and VN) and each of these bees was marked with a small number tag. These tagged individuals were used in the training and test. Training The purpose of the training was to assess our hypothesis, while training the bees to match to a sample. In the training, the center of the arena had a white circular plastic platform (Ø 150 mm with a bordering wall 12 mm high). This platform had a hole in its center (Ø 12 mm) and a colored circular zone encompassing the center hole (Ø 35 mm). Three lanes (20 mm wide at the center section, outlined by 1 mm high and 10 mm wide plastic strips) ran from the rim of the platform and converged at the central zone at 120° angles relative to the adjacent lanes (Figure 1A). The platform also had two wooden balls (Ø 8.5 mm) of different colors (blue and yellow, blue and orange/red, or yellow and orange/red), painted using Uni POSCA PC-5M, Mitsubishi Pencil Co., LTD. Japan (Figure 1). The bordering wall, the three lanes, the circular zone around the center hole (collectively referred to as 'platform'), and one of the two balls had the color that matched the platform, while the other ball was of a different color. During the training, only one tagged bee (N = 28 over the experiment) was allowed to access the arena at a time. Each bee was randomly assigned to a treatment group (blue and yellow, blue and orange/red, or yellow and orange/red) and thus only exposed to only two of the three colors used in the experiment. In each treatment group, a bee was exposed to the platform of two different colors. There were two balls, with one of the two balls always matching the color of the platform and the other ball having the other color. Each bee was challenged with a color matching task in the context of token use. The bee was given 5 minutes to complete a training bout. During a training bout, the 'correct' (rewarding) action required the bumblebee to successfully roll the ball that matched the color of the platform, from the rim of the platform to its center hole. Rolling the ball onto the central zone surrounding the hole, but not all the way into the hole, was also considered as successful. If the bee was successful, the experimenter used a syringe to immediately place a reward of 30% sucrose solution ad libitum (>200 µl) in the central hole for the bee to drink. Failing to accomplish the task within 5 minutes (or rolling the non-matching ball onto the central zone) was deemed as ‘incorrect’ (i.e., the bumblebee did not accomplish the task). After each training bout (successful or fail), the bumblebee was allowed to use the connecting corridor to visit the hive and then later to return to the arena to try again (i.e., the start of another training bout). We cleaned both balls and the platform cleaned with ethanol to neutralize any odor cues after each training bout. We also switched the color of the platform's color theme between the two options used for that particular bumblebee after every 1-3 training bouts. The behavior of the bee was video recorded for later analysis using a Sony Xperia XZ Premium smartphone. As soon as an individual reached the criterion of training (matching the ball that had the color as the platform for 5 or more consecutive bouts), she was used in the transfer test. At the end of each day, all the bees were allowed to freely access the arena to forage from a white platform, as during the pre-training phase. The training progressed in a stepwise fashion that included four steps. In the first step, the ball that matched the color of the platform was already in the central hole, and the bumblebee was rewarded as soon as it touched that ball. Once this had happened, the task progressed to the second step, in which the 'correct' ball was placed next to the central zone. After this step was successfully completed, the third step involved the balls being placed in the midway between the central zone and the rim of the platform. Once the focal bumblebee completed this step, the final step involved both balls being placed at the rim of the platform, from where the bumblebee needed to roll the ball to the center. Most, if not all, of the bees failed one or more steps during the training. When a bumblebee did not successfully perform the task correctly within 5 minutes training bouts, the experimenter (OJL and VN) used a plastic model bumblebee (which mimicked the color patterns of a B. terrestris worker), attached to a thin transparent stick, to demonstrate how to solve the task. The experimenter then used a syringe to give the sucrose solution directly to the bumblebee. A model, rather than living, bumblebee demonstrator ensured a desired and standardized demonstration. Transfer test The purpose of the transfer test was to assess whether the bees exhibit concept learning by applying a learned rule in a novel context. The test was conducted once a bee reached the training criterion (5 or more successful training bouts in a row). The test consisted of a single bout that was similar to the last phase of training with the following exceptions: The platform was of the 'third' color that the bumblebee had not encountered during the training. In addition, the platform had 3 balls of different colors: blue, yellow or orange/red. One ball was placed at the end of each lane, next to the rim of the platform. The test ended if the bee rolled the correct ball to the central hole. If the bumblebee rolled a ball of a color that did not match with that of the platform, it was considered to have failed the test and the test continued during which the 'incorrect' ball was returned to the trim of the platform until 10 minutes passed. QUANTIFICATION AND STATISTICAL ANALYSIS All statistical analyses were conducted using R version 3.6.2 and SPSS v25 (IBM Corp). Generalised Linear Mixed Models (GLMM) with a poisson distribution (link = log) in the package 'glmmTMB' were used to examine whether the color pair (three levels: blue-and-yellow, blue-and-orange/red, and yellow-and-orange/red) affected the number of bout taken to reach the training criterion. We included bee ID nested within colony ID as the random variable. To assessing whether bumblebees learned to solve the generalization task, we used 1/3 as the baseline expectation and compared it to bumblebees' performance (in terms of the number of bumblebees that solved vs. did not solve the task) for each of the treatment group using a binomial test. To examine the effect of the color pairs in relation to success in the test, we conducted a GLMM with binomial distribution (link = logit). However, due to convergent issues (likely related to the zeros, or the bees in the yellow-and-orange/red treatment group completely failed the test), the model could not be run. Accordingly, we compared bumblebees’ test performance between the three treatment groups using Fisher’s exact test. We also used Fisher’s exact test with Bonferroni corrections for posthoc analyses, when comparing the performance between any two treatment groups (adjusted significance level: P≤0.017). Another GLMM with poisson distribution (link = log) was conducted to examine whether the training bouts differed between the bees that have successfully completed the transfer test and those bees that failed the test. For this analysis, we included colony, bee
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hybrid LCA database generated using ecoinvent and EXIOBASE, i.e., each process of the original ecoinvent database is added new direct inputs (coming from EXIOBASE) deemed missing (e.g., services). Each process of the resulting hybrid database is thus not (or at least less) truncated and the calculated lifecycle emissions/impacts should therefore be closer to reality.
For license reasons, only the added inputs for each process of ecoinvent are provided (and not all the inputs).
Why are there two versions for hybrid-ecoinvent3.5?
One of the version corresponds to ecoinvent hybridized with the normal version of EXIOBASE and the other is hybridized with a capital-endogenized version of EXIOBASE.
What does capital endogenization do?
It matches capital goods formation to the value chains of products where they are required. In a more LCA way of speaking, EXIOBASE in its normal version does not allocate capital use to value chains. It's like if ecoinvent processes had no inputs of buildings, etc. in their unit process inventory. For more detail on this, refer to (Södersten et al., 2019) or (Miller et al., 2019).
So which version do I use?
Using the version "with capitals" gives a more comprehensive coverage. Using the "without capitals" version means that if a process of ecoinvent misses inputs of capital goods (e.g., a process does not include the company laptops of the employees), it won't be added. It comes with its fair share of assumptions and uncertainties however.
Why is it only available for hybrid-ecoinvent3.5?
The work used for capital endogenization is not available for exiobase3.8.1.
How do I use the dataset?
First, to use it, you will need both the corresponding ecoinvent [cut-off] and EXIOBASE [product x product] versions. For the reference year of EXIOBASE to-be-used, take 2011 if using the hybrid-ecoinvent3.5 and 2019 for hybrid-ecoinvent3.6 and 3.7.1.
In the four datasets of this package, only added inputs are given (i.e. inputs from EXIOBASE added to ecoinvent processes). Ecoinvent and EXIOBASE processes/sectors are not included, for copyright issues. You thus need both ecoinvent and EXIOBASE to calculate life cycle emissions/impacts.
Module to get ecoinvent in a Python format: https://github.com/majeau-bettez/ecospold2matrix (make sure to take the most up-to-date branch)
Module to get EXIOBASE in a Python format: https://github.com/konstantinstadler/pymrio (can also be installed with pip)
If you want to use the "with capitals" version of the hybrid database, you also need to use the capital endogenized version of EXIOBASE, available here: https://zenodo.org/record/3874309. Choose the pxp version of the year you plan to study (which should match with the year of the EXIOBASE version). You then need to normalize the capital matrix (i.e., divide by the total output x of EXIOBASE). Then, you simply add the normalized capital matrix (K) to the technology matrix (A) of EXIOBASE (see equation below).
Once you have all the data needed, you just need to apply a slightly modified version of the Leontief equation:
(\begin{equation} \textbf{q}^{hyb} = \begin{bmatrix} \textbf{C}^{lca}\cdot\textbf{S}^{lca} & \textbf{C}^{io}\cdot\textbf{S}^{io} \end{bmatrix} \cdot \left( \textbf{I} - \begin{bmatrix} \textbf{A}^{lca} & \textbf{C}^{d} \ \textbf{C}^{u} & \textbf{A}^{io}+\textbf{K}^{io} \end{bmatrix} \right) ^{-1} \cdot \left( \begin{bmatrix} \textbf{y}^{lca} \ 0 \end{bmatrix} \right) \end{equation})
qhyb gives the hybridized impact, i.e., the impacts of each process including the impacts generated by their new inputs.
Clca and Cio are the respective characterization matrices for ecoinvent and EXIOBASE.
Slca and Sio are the respective environmental extension matrices (or elementary flows in LCA terms) for ecoinvent and EXIOBASE.
I is the identity matrix.
Alca and Aio are the respective technology matrices for ecoinvent and EXIOBASE (the ones loaded with ecospold2matrix and pymrio).
Kio is the capital matrix. If you do not use the endogenized version, do not include this matrix in the calculation.
Cu (or upstream cut-offs) is the matrix that you get in this dataset.
Cd (or downstream cut-offs) is simply a matrix of zeros in the case of this application.
Finally you define your final demand (or functional unit/set of functional units for LCA) as ylca.
Can I use it with different versions/reference years of EXIOBASE?
Technically speaking, yes it will work, because the temporal aspect does not intervene in the determination of the hybrid database presented here. However, keep in mind that there might be some inconsistencies. For example, you would need to multiply each of the inputs of the datasets by a factor to account for inflation. Prices of ecoinvent (which were used to compile the hybrid databases, for all versions presented here) are defined in €2005.
What are the weird suite of numbers in the columns?
Ecoinvent processes are identified through unique identifiers (uuids) to which metadata (i.e., name, location, price, etc.) can be retraced with the appropriate metadata files in each dataset package.
Why is the equation (I-A)-1 and not A-1 like in LCA?
IO and LCA have the same computational background. In LCA however, the convention is to represents outputs and inputs in the technology matrix. That's why there is a diagonal of 1s (the outputs, i.e. functional units) and negative values elsewhere (inputs). In IO, the technology matrix does not include outputs and only registers inputs as positive values. In the end, it is just a convention difference. If we call T the technology matrix of LCA and A the technology matrix of IO we have T = I-A. When you load ecoinvent using ecospold2matrix, the resulting version of ecoinvent will already be in IO convention and you won't have to bother with it.
Pymrio does not provide a characterization matrix for EXIOBASE, what do I do?
You can find an up-to-date characterization matrix (with Impact World+) for environmental extensions of EXIOBASE here: https://zenodo.org/record/3890339
If you want to match characterization across both EXIOBASE and ecoinvent (which you should do), here you can find a characterization matrix with Impact World+ for ecoinvent: https://zenodo.org/record/3890367
It's too complicated...
The custom software that was used to develop these datasets already deals with some of the steps described. Go check it out: https://github.com/MaximeAgez/pylcaio. You can also generate your own hybrid version of ecoinvent using this software (you can play with some parameters like correction for double counting, inflation rate, change price data to be used, etc.). As of pylcaio v2.1, the resulting hybrid database (generated directly by pylcaio) can be exported to and manipulated in brightway2.
Where can I get more information?
The whole methodology is detailed in (Agez et al., 2021).
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Fitness consequences of early-life environmental conditions are often sex-specific, but corresponding evidence for invertebrates remains inconclusive. Here we use meta-analysis to evaluate sex-specific sensitivity to early-life nutritional conditions in insects. Using literature-derived data for 85 species with broad phylogenetic and ecological coverage, we show that females are generally more sensitive to food stress than males. Stressful nutritional conditions during development typically lead to female-biased mortality and thus increasingly male-biased sex ratios of emerging adults. We further demonstrate that the general trend of higher sensitivity to food stress in females can primarily be attributed to their typically larger body size in insects and hence higher energy needs during development. By contrast, there is no consistent evidence of sex-biased sensitivity in sexually size-monomorphic species. Drawing conclusions regarding sex-biased sensitivity in species with male-biased size dimorphism remains to wait for the accumulation of relevant data. Our results suggest that environmental conditions leading to elevated juvenile mortality may potentially affect the performance of insect populations further by reducing the proportion of females among individuals reaching reproductive age. Accounting for sex-biased mortality is therefore essential to understanding the dynamics and demography of insect populations, not least importantly in the context of ongoing insect declines. Methods Data collection These data were collected for a meta-analysis to assess sex-specific sensitivity to early-life nutritional conditions in insects. We made use of experimental case studies reporting sex ratios at adult emergence in conspecifics reared under two or more diet treatments (food quality or availability). We collated primary studies in two complementary ways. The majority of primary data sets for this synthesis were collected systematically by the lead author (T. Teder) from an extensive list of journals in the field of entomology, ecology and evolutionary biology, partly as a result of one-time retrospective screening (articles published before 2003) and partly as a result of continuous screening (articles published between 2004 and 2021) of journals' tables of contents. Our systematic screening meant that the journals' tables of contents were routinely examined, and all papers identified as potentially containing relevant data on the basis of article titles were subjected to full-text review. As data of this type are typically reported in tables and figures, their identification within articles was straightforward. To increase the amount of primary data, additional studies were identified by a thorough search in major literature databases (Google Scholar, Web of Science, Scopus, published until 2021). These complementary searches in the literature databases were undertaken to find relevant data in journals that remained uncovered by our main data collection method. Accordingly, while exploring the search results, we primarily focused on studies published in journals that were not subjected to systematic screening. The basic procedure for identifying relevant primary papers among search results was basically identical to that used when screening journals' contents: papers identified as potentially containing relevant data based on article titles were retrieved for full-text review. To minimize any search-related biases, we used only search queries that were strictly neutral concerning the focal questions of our study (i.e. sex-specific sensitivity to nutritional stress). Accordingly, our search queries included only combinations of very generic search terms: one of several synonyms of sex ratio ('sex ratio', 'proportion/percentage/fraction of males/females'), 'mortality' and one of particular insect order names ('Diptera', 'Hemiptera', 'Lepidoptera', 'Coleoptera', 'Orthoptera', etc., or 'insect*'). No restriction was set on the language or publication year of primary studies. As a major exception, we systematically ignored studies focusing on Hymenoptera and Thysanoptera during the process of data collection. These groups of insects have haplodiploid sex determination (males develop from unfertilized and females from fertilized eggs) which provides mothers with an efficient mechanism for manipulating offspring sex ratio. We also did not consider taxa regularly exhibiting asexual reproduction, such as aphids. Data extraction and criteria for eligibility For a study to be considered, it had to provide two types of information: i) sex ratios at adult emergence for multiple (two or more) diet treatments together with sample sizes, and ii) corresponding juvenile mortality rates. Typically, sex ratios in primary studies were reported as the proportion/percentage of males/females or the ratio of the two sexes at adult emergence (or, in a few cases, at the pupal stage). As sample sizes for sex ratio estimates were not always explicitly indicated, we applied various indirect approaches to derive them, most often combining information on sample sizes at the start of the experiment with data on mortality throughout juvenile stages. The combined juvenile mortality rate of the two sexes was used as a proxy for nutritional stress. Accordingly, our research relies on the premise that, within each primary study, food stress was most severe in treatments with the highest mortality rates and least pronounced in treatments with the lowest mortality rates. Both egg-to-adult and larval mortality rates (often reported as survival rates) were considered equally acceptable measures of juvenile mortality. In a few cases, we also accepted mortality rates estimated over a particular fixed part of the larval stage (two primary studies) or the pupal stage (three studies). We limited our inclusion criteria to studies where major external mortality agents – predators and parasitoids – were explicitly excluded. In all studies included, experimental treatments were applied to the F1 generation only, whereas their parents were maintained under identical conditions, excluding in this way any parental effect on sex ratios. Among-treatment differences in nutritional stress were solely due to variations in food quality (e.g., different host plants, different prey species, also different artificial diets) or food availability. Otherwise, the conditions were uniform within the experiments. Data from multifactorial experiments (e.g. those manipulating both diet and temperature) were divided into different data sets so that the environmental factor of our interest was allowed to vary while other factors were held constant. In some primary studies, food quality and amount were manipulated indistinguishably within the same experimental setup. Data extracted from different studies were always treated as different data sets. However, data from a single study could also be split into multiple primary data sets if obtained from different experiments or using different species/populations/genotypes. We deliberately did not consider studies in which diet treatments applied contained pesticides or their residues. WebPlotDigitizer 4.3 (A. Rohatgi; https://automeris.io/WebPlotDigitizer) was used to extract graphically presented data. One should note that the overwhelming majority of primary studies were conducted in contexts other than the focus of our synthesis: sex differences in stress responses per se were rarely addressed in these papers. Therefore, a considerable share of primary studies found, between-treatment differences in juvenile mortality were relatively small, indicating low variation in environmental stress levels. Naturally, in order to meaningfully evaluate sex-specific responses to food stress, there must be some variation in food stress across treatments. We therefore arbitrarily limited our main database to a subset of primary studies in which mortality rates across treatments had at least a 10 % difference (calculated as the difference between the maximum and minimum mortality rates across treatments). This way we ensured that growth conditions within studies were not "too similar" across treatments. Applying this threshold retained us altogether 125 primary data sets which formed the backbone of our analyses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Aboriginal Sites Decision Support Tool ASDST extends the Aboriginal Heritage Information Management System (AHIMS) by illustrating the potential distribution of site features recorded in AHIMS. ASDST was first developed in 2012 by the Office of Environment and Heritage (OEH) to support landscape planning of Aboriginal Heritage. The Tool produces a suite of raster GIS modelled outputs and is held in Esri GRID format. The first suite was published in 2016 as Version 7 at 100m resolution and in Lamberts Conic Conformal Projection (LCC). The current Version 7.5 was produced by the now Department of Planning, Industry and Environment (DPIE) in 2020 at 50m resolution in Geographic Coordinate System (GCS). Each layer covers the extent of NSW.
The suite of layers includes separate predictive layers for different Aboriginal site feature types. The feature codes used in layer naming conventions are:
The feature models have been derived in two forms:
The first form (“p1750XXX” where XXX denotes three letter feature code) predicts likelihood of feature distribution prior to European colonisation of NSW.
The second form (“curr_XXX” where XXX denotes three letter feature code) predicts feature likelihood in the current landscape.
For both sets of feature likelihood layers, cell values range from 0 – 1000, where 0 indicates low likelihood and 1000 is high likelihood.
Please note the scale is likelihood and NOT probability. Likelihood is defined as a relative measure indicating the likelihood that a grid cell may contain the feature of interest relative to all other cells in the layer.
Additionally, there are other derived products as part of the suite. These are:
drvd_imp = which is a model of accumulated impacts, derived by summing the difference between the pre colonisation and current version of all feature models. Cell values range from 0 – 1000, where 1000 is a high accumulated impact.
drvd_rel = which is a model of the reliability of predictions based on an environmental distance algorithm that looks at recorded site density across the variables used in the models.
drvd_srv = which is a survey priority map, which considers model reliability (data gap), current likelihood and accumulated impact. Cell values range from 0 – 1000 where 1000 indicates highest survey priority relative to the rest of the layer.
For more details see the technical reference on the ASDST website.
NB. Old layers with a suffix of “_v7” indicate they are part of ASDST Version 7 produced in 2016. The current models (Version 7.5) do not contain a version number in their name and will continue to be named generically in future versions for seamless access.
Updates applied to ASDST version 7.5
For all ASDST 7.5 data sets, the resolution was increased from a 100m cell to a 50m cell. All data sets were clipped and cleaned to a refined coastal mask. Cell gaps in the mask were filled using a Nibble algorithm. The pre-settlement data sets were derived by resampling the version 7 pre-settlement data sets to 50m cell size. The present-day data sets were derived from the version 7.5 pre-settlement layers and 2017-18 land-use mapping and applying the same version 7 parameters for estimating the preservation of each feature type on each land-use. For version 7.5, the model reliability data set was derived by resampling the version 7 data set to 50m cell size. Accumulated impact and survey priority version 7.5 data sets were derived by applying the version 7 processing algorithm but substituting the version 7.5 pre-settlement and present-day ASDST models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time to Update the Split-Sample Approach in Hydrological Model Calibration
Hongren Shen1, Bryan A. Tolson1, Juliane Mai1
1Department of Civil and Environmental Engineering, University of Waterloo, Waterloo, Ontario, Canada
Corresponding author: Hongren Shen (hongren.shen@uwaterloo.ca)
Abstract
Model calibration and validation are critical in hydrological model robustness assessment. Unfortunately, the commonly-used split-sample test (SST) framework for data splitting requires modelers to make subjective decisions without clear guidelines. This large-sample SST assessment study empirically assesses how different data splitting methods influence post-validation model testing period performance, thereby identifying optimal data splitting methods under different conditions. This study investigates the performance of two lumped conceptual hydrological models calibrated and tested in 463 catchments across the United States using 50 different data splitting schemes. These schemes are established regarding the data availability, length and data recentness of the continuous calibration sub-periods (CSPs). A full-period CSP is also included in the experiment, which skips model validation. The assessment approach is novel in multiple ways including how model building decisions are framed as a decision tree problem and viewing the model building process as a formal testing period classification problem, aiming to accurately predict model success/failure in the testing period. Results span different climate and catchment conditions across a 35-year period with available data, making conclusions quite generalizable. Calibrating to older data and then validating models on newer data produces inferior model testing period performance in every single analysis conducted and should be avoided. Calibrating to the full available data and skipping model validation entirely is the most robust split-sample decision. Experimental findings remain consistent no matter how model building factors (i.e., catchments, model types, data availability, and testing periods) are varied. Results strongly support revising the traditional split-sample approach in hydrological modeling.
Data description
This data was used in the paper entitled "Time to Update the Split-Sample Approach in Hydrological Model Calibration" by Shen et al. (2022).
Catchment, meteorological forcing and streamflow data are provided for hydrological modeling use. Specifically, the forcing and streamflow data are archived in the Raven hydrological modeling required format. The GR4J and HMETS model building results in the paper, i.e., reference KGE and KGE metrics in calibration, validation and testing periods, are provided for replication of the split-sample assessment performed in the paper.
Data content
The data folder contains a gauge info file (CAMELS_463_gauge_info.txt), which reports basic information of each catchment, and 463 subfolders, each having four files for a catchment, including:
(1) Raven_Daymet_forcing.rvt, which contains Daymet meteorological forcing (i.e., daily precipitation in mm/d, minimum and maximum air temperature in deg_C, shortwave in MJ/m2/day, and day length in day) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format.
(2) Raven_USGS_streamflow.rvt, which contains daily discharge data (in m3/s) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format.
(3) GR4J_metrics.txt, which contains reference KGE and GR4J-based KGE metrics in calibration, validation and testing periods.
(4) HMETS_metrics.txt, which contains reference KGE and HMETS-based KGE metrics in calibration, validation and testing periods.
Data collection and processing methods
Data source
Forcing data processing
Streamflow data processing
GR4J and HMETS metrics
The GR4J and HMETS metrics files consists of reference KGE and KGE in model calibration, validation, and testing periods, which are derived in the massive split-sample test experiment performed in the paper.
More details of the split-sample test experiment and modeling results analysis can be referred to the paper by Shen et al. (2022).
Citation
Journal Publication
This study:
Shen, H., Tolson, B. A., & Mai, J.(2022). Time to update the split-sample approach in hydrological model calibration. Water Resources Research, 58, e2021WR031523. https://doi.org/10.1029/2021WR031523
Original CAMELS dataset:
A. J. Newman, M. P. Clark, K. Sampson, A. Wood, L. E. Hay, A. Bock, R. J. Viger, D. Blodgett, L. Brekke, J. R. Arnold, T. Hopson, and Q. Duan (2015). Development of a large-sample watershed-scale hydrometeorological dataset for the contiguous USA: dataset characteristics and assessment of regional variability in hydrologic model performance. Hydrol. Earth Syst. Sci., 19, 209-223, http://doi.org/10.5194/hess-19-209-2015
Data Publication
This study:
H. Shen, B.