Dataset Card for "RLCD-generated-preference-data-split"
More Information needed
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Data Split is a dataset for object detection tasks - it contains Objects annotations for 1,392 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This resource contains codes used in the study "Analyzing the Effect of Data Splitting and Covariate Shift on Machine Leaning Based Streamflow Prediction in Ungauged Basins" published in Water Resources Research (doi: 10.1029/2023WR034464)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Suppose we observe a random vector X from some distribution in a known family with unknown parameters. We ask the following question: when is it possible to split X into two pieces f(X) and g(X) such that neither part is sufficient to reconstruct X by itself, but both together can recover X fully, and their joint distribution is tractable? One common solution to this problem when multiple samples of X are observed is data splitting, but Rasines and Young offers an alternative approach that uses additive Gaussian noise—this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this article, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on several prototypical applications, such as post-selection inference for trend filtering and other regression problems, and effect size estimation after interactive multiple testing. Supplementary materials for this article are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Split Data Patch is a dataset for object detection tasks - it contains Patch annotations for 636 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
The UUIDs from the 80% split of https://huggingface.co/datasets/liu-nlp/fineweb-data-80-20-split-indices that were split into two parts due to resource limits.
dataset_info:
features: - name: id dtype: string splits: - name: train num_bytes: 3879427302 num_examples: 76067202 - name: test num_bytes: 3879427302 num_examples: 76067202 download_size: 5861889789 dataset_size: 7758854604 configs: - config_name: default data_files: - split: train… See the full description on the dataset page: https://huggingface.co/datasets/liu-nlp/fineweb-data-80-split-in-two-parts.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
yongjoongkim/X-ALMA-Parallel-Data-Split dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset was created by Phanh Vũ
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Thermal Detection Split 3 is a dataset for object detection tasks - it contains People annotations for 340 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data file contains records from research participants in an empirical investigation of dysfunctional individuation and its association with ego splitting and with differentiation of self. College adjustment problems were also measured. The Dysfunctional Individuation Scale, the Splitting Index, the Differentiation of Self battery, along with measures of college adjustment were assessed. The general aim of the project was to provide further evidence for the construct validity of dysfunctional individuation
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The temporal split-sample approach is the most common method to allocate observed data into calibration and validation groups for hydrologic model calibration. Often, calibration and validation data are split 50:50, where a hydrologic model is calibrated using the first half of the observed data and the second half is used for model validation. However, there is no standard strategy for how to split the data. This may result in different distributions in the observed hydrologic variable (e.g., wetter conditions in one half compared to the other) that could affect simulation results. We investigated this uncertainty by calibrating Soil and Water Assessment Tool hydrologic models with observed streamflow for three watersheds within the United States. We used six temporal data calibration/validation splitting strategies for each watershed (33:67, 50:50, and 67:33 with the calibration period occurring first, then the same three with the validation period occurring first). We found that the choice of split could have a large enough impact to alter conclusions about model performance. Through different calibrations of parameter sets, the choice of data splitting strategy also led to different simulations of streamflow, snowmelt, evapotranspiration, soil water storage, surface runoff, and groundwater flow. The impact of this research is an improved understanding of uncertainties caused by the temporal split-sample approach and the need to carefully consider calibration and validation periods for hydrologic modeling to minimize uncertainties during its use. The file "Research_Data_for_Myers_et_al.zip" includes the water balances and observed data from the study.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
SPLIT 3 is a dataset for object detection tasks - it contains SPLIT3 annotations for 7,306 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performances of models to predict unseen analysis windows [cross validation (CV) approach] of the Swedish, Swiss and Swedish+Swiss testing data sets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The shear-wave splitting (SWS) databases data product provides the geosciences community with an easy access to two published databases:
✓ ☁️ The SWS files are also available from the "EarthScope Data Archive":https://data.earthscope.org/archive/seismology/products/swsdb/README.html.
https://ds.iris.edu/spud/resources/images/spud.png" style="width:30px;"/> A query for the Splitlab Shear-wave splitting database is available on "SPUD":https://ds.iris.edu/spud/swsmeasurement
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An Open Context "predicates" dataset item. Open Context publishes structured data as granular, URL identified Web resources. This "Variables" record is part of the "Çatalhöyük Area TP Zooarchaeology" data publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This SPSS file was utilized to examine the ongoing construct validation of the Dysfunctional Individuation Scale (DIS). In addition to the DIS, the file includes assessments of Differentiation of Self, Splitting and measures of college adjustment
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time to Update the Split-Sample Approach in Hydrological Model Calibration
Hongren Shen1, Bryan A. Tolson1, Juliane Mai1
1Department of Civil and Environmental Engineering, University of Waterloo, Waterloo, Ontario, Canada
Corresponding author: Hongren Shen (hongren.shen@uwaterloo.ca)
Abstract
Model calibration and validation are critical in hydrological model robustness assessment. Unfortunately, the commonly-used split-sample test (SST) framework for data splitting requires modelers to make subjective decisions without clear guidelines. This large-sample SST assessment study empirically assesses how different data splitting methods influence post-validation model testing period performance, thereby identifying optimal data splitting methods under different conditions. This study investigates the performance of two lumped conceptual hydrological models calibrated and tested in 463 catchments across the United States using 50 different data splitting schemes. These schemes are established regarding the data availability, length and data recentness of the continuous calibration sub-periods (CSPs). A full-period CSP is also included in the experiment, which skips model validation. The assessment approach is novel in multiple ways including how model building decisions are framed as a decision tree problem and viewing the model building process as a formal testing period classification problem, aiming to accurately predict model success/failure in the testing period. Results span different climate and catchment conditions across a 35-year period with available data, making conclusions quite generalizable. Calibrating to older data and then validating models on newer data produces inferior model testing period performance in every single analysis conducted and should be avoided. Calibrating to the full available data and skipping model validation entirely is the most robust split-sample decision. Experimental findings remain consistent no matter how model building factors (i.e., catchments, model types, data availability, and testing periods) are varied. Results strongly support revising the traditional split-sample approach in hydrological modeling.
Version updates
v1.1 Updated on May 19, 2022. We added hydrographs for each catchment.
There are 8 parts of the zipped file attached in v1.1. You should download all of them and unzip all those eight parts together.
In this update, we added two zipped files in each gauge subfolder:
(1) GR4J_Hydrographs.zip and
(2) HMETS_Hydrographs.zip
Each of the zip files contains 50 CSV files. These CSV files are named with keywords of model name, gauge ID, and the calibration sub-period (CSP) identifier.
Each hydrograph CSV file contains four key columns:
(1) Date time (note that the hour column is less significant since this is daily data);
(2) Precipitation in mm that is the aggregated basin mean precipitation;
(3) Simulated streamflow in m3/s and the column is named as "subXXX", where XXX is the ID of the catchment, specified in the CAMELS_463_gauge_info.txt file; and
(4) Observed streamflow in m3/s and the column is named as "subXXX(observed)".
Note that these hydrograph CSV files reported period-ending time-averaged flows. They were directly produced by the Raven hydrological modeling framework. More information about the format of the hydrograph CSV files can be redirected to the Raven webpage.
v1.0 First version published on Jan 29, 2022.
Data description
This data was used in the paper entitled "Time to Update the Split-Sample Approach in Hydrological Model Calibration" by Shen et al. (2022).
Catchment, meteorological forcing and streamflow data are provided for hydrological modeling use. Specifically, the forcing and streamflow data are archived in the Raven hydrological modeling required format. The GR4J and HMETS model building results in the paper, i.e., reference KGE and KGE metrics in calibration, validation and testing periods, are provided for replication of the split-sample assessment performed in the paper.
Data content
The data folder contains a gauge info file (CAMELS_463_gauge_info.txt), which reports basic information of each catchment, and 463 subfolders, each having four files for a catchment, including:
(1) Raven_Daymet_forcing.rvt, which contains Daymet meteorological forcing (i.e., daily precipitation in mm/d, minimum and maximum air temperature in deg_C, shortwave in MJ/m2/day, and day length in day) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format.
(2) Raven_USGS_streamflow.rvt, which contains daily discharge data (in m3/s) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format.
(3) GR4J_metrics.txt, which contains reference KGE and GR4J-based KGE metrics in calibration, validation and testing periods.
(4) HMETS_metrics.txt, which contains reference KGE and HMETS-based KGE metrics in calibration, validation and testing periods.
Data collection and processing methods
Data source
Catchment information and the Daymet meteorological forcing are retrieved from the CAMELS data set, which can be found here.
The USGS streamflow data are collected from the U.S. Geological Survey's (USGS) National Water Information System (NWIS), which can be found here.
The GR4J and HMETS performance metrics (i.e., reference KGE and KGE) are produced in the study by Shen et al. (2022).
Forcing data processing
A quality assessment procedure was performed. For example, daily maximum air temperature should be larger than the daily minimum air temperature; otherwise, these two values will be swapped.
Units are converted to Raven-required ones. Precipitation: mm/day, unchanged; daily minimum/maximum air temperature: deg_C, unchanged; shortwave: W/m2 to MJ/m2/day; day length: seconds to days.
Data for a catchment is archived in a RVT (ASCII-based) file, in which the second line specifies the start time of the forcing series, the time step (= 1 day), and the total time steps in the series (= 12784), respectively; the third and the fourth lines specify the forcing variables and their corresponding units, respectively.
More details of Raven formatted forcing files can be found in the Raven manual (here).
Streamflow data processing
Units are converted to Raven-required ones. Daily discharge originally in cfs is converted to m3/s.
Missing data are replaced with -1.2345 as Raven requires. Those missing time steps will not be counted in performance metrics calculation.
Streamflow series is archived in a RVT (ASCII-based) file, which is open with eight commented lines specifying relevant gauge and streamflow data information, such as gauge name, gauge ID, USGS reported catchment area, calculated catchment area (based on the catchment shapefiles in CAMELS dataset), streamflow data range, data time step, and missing data periods. The first line after the commented lines in the streamflow RVT files specifies data type (default is HYDROGRAPH), subbasin ID (i.e., SubID), and discharge unit (m3/s), respectively. And the next line specifies the start of the streamflow data, time step (=1 day), and the total time steps in the series(= 12784), respectively.
GR4J and HMETS metrics
The GR4J and HMETS metrics files consists of reference KGE and KGE in model calibration, validation, and testing periods, which are derived in the massive split-sample test experiment performed in the paper.
Columns in these metrics files are gauge ID, calibration sub-period (CSP) identifier, KGE in calibration, validation, testing1, testing2, and testing3, respectively.
We proposed 50 different CSPs in the experiment. "CSP_identifier" is a unique name of each CSP. e.g., CSP identifier "CSP-3A_1990" stands for the model is built in Jan 1st 1990, calibrated in the first 3-year sample (1981-1983), calibrated in the rest years during the period of 1980 to 1989. Note that 1980 is always used for spin-up.
We defined three testing periods (independent to calibration and validation periods) for each CSP, which are the first 3 years from model build year inclusive, the first 5 years from model build year inclusive, and the full years from model build year inclusive. e.g., "testing1", "testing2", and "testing3" for CSP-3A_1990 are 1990-1992, 1990-1994, and 1990-2014, respectively.
Reference flow is the interannual mean daily flow based on a specific period, which is derived for a one-year period and then repeated in each year in the calculation period.
For calibration, its reference flow is based on spin-up + calibration periods.
For validation, its reference flow is based on spin-up + calibration periods.
For testing, its reference flow is based on spin-up +calibration + validation periods.
Reference KGE is calculated based on the reference flow and observed streamflow in a specific calculation period (e.g., calibration). Reference KGE is computed using the KGE equation with substituting the "simulated" flow for "reference" flow in the period for calculation. Note that the reference KGEs for the three different testing periods corresponds to the same historical period, but are different, because each testing period spans in a different time period and covers different series of observed flow.
More details of the split-sample test experiment and modeling results analysis can be referred to the paper by Shen et al. (2022).
Citation
Journal Publication
This study:
Shen, H., Tolson, B. A., & Mai, J.(2022). Time to update the split-sample approach in hydrological model calibration. Water Resources Research, 58, e2021WR031523. https://doi.org/10.1029/2021WR031523
Original CAMELS dataset:
A. J. Newman, M. P. Clark, K. Sampson, A. Wood, L. E. Hay, A. Bock, R. J. Viger, D. Blodgett, L. Brekke, J. R. Arnold, T. Hopson, and Q. Duan (2015). Development of a large-sample
This dataset was created by Trisha Tomy
Dataset Card for "RLCD-generated-preference-data-split"
More Information needed