100+ datasets found

h
RLCD-generated-preference-data-split
huggingface.co
Updated Sep 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taylor (2023). RLCD-generated-preference-data-split [Dataset]. https://huggingface.co/datasets/TaylorAI/RLCD-generated-preference-data-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2023
Dataset authored and provided by
Taylor
Description
Dataset Card for "RLCD-generated-preference-data-split"

More Information needed
R
Data Split Dataset
universe.roboflow.com
zip
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Basketball Tracking (2025). Data Split Dataset [Dataset]. https://universe.roboflow.com/basketball-tracking-5halg/data-split-lyhid/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Mar 12, 2025
Dataset authored and provided by
Basketball Tracking
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Objects Bounding Boxes
Description
Data Split

## Overview Data Split is a dataset for object detection tasks - it contains Objects annotations for 1,392 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Data from: Time-Split Cross-Validation as a Method for Estimating the...
acs.figshare.com
figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/ci400084k.s001
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.
p
Codes for Analyzing the Effect of Data Splitting and Covariate Shift on...
purr.purdue.edu
Updated Jan 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pin-ching Li; Sayan Dey; Venkatesh Merwade (2023). Codes for Analyzing the Effect of Data Splitting and Covariate Shift on Machine Leaning Based Streamflow Prediction in Ungauged Basins [Dataset]. http://doi.org/10.4231/B783-2C47
Explore at:
Unique identifier
https://doi.org/10.4231/B783-2C47
Dataset updated
Jan 23, 2023
Dataset provided by
PURR
Authors
Pin-ching Li; Sayan Dey; Venkatesh Merwade
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This resource contains codes used in the study "Analyzing the Effect of Data Splitting and Covariate Shift on Machine Leaning Based Streamflow Prediction in Ungauged Basins" published in Water Resources Research (doi: 10.1029/2023WR034464)
f
Data from: Data Fission: Splitting a Single Data Point
tandf.figshare.com
txt
Updated Dec 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Leiner; Boyan Duan; Larry Wasserman; Aaditya Ramdas (2023). Data Fission: Splitting a Single Data Point [Dataset]. http://doi.org/10.6084/m9.figshare.24328745.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24328745.v2
Dataset updated
Dec 14, 2023
Dataset provided by
Taylor & Francis
Authors
James Leiner; Boyan Duan; Larry Wasserman; Aaditya Ramdas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Suppose we observe a random vector X from some distribution in a known family with unknown parameters. We ask the following question: when is it possible to split X into two pieces f(X) and g(X) such that neither part is sufficient to reconstruct X by itself, but both together can recover X fully, and their joint distribution is tractable? One common solution to this problem when multiple samples of X are observed is data splitting, but Rasines and Young offers an alternative approach that uses additive Gaussian noise—this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this article, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on several prototypical applications, such as post-selection inference for trend filtering and other regression problems, and effect size estimation after interactive multiple testing. Supplementary materials for this article are available online.
R
Split Data Patch Dataset
universe.roboflow.com
zip
Updated Oct 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Universitas Islam Indonesia (2023). Split Data Patch Dataset [Dataset]. https://universe.roboflow.com/universitas-islam-indonesia-fgk9e/split-data-patch/model/1
Explore at:
zipAvailable download formats
Dataset updated
Oct 25, 2023
Dataset authored and provided by
Universitas Islam Indonesia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Patch Bounding Boxes
Description
Split Data Patch

## Overview Split Data Patch is a dataset for object detection tasks - it contains Patch annotations for 636 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
fineweb-data-80-split-in-two-parts
huggingface.co
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fineweb-data-80-split-in-two-parts [Dataset]. https://huggingface.co/datasets/liu-nlp/fineweb-data-80-split-in-two-parts
Explore at:
Dataset updated
Apr 11, 2025
Dataset authored and provided by
Linköping University NLP Group
Description
The UUIDs from the 80% split of https://huggingface.co/datasets/liu-nlp/fineweb-data-80-20-split-indices that were split into two parts due to resource limits.

dataset_info:

features: - name: id dtype: string splits: - name: train num_bytes: 3879427302 num_examples: 76067202 - name: test num_bytes: 3879427302 num_examples: 76067202 download_size: 5861889789 dataset_size: 7758854604 configs: - config_name: default data_files: - split: train… See the full description on the dataset page: https://huggingface.co/datasets/liu-nlp/fineweb-data-80-split-in-two-parts.
h
X-ALMA-Parallel-Data-Split
huggingface.co
Updated Jun 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yong-Joong Kim (2025). X-ALMA-Parallel-Data-Split [Dataset]. https://huggingface.co/datasets/yongjoongkim/X-ALMA-Parallel-Data-Split
Explore at:
Dataset updated
Jun 1, 2025
Authors
Yong-Joong Kim
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
yongjoongkim/X-ALMA-Parallel-Data-Split dataset hosted on Hugging Face and contributed by the HF Datasets community
Split data
kaggle.com
Updated Mar 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Phanh Vũ (2023). Split data [Dataset]. https://www.kaggle.com/datasets/phanhv/split-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Phanh Vũ
Description
Dataset

This dataset was created by Phanh Vũ

Contents
R
Thermal Detection Split 3 Dataset
universe.roboflow.com
zip
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eli MDT Data Splits (2025). Thermal Detection Split 3 Dataset [Dataset]. https://universe.roboflow.com/eli-mdt-data-splits/thermal-detection-split-3
Explore at:
zipAvailable download formats
Dataset updated
Feb 10, 2025
Dataset authored and provided by
Eli MDT Data Splits
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
People Bounding Boxes
Description
Thermal Detection Split 3

## Overview Thermal Detection Split 3 is a dataset for object detection tasks - it contains People annotations for 340 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
n
Dysfunctional Individuation and Splitting Deidentified Data File
curate.nd.edu
bin
Updated Aug 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Lapsley (2024). Dysfunctional Individuation and Splitting Deidentified Data File [Dataset]. http://doi.org/10.7274/26739532.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.7274/26739532.v1
Dataset updated
Aug 15, 2024
Dataset provided by
University of Notre Dame
Authors
Daniel Lapsley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data file contains records from research participants in an empirical investigation of dysfunctional individuation and its association with ego splitting and with differentiation of self. College adjustment problems were also measured. The Dysfunctional Individuation Scale, the Splitting Index, the Differentiation of Self battery, along with measures of college adjustment were assessed. The general aim of the project was to provide further evidence for the construct validity of dysfunctional individuation
m
Calibration/validation time-period selection in hydrologic models leads to...
data.mendeley.com
Updated Nov 9, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Myers (2020). Calibration/validation time-period selection in hydrologic models leads to uncertainty in water balance simulations [Dataset]. http://doi.org/10.17632/hrdycfbm4m.4
Explore at:
Unique identifier
https://doi.org/10.17632/hrdycfbm4m.4
Dataset updated
Nov 9, 2020
Authors
Daniel Myers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The temporal split-sample approach is the most common method to allocate observed data into calibration and validation groups for hydrologic model calibration. Often, calibration and validation data are split 50:50, where a hydrologic model is calibrated using the first half of the observed data and the second half is used for model validation. However, there is no standard strategy for how to split the data. This may result in different distributions in the observed hydrologic variable (e.g., wetter conditions in one half compared to the other) that could affect simulation results. We investigated this uncertainty by calibrating Soil and Water Assessment Tool hydrologic models with observed streamflow for three watersheds within the United States. We used six temporal data calibration/validation splitting strategies for each watershed (33:67, 50:50, and 67:33 with the calibration period occurring first, then the same three with the validation period occurring first). We found that the choice of split could have a large enough impact to alter conclusions about model performance. Through different calibrations of parameter sets, the choice of data splitting strategy also led to different simulations of streamflow, snowmelt, evapotranspiration, soil water storage, surface runoff, and groundwater flow. The impact of this research is an improved understanding of uncertainties caused by the temporal split-sample approach and the need to carefully consider calibration and validation periods for hydrologic modeling to minimize uncertainties during its use. The file "Research_Data_for_Myers_et_al.zip" includes the water balances and observed data from the study.
R
Data from: Split 3 Dataset
universe.roboflow.com
zip
Updated Jun 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SPLIT 3 (2024). Split 3 Dataset [Dataset]. https://universe.roboflow.com/split-3/split-3/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Jun 16, 2024
Dataset authored and provided by
SPLIT 3
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
SPLIT3 Bounding Boxes
Description
SPLIT 3

## Overview SPLIT 3 is a dataset for object detection tasks - it contains SPLIT3 annotations for 7,306 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
f
Performances of models to predict unseen analysis windows [cross validation...
plos.figshare.com
xls
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Catherine Ollagnier; Claudia Kasper; Anna Wallenbeck; Linda Keeling; Giuseppe Bee; Siavash A. Bigdeli (2023). Performances of models to predict unseen analysis windows [cross validation (CV) approach] of the Swedish, Swiss and Swedish+Swiss testing data sets. [Dataset]. http://doi.org/10.1371/journal.pone.0252002.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0252002.t006
Dataset updated
Jun 17, 2023
Dataset provided by
PLOS ONE
Authors
Catherine Ollagnier; Claudia Kasper; Anna Wallenbeck; Linda Keeling; Giuseppe Bee; Siavash A. Bigdeli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performances of models to predict unseen analysis windows [cross validation (CV) approach] of the Swedish, Swiss and Swedish+Swiss testing data sets.
Data Cleaning, Translation & Split of the Dataset for the Automatic...
zenodo.org
data.niaid.nih.gov
bin, csv +1
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Explore at:
text/x-python, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6957842
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
i
Shear-wave splitting databases
ds.iris.edu
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Help (2025). Shear-wave splitting databases [Dataset]. https://ds.iris.edu/ds/products/sws-dbs/
Explore at:
Dataset updated
May 12, 2025
Authors
Data Help
Description
The shear-wave splitting (SWS) databases data product provides the geosciences community with an easy access to two published databases:

"The Géosciences Montpellier SplitLab sShear-wave splitting database":http://ds.iris.edu/ds/products/sws-db/

"The Missouri University of Science and Technology (Missouri S&T) shear-wave splitting database":http://ds.iris.edu/ds/products/sws-db-mst/ for western and central United States

✓ ☁️ The SWS files are also available from the "EarthScope Data Archive":https://data.earthscope.org/archive/seismology/products/swsdb/README.html.

https://ds.iris.edu/spud/resources/images/spud.png" style="width:30px;"/> A query for the Splitlab Shear-wave splitting database is available on "SPUD":https://ds.iris.edu/spud/swsmeasurement
o
Splitting
opencontext.org
Updated Oct 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arek Marciniak (2022). Splitting [Dataset]. https://opencontext.org/predicates/3db0ac19-9c9b-4792-7e43-f59dd287195e
Explore at:
Dataset updated
Oct 1, 2022
Dataset provided by
Open Context
Authors
Arek Marciniak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An Open Context "predicates" dataset item. Open Context publishes structured data as granular, URL identified Web resources. This "Variables" record is part of the "Çatalhöyük Area TP Zooarchaeology" data publication.
n
Dysfunctional Individuation, Splitting and Differentiation of Self Data File...
curate.nd.edu
bin
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Lapsley (2024). Dysfunctional Individuation, Splitting and Differentiation of Self Data File [Dataset]. http://doi.org/10.7274/26312326.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.7274/26312326.v1
Dataset updated
Jul 17, 2024
Dataset provided by
University of Notre Dame
Authors
Daniel Lapsley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This SPSS file was utilized to examine the ongoing construct validation of the Dysfunctional Individuation Scale (DIS). In addition to the DIS, the file includes assessments of Differentiation of Self, Splitting and measures of college adjustment
Z
Time to Update the Split-Sample Approach in Hydrological Model Calibration...
data.niaid.nih.gov
zenodo.org
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hongren Shen (2022). Time to Update the Split-Sample Approach in Hydrological Model Calibration v1.1 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5915373
Explore at:
Dataset updated
May 31, 2022
Dataset provided by
Bryan A. Tolson
Juliane Mai
Hongren Shen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Time to Update the Split-Sample Approach in Hydrological Model Calibration

Hongren Shen1, Bryan A. Tolson1, Juliane Mai1

1Department of Civil and Environmental Engineering, University of Waterloo, Waterloo, Ontario, Canada

Corresponding author: Hongren Shen (hongren.shen@uwaterloo.ca)

Abstract

Model calibration and validation are critical in hydrological model robustness assessment. Unfortunately, the commonly-used split-sample test (SST) framework for data splitting requires modelers to make subjective decisions without clear guidelines. This large-sample SST assessment study empirically assesses how different data splitting methods influence post-validation model testing period performance, thereby identifying optimal data splitting methods under different conditions. This study investigates the performance of two lumped conceptual hydrological models calibrated and tested in 463 catchments across the United States using 50 different data splitting schemes. These schemes are established regarding the data availability, length and data recentness of the continuous calibration sub-periods (CSPs). A full-period CSP is also included in the experiment, which skips model validation. The assessment approach is novel in multiple ways including how model building decisions are framed as a decision tree problem and viewing the model building process as a formal testing period classification problem, aiming to accurately predict model success/failure in the testing period. Results span different climate and catchment conditions across a 35-year period with available data, making conclusions quite generalizable. Calibrating to older data and then validating models on newer data produces inferior model testing period performance in every single analysis conducted and should be avoided. Calibrating to the full available data and skipping model validation entirely is the most robust split-sample decision. Experimental findings remain consistent no matter how model building factors (i.e., catchments, model types, data availability, and testing periods) are varied. Results strongly support revising the traditional split-sample approach in hydrological modeling.

Version updates

v1.1 Updated on May 19, 2022. We added hydrographs for each catchment.

There are 8 parts of the zipped file attached in v1.1. You should download all of them and unzip all those eight parts together.

In this update, we added two zipped files in each gauge subfolder:

(1) GR4J_Hydrographs.zip and (2) HMETS_Hydrographs.zip

Each of the zip files contains 50 CSV files. These CSV files are named with keywords of model name, gauge ID, and the calibration sub-period (CSP) identifier.

Each hydrograph CSV file contains four key columns:

(1) Date time (note that the hour column is less significant since this is daily data); (2) Precipitation in mm that is the aggregated basin mean precipitation; (3) Simulated streamflow in m3/s and the column is named as "subXXX", where XXX is the ID of the catchment, specified in the CAMELS_463_gauge_info.txt file; and (4) Observed streamflow in m3/s and the column is named as "subXXX(observed)".

Note that these hydrograph CSV files reported period-ending time-averaged flows. They were directly produced by the Raven hydrological modeling framework. More information about the format of the hydrograph CSV files can be redirected to the Raven webpage.

v1.0 First version published on Jan 29, 2022.

Data description

This data was used in the paper entitled "Time to Update the Split-Sample Approach in Hydrological Model Calibration" by Shen et al. (2022).

Catchment, meteorological forcing and streamflow data are provided for hydrological modeling use. Specifically, the forcing and streamflow data are archived in the Raven hydrological modeling required format. The GR4J and HMETS model building results in the paper, i.e., reference KGE and KGE metrics in calibration, validation and testing periods, are provided for replication of the split-sample assessment performed in the paper.

Data content

The data folder contains a gauge info file (CAMELS_463_gauge_info.txt), which reports basic information of each catchment, and 463 subfolders, each having four files for a catchment, including:

(1) Raven_Daymet_forcing.rvt, which contains Daymet meteorological forcing (i.e., daily precipitation in mm/d, minimum and maximum air temperature in deg_C, shortwave in MJ/m2/day, and day length in day) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format. (2) Raven_USGS_streamflow.rvt, which contains daily discharge data (in m3/s) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format. (3) GR4J_metrics.txt, which contains reference KGE and GR4J-based KGE metrics in calibration, validation and testing periods. (4) HMETS_metrics.txt, which contains reference KGE and HMETS-based KGE metrics in calibration, validation and testing periods.

Data collection and processing methods

Data source

Catchment information and the Daymet meteorological forcing are retrieved from the CAMELS data set, which can be found here.

The USGS streamflow data are collected from the U.S. Geological Survey's (USGS) National Water Information System (NWIS), which can be found here.

The GR4J and HMETS performance metrics (i.e., reference KGE and KGE) are produced in the study by Shen et al. (2022).

Forcing data processing

A quality assessment procedure was performed. For example, daily maximum air temperature should be larger than the daily minimum air temperature; otherwise, these two values will be swapped.

Units are converted to Raven-required ones. Precipitation: mm/day, unchanged; daily minimum/maximum air temperature: deg_C, unchanged; shortwave: W/m2 to MJ/m2/day; day length: seconds to days.

Data for a catchment is archived in a RVT (ASCII-based) file, in which the second line specifies the start time of the forcing series, the time step (= 1 day), and the total time steps in the series (= 12784), respectively; the third and the fourth lines specify the forcing variables and their corresponding units, respectively.

More details of Raven formatted forcing files can be found in the Raven manual (here).

Streamflow data processing

Units are converted to Raven-required ones. Daily discharge originally in cfs is converted to m3/s.

Missing data are replaced with -1.2345 as Raven requires. Those missing time steps will not be counted in performance metrics calculation.

Streamflow series is archived in a RVT (ASCII-based) file, which is open with eight commented lines specifying relevant gauge and streamflow data information, such as gauge name, gauge ID, USGS reported catchment area, calculated catchment area (based on the catchment shapefiles in CAMELS dataset), streamflow data range, data time step, and missing data periods. The first line after the commented lines in the streamflow RVT files specifies data type (default is HYDROGRAPH), subbasin ID (i.e., SubID), and discharge unit (m3/s), respectively. And the next line specifies the start of the streamflow data, time step (=1 day), and the total time steps in the series(= 12784), respectively.

GR4J and HMETS metrics

The GR4J and HMETS metrics files consists of reference KGE and KGE in model calibration, validation, and testing periods, which are derived in the massive split-sample test experiment performed in the paper.

Columns in these metrics files are gauge ID, calibration sub-period (CSP) identifier, KGE in calibration, validation, testing1, testing2, and testing3, respectively.

We proposed 50 different CSPs in the experiment. "CSP_identifier" is a unique name of each CSP. e.g., CSP identifier "CSP-3A_1990" stands for the model is built in Jan 1st 1990, calibrated in the first 3-year sample (1981-1983), calibrated in the rest years during the period of 1980 to 1989. Note that 1980 is always used for spin-up.

We defined three testing periods (independent to calibration and validation periods) for each CSP, which are the first 3 years from model build year inclusive, the first 5 years from model build year inclusive, and the full years from model build year inclusive. e.g., "testing1", "testing2", and "testing3" for CSP-3A_1990 are 1990-1992, 1990-1994, and 1990-2014, respectively.

Reference flow is the interannual mean daily flow based on a specific period, which is derived for a one-year period and then repeated in each year in the calculation period.

For calibration, its reference flow is based on spin-up + calibration periods.

For validation, its reference flow is based on spin-up + calibration periods.

For testing, its reference flow is based on spin-up +calibration + validation periods.

Reference KGE is calculated based on the reference flow and observed streamflow in a specific calculation period (e.g., calibration). Reference KGE is computed using the KGE equation with substituting the "simulated" flow for "reference" flow in the period for calculation. Note that the reference KGEs for the three different testing periods corresponds to the same historical period, but are different, because each testing period spans in a different time period and covers different series of observed flow.

More details of the split-sample test experiment and modeling results analysis can be referred to the paper by Shen et al. (2022).

Citation

Journal Publication

This study:

Shen, H., Tolson, B. A., & Mai, J.(2022). Time to update the split-sample approach in hydrological model calibration. Water Resources Research, 58, e2021WR031523. https://doi.org/10.1029/2021WR031523

Original CAMELS dataset:

A. J. Newman, M. P. Clark, K. Sampson, A. Wood, L. E. Hay, A. Bock, R. J. Viger, D. Blodgett, L. Brekke, J. R. Arnold, T. Hopson, and Q. Duan (2015). Development of a large-sample
Preprocessed Data (Split Train) AII2022 Human Pose
kaggle.com
Updated Jul 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trisha Tomy (2023). Preprocessed Data (Split Train) AII2022 Human Pose [Dataset]. https://www.kaggle.com/datasets/trishatomy/preprocessed-data-split-train-aii2022-human-pose/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Trisha Tomy
Description
Dataset

This dataset was created by Trisha Tomy

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Taylor (2023). RLCD-generated-preference-data-split [Dataset]. https://huggingface.co/datasets/TaylorAI/RLCD-generated-preference-data-split

RLCD-generated-preference-data-split

TaylorAI/RLCD-generated-preference-data-split

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 13, 2023

Dataset authored and provided by

Taylor

Description

Dataset Card for "RLCD-generated-preference-data-split"

More Information needed

Clear search

Close search

Google apps

Main menu

RLCD-generated-preference-data-split

Data Split Dataset

Data Split

Data from: Time-Split Cross-Validation as a Method for Estimating the...

Codes for Analyzing the Effect of Data Splitting and Covariate Shift on...

Data from: Data Fission: Splitting a Single Data Point

Split Data Patch Dataset

Split Data Patch

fineweb-data-80-split-in-two-parts

X-ALMA-Parallel-Data-Split

Split data

Dataset

Contents

Thermal Detection Split 3 Dataset

Thermal Detection Split 3

Dysfunctional Individuation and Splitting Deidentified Data File

Calibration/validation time-period selection in hydrologic models leads to...

Data from: Split 3 Dataset

SPLIT 3

Performances of models to predict unseen analysis windows [cross validation...

Data Cleaning, Translation & Split of the Dataset for the Automatic...

Shear-wave splitting databases

Splitting

Dysfunctional Individuation, Splitting and Differentiation of Self Data File...

Time to Update the Split-Sample Approach in Hydrological Model Calibration...

Preprocessed Data (Split Train) AII2022 Human Pose

Dataset

Contents

RLCD-generated-preference-data-split

TaylorAI/RLCD-generated-preference-data-split