100+ datasets found
  1. h

    RLCD-generated-preference-data-split

    • huggingface.co
    Updated Sep 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taylor (2023). RLCD-generated-preference-data-split [Dataset]. https://huggingface.co/datasets/TaylorAI/RLCD-generated-preference-data-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2023
    Dataset authored and provided by
    Taylor
    Description

    Dataset Card for "RLCD-generated-preference-data-split"

    More Information needed

  2. R

    Data Split Dataset

    • universe.roboflow.com
    zip
    Updated Mar 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Basketball Tracking (2025). Data Split Dataset [Dataset]. https://universe.roboflow.com/basketball-tracking-5halg/data-split-lyhid/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 12, 2025
    Dataset authored and provided by
    Basketball Tracking
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Objects Bounding Boxes
    Description

    Data Split

    ## Overview
    
    Data Split is a dataset for object detection tasks - it contains Objects annotations for 1,392 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  3. Data from: Time-Split Cross-Validation as a Method for Estimating the...

    • acs.figshare.com
    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

  4. p

    Codes for Analyzing the Effect of Data Splitting and Covariate Shift on...

    • purr.purdue.edu
    Updated Jan 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pin-ching Li; Sayan Dey; Venkatesh Merwade (2023). Codes for Analyzing the Effect of Data Splitting and Covariate Shift on Machine Leaning Based Streamflow Prediction in Ungauged Basins [Dataset]. http://doi.org/10.4231/B783-2C47
    Explore at:
    Dataset updated
    Jan 23, 2023
    Dataset provided by
    PURR
    Authors
    Pin-ching Li; Sayan Dey; Venkatesh Merwade
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This resource contains codes used in the study "Analyzing the Effect of Data Splitting and Covariate Shift on Machine Leaning Based Streamflow Prediction in Ungauged Basins" published in Water Resources Research (doi: 10.1029/2023WR034464)

  5. f

    Data from: Data Fission: Splitting a Single Data Point

    • tandf.figshare.com
    txt
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Leiner; Boyan Duan; Larry Wasserman; Aaditya Ramdas (2023). Data Fission: Splitting a Single Data Point [Dataset]. http://doi.org/10.6084/m9.figshare.24328745.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    James Leiner; Boyan Duan; Larry Wasserman; Aaditya Ramdas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Suppose we observe a random vector X from some distribution in a known family with unknown parameters. We ask the following question: when is it possible to split X into two pieces f(X) and g(X) such that neither part is sufficient to reconstruct X by itself, but both together can recover X fully, and their joint distribution is tractable? One common solution to this problem when multiple samples of X are observed is data splitting, but Rasines and Young offers an alternative approach that uses additive Gaussian noise—this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this article, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on several prototypical applications, such as post-selection inference for trend filtering and other regression problems, and effect size estimation after interactive multiple testing. Supplementary materials for this article are available online.

  6. R

    Split Data Patch Dataset

    • universe.roboflow.com
    zip
    Updated Oct 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Universitas Islam Indonesia (2023). Split Data Patch Dataset [Dataset]. https://universe.roboflow.com/universitas-islam-indonesia-fgk9e/split-data-patch/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 25, 2023
    Dataset authored and provided by
    Universitas Islam Indonesia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Patch Bounding Boxes
    Description

    Split Data Patch

    ## Overview
    
    Split Data Patch is a dataset for object detection tasks - it contains Patch annotations for 636 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  7. h

    fineweb-data-80-split-in-two-parts

    • huggingface.co
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fineweb-data-80-split-in-two-parts [Dataset]. https://huggingface.co/datasets/liu-nlp/fineweb-data-80-split-in-two-parts
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset authored and provided by
    Linköping University NLP Group
    Description

    The UUIDs from the 80% split of https://huggingface.co/datasets/liu-nlp/fineweb-data-80-20-split-indices that were split into two parts due to resource limits.

      dataset_info:
    

    features: - name: id dtype: string splits: - name: train num_bytes: 3879427302 num_examples: 76067202 - name: test num_bytes: 3879427302 num_examples: 76067202 download_size: 5861889789 dataset_size: 7758854604 configs: - config_name: default data_files: - split: train… See the full description on the dataset page: https://huggingface.co/datasets/liu-nlp/fineweb-data-80-split-in-two-parts.

  8. h

    X-ALMA-Parallel-Data-Split

    • huggingface.co
    Updated Jun 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yong-Joong Kim (2025). X-ALMA-Parallel-Data-Split [Dataset]. https://huggingface.co/datasets/yongjoongkim/X-ALMA-Parallel-Data-Split
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Yong-Joong Kim
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    yongjoongkim/X-ALMA-Parallel-Data-Split dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. Split data

    • kaggle.com
    Updated Mar 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phanh Vũ (2023). Split data [Dataset]. https://www.kaggle.com/datasets/phanhv/split-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Phanh Vũ
    Description

    Dataset

    This dataset was created by Phanh Vũ

    Contents

  10. R

    Thermal Detection Split 3 Dataset

    • universe.roboflow.com
    zip
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eli MDT Data Splits (2025). Thermal Detection Split 3 Dataset [Dataset]. https://universe.roboflow.com/eli-mdt-data-splits/thermal-detection-split-3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset authored and provided by
    Eli MDT Data Splits
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    People Bounding Boxes
    Description

    Thermal Detection Split 3

    ## Overview
    
    Thermal Detection Split 3 is a dataset for object detection tasks - it contains People annotations for 340 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  11. n

    Dysfunctional Individuation and Splitting Deidentified Data File

    • curate.nd.edu
    bin
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Lapsley (2024). Dysfunctional Individuation and Splitting Deidentified Data File [Dataset]. http://doi.org/10.7274/26739532.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 15, 2024
    Dataset provided by
    University of Notre Dame
    Authors
    Daniel Lapsley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data file contains records from research participants in an empirical investigation of dysfunctional individuation and its association with ego splitting and with differentiation of self. College adjustment problems were also measured. The Dysfunctional Individuation Scale, the Splitting Index, the Differentiation of Self battery, along with measures of college adjustment were assessed. The general aim of the project was to provide further evidence for the construct validity of dysfunctional individuation

  12. m

    Calibration/validation time-period selection in hydrologic models leads to...

    • data.mendeley.com
    Updated Nov 9, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Myers (2020). Calibration/validation time-period selection in hydrologic models leads to uncertainty in water balance simulations [Dataset]. http://doi.org/10.17632/hrdycfbm4m.4
    Explore at:
    Dataset updated
    Nov 9, 2020
    Authors
    Daniel Myers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The temporal split-sample approach is the most common method to allocate observed data into calibration and validation groups for hydrologic model calibration. Often, calibration and validation data are split 50:50, where a hydrologic model is calibrated using the first half of the observed data and the second half is used for model validation. However, there is no standard strategy for how to split the data. This may result in different distributions in the observed hydrologic variable (e.g., wetter conditions in one half compared to the other) that could affect simulation results. We investigated this uncertainty by calibrating Soil and Water Assessment Tool hydrologic models with observed streamflow for three watersheds within the United States. We used six temporal data calibration/validation splitting strategies for each watershed (33:67, 50:50, and 67:33 with the calibration period occurring first, then the same three with the validation period occurring first). We found that the choice of split could have a large enough impact to alter conclusions about model performance. Through different calibrations of parameter sets, the choice of data splitting strategy also led to different simulations of streamflow, snowmelt, evapotranspiration, soil water storage, surface runoff, and groundwater flow. The impact of this research is an improved understanding of uncertainties caused by the temporal split-sample approach and the need to carefully consider calibration and validation periods for hydrologic modeling to minimize uncertainties during its use. The file "Research_Data_for_Myers_et_al.zip" includes the water balances and observed data from the study.

  13. R

    Data from: Split 3 Dataset

    • universe.roboflow.com
    zip
    Updated Jun 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SPLIT 3 (2024). Split 3 Dataset [Dataset]. https://universe.roboflow.com/split-3/split-3/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 16, 2024
    Dataset authored and provided by
    SPLIT 3
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    SPLIT3 Bounding Boxes
    Description

    SPLIT 3

    ## Overview
    
    SPLIT 3 is a dataset for object detection tasks - it contains SPLIT3 annotations for 7,306 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  14. f

    Performances of models to predict unseen analysis windows [cross validation...

    • plos.figshare.com
    xls
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Catherine Ollagnier; Claudia Kasper; Anna Wallenbeck; Linda Keeling; Giuseppe Bee; Siavash A. Bigdeli (2023). Performances of models to predict unseen analysis windows [cross validation (CV) approach] of the Swedish, Swiss and Swedish+Swiss testing data sets. [Dataset]. http://doi.org/10.1371/journal.pone.0252002.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Catherine Ollagnier; Claudia Kasper; Anna Wallenbeck; Linda Keeling; Giuseppe Bee; Siavash A. Bigdeli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performances of models to predict unseen analysis windows [cross validation (CV) approach] of the Swedish, Swiss and Swedish+Swiss testing data sets.

  15. Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv +1
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
    Explore at:
    text/x-python, csv, binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juliane Köhler; Juliane Köhler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
    • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
    • ger_train.csv – The German training set as CSV file.
    • ger_validation.csv – The German validation set as CSV file.
    • en_test.csv – The English test set as CSV file.
    • en_train.csv – The English training set as CSV file.
    • en_validation.csv – The English validation set as CSV file.
    • splitting.py – The python code for splitting a dataset into train, test and validation set.
    • DataSetTrans_de.csv – The final German dataset as a CSV file.
    • DataSetTrans_en.csv – The final English dataset as a CSV file.
    • translation.py – The python code for translating the cleaned dataset.
  16. i

    Shear-wave splitting databases

    • ds.iris.edu
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Help (2025). Shear-wave splitting databases [Dataset]. https://ds.iris.edu/ds/products/sws-dbs/
    Explore at:
    Dataset updated
    May 12, 2025
    Authors
    Data Help
    Description

    The shear-wave splitting (SWS) databases data product provides the geosciences community with an easy access to two published databases:

    ☁️ The SWS files are also available from the "EarthScope Data Archive":https://data.earthscope.org/archive/seismology/products/swsdb/README.html.

    https://ds.iris.edu/spud/resources/images/spud.png" style="width:30px;"/> A query for the Splitlab Shear-wave splitting database is available on "SPUD":https://ds.iris.edu/spud/swsmeasurement

  17. o

    Splitting

    • opencontext.org
    Updated Oct 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arek Marciniak (2022). Splitting [Dataset]. https://opencontext.org/predicates/3db0ac19-9c9b-4792-7e43-f59dd287195e
    Explore at:
    Dataset updated
    Oct 1, 2022
    Dataset provided by
    Open Context
    Authors
    Arek Marciniak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    An Open Context "predicates" dataset item. Open Context publishes structured data as granular, URL identified Web resources. This "Variables" record is part of the "Çatalhöyük Area TP Zooarchaeology" data publication.

  18. n

    Dysfunctional Individuation, Splitting and Differentiation of Self Data File...

    • curate.nd.edu
    bin
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Lapsley (2024). Dysfunctional Individuation, Splitting and Differentiation of Self Data File [Dataset]. http://doi.org/10.7274/26312326.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    University of Notre Dame
    Authors
    Daniel Lapsley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This SPSS file was utilized to examine the ongoing construct validation of the Dysfunctional Individuation Scale (DIS). In addition to the DIS, the file includes assessments of Differentiation of Self, Splitting and measures of college adjustment

  19. Z

    Time to Update the Split-Sample Approach in Hydrological Model Calibration...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hongren Shen (2022). Time to Update the Split-Sample Approach in Hydrological Model Calibration v1.1 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5915373
    Explore at:
    Dataset updated
    May 31, 2022
    Dataset provided by
    Bryan A. Tolson
    Juliane Mai
    Hongren Shen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Time to Update the Split-Sample Approach in Hydrological Model Calibration

    Hongren Shen1, Bryan A. Tolson1, Juliane Mai1

    1Department of Civil and Environmental Engineering, University of Waterloo, Waterloo, Ontario, Canada

    Corresponding author: Hongren Shen (hongren.shen@uwaterloo.ca)

    Abstract

    Model calibration and validation are critical in hydrological model robustness assessment. Unfortunately, the commonly-used split-sample test (SST) framework for data splitting requires modelers to make subjective decisions without clear guidelines. This large-sample SST assessment study empirically assesses how different data splitting methods influence post-validation model testing period performance, thereby identifying optimal data splitting methods under different conditions. This study investigates the performance of two lumped conceptual hydrological models calibrated and tested in 463 catchments across the United States using 50 different data splitting schemes. These schemes are established regarding the data availability, length and data recentness of the continuous calibration sub-periods (CSPs). A full-period CSP is also included in the experiment, which skips model validation. The assessment approach is novel in multiple ways including how model building decisions are framed as a decision tree problem and viewing the model building process as a formal testing period classification problem, aiming to accurately predict model success/failure in the testing period. Results span different climate and catchment conditions across a 35-year period with available data, making conclusions quite generalizable. Calibrating to older data and then validating models on newer data produces inferior model testing period performance in every single analysis conducted and should be avoided. Calibrating to the full available data and skipping model validation entirely is the most robust split-sample decision. Experimental findings remain consistent no matter how model building factors (i.e., catchments, model types, data availability, and testing periods) are varied. Results strongly support revising the traditional split-sample approach in hydrological modeling.

    Version updates

    v1.1 Updated on May 19, 2022. We added hydrographs for each catchment.

    There are 8 parts of the zipped file attached in v1.1. You should download all of them and unzip all those eight parts together.

    In this update, we added two zipped files in each gauge subfolder:

    (1) GR4J_Hydrographs.zip and
    
    
    (2) HMETS_Hydrographs.zip
    

    Each of the zip files contains 50 CSV files. These CSV files are named with keywords of model name, gauge ID, and the calibration sub-period (CSP) identifier.

    Each hydrograph CSV file contains four key columns:

    (1) Date time (note that the hour column is less significant since this is daily data);
    
    
    (2) Precipitation in mm that is the aggregated basin mean precipitation;
    
    
    (3) Simulated streamflow in m3/s and the column is named as "subXXX", where XXX is the ID of the catchment, specified in the CAMELS_463_gauge_info.txt file; and
    
    
    (4) Observed streamflow in m3/s and the column is named as "subXXX(observed)".
    

    Note that these hydrograph CSV files reported period-ending time-averaged flows. They were directly produced by the Raven hydrological modeling framework. More information about the format of the hydrograph CSV files can be redirected to the Raven webpage.

    v1.0 First version published on Jan 29, 2022.

    Data description

    This data was used in the paper entitled "Time to Update the Split-Sample Approach in Hydrological Model Calibration" by Shen et al. (2022).

    Catchment, meteorological forcing and streamflow data are provided for hydrological modeling use. Specifically, the forcing and streamflow data are archived in the Raven hydrological modeling required format. The GR4J and HMETS model building results in the paper, i.e., reference KGE and KGE metrics in calibration, validation and testing periods, are provided for replication of the split-sample assessment performed in the paper.

    Data content

    The data folder contains a gauge info file (CAMELS_463_gauge_info.txt), which reports basic information of each catchment, and 463 subfolders, each having four files for a catchment, including:

    (1) Raven_Daymet_forcing.rvt, which contains Daymet meteorological forcing (i.e., daily precipitation in mm/d, minimum and maximum air temperature in deg_C, shortwave in MJ/m2/day, and day length in day) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format.
    
    
    (2) Raven_USGS_streamflow.rvt, which contains daily discharge data (in m3/s) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format.
    
    
    (3) GR4J_metrics.txt, which contains reference KGE and GR4J-based KGE metrics in calibration, validation and testing periods.
    
    
    (4) HMETS_metrics.txt, which contains reference KGE and HMETS-based KGE metrics in calibration, validation and testing periods.
    

    Data collection and processing methods

      Data source
    

    Catchment information and the Daymet meteorological forcing are retrieved from the CAMELS data set, which can be found here.

    The USGS streamflow data are collected from the U.S. Geological Survey's (USGS) National Water Information System (NWIS), which can be found here.

    The GR4J and HMETS performance metrics (i.e., reference KGE and KGE) are produced in the study by Shen et al. (2022).

      Forcing data processing
    

    A quality assessment procedure was performed. For example, daily maximum air temperature should be larger than the daily minimum air temperature; otherwise, these two values will be swapped.

    Units are converted to Raven-required ones. Precipitation: mm/day, unchanged; daily minimum/maximum air temperature: deg_C, unchanged; shortwave: W/m2 to MJ/m2/day; day length: seconds to days.

    Data for a catchment is archived in a RVT (ASCII-based) file, in which the second line specifies the start time of the forcing series, the time step (= 1 day), and the total time steps in the series (= 12784), respectively; the third and the fourth lines specify the forcing variables and their corresponding units, respectively.

    More details of Raven formatted forcing files can be found in the Raven manual (here).

      Streamflow data processing
    

    Units are converted to Raven-required ones. Daily discharge originally in cfs is converted to m3/s.

    Missing data are replaced with -1.2345 as Raven requires. Those missing time steps will not be counted in performance metrics calculation.

    Streamflow series is archived in a RVT (ASCII-based) file, which is open with eight commented lines specifying relevant gauge and streamflow data information, such as gauge name, gauge ID, USGS reported catchment area, calculated catchment area (based on the catchment shapefiles in CAMELS dataset), streamflow data range, data time step, and missing data periods. The first line after the commented lines in the streamflow RVT files specifies data type (default is HYDROGRAPH), subbasin ID (i.e., SubID), and discharge unit (m3/s), respectively. And the next line specifies the start of the streamflow data, time step (=1 day), and the total time steps in the series(= 12784), respectively.

    GR4J and HMETS metrics

    The GR4J and HMETS metrics files consists of reference KGE and KGE in model calibration, validation, and testing periods, which are derived in the massive split-sample test experiment performed in the paper.

    Columns in these metrics files are gauge ID, calibration sub-period (CSP) identifier, KGE in calibration, validation, testing1, testing2, and testing3, respectively.

    We proposed 50 different CSPs in the experiment. "CSP_identifier" is a unique name of each CSP. e.g., CSP identifier "CSP-3A_1990" stands for the model is built in Jan 1st 1990, calibrated in the first 3-year sample (1981-1983), calibrated in the rest years during the period of 1980 to 1989. Note that 1980 is always used for spin-up.

    We defined three testing periods (independent to calibration and validation periods) for each CSP, which are the first 3 years from model build year inclusive, the first 5 years from model build year inclusive, and the full years from model build year inclusive. e.g., "testing1", "testing2", and "testing3" for CSP-3A_1990 are 1990-1992, 1990-1994, and 1990-2014, respectively.

    Reference flow is the interannual mean daily flow based on a specific period, which is derived for a one-year period and then repeated in each year in the calculation period.

    For calibration, its reference flow is based on spin-up + calibration periods.

    For validation, its reference flow is based on spin-up + calibration periods.

    For testing, its reference flow is based on spin-up +calibration + validation periods.

    Reference KGE is calculated based on the reference flow and observed streamflow in a specific calculation period (e.g., calibration). Reference KGE is computed using the KGE equation with substituting the "simulated" flow for "reference" flow in the period for calculation. Note that the reference KGEs for the three different testing periods corresponds to the same historical period, but are different, because each testing period spans in a different time period and covers different series of observed flow.

    More details of the split-sample test experiment and modeling results analysis can be referred to the paper by Shen et al. (2022).

    Citation

    Journal Publication

    This study:

    Shen, H., Tolson, B. A., & Mai, J.(2022). Time to update the split-sample approach in hydrological model calibration. Water Resources Research, 58, e2021WR031523. https://doi.org/10.1029/2021WR031523

    Original CAMELS dataset:

    A. J. Newman, M. P. Clark, K. Sampson, A. Wood, L. E. Hay, A. Bock, R. J. Viger, D. Blodgett, L. Brekke, J. R. Arnold, T. Hopson, and Q. Duan (2015). Development of a large-sample

  20. Preprocessed Data (Split Train) AII2022 Human Pose

    • kaggle.com
    Updated Jul 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trisha Tomy (2023). Preprocessed Data (Split Train) AII2022 Human Pose [Dataset]. https://www.kaggle.com/datasets/trishatomy/preprocessed-data-split-train-aii2022-human-pose/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Trisha Tomy
    Description

    Dataset

    This dataset was created by Trisha Tomy

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Taylor (2023). RLCD-generated-preference-data-split [Dataset]. https://huggingface.co/datasets/TaylorAI/RLCD-generated-preference-data-split

RLCD-generated-preference-data-split

TaylorAI/RLCD-generated-preference-data-split

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2023
Dataset authored and provided by
Taylor
Description

Dataset Card for "RLCD-generated-preference-data-split"

More Information needed

Search
Clear search
Close search
Google apps
Main menu