100+ datasets found
  1. h

    RLCD-generated-preference-data-split

    • huggingface.co
    Updated Sep 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taylor (2023). RLCD-generated-preference-data-split [Dataset]. https://huggingface.co/datasets/TaylorAI/RLCD-generated-preference-data-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2023
    Dataset authored and provided by
    Taylor
    Description

    Dataset Card for "RLCD-generated-preference-data-split"

    More Information needed

  2. All Data Splitting

    • kaggle.com
    zip
    Updated Mar 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maryam Khan Afridi 2024 (2024). All Data Splitting [Dataset]. https://www.kaggle.com/datasets/maryamkhanafridi2024/all-data-splitting
    Explore at:
    zip(698479608 bytes)Available download formats
    Dataset updated
    Mar 2, 2024
    Authors
    Maryam Khan Afridi 2024
    Description

    Dataset

    This dataset was created by Maryam Khan Afridi 2024

    Contents

  3. h

    tae-data-split-paragraphs

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicky (2025). tae-data-split-paragraphs [Dataset]. https://huggingface.co/datasets/nickypro/tae-data-split-paragraphs
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Nicky
    Description

    Split Paragraphs Dataset

    Split paragraphs data with configs 000-099.

  4. R

    Data Split Dataset

    • universe.roboflow.com
    zip
    Updated Sep 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yolov5 (2022). Data Split Dataset [Dataset]. https://universe.roboflow.com/yolov5-vgpfy/data-split-atsuf/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 2, 2022
    Dataset authored and provided by
    yolov5
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    1
    Description

    Data Split

    ## Overview
    
    Data Split is a dataset for classification tasks - it contains 1 annotations for 639 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  5. Data from: Projection Test for Mean Vector in High Dimensions

    • tandf.figshare.com
    zip
    Updated Feb 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wanjun Liu; Xiufan Yu; Wei Zhong; Runze Li (2024). Projection Test for Mean Vector in High Dimensions [Dataset]. http://doi.org/10.6084/m9.figshare.21505061.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 5, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Wanjun Liu; Xiufan Yu; Wei Zhong; Runze Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This article studies the projection test for high-dimensional mean vectors via optimal projection. The idea of projection test is to project high-dimensional data onto a space of low dimension such that traditional methods can be applied. We first propose a new estimation for the optimal projection direction by solving a constrained and regularized quadratic programming. Then two tests are constructed using the estimated optimal projection direction. The first one is based on a data-splitting procedure, which achieves an exact t-test under normality assumption. To mitigate the power loss due to data-splitting, we further propose an online framework, which iteratively updates the estimation of projection direction when new observations arrive. We show that this online-style projection test asymptotically converges to the standard normal distribution. Various simulation studies as well as a real data example show that the proposed online-style projection test retains the Type I error rate well and is more powerful than other existing tests. Supplementary materials for this article are available online.

  6. split data set

    • kaggle.com
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Gold Medalist (2025). split data set [Dataset]. https://www.kaggle.com/datasets/salman2024/split-data-set
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 17, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ali Gold Medalist
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Ali Gold Medalist

    Released under Apache 2.0

    Contents

  7. Image Enhancement Google Earth Data Splitting

    • kaggle.com
    zip
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dicka taksa (2024). Image Enhancement Google Earth Data Splitting [Dataset]. https://www.kaggle.com/datasets/dickataksa/image-enhancement-google-earth-data-splitting
    Explore at:
    zip(584061668 bytes)Available download formats
    Dataset updated
    Dec 30, 2024
    Authors
    Dicka taksa
    Description

    Dataset

    This dataset was created by Dicka taksa

    Contents

  8. h

    cleaned-data-split-0

    • huggingface.co
    Updated Mar 18, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Indonesia AI (2019). cleaned-data-split-0 [Dataset]. https://huggingface.co/datasets/IndonesiaAI/cleaned-data-split-0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 18, 2019
    Dataset authored and provided by
    Indonesia AI
    Description

    Dataset Card for "cleaned-data-split-0"

    More Information needed

  9. Materials Project Time Split Data

    • figshare.com
    json
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sterling G. Baird; Taylor Sparks (2023). Materials Project Time Split Data [Dataset]. http://doi.org/10.6084/m9.figshare.19991516.v4
    Explore at:
    jsonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Sterling G. Baird; Taylor Sparks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Full and dummy snapshots (2022-06-04) of data for mp-time-split encoded via matminer convenience functions grabbed via the new Materials Project API. The dataset is restricted to experimentally verified compounds with no more than 52 sites. No other filtering criteria were applied. The snapshots were developed for sparks-baird/mp-time-split as a benchmark dataset for materials generative modeling. Compressed version of the files (.gz) are also available. dtypes python from pprint import pprint from matminer.utils.io import load_dataframe_from_json filepath = "insert/path/to/file/here.json" expt_df = load_dataframe_from_json(filepath) pprint(expt_df.iloc[0].apply(type).to_dict()) {'discovery': , 'energy_above_hull': , 'formation_energy_per_atom': , 'material_id': , 'references': , 'structure': , 'theoretical': , 'year': } index/mpids (just the number for the index). Note that material_id-s that begin with "mvc-" have the "mvc" dropped and the hyphen (minus sign) is left to distinguish between "mp-" and "mvc-" types while still allowing for sorting. E.g. mvc-001 -> -1.

    {146: MPID(mp-146), 925: MPID(mp-925), 1282: MPID(mp-1282), 1335: MPID(mp-1335), 12778: MPID(mp-12778), 2540: MPID(mp-2540), 316: MPID(mp-316), 1395: MPID(mp-1395), 2678: MPID(mp-2678), 1281: MPID(mp-1281), 1251: MPID(mp-1251)}

  10. p

    Data for Analyzing the Effect of Data Splitting and Covariate Shift on...

    • purr.purdue.edu
    Updated Jan 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pin-ching Li; Sayan Dey; Venkatesh Merwade (2023). Data for Analyzing the Effect of Data Splitting and Covariate Shift on Machine Leaning Based Streamflow Prediction in Ungauged Basins [Dataset]. http://doi.org/10.4231/0PG5-KC30
    Explore at:
    Dataset updated
    Jan 23, 2023
    Dataset provided by
    PURR
    Authors
    Pin-ching Li; Sayan Dey; Venkatesh Merwade
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This resource contains the data used in the study "Analyzing the Effect of Data Splitting and Covariate Shift on Machine Leaning Based Streamflow Prediction in Ungauged Basins" published in Water Resources Research (doi: 10.1029/2023WR034464)

  11. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  12. Z

    Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841
    Explore at:
    Dataset updated
    Aug 8, 2022
    Authors
    Köhler, Juliane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

    Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

    ger_train.csv – The German training set as CSV file.

    ger_validation.csv – The German validation set as CSV file.

    en_test.csv – The English test set as CSV file.

    en_train.csv – The English training set as CSV file.

    en_validation.csv – The English validation set as CSV file.

    splitting.py – The python code for splitting a dataset into train, test and validation set.

    DataSetTrans_de.csv – The final German dataset as a CSV file.

    DataSetTrans_en.csv – The final English dataset as a CSV file.

    translation.py – The python code for translating the cleaned dataset.

  13. h

    frugal-maths-data-split-v1

    • huggingface.co
    Updated Nov 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MBZUAI-IFM Paris Lab (2025). frugal-maths-data-split-v1 [Dataset]. https://huggingface.co/datasets/MBZUAI-Paris/frugal-maths-data-split-v1
    Explore at:
    Dataset updated
    Nov 5, 2025
    Dataset authored and provided by
    MBZUAI-IFM Paris Lab
    Description

    FrugalMath Dataset: Easy Samples as Length Regularizers in Math RLVR

    Paper: Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR Base Model: Qwen/Qwen3-4B-Thinking-2507 Authors: Abdelaziz Bounhar et al. License: Apache 2.0

      Overview
    

    The FrugalMath dataset was designed to study implicit length regularization in Reinforcement Learning with Verifiable Rewards (RLVR). Unlike standard pipelines that discard easy problems, this dataset… See the full description on the dataset page: https://huggingface.co/datasets/MBZUAI-Paris/frugal-maths-data-split-v1.

  14. f

    Data split for each class of each dataset for training and test.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niranjan, Mahesan; Fan, Keqiang; Cai, Xiaohao; Liu, Jiahui (2024). Data split for each class of each dataset for training and test. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001424294
    Explore at:
    Dataset updated
    Nov 6, 2024
    Authors
    Niranjan, Mahesan; Fan, Keqiang; Cai, Xiaohao; Liu, Jiahui
    Description

    Data split for each class of each dataset for training and test.

  15. h

    X-ALMA-Parallel-Data-Split

    • huggingface.co
    Updated Jun 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yong-Joong Kim (2025). X-ALMA-Parallel-Data-Split [Dataset]. https://huggingface.co/datasets/yongjoongkim/X-ALMA-Parallel-Data-Split
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Yong-Joong Kim
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    yongjoongkim/X-ALMA-Parallel-Data-Split dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. s

    Citation Trends for "Rank-transformed subsampling: inference for multiple...

    • shibatadb.com
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yubetsu (2024). Citation Trends for "Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values" [Dataset]. https://www.shibatadb.com/article/QFpMKWaN
    Explore at:
    Dataset updated
    Sep 18, 2024
    Dataset authored and provided by
    Yubetsu
    License

    https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt

    Time period covered
    2024 - 2025
    Variables measured
    New Citations per Year
    Description

    Yearly citation counts for the publication titled "Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values".

  17. Data Split

    • kaggle.com
    zip
    Updated Dec 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DanielJamesdj08 (2023). Data Split [Dataset]. https://www.kaggle.com/datasets/danieljamesdj08/data-split
    Explore at:
    zip(7553 bytes)Available download formats
    Dataset updated
    Dec 20, 2023
    Authors
    DanielJamesdj08
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by DanielJamesdj08

    Released under MIT

    Contents

  18. d

    Historical Stock Splits API by Finnworlds

    • datarade.ai
    Updated Oct 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Finnworlds (2022). Historical Stock Splits API by Finnworlds [Dataset]. https://datarade.ai/data-products/historical-stock-splits-api-by-finnworlds-finnworlds
    Explore at:
    .json, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Oct 27, 2022
    Dataset authored and provided by
    Finnworlds
    Area covered
    Bosnia and Herzegovina, Finland, Ukraine, Denmark, Malta, United States of America, Canada, Montenegro, Belgium, Italy
    Description

    Historical Stock Splits API provides financial data users with a rapid access to historical stock splits data. Company executive boards of public companies very often aim for stock splitting when circumstances are favourable. Stock-splitting leads to an increased number of shares sold at lower prices. In this way, prospective investors or company shareholders purchase more shares at attractive prices. If you need historical stock splitting data for your financial project, try out Finnworlds Historical Stock Splits API. In case you want to learn more about it, please, visit the website. https://finnworlds.com/historical-stock-splits-api/

  19. R

    Data from: Split 3 Dataset

    • universe.roboflow.com
    zip
    Updated Jun 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SPLIT 3 (2024). Split 3 Dataset [Dataset]. https://universe.roboflow.com/split-3/split-3/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 16, 2024
    Dataset authored and provided by
    SPLIT 3
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    SPLIT3 Bounding Boxes
    Description

    SPLIT 3

    ## Overview
    
    SPLIT 3 is a dataset for object detection tasks - it contains SPLIT3 annotations for 7,306 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  20. d

    Data from: Split Phase Inverter Data

    • catalog.data.gov
    • data.openei.org
    • +3more
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Renewable Energy Laboratory (2025). Split Phase Inverter Data [Dataset]. https://catalog.data.gov/dataset/split-phase-inverter-data-b286c
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    National Renewable Energy Laboratory
    Description

    The increase in power electronic based generation sources require accurate modeling of inverters. Accurate modeling requires experimental data over wider operation range. We used 8.35 kW off-the-shelf grid following split phase PV inverter in the experiments. We used controllable AC supply and controllable DC supply to emulate AC and DC side characteristics. The experiments were performed at NREL's Energy Systems Integration Facility. Inverter is tested under 100%, 75%, 50%, 25% load conditions. In the first dataset, for each operating condition, controllable AC source voltage is varied from 0.9 to 1.1 per unit (p.u) with a step value of 0.025 p.u while keeping the frequency at 60 Hz. In the second dataset, under similar load conditions (100%, 75%, 50%, 25% ), the frequency of the controllable AC source voltage was varied from 59 Hz to 61 Hz with a step value of 0.2 Hz. Voltage and frequency range is chosen based on inverter protection. Voltages and currents on DC and AC side are included in the dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Taylor (2023). RLCD-generated-preference-data-split [Dataset]. https://huggingface.co/datasets/TaylorAI/RLCD-generated-preference-data-split

RLCD-generated-preference-data-split

TaylorAI/RLCD-generated-preference-data-split

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2023
Dataset authored and provided by
Taylor
Description

Dataset Card for "RLCD-generated-preference-data-split"

More Information needed

Search
Clear search
Close search
Google apps
Main menu