16 datasets found
  1. RANZR Clip-600x600 Stratified k fold TFrecords

    • kaggle.com
    Updated Feb 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepak Bhat (2021). RANZR Clip-600x600 Stratified k fold TFrecords [Dataset]. https://www.kaggle.com/datasets/deepakbhatp/ranzr-clip600x600-stratified-k-fold-tfrecords/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 28, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Deepak Bhat
    Description

    Dataset

    This dataset was created by Deepak Bhat

    Contents

  2. f

    DataSheet2_SKCV: Stratified K-fold cross-validation on ML classifiers for...

    • frontiersin.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sashikanta Prusty; Srikanta Patnaik; Sujit Kumar Dash (2023). DataSheet2_SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer.docx [Dataset]. http://doi.org/10.3389/fnano.2022.972421.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Sashikanta Prusty; Srikanta Patnaik; Sujit Kumar Dash
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cancer is the unregulated development of abnormal cells in the human body system. Cervical cancer, also known as cervix cancer, develops on the cervix’s surface. This causes an overabundance of cells to build up, eventually forming a lump or tumour. As a result, early detection is essential to determine what effective treatment we can take to overcome it. Therefore, the novel Machine Learning (ML) techniques come to a place that predicts cervical cancer before it becomes too serious. Furthermore, four common diagnosis testing namely, Hinselmann, Schiller, Cytology, and Biopsy have been compared and predicted with four common ML models, namely Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbors (K-NNs), and Extreme Gradient Boosting (XGB). Additionally, to enhance the better performance of ML models, the Stratified k-fold cross-validation (SKCV) method has been implemented over here. The findings of the experiments demonstrate that utilizing an RF classifier for analyzing the cervical cancer risk, could be a good alternative for assisting clinical specialists in classifying this disease in advance.

  3. Iterative-stratification

    • kaggle.com
    Updated Nov 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raj Gandhi (2020). Iterative-stratification [Dataset]. https://www.kaggle.com/datasets/rajgandhi/iterativestratification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Raj Gandhi
    Description

    Dataset

    This dataset was created by Raj Gandhi

    Contents

  4. Poisonous Mushroom Stratified Kfold (5)

    • kaggle.com
    Updated Sep 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gagandeep Singh Bajwa (2024). Poisonous Mushroom Stratified Kfold (5) [Dataset]. https://www.kaggle.com/gaganbajwaa/poisonous-mushroom-stratified-kfold-5/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 3, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gagandeep Singh Bajwa
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Gagan Bajwa

    Released under Apache 2.0

    Contents

  5. PP2021 - KFold TFRecords

    • kaggle.com
    zip
    Updated Mar 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Kuzmenkov (2021). PP2021 - KFold TFRecords [Dataset]. https://www.kaggle.com/nickuzmenkov/pp2021-kfold-tfrecords-0
    Explore at:
    zip(1883845807 bytes)Available download formats
    Dataset updated
    Mar 24, 2021
    Authors
    Nick Kuzmenkov
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Nick Kuzmenkov

    Released under CC0: Public Domain

    Contents

    It contains the following files:

  6. Random Forest classification results for the whole dataset with stratified...

    • plos.figshare.com
    txt
    Updated Aug 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salvador Chulián; Bernadette J. Stolz; Álvaro Martínez-Rubio; Cristina Blázquez Goñi; Juan F. Rodríguez Gutiérrez; Teresa Caballero Velázquez; Águeda Molinos Quintana; Manuel Ramírez Orellana; Ana Castillo Robleda; José Luis Fuster Soler; Alfredo Minguela Puras; María V. Martínez Sánchez; María Rosa; Víctor M. Pérez-García; Helen M. Byrne (2023). Random Forest classification results for the whole dataset with stratified k-fold and oversampling. [Dataset]. http://doi.org/10.1371/journal.pcbi.1011329.s003
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Salvador Chulián; Bernadette J. Stolz; Álvaro Martínez-Rubio; Cristina Blázquez Goñi; Juan F. Rodríguez Gutiérrez; Teresa Caballero Velázquez; Águeda Molinos Quintana; Manuel Ramírez Orellana; Ana Castillo Robleda; José Luis Fuster Soler; Alfredo Minguela Puras; María V. Martínez Sánchez; María Rosa; Víctor M. Pérez-García; Helen M. Byrne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Random Forest classification results for the whole dataset with stratified k-fold and oversampling.

  7. PP2021 - Augmented KFold TFRecords (1/4)

    • kaggle.com
    Updated May 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Kuzmenkov (2021). PP2021 - Augmented KFold TFRecords (1/4) [Dataset]. https://www.kaggle.com/datasets/nickuzmenkov/pp2021-kfold-tfrecords/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 8, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nick Kuzmenkov
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    Dataset of TFRecords files made from Plant Pathology 2021 original competition data. Changes: * labels column of the initial train.csv DataFrame was binarized to multi-label format columns: complex, frog_eye_leaf_spot, healthy, powdery_mildew, rust, and scab * images were scaled to 600x600 * 77 duplicate images having different labels were removed (see the context in this notebook) * samples were stratified and split into 5 folds (see corresponding folders fold_0:fold_4) * images were heavily augmented with albumentations library (for raw images see this dataset) * each folder contains 5 copies of randomly augmented initial images (so that the model never meets the same images)

    I suggest adding all 5 datasets to your notebook: 4 augmented datasets = 20 epochs of unique images (1, 2, 3, 4) + 1 raw dataset for validation here.

    For a complete example see my TPU Training Notebook

    Contents:

    • preprocessed DataFrame train.csv
    • fold indexes DataFrame folds.csv
    • fold_0:fold_4 folders containing 64 .tfrec files, respectively, with feature map shown below: feature_map = { 'image': tf.io.FixedLenFeature([], tf.string), 'name': tf.io.FixedLenFeature([], tf.string), 'complex': tf.io.FixedLenFeature([], tf.int64), 'frog_eye_leaf_spot': tf.io.FixedLenFeature([], tf.int64), 'healthy': tf.io.FixedLenFeature([], tf.int64), 'powdery_mildew': tf.io.FixedLenFeature([], tf.int64), 'rust': tf.io.FixedLenFeature([], tf.int64), 'scab': tf.io.FixedLenFeature([], tf.int64)} ### Acknowledgements
    • photo from Unsplash here
  8. E

    Pairwise Multi-Class Document Classification for Semantic Relations between...

    • live.european-language-grid.eu
    csv
    Updated Apr 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18317
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 15, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93,
    which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.

    Additional information can be found on GitHub.

    The following data is supplemental to the experiments described in our research paper. The data consists of:

    • Datasets (articles, class labels, cross-validation splits)
    • Pretrained models (Transformers, GloVe, Doc2vec)
    • Model output (prediction) for the best performing models

    This package consists of the Dataset part.

    Dataset

    The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.

    The actual dataset is provided as used in the stratified k-fold with k=4 in train_testdata_4folds.tar.gz.

    ├── 1
    │  ├── test.csv
    │  └── train.csv
    ├── 2
    │  ├── test.csv
    │  └── train.csv
    ├── 3
    │  ├── test.csv
    │  └── train.csv
    └── 4
     ├── test.csv
     └── train.csv

    4 directories, 8 files

  9. f

    Supplementary Table 5: Unsupervised learning of cross-modal mappings in...

    • tandf.figshare.com
    pdf
    Updated May 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jianmin Xu; Binghua Xu; Yipeng Li; Zhijian Su (2024). Supplementary Table 5: Unsupervised learning of cross-modal mappings in multi-omics data for survival stratification of gastric cancer [Dataset]. http://doi.org/10.25402/FON.17113550.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 15, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Jianmin Xu; Binghua Xu; Yipeng Li; Zhijian Su
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Supplementary Table 5. Unsupervised learning of cross-modal mappings in multi-omics data for survival stratification of gastric cancerSignificant pathways for up-regulated and down-regulated genesAbstractPurpose: This study presents a survival-stratification model based on muti-omics integration using BiDNNs in GC. Methods: Based on the survival-related representation features yielded by BiDNNs through integrating transcriptomics and epigenomics data, K-means clustering analysis was performed to cluster tumor samples into different survival subgroups. The BiDNNs-based model was validated using 10-fold cross-validation and in two independent confirmation cohorts. Results: Using the BiDNNs-based survival stratification model, patients were grouped into two survival subgroups with log-rank P value=9.05E-05. The subgroups classification was robustly validated in 10-fold cross-validation (C-index=0.65±0.02) and in two confirmation cohorts (E-GEOD-26253, C-index=0.609; E-GEOD-62254, C-index=0.706). Conclusion: We propose and validate a robust and stable BiDNNs-based survival stratification model in GC.

  10. Iterative-stratification-kfold

    • kaggle.com
    Updated Nov 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AAKRITI ADHIKARI (2020). Iterative-stratification-kfold [Dataset]. https://www.kaggle.com/datasets/aadhika3/iterativestratificationkfold
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 28, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    AAKRITI ADHIKARI
    Description

    Dataset

    This dataset was created by AAKRITI ADHIKARI

    Contents

  11. u

    Daily Gridded North American Snowfall

    • data.ucar.edu
    • rda-web-prod.ucar.edu
    • +2more
    netcdf
    Updated Aug 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chan, Weihan; Henderson, Gina R.; Kluver, Daria; Leathers, Daniel; Mote, Tom; Robinson, David A. (2024). Daily Gridded North American Snowfall [Dataset]. http://doi.org/10.5065/5BJC-W635
    Explore at:
    netcdfAvailable download formats
    Dataset updated
    Aug 4, 2024
    Dataset provided by
    Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory
    Authors
    Chan, Weihan; Henderson, Gina R.; Kluver, Daria; Leathers, Daniel; Mote, Tom; Robinson, David A.
    Time period covered
    Jan 1, 1900 - Dec 31, 2009
    Area covered
    Description

    A daily gridded North American snowfall data with focus on the quality of the interpolated product is archived in this dataset. Daily snowfall amounts from National Weather Service Cooperative Observer Program stations and Meteorological Service of Canada surface stations are interpolated to 1 degree by 1 degree grids and examined for data record length and quality. The interpolation is validated spatially and temporally through the use of stratified sampling and k-fold cross-validation analyses. Interpolation errors average around 0.5 cm and range from less than 0.01 to greater than 2.5 cm. For most locations, this is within the measurement sensitivity. Grid cells with large variations in elevation experience higher errors and should be used with caution.

  12. COTS-YOLO-StratifiedKFold

    • kaggle.com
    Updated Jan 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuzuki (2022). COTS-YOLO-StratifiedKFold [Dataset]. https://www.kaggle.com/datasets/myintzu/cotsyolostratifiedkfold/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yuzuki
    Description

    Dataset

    This dataset was created by Yuzuki

    Contents

  13. f

    Additional file 2 of Machine learning pipeline for blood culture outcome...

    • springernature.figshare.com
    • researchdata.edu.au
    application/csv
    Updated Aug 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin R. McFadden; Timothy J. J. Inglis; Mark Reynolds (2024). Additional file 2 of Machine learning pipeline for blood culture outcome prediction using Sysmex XN-2000 blood sample results in Western Australia [Dataset]. http://doi.org/10.6084/m9.figshare.26612528.v1
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Aug 13, 2024
    Dataset provided by
    figshare
    Authors
    Benjamin R. McFadden; Timothy J. J. Inglis; Mark Reynolds
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Western Australia, Australia
    Description

    Additional file 2. Contains results for all models evaluated during the model training and stratified 10-fold cross validation stage.

  14. f

    AUC metric for the corresponding groups and classifiers based on the...

    • plos.figshare.com
    xls
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Koychev; Evgeniy Marinov; Simon Young; Sophia Lazarova; Denitsa Grigorova; Dean Palejev (2023). AUC metric for the corresponding groups and classifiers based on the original 7 features, the selected 10 optimal minimal subset of features and all 13 selected significant features. [Dataset]. http://doi.org/10.1371/journal.pone.0288039.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 19, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Ivan Koychev; Evgeniy Marinov; Simon Young; Sophia Lazarova; Denitsa Grigorova; Dean Palejev
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AUC metric for the corresponding groups and classifiers based on the original 7 features, the selected 10 optimal minimal subset of features and all 13 selected significant features.

  15. EMNIST StratifiedKFold_models

    • kaggle.com
    Updated Oct 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Олексій Чорний (2023). EMNIST StratifiedKFold_models [Dataset]. https://www.kaggle.com/datasets/oleksiichornyi/emnist-stratifiedkfold-models/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Олексій Чорний
    Description

    Dataset

    This dataset was created by Олексій Чорний

    Contents

  16. f

    Performance of TS predictors created by MuLT and SMLA on 10-fold CV...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Venezian Povoa; Carlos Henrique Costa Ribeiro; Israel Tojal da Silva (2023). Performance of TS predictors created by MuLT and SMLA on 10-fold CV experiments. [Dataset]. http://doi.org/10.1371/journal.pone.0254596.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lucas Venezian Povoa; Carlos Henrique Costa Ribeiro; Israel Tojal da Silva
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Namely, sensitivity is the rate of correct prediction of patients identified as sensitive to first line treatments, and specificity describes the correct prediction percentage of patients identified as non-sensitive to first line treatment.

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Deepak Bhat (2021). RANZR Clip-600x600 Stratified k fold TFrecords [Dataset]. https://www.kaggle.com/datasets/deepakbhatp/ranzr-clip600x600-stratified-k-fold-tfrecords/data
Organization logo

RANZR Clip-600x600 Stratified k fold TFrecords

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 28, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Deepak Bhat
Description

Dataset

This dataset was created by Deepak Bhat

Contents

Search
Clear search
Close search
Google apps
Main menu