100+ datasets found
  1. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  2. Results of applying optimized machine learning approach for multi-tasks...

    • plos.figshare.com
    xls
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmadreza Keihani; Amin Mohammad Mohammadi; Hengameh Marzbani; Shahriar Nafissi; Mohsen Reza Haidari; Amir Homayoun Jafari (2023). Results of applying optimized machine learning approach for multi-tasks classification. [Dataset]. http://doi.org/10.1371/journal.pone.0270757.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ahmadreza Keihani; Amin Mohammad Mohammadi; Hengameh Marzbani; Shahriar Nafissi; Mohsen Reza Haidari; Amir Homayoun Jafari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results of applying optimized machine learning approach for multi-tasks classification.

  3. d

    Recommended test method data set

    • data.gov.tw
    csv, json, xml
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Food and Drug Administration, Recommended test method data set [Dataset]. https://data.gov.tw/en/datasets/11513
    Explore at:
    json, csv, xmlAvailable download formats
    Dataset authored and provided by
    Food and Drug Administration
    License

    https://data.gov.tw/licensehttps://data.gov.tw/license

    Description

    This dataset provides suggested test method data to be used by academic institutions, laboratories, businesses, and the public.

  4. Complete Blood Count (CBC) Dataset

    • kaggle.com
    zip
    Updated May 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orvile (2025). Complete Blood Count (CBC) Dataset [Dataset]. https://www.kaggle.com/datasets/orvile/complete-blood-count-cbc-dataset/versions/1
    Explore at:
    zip(9067150 bytes)Available download formats
    Dataset updated
    May 16, 2025
    Authors
    Orvile
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The complete blood count (CBC) dataset contains 360 blood smear images along with their annotation files splitting into Training, Testing, and Validation sets. The training folder contains 300 images with annotations. The testing and validation folder both contain 60 images with annotations. We have done some modification over the original dataset to prepare this CBC dataset where some of the image annotation files contain very low red blood cells (RBCs) than actual and one annotation file does not include any RBC at all although the cell smear image contains RBCs. So, we clear up all the fallacious files and split the dataset into three parts. Among the 360 smear images, 300 blood cell images with annotations are used as the training set first, and then the rest of the 60 images with annotations are used as the testing set. Due to the shortage of the data, a subset of the training set is used to prepare the validation set which contains 60 images with annotations.

  5. n

    Data from: Cross-validation in association mapping and its relevance for the...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +2more
    zip
    Updated Nov 5, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tobias Würschum; Thomas Kraft (2013). Cross-validation in association mapping and its relevance for the estimation of QTL parameters of complex traits [Dataset]. http://doi.org/10.5061/dryad.db521
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 5, 2013
    Dataset provided by
    University of Hohenheim
    Syngenta Seeds AB, Landskrona, Sweden
    Authors
    Tobias Würschum; Thomas Kraft
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Association mapping has become a widely applied genomic approach to identify quantitative trait loci (QTL) and dissect the genetic architecture of complex traits. However, approaches to assess the quality of the obtained QTL results are lacking. We therefore evaluated the potential of cross-validation in association mapping based on a large sugar beet data set. Our results show that the proportion of the population that should be used as estimation and validation sets, respectively, depends on the size of the mapping population. Generally, a fivefold cross-validation, that is, 20% of the lines as independent validation set, appears appropriate for commonly used population sizes. The predictive power for the proportion of genotypic variance explained by QTL was overestimated by on average 38% indicating a strong bias in the estimated QTL effects. The cross-validated predictive power ranged between 4 and 50%, which are more realistic estimates of this parameter for complex traits. In addition, QTL frequency distributions can be used to assess the precision of QTL position estimates and the robustness of the detected QTL. In summary, cross-validation can be a valuable tool to assess the quality of QTL parameters in association mapping.

  6. Data from: Robust Validation: Confident Predictions Even When Distributions...

    • tandf.figshare.com
    bin
    Updated Dec 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maxime Cauchois; Suyash Gupta; Alnur Ali; John C. Duchi (2023). Robust Validation: Confident Predictions Even When Distributions Shift* [Dataset]. http://doi.org/10.6084/m9.figshare.24904721.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 26, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Maxime Cauchois; Suyash Gupta; Alnur Ali; John C. Duchi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategy—coming from robust statistics and optimization—is thus to build a model robust to distributional perturbations. In this paper, we take a different approach to describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions. We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an f-divergence ball around the training population. The method, based on conformal inference, achieves (nearly) valid coverage in finite samples, under only the condition that the training data be exchangeable. An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it; we develop estimators and prove their consistency for protection and validity of uncertainty estimates under shifts. By experimenting on several large-scale benchmark datasets, including Recht et al.’s CIFAR-v4 and ImageNet-V2 datasets, we provide complementary empirical results that highlight the importance of robust predictive validity.

  7. Complete Blood Count (CBC)

    • kaggle.com
    zip
    Updated Aug 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Noukhez (2024). Complete Blood Count (CBC) [Dataset]. https://www.kaggle.com/datasets/mdnoukhej/complete-blood-count-cbc
    Explore at:
    zip(12859168 bytes)Available download formats
    Dataset updated
    Aug 1, 2024
    Authors
    Muhammad Noukhez
    Description

    Dataset Description:

    This dataset is a comprehensive collection of Complete Blood Count (CBC) images, meticulously organized to support machine learning and deep learning projects, especially in the domain of medical image analysis. The dataset's structure ensures a balanced and systematic approach to model development, validation, and testing.

    Dataset Breakdown:

    • Training Images: 300
    • Validation Images: 60
    • Test Images: 60
    • Annotations: Detailed annotations included for all images

    Overview:

    The Complete Blood Count (CBC) is a crucial test used in medical diagnostics to evaluate the overall health and detect a variety of disorders, including anemia, infection, and many other diseases. This dataset provides a rich source of CBC images that can be used to train machine learning models to automate the analysis and interpretation of these tests.

    Data Composition:

    1. Training Set:

      • Contains 300 images
      • These images are used to train machine learning models, enabling them to learn and recognize patterns associated with various blood cell types and conditions.
    2. Validation Set:

      • Contains 60 images
      • Used to tune the models and optimize their performance, ensuring that the models generalize well to new, unseen data.
    3. Test Set:

      • Contains 60 images
      • Used to evaluate the final model performance, providing an unbiased assessment of how well the model performs on new data.

    Annotations:

    Each image in the dataset is accompanied by detailed annotations, which include information about the different types of blood cells present and any relevant diagnostic features. These annotations are essential for supervised learning, allowing models to learn from labeled examples and improve their accuracy and reliability.

    Key Features:

    • High-Quality Images: All images are of high quality, making them suitable for a variety of machine learning tasks, including image classification, object detection, and segmentation.
    • Comprehensive Annotations: Each image is thoroughly annotated, providing valuable information that can be used to train and validate models.
    • Balanced Dataset: The dataset is carefully balanced with distinct sets for training, validation, and testing, ensuring that models trained on this data will be robust and generalizable.

    Applications:

    This dataset is ideal for researchers and practitioners in the fields of machine learning, deep learning, and medical image analysis. Potential applications include: - Automated CBC Analysis: Developing algorithms to automatically analyze CBC images and provide diagnostic insights. - Blood Cell Classification: Training models to accurately classify different types of blood cells, which is critical for diagnosing various blood disorders. - Educational Purposes: Using the dataset as a teaching tool to help students and new practitioners understand the complexities of CBC image analysis.

    Usage Notes:

    • Data Augmentation: Users may consider applying data augmentation techniques to increase the diversity of the training data and improve model robustness.
    • Preprocessing: Proper preprocessing, such as normalization and noise reduction, can enhance model performance.
    • Evaluation Metrics: It is recommended to use standard evaluation metrics such as accuracy, precision, recall, and F1-score to assess model performance.

    Conclusion:

    This CBC dataset is a valuable resource for anyone looking to advance the field of automated medical diagnostics through machine learning and deep learning. With its high-quality images, detailed annotations, and balanced composition, it provides the necessary foundation for developing accurate and reliable models for CBC analysis.

  8. f

    Results of determination of hydroxyzine hydrochloride, ephedrine...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Oct 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali, Nourudin W.; Magdy, Maimana A.; Abdelkawy, Mohamed; Algethami, Faisal K.; AlSalem, Huda Salem; Zaazaa, Hala E.; Abdelrahman, Maha M.; Gamal, Mohammed (2024). Results of determination of hydroxyzine hydrochloride, ephedrine hydrochloride and theophylline in laboratory prepared mixtures in the validation set using the proposed multivariate method. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001284147
    Explore at:
    Dataset updated
    Oct 7, 2024
    Authors
    Ali, Nourudin W.; Magdy, Maimana A.; Abdelkawy, Mohamed; Algethami, Faisal K.; AlSalem, Huda Salem; Zaazaa, Hala E.; Abdelrahman, Maha M.; Gamal, Mohammed
    Description

    Results of determination of hydroxyzine hydrochloride, ephedrine hydrochloride and theophylline in laboratory prepared mixtures in the validation set using the proposed multivariate method.

  9. f

    Performances of models to predict unseen pens [leave one out...

    • datasetcatalog.nlm.nih.gov
    Updated Jan 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bee, Giuseppe; Kasper, Claudia; Bigdeli, Siavash A.; Ollagnier, Catherine; Keeling, Linda; Wallenbeck, Anna (2023). Performances of models to predict unseen pens [leave one out cross-validation (LOOP) approach] of the Swedish, Swiss and Swedish+Swiss data sets. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000962102
    Explore at:
    Dataset updated
    Jan 5, 2023
    Authors
    Bee, Giuseppe; Kasper, Claudia; Bigdeli, Siavash A.; Ollagnier, Catherine; Keeling, Linda; Wallenbeck, Anna
    Description

    Performances of models to predict unseen pens [leave one out cross-validation (LOOP) approach] of the Swedish, Swiss and Swedish+Swiss data sets.

  10. f

    Data from: Time-Split Cross-Validation as a Method for Estimating the...

    • acs.figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

  11. ZEW Data Purchasing Challenge 2022

    • kaggle.com
    zip
    Updated Feb 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manish Tripathi (2022). ZEW Data Purchasing Challenge 2022 [Dataset]. https://www.kaggle.com/datasets/manishtripathi86/zew-data-purchasing-challenge-2022
    Explore at:
    zip(1162256319 bytes)Available download formats
    Dataset updated
    Feb 8, 2022
    Authors
    Manish Tripathi
    Description

    Dataset Source: https://www.aicrowd.com/challenges/data-purchasing-challenge-2022

    🕵️ Introduction Data for machine learning tasks usually does not come for free but has to be purchased. The costs and benefits of data have to be weighed against each other. This is challenging. First, data usually has combinatorial value. For instance, different observations might complement or substitute each other for a given machine learning task. In such cases, the decision to purchase one group of observations has to be made conditional on the decision to purchase another group of observations. If these relationships are high-dimensional, finding the optimal bundle becomes computationally hard. Second, data comes at different quality, for instance, with different levels of noise. Third, data has to be acquired under the assumption of being valuable out-of-sample. Distribution shifts have to be anticipated.

    In this competition, you face these data purchasing challenges in the context of an multi-label image classification task in a quality control setting.

    📑 Problem Statement

    In short: You have to classify images. Some images in your training set are labelled but most of them aren't. How do you decide which images to label if you have a limited budget to do so?

    In more detail: You face a multi-label image classification task. The dataset consists of synthetically generated images of painted metal sheets. A classifier is meant to predict whether the sheets have production damages and if so which ones. You have access to a set of images, a subset of which are labelled with respect to production damages. Because labeling is costly and your budget is limited, you have to decide for which of the unlabelled images labels should be purchased in order to maximize prediction accuracy.

    Each of the images have a 4 dimensional label representing the presence or the absence of ['scratch_small', 'scratch_large', 'dent_small', 'dent_large'] in the images.

    You are required to submit code, which can be run in three different phases:

    Pre-Training Phase

    In the Pre-Training Phase, your code will have access to 5,000 labelled images on a multi-label image classification task with 4 classes. It is up to you, how you wish to use this data. For instance, you might want to pre-train a classification model. Purchase Phase

    In the Purchase Phase, your code, after going through the Pre-Training Phase will have access to an unlabelled dataset of 10,000 images. You will have a budget of 3,000 label purchases, that you can freely use across any of the images in the unlabelled dataset to obtain their labels. You are tasked with designing your own approach on how to select the optimal subset of 3,000 images in the unlabelled dataset, which would help you optimize your model's performance on the prediction task. You can then continue training your model (which has been pre-trained in the pre-training phase) using the newly purchased labels. Prediction Phase

    In the Prediction Phase, your code will have access to a test set of 3,000 unlabelled images, for which you have to generate and submit predictions. Your submission will be evaluated based on the performance of your predictions on this test set. Your code will have access to a node with 4 CPUS, 16 GB RAM, 1 NVIDIA T4 GPU and 3 hours of runtime per submission. In the final round of this challenge, your code will be evaluated across multiple budget-runtime constraints.

    💾 Dataset

    The datasets for this challenge can be accessed in the Resources Section.

    training.tar.gz: The training set containing 5,000 images with their associated labels. During your local experiments you are allowed to use the data as you please. unlabelled.tar.gz: The unlabelled set containing 10,000 images, and their associated labels. During your local experiments you are only allowed to access the labels through the provided purchase_label function. validation.tar.gz: The validation set containing 3,000 images, and their associated labels. During your local experiments you are only allowed to use the labels of the validation set to measure the performance of your models and experiments. debug.tar.gz.: A small set of 100 images with their associated labels, that you can use for integration testing, and for trying out the provided starter kit. NOTE While you run your local experiments on this dataset, your submissions will be evaluated on a dataset which might be sampled from a different distribution, and is not the same as this publicly released version.

    👥 Participation

    🖊 Evaluation Criteria The challenge will use the Accuracy Score, Hamming Loss and the Exact Match Ratio during evaluation. The primary score will be the Accuracy Score.

    📅 Timeline This challenge has two Rounds.

    Round 1 : Feb 4th – Feb 28th, 2022

    The first round submissions will be evaluated based on one budget-compute constraint pair (max. of 3,00...

  12. t

    Templates Recommendation in the Open Research Knowledge Graph - Vdataset -...

    • service.tib.eu
    Updated Jun 3, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Templates Recommendation in the Open Research Knowledge Graph - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/luh-templates-recommendation-in-the-open-research-knowledge-graph
    Explore at:
    Dataset updated
    Jun 3, 2022
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    This dataset has been created for implementing a content-based recommender system in the context of the Open Research Knowledge Graph (ORKG). The recommender system accepts research paper's title and abstracts as input and recommends existing templates in the ORKG semantically relevant to the given paper. Two approaches have been trained on this dataset in the context of this https://doi.org/10.15488/11834 master's thesis, namely a Natural Language Inference (NLI) approach based on SciBERT embeddings and an unsupervised approach based on ElasticSearch. This publication consists therefore of one general dataset, two training sets for each approach, validation set for the supervised approach and a test set for both approaches.

  13. f

    Data from: A machine learning approach to triaging patients with chronic...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gerber, Anthony N.; Qirko, Klajdi; Swaminathan, Sumanth; Bazaz, Gaurav; Corcoran, Ethan; Kappel, George; Wysham, Nicholas G.; Smith, Ted (2017). A machine learning approach to triaging patients with chronic obstructive pulmonary disease [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001753196
    Explore at:
    Dataset updated
    Nov 22, 2017
    Authors
    Gerber, Anthony N.; Qirko, Klajdi; Swaminathan, Sumanth; Bazaz, Gaurav; Corcoran, Ethan; Kappel, George; Wysham, Nicholas G.; Smith, Ted
    Description

    COPD patients are burdened with a daily risk of acute exacerbation and loss of control, which could be mitigated by effective, on-demand decision support tools. In this study, we present a machine learning-based strategy for early detection of exacerbations and subsequent triage. Our application uses physician opinion in a statistically and clinically comprehensive set of patient cases to train a supervised prediction algorithm. The accuracy of the model is assessed against a panel of physicians each triaging identical cases in a representative patient validation set. Our results show that algorithm accuracy and safety indicators surpass all individual pulmonologists in both identifying exacerbations and predicting the consensus triage in a 101 case validation set. The algorithm is also the top performer in sensitivity, specificity, and ppv when predicting a patient’s need for emergency care.

  14. m

    Data for: A new method for predicting formation lithology while drilling at...

    • data.mendeley.com
    • narcis.nl
    Updated Oct 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jian Sun (2020). Data for: A new method for predicting formation lithology while drilling at horizontal well bit [Dataset]. http://doi.org/10.17632/t8fvs9fjvw.1
    Explore at:
    Dataset updated
    Oct 6, 2020
    Authors
    Jian Sun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    training set data, validation set data and test set data

  15. f

    AUC for the combined engines on training set and test set with the ambiguous...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Feb 21, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sönnerborg, Anders; Lengauer, Thomas; Zazzi, Maurizio; Peres, Yardena; Struck, Daniel; Kaiser, Rolf; Neuvirth, Hani; Büch, Joachim; Schülter, Eugen; Incardona, Francesca; Altmann, André; Prosperi, Mattia; Rosen-Zvi, Michal; Aharoni, Ehud (2013). AUC for the combined engines on training set and test set with the ambiguous cases removed from test set and training set or test set only. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001631417
    Explore at:
    Dataset updated
    Feb 21, 2013
    Authors
    Sönnerborg, Anders; Lengauer, Thomas; Zazzi, Maurizio; Peres, Yardena; Struck, Daniel; Kaiser, Rolf; Neuvirth, Hani; Büch, Joachim; Schülter, Eugen; Incardona, Francesca; Altmann, André; Prosperi, Mattia; Rosen-Zvi, Michal; Aharoni, Ehud
    Description

    The table displays the results, measured in AUC, on training set (10-fold cross validation; standard deviation in brackets) and test set for a selection of combination approaches when trained on the (un)cleaned training set. For computation of the AUC the ambiguous cases were always removed.

  16. F

    Templates Recommendation in the Open Research Knowledge Graph

    • data.uni-hannover.de
    json
    Updated Jun 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). Templates Recommendation in the Open Research Knowledge Graph [Dataset]. https://data.uni-hannover.de/dataset/templates-recommendation-in-the-open-research-knowledge-graph
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jun 3, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    This dataset has been created for implementing a content-based recommender system in the context of the Open Research Knowledge Graph (ORKG). The recommender system accepts research paper's title and abstracts as input and recommends existing templates in the ORKG semantically relevant to the given paper.

    Two approaches have been trained on this dataset in the context of this https://doi.org/10.15488/11834 master's thesis, namely a Natural Language Inference (NLI) approach based on SciBERT embeddings and an unsupervised approach based on ElasticSearch.

    This publication consists therefore of one general dataset, two training sets for each approach, validation set for the supervised approach and a test set for both approaches.

  17. D

    Dataset for: Probably Pleasant? A Neural-Probabilistic Approach to Automatic...

    • researchdata.ntu.edu.sg
    bin
    Updated Jun 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kenneth Ooi; Kenneth Ooi; Karn N. Watcharasupat; Karn N. Watcharasupat; Bhan Lam; Bhan Lam; Zhen-Ting Ong; Zhen-Ting Ong; Woon-Seng Gan; Woon-Seng Gan (2022). Dataset for: Probably Pleasant? A Neural-Probabilistic Approach to Automatic Masker Selection for Urban Soundscape Augmentation [Dataset]. http://doi.org/10.21979/N9/YSJQKD
    Explore at:
    bin(512), bin(830914688), bin(20288), bin(15827072)Available download formats
    Dataset updated
    Jun 9, 2022
    Dataset provided by
    DR-NTU (Data)
    Authors
    Kenneth Ooi; Kenneth Ooi; Karn N. Watcharasupat; Karn N. Watcharasupat; Bhan Lam; Bhan Lam; Zhen-Ting Ong; Zhen-Ting Ong; Woon-Seng Gan; Woon-Seng Gan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Dataset funded by
    National Research Foundation (NRF)
    Ministry of National Development (MND)
    Description

    This dataset contains the log-mel spectrograms for the augmented soundscapes described in our ICASSP 2022 submission "Probably Pleasant? A Neural-Probabilistic Approach to Automatic Masker Selection for Urban Soundscape Augmentation", in .npy format. The data can be accessed using the numpy package of Python, using the command numpy.load. The dataset is available as a 5-fold cross validation dataset, with the log-mel spectrograms for each fold having filenames fold_#_features.npy and the subjective ratings for the augmented soundscapes having filenames of the format fold_#_labels.npy, where # is the number of the fold in the set {1,2,3,4,5}. The independent test set has fold index 0. Generation of augmented soundscapes Each augmented soundscape was created by adding 30-second excerpts of recordings of sounds known as maskers to binaural recordings of urban soundscapes (element-wise addition in the time domain). Each masker recording only has one class ("construction", "traffic", "water", or "wind") active for the entire duration of the recording, whereas each binaural recording of an urban soundscape may have multiple sound sources active at any point in the recording, including sound sources outside of the four masker classes. Cross-validation set The masker samples were obtained from Freesound by searching the names of the masker classes (i.e. "construction", "traffic", "water", and "wind") on Freesound, and randomly picking a selection of tracks containing 30-second sections of sound that corresponded only to that particular masker class. The soundscape samples were obtained from the Urban Soundscapes of the World (USotW) dataset, and consisted of all binaural recordings available in the public dataset, minus those with audible electrical noise, measured in-situ LA,eq values below 52 dB, and measured in-situ LA,eq values above 77 dB, in order to reflect only the accurately-captured real-life soundscapes, ensure that reproduction levels were significantly above the noise floor of the location with the highest noise floor (~36 dB) where the subjective responses were obtained, and ensure safe listening levels for our participants. In total, 120 out of the 127 publicly-available recordings in the USotW dataset were used for the cross-validation set. Test set The masker samples were obtained from Freesound in the same manner as that for the cross-validation set, but ensuring that no overlap in recordings occurred between the test set and cross-validation set maskers. The soundscape samples were taken from binaural recordings of locations in Singapore (which was not represented in any of the soundscapes in the USotW dataset and hence the cross-validation set). They were recorded under the similar Soundscape Indices Protocol and were taken in similar urban contexts as the USotW dataset Specifically, they were from a road facing a construction site, a gazebo in a park, a walkway facing a lake, a walkway facing a crowded canteen, a path facing a lake, and a path facing a lake with an aircraft flying overhead. Participant information The participants of the listening test were a sample of people who were able to physically come down to our laboratory (in Nanyang Technological University, Singapore) to listen to the stimuli and provide their responses. Their mean age was 28.4 ± 11.8 years, and there were a total of 151 female and 149 male participants. All participants were tested to have normal hearing (mean hearing threshold <20 dB (resp. 30 dB) at 0.5, 1, 2, 4, and 6 kHz for participants below (resp. equal to or above) 30 years of age).

  18. f

    Data from: A Symbolic Regression Model for the Prediction of Drug Binding to...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    Updated Mar 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morrison, Denise; Wegner, Joerg Kurt; Van Den Bergh, An; Van Rompaey, Dries (2023). A Symbolic Regression Model for the Prediction of Drug Binding to Human Liver Microsomes [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001092910
    Explore at:
    Dataset updated
    Mar 31, 2023
    Authors
    Morrison, Denise; Wegner, Joerg Kurt; Van Den Bergh, An; Van Rompaey, Dries
    Description

    It is common practice in the early drug discovery process to conduct in vitro screening experiments using liver microsomes in order to obtain an initial assessment of test compound metabolic stability. Compounds which bind to liver microsomes are unavailable for interaction with the drug metabolizing enzymes. As such, assessment of the unbound fraction of compound available for biotransformation is an important factor for interpretation of in vitro experimental results and to improve prediction of the in vivo metabolic clearance. Various in silico methods have been proposed for the prediction of test compound binding to microsomes, from various simple lipophilicity-based models with moderate performance to sophisticated machine learning models which demonstrate superior performance at the cost of increased complexity and higher data requirements. In this work, we attempt to strike a middle ground by developing easily implementable equations with improved predictive performance. We employ a symbolic regression approach based on a medium-size in-house data set of fraction unbound in human liver microsomes measurements allowing the identification of novel equations with improved performance. We validate the model performance on an in-house held-out test set and an external validation set.

  19. Iris Dataset - Logistic Regression

    • kaggle.com
    zip
    Updated Mar 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanya Ganesan (2019). Iris Dataset - Logistic Regression [Dataset]. https://www.kaggle.com/tanyaganesan/iris-dataset-logistic-regression
    Explore at:
    zip(996 bytes)Available download formats
    Dataset updated
    Mar 8, 2019
    Authors
    Tanya Ganesan
    Description

    Visualization of Iris Species Dataset:

    https://i.imgur.com/XqkskaX.png">

    • The data has four features.
    • Each subplot considers two features.
    • From the figure it can be observed that the data points for species Iris-setosa are clubbed together and for the other two species they sort of overlap.

    Classification using Logistic Regression:

    • There are 50 samples for each of the species. The data for each species is split into three sets - training, validation and test.
      • The training data is prepared separately for the three species. For instance, if the species is Iris-Setosa, then the corresponding outputs are set to 1 and for the other two species they are set to 0.
      • The training data sets are modeled separately. Three sets of model parameters(theta) are obtained. A sigmoid function is used to predict the output.
      • Gradient descent method is used to converge on 'theta' using a cost function.

    https://i.imgur.com/USfd26D.png"> https://i.imgur.com/AAxz3Ma.png"> https://i.imgur.com/kLNQPu1.png">

    Choosing best model:

    • Polynomial features are included to train the model better. Including more polynomial features will better fit the training set, but it may not give good results on validation set. The cost for training data decreases as more polynomial features are included.
      • So, to know which one is the best fit, first training data set is used to find the model parameters which is then used on the validation set. Whichever gives the least cost on validation set is chosen as the better fit to the data.
      • A regularization term is included to keep a check overfitting of the data as more polynomial features are added.

    Observations: - For Iris-Setosa, inclusion of polynomial features did not do well on the cross validation set. - For Iris-Versicolor, it seems more polynomial features needs to be included to be more conclusive. However, polynomial features up to the third degree was being used already, hence the idea of adding more features was dropped.

    https://i.imgur.com/RT0rsHU.png"> https://i.imgur.com/wsOFfi0.png"> https://i.imgur.com/tQkla35.png">

    https://i.imgur.com/GzPuAsT.png"> https://i.imgur.com/CBnjTki.png"> https://i.imgur.com/tF103lm.png">

    Bias-Variance trade off:

    • A check is done to see if the model will perform better if more features are included. The number of samples is increased in steps, the corresponding model parameters and cost are calculated. The model parameters obtained can then used to get the cost using validation set.
    • So if the costs for both sets converge, it is an indication that fit is good.

    https://i.imgur.com/UNh0Veo.png"> https://i.imgur.com/Ae9ObBR.png"> https://i.imgur.com/oHrjRLF.png">

    Training error:

    • The heuristic function should ideally be 1 for positive outputs and 0 for negative.
    • It is acceptable if the heuristic function is >=0.5 for positive outputs and < 0.5 for negative outputs.
    • The training error is calculated for all the sets. Observations: It performs very well for Iris-Setosa and Iris-Virginica. Except for validation set for Iris-Versicolor, rest have been modeled pretty well.

    https://i.imgur.com/WwB6B55.png"> https://i.imgur.com/Pj0c0NJ.png"> https://i.imgur.com/i3Wpzt8.png">

    https://i.imgur.com/62HanTn.png"> https://i.imgur.com/jj5sATL.png"> https://i.imgur.com/yVJvpkW.png">

    https://i.imgur.com/HyCRIb7.png"> https://i.imgur.com/MblLr1C.png"> https://i.imgur.com/zcDHt58.png">

    Accuracy: The highest probability (from heuristic function) obtained is predicted to be the species it belongs to. The accuracy came out to be 93.33% for validation data. And surprisingly 100% for test data.

    Improvements that can be done: A more sophisticated algorithm for finding the model parameters can be used instead of gradient descent. The training data, validation and test data can be chosen randomly to get the best performance.

  20. h

    Data from: MARIDA

    • huggingface.co
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GFM-Bench (2025). MARIDA [Dataset]. https://huggingface.co/datasets/GFM-Bench/MARIDA
    Explore at:
    Dataset updated
    Mar 13, 2025
    Dataset authored and provided by
    GFM-Bench
    Description

    MARIDA

    MARIDA is a dataset for sparsely labeled marine debris which consists of 11 MSI bands. This dataset contains a training set of 694 samples along with a validation set of 328 samples and a test set of 350 samples. All image samples are originally 256 x 256 pixels. Wecombine both the original validation set and test set into one single test set (678 samples). Weemploy the same approach as DFC2020’s where we divide 256 x 256 pixels into 9 smaller patches of 96 x 96 pixels. Thus… See the full description on the dataset page: https://huggingface.co/datasets/GFM-Bench/MARIDA.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Organization logo

Machine learning algorithm validation with a limited sample size

Explore at:
text/x-pythonAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Search
Clear search
Close search
Google apps
Main menu