100+ datasets found
  1. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  2. Data from: Development and validation of HBV surveillance models using big...

    • tandf.figshare.com
    docx
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weinan Dong; Cecilia Clara Da Roza; Dandan Cheng; Dahao Zhang; Yuling Xiang; Wai Kay Seto; William C. W. Wong (2024). Development and validation of HBV surveillance models using big data and machine learning [Dataset]. http://doi.org/10.6084/m9.figshare.25201473.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Weinan Dong; Cecilia Clara Da Roza; Dandan Cheng; Dahao Zhang; Yuling Xiang; Wai Kay Seto; William C. W. Wong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The construction of a robust healthcare information system is fundamental to enhancing countries’ capabilities in the surveillance and control of hepatitis B virus (HBV). Making use of China’s rapidly expanding primary healthcare system, this innovative approach using big data and machine learning (ML) could help towards the World Health Organization’s (WHO) HBV infection elimination goals of reaching 90% diagnosis and treatment rates by 2030. We aimed to develop and validate HBV detection models using routine clinical data to improve the detection of HBV and support the development of effective interventions to mitigate the impact of this disease in China. Relevant data records extracted from the Family Medicine Clinic of the University of Hong Kong-Shenzhen Hospital’s Hospital Information System were structuralized using state-of-the-art Natural Language Processing techniques. Several ML models have been used to develop HBV risk assessment models. The performance of the ML model was then interpreted using the Shapley value (SHAP) and validated using cohort data randomly divided at a ratio of 2:1 using a five-fold cross-validation framework. The patterns of physical complaints of patients with and without HBV infection were identified by processing 158,988 clinic attendance records. After removing cases without any clinical parameters from the derivation sample (n = 105,992), 27,392 cases were analysed using six modelling methods. A simplified model for HBV using patients’ physical complaints and parameters was developed with good discrimination (AUC = 0.78) and calibration (goodness of fit test p-value >0.05). Suspected case detection models of HBV, showing potential for clinical deployment, have been developed to improve HBV surveillance in primary care setting in China. (Word count: 264) This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections.We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China. This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections. We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China.

  3. Text Function, Date, Data Validation

    • kaggle.com
    zip
    Updated Mar 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanjana Murthy (2024). Text Function, Date, Data Validation [Dataset]. https://www.kaggle.com/sanjanamurthy392/text-function-date-data-validation
    Explore at:
    zip(25270 bytes)Available download formats
    Dataset updated
    Mar 15, 2024
    Authors
    Sanjana Murthy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This data contains Text Function, Date, Data Validation.

  4. FDA Drug Product Labels Validation Method Data Package

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs (2021). FDA Drug Product Labels Validation Method Data Package [Dataset]. https://www.johnsnowlabs.com/marketplace/fda-drug-product-labels-validation-method-data-package/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Description

    This data package contains information on Structured Product Labeling (SPL) Terminology for SPL validation procedures and information on performing SPL validations.

  5. m

    PEN-Method: Predictor model and Validation Data

    • data.mendeley.com
    • narcis.nl
    Updated Sep 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Halle (2021). PEN-Method: Predictor model and Validation Data [Dataset]. http://doi.org/10.17632/459f33wxf6.4
    Explore at:
    Dataset updated
    Sep 3, 2021
    Authors
    Alex Halle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This Data contains the PEN-Predictor-Keras-Model as well as the 100 validation data sets.

  6. Data from: Selection of optimal validation methods for quantitative...

    • tandf.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K. Héberger (2023). Selection of optimal validation methods for quantitative structure–activity relationships and applicability domain [Dataset]. http://doi.org/10.6084/m9.figshare.23185916.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    K. Héberger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This brief literature survey groups the (numerical) validation methods and emphasizes the contradictions and confusion considering bias, variance and predictive performance. A multicriteria decision-making analysis has been made using the sum of absolute ranking differences (SRD), illustrated with five case studies (seven examples). SRD was applied to compare external and cross-validation techniques, indicators of predictive performance, and to select optimal methods to determine the applicability domain (AD). The ordering of model validation methods was in accordance with the sayings of original authors, but they are contradictory within each other, suggesting that any variant of cross-validation can be superior or inferior to other variants depending on the algorithm, data structure and circumstances applied. A simple fivefold cross-validation proved to be superior to the Bayesian Information Criterion in the vast majority of situations. It is simply not sufficient to test a numerical validation method in one situation only, even if it is a well defined one. SRD as a preferable multicriteria decision-making algorithm is suitable for tailoring the techniques for validation, and for the optimal determination of the applicability domain according to the dataset in question.

  7. d

    Data from: Development of a Mobile Robot Test Platform and Methods for...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Development of a Mobile Robot Test Platform and Methods for Validation of Prognostics-Enabled Decision Making Algorithms [Dataset]. https://catalog.data.gov/dataset/development-of-a-mobile-robot-test-platform-and-methods-for-validation-of-prognostics-enab
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    As fault diagnosis and prognosis systems in aerospace applications become more capable, the ability to utilize information supplied by them becomes increasingly important. While certain types of vehicle health data can be effectively processed and acted upon by crew or support personnel, others, due to their complexity or time constraints, require either automated or semi-automated reasoning. Prognostics-enabled Decision Making (PDM) is an emerging research area that aims to integrate prognostic health information and knowledge about the future operating conditions into the process of selecting subsequent actions for the system. The newly developed PDM algorithms require suitable software and hardware platforms for testing under realistic fault scenarios. The paper describes the development of such a platform, based on the K11 planetary rover prototype. A variety of injectable fault modes are being investigated for electrical, mechanical, and power subsystems of the testbed, along with methods for data collection and processing. In addition to the hardware platform, a software simulator with matching capabilities has been developed. The simulator allows for prototyping and initial validation of the algorithms prior to their deployment on the K11. The simulator is also available to the PDM algorithms to assist with the reasoning process. A reference set of diagnostic, prognostic, and decision making algorithms is also described, followed by an overview of the current test scenarios and the results of their execution on the simulator.

  8. Sensor Validation using Bayesian Networks - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Sensor Validation using Bayesian Networks - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/sensor-validation-using-bayesian-networks
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    One of NASA’s key mission requirements is robust state estimation. Sensing, using a wide range of sensors and sensor fusion approaches, plays a central role in robust state estimation, and there is a need to diagnose sensor failure as well as component failure. Sensor validation techniques address this problem: given a vector of sensor readings, decide whether sensors have failed, therefore producing bad data. We take in this paper a probabilistic approach, using Bayesian networks, to diagnosis and sensor validation, and investigate several relevant but slightly different Bayesian network queries. We emphasize that on-board inference can be performed on a compiled model, giving fast and predictable execution times. Our results are illustrated using an electrical power system, and we show that a Bayesian network with over 400 nodes can be compiled into an arithmetic circuit that can correctly answer queries in less than 500 microseconds on average. Reference: O. J. Mengshoel, A. Darwiche, and S. Uckun, "Sensor Validation using Bayesian Networks." In Proc. of the 9th International Symposium on Artificial Intelligence, Robotics, and Automation in Space (iSAIRAS-08), Los Angeles, CA, 2008. BibTex Reference: @inproceedings{mengshoel08sensor, author = {Mengshoel, O. J. and Darwiche, A. and Uckun, S.}, title = {Sensor Validation using {Bayesian} Networks}, booktitle = {Proceedings of the 9th International Symposium on Artificial Intelligence, Robotics, and Automation in Space (iSAIRAS-08)}, year = {2008} }

  9. D

    Python functions -- cross-validation methods from a data-driven perspective

    • phys-techsciences.datastations.nl
    docx, png +4
    Updated Aug 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Y. Wang; Y. Wang (2024). Python functions -- cross-validation methods from a data-driven perspective [Dataset]. http://doi.org/10.17026/PT/TXAU9W
    Explore at:
    tiff(2474294), tiff(2412540), tsv(49141), txt(1220), tiff(2413148), tsv(20072), tsv(30174), tiff(4833081), tiff(12196238), tiff(1606453), tiff(4729349), tiff(5695336), tsv(29), tiff(6478950), tiff(6534556), tiff(6466131), text/x-python(8210), docx(63366), tsv(12056), tiff(6567360), tsv(28), tiff(5385805), tsv(263901), tiff(6385076), text/x-python(5598), tiff(2423836), tiff(3417568), text/x-python(8181), png(110251), tiff(5726045), tsv(48948), tsv(1564525), tiff(3031197), tiff(2059260), tiff(2880005), tiff(6135064), tiff(3648419), tsv(102), tiff(3060978), tiff(3802696), tiff(4396561), tiff(1385025), text/x-python(1184), tiff(2817752), tiff(2516606), tsv(27725), text/x-python(12795), tiff(2282443)Available download formats
    Dataset updated
    Aug 16, 2024
    Dataset provided by
    DANS Data Station Physical and Technical Sciences
    Authors
    Y. Wang; Y. Wang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is the organized python functions of proposed methods in Yanwen Wang PhD research. Researchers can directly use these functions to conduct spatial+ cross-validation, dissimilarity quantification method, and dissimilarity-adaptive cross-validation.

  10. Z

    Dataset Methods for stratification and validation cohorts: a scoping review

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Apr 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teresa Torres Moral; Albert Sanchez-Niubo; Anna Monistrol Mula; Chiara Gerardi; Josep Maria Haro Abad; Judit Subirana-Mirete (2022). Dataset Methods for stratification and validation cohorts: a scoping review [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_6414853
    Explore at:
    Dataset updated
    Apr 5, 2022
    Dataset provided by
    Research and Developmental Unit. Parc Sanitari Sant Joan de Déu (Barcelona, Spain) | Centro de Investigación Biomédica en Red de Salud Mental (CIBERSAM)
    Center for Health Regulatory Policies. Istituto di Ricerche Farmacologiche Mario Negri (Milan, Italy)
    Research and Developmental Unit. Parc Sanitari Sant Joan de Déu (Barcelona, Spain)
    Authors
    Teresa Torres Moral; Albert Sanchez-Niubo; Anna Monistrol Mula; Chiara Gerardi; Josep Maria Haro Abad; Judit Subirana-Mirete
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We searched PubMed, EMBASE and the Cochrane Library for reviews that described the tools and methods applied to define cohorts used for patient stratification or validation of patient clustering. We focused on cancer, stroke, and Alzheimer’s disease (AD) and limited the searches to reports in English, French, German, Italian and Spanish, published from 2005 to April 2020. Two authors screened the records, and one extracted the key information from each included review. The result of the screening process was reported through a PRISMA flowchart.

  11. D

    Billing-grade Interval Data Validation Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Billing-grade Interval Data Validation Market Research Report 2033 [Dataset]. https://dataintelo.com/report/billing-grade-interval-data-validation-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Billing-grade Interval Data Validation Market Outlook



    According to our latest research, the global billing-grade interval data validation market size reached USD 1.42 billion in 2024, reflecting a robust expansion driven by the increasing demand for accurate and reliable data in utility billing and energy management systems. The market is expected to grow at a CAGR of 13.4% from 2025 to 2033, culminating in a projected market size of USD 4.54 billion by 2033. This substantial growth is primarily fueled by the proliferation of smart grids, the rising adoption of advanced metering infrastructure, and the necessity for regulatory compliance in billing operations across utilities and energy sectors. As per our research, the market’s momentum is underpinned by the convergence of digital transformation initiatives and the critical need for high-integrity interval data validation to support accurate billing and operational efficiency.




    The growth trajectory of the billing-grade interval data validation market is significantly influenced by the rapid digitalization of utility infrastructure worldwide. With the deployment of smart meters and IoT-enabled devices, utilities are generating an unprecedented volume of interval data that must be validated for billing and operational purposes. The integration of advanced data analytics and machine learning algorithms into validation processes is enhancing the accuracy and reliability of interval data, minimizing errors, and enabling near real-time validation. This technological advancement is not only reducing manual intervention but also ensuring compliance with increasingly stringent regulatory standards. As utilities and energy providers transition toward more automated and data-centric operations, the demand for robust billing-grade data validation solutions is set to surge, driving market expansion.




    Another critical growth factor for the billing-grade interval data validation market is the intensifying focus on energy efficiency and demand-side management. Governments and regulatory bodies across the globe are implementing policies to promote energy conservation, necessitating accurate measurement and validation of consumption data. Billing-grade interval data validation plays a pivotal role in ensuring that billings are precise and reflective of actual usage, thereby fostering trust between utilities and end-users. Moreover, the shift toward dynamic pricing models and time-of-use tariffs is making interval data validation indispensable for utilities aiming to optimize revenue streams and offer personalized billing solutions. As a result, both established utilities and emerging energy management firms are investing heavily in advanced validation platforms to stay competitive and meet evolving customer expectations.




    The market is also witnessing growth due to the increasing complexity of utility billing systems and the diversification of energy sources, including renewables. The integration of distributed energy resources such as solar and wind into the grid is generating multifaceted data streams that require sophisticated validation to ensure billing accuracy and grid stability. Additionally, the rise of prosumers—consumers who also produce energy—has introduced new challenges in data validation, further amplifying the need for billing-grade solutions. Vendors are responding by developing scalable, interoperable platforms capable of handling diverse data types and validation scenarios. This trend is expected to drive innovation and shape the competitive landscape of the billing-grade interval data validation market over the forecast period.




    From a regional perspective, North America continues to dominate the billing-grade interval data validation market, owing to its advanced utility infrastructure, widespread adoption of smart grids, and strong regulatory framework. However, Asia Pacific is emerging as the fastest-growing region, propelled by massive investments in smart grid projects, urbanization, and government initiatives to modernize energy distribution systems. Europe, with its emphasis on sustainability and energy efficiency, is also contributing significantly to market growth. The Middle East & Africa and Latin America, though currently smaller in market share, are expected to witness accelerated adoption as utilities in these regions embark on digital transformation journeys. Overall, the global market is set for dynamic growth, shaped by regional developments and technological advancements.



    Component Analys

  12. H

    Data Repository for 'Bootstrap aggregation and cross-validation methods to...

    • hydroshare.org
    • beta.hydroshare.org
    • +1more
    zip
    Updated Jun 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zachary Paul Brodeur; Scott S. Steinschneider; Jonathan D. Herman (2020). Data Repository for 'Bootstrap aggregation and cross-validation methods to reduce overfitting in reservoir control policy search' [Dataset]. http://doi.org/10.4211/hs.b8f87a7b680d44cebfb4b3f4f4a6a447
    Explore at:
    zip(8.3 MB)Available download formats
    Dataset updated
    Jun 24, 2020
    Dataset provided by
    HydroShare
    Authors
    Zachary Paul Brodeur; Scott S. Steinschneider; Jonathan D. Herman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 1, 1922 - Sep 30, 2016
    Area covered
    Description

    Policy search methods provide a heuristic mapping between observations and decisions and have been widely used in reservoir control studies. However, recent studies have observed a tendency for policy search methods to overfit to the hydrologic data used in training, particularly the sequence of flood and drought events. This technical note develops an extension of bootstrap aggregation (bagging) and cross-validation techniques, inspired by the machine learning literature, to improve control policy performance on out-of-sample hydrology. We explore these methods using a case study of Folsom Reservoir, California using control policies structured as binary trees and daily streamflow resampling based on the paleo-inflow record. Results show that calibration-validation strategies for policy selection and certain ensemble aggregation methods can improve out-of-sample tradeoffs between water supply and flood risk objectives over baseline performance given fixed computational costs. These results highlight the potential to improve policy search methodologies by leveraging well-established model training strategies from machine learning.

  13. How to use frailtypack for validating failure-time surrogate endpoints using...

    • plos.figshare.com
    tiff
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Casimir Ledoux Sofeu; Virginie Rondeau (2023). How to use frailtypack for validating failure-time surrogate endpoints using individual patient data from meta-analyses of randomized controlled trials [Dataset]. http://doi.org/10.1371/journal.pone.0228098
    Explore at:
    tiffAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Casimir Ledoux Sofeu; Virginie Rondeau
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background and ObjectiveThe use of valid surrogate endpoints can accelerate the development of phase III trials. Numerous validation methods have been proposed with the most popular used in a context of meta-analyses, based on a two-step analysis strategy. For two failure time endpoints, two association measures are usually considered, Kendall’s τ at individual level and adjusted R2 () at trial level. However, is not always available mainly due to model estimation constraints. More recently, we proposed a one-step validation method based on a joint frailty model, with the aim of reducing estimation issues and estimation bias on the surrogacy evaluation criteria. The model was quite robust with satisfactory results obtained in simulation studies. This study seeks to popularize this new surrogate endpoints validation approach by making the method available in a user-friendly R package.MethodsWe provide numerous tools in the frailtypack R package, including more flexible functions, for the validation of candidate surrogate endpoints using data from multiple randomized clinical trials.ResultsWe implemented the surrogate threshold effect which is used in combination with to make decisions concerning the validity of the surrogate endpoints. It is also possible thanks to frailtypack to predict the treatment effect on the true endpoint in a new trial using the treatment effect observed on the surrogate endpoint. The leave-one-out cross-validation is available for assessing the accuracy of the prediction using the joint surrogate model. Other tools include data generation, simulation study and graphic representations. We illustrate the use of the new functions with both real data and simulated data.ConclusionThis article proposes new attractive and well developed tools for validating failure time surrogate endpoints.

  14. c

    Forage Fish Aerial Validation Data from Prince William Sound, Alaska

    • s.cnmilf.com
    • data.usgs.gov
    • +1more
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Forage Fish Aerial Validation Data from Prince William Sound, Alaska [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/forage-fish-aerial-validation-data-from-prince-william-sound-alaska
    Explore at:
    Dataset updated
    Oct 1, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Prince William Sound, Alaska
    Description

    One table with data used to validate aerial fish surveys in Prince William Sound, Alaska. Data includes: date, _location, latitude, longitude, aerial ID, validation ID, total length and validation method. Various catch methods were used to obtain fish samples for aerial validations, including: cast net, GoPro, hydroacoustics, jig, dip net, gillnet, purse seine, photo and visual identification.

  15. Validation of Methods to Assess the Immunoglobulin Gene Repertoire in...

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Validation of Methods to Assess the Immunoglobulin Gene Repertoire in Tissues Obtained from Mice on the International Space Station Followers 0 --> [Dataset]. https://data.nasa.gov/dataset/validation-of-methods-to-assess-the-immunoglobulin-gene-repertoire-in-tissues-obtained-fro-e1070
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Spaceflight is known to affect immune cell populations. In particular, splenic B-cell numbers decrease during spaceflight and in ground-based physiological models. Although antibody isotype changes have been assessed during and after spaceflight, an extensive characterization of the impact of spaceflight on antibody composition has not been conducted in mice. Next Generation Sequencing and bioinformatic tools are now available to assess antibody repertoires. We can now identify immunoglobulin gene- segment usage, junctional regions, and modifications that contribute to specificity and diversity. Due to limitations on the International Space Station, alternate sample collection and storage methods must be employed. Our group compared Illumina MiSeq sequencing data from multiple sample preparation methods in normal C57Bl/6J mice to validate that sample preparation and storage would not bias the outcome of antibody repertoire characterization. In this report, we also compared sequencing techniques and a bioinformatic workflow on the data output when we assessed the IgH and Igκ variable gene usage. Our bioinformatic workflow has been optimized for Illumina HiSeq and MiSeq datasets, and is designed specifically to reduce bias, capture the most information from Ig sequences, and produce a data set that provides other data mining options.

  16. f

    Summary of model validation analyses per country.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jan 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khaki, Jessie Jane; Minnery, Mark; Giorgi, Emanuele (2025). Summary of model validation analyses per country. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001304764
    Explore at:
    Dataset updated
    Jan 9, 2025
    Authors
    Khaki, Jessie Jane; Minnery, Mark; Giorgi, Emanuele
    Description

    BackgroundThe Expanded Special Project for the Elimination of Neglected Tropical Diseases (ESPEN) was launched in 2019 by the World Health Organization and African nations to combat Neglected Tropical Diseases (NTDs), including Soil-transmitted helminths (STH), which still affect over 1.5 billion people globally. In this study, we present a comprehensive geostatistical analysis of publicly available STH survey data from ESPEN to delineate inter-country disparities in STH prevalence and its environmental drivers while highlighting the strengths and limitations that arise from the use of the ESPEN data. To achieve this, we also propose the use of calibration validation methods to assess the suitability of geostatistical models for disease mapping at the national scale.MethodsWe analysed the most recent survey data with at least 50 geo-referenced observations, and modelled each STH species data (hookworm, roundworm, whipworm) separately. Binomial geostatistical models were developed for each country, exploring associations between STH and environmental covariates, and were validated using the non-randomized probability integral transform. We produced pixel-, subnational-, and country-level prevalence maps for successfully calibrated countries. All the results were made publicly available through an R Shiny application.ResultsAmong 35 countries with STH data that met our inclusion criteria, the reported data years ranged from 2004 to 2018. Models from 25 countries were found to be well-calibrated. Spatial patterns exhibited significant variation in STH species distribution and heterogeneity in spatial correlation scale (1.14 km to 3,027.44 km) and residual spatial variation variance across countries.ConclusionThis study highlights the utility of ESPEN data in assessing spatial variations in STH prevalence across countries using model-based geostatistics. Despite the challenges posed by data sparsity which limit the application of geostatistical models, the insights gained remain crucial for directing focused interventions and shaping future STH assessment strategies within national control programs.

  17. d

    Data from: cross-validation matters in species distribution models: a case...

    • datadryad.org
    • search.dataone.org
    zip
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hongwei Huang; Zhixin Zhang; Ákos Bede-Fazekas; Stefano Mammola; Jiqi Gu; Jinxin Zhou; Junmei Qu; Qiang Lin (2024). cross-validation matters in species distribution models: a case study with goatfish species [Dataset]. http://doi.org/10.5061/dryad.rr4xgxdhf
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2024
    Dataset provided by
    Dryad
    Authors
    Hongwei Huang; Zhixin Zhang; Ákos Bede-Fazekas; Stefano Mammola; Jiqi Gu; Jinxin Zhou; Junmei Qu; Qiang Lin
    Time period covered
    Aug 22, 2024
    Description

    Data from: cross-validation matters in species distribution models: a case study with goatfish species

    R scripts were used to explore random and spatial cross-validation methods with goatfish species

    ################################

    R scripts used to generate background data for Maxent

    ################################ R1_maxent_background_1000km.R R1_maxent_background_2000km.R

    ################################

    R scripts of random and spatial cross-validation methods used to tune maxent parameters

    ################################ R2.1_SDM_CV_random_1000km.R R2.1_SDM_CV_random_2000km.R R2.2_SDM_CV_spatial_1000km_5x5.R R2.2_SDM_CV_spatial_2000km_5x5.R R2.2_SDM_CV_spatial_1000km_10x10.R R2.2_SDM_CV_spatial_2000km_10x10.R

    ################################

    R scripts of random and spatial cross-validation methods used to predict species distribution

    ################################ R3.1_SDM_prediction_random_1000km.R R3.1_SDM_prediction_random_2000km.R R3.2_SDM_predict...

  18. f

    Data from: Content validation in concepts of management and managerial...

    • datasetcatalog.nlm.nih.gov
    • scielo.figshare.com
    Updated May 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    de Meneses, Abel Silva; Cunha, Isabel Cristina Kowal Olm (2022). Content validation in concepts of management and managerial practices in Nursing [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000411385
    Explore at:
    Dataset updated
    May 27, 2022
    Authors
    de Meneses, Abel Silva; Cunha, Isabel Cristina Kowal Olm
    Description

    ABSTRACT Objectives: to define and validate 37 concepts emanating from the epistemology of knowledge about Nursing Administration. Methods: theoretical-methodological study using the Delphi technique in 37 concepts and definitions built on the knowledge of more than half a century of research on Nursing Administration. The concepts were submitted to the judgment of a panel of 21 judges and the validation was measured by the content validity index (> 0.78) and Kappa coefficient (> 0.61). Results: enunciation of 37 concepts and definitions capable of reflecting the knowledge about Nursing Administration. The 37 concepts were validated by the judges, resulting in content validity indices that ranged from 0.81 to 1.00, with reliability higher than 0.79. Conclusions: the epistemological solution presented was validated by the judges with indices above 0.80 and high reliability of universal agreement, constituting a new object of ontological understanding for the scientific nursing community.

  19. E

    Email Validation Tools Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Email Validation Tools Report [Dataset]. https://www.marketresearchforecast.com/reports/email-validation-tools-549597
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Jul 25, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The email validation tools market is experiencing robust growth, driven by the increasing need for businesses to maintain clean and accurate email lists for effective marketing campaigns. The rising adoption of email marketing as a primary communication channel, coupled with stricter data privacy regulations like GDPR and CCPA, necessitates the use of tools that ensure email deliverability and prevent bounces. This market, estimated at $500 million in 2025, is projected to grow at a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $1.5 billion by 2033. This expansion is fueled by the growing sophistication of email validation techniques, including real-time verification, syntax checks, and mailbox monitoring, offering businesses more robust solutions to improve their email marketing ROI. Key market segments include small and medium-sized businesses (SMBs), large enterprises, and email marketing agencies, each exhibiting varying levels of adoption and spending based on their specific needs and email marketing strategies. The competitive landscape is characterized by a mix of established players and emerging startups, offering a range of features and pricing models to cater to diverse customer requirements. The market's growth is, however, subject to factors like increasing costs associated with maintaining data accuracy and the potential for false positives in email verification. The key players in this dynamic market, such as Mailgun, BriteVerify, and similar companies, are continuously innovating to improve accuracy, speed, and integration with other marketing automation platforms. The market's geographical distribution is diverse, with North America and Europe currently holding significant market share due to higher email marketing adoption rates and a robust technological infrastructure. However, Asia-Pacific and other emerging markets are poised for considerable growth in the coming years due to increasing internet penetration and rising adoption of digital marketing techniques. The ongoing evolution of email marketing strategies, the increasing emphasis on data hygiene, and the rise of artificial intelligence in email verification are likely to further shape the trajectory of this market in the years to come, leading to further innovation and growth.

  20. Z

    Dataset for Validation of a Qualification Procedure Applied to the...

    • data.niaid.nih.gov
    Updated Oct 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khamlichi, Abderrahim (2023). Dataset for Validation of a Qualification Procedure Applied to the Verification of Partial Discharge Analysers Used for HVDC or HVAC Networks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10000388
    Explore at:
    Dataset updated
    Oct 13, 2023
    Authors
    Khamlichi, Abderrahim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data set for the publication named: "Validation of a Qualification Procedure Applied to the Verification of Partial Discharge Analysers Used for HVDC or HVAC Networks"

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Organization logo

Machine learning algorithm validation with a limited sample size

Explore at:
text/x-pythonAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Search
Clear search
Close search
Google apps
Main menu