100+ datasets found
  1. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  2. Method Validation Data.xlsx

    • figshare.com
    xlsx
    Updated Jan 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Norberto Gonzalez; Alanah Fitch (2020). Method Validation Data.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.11741703.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 28, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Norberto Gonzalez; Alanah Fitch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data for method validation on detecting pmp-glucose by HPLC

  3. Data from: Development and validation of HBV surveillance models using big...

    • tandf.figshare.com
    docx
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weinan Dong; Cecilia Clara Da Roza; Dandan Cheng; Dahao Zhang; Yuling Xiang; Wai Kay Seto; William C. W. Wong (2024). Development and validation of HBV surveillance models using big data and machine learning [Dataset]. http://doi.org/10.6084/m9.figshare.25201473.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Weinan Dong; Cecilia Clara Da Roza; Dandan Cheng; Dahao Zhang; Yuling Xiang; Wai Kay Seto; William C. W. Wong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The construction of a robust healthcare information system is fundamental to enhancing countries’ capabilities in the surveillance and control of hepatitis B virus (HBV). Making use of China’s rapidly expanding primary healthcare system, this innovative approach using big data and machine learning (ML) could help towards the World Health Organization’s (WHO) HBV infection elimination goals of reaching 90% diagnosis and treatment rates by 2030. We aimed to develop and validate HBV detection models using routine clinical data to improve the detection of HBV and support the development of effective interventions to mitigate the impact of this disease in China. Relevant data records extracted from the Family Medicine Clinic of the University of Hong Kong-Shenzhen Hospital’s Hospital Information System were structuralized using state-of-the-art Natural Language Processing techniques. Several ML models have been used to develop HBV risk assessment models. The performance of the ML model was then interpreted using the Shapley value (SHAP) and validated using cohort data randomly divided at a ratio of 2:1 using a five-fold cross-validation framework. The patterns of physical complaints of patients with and without HBV infection were identified by processing 158,988 clinic attendance records. After removing cases without any clinical parameters from the derivation sample (n = 105,992), 27,392 cases were analysed using six modelling methods. A simplified model for HBV using patients’ physical complaints and parameters was developed with good discrimination (AUC = 0.78) and calibration (goodness of fit test p-value >0.05). Suspected case detection models of HBV, showing potential for clinical deployment, have been developed to improve HBV surveillance in primary care setting in China. (Word count: 264) This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections.We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China. This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections. We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China.

  4. Text Function, Date, Data Validation

    • kaggle.com
    zip
    Updated Mar 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanjana Murthy (2024). Text Function, Date, Data Validation [Dataset]. https://www.kaggle.com/sanjanamurthy392/text-function-date-data-validation
    Explore at:
    zip(25270 bytes)Available download formats
    Dataset updated
    Mar 15, 2024
    Authors
    Sanjana Murthy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This data contains Text Function, Date, Data Validation.

  5. m

    PEN-Method: Predictor model and Validation Data

    • data.mendeley.com
    • narcis.nl
    Updated Sep 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Halle (2021). PEN-Method: Predictor model and Validation Data [Dataset]. http://doi.org/10.17632/459f33wxf6.4
    Explore at:
    Dataset updated
    Sep 3, 2021
    Authors
    Alex Halle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This Data contains the PEN-Predictor-Keras-Model as well as the 100 validation data sets.

  6. Data from: Development of a Mobile Robot Test Platform and Methods for...

    • data.nasa.gov
    • s.cnmilf.com
    • +1more
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Development of a Mobile Robot Test Platform and Methods for Validation of Prognostics-Enabled Decision Making Algorithms [Dataset]. https://data.nasa.gov/dataset/development-of-a-mobile-robot-test-platform-and-methods-for-validation-of-prognostics-enab
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    As fault diagnosis and prognosis systems in aerospace applications become more capable, the ability to utilize information supplied by them becomes increasingly important. While certain types of vehicle health data can be effectively processed and acted upon by crew or support personnel, others, due to their complexity or time constraints, require either automated or semi-automated reasoning. Prognostics-enabled Decision Making (PDM) is an emerging research area that aims to integrate prognostic health information and knowledge about the future operating conditions into the process of selecting subsequent actions for the system. The newly developed PDM algorithms require suitable software and hardware platforms for testing under realistic fault scenarios. The paper describes the development of such a platform, based on the K11 planetary rover prototype. A variety of injectable fault modes are being investigated for electrical, mechanical, and power subsystems of the testbed, along with methods for data collection and processing. In addition to the hardware platform, a software simulator with matching capabilities has been developed. The simulator allows for prototyping and initial validation of the algorithms prior to their deployment on the K11. The simulator is also available to the PDM algorithms to assist with the reasoning process. A reference set of diagnostic, prognostic, and decision making algorithms is also described, followed by an overview of the current test scenarios and the results of their execution on the simulator.

  7. FDA Drug Product Labels Validation Method Data Package

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs (2021). FDA Drug Product Labels Validation Method Data Package [Dataset]. https://www.johnsnowlabs.com/marketplace/fda-drug-product-labels-validation-method-data-package/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Description

    This data package contains information on Structured Product Labeling (SPL) Terminology for SPL validation procedures and information on performing SPL validations.

  8. Data from: Selection of optimal validation methods for quantitative...

    • tandf.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K. Héberger (2023). Selection of optimal validation methods for quantitative structure–activity relationships and applicability domain [Dataset]. http://doi.org/10.6084/m9.figshare.23185916.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    K. Héberger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This brief literature survey groups the (numerical) validation methods and emphasizes the contradictions and confusion considering bias, variance and predictive performance. A multicriteria decision-making analysis has been made using the sum of absolute ranking differences (SRD), illustrated with five case studies (seven examples). SRD was applied to compare external and cross-validation techniques, indicators of predictive performance, and to select optimal methods to determine the applicability domain (AD). The ordering of model validation methods was in accordance with the sayings of original authors, but they are contradictory within each other, suggesting that any variant of cross-validation can be superior or inferior to other variants depending on the algorithm, data structure and circumstances applied. A simple fivefold cross-validation proved to be superior to the Bayesian Information Criterion in the vast majority of situations. It is simply not sufficient to test a numerical validation method in one situation only, even if it is a well defined one. SRD as a preferable multicriteria decision-making algorithm is suitable for tailoring the techniques for validation, and for the optimal determination of the applicability domain according to the dataset in question.

  9. Validation of Methods to Assess the Immunoglobulin Gene Repertoire in...

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Validation of Methods to Assess the Immunoglobulin Gene Repertoire in Tissues Obtained from Mice on the International Space Station Followers 0 --> [Dataset]. https://data.nasa.gov/dataset/validation-of-methods-to-assess-the-immunoglobulin-gene-repertoire-in-tissues-obtained-fro-e1070
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Spaceflight is known to affect immune cell populations. In particular, splenic B-cell numbers decrease during spaceflight and in ground-based physiological models. Although antibody isotype changes have been assessed during and after spaceflight, an extensive characterization of the impact of spaceflight on antibody composition has not been conducted in mice. Next Generation Sequencing and bioinformatic tools are now available to assess antibody repertoires. We can now identify immunoglobulin gene- segment usage, junctional regions, and modifications that contribute to specificity and diversity. Due to limitations on the International Space Station, alternate sample collection and storage methods must be employed. Our group compared Illumina MiSeq sequencing data from multiple sample preparation methods in normal C57Bl/6J mice to validate that sample preparation and storage would not bias the outcome of antibody repertoire characterization. In this report, we also compared sequencing techniques and a bioinformatic workflow on the data output when we assessed the IgH and Igκ variable gene usage. Our bioinformatic workflow has been optimized for Illumina HiSeq and MiSeq datasets, and is designed specifically to reduce bias, capture the most information from Ig sequences, and produce a data set that provides other data mining options.

  10. D

    Billing-grade Interval Data Validation Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Billing-grade Interval Data Validation Market Research Report 2033 [Dataset]. https://dataintelo.com/report/billing-grade-interval-data-validation-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Billing-grade Interval Data Validation Market Outlook



    According to our latest research, the global billing-grade interval data validation market size reached USD 1.42 billion in 2024, reflecting a robust expansion driven by the increasing demand for accurate and reliable data in utility billing and energy management systems. The market is expected to grow at a CAGR of 13.4% from 2025 to 2033, culminating in a projected market size of USD 4.54 billion by 2033. This substantial growth is primarily fueled by the proliferation of smart grids, the rising adoption of advanced metering infrastructure, and the necessity for regulatory compliance in billing operations across utilities and energy sectors. As per our research, the market’s momentum is underpinned by the convergence of digital transformation initiatives and the critical need for high-integrity interval data validation to support accurate billing and operational efficiency.




    The growth trajectory of the billing-grade interval data validation market is significantly influenced by the rapid digitalization of utility infrastructure worldwide. With the deployment of smart meters and IoT-enabled devices, utilities are generating an unprecedented volume of interval data that must be validated for billing and operational purposes. The integration of advanced data analytics and machine learning algorithms into validation processes is enhancing the accuracy and reliability of interval data, minimizing errors, and enabling near real-time validation. This technological advancement is not only reducing manual intervention but also ensuring compliance with increasingly stringent regulatory standards. As utilities and energy providers transition toward more automated and data-centric operations, the demand for robust billing-grade data validation solutions is set to surge, driving market expansion.




    Another critical growth factor for the billing-grade interval data validation market is the intensifying focus on energy efficiency and demand-side management. Governments and regulatory bodies across the globe are implementing policies to promote energy conservation, necessitating accurate measurement and validation of consumption data. Billing-grade interval data validation plays a pivotal role in ensuring that billings are precise and reflective of actual usage, thereby fostering trust between utilities and end-users. Moreover, the shift toward dynamic pricing models and time-of-use tariffs is making interval data validation indispensable for utilities aiming to optimize revenue streams and offer personalized billing solutions. As a result, both established utilities and emerging energy management firms are investing heavily in advanced validation platforms to stay competitive and meet evolving customer expectations.




    The market is also witnessing growth due to the increasing complexity of utility billing systems and the diversification of energy sources, including renewables. The integration of distributed energy resources such as solar and wind into the grid is generating multifaceted data streams that require sophisticated validation to ensure billing accuracy and grid stability. Additionally, the rise of prosumers—consumers who also produce energy—has introduced new challenges in data validation, further amplifying the need for billing-grade solutions. Vendors are responding by developing scalable, interoperable platforms capable of handling diverse data types and validation scenarios. This trend is expected to drive innovation and shape the competitive landscape of the billing-grade interval data validation market over the forecast period.




    From a regional perspective, North America continues to dominate the billing-grade interval data validation market, owing to its advanced utility infrastructure, widespread adoption of smart grids, and strong regulatory framework. However, Asia Pacific is emerging as the fastest-growing region, propelled by massive investments in smart grid projects, urbanization, and government initiatives to modernize energy distribution systems. Europe, with its emphasis on sustainability and energy efficiency, is also contributing significantly to market growth. The Middle East & Africa and Latin America, though currently smaller in market share, are expected to witness accelerated adoption as utilities in these regions embark on digital transformation journeys. Overall, the global market is set for dynamic growth, shaped by regional developments and technological advancements.



    Component Analys

  11. d

    Forage Fish Aerial Validation Data from Prince William Sound, Alaska

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Forage Fish Aerial Validation Data from Prince William Sound, Alaska [Dataset]. https://catalog.data.gov/dataset/forage-fish-aerial-validation-data-from-prince-william-sound-alaska
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Prince William Sound, Alaska
    Description

    One table with data used to validate aerial fish surveys in Prince William Sound, Alaska. Data includes: date, location, latitude, longitude, aerial ID, validation ID, total length and validation method. Various catch methods were used to obtain fish samples for aerial validations, including: cast net, GoPro, hydroacoustics, jig, dip net, gillnet, purse seine, photo and visual identification.

  12. Type definitions of the specialist group review.

    • plos.figshare.com
    xls
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ki-Hoon Kim; Seol Whan Oh; Soo Jeong Ko; Kang Hyuck Lee; Wona Choi; In Young Choi (2023). Type definitions of the specialist group review. [Dataset]. http://doi.org/10.1371/journal.pone.0294554.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 20, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ki-Hoon Kim; Seol Whan Oh; Soo Jeong Ko; Kang Hyuck Lee; Wona Choi; In Young Choi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Numerous studies make extensive use of healthcare data, including human materials and clinical information, and acknowledge its significance. However, limitations in data collection methods can impact the quality of healthcare data obtained from multiple institutions. In order to secure high-quality data related to human materials, research focused on data quality is necessary. This study validated the quality of data collected in 2020 from 16 institutions constituting the Korea Biobank Network using 104 validation rules. The validation rules were developed based on the DQ4HEALTH model and were divided into four dimensions: completeness, validity, accuracy, and uniqueness. Korea Biobank Network collects and manages human materials and clinical information from multiple biobanks, and is in the process of developing a common data model for data integration. The results of the data quality verification revealed an error rate of 0.74%. Furthermore, an analysis of the data from each institution was performed to examine the relationship between the institution’s characteristics and error count. The results from a chi-square test indicated that there was an independent correlation between each institution and its error count. To confirm this correlation between error counts and the characteristics of each institution, a correlation analysis was conducted. The results, shown in a graph, revealed the relationship between factors that had high correlation coefficients and the error count. The findings suggest that the data quality was impacted by biases in the evaluation system, including the institution’s IT environment, infrastructure, and the number of collected samples. These results highlight the need to consider the scalability of research quality when evaluating clinical epidemiological information linked to human materials in future validation studies of data quality.

  13. m

    data for paper: Effect of Green Supply Chain Management Practices on...

    • data.mendeley.com
    Updated Oct 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jorge Luis García-Alcaraz (2021). data for paper: Effect of Green Supply Chain Management Practices on Environmental Performance: Case of Mexican Manufacturing Companies [Dataset]. http://doi.org/10.17632/nj3pms8twc.1
    Explore at:
    Dataset updated
    Oct 12, 2021
    Authors
    Jorge Luis García-Alcaraz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains a set of evidence files for the article initially entitled "Effect of Green Supply Chain Management Practices on Environmental Performance: Case of Mexican Manufacturing Companies".

  14. d

    Data from: Summary report of the 4th IAEA Technical Meeting on Fusion Data...

    • dataone.org
    • dataverse.harvard.edu
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    S.M. Gonzalez de Vicente, D. Mazon, M. Xu, S. Pinches, M. Churchill, A. Dinklage, R. Fischer, A. Murari, P. Rodriguez-Fernandez, J. Stillerman, J. Vega, G. Verdoolaege (2024). Summary report of the 4th IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis (FDPVA) [Dataset]. http://doi.org/10.7910/DVN/ZZ9UKO
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    S.M. Gonzalez de Vicente, D. Mazon, M. Xu, S. Pinches, M. Churchill, A. Dinklage, R. Fischer, A. Murari, P. Rodriguez-Fernandez, J. Stillerman, J. Vega, G. Verdoolaege
    Description

    The objective of the fourth Technical Meeting on Fusion Data Processing, Validation and Analysis was to provide a platform during which a set of topics relevant to fusion data processing, validation and analysis are discussed with the view of extrapolating needs to next step fusion devices such as ITER. The validation and analysis of experimental data obtained from diagnostics used to characterize fusion plasmas are crucial for a knowledge-based understanding of the physical processes governing the dynamics of these plasmas. This paper presents the recent progress and achievements in the domain of plasma diagnostics and synthetic diagnostics data analysis (including image processing, regression analysis, inverse problems, deep learning, machine learning, big data and physics-based models for control) reported at the meeting. The progress in these areas highlight trends observed in current major fusion confinement devices. A special focus is dedicated on data analysis requirements for ITER and DEMO with a particular attention paid to Artificial Intelligence for automatization and improving reliability of control processes.

  15. i

    Validation of the calculation method

    • ieee-dataport.org
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luciano Santos (2024). Validation of the calculation method [Dataset]. https://ieee-dataport.org/documents/validation-calculation-method
    Explore at:
    Dataset updated
    Jul 8, 2024
    Authors
    Luciano Santos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Validation of the calculation method Two files compressed with (zip) *.CSV: 1 - Data collected through the Energy Analyzer for 24 hours and results of the Cigré method. 2 - Data collected from the Telemanagement Relays for 24 hours and results of the calculation method and Cigré method.

  16. Z

    Data for training, validation and testing of methods in the thesis:...

    • data.niaid.nih.gov
    Updated May 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucia Hajduková (2021). Data for training, validation and testing of methods in the thesis: Camera-based Accuracy Improvement of Indoor Localization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4730337
    Explore at:
    Dataset updated
    May 1, 2021
    Authors
    Lucia Hajduková
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The package contains files for two modules designed to improve the accuracy of the indoor positioning system, namely the following:

    door detection

    videos_test - videos used to demonstrate the application of door detector

    videos_res - videos from videos_test directory with detected doors marked

    parts detection

    frames_train_val - images generated from videos used for training and validation of VGG16 neural network model

    frames_test - images generated from videos used for testing of the trained model

    videos_test - videos used to demonstrate the application of parts detector

    videos_res - videos from videos_test directory with detected parts marked

  17. GPM GROUND VALIDATION NOAA CPC MORPHING TECHNIQUE (CMORPH) IFLOODS V1

    • data.nasa.gov
    • datasets.ai
    • +5more
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). GPM GROUND VALIDATION NOAA CPC MORPHING TECHNIQUE (CMORPH) IFLOODS V1 [Dataset]. https://data.nasa.gov/dataset/gpm-ground-validation-noaa-cpc-morphing-technique-cmorph-ifloods-v1
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The GPM Ground Validation NOAA CPC Morphing Technique (CMORPH) IFloodS dataset consists of global precipitation analyses data produced by the NOAA Climate Prediction Center (CPC). The Iowa Flood Studies (IFloodS) campaign was a ground measurement campaign that took place in eastern Iowa from May 1 to June 15, 2013. The goals of the campaign were to collect detailed measurements of precipitation at the Earth's surface using ground instruments and advanced weather radars and, simultaneously, collect data from satellites passing overhead. The CPC morphing technique uses precipitation estimates from low orbiter satellite microwave observations to produce global precipitation analyses at a high temporal and spatial resolution. Data has been selected for the Iowa Flood Studies (IFloodS) field campaign which took place from April 1, 2013 to June 30, 2013. The dataset includes both the near real-time raw data and bias corrected data from NOAA in binary and netCDF format.

  18. D

    Python functions -- cross-validation methods from a data-driven perspective

    • phys-techsciences.datastations.nl
    docx, png +4
    Updated Aug 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Y. Wang; Y. Wang (2024). Python functions -- cross-validation methods from a data-driven perspective [Dataset]. http://doi.org/10.17026/PT/TXAU9W
    Explore at:
    tiff(2474294), tiff(2412540), tsv(49141), txt(1220), tiff(2413148), tsv(20072), tsv(30174), tiff(4833081), tiff(12196238), tiff(1606453), tiff(4729349), tiff(5695336), tsv(29), tiff(6478950), tiff(6534556), tiff(6466131), text/x-python(8210), docx(63366), tsv(12056), tiff(6567360), tsv(28), tiff(5385805), tsv(263901), tiff(6385076), text/x-python(5598), tiff(2423836), tiff(3417568), text/x-python(8181), png(110251), tiff(5726045), tsv(48948), tsv(1564525), tiff(3031197), tiff(2059260), tiff(2880005), tiff(6135064), tiff(3648419), tsv(102), tiff(3060978), tiff(3802696), tiff(4396561), tiff(1385025), text/x-python(1184), tiff(2817752), tiff(2516606), tsv(27725), text/x-python(12795), tiff(2282443)Available download formats
    Dataset updated
    Aug 16, 2024
    Dataset provided by
    DANS Data Station Physical and Technical Sciences
    Authors
    Y. Wang; Y. Wang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is the organized python functions of proposed methods in Yanwen Wang PhD research. Researchers can directly use these functions to conduct spatial+ cross-validation, dissimilarity quantification method, and dissimilarity-adaptive cross-validation.

  19. Z

    Dataset Methods for stratification and validation cohorts: a scoping review

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Apr 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teresa Torres Moral; Albert Sanchez-Niubo; Anna Monistrol Mula; Chiara Gerardi; Josep Maria Haro Abad; Judit Subirana-Mirete (2022). Dataset Methods for stratification and validation cohorts: a scoping review [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_6414853
    Explore at:
    Dataset updated
    Apr 5, 2022
    Dataset provided by
    Center for Health Regulatory Policies. Istituto di Ricerche Farmacologiche Mario Negri (Milan, Italy)
    Research and Developmental Unit. Parc Sanitari Sant Joan de Déu (Barcelona, Spain) | Centro de Investigación Biomédica en Red de Salud Mental (CIBERSAM)
    Research and Developmental Unit. Parc Sanitari Sant Joan de Déu (Barcelona, Spain)
    Authors
    Teresa Torres Moral; Albert Sanchez-Niubo; Anna Monistrol Mula; Chiara Gerardi; Josep Maria Haro Abad; Judit Subirana-Mirete
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We searched PubMed, EMBASE and the Cochrane Library for reviews that described the tools and methods applied to define cohorts used for patient stratification or validation of patient clustering. We focused on cancer, stroke, and Alzheimer’s disease (AD) and limited the searches to reports in English, French, German, Italian and Spanish, published from 2005 to April 2020. Two authors screened the records, and one extracted the key information from each included review. The result of the screening process was reported through a PRISMA flowchart.

  20. D

    Data Quality Management Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Jun 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Data Quality Management Report [Dataset]. https://www.archivemarketresearch.com/reports/data-quality-management-558466
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Jun 16, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Data Quality Management (DQM) market is experiencing robust growth, driven by the increasing volume and velocity of data generated across various industries. Businesses are increasingly recognizing the critical need for accurate, reliable, and consistent data to support critical decision-making, improve operational efficiency, and comply with stringent data regulations. The market is estimated to be valued at $15 billion in 2025, exhibiting a Compound Annual Growth Rate (CAGR) of 12% from 2025 to 2033. This growth is fueled by several key factors, including the rising adoption of cloud-based DQM solutions, the expanding use of advanced analytics and AI in data quality processes, and the growing demand for data governance and compliance solutions. The market is segmented by deployment (cloud, on-premises), organization size (small, medium, large enterprises), and industry vertical (BFSI, healthcare, retail, etc.), with the cloud segment exhibiting the fastest growth. Major players in the DQM market include Informatica, Talend, IBM, Microsoft, Oracle, SAP, SAS Institute, Pitney Bowes, Syncsort, and Experian, each offering a range of solutions catering to diverse business needs. These companies are constantly innovating to provide more sophisticated and integrated DQM solutions incorporating machine learning, automation, and self-service capabilities. However, the market also faces some challenges, including the complexity of implementing DQM solutions, the lack of skilled professionals, and the high cost associated with some advanced technologies. Despite these restraints, the long-term outlook for the DQM market remains positive, with continued expansion driven by the expanding digital transformation initiatives across industries and the growing awareness of the significant return on investment associated with improved data quality.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Organization logo

Machine learning algorithm validation with a limited sample size

Explore at:
text/x-pythonAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Search
Clear search
Close search
Google apps
Main menu