100+ datasets found
  1. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  2. Text Function, Date, Data Validation

    • kaggle.com
    zip
    Updated Mar 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanjana Murthy (2024). Text Function, Date, Data Validation [Dataset]. https://www.kaggle.com/sanjanamurthy392/text-function-date-data-validation
    Explore at:
    zip(25270 bytes)Available download formats
    Dataset updated
    Mar 15, 2024
    Authors
    Sanjana Murthy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This data contains Text Function, Date, Data Validation.

  3. Data from: Development and validation of HBV surveillance models using big...

    • tandf.figshare.com
    docx
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weinan Dong; Cecilia Clara Da Roza; Dandan Cheng; Dahao Zhang; Yuling Xiang; Wai Kay Seto; William C. W. Wong (2024). Development and validation of HBV surveillance models using big data and machine learning [Dataset]. http://doi.org/10.6084/m9.figshare.25201473.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Weinan Dong; Cecilia Clara Da Roza; Dandan Cheng; Dahao Zhang; Yuling Xiang; Wai Kay Seto; William C. W. Wong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The construction of a robust healthcare information system is fundamental to enhancing countries’ capabilities in the surveillance and control of hepatitis B virus (HBV). Making use of China’s rapidly expanding primary healthcare system, this innovative approach using big data and machine learning (ML) could help towards the World Health Organization’s (WHO) HBV infection elimination goals of reaching 90% diagnosis and treatment rates by 2030. We aimed to develop and validate HBV detection models using routine clinical data to improve the detection of HBV and support the development of effective interventions to mitigate the impact of this disease in China. Relevant data records extracted from the Family Medicine Clinic of the University of Hong Kong-Shenzhen Hospital’s Hospital Information System were structuralized using state-of-the-art Natural Language Processing techniques. Several ML models have been used to develop HBV risk assessment models. The performance of the ML model was then interpreted using the Shapley value (SHAP) and validated using cohort data randomly divided at a ratio of 2:1 using a five-fold cross-validation framework. The patterns of physical complaints of patients with and without HBV infection were identified by processing 158,988 clinic attendance records. After removing cases without any clinical parameters from the derivation sample (n = 105,992), 27,392 cases were analysed using six modelling methods. A simplified model for HBV using patients’ physical complaints and parameters was developed with good discrimination (AUC = 0.78) and calibration (goodness of fit test p-value >0.05). Suspected case detection models of HBV, showing potential for clinical deployment, have been developed to improve HBV surveillance in primary care setting in China. (Word count: 264) This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections.We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China. This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections. We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China.

  4. m

    PEN-Method: Predictor model and Validation Data

    • data.mendeley.com
    • narcis.nl
    Updated Sep 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Halle (2021). PEN-Method: Predictor model and Validation Data [Dataset]. http://doi.org/10.17632/459f33wxf6.4
    Explore at:
    Dataset updated
    Sep 3, 2021
    Authors
    Alex Halle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This Data contains the PEN-Predictor-Keras-Model as well as the 100 validation data sets.

  5. Sensor Validation using Bayesian Networks - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Sensor Validation using Bayesian Networks - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/sensor-validation-using-bayesian-networks
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    One of NASA’s key mission requirements is robust state estimation. Sensing, using a wide range of sensors and sensor fusion approaches, plays a central role in robust state estimation, and there is a need to diagnose sensor failure as well as component failure. Sensor validation techniques address this problem: given a vector of sensor readings, decide whether sensors have failed, therefore producing bad data. We take in this paper a probabilistic approach, using Bayesian networks, to diagnosis and sensor validation, and investigate several relevant but slightly different Bayesian network queries. We emphasize that on-board inference can be performed on a compiled model, giving fast and predictable execution times. Our results are illustrated using an electrical power system, and we show that a Bayesian network with over 400 nodes can be compiled into an arithmetic circuit that can correctly answer queries in less than 500 microseconds on average. Reference: O. J. Mengshoel, A. Darwiche, and S. Uckun, "Sensor Validation using Bayesian Networks." In Proc. of the 9th International Symposium on Artificial Intelligence, Robotics, and Automation in Space (iSAIRAS-08), Los Angeles, CA, 2008. BibTex Reference: @inproceedings{mengshoel08sensor, author = {Mengshoel, O. J. and Darwiche, A. and Uckun, S.}, title = {Sensor Validation using {Bayesian} Networks}, booktitle = {Proceedings of the 9th International Symposium on Artificial Intelligence, Robotics, and Automation in Space (iSAIRAS-08)}, year = {2008} }

  6. d

    Forage Fish Aerial Validation Data from Prince William Sound, Alaska

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Forage Fish Aerial Validation Data from Prince William Sound, Alaska [Dataset]. https://catalog.data.gov/dataset/forage-fish-aerial-validation-data-from-prince-william-sound-alaska
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Alaska, Prince William Sound
    Description

    One table with data used to validate aerial fish surveys in Prince William Sound, Alaska. Data includes: date, location, latitude, longitude, aerial ID, validation ID, total length and validation method. Various catch methods were used to obtain fish samples for aerial validations, including: cast net, GoPro, hydroacoustics, jig, dip net, gillnet, purse seine, photo and visual identification.

  7. FDA Drug Product Labels Validation Method Data Package

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs (2021). FDA Drug Product Labels Validation Method Data Package [Dataset]. https://www.johnsnowlabs.com/marketplace/fda-drug-product-labels-validation-method-data-package/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Description

    This data package contains information on Structured Product Labeling (SPL) Terminology for SPL validation procedures and information on performing SPL validations.

  8. Data from: Selection of optimal validation methods for quantitative...

    • tandf.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K. Héberger (2023). Selection of optimal validation methods for quantitative structure–activity relationships and applicability domain [Dataset]. http://doi.org/10.6084/m9.figshare.23185916.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    K. Héberger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This brief literature survey groups the (numerical) validation methods and emphasizes the contradictions and confusion considering bias, variance and predictive performance. A multicriteria decision-making analysis has been made using the sum of absolute ranking differences (SRD), illustrated with five case studies (seven examples). SRD was applied to compare external and cross-validation techniques, indicators of predictive performance, and to select optimal methods to determine the applicability domain (AD). The ordering of model validation methods was in accordance with the sayings of original authors, but they are contradictory within each other, suggesting that any variant of cross-validation can be superior or inferior to other variants depending on the algorithm, data structure and circumstances applied. A simple fivefold cross-validation proved to be superior to the Bayesian Information Criterion in the vast majority of situations. It is simply not sufficient to test a numerical validation method in one situation only, even if it is a well defined one. SRD as a preferable multicriteria decision-making algorithm is suitable for tailoring the techniques for validation, and for the optimal determination of the applicability domain according to the dataset in question.

  9. Data from: Development of a Mobile Robot Test Platform and Methods for...

    • data.nasa.gov
    • s.cnmilf.com
    • +1more
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Development of a Mobile Robot Test Platform and Methods for Validation of Prognostics-Enabled Decision Making Algorithms [Dataset]. https://data.nasa.gov/dataset/development-of-a-mobile-robot-test-platform-and-methods-for-validation-of-prognostics-enab
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    As fault diagnosis and prognosis systems in aerospace applications become more capable, the ability to utilize information supplied by them becomes increasingly important. While certain types of vehicle health data can be effectively processed and acted upon by crew or support personnel, others, due to their complexity or time constraints, require either automated or semi-automated reasoning. Prognostics-enabled Decision Making (PDM) is an emerging research area that aims to integrate prognostic health information and knowledge about the future operating conditions into the process of selecting subsequent actions for the system. The newly developed PDM algorithms require suitable software and hardware platforms for testing under realistic fault scenarios. The paper describes the development of such a platform, based on the K11 planetary rover prototype. A variety of injectable fault modes are being investigated for electrical, mechanical, and power subsystems of the testbed, along with methods for data collection and processing. In addition to the hardware platform, a software simulator with matching capabilities has been developed. The simulator allows for prototyping and initial validation of the algorithms prior to their deployment on the K11. The simulator is also available to the PDM algorithms to assist with the reasoning process. A reference set of diagnostic, prognostic, and decision making algorithms is also described, followed by an overview of the current test scenarios and the results of their execution on the simulator.

  10. E

    Email Validation Tools Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Email Validation Tools Report [Dataset]. https://www.marketresearchforecast.com/reports/email-validation-tools-549597
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Jul 25, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The email validation tools market is experiencing robust growth, driven by the increasing need for businesses to maintain clean and accurate email lists for effective marketing campaigns. The rising adoption of email marketing as a primary communication channel, coupled with stricter data privacy regulations like GDPR and CCPA, necessitates the use of tools that ensure email deliverability and prevent bounces. This market, estimated at $500 million in 2025, is projected to grow at a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $1.5 billion by 2033. This expansion is fueled by the growing sophistication of email validation techniques, including real-time verification, syntax checks, and mailbox monitoring, offering businesses more robust solutions to improve their email marketing ROI. Key market segments include small and medium-sized businesses (SMBs), large enterprises, and email marketing agencies, each exhibiting varying levels of adoption and spending based on their specific needs and email marketing strategies. The competitive landscape is characterized by a mix of established players and emerging startups, offering a range of features and pricing models to cater to diverse customer requirements. The market's growth is, however, subject to factors like increasing costs associated with maintaining data accuracy and the potential for false positives in email verification. The key players in this dynamic market, such as Mailgun, BriteVerify, and similar companies, are continuously innovating to improve accuracy, speed, and integration with other marketing automation platforms. The market's geographical distribution is diverse, with North America and Europe currently holding significant market share due to higher email marketing adoption rates and a robust technological infrastructure. However, Asia-Pacific and other emerging markets are poised for considerable growth in the coming years due to increasing internet penetration and rising adoption of digital marketing techniques. The ongoing evolution of email marketing strategies, the increasing emphasis on data hygiene, and the rise of artificial intelligence in email verification are likely to further shape the trajectory of this market in the years to come, leading to further innovation and growth.

  11. D

    Icing Validation Database

    • dataverse.no
    • dataverse.azure.uit.no
    tsv, txt, xlsx
    Updated Sep 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Hann; Richard Hann; Nicolas Müller; Nicolas Müller (2023). Icing Validation Database [Dataset]. http://doi.org/10.18710/5XYALW
    Explore at:
    txt(2378), xlsx(35849), tsv(35777)Available download formats
    Dataset updated
    Sep 28, 2023
    Dataset provided by
    DataverseNO
    Authors
    Richard Hann; Richard Hann; Nicolas Müller; Nicolas Müller
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    1983 - 2019
    Description

    This database contains an overview of experimental datasets that can be used for the validation of ice prediction simulation methods. This database was generated for the 1st AIAA Ice Prediction Workshop, scheduled for 2021. The database contains entries on 71 experimental datasets in the literature. For each entry, a series of parameters have been identified, including the investigated geometries, Reynolds numbers, Mach numbers, icing envelopes.

  12. d

    Data from: Summary report of the 4th IAEA Technical Meeting on Fusion Data...

    • dataone.org
    • dataverse.harvard.edu
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    S.M. Gonzalez de Vicente, D. Mazon, M. Xu, S. Pinches, M. Churchill, A. Dinklage, R. Fischer, A. Murari, P. Rodriguez-Fernandez, J. Stillerman, J. Vega, G. Verdoolaege (2024). Summary report of the 4th IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis (FDPVA) [Dataset]. http://doi.org/10.7910/DVN/ZZ9UKO
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    S.M. Gonzalez de Vicente, D. Mazon, M. Xu, S. Pinches, M. Churchill, A. Dinklage, R. Fischer, A. Murari, P. Rodriguez-Fernandez, J. Stillerman, J. Vega, G. Verdoolaege
    Description

    The objective of the fourth Technical Meeting on Fusion Data Processing, Validation and Analysis was to provide a platform during which a set of topics relevant to fusion data processing, validation and analysis are discussed with the view of extrapolating needs to next step fusion devices such as ITER. The validation and analysis of experimental data obtained from diagnostics used to characterize fusion plasmas are crucial for a knowledge-based understanding of the physical processes governing the dynamics of these plasmas. This paper presents the recent progress and achievements in the domain of plasma diagnostics and synthetic diagnostics data analysis (including image processing, regression analysis, inverse problems, deep learning, machine learning, big data and physics-based models for control) reported at the meeting. The progress in these areas highlight trends observed in current major fusion confinement devices. A special focus is dedicated on data analysis requirements for ITER and DEMO with a particular attention paid to Artificial Intelligence for automatization and improving reliability of control processes.

  13. GPM GROUND VALIDATION NOAA CPC MORPHING TECHNIQUE (CMORPH) IFLOODS V1

    • data.nasa.gov
    • datasets.ai
    • +5more
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). GPM GROUND VALIDATION NOAA CPC MORPHING TECHNIQUE (CMORPH) IFLOODS V1 [Dataset]. https://data.nasa.gov/dataset/gpm-ground-validation-noaa-cpc-morphing-technique-cmorph-ifloods-v1
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The GPM Ground Validation NOAA CPC Morphing Technique (CMORPH) IFloodS dataset consists of global precipitation analyses data produced by the NOAA Climate Prediction Center (CPC). The Iowa Flood Studies (IFloodS) campaign was a ground measurement campaign that took place in eastern Iowa from May 1 to June 15, 2013. The goals of the campaign were to collect detailed measurements of precipitation at the Earth's surface using ground instruments and advanced weather radars and, simultaneously, collect data from satellites passing overhead. The CPC morphing technique uses precipitation estimates from low orbiter satellite microwave observations to produce global precipitation analyses at a high temporal and spatial resolution. Data has been selected for the Iowa Flood Studies (IFloodS) field campaign which took place from April 1, 2013 to June 30, 2013. The dataset includes both the near real-time raw data and bias corrected data from NOAA in binary and netCDF format.

  14. VQA Validation Dataset

    • kaggle.com
    zip
    Updated Sep 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mario Dias (2021). VQA Validation Dataset [Dataset]. https://www.kaggle.com/itsmariodias/vqa-validation-dataset
    Explore at:
    zip(6615743858 bytes)Available download formats
    Dataset updated
    Sep 5, 2021
    Authors
    Mario Dias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer.

    Validation Data from VQA v2 Dataset - https://visualqa.org/index.html Validation images - 40,504 images Validation questions 2017 v2.0 - 214,354 questions Validation annotations 2017 v2.0 - 2,143,540 answers

    This dataset is only for ease of use in kaggle environments. I do not take any credit for the creation of the dataset. The annotations in this dataset belong to the VQA Consortium and are licensed under a Commons Attribution 4.0 International License.

    Copyright © 2015, VQA Consortium. All rights reserved. Redistribution and use software in source and binary form, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the VQA Consortium nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE AND ANNOTATIONS ARE PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

  15. Restaurant_management_system

    • kaggle.com
    zip
    Updated Nov 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vaibhav2702 (2025). Restaurant_management_system [Dataset]. https://www.kaggle.com/datasets/vaibhav2702/restaurant-management-system
    Explore at:
    zip(64116 bytes)Available download formats
    Dataset updated
    Nov 26, 2025
    Authors
    Vaibhav2702
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This project contains a Restaurant Orders Dataset with 150+ rows and includes Excel-cleaned data, Python OOP classes for starter, real menu, registration, login, and billing, and SQL scripts with triggers, stored procedures, and views for data validation. It is suitable for learning data analysis, database management, and Python programming.

  16. s

    Bookshelf data

    • orda.shef.ac.uk
    zip
    Updated Sep 8, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Malcolm Hepburne Scott; Robert Barthorpe; David Wagg; Keith Worden (2017). Bookshelf data [Dataset]. http://doi.org/10.15131/shef.data.5384275.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 8, 2017
    Dataset provided by
    The University of Sheffield
    Authors
    Malcolm Hepburne Scott; Robert Barthorpe; David Wagg; Keith Worden
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the data and models associated with the paper "A comparison of validation techniques for a nonlinear bifurcating system"a guide to the data is contained in the word document "data guide"

  17. T

    Thermal Validation System Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Thermal Validation System Report [Dataset]. https://www.datainsightsmarket.com/reports/thermal-validation-system-1521853
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Jul 9, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global thermal validation system market is experiencing robust growth, driven by increasing regulatory scrutiny across pharmaceutical, biotechnology, and food processing industries. Stringent quality control standards and the need for accurate temperature monitoring throughout the manufacturing and storage processes are key factors fueling market expansion. The market is segmented by system type (e.g., autoclaves, ovens, incubators), application (pharmaceutical, food & beverage, etc.), and end-user (contract research organizations, pharmaceutical manufacturers, etc.). Technological advancements, such as the integration of IoT sensors and cloud-based data analysis, are enhancing the capabilities of thermal validation systems, leading to improved efficiency and data management. Furthermore, the rising demand for sophisticated validation techniques to comply with international regulations like GMP and FDA guidelines is further bolstering market growth. We estimate the 2025 market size to be approximately $850 million, growing at a Compound Annual Growth Rate (CAGR) of 7% from 2025 to 2033. This growth reflects the increasing adoption of advanced technologies and the expanding regulatory landscape in key regions like North America and Europe. Competition in the thermal validation system market is intense, with several established players and emerging companies vying for market share. Key players like Kaye, Ellab, and Thermo Fisher Scientific are leveraging their strong brand reputation and technological expertise to maintain market leadership. However, smaller, specialized firms are also gaining traction by offering niche solutions and innovative technologies. The market is expected to witness further consolidation in the coming years, with strategic acquisitions and partnerships playing a crucial role in shaping the competitive landscape. Geographic expansion, particularly in emerging markets in Asia-Pacific and Latin America, represents a significant growth opportunity for market participants. The restraints to growth include the high initial investment cost associated with implementing thermal validation systems and the need for skilled personnel to operate and maintain these systems.

  18. Credit Card Behaviour Score

    • kaggle.com
    zip
    Updated Jan 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suvradeep (2025). Credit Card Behaviour Score [Dataset]. https://www.kaggle.com/datasets/suvroo/credit-card-behaviour-score/code
    Explore at:
    zip(72966946 bytes)Available download formats
    Dataset updated
    Jan 10, 2025
    Authors
    Suvradeep
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Bank A issues Credit Cards to eligible customers. The Bank deploys advanced ML models and frameworks to decide on eligibility, limit, and interest rate assignment. The models and frameworks are optimized to manage early risk and ensure profitability. The Bank has now decided to build a robust risk management framework for its existing Credit Card customers, irrespective of when they were acquired. To enable this, the Bank has decided to create a “Behaviour Score”. A Behaviour Score is a predictive model. It is developed on a base of customers whose Credit Cards are open and are not past due. The model predicts the probability of customers defaulting on the Credit Cards going forward. This model will then be used for several portfolio risk management activities.

    Problem statement

    Your objective is to develop the Behaviour Score for Bank A.

    Datasets

    You have been provided with a random sample of 96,806 Credit Card details in “Dev_data_to_be_shared.zip”, along with a flag (bad_flag) – henceforth known as “development data”. This is a historical snapshot of the Credit Card portfolio of Bank A. Credit Cards that have actually defaulted have bad_flag = 1. You have also been provided with several independent variables. These include: • On us attributes like credit limit (varables with names starting with onus_attributes) • Transaction level attributes like number of transactions / rupee value transactions on various kinds of merchants (variables with names starting with transaction_attribute) • Bureau tradeline level attributes (like product holdings, historical delinquencies) – variables starting with bureau • Bureau enquiry level attributes (like PL enquiries in the last 3 months etc) – variables starting with bureau_enquiry You have also been provided with another random sample of 41,792 Credit Card details in “validation_data_to_be_shared.zip” with the same set of input variables, but without “bad_flag”. This will be referred to going forward as “validation data”.

    Requirements

    Using the data provided, you will have to come up with a way to predict the probability that a given Credit Card customer will default. You can use the development data for this purpose. You are then required to use the same logic to predict the probability of all the Credit Cards which are a part of the validation data. Your submission should contain two columns – the Primary key from the validation data (account_number), and the predicted probability against that account. You are also required to submit a detailed documentation of this exercise. A good document should contain details about your approach. In this section, you should include a write up on any algorithms that you use. You should then cover each of the steps that you have followed in as much detail as you can. You should then move on to any key insights or observations that you have come across in the data provided to you. Finally, you should write about what metrics you have used to measure the effectiveness of the approach that you have followed.

    Evaluation

    As detailed in the previous section, you are required to submit the Primary key and predicted probabilities of all the accounts provided to you in the validation data, as well as a documentation. We will only evaluate submissions that are complete and pass sanity checks (probability values should be between 0 and 1 for example). Submissions will be evaluated basis how close the predicted probabilities are to the actual outcome. We will also evaluate the documentation basis it’s completeness and accuracy. Extra points will be granted to submissions that include interesting insights / observations on the data provided.

  19. Z

    LBD Data used in validation experiments for the contrast method (8...

    • data-staging.niaid.nih.gov
    Updated Jan 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moreau, Erwan (2022). LBD Data used in validation experiments for the contrast method (8 discoveries related to NDs) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4895189
    Explore at:
    Dataset updated
    Jan 21, 2022
    Dataset provided by
    Trinity College Dublin
    Authors
    Moreau, Erwan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the data used in the validation experiments for the "contrast method", a novel Literature-Based Discovery (LBD) approach.

    Paper: https://doi.org/10.1101/2021.09.22.461375

    Code: https://github.com/erwanm/lbd-contrast

    How to generate a similar dataset

    How to use this dataset to reproduce the LBD experiments

    Important: the raw data from which this data is derived was downloaded from Medline, PubMedCentral and PubTatorCentral, provided courtesy of the U.S. National Library of Medicine (NLM). The data was extracted in January 2021 and do not reflect the most current/accurate data available from NLM. See the instructions above in order to generate a similar dataset from up to date data.

  20. H

    Data Repository for 'Bootstrap aggregation and cross-validation methods to...

    • hydroshare.org
    • beta.hydroshare.org
    • +1more
    zip
    Updated Jun 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zachary Paul Brodeur; Scott S. Steinschneider; Jonathan D. Herman (2020). Data Repository for 'Bootstrap aggregation and cross-validation methods to reduce overfitting in reservoir control policy search' [Dataset]. http://doi.org/10.4211/hs.b8f87a7b680d44cebfb4b3f4f4a6a447
    Explore at:
    zip(8.3 MB)Available download formats
    Dataset updated
    Jun 24, 2020
    Dataset provided by
    HydroShare
    Authors
    Zachary Paul Brodeur; Scott S. Steinschneider; Jonathan D. Herman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 1, 1922 - Sep 30, 2016
    Area covered
    Description

    Policy search methods provide a heuristic mapping between observations and decisions and have been widely used in reservoir control studies. However, recent studies have observed a tendency for policy search methods to overfit to the hydrologic data used in training, particularly the sequence of flood and drought events. This technical note develops an extension of bootstrap aggregation (bagging) and cross-validation techniques, inspired by the machine learning literature, to improve control policy performance on out-of-sample hydrology. We explore these methods using a case study of Folsom Reservoir, California using control policies structured as binary trees and daily streamflow resampling based on the paleo-inflow record. Results show that calibration-validation strategies for policy selection and certain ensemble aggregation methods can improve out-of-sample tradeoffs between water supply and flood risk objectives over baseline performance given fixed computational costs. These results highlight the potential to improve policy search methodologies by leveraging well-established model training strategies from machine learning.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Organization logo

Machine learning algorithm validation with a limited sample size

Explore at:
text/x-pythonAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Search
Clear search
Close search
Google apps
Main menu