100+ datasets found

Machine learning algorithm validation with a limited sample size
plos.figshare.com
text/x-python
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0224365
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
Text Function, Date, Data Validation
kaggle.com
zip
Updated Mar 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sanjana Murthy (2024). Text Function, Date, Data Validation [Dataset]. https://www.kaggle.com/sanjanamurthy392/text-function-date-data-validation
Explore at:
zip(25270 bytes)Available download formats
Dataset updated
Mar 15, 2024
Authors
Sanjana Murthy
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This data contains Text Function, Date, Data Validation.
Data from: Development and validation of HBV surveillance models using big...
tandf.figshare.com
docx
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weinan Dong; Cecilia Clara Da Roza; Dandan Cheng; Dahao Zhang; Yuling Xiang; Wai Kay Seto; William C. W. Wong (2024). Development and validation of HBV surveillance models using big data and machine learning [Dataset]. http://doi.org/10.6084/m9.figshare.25201473.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25201473.v1
Dataset updated
Dec 3, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Weinan Dong; Cecilia Clara Da Roza; Dandan Cheng; Dahao Zhang; Yuling Xiang; Wai Kay Seto; William C. W. Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The construction of a robust healthcare information system is fundamental to enhancing countries’ capabilities in the surveillance and control of hepatitis B virus (HBV). Making use of China’s rapidly expanding primary healthcare system, this innovative approach using big data and machine learning (ML) could help towards the World Health Organization’s (WHO) HBV infection elimination goals of reaching 90% diagnosis and treatment rates by 2030. We aimed to develop and validate HBV detection models using routine clinical data to improve the detection of HBV and support the development of effective interventions to mitigate the impact of this disease in China. Relevant data records extracted from the Family Medicine Clinic of the University of Hong Kong-Shenzhen Hospital’s Hospital Information System were structuralized using state-of-the-art Natural Language Processing techniques. Several ML models have been used to develop HBV risk assessment models. The performance of the ML model was then interpreted using the Shapley value (SHAP) and validated using cohort data randomly divided at a ratio of 2:1 using a five-fold cross-validation framework. The patterns of physical complaints of patients with and without HBV infection were identified by processing 158,988 clinic attendance records. After removing cases without any clinical parameters from the derivation sample (n = 105,992), 27,392 cases were analysed using six modelling methods. A simplified model for HBV using patients’ physical complaints and parameters was developed with good discrimination (AUC = 0.78) and calibration (goodness of fit test p-value >0.05). Suspected case detection models of HBV, showing potential for clinical deployment, have been developed to improve HBV surveillance in primary care setting in China. (Word count: 264) This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections.We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China. This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections. We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China.
m
PEN-Method: Predictor model and Validation Data
data.mendeley.com
narcis.nl
Updated Sep 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Halle (2021). PEN-Method: Predictor model and Validation Data [Dataset]. http://doi.org/10.17632/459f33wxf6.4
Explore at:
Unique identifier
https://doi.org/10.17632/459f33wxf6.4
Dataset updated
Sep 3, 2021
Authors
Alex Halle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Data contains the PEN-Predictor-Keras-Model as well as the 100 validation data sets.
Sensor Validation using Bayesian Networks - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Sensor Validation using Bayesian Networks - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/sensor-validation-using-bayesian-networks
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
One of NASA’s key mission requirements is robust state estimation. Sensing, using a wide range of sensors and sensor fusion approaches, plays a central role in robust state estimation, and there is a need to diagnose sensor failure as well as component failure. Sensor validation techniques address this problem: given a vector of sensor readings, decide whether sensors have failed, therefore producing bad data. We take in this paper a probabilistic approach, using Bayesian networks, to diagnosis and sensor validation, and investigate several relevant but slightly different Bayesian network queries. We emphasize that on-board inference can be performed on a compiled model, giving fast and predictable execution times. Our results are illustrated using an electrical power system, and we show that a Bayesian network with over 400 nodes can be compiled into an arithmetic circuit that can correctly answer queries in less than 500 microseconds on average. Reference: O. J. Mengshoel, A. Darwiche, and S. Uckun, "Sensor Validation using Bayesian Networks." In Proc. of the 9th International Symposium on Artificial Intelligence, Robotics, and Automation in Space (iSAIRAS-08), Los Angeles, CA, 2008. BibTex Reference: @inproceedings{mengshoel08sensor, author = {Mengshoel, O. J. and Darwiche, A. and Uckun, S.}, title = {Sensor Validation using {Bayesian} Networks}, booktitle = {Proceedings of the 9th International Symposium on Artificial Intelligence, Robotics, and Automation in Space (iSAIRAS-08)}, year = {2008} }
d
Forage Fish Aerial Validation Data from Prince William Sound, Alaska
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Forage Fish Aerial Validation Data from Prince William Sound, Alaska [Dataset]. https://catalog.data.gov/dataset/forage-fish-aerial-validation-data-from-prince-william-sound-alaska
Explore at:
Dataset updated
Nov 27, 2025
Dataset provided by
U.S. Geological Survey
Area covered
Alaska, Prince William Sound
Description
One table with data used to validate aerial fish surveys in Prince William Sound, Alaska. Data includes: date, location, latitude, longitude, aerial ID, validation ID, total length and validation method. Various catch methods were used to obtain fish samples for aerial validations, including: cast net, GoPro, hydroacoustics, jig, dip net, gillnet, purse seine, photo and visual identification.
FDA Drug Product Labels Validation Method Data Package
johnsnowlabs.com
csv
Updated Jan 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). FDA Drug Product Labels Validation Method Data Package [Dataset]. https://www.johnsnowlabs.com/marketplace/fda-drug-product-labels-validation-method-data-package/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Description
This data package contains information on Structured Product Labeling (SPL) Terminology for SPL validation procedures and information on performing SPL validations.
Data from: Selection of optimal validation methods for quantitative...
tandf.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
K. Héberger (2023). Selection of optimal validation methods for quantitative structure–activity relationships and applicability domain [Dataset]. http://doi.org/10.6084/m9.figshare.23185916.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23185916.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
K. Héberger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This brief literature survey groups the (numerical) validation methods and emphasizes the contradictions and confusion considering bias, variance and predictive performance. A multicriteria decision-making analysis has been made using the sum of absolute ranking differences (SRD), illustrated with five case studies (seven examples). SRD was applied to compare external and cross-validation techniques, indicators of predictive performance, and to select optimal methods to determine the applicability domain (AD). The ordering of model validation methods was in accordance with the sayings of original authors, but they are contradictory within each other, suggesting that any variant of cross-validation can be superior or inferior to other variants depending on the algorithm, data structure and circumstances applied. A simple fivefold cross-validation proved to be superior to the Bayesian Information Criterion in the vast majority of situations. It is simply not sufficient to test a numerical validation method in one situation only, even if it is a well defined one. SRD as a preferable multicriteria decision-making algorithm is suitable for tailoring the techniques for validation, and for the optimal determination of the applicability domain according to the dataset in question.
Data from: Development of a Mobile Robot Test Platform and Methods for...
data.nasa.gov
s.cnmilf.com
+1more
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Development of a Mobile Robot Test Platform and Methods for Validation of Prognostics-Enabled Decision Making Algorithms [Dataset]. https://data.nasa.gov/dataset/development-of-a-mobile-robot-test-platform-and-methods-for-validation-of-prognostics-enab
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
As fault diagnosis and prognosis systems in aerospace applications become more capable, the ability to utilize information supplied by them becomes increasingly important. While certain types of vehicle health data can be effectively processed and acted upon by crew or support personnel, others, due to their complexity or time constraints, require either automated or semi-automated reasoning. Prognostics-enabled Decision Making (PDM) is an emerging research area that aims to integrate prognostic health information and knowledge about the future operating conditions into the process of selecting subsequent actions for the system. The newly developed PDM algorithms require suitable software and hardware platforms for testing under realistic fault scenarios. The paper describes the development of such a platform, based on the K11 planetary rover prototype. A variety of injectable fault modes are being investigated for electrical, mechanical, and power subsystems of the testbed, along with methods for data collection and processing. In addition to the hardware platform, a software simulator with matching capabilities has been developed. The simulator allows for prototyping and initial validation of the algorithms prior to their deployment on the K11. The simulator is also available to the PDM algorithms to assist with the reasoning process. A reference set of diagnostic, prognostic, and decision making algorithms is also described, followed by an overview of the current test scenarios and the results of their execution on the simulator.
E
Email Validation Tools Report
marketresearchforecast.com
doc, pdf, ppt
Updated Jul 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). Email Validation Tools Report [Dataset]. https://www.marketresearchforecast.com/reports/email-validation-tools-549597
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Jul 25, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The email validation tools market is experiencing robust growth, driven by the increasing need for businesses to maintain clean and accurate email lists for effective marketing campaigns. The rising adoption of email marketing as a primary communication channel, coupled with stricter data privacy regulations like GDPR and CCPA, necessitates the use of tools that ensure email deliverability and prevent bounces. This market, estimated at $500 million in 2025, is projected to grow at a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $1.5 billion by 2033. This expansion is fueled by the growing sophistication of email validation techniques, including real-time verification, syntax checks, and mailbox monitoring, offering businesses more robust solutions to improve their email marketing ROI. Key market segments include small and medium-sized businesses (SMBs), large enterprises, and email marketing agencies, each exhibiting varying levels of adoption and spending based on their specific needs and email marketing strategies. The competitive landscape is characterized by a mix of established players and emerging startups, offering a range of features and pricing models to cater to diverse customer requirements. The market's growth is, however, subject to factors like increasing costs associated with maintaining data accuracy and the potential for false positives in email verification. The key players in this dynamic market, such as Mailgun, BriteVerify, and similar companies, are continuously innovating to improve accuracy, speed, and integration with other marketing automation platforms. The market's geographical distribution is diverse, with North America and Europe currently holding significant market share due to higher email marketing adoption rates and a robust technological infrastructure. However, Asia-Pacific and other emerging markets are poised for considerable growth in the coming years due to increasing internet penetration and rising adoption of digital marketing techniques. The ongoing evolution of email marketing strategies, the increasing emphasis on data hygiene, and the rise of artificial intelligence in email verification are likely to further shape the trajectory of this market in the years to come, leading to further innovation and growth.
D
Icing Validation Database
dataverse.no
dataverse.azure.uit.no
tsv, txt, xlsx
Updated Sep 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Hann; Richard Hann; Nicolas Müller; Nicolas Müller (2023). Icing Validation Database [Dataset]. http://doi.org/10.18710/5XYALW
Explore at:
txt(2378), xlsx(35849), tsv(35777)Available download formats
Unique identifier
https://doi.org/10.18710/5XYALW
Dataset updated
Sep 28, 2023
Dataset provided by
DataverseNO
Authors
Richard Hann; Richard Hann; Nicolas Müller; Nicolas Müller
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
1983 - 2019
Description
This database contains an overview of experimental datasets that can be used for the validation of ice prediction simulation methods. This database was generated for the 1st AIAA Ice Prediction Workshop, scheduled for 2021. The database contains entries on 71 experimental datasets in the literature. For each entry, a series of parameters have been identified, including the investigated geometries, Reynolds numbers, Mach numbers, icing envelopes.
d
Data from: Summary report of the 4th IAEA Technical Meeting on Fusion Data...
dataone.org
dataverse.harvard.edu
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
S.M. Gonzalez de Vicente, D. Mazon, M. Xu, S. Pinches, M. Churchill, A. Dinklage, R. Fischer, A. Murari, P. Rodriguez-Fernandez, J. Stillerman, J. Vega, G. Verdoolaege (2024). Summary report of the 4th IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis (FDPVA) [Dataset]. http://doi.org/10.7910/DVN/ZZ9UKO
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ZZ9UKO
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
S.M. Gonzalez de Vicente, D. Mazon, M. Xu, S. Pinches, M. Churchill, A. Dinklage, R. Fischer, A. Murari, P. Rodriguez-Fernandez, J. Stillerman, J. Vega, G. Verdoolaege
Description
The objective of the fourth Technical Meeting on Fusion Data Processing, Validation and Analysis was to provide a platform during which a set of topics relevant to fusion data processing, validation and analysis are discussed with the view of extrapolating needs to next step fusion devices such as ITER. The validation and analysis of experimental data obtained from diagnostics used to characterize fusion plasmas are crucial for a knowledge-based understanding of the physical processes governing the dynamics of these plasmas. This paper presents the recent progress and achievements in the domain of plasma diagnostics and synthetic diagnostics data analysis (including image processing, regression analysis, inverse problems, deep learning, machine learning, big data and physics-based models for control) reported at the meeting. The progress in these areas highlight trends observed in current major fusion confinement devices. A special focus is dedicated on data analysis requirements for ITER and DEMO with a particular attention paid to Artificial Intelligence for automatization and improving reliability of control processes.
GPM GROUND VALIDATION NOAA CPC MORPHING TECHNIQUE (CMORPH) IFLOODS V1
data.nasa.gov
datasets.ai
+5more
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). GPM GROUND VALIDATION NOAA CPC MORPHING TECHNIQUE (CMORPH) IFLOODS V1 [Dataset]. https://data.nasa.gov/dataset/gpm-ground-validation-noaa-cpc-morphing-technique-cmorph-ifloods-v1
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
The GPM Ground Validation NOAA CPC Morphing Technique (CMORPH) IFloodS dataset consists of global precipitation analyses data produced by the NOAA Climate Prediction Center (CPC). The Iowa Flood Studies (IFloodS) campaign was a ground measurement campaign that took place in eastern Iowa from May 1 to June 15, 2013. The goals of the campaign were to collect detailed measurements of precipitation at the Earth's surface using ground instruments and advanced weather radars and, simultaneously, collect data from satellites passing overhead. The CPC morphing technique uses precipitation estimates from low orbiter satellite microwave observations to produce global precipitation analyses at a high temporal and spatial resolution. Data has been selected for the Iowa Flood Studies (IFloodS) field campaign which took place from April 1, 2013 to June 30, 2013. The dataset includes both the near real-time raw data and bias corrected data from NOAA in binary and netCDF format.
VQA Validation Dataset
kaggle.com
zip
Updated Sep 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mario Dias (2021). VQA Validation Dataset [Dataset]. https://www.kaggle.com/itsmariodias/vqa-validation-dataset
Explore at:
zip(6615743858 bytes)Available download formats
Dataset updated
Sep 5, 2021
Authors
Mario Dias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer.

Validation Data from VQA v2 Dataset - https://visualqa.org/index.html Validation images - 40,504 images Validation questions 2017 v2.0 - 214,354 questions Validation annotations 2017 v2.0 - 2,143,540 answers

This dataset is only for ease of use in kaggle environments. I do not take any credit for the creation of the dataset. The annotations in this dataset belong to the VQA Consortium and are licensed under a Commons Attribution 4.0 International License.

Copyright © 2015, VQA Consortium. All rights reserved. Redistribution and use software in source and binary form, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the VQA Consortium nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE AND ANNOTATIONS ARE PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Restaurant_management_system
kaggle.com
zip
Updated Nov 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vaibhav2702 (2025). Restaurant_management_system [Dataset]. https://www.kaggle.com/datasets/vaibhav2702/restaurant-management-system
Explore at:
zip(64116 bytes)Available download formats
Dataset updated
Nov 26, 2025
Authors
Vaibhav2702
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This project contains a Restaurant Orders Dataset with 150+ rows and includes Excel-cleaned data, Python OOP classes for starter, real menu, registration, login, and billing, and SQL scripts with triggers, stored procedures, and views for data validation. It is suitable for learning data analysis, database management, and Python programming.
s
Bookshelf data
orda.shef.ac.uk
zip
Updated Sep 8, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Malcolm Hepburne Scott; Robert Barthorpe; David Wagg; Keith Worden (2017). Bookshelf data [Dataset]. http://doi.org/10.15131/shef.data.5384275.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.15131/shef.data.5384275.v1
Dataset updated
Sep 8, 2017
Dataset provided by
The University of Sheffield
Authors
Malcolm Hepburne Scott; Robert Barthorpe; David Wagg; Keith Worden
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the data and models associated with the paper "A comparison of validation techniques for a nonlinear bifurcating system"a guide to the data is contained in the word document "data guide"
T
Thermal Validation System Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Thermal Validation System Report [Dataset]. https://www.datainsightsmarket.com/reports/thermal-validation-system-1521853
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Jul 9, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global thermal validation system market is experiencing robust growth, driven by increasing regulatory scrutiny across pharmaceutical, biotechnology, and food processing industries. Stringent quality control standards and the need for accurate temperature monitoring throughout the manufacturing and storage processes are key factors fueling market expansion. The market is segmented by system type (e.g., autoclaves, ovens, incubators), application (pharmaceutical, food & beverage, etc.), and end-user (contract research organizations, pharmaceutical manufacturers, etc.). Technological advancements, such as the integration of IoT sensors and cloud-based data analysis, are enhancing the capabilities of thermal validation systems, leading to improved efficiency and data management. Furthermore, the rising demand for sophisticated validation techniques to comply with international regulations like GMP and FDA guidelines is further bolstering market growth. We estimate the 2025 market size to be approximately $850 million, growing at a Compound Annual Growth Rate (CAGR) of 7% from 2025 to 2033. This growth reflects the increasing adoption of advanced technologies and the expanding regulatory landscape in key regions like North America and Europe. Competition in the thermal validation system market is intense, with several established players and emerging companies vying for market share. Key players like Kaye, Ellab, and Thermo Fisher Scientific are leveraging their strong brand reputation and technological expertise to maintain market leadership. However, smaller, specialized firms are also gaining traction by offering niche solutions and innovative technologies. The market is expected to witness further consolidation in the coming years, with strategic acquisitions and partnerships playing a crucial role in shaping the competitive landscape. Geographic expansion, particularly in emerging markets in Asia-Pacific and Latin America, represents a significant growth opportunity for market participants. The restraints to growth include the high initial investment cost associated with implementing thermal validation systems and the need for skilled personnel to operate and maintain these systems.
Credit Card Behaviour Score
kaggle.com
zip
Updated Jan 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suvradeep (2025). Credit Card Behaviour Score [Dataset]. https://www.kaggle.com/datasets/suvroo/credit-card-behaviour-score/code
Explore at:
zip(72966946 bytes)Available download formats
Dataset updated
Jan 10, 2025
Authors
Suvradeep
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Bank A issues Credit Cards to eligible customers. The Bank deploys advanced ML models and frameworks to decide on eligibility, limit, and interest rate assignment. The models and frameworks are optimized to manage early risk and ensure profitability. The Bank has now decided to build a robust risk management framework for its existing Credit Card customers, irrespective of when they were acquired. To enable this, the Bank has decided to create a “Behaviour Score”. A Behaviour Score is a predictive model. It is developed on a base of customers whose Credit Cards are open and are not past due. The model predicts the probability of customers defaulting on the Credit Cards going forward. This model will then be used for several portfolio risk management activities.

Problem statement

Your objective is to develop the Behaviour Score for Bank A.

Datasets

You have been provided with a random sample of 96,806 Credit Card details in “Dev_data_to_be_shared.zip”, along with a flag (bad_flag) – henceforth known as “development data”. This is a historical snapshot of the Credit Card portfolio of Bank A. Credit Cards that have actually defaulted have bad_flag = 1. You have also been provided with several independent variables. These include: • On us attributes like credit limit (varables with names starting with onus_attributes) • Transaction level attributes like number of transactions / rupee value transactions on various kinds of merchants (variables with names starting with transaction_attribute) • Bureau tradeline level attributes (like product holdings, historical delinquencies) – variables starting with bureau • Bureau enquiry level attributes (like PL enquiries in the last 3 months etc) – variables starting with bureau_enquiry You have also been provided with another random sample of 41,792 Credit Card details in “validation_data_to_be_shared.zip” with the same set of input variables, but without “bad_flag”. This will be referred to going forward as “validation data”.

Requirements

Using the data provided, you will have to come up with a way to predict the probability that a given Credit Card customer will default. You can use the development data for this purpose. You are then required to use the same logic to predict the probability of all the Credit Cards which are a part of the validation data. Your submission should contain two columns – the Primary key from the validation data (account_number), and the predicted probability against that account. You are also required to submit a detailed documentation of this exercise. A good document should contain details about your approach. In this section, you should include a write up on any algorithms that you use. You should then cover each of the steps that you have followed in as much detail as you can. You should then move on to any key insights or observations that you have come across in the data provided to you. Finally, you should write about what metrics you have used to measure the effectiveness of the approach that you have followed.

Evaluation

As detailed in the previous section, you are required to submit the Primary key and predicted probabilities of all the accounts provided to you in the validation data, as well as a documentation. We will only evaluate submissions that are complete and pass sanity checks (probability values should be between 0 and 1 for example). Submissions will be evaluated basis how close the predicted probabilities are to the actual outcome. We will also evaluate the documentation basis it’s completeness and accuracy. Extra points will be granted to submissions that include interesting insights / observations on the data provided.
Z
LBD Data used in validation experiments for the contrast method (8...
data-staging.niaid.nih.gov
Updated Jan 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moreau, Erwan (2022). LBD Data used in validation experiments for the contrast method (8 discoveries related to NDs) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4895189
Explore at:
Dataset updated
Jan 21, 2022
Dataset provided by
Trinity College Dublin
Authors
Moreau, Erwan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the data used in the validation experiments for the "contrast method", a novel Literature-Based Discovery (LBD) approach.

Paper: https://doi.org/10.1101/2021.09.22.461375

Code: https://github.com/erwanm/lbd-contrast

How to generate a similar dataset

How to use this dataset to reproduce the LBD experiments

Important: the raw data from which this data is derived was downloaded from Medline, PubMedCentral and PubTatorCentral, provided courtesy of the U.S. National Library of Medicine (NLM). The data was extracted in January 2021 and do not reflect the most current/accurate data available from NLM. See the instructions above in order to generate a similar dataset from up to date data.
H
Data Repository for 'Bootstrap aggregation and cross-validation methods to...
hydroshare.org
beta.hydroshare.org
+1more
zip
Updated Jun 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zachary Paul Brodeur; Scott S. Steinschneider; Jonathan D. Herman (2020). Data Repository for 'Bootstrap aggregation and cross-validation methods to reduce overfitting in reservoir control policy search' [Dataset]. http://doi.org/10.4211/hs.b8f87a7b680d44cebfb4b3f4f4a6a447
Explore at:
zip(8.3 MB)Available download formats
Unique identifier
https://doi.org/10.4211/hs.b8f87a7b680d44cebfb4b3f4f4a6a447
Dataset updated
Jun 24, 2020
Dataset provided by
HydroShare
Authors
Zachary Paul Brodeur; Scott S. Steinschneider; Jonathan D. Herman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 1, 1922 - Sep 30, 2016
Area covered

Description
Policy search methods provide a heuristic mapping between observations and decisions and have been widely used in reservoir control studies. However, recent studies have observed a tendency for policy search methods to overfit to the hydrologic data used in training, particularly the sequence of flood and drought events. This technical note develops an extension of bootstrap aggregation (bagging) and cross-validation techniques, inspired by the machine learning literature, to improve control policy performance on out-of-sample hydrology. We explore these methods using a case study of Folsom Reservoir, California using control policies structured as binary trees and daily streamflow resampling based on the paleo-inflow record. Results show that calibration-validation strategies for policy selection and certain ensemble aggregation methods can improve out-of-sample tradeoffs between water supply and flood risk objectives over baseline performance given fixed computational costs. These results highlight the potential to improve policy search methodologies by leveraging well-established model training strategies from machine learning.

Facebook

Twitter

Click to copy link

Link copied

Cite

Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365

Machine learning algorithm validation with a limited sample size

Explore at:

text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0224365

Dataset updated

May 30, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Clear search

Close search

Google apps

Main menu

Machine learning algorithm validation with a limited sample size

Text Function, Date, Data Validation

Data from: Development and validation of HBV surveillance models using big...

PEN-Method: Predictor model and Validation Data

Sensor Validation using Bayesian Networks - Dataset - NASA Open Data Portal

Forage Fish Aerial Validation Data from Prince William Sound, Alaska

FDA Drug Product Labels Validation Method Data Package

Data from: Selection of optimal validation methods for quantitative...

Data from: Development of a Mobile Robot Test Platform and Methods for...

Email Validation Tools Report

Icing Validation Database

Data from: Summary report of the 4th IAEA Technical Meeting on Fusion Data...

GPM GROUND VALIDATION NOAA CPC MORPHING TECHNIQUE (CMORPH) IFLOODS V1

VQA Validation Dataset

Restaurant_management_system

Bookshelf data

Thermal Validation System Report

Credit Card Behaviour Score

Problem statement

Datasets

Requirements

Evaluation

LBD Data used in validation experiments for the contrast method (8...

Data Repository for 'Bootstrap aggregation and cross-validation methods to...

Machine learning algorithm validation with a limited sample size