100+ datasets found

f
Performance of ML models on test data.
plos.figshare.com
xls
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Performance of ML models on test data. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0002475.t005
Dataset updated
Oct 31, 2023
Dataset provided by
PLOS Global Public Health
Authors
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.
g
TEAMER: Experimental Validation and Analysis of Deep Reinforcement Learning...
gimi9.com
Updated Jul 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). TEAMER: Experimental Validation and Analysis of Deep Reinforcement Learning Control for Wave Energy Converters | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_teamer-experimental-validation-and-analysis-of-deep-reinforcement-learning-control-for-wav/
Explore at:
Dataset updated
Jul 1, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Through this TEAMER project, Michigan Technological University (MTU) collaborated with Oregon State University (OSU) to test the performance of a Deep Reinforcement Learning (DRL) control in the wave tank. Unlike model-based controls, DRL control is model-free and can directly maximize the performance of the Wave Energy Converter (WEC) in terms of power production, regardless of system complexity. While DRL control has demonstrated promising performance in previous studies, this project aimed to (1) evaluate the practical performance of DRL control and (2) identify the challenges and limitations associated with its practical implementation. To investigate the real-world performance of DRL-based control, the controller was trained with the LUPA numerical model using MATLAB/Simulink Deep Learning Toolbox and implemented on the Laboratory Upgrade Point Absorber (LUPA) device developed by the facility at OSU. A series of regular and irregular wave tests were conducted to evaluate the power harvested by the DRL control across different wave conditions, using various observation state selections, and incorporating a reward function that includes a penalty on the PTO force. The dataset consists of six main parts: (1) the Post Access Report (2) the test log containing the test ID, description, test data filename, wave data filename, wave condition, test notes for all conducted LUPA Testing Data (3) the tank testing results as described in the DRL Test Log (4) the model used for retraining the DRL control and associated results (5) the model used for pre-training the DRL control and associated results (6) the scripts used for processing the data (7) A readme file to indicate the folder contents and structure within the resources "LUPA Pretraining Data.zip", "LUPA Retraining Data.zip", and "ScriptsForPostProcessing.zip" This testing was funded by TEAMER RFTS 10 (request for technical support) program.
f
Data from: Analysis, Modeling, and Target-Specific Predictions of Linear...
acs.figshare.com
xlsx
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Boris Vishnepolsky; Maya Grigolava; Andrei Gabrielian; Alex Rosenthal; Darrell Hurt; Michael Tartakovsky; Malak Pirtskhalava (2023). Analysis, Modeling, and Target-Specific Predictions of Linear Peptides Inhibiting Virus Entry [Dataset]. http://doi.org/10.1021/acsomega.3c07521.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acsomega.3c07521.s001
Dataset updated
Nov 24, 2023
Dataset provided by
ACS Publications
Authors
Boris Vishnepolsky; Maya Grigolava; Andrei Gabrielian; Alex Rosenthal; Darrell Hurt; Michael Tartakovsky; Malak Pirtskhalava
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Antiviral peptides (AVPs) are bioactive peptides that exhibit the inhibitory activity against viruses through a range of mechanisms. Virus entry inhibitory peptides (VEIPs) make up a specific class of AVPs that can prevent envelope viruses from entering cells. With the growing number of experimentally verified VEIPs, there is an opportunity to use machine learning to predict peptides that inhibit the virus entry. In this paper, we have developed the first target-specific prediction model for the identification of new VEIPs using, along with the peptide sequence characteristics, the attributes of the envelope proteins of the target virus, which overcomes the problem of insufficient data for particular viral strains and improves the predictive ability. The model’s performance was evaluated through 10 repeats of 10-fold cross-validation on the training data set, and the results indicate that it can predict VEIPs with 87.33% accuracy and Matthews correlation coefficient (MCC) value of 0.76. The model also performs well on an independent test set with 90.91% accuracy and MCC of 0.81. We have also developed an automatic computational tool that predicts VEIPs, which is freely available at https://dbaasp.org/tools?page=linear-amp-prediction.
d
Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...
datarade.ai
.json, .csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
Explore at:
.json, .csvAvailable download formats
Dataset provided by
Xverum LLC
Authors
Xverum
Area covered
Sint Maarten (Dutch part), Cook Islands, Western Sahara, Barbados, Jordan, Oman, Norway, India, United Kingdom, Dominican Republic
Description
Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

What Makes Our Data Unique?

Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

Primary Use Cases and Verticals

Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

Contact us for sample datasets or to discuss your specific needs.
D
Data Annotation and Model Validation Platform Report
archivemarketresearch.com
doc, pdf, ppt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Data Annotation and Model Validation Platform Report [Dataset]. https://www.archivemarketresearch.com/reports/data-annotation-and-model-validation-platform-36064
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Feb 18, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
Market Analysis The global Data Annotation and Model Validation Platform market is projected to grow from a staggering XXX million dollars in 2023 to an estimated XXX million dollars by 2033, exhibiting a robust CAGR of XX% during the forecast period of 2025-2033. This remarkable growth is primarily attributed to the increasing adoption of artificial intelligence (AI), machine learning (ML), and computer vision across various industries, such as healthcare, automotive, and manufacturing. The need for high-quality data annotation and efficient model validation tools is driving the demand for these platforms, ensuring the accuracy and reliability of AI and ML models. Key Trends and Drivers The Data Annotation and Model Validation Platform market is witnessing several key trends that are shaping its growth trajectory. The increasing need for labeled and annotated datasets for AI training is a major driver, as it enables models to learn complex patterns and make accurate predictions. Furthermore, the advancements in AI algorithms and the rising adoption of cloud computing are expanding the capabilities of these platforms, allowing for faster and more efficient data annotation and model validation. Additionally, the proliferation of IoT devices and the growing volume of data generated are driving the need for scalable and automated data annotation solutions. The market is also benefiting from government initiatives and funding for AI research and development, as well as the increasing collaboration between industry players and academia.
Machine Learning Dataset
brightdata.com
.json, .csv, .xlsx
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 23, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
d
Data from: Summary report of the 4th IAEA Technical Meeting on Fusion Data...
dataone.org
search.dataone.org
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
S.M. Gonzalez de Vicente, D. Mazon, M. Xu, S. Pinches, M. Churchill, A. Dinklage, R. Fischer, A. Murari, P. Rodriguez-Fernandez, J. Stillerman, J. Vega, G. Verdoolaege (2024). Summary report of the 4th IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis (FDPVA) [Dataset]. http://doi.org/10.7910/DVN/ZZ9UKO
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ZZ9UKO
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
S.M. Gonzalez de Vicente, D. Mazon, M. Xu, S. Pinches, M. Churchill, A. Dinklage, R. Fischer, A. Murari, P. Rodriguez-Fernandez, J. Stillerman, J. Vega, G. Verdoolaege
Description
The objective of the fourth Technical Meeting on Fusion Data Processing, Validation and Analysis was to provide a platform during which a set of topics relevant to fusion data processing, validation and analysis are discussed with the view of extrapolating needs to next step fusion devices such as ITER. The validation and analysis of experimental data obtained from diagnostics used to characterize fusion plasmas are crucial for a knowledge-based understanding of the physical processes governing the dynamics of these plasmas. This paper presents the recent progress and achievements in the domain of plasma diagnostics and synthetic diagnostics data analysis (including image processing, regression analysis, inverse problems, deep learning, machine learning, big data and physics-based models for control) reported at the meeting. The progress in these areas highlight trends observed in current major fusion confinement devices. A special focus is dedicated on data analysis requirements for ITER and DEMO with a particular attention paid to Artificial Intelligence for automatization and improving reliability of control processes.
D
Data Annotation Tool Software Report
archivemarketresearch.com
doc, pdf, ppt
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Data Annotation Tool Software Report [Dataset]. https://www.archivemarketresearch.com/reports/data-annotation-tool-software-564768
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Jun 3, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Data Annotation Tool Software market is experiencing robust growth, driven by the increasing demand for high-quality training data in the burgeoning fields of artificial intelligence (AI) and machine learning (ML). The market, estimated at $2.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This significant expansion is fueled by several key factors. The rising adoption of AI and ML across diverse industries, including automotive, healthcare, and finance, necessitates large volumes of accurately annotated data for model training and validation. Furthermore, advancements in automation and the emergence of sophisticated annotation tools are streamlining the data annotation process, reducing costs and improving efficiency. The market is also witnessing a shift towards cloud-based solutions, offering scalability and accessibility to a wider range of users. However, challenges remain, such as the need for skilled annotators and the complexities associated with handling diverse data formats and annotation requirements. The competitive landscape is dynamic, with a mix of established players and emerging startups vying for market share, leading to continuous innovation and improvements in data annotation technologies. The segmentation of the Data Annotation Tool Software market is primarily based on functionality (image, text, video, audio annotation), deployment model (cloud-based, on-premise), and industry vertical (automotive, healthcare, etc.). The prominent players, including Appen Limited, CloudApp, Cogito Tech LLC, and others mentioned, are actively investing in research and development to enhance their offerings and expand their market reach. Regional variations exist, with North America and Europe currently holding a significant market share, but growth is expected in Asia-Pacific and other emerging regions as AI adoption accelerates. The ongoing evolution of deep learning techniques and the increasing complexity of AI models will further stimulate the demand for sophisticated data annotation tools, thus perpetuating the market's upward trajectory throughout the forecast period.
On the spatial distance between training and validation data in model...
figshare.com
application/x-rar
Updated Jul 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
authors a (2025). On the spatial distance between training and validation data in model evaluation: data and codes [Dataset]. http://doi.org/10.6084/m9.figshare.29555306.v1
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29555306.v1
Dataset updated
Jul 14, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
authors a
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and codes in the paper "On the spatial distance between training and validation data in model evaluation".The "Data" folder contains response variables, explanatory variables, and simulated data.The code folder contains:"sim case.R": The code for simulation analysis."spatial random sim.R": The code for generating the simulated data."compute_sliding_window_metrics.R": The code for the moving distance validation model."fun spatially weighted validation.R": The code for calculating the spatially weighted indicators."Modelling.R": The code for the case study.
D
Data Annotation and Model Validation Platform Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Data Annotation and Model Validation Platform Report [Dataset]. https://www.datainsightsmarket.com/reports/data-annotation-and-model-validation-platform-1945496
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
May 13, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Data Annotation and Model Validation Platform market is experiencing robust growth, driven by the escalating demand for high-quality AI models across diverse sectors. The increasing complexity of AI algorithms necessitates rigorous validation and testing, fueling the adoption of specialized platforms that streamline these processes. The market, estimated at $2 billion in 2025, is projected to grow at a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an impressive $10 billion by 2033. This growth is fueled by several key factors: the expanding adoption of AI and machine learning across industries, the rising need for accurate data annotation to train effective AI models, and the increasing focus on ensuring the reliability and trustworthiness of AI systems. Key application segments include Computer Vision, Artificial Intelligence, and Machine Learning, with Quality Assurance for AI models and AI Model Validation and Performance Analysis Software dominating the types segment. North America currently holds the largest market share, driven by early adoption of AI technologies and the presence of major technology companies. However, the Asia-Pacific region is poised for significant growth, fueled by rapid technological advancements and a burgeoning AI ecosystem. The competitive landscape is dynamic, featuring a mix of established players and emerging startups. Established companies like iMerit and CloudFactory offer comprehensive data annotation services, while others such as Labelbox and Explosion AI focus on specific aspects of model validation. The market is characterized by ongoing innovation, with companies constantly developing new tools and techniques to improve the accuracy, efficiency, and scalability of data annotation and model validation. Future growth will be influenced by advancements in automation, the integration of cloud-based platforms, and the increasing demand for explainable AI, which requires robust validation processes. The adoption of ethical AI practices and regulations will also play a crucial role in shaping the market trajectory. Strategic partnerships and acquisitions are anticipated to further consolidate the market and accelerate innovation.
Sample, test, and validation data for findmycells
zenodo.org
zip
Updated Feb 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dennis Segebarth; Dennis Segebarth (2023). Sample, test, and validation data for findmycells [Dataset]. http://doi.org/10.5281/zenodo.7655292
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7655292
Dataset updated
Feb 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dennis Segebarth; Dennis Segebarth
License
Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
Description
findmycells is an open source python package, developed to foster the use of deep-learning based python tools for bioimage analysis, specifically for researchers with limited python coding experience. It is developed and maintained in the following GitHub repository: https://github.com/Defense-Circuits-Lab/findmycells

Disclaimer: All data (including the model ensemble) uploaded here serve solely as a test dataset for findmycells and are not intended for any other purposes.

For instance, the group, subgroup, or subject IDs don´t refer to the actual experimental conditions. Likewise, also the included ROI-files were only created to allow the testing of findmycells and may not live up to scientific standards. Furthermore, the image data represents a subset of a dataset that is already published here:

Segebarth, Dennis et al. (2020), Data from: On the objectivity, reliability, and validity of deep learning enabled bioimage analyses, Dryad, Dataset, https://doi.org/10.5061/dryad.4b8gtht9d

The model ensemble (cfos_ensemble.zip) was trained using deepflash2 (v 0.1.7)

Griebel, M., Segebarth, D., Stein, N., Schukraft, N., Tovote, P., Blum, R., & Flath, C. M. (2021). Deep-learning in the bioimaging wild: Handling ambiguous data with deepflash2. arXiv preprint arXiv:2111.06693.

The training was performed on a subset of the "lab-wue1" training dataset, using only the 27 images with IDs 0000 - 0099 (cfos_training_images.zip) and the corresponding est. GT masks (cfos_training_masks.zip). The images used in "cfos_fmc_test_project.zip" for the actual testing of findmycells are the images with the IDs 0100, 0106, 0149, and 0152 of the aforementioned "lab-wue1" training dataset. They were randomly distributed to the made-up subject folders and renamed to "dentate_gyrus_01" or "dentate_gyrus_02".
Replication Package for 'Data-Driven Analysis and Optimization of Machine...
zenodo.org
zip
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15643706
Dataset updated
Jun 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joel Castaño; Joel Castaño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.

The framework considers the trade-offs between three key objectives:

1. Performance (maximizing throughput)

2. Energy Efficiency (minimizing estimated energy per unit)

3. Cost (minimizing estimated hardware cost)

Repository Structure

This repository is organized as follows:

Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.

Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.

Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.

Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.

Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.

eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.

requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.

eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.

optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.

pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.

shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

Requirements and Installation

To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.

1. Clone the repository:

bash

git clone

cd

2. **Create and activate a virtual environment (optional but recommended):

bash

python -m venv venv

source venv/bin/activate # On Windows, use `venv\Scripts\activate`

3. Install the required packages:

All dependencies are listed in the `requirements.txt` file. Install them using pip:

bash

pip install -r requirements.txt

Step-by-Step Reproduction Workflow

The notebooks are designed to be run in a logical sequence.

Step 1: Data Enrichment (Optional)

The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

Step 2: Exploratory Data Analysis (Optional)

All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

Step 3: Main Model Training, Validation, and Recommendation

This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:

It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.

It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.

It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.

It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
m
Ransomware and user samples for training and validating ML models
data.mendeley.com
Updated Sep 17, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Berrueta (2021). Ransomware and user samples for training and validating ML models [Dataset]. http://doi.org/10.17632/yhg5wk39kf.2
Explore at:
Unique identifier
https://doi.org/10.17632/yhg5wk39kf.2
Dataset updated
Sep 17, 2021
Authors
Eduardo Berrueta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected.

This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder.

The folders are named NxSy where x is the number of 1-second interval per sample and y the sliding step in seconds.

Each folder (for example N10S10/) contains: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - N10S10.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - userSamples_test -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. - userSamples_train -> User samples used for training the models. - ransomware_train -> Ransomware samples used for training the models - scaler.scaler -> Standard Scaler from python library used for scale the samples. - zeroDays_notFiltered -> Folder with the zeroDay samples.

In the case of N30S30 folder, there is an additional folder (SMBv2SMBv3NFS) with the samples extracted from the SMBv2, SMBv3 and NFS traffic traces. There are more binaries than the ones presented in the article, but it is because some of them are not "unseen" binaries (the families are present in the training set).

The files containing samples (NxSy.csv, zeroDays.csv and userSamples_test.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.

Additionally we have placed two pcap files in root directory. There are the traces used for compare both versions of SMB.
d
Factori Machine Learning (ML) Data | 247 Countries Coverage | 5.2 B Event...
datarade.ai
.csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Factori, Factori Machine Learning (ML) Data | 247 Countries Coverage | 5.2 B Event per Day [Dataset]. https://datarade.ai/data-products/factori-ai-ml-training-data-web-data-machine-learning-d-factori
Explore at:
.csvAvailable download formats
Dataset authored and provided by
Factori
Area covered
Sweden, Palestine, Taiwan, Japan, Egypt, Faroe Islands, Austria, Cameroon, Turks and Caicos Islands, Uzbekistan
Description
Factori's AI & ML training data is thoroughly tested and reviewed to ensure that what you receive on your end is of the best quality.

Integrate the comprehensive AI & ML training data provided by Grepsr and develop a superior AI & ML model.

Whether you're training algorithms for natural language processing, sentiment analysis, or any other AI application, we can deliver comprehensive datasets tailored to fuel your machine learning initiatives.

Enhanced Data Quality: We have rigorous data validation processes and also conduct quality assurance checks to guarantee the integrity and reliability of the training data for you to develop the AI & ML models.

Gain a competitive edge, drive innovation, and unlock new opportunities by leveraging the power of tailored Artificial Intelligence and Machine Learning training data with Factori.

We offer web activity data of users that are browsing popular websites around the world. This data can be used to analyze web behavior across the web and build highly accurate audience segments based on web activity for targeting ads based on interest categories and search/browsing intent.

Web Data Reach: Our reach data represents the total number of data counts available within various categories and comprises attributes such as Country, Anonymous ID, IP addresses, Search Query, and so on.

Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method at a suitable interval (daily/weekly/monthly).

Data Attributes: Anonymous_id IDType Timestamp Estid Ip userAgent browserFamily deviceType Os Url_metadata_canonical_url Url_metadata_raw_query_params refDomain mappedEvent Channel searchQuery Ttd_id Adnxs_id Keywords Categories Entities Concepts
d
Rangeland Condition Monitoring Assessment and Projection (RCMAP) Independent...
catalog.data.gov
data.usgs.gov
+2more
Updated Dec 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Rangeland Condition Monitoring Assessment and Projection (RCMAP) Independent Validation Data [Dataset]. https://catalog.data.gov/dataset/rangeland-condition-monitoring-assessment-and-projection-rcmap-independent-validation-data
Explore at:
Dataset updated
Dec 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
Rangeland ecosystems provide critical wildlife habitat (e.g., greater sage grouse, pronghorn, black-footed ferret), forage for livestock, carbon sequestration, provision of water resources, and recreational opportunities. At the same time, rangelands are vulnerable to climate change, fire, and anthropogenic disturbances. The arid-semiarid climate in most rangelands fluctuates widely, impacting livestock forage availability, wildlife habitat, and water resources. Many of these changes can be subtle or evolve over long time periods, responding to climate, anthropogenic, and disturbance driving forces. To understand vegetation change, scientists from the USGS and Bureau of Land Management (BLM) developed the Rangeland Condition Monitoring Assessment and Projection (RCMAP) project. RCMAP provides robust, long-term, and floristically detailed maps of vegetation cover at yearly time-steps, a critical reference to advancing science in the BLM and assessing Landscape Health standards. RCMAP quantifies the percent cover of ten rangeland components (annual herbaceous, bare ground, herbaceous, litter, non-sagebrush shrub, perennial herbaceous, sagebrush, shrub, and tree cover and shrub height) at yearly time-steps across the western U.S. using field training data, Landsat imagery, and machine learning. We utilize an ecologically comprehensive series of field-trained, high-resolution predictions of component cover and BLM Analysis Inventory and Monitoring (AIM) data to train machine learning models predicting component cover over the Landsat time-series. This dataset enables retrospective analysis of vegetation condition, impacts of weather variation and longer-term climatic change, and understanding of vegetation treatment and altered management practice effectiveness. RCMAP data can be used to answer critical questions regarding the influence of climate change and the suitability of management practices. Component products can be downloaded https://www.mrlc.gov/data. Independent validation was our primary validation approach, consisting of field measurements of component cover at stratified-random locations. Independent validation point placement used a stratified random design, with two levels of stratified restrictions to simplify logistics of field sampling (Rigge et al. 2020, Xian et al. 2015). The first level of stratification randomly selected 15, 8 km in diameter, sites across each mapping region. First level sites excluded areas less than 30 km away from training sites and other validation sites. The second level stratification randomly placed 6–10 points within each 8 km diameter validation site (total n = 2,014 points at n = 229 sites). Only sites on public land, between 100 and 1000 m from the nearest road, and in rangeland vegetation cover within each site were considered. The random points within a site were evenly allocated to three NDVI thresholds from a leaf-on Landsat image (low, medium, and high). Sites with relatively high spatial variance within a 90 m by 90 m patch (3 × 3 Landsat pixels) were excluded to minimize plot-pixel locational error. Using NDVI as a stratum ensured plot locations were distributed across the range of validation site productivity. At each validation point, we measured component cover using the line point intercept method along 2, 30 m transects. Data were collected from the first hit perspective.
D
AI Training Dataset Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). AI Training Dataset Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-ai-training-dataset-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
AI Training Dataset Market Outlook

The global AI training dataset market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach USD 6.5 billion by 2032, growing at a compound annual growth rate (CAGR) of 20.5% from 2024 to 2032. This substantial growth is driven by the increasing adoption of artificial intelligence across various industries, the necessity for large-scale and high-quality datasets to train AI models, and the ongoing advancements in AI and machine learning technologies.

One of the primary growth factors in the AI training dataset market is the exponential increase in data generation across multiple sectors. With the proliferation of internet usage, the expansion of IoT devices, and the digitalization of industries, there is an unprecedented volume of data being generated daily. This data is invaluable for training AI models, enabling them to learn and make more accurate predictions and decisions. Moreover, the need for diverse and comprehensive datasets to improve AI accuracy and reliability is further propelling market growth.

Another significant factor driving the market is the rising investment in AI and machine learning by both public and private sectors. Governments around the world are recognizing the potential of AI to transform economies and improve public services, leading to increased funding for AI research and development. Simultaneously, private enterprises are investing heavily in AI technologies to gain a competitive edge, enhance operational efficiency, and innovate new products and services. These investments necessitate high-quality training datasets, thereby boosting the market.

The proliferation of AI applications in various industries, such as healthcare, automotive, retail, and finance, is also a major contributor to the growth of the AI training dataset market. In healthcare, AI is being used for predictive analytics, personalized medicine, and diagnostic automation, all of which require extensive datasets for training. The automotive industry leverages AI for autonomous driving and vehicle safety systems, while the retail sector uses AI for personalized shopping experiences and inventory management. In finance, AI assists in fraud detection and risk management. The diverse applications across these sectors underline the critical need for robust AI training datasets.

As the demand for AI applications continues to grow, the role of Ai Data Resource Service becomes increasingly vital. These services provide the necessary infrastructure and tools to manage, curate, and distribute datasets efficiently. By leveraging Ai Data Resource Service, organizations can ensure that their AI models are trained on high-quality and relevant data, which is crucial for achieving accurate and reliable outcomes. The service acts as a bridge between raw data and AI applications, streamlining the process of data acquisition, annotation, and validation. This not only enhances the performance of AI systems but also accelerates the development cycle, enabling faster deployment of AI-driven solutions across various sectors.

Regionally, North America currently dominates the AI training dataset market due to the presence of major technology companies and extensive R&D activities in the region. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by rapid technological advancements, increasing investments in AI, and the growing adoption of AI technologies across various industries in countries like China, India, and Japan. Europe and Latin America are also anticipated to experience significant growth, supported by favorable government policies and the increasing use of AI in various sectors.

Data Type Analysis

The data type segment of the AI training dataset market encompasses text, image, audio, video, and others. Each data type plays a crucial role in training different types of AI models, and the demand for specific data types varies based on the application. Text data is extensively used in natural language processing (NLP) applications such as chatbots, sentiment analysis, and language translation. As the use of NLP is becoming more widespread, the demand for high-quality text datasets is continually rising. Companies are investing in curated text datasets that encompass diverse languages and dialects to improve the accuracy and efficiency of NLP models.

Image data is critical for computer vision application
Q
Quality Analysis Tool Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Quality Analysis Tool Report [Dataset]. https://www.datainsightsmarket.com/reports/quality-analysis-tool-1455522
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
May 19, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Quality Analysis Tool market is experiencing robust growth, driven by the increasing need for data quality assurance across various industries. The market's expansion is fueled by the rising adoption of cloud-based solutions, offering scalability and accessibility to both SMEs and large enterprises. The shift towards digital transformation and the burgeoning volume of data generated necessitate robust quality analysis tools to ensure data accuracy, reliability, and compliance. A compound annual growth rate (CAGR) of 15% is projected from 2025 to 2033, indicating a significant market expansion. This growth is further propelled by trends like the increasing adoption of AI and machine learning in quality analysis, enabling automation and improved efficiency. However, factors like high implementation costs and the need for specialized expertise could act as restraints on market growth. Segmentation reveals that the cloud-based segment holds a larger market share due to its flexibility and cost-effectiveness compared to on-premises solutions. North America is expected to dominate the market due to early adoption and the presence of major technology players. However, the Asia-Pacific region is anticipated to witness rapid growth fueled by increasing digitalization and data generation in emerging economies. The competitive landscape is characterized by a mix of established players like TIBCO and Google, alongside innovative startups offering niche solutions. The market is expected to reach approximately $15 billion by 2033, based on current growth projections and market dynamics. The competitive intensity in the Quality Analysis Tool market is expected to remain high, as both established vendors and new entrants strive to capture market share. Strategic alliances, mergers, and acquisitions are anticipated to shape the market landscape. Furthermore, the focus on integrating AI and machine learning capabilities into existing tools will be crucial for vendors to stay competitive. The development of user-friendly interfaces and improved data visualization capabilities will be paramount to cater to the growing demand for accessible and effective quality analysis solutions across different technical skill sets. The ongoing evolution of data privacy regulations will necessitate the development of tools compliant with global standards, impacting the market's trajectory. Finally, the market will need to address the skill gap in data quality management by providing robust training and support to users, ensuring widespread adoption and optimal utilization of the tools.
r
Seaview Survey: GBR Training and Validation Dataset
researchdata.edu.au
zip
Updated Jan 1, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Professor Ove Hoegh-Guldberg; Emeritus Professor Ove Hoegh-Guldberg; Dr Erwin Alberto Rodriguez-Ramirez; Dr Erwin Alberto Rodriguez-Ramirez (2018). Seaview Survey: GBR Training and Validation Dataset [Dataset]. http://doi.org/10.14264/UQL.2018.518
Explore at:
zip(4411733733)Available download formats
Unique identifier
https://doi.org/10.14264/UQL.2018.518
Dataset updated
Jan 1, 2018
Dataset provided by
The University of Queensland
Authors
Professor Ove Hoegh-Guldberg; Emeritus Professor Ove Hoegh-Guldberg; Dr Erwin Alberto Rodriguez-Ramirez; Dr Erwin Alberto Rodriguez-Ramirez
License
https://creativecommons.org/licenses/by_sa/3.0/deed.enhttps://creativecommons.org/licenses/by_sa/3.0/deed.en
Time period covered
Jan 1, 2012 - Dec 31, 2017
Description
This dataset consists of: (1) a series of images from a transect on Australia'a Great Barrier Reef ("images_training" and "images_validation" folders); (2) a description of the features we would like to identify within the images ("label_key.csv"); (3) a training dataset that can be used to train an interpretation algorithm to automatically identify those features in the images ("training_dataset.csv"); (4) a validation dataset that can be used to assess the performance of any interpretation algorithm ("validation_dataset.csv").

Each of the images has been processed to correct for lens distortion and cropped to standardise the area covered to approximately 1 square meter 'quadrats'. These training and validation datasets were developed by expert human interpretation of images.

The images were collected on several different coral reefs within the Great Barrier Reef between 2012-2014, so represent a variety of communities and conditions.

The training and validation datasets consists primarily of randomly selected points on images, but this random sample has been manually augmented to ensure that the full range of features has adequate representation. The row and column data in these datasets are in reference to the top left of the image with an origin coordinate of (1, 1).

The challenge is to develop novel ways of processing these images to extract information that can help us monitor and manage coral reefs.

Tabular dataset field descriptions:

label_key.csv:

id: label ID number label_code: the short code representing the features of interest functional_group: a more general categorisation of the features label_description: a brief description of what the labels represent benchmark: the classification accuracy (%) of the current best performing classification algorithm

training_dataset.csv & validation_dataset.csv:

qid: quadrat ID number row, col: the row and column of the pixel associated with the training/validation record (based on a top-left image origin) label: the names of the features of interest label_code: the short code representing the features of interest functional_group: a more general categorisation of the features filename: the name of the image associated with the training/validation record method: random (the point was generated randomly), target (the point was human- generated in order to ensure every feature has adequate representation

Citation for this dataset:

González-Rivero M, Beijbom O, Rodriguez-Ramirez A, Holtrop T, González-Marrero Y, Ganase A, Roelfsema C, Phinn S, Hoegh-Guldberg O (2016) Scaling up ecological measurements of coral reefs using semi-automated field image collection and analysis. Remote Sensing 8:30.

License:

These data are shared under a Creative Commons Attribution Share Alike License: https://creativecommons.org/licenses/by-sa/2.5/au/
Data supporting "A comprehensive analysis of air-sea CO2 flux uncertainties...
zenodo.org
text/x-python, zip
Updated Nov 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel J. Ford; Josh Blannin; Jennifer Watts; Andrew J. Watson; Peter Landschutzer; Annika Jersild; Jamie D. Shutler; Daniel J. Ford; Josh Blannin; Jennifer Watts; Andrew J. Watson; Peter Landschutzer; Annika Jersild; Jamie D. Shutler (2024). Data supporting "A comprehensive analysis of air-sea CO2 flux uncertainties constructed from surface ocean data products" [Dataset]. http://doi.org/10.5281/zenodo.13911533
Explore at:
text/x-python, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13911533
Dataset updated
Nov 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel J. Ford; Josh Blannin; Jennifer Watts; Andrew J. Watson; Peter Landschutzer; Annika Jersild; Jamie D. Shutler; Daniel J. Ford; Josh Blannin; Jennifer Watts; Andrew J. Watson; Peter Landschutzer; Annika Jersild; Jamie D. Shutler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data included in this repository supports the manuscript "A comprehensive analysis of air-sea CO₂ flux uncertainties constructed from surface ocean data products".

Two files are present:

A Python config file used to run the software developed for the analysis (Ford et al., 2024)

A ZIP file containing the input, neural network, and output files for the analysis.

Within the ZIP file, multiple folders are present:

Decorrelation contains .csv files that contain the annual estimates of the decorrelation lengths for the parameters requiring these (SST, sea ice, wind, fCO₂ and fCO₂ network).

Flux contains the individual FluxEngine output files that provide all the flux calculations, and auxillary data to the flux calculations.

Fluxengine_input contains the input files to FluxEngine, which specifies the fCO_{2 (sw),}xCO_{2 (atm)} and the temperature, salinities for the skin and subskin layers.

Inputs contains all the monthly 1 degree input data used. Many of the data used are not native monthly 1 deg, and so these are generated from the higher resolution data. These are all combined into the neural_network_input.nc file, so a single file can be distributed with all the inputs used.

Networks contains the TensorFlow neural network (FNN) files, where each province has 10 folders (one for each ensemble).

Plots contains output plots for debugging and final plots of uncertainties

Scalars contains the scalars used to normalise the data before input into the neural network. These are saved as Python pickle files, as they are needed if the neural network is used on other data.

Unc_lut contains the look up tables to generate the parameter uncertainty as described in the manuscript. These are Python pickle files.

Validation contains a csv file with the independent test RMSD, along with Python Pickle files of the validation data.

In the main folder, three files are present:

Annual_flux.csv contains the annual air-sea CO₂ flux (or ocean sink estimate) estimated from the fCO_{2 (sw)} fields. This also contains the annual integrated uncertainties for each component in the uncertainty flow chart in the manuscript.

Output.nc contrains the gridded global fields of the fCO_{2 (sw)}, the air-sea CO₂ flux, and the uncertainties for all the individual components. Metadata within the file should provide all the information required.

Training.tsv contains the training/validation data alongside the input parameters for neural network training

Please contact Daniel J. Ford (d.ford@exeter.ac.uk) if you have any questions.

Acknowledgements

This work was funded by the Convex Seascape Survey (https://convexseascapesurvey.com/) and the European Union under grant agreement no. 101083922 (OceanICU; https://ocean-icu.eu/) and UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10054454, 10063673, 10064020, 10059241, 10079684, 10059012, 10048179]. The views, opinions and practices used to produce this dataset/software are however those of the author(s) only and do not necessarily reflect those of the European Union or European Research Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

The Surface Ocean CO₂ Atlas (SOCAT) is an international effort, endorsed by the International Ocean Carbon Coordination Project (IOCCP), the Surface Ocean Lower Atmosphere Study (SOLAS) and the Integrated Marine Biosphere Research (IMBeR) program, to deliver a uniformly quality-controlled surface ocean CO₂ database. The many researchers and funding agencies responsible for the collection of data and quality control are thanked for their contributions to SOCAT.

References

Ford, D. J., Blannin, J., Watts, J., Watson, A. J., Landschutzer, P., Jersild, A., & Shutler, J. D. (2024, June 30). OceanICU Neural Network Framework with per pixel uncertainty propagation (v1.1) (Version v1.1). Zenodo. https://doi.org/10.5281/ZENODO.12597803
Data from: Prediction and Analysis of Tumor Infiltrating Lymphocytes across...
zenodo.org
zip
Updated Jan 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huibo Zhang; Huibo Zhang (2025). Prediction and Analysis of Tumor Infiltrating Lymphocytes across 28 Cancers by TILScout Using Deep Learning [Dataset]. http://doi.org/10.5281/zenodo.14628242
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14628242
Dataset updated
Jan 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Huibo Zhang; Huibo Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

Training, validation and independent test datasets related to model training and evaluation.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Performance of ML models on test data. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t005

Performance of ML models on test data.

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pgph.0002475.t005

Dataset updated

Oct 31, 2023

Dataset provided by

PLOS Global Public Health

Authors

Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.

Clear search

Close search

Google apps

Main menu

Performance of ML models on test data.

TEAMER: Experimental Validation and Analysis of Deep Reinforcement Learning...

Data from: Analysis, Modeling, and Target-Specific Predictions of Linear...

Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...

Data Annotation and Model Validation Platform Report

Machine Learning Dataset

Data from: Summary report of the 4th IAEA Technical Meeting on Fusion Data...

Data Annotation Tool Software Report

On the spatial distance between training and validation data in model...

Data Annotation and Model Validation Platform Report

Sample, test, and validation data for findmycells

Replication Package for 'Data-Driven Analysis and Optimization of Machine...

Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

Repository Structure

Requirements and Installation

Step-by-Step Reproduction Workflow

Step 1: Data Enrichment (Optional)

Step 2: Exploratory Data Analysis (Optional)

Step 3: Main Model Training, Validation, and Recommendation

Ransomware and user samples for training and validating ML models

Factori Machine Learning (ML) Data | 247 Countries Coverage | 5.2 B Event...

Rangeland Condition Monitoring Assessment and Projection (RCMAP) Independent...

AI Training Dataset Market Report | Global Forecast From 2025 To 2033

AI Training Dataset Market Outlook

Data Type Analysis

Quality Analysis Tool Report

Seaview Survey: GBR Training and Validation Dataset

Data supporting "A comprehensive analysis of air-sea CO2 flux uncertainties...

Data from: Prediction and Analysis of Tumor Infiltrating Lymphocytes across...

Performance of ML models on test data.