Extracting useful and accurate information from scanned geologic and other earth science maps is a time-consuming and laborious process involving manual human effort. To address this limitation, the USGS partnered with the Defense Advanced Research Projects Agency (DARPA) to run the AI for Critical Mineral Assessment Competition, soliciting innovative solutions for automatically georeferencing and extracting features from maps. The competition opened for registration in August 2022 and concluded in December 2022. Training and validation data from the competition are provided here, as well as competition details and baseline solutions. The data are derived from published sources and are provided to the public to support continued development of automated georeferencing and feature extraction tools. References for all maps are included with the data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Convolutional neural network (CNN) models and their respective training, validation and test datasets used in manuscript:
Tuomo Hartonen, Teemu Kivioja and Jussi Taipale, "PlotMI: interpretation of pairwise interactions and positional preferences learned by a deep learning model from sequence data"
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes the date and time, latitude (“lat”), longitude (“lon”), sun angle (“sun_angle”, in degrees [o]), rainbow presence (TRUE = rainbow, FALSE = no rainbow), cloud cover (“cloud_cover”, proportion), and liquid precipitation (“liquid_precip”, kg m-2 s-1) for each record used to train and/or validate the models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The package contains files for two modules designed to improve the accuracy of the indoor positioning system, namely the following:
door detection
videos_test - videos used to demonstrate the application of door detector
videos_res - videos from videos_test directory with detected doors marked
parts detection
frames_train_val - images generated from videos used for training and validation of VGG16 neural network model
frames_test - images generated from videos used for testing of the trained model
videos_test - videos used to demonstrate the application of parts detector
videos_res - videos from videos_test directory with detected parts marked
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training, test data and model parameters. The last 3 columns show the MinORG, LT and HT parameters used to create the pathogenicity families and build the model for each of the 10 models. Zthr is a threshold value, calculated for each model at the cross validation phase, which is used, given the final prediction score, to decide if the input organisms will be predicted as pathogenic or non-pathogenic. The parameters for each model are chosen after 5-fold cross-validation tests.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The repo includes the training, test and validation data used in the paper "On the accuracy of posterior recovery with neural network emulators". Note that due to the convention employed by the emulator framework in the paper the test data is the data used for early stopping and the validation data is used to measure the accuracy of the emulator after training. This is the opposite convention to most machine learning literature.
The corresponding code used in the paper is found at: https://github.com/htjb/validating_posteriors.
`_data.txt` corresponds to the ARES parameters used to generate the signals in `_labels.txt`.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Partial and incremental stratification analysis of a quantitative structure-interference relationship (QSIR) is a novel strategy intended to categorize classification provided by machine learning techniques. It is based on a 2D mapping of classification statistics onto two categorical axes: the degree of consensus and level of applicability domain. An internal cross-validation set allows to determine the statistical performance of the ensemble at every 2D map stratum and hence to define isometric local performance regions with the aim of better hit ranking and selection. During training, isometric stratified ensembles (ISE) applies a recursive decorrelated variable selection and considers the cardinal ratio of classes to balance training sets and thus avoid bias due to possible class imbalance. To exemplify the interest of this strategy, three different highly imbalanced PubChem pairs of AmpC β-lactamase and cruzain inhibition assay campaigns of colloidal aggregators and complementary aggregators data set available at the AGGREGATOR ADVISOR predictor web page were employed. Statistics obtained using this new strategy show outperforming results compared to former published tools, with and without a classical applicability domain. ISE performance on classifying colloidal aggregators shows from a global AUC of 0.82, when the whole test data set is considered, up to a maximum AUC of 0.88, when its highest confidence isometric stratum is retained.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Collocated data between AHI at 2km resolution (nadir) and CALIOP 1km cloud product v4.20 used for training and validating cloud identification neural networks. The main training and validation data from 2019 is stored in monthly directories, whilst the collocated dataset used to compare the NN, JMA and BoM cloud mask performances is the file "superdf.h5". All collocated data is stored as .h5 files and was built using the Python Pandas package. In this archive, the data has been stored as compressed directories for each month or as a single compressed file in the case of "superdf.h5" using tar with bzip2 compression or just bzip2 compression respectively.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Abstract: Das Ziel des Datensatzes ist das Training und die Validierung von Modellen zur Vorhersage von Zeitreihen für Fräsprozesse. Dazu wurden an einer DMC 60H Prozesse mit einer Abtastrate von 500 Hz durch eine Siemens Industrial Edge aufgenommen. Die Maschine wurde steuerungstechnisch aufgerüstet. Es wurden Prozesse für das Modelltraining und die Validierung aufgenommen, welche sowohl für die Bearbeitung von Stahl sowie von Aluminium verwendet wurden. Es wurden mehrere Aufnahmen mit und ohne Werkstück (Aircut) erstellt, um möglichst viele Fälle abdecken zu können. Es handelt sich um die gleiche Versuchsreihe wie in "Training and validation dataset of milling processes for time series prediction" mit der DOI 10.5445/IR/1000157789 und hat zum Ziel, eine Untersuchung der Übertragbarkeit von Modellen zwischen verschiedenen Maschinen zu ermöglichen. Abstract: The aim of the dataset is to train and validate models for predicting time series for milling processes. For this purpose, processes were recorded at a sampling rate of 500 Hz by a Siemens Industrial Edge on a DMC 60H. The machine was upgraded in terms of control technology. Processes for model training and validation were recorded, suitable for both steel and aluminum machining. Several recordings were made with and without the workpiece (aircut) in order to cover as many cases as possible. This is the same series of experiments as in "Training and validation dataset of milling processes for time series prediction" with DOI 10.5445/IR/1000157789 and allows an investigation of the transferability of models between different machines. TechnicalRemarks: Documents: -Design of Experiments: Information on the paths such as the technological values of the experiments -Recording information: Information about the recordings with comments -Data: All recorded datasets. The first level contains the folders for training and validation both with and without the workpiece. In the next level, the individual test executions are located. The individual recordings are stored in the form of a JSON file. This consists of a header with all relevant information such as the signal sources. This is followed by the entries of the recorded time series. -NC-Code: NC programs executed on the machine Experimental data: -Machine: Retrofitted DMC 60H -Material: S235JR, 2007 T4 -Tools: -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 5mm -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 10mm -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 20mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 5mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 10mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 5mm -Workpiece blank dimensions: 150x75x50mm License: This work is licensed under a Creative Commons Attribution 4.0 International License. Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This research study aims to understand the application of Artificial Neural Networks (ANNs) to forecast the Self-Compacting Recycled Coarse Aggregate Concrete (SCRCAC) compressive strength. From different literature, 602 available data sets from SCRCAC mix designs are collected, and the data are rearranged, reconstructed, trained and tested for the ANN model development. The models were established using seven input variables: the mass of cementitious content, water, natural coarse aggregate content, natural fine aggregate content, recycled coarse aggregate content, chemical admixture and mineral admixture used in the SCRCAC mix designs. Two normalization techniques are used for data normalization to visualize the data distribution. For each normalization technique, three transfer functions are used for modelling. In total, six different types of models were run in MATLAB and used to estimate the 28th day SCRCAC compressive strength. Normalization technique 2 performs better than 1 and TANSING is the best transfer function. The best k-fold cross-validation fold is k = 7. The coefficient of determination for predicted and actual compressive strength is 0.78 for training and 0.86 for testing. The impact of the number of neurons and layers on the model was performed. Inputs from standards are used to forecast the 28th day compressive strength. Apart from ANN, Machine Learning (ML) techniques like random forest, extra trees, extreme boosting and light gradient boosting techniques are adopted to predict the 28th day compressive strength of SCRCAC. Compared to ML, ANN prediction shows better results in terms of sensitive analysis. The study also extended to determine 28th day compressive strength from experimental work and compared it with 28th day compressive strength from ANN best model. Standard and ANN mix designs have similar fresh and hardened properties. The average compressive strength from ANN model and experimental results are 39.067 and 38.36 MPa, respectively with correlation coefficient is 1. It appears that ANN can validly predict the compressive strength of concrete.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is training and validation datasets used in manuscript "Three-Dimensional Implicit Structural Modeling Using Convolutional Neural Network". In this manuscript, we propose an efficient deep learning method using a Convolutional Neural Network (CNN) to predict a scalar field from sparse structural data associated with multiple distinct stratigraphic layers and faults. The CNN architecture is beneficial for the flexible incorporation of empirical geological knowledge when trained with numerous and realistic structural models that are automatically generated from a data simulation workflow. It also presents an expressive characteristic of integrating various types of structural constraints by optimally minimizing a hybrid loss function to compare predicted and reference structural models, opening new opportunities for further improving geological modeling.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This datasets represents the training and validation data that was used to produce the pre-trained model for the TomoTwin paper. Please see 10.5281/zenodo.6637357 for the raw tomograms.
This dataset was created by Hamed Etezadi
https://brightdata.com/licensehttps://brightdata.com/license
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training and validation datasets for the first subtask of the shared task "Field of Research Classification" to be held at the Natural Scientific Language Processing and Research Knowledge Graphs (NSLP 2024) workshop (https://nfdi4ds.github.io/nslp2024/).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Annotated test and train data sets. Both images and annotations are provided separately.
Validation data set for Hi5, Sf9 and HEK cells.
Confusion matrices for the determination of performance parameters
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset necessary for DocTOR utility.
DocTOR (Direct fOreCast Target On Reaction), is a utility written in python3.9 (using the conda workframe) that allows the user to upload a list of Uniprot IDs and Adverse reactions (from the available models) in order to study the relationship between the two.
On output the program will assign a positive or negative class to the protein, assessing its possible involvement in the selected ADRs onset.
DocTOR exploits the data coming from T-ARDIS [https://doi.org/10.1093/database/baab068] to train different Machine Learning approaches (SVM, RF, NN) using network topological measurements as features.
The prediction coming from the single trained models are combined in a meta-predictor exploiting three different voting systems.
The results of the meta-predictor together with the ones from the single ML method will be available in the output log file (named "predictions_community" or "predictions_curated" based on the database type).
The DocTOR utility is avaiable at https://github.com/cristian931/DocTOR
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
*Once again, kicking one sample out as the testing sample, the rest 28 samples are the training dataset.The four features (columns “1”, “2”, “3”, and “4”) of each miRNA are calculated based on the genomic coordinates of the miRNA, the miRNA hosting intron, and the host gene.ER represents the experimental results and PR represents the prediction results. The symbol “+” means high co-expression and the symbol “−” means low co-expression.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The construction of a robust healthcare information system is fundamental to enhancing countries’ capabilities in the surveillance and control of hepatitis B virus (HBV). Making use of China’s rapidly expanding primary healthcare system, this innovative approach using big data and machine learning (ML) could help towards the World Health Organization’s (WHO) HBV infection elimination goals of reaching 90% diagnosis and treatment rates by 2030. We aimed to develop and validate HBV detection models using routine clinical data to improve the detection of HBV and support the development of effective interventions to mitigate the impact of this disease in China. Relevant data records extracted from the Family Medicine Clinic of the University of Hong Kong-Shenzhen Hospital’s Hospital Information System were structuralized using state-of-the-art Natural Language Processing techniques. Several ML models have been used to develop HBV risk assessment models. The performance of the ML model was then interpreted using the Shapley value (SHAP) and validated using cohort data randomly divided at a ratio of 2:1 using a five-fold cross-validation framework. The patterns of physical complaints of patients with and without HBV infection were identified by processing 158,988 clinic attendance records. After removing cases without any clinical parameters from the derivation sample (n = 105,992), 27,392 cases were analysed using six modelling methods. A simplified model for HBV using patients’ physical complaints and parameters was developed with good discrimination (AUC = 0.78) and calibration (goodness of fit test p-value >0.05). Suspected case detection models of HBV, showing potential for clinical deployment, have been developed to improve HBV surveillance in primary care setting in China. (Word count: 264) This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections.We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China. This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections. We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China.
Extracting useful and accurate information from scanned geologic and other earth science maps is a time-consuming and laborious process involving manual human effort. To address this limitation, the USGS partnered with the Defense Advanced Research Projects Agency (DARPA) to run the AI for Critical Mineral Assessment Competition, soliciting innovative solutions for automatically georeferencing and extracting features from maps. The competition opened for registration in August 2022 and concluded in December 2022. Training and validation data from the competition are provided here, as well as competition details and baseline solutions. The data are derived from published sources and are provided to the public to support continued development of automated georeferencing and feature extraction tools. References for all maps are included with the data.