Facebook
TwitterThis dataset was created by Robbie Manolache
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets often incorporate various functional patterns related to different aspects or regimes, which are typically not equally present throughout the dataset. We propose a novel partitioning algorithm that utilizes competition between models to detect and separate these functional patterns. This competition is induced by multiple models iteratively submitting their predictions for the dataset, with the best prediction for each data point being rewarded with training on that data point. This reward mechanism amplifies each model's strengths and encourages specialization in different patterns. The specializations can then be translated into a partitioning scheme. We validate our concept with datasets with clearly distinct functional patterns, such as mechanical stress and strain data in a porous structure. Our partitioning algorithm produces valuable insights into the datasets' structure, which can serve various further applications. As a demonstration of one exemplary usage, we set up modular models consisting of multiple expert models, each learning a single partition, and compare their performance on more than twenty popular regression problems with single models learning all partitions simultaneously. Our results show significant improvements, with up to 56% loss reduction, confirming our algorithm's utility.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).
The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:
A clean, pre-defined 80/20 train-test split.
Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.
A flat directory structure (train/, test/) for simplified file access.
File Content The dataset is organized into a single top-level folder and two CSV files:
train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.
test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.
Caltech-256_Train_Test/: The primary data folder.
train/: This directory contains 80% of the images from all 257 categories, intended for model training.
test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.
Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.
Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.
Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data
Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Solute descriptors have been widely used to model chemical transfer processes through poly-parameter linear free energy relationships (pp-LFERs); however, there are still substantial difficulties in obtaining these descriptors accurately and quickly for new organic chemicals. In this research, models (PaDEL-DNN) that require only SMILES of chemicals were built to satisfactorily estimate pp-LFER descriptors using deep neural networks (DNN) and the PaDEL chemical representation. The PaDEL-DNN-estimated pp-LFER descriptors demonstrated good performance in modeling storage-lipid/water partitioning coefficient (log Kstorage‑lipid/water), bioconcentration factor (BCF), aqueous solubility (ESOL), and hydration free energy (freesolve). Then, assuming that the accuracy in the estimated values of widely available properties, e.g., logP (octanol–water partition coefficient), can calibrate estimates for less available but related properties, we proposed logP as a surrogate metric for evaluating the overall accuracy of the estimated pp-LFER descriptors. When using the pp-LFER descriptors to model log Kstorage‑lipid/water, BCF, ESOL, and freesolve, we achieved around 0.1 log unit lower errors for chemicals whose estimated pp-LFER descriptors were deemed “accurate” by the surrogate metric. The interpretation of the PaDEL-DNN models revealed that, for a given test chemical, having several (around 5) “similar” chemicals in the training data set was crucial for accurate estimation while the remaining less similar training chemicals provided reasonable baseline estimates. Lastly, pp-LFER descriptors for over 2800 persistent, bioaccumulative, and toxic chemicals were reasonably estimated by combining PaDEL-DNN with the surrogate metric. Overall, the PaDEL-DNN/surrogate metric and newly estimated descriptors will greatly benefit chemical transfer modeling.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The raw data comes from Ba Nguyen et al, 2022, who hosted their data here. This dataset was used in an independent study in Rijal et al, 2025, who preprocessed the data using these notebook scripts. They did not release their processed data, so we reproduced their processing pipeline and have uploaded the data ourselves as part of this data resource.
This release accompanies this publication: https://doi.org/10.57844/arcadia-bmb9-fzxd
Facebook
TwitterDDI_Ben
The DDI_Ben dataset is divided into five parts:
Random_drugbank contains DDI data for the training, validation, and test sets under scenarios S1 and S2, generated by randomly partitioning the DrugBank dataset into training, validation, and test subsets. Random_twosides contains DDI data for the training, validation, and test sets under scenarios S1 and S2, generated by randomly partitioning the TWOSIDES dataset into training, validation, and test subsets.… See the full description on the dataset page: https://huggingface.co/datasets/juejueziok/DDI_Ben.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Overview: This dataset has 17 classes. Data are divided in three partition train, val and test.
Dataset Characteristics: Image Feature Type: Categorical Associated Tasks: Classification, Other
Class Labels: The classes are 0 : Beet Armyworm 1 : Black Hairy 2 : Cutworm 3 : Field Cricket 4 : Jute Aphid 5 : Jute Hairy 6 : Jute Red Mite 7 : Jute Semilooper 8 : Jute Stem Girdler 9 : Jute Stem Weevil 10 : Leaf Beetle 11 : Mealybug 12 : Pod Borer 13 : Scopula Emissaria 14 : Termite 15 : Termite odontotermes (Rambur) 16 : Yellow Mite
Has Missing Values?: No
Facebook
TwitterAblation studies of length-scaling cosine distance, the dynamic training data partition strategy and the GNN-based encoder on SCOPe v2.07 and ind_PDB.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
For convenience during training, the file train includes:
Training set.
Validation set.
In-domain test set.
Data partitioning rules are defined in dataset.py
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy.
Methods
Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification.
Data Source Collection
Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly.
Data Preprocessing:
Data preprocessing shall be done to remove noise and outlier.
Transformation:
The data shall be transformed from analog to electronic record.
Data Partitioning
The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2.
The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics.
Classification and prediction:
Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows:
i. Data collection and preprocessing shall be done.
ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification.
iii. Test data set is shall be stored in database test data set.
iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows:
Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
Facebook
TwitterBackground and purposeExternal drainage represents a well-established treatment option for acute intracerebral hemorrhage. The current standard of practice includes post-operative computer tomography imaging, which is subjectively evaluated. The implementation of an objective, automated evaluation of postoperative studies may enhance diagnostic accuracy and facilitate the scaling of research projects. The objective is to develop and validate a fully automated pipeline for intracerebral hemorrhage and drain detection, quantification of intracerebral hemorrhage coverage, and detection of malpositioned drains.Materials and methodsIn this retrospective study, we selected patients (n = 68) suffering from supratentorial intracerebral hemorrhage treated by minimally invasive surgery, from years 2010–2018. These were divided into training (n = 21), validation (n = 3) and testing (n = 44) datasets. Mean age (SD) was 70 (±13.56) years, 32 female. Intracerebral hemorrhage and drains were automatically segmented using a previously published artificial intelligence-based approach. From this, we calculated coverage profiles of the correctly detected drains to quantify the drains’ coverage by the intracerebral hemorrhage and classify malpositioning. We used accuracy measures to assess detection and classification results and intraclass correlation coefficient to assess the quantification of the drain coverage by the intracerebral hemorrhage.ResultsIn the test dataset, the pipeline showed a drain detection accuracy of 0.97 (95% CI: 0.92 to 0.99), an agreement between predicted and ground truth coverage profiles of 0.86 (95% CI: 0.85 to 0.87) and a drain position classification accuracy of 0.88 (95% CI: 0.77 to 0.95) resulting in area under the receiver operating characteristic curve of 0.92 (95% CI: 0.85 to 0.99).ConclusionWe developed and statistically validated an automated pipeline for evaluating computed tomography scans after minimally invasive surgery for intracerebral hemorrhage. The algorithm reliably detects drains, quantifies drain coverage by the hemorrhage, and uses machine learning to detect malpositioned drains. This pipeline has the potential to impact the daily clinical workload, as well as to facilitate the scaling of data collection for future research into intracerebral hemorrhage and other diseases.
Facebook
Twitterhttps://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.34810/data314https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.34810/data314
This package contains a partition of the Iula Spanish LSP Treebank into train and test sets to perform Machine Learning experiments. In that way the same partitions can be used by different researchers and their results can be directly compared. In this package we also deliver the Tibidabo Treebank (Marimon 2010) which contains a set of sentences extracted from Ancora corpus annotated in the same way than the Iula Treebank. Tibidabo Treebank is a very good test set for models trained with Iula Spanish LSP Treebank since the sentences that form it from a very different domain than those of the Iula Spanish LSP Treebank.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive includes the scripts and related input data to produce results for the paper entitled - "Structural constraints in current stomatal conductance models preclude accurate estimation of evapotranspiration and its partitions". Following is the description of files/folders:
1. Input_Data: This folder contains all the required input data including FluxNet data, soil properties, quality controlled training-validation data, and metadata & other supporting information of the sites.
2. Model_EMP: This folder contains all the scripts for empirical model of stomatal conductance. (Note: Scripts have been written in MATLAB"). No need to change anything except the MATLAB executive path in two files "run_all_tasks_to_optimize_params.sh" and "prediction.sh". Read "ReadMe.txt" file in the folder "Model_EMP" for more instructions on running the model.
3. Model_ML: This folder contains all the scripts for pure machine learning model of stomatal conductance. It contains four sub-folders: 1. Model_Config_1 (Model with configuration-1); 2. Model_Config_2_TEA (Model with Configuration-2 & TEA-based T estimates); 3. Model_Config_2_uWUE (Model with Configuration-2 & uWUE-based T estimates); 4. Model_Config_2_Yu22 (Model with Configuration-2 & Yu22-based T estimates). Further instructions have been given in each jupyter notebooks. Briefly, in folder "Model_Config_1", the notebook "train_ML_config_1.ipynb" trains the model parameters and notebook "Predictions_ML_config_1" is used to do predictions. Similar instructions apply for other subfolders. (Note: Scripts have been written in Python Language"). All the scripts are fully functional as long as all the required modules are installed.
4. Model_PH_exp: This folder contains all the scripts for plant hydraulics model with explicit representation. All the scripts are self explanatory and further instructions are provided in the scripts as needed. (Note: Scripts have been written in Python Language"). All the scripts are fully functional as long as all the required modules are installed.
5. Model_PN_imp: This folder contains all the scripts for plant hydraulics model with implicit representation. Instructions given for "Model_ML" are applicable here. (Note: Scripts have been written in Python Language"). All the scripts are fully functional as long as all the required modules are installed.
Versions: Tensorflow 2.11.0, MATLAB_R2022a, Python 3.10.9
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data set partitioning into training, validation and test data sets, considering the quantity of cells and the quantity of images.
Facebook
TwitterThis dataset was created by Teddy_55
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Results for feature selection, model selection and validation, using the two selection criteria and the four data partitioning schemes. The outlier, 2OZA, was omitted from these runs. The number of features for the and models is shown (#), alongside their leave-one-out cross-validation correlations and RMSE. The RMSE and correlation of the values used for selecting these models is also shown, as are those when the model is applied to the validation set, along with the significance of correlation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Swahili dataset developed specifically for language modeling task. The dataset contains 28,000 unique words with 6.84M, 970k, and 2M words for the train, valid and test partitions respectively which represent the ratio 80:10:10. The entire dataset is lowercased, has no punctuation marks and, the start and end of sentence markers have been incorporated to facilitate easy tokenization during language modeling. The train partition is the largest in order to support unsupervised learning of word representations while the hyper-parameters are adjusted based on the performance on the valid partition before evaluating the language model on the test partition.
Facebook
TwitterBackgroundTo address the limitations of commonly used cross-validation methods, the linear regression method (LR) was proposed to estimate population accuracy of predictions based on the implicit assumption that the fitted model is correct. This method also provides two statistics to determine the adequacy of the fitted model. The validity and behavior of the LR method have been provided and studied for linear predictions but not for nonlinear predictions. The objectives of this study were to 1) provide a mathematical proof for the validity of the LR method when predictions are based on conditional means, regardless of whether the predictions are linear or non-linear 2) investigate the ability of the LR method to detect whether the fitted model is adequate or inadequate, and 3) provide guidelines on how to appropriately partition the data into training and validation such that the LR method can identify an inadequate model.ResultsWe present a mathematical proof for the validity of the LR method to estimate population accuracy and to determine whether the fitted model is adequate or inadequate when the predictor is the conditional mean, which may be a non-linear function of the phenotype. Using three partitioning scenarios of simulated data, we show that the one of the LR statistics can detect an inadequate model only when the data are partitioned such that the values of relevant predictor variables differ between the training and validation sets. In contrast, we observed that the other LR statistic was able to detect an inadequate model for all three scenarios.ConclusionThe LR method has been proposed to address some limitations of the traditional approach of cross-validation in genetic evaluation. In this paper, we showed that the LR method is valid when the model is adequate and the conditional mean is the predictor, even when it is a non-linear function of the phenotype. We found one of the two LR statistics is superior because it was able to detect an inadequate model for all three partitioning scenarios (i.e., between animals, by age within animals, and between animals and by age) that were studied.
Facebook
TwitterTraining data of the model detokenized in the exact order seen by the model. The training data is partitioned into 8 chunks (chunk-0 through chunk-7), based on the GPU rank that generated the data. Each chunk contains detokenized text files in JSON Lines format (.jsonl).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set is a collection of data sets and model checkpoint in the iPXRDnet.
Model checkpoint file:
hmof-130T_Hydrogen: Model of adsorption prediction for H2 in the hMOF-130T database obtained by training
hmof-130T_CarbonDioxide: Model of adsorption prediction for CO2 in the hMOF-130T database obtained by training
hmof-130T_Nitrogen: Model of adsorption prediction for N2 in the hMOF-130T database obtained by training
hmof-130T_Methane: Model of adsorption prediction for CH4 in the hMOF-130T database obtained by training
hmof-300T: Adsorption prediction model in the hMOF-300T database obtained by training
Gas_Se: Separation selectivity prediction model obtained by training
Gas_SD: Self-diffusion coefficients prediction model obtained by training
MOD: Bulk modulus and shear modulus prediction model obtained by training
exAPMOF-1bar-ALM+PXRD: Experimental adsorption at 1 bar model of Anion-pillared MOFs obtained by training with PXRD and material ligands
exAPMOF-1bar-ALM: Experimental adsorption at 1 bar model of Anion-pillared MOFs obtained by training with material ligands only
exAPMOF-1bar-PXRD: Experimental adsorption at 1 bar model of Anion-pillared MOFs obtained by training with PXRD only
exAPMOF-ISO:Experimental adsorption isotherm model of Anion-pillared MOFs obtained by training
exAPMOF-1bar-NOacvPXRD: Experimental adsorption at 1 bar model of Anion-pillared MOFs obtained by training with PXRD data before activation only
exAPMOF-1bar-acvPXRD: Experimental adsorption at 1 bar model of Anion-pillared MOFs obtained by training with PXRD data after activation only
Data sets file:
hmof-xrd+str+ad :PXRD and gas adsorption and structural feature of hmof-300T database
hMOF-130T_ad_list_mof :Gas adsorption data of hmof-130T database
hMOF-130T_GAS_DICT :Gas descriptors data of hmof-130T database
hMOF-130T_STR_DICT :Structural feature data of hmof-130T database
hMOF-130T_PXRD_DICT :PXRD data of hmof-130T database
MOD_data :Bulk modulus and shear modulus data of Moghadam's MOFs
MOD_PXRD_dict : PXRD data of Moghadam's MOFs
GAS_SD-data : self-diffusion coefficients data in CoREMOF database
SE-CO2,N2_data:Separation selectivity ,PXRD and structural feature of CO2/N2 selectivity database
Sa_sp:Data set partitioning results of CO2/N2 selectivity database
gas_dict : gas descriptors data used in the self-diffusion coefficients database
PXRD_DICT : PXRD data after activation of MOFs in Anion-pillared MOFs' experimental database
xrd_noacv : PXRD data before activation of MOFs in Anion-pillared MOFs' experimental database
Smiles_ads : Smiles data of gas in Anion-pillared MOFs' experimental database
all_exAPMOF-1bar : Anion-pillared MOFs' experimental adsorption data under 298K and 1 bar.
all_exAPMOF-1bar-NOacv : Experimental adsorption data for anion-pillared MOFs with PXRD before activation under 298K and 1 bar.
exAPMOF_DICT : Anion-pillared MOFs' Smiles data of MOFs' ligands and descriptors of metal centers in the experimental database
all_exAPMOF-iso : Key library of MOF and gas combinations in Anion-pillared MOFs' experimental isotherm database.
exAPMOF_ISOdata: Anion-pillared MOFs' experimental adsorption isotherm data under 298K.
Facebook
TwitterThis dataset was created by Robbie Manolache