66 datasets found
  1. AMEX Training Data - Parquet Partitions

    • kaggle.com
    zip
    Updated Jul 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robbie Manolache (2022). AMEX Training Data - Parquet Partitions [Dataset]. https://www.kaggle.com/datasets/slashie/amex-train-data-pq
    Explore at:
    zip(7813557437 bytes)Available download formats
    Dataset updated
    Jul 24, 2022
    Authors
    Robbie Manolache
    Description

    Dataset

    This dataset was created by Robbie Manolache

    Contents

  2. f

    Data Sheet 1_Functional partitioning through competitive learning.pdf

    • figshare.com
    pdf
    Updated Nov 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marius Tacke; Matthias Busch; Kevin Linka; Christian Cyron; Roland Aydin (2025). Data Sheet 1_Functional partitioning through competitive learning.pdf [Dataset]. http://doi.org/10.3389/frai.2025.1661444.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 5, 2025
    Dataset provided by
    Frontiers
    Authors
    Marius Tacke; Matthias Busch; Kevin Linka; Christian Cyron; Roland Aydin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets often incorporate various functional patterns related to different aspects or regimes, which are typically not equally present throughout the dataset. We propose a novel partitioning algorithm that utilizes competition between models to detect and separate these functional patterns. This competition is induced by multiple models iteratively submitting their predictions for the dataset, with the best prediction for each data point being rewarded with training on that data point. This reward mechanism amplifies each model's strengths and encourages specialization in different patterns. The specializations can then be translated into a partitioning scheme. We validate our concept with datasets with clearly distinct functional patterns, such as mechanical stress and strain data in a porous structure. Our partitioning algorithm produces valuable insights into the datasets' structure, which can serve various further applications. As a demonstration of one exemplary usage, we set up modular models consisting of multiple expert models, each learning a single partition, and compare their performance on more than twenty popular regression problems with single models learning all partitions simultaneously. Our results show significant improvements, with up to 56% loss reduction, confirming our algorithm's utility.

  3. Caltech-256: Pre-Processed 80/20 Train-Test Split

    • kaggle.com
    zip
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KUSHAGRA MATHUR (2025). Caltech-256: Pre-Processed 80/20 Train-Test Split [Dataset]. https://www.kaggle.com/datasets/kushubhai/caltech-256-train-test
    Explore at:
    zip(1138799273 bytes)Available download formats
    Dataset updated
    Nov 12, 2025
    Authors
    KUSHAGRA MATHUR
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).

    The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:

    A clean, pre-defined 80/20 train-test split.

    Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.

    A flat directory structure (train/, test/) for simplified file access.

    File Content The dataset is organized into a single top-level folder and two CSV files:

    train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.

    test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.

    Caltech-256_Train_Test/: The primary data folder.

    train/: This directory contains 80% of the images from all 257 categories, intended for model training.

    test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.

    Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.

    Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.

    Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data

    Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.

  4. f

    Data from: Predicting Solute Descriptors for Organic Chemicals by a Deep...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xlsx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kai Zhang; Huichun Zhang (2023). Predicting Solute Descriptors for Organic Chemicals by a Deep Neural Network (DNN) Using Basic Chemical Structures and a Surrogate Metric [Dataset]. http://doi.org/10.1021/acs.est.1c05398.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    ACS Publications
    Authors
    Kai Zhang; Huichun Zhang
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Solute descriptors have been widely used to model chemical transfer processes through poly-parameter linear free energy relationships (pp-LFERs); however, there are still substantial difficulties in obtaining these descriptors accurately and quickly for new organic chemicals. In this research, models (PaDEL-DNN) that require only SMILES of chemicals were built to satisfactorily estimate pp-LFER descriptors using deep neural networks (DNN) and the PaDEL chemical representation. The PaDEL-DNN-estimated pp-LFER descriptors demonstrated good performance in modeling storage-lipid/water partitioning coefficient (log Kstorage‑lipid/water), bioconcentration factor (BCF), aqueous solubility (ESOL), and hydration free energy (freesolve). Then, assuming that the accuracy in the estimated values of widely available properties, e.g., logP (octanol–water partition coefficient), can calibrate estimates for less available but related properties, we proposed logP as a surrogate metric for evaluating the overall accuracy of the estimated pp-LFER descriptors. When using the pp-LFER descriptors to model log Kstorage‑lipid/water, BCF, ESOL, and freesolve, we achieved around 0.1 log unit lower errors for chemicals whose estimated pp-LFER descriptors were deemed “accurate” by the surrogate metric. The interpretation of the PaDEL-DNN models revealed that, for a given test chemical, having several (around 5) “similar” chemicals in the training data set was crucial for accurate estimation while the remaining less similar training chemicals provided reasonable baseline estimates. Lastly, pp-LFER descriptors for over 2800 persistent, bioaccumulative, and toxic chemicals were reasonably estimated by combining PaDEL-DNN with the surrogate metric. Overall, the PaDEL-DNN/surrogate metric and newly estimated descriptors will greatly benefit chemical transfer modeling.

  5. Yeast genotype/phenotype data partitioned into train/validation/test splits...

    • zenodo.org
    zip
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Yeast genotype/phenotype data partitioned into train/validation/test splits from Ba Nguyen et al, 2022 [Dataset]. http://doi.org/10.5281/zenodo.15313069
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The raw data comes from Ba Nguyen et al, 2022, who hosted their data here. This dataset was used in an independent study in Rijal et al, 2025, who preprocessed the data using these notebook scripts. They did not release their processed data, so we reproduced their processing pipeline and have uploaded the data ourselves as part of this data resource.

    This release accompanies this publication: https://doi.org/10.57844/arcadia-bmb9-fzxd

  6. h

    DDI_Ben

    • huggingface.co
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang Cao (2025). DDI_Ben [Dataset]. https://huggingface.co/datasets/juejueziok/DDI_Ben
    Explore at:
    Dataset updated
    May 28, 2025
    Authors
    Yang Cao
    Description

    DDI_Ben

    The DDI_Ben dataset is divided into five parts:

    Random_drugbank contains DDI data for the training, validation, and test sets under scenarios S1 and S2, generated by randomly partitioning the DrugBank dataset into training, validation, and test subsets. Random_twosides contains DDI data for the training, validation, and test sets under scenarios S1 and S2, generated by randomly partitioning the TWOSIDES dataset into training, validation, and test subsets.… See the full description on the dataset page: https://huggingface.co/datasets/juejueziok/DDI_Ben.

  7. Jute Pest

    • kaggle.com
    zip
    Updated May 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ABUCHI ONWUEGBUSI (2024). Jute Pest [Dataset]. https://www.kaggle.com/datasets/abuchionwuegbusi/jute-pest
    Explore at:
    zip(163188696 bytes)Available download formats
    Dataset updated
    May 16, 2024
    Authors
    ABUCHI ONWUEGBUSI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview: This dataset has 17 classes. Data are divided in three partition train, val and test.

    Dataset Characteristics: Image Feature Type: Categorical Associated Tasks: Classification, Other

    Class Labels: The classes are 0 : Beet Armyworm 1 : Black Hairy 2 : Cutworm 3 : Field Cricket 4 : Jute Aphid 5 : Jute Hairy 6 : Jute Red Mite 7 : Jute Semilooper 8 : Jute Stem Girdler 9 : Jute Stem Weevil 10 : Leaf Beetle 11 : Mealybug 12 : Pod Borer 13 : Scopula Emissaria 14 : Termite 15 : Termite odontotermes (Rambur) 16 : Yellow Mite

    Has Missing Values?: No

  8. f

    Ablation studies of length-scaling cosine distance, the dynamic training...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Mar 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xia, Chunqiu; Shen, Hong-Bin; Xia, Ying; Feng, Shi-Hao; Pan, Xiaoyong (2022). Ablation studies of length-scaling cosine distance, the dynamic training data partition strategy and the GNN-based encoder on SCOPe v2.07 and ind_PDB. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000279177
    Explore at:
    Dataset updated
    Mar 24, 2022
    Authors
    Xia, Chunqiu; Shen, Hong-Bin; Xia, Ying; Feng, Shi-Hao; Pan, Xiaoyong
    Description

    Ablation studies of length-scaling cosine distance, the dynamic training data partition strategy and the GNN-based encoder on SCOPe v2.07 and ind_PDB.

  9. h

    Global-Scale

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yin pan, Global-Scale [Dataset]. https://huggingface.co/datasets/YINPAN/Global-Scale
    Explore at:
    Authors
    yin pan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Notice

    For convenience during training, the file train includes: Training set.
    Validation set.
    In-domain test set.
    Data partitioning rules are defined in dataset.py

  10. n

    Malaria disease and grading system dataset from public hospitals reflecting...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Nov 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 10, 2023
    Dataset provided by
    Nasarawa State University
    Authors
    Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
    Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.

  11. f

    Results of each step in all data partitions.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reisert, Marco; Rau, Alexander; Urbach, Horst; Watzlawick, Ralf; Elsheikh, Samer; Kellner, Elias; Demerath, Theo; Würtemberger, Urs; Elbaz, Ahmed (2024). Results of each step in all data partitions. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001464570
    Explore at:
    Dataset updated
    Dec 26, 2024
    Authors
    Reisert, Marco; Rau, Alexander; Urbach, Horst; Watzlawick, Ralf; Elsheikh, Samer; Kellner, Elias; Demerath, Theo; Würtemberger, Urs; Elbaz, Ahmed
    Description

    Background and purposeExternal drainage represents a well-established treatment option for acute intracerebral hemorrhage. The current standard of practice includes post-operative computer tomography imaging, which is subjectively evaluated. The implementation of an objective, automated evaluation of postoperative studies may enhance diagnostic accuracy and facilitate the scaling of research projects. The objective is to develop and validate a fully automated pipeline for intracerebral hemorrhage and drain detection, quantification of intracerebral hemorrhage coverage, and detection of malpositioned drains.Materials and methodsIn this retrospective study, we selected patients (n = 68) suffering from supratentorial intracerebral hemorrhage treated by minimally invasive surgery, from years 2010–2018. These were divided into training (n = 21), validation (n = 3) and testing (n = 44) datasets. Mean age (SD) was 70 (±13.56) years, 32 female. Intracerebral hemorrhage and drains were automatically segmented using a previously published artificial intelligence-based approach. From this, we calculated coverage profiles of the correctly detected drains to quantify the drains’ coverage by the intracerebral hemorrhage and classify malpositioning. We used accuracy measures to assess detection and classification results and intraclass correlation coefficient to assess the quantification of the drain coverage by the intracerebral hemorrhage.ResultsIn the test dataset, the pipeline showed a drain detection accuracy of 0.97 (95% CI: 0.92 to 0.99), an agreement between predicted and ground truth coverage profiles of 0.86 (95% CI: 0.85 to 0.87) and a drain position classification accuracy of 0.88 (95% CI: 0.77 to 0.95) resulting in area under the receiver operating characteristic curve of 0.92 (95% CI: 0.85 to 0.99).ConclusionWe developed and statistically validated an automated pipeline for evaluating computed tomography scans after minimally invasive surgery for intracerebral hemorrhage. The algorithm reliably detects drains, quantifies drain coverage by the hemorrhage, and uses machine learning to detect malpositioned drains. This pipeline has the potential to impact the daily clinical workload, as well as to facilitate the scaling of data collection for future research into intracerebral hemorrhage and other diseases.

  12. C

    Data from: Tibidabo Treebank and IULA Spanish LSP Treebank Train and Test...

    • dataverse.csuc.cat
    txt, zip
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    de titut Universitari de Lingüística Aplicada; Montserrat Marimon; Montserrat Marimon; de titut Universitari de Lingüística Aplicada (2024). Tibidabo Treebank and IULA Spanish LSP Treebank Train and Test Partitions [Dataset]. http://doi.org/10.34810/data314
    Explore at:
    txt(1731), zip(5519589)Available download formats
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    de titut Universitari de Lingüística Aplicada; Montserrat Marimon; Montserrat Marimon; de titut Universitari de Lingüística Aplicada
    License

    https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.34810/data314https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.34810/data314

    Description

    This package contains a partition of the Iula Spanish LSP Treebank into train and test sets to perform Machine Learning experiments. In that way the same partitions can be used by different researchers and their results can be directly compared. In this package we also deliver the Tibidabo Treebank (Marimon 2010) which contains a set of sentences extracted from Ancora corpus annotated in the same way than the Iula Treebank. Tibidabo Treebank is a very good test set for models trained with Iula Spanish LSP Treebank since the sentences that form it from a very different domain than those of the Iula Spanish LSP Treebank.

  13. Data from: Structural constraints in current stomatal conductance models...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    txt, zip
    Updated Mar 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pushpendra Raghav; Pushpendra Raghav; Mukesh Kumar; Mukesh Kumar; Yanlan Liu; Yanlan Liu (2023). Structural constraints in current stomatal conductance models preclude accurate estimation of evapotranspiration and its partitions [Dataset]. http://doi.org/10.5281/zenodo.7768479
    Explore at:
    txt, zipAvailable download formats
    Dataset updated
    Mar 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pushpendra Raghav; Pushpendra Raghav; Mukesh Kumar; Mukesh Kumar; Yanlan Liu; Yanlan Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This archive includes the scripts and related input data to produce results for the paper entitled - "Structural constraints in current stomatal conductance models preclude accurate estimation of evapotranspiration and its partitions". Following is the description of files/folders:

    1. Input_Data: This folder contains all the required input data including FluxNet data, soil properties, quality controlled training-validation data, and metadata & other supporting information of the sites.

    2. Model_EMP: This folder contains all the scripts for empirical model of stomatal conductance. (Note: Scripts have been written in MATLAB"). No need to change anything except the MATLAB executive path in two files "run_all_tasks_to_optimize_params.sh" and "prediction.sh". Read "ReadMe.txt" file in the folder "Model_EMP" for more instructions on running the model.

    3. Model_ML: This folder contains all the scripts for pure machine learning model of stomatal conductance. It contains four sub-folders: 1. Model_Config_1 (Model with configuration-1); 2. Model_Config_2_TEA (Model with Configuration-2 & TEA-based T estimates); 3. Model_Config_2_uWUE (Model with Configuration-2 & uWUE-based T estimates); 4. Model_Config_2_Yu22 (Model with Configuration-2 & Yu22-based T estimates). Further instructions have been given in each jupyter notebooks. Briefly, in folder "Model_Config_1", the notebook "train_ML_config_1.ipynb" trains the model parameters and notebook "Predictions_ML_config_1" is used to do predictions. Similar instructions apply for other subfolders. (Note: Scripts have been written in Python Language"). All the scripts are fully functional as long as all the required modules are installed.

    4. Model_PH_exp: This folder contains all the scripts for plant hydraulics model with explicit representation. All the scripts are self explanatory and further instructions are provided in the scripts as needed. (Note: Scripts have been written in Python Language"). All the scripts are fully functional as long as all the required modules are installed.

    5. Model_PN_imp: This folder contains all the scripts for plant hydraulics model with implicit representation. Instructions given for "Model_ML" are applicable here. (Note: Scripts have been written in Python Language"). All the scripts are fully functional as long as all the required modules are installed.

    Versions: Tensorflow 2.11.0, MATLAB_R2022a, Python 3.10.9

  14. Data set partitioning into training, validation and test data sets,...

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carina Albuquerque; Leonardo Vanneschi; Roberto Henriques; Mauro Castelli; Vanda Póvoa; Rita Fior; Nickolas Papanikolaou (2023). Data set partitioning into training, validation and test data sets, considering the quantity of cells and the quantity of images. [Dataset]. http://doi.org/10.1371/journal.pone.0260609.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Carina Albuquerque; Leonardo Vanneschi; Roberto Henriques; Mauro Castelli; Vanda Póvoa; Rita Fior; Nickolas Papanikolaou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data set partitioning into training, validation and test data sets, considering the quantity of cells and the quantity of images.

  15. ISIC2018-Task3-preprocessed data

    • kaggle.com
    zip
    Updated Jun 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teddy_55 (2021). ISIC2018-Task3-preprocessed data [Dataset]. https://www.kaggle.com/datasets/teddyziyyyu/isic2018task3preprocessed-data/data
    Explore at:
    zip(2771717873 bytes)Available download formats
    Dataset updated
    Jun 8, 2021
    Authors
    Teddy_55
    Description

    Dataset

    This dataset was created by Teddy_55

    Contents

  16. f

    Results for training, model selection and validation (2OZA omitted).

    • figshare.com
    xls
    Updated Dec 2, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iain H. Moal; Paul A. Bates (2015). Results for training, model selection and validation (2OZA omitted). [Dataset]. http://doi.org/10.1371/journal.pcbi.1002351.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 2, 2015
    Dataset provided by
    PLOS Computational Biology
    Authors
    Iain H. Moal; Paul A. Bates
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results for feature selection, model selection and validation, using the two selection criteria and the four data partitioning schemes. The outlier, 2OZA, was omitted from these runs. The number of features for the and models is shown (#), alongside their leave-one-out cross-validation correlations and RMSE. The RMSE and correlation of the values used for selecting these models is also shown, as are those when the model is applied to the validation set, along with the significance of correlation.

  17. Z

    Language modeling data for Swahili

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivachi Casper Shikali; Mokhosi Refuoe (2020). Language modeling data for Swahili [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3553422
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    University of Electronic Science and Technology of China
    Authors
    Shivachi Casper Shikali; Mokhosi Refuoe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Swahili dataset developed specifically for language modeling task. The dataset contains 28,000 unique words with 6.84M, 970k, and 2M words for the train, valid and test partitions respectively which represent the ratio 80:10:10. The entire dataset is lowercased, has no punctuation marks and, the start and end of sentence markers have been incorporated to facilitate easy tokenization during language modeling. The train partition is the largest in order to support unsupervised learning of word representations while the hyper-parameters are adjusted based on the performance on the valid partition before evaluating the language model on the test partition.

  18. f

    DataSheet1_Use of the linear regression method to evaluate population...

    • datasetcatalog.nlm.nih.gov
    Updated Jun 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando, Rohan L.; Yu, Haipeng; Dekkers, Jack C. M. (2024). DataSheet1_Use of the linear regression method to evaluate population accuracy of predictions from non-linear models.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001361346
    Explore at:
    Dataset updated
    Jun 4, 2024
    Authors
    Fernando, Rohan L.; Yu, Haipeng; Dekkers, Jack C. M.
    Description

    BackgroundTo address the limitations of commonly used cross-validation methods, the linear regression method (LR) was proposed to estimate population accuracy of predictions based on the implicit assumption that the fitted model is correct. This method also provides two statistics to determine the adequacy of the fitted model. The validity and behavior of the LR method have been provided and studied for linear predictions but not for nonlinear predictions. The objectives of this study were to 1) provide a mathematical proof for the validity of the LR method when predictions are based on conditional means, regardless of whether the predictions are linear or non-linear 2) investigate the ability of the LR method to detect whether the fitted model is adequate or inadequate, and 3) provide guidelines on how to appropriately partition the data into training and validation such that the LR method can identify an inadequate model.ResultsWe present a mathematical proof for the validity of the LR method to estimate population accuracy and to determine whether the fitted model is adequate or inadequate when the predictor is the conditional mean, which may be a non-linear function of the phenotype. Using three partitioning scenarios of simulated data, we show that the one of the LR statistics can detect an inadequate model only when the data are partitioned such that the values of relevant predictor variables differ between the training and validation sets. In contrast, we observed that the other LR statistic was able to detect an inadequate model for all three scenarios.ConclusionThe LR method has been proposed to address some limitations of the traditional approach of cross-validation in genetic evaluation. In this paper, we showed that the LR method is valid when the model is adequate and the conditional mean is the predictor, even when it is a non-linear function of the phenotype. We found one of the two LR statistics is superior because it was able to detect an inadequate model for all three partitioning scenarios (i.e., between animals, by age within animals, and between animals and by age) that were studied.

  19. h

    comma-v0.1-toksuite-detokenized

    • huggingface.co
    Updated Nov 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TokSuite (2025). comma-v0.1-toksuite-detokenized [Dataset]. https://huggingface.co/datasets/toksuite/comma-v0.1-toksuite-detokenized
    Explore at:
    Dataset updated
    Nov 18, 2025
    Dataset authored and provided by
    TokSuite
    Description

    Training data of the model detokenized in the exact order seen by the model. The training data is partitioned into 8 chunks (chunk-0 through chunk-7), based on the GPU rank that generated the data. Each chunk contains detokenized text files in JSON Lines format (.jsonl).

  20. Data and trained model for iPXRDnet

    • zenodo.org
    zip
    Updated Nov 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang Zhenglu; Yang Zhenglu (2024). Data and trained model for iPXRDnet [Dataset]. http://doi.org/10.5281/zenodo.14129317
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yang Zhenglu; Yang Zhenglu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set is a collection of data sets and model checkpoint in the iPXRDnet.


    Model checkpoint file:
    hmof-130T_Hydrogen: Model of adsorption prediction for H2 in the hMOF-130T database obtained by training
    hmof-130T_CarbonDioxide: Model of adsorption prediction for CO2 in the hMOF-130T database obtained by training
    hmof-130T_Nitrogen: Model of adsorption prediction for N2 in the hMOF-130T database obtained by training
    hmof-130T_Methane: Model of adsorption prediction for CH4 in the hMOF-130T database obtained by training
    hmof-300T: Adsorption prediction model in the hMOF-300T database obtained by training
    Gas_Se: Separation selectivity prediction model obtained by training
    Gas_SD: Self-diffusion coefficients prediction model obtained by training
    MOD: Bulk modulus and shear modulus prediction model obtained by training
    exAPMOF-1bar-ALM+PXRD: Experimental adsorption at 1 bar model of Anion-pillared MOFs obtained by training with PXRD and material ligands
    exAPMOF-1bar-ALM: Experimental adsorption at 1 bar model of Anion-pillared MOFs obtained by training with material ligands only
    exAPMOF-1bar-PXRD: Experimental adsorption at 1 bar model of Anion-pillared MOFs obtained by training with PXRD only
    exAPMOF-ISO:Experimental adsorption isotherm model of Anion-pillared MOFs obtained by training
    exAPMOF-1bar-NOacvPXRD: Experimental adsorption at 1 bar model of Anion-pillared MOFs obtained by training with PXRD data before activation only
    exAPMOF-1bar-acvPXRD: Experimental adsorption at 1 bar model of Anion-pillared MOFs obtained by training with PXRD data after activation only

    Data sets file:
    hmof-xrd+str+ad :PXRD and gas adsorption and structural feature of hmof-300T database
    hMOF-130T_ad_list_mof :Gas adsorption data of hmof-130T database
    hMOF-130T_GAS_DICT :Gas descriptors data of hmof-130T database
    hMOF-130T_STR_DICT :Structural feature data of hmof-130T database
    hMOF-130T_PXRD_DICT :PXRD data of hmof-130T database
    MOD_data :Bulk modulus and shear modulus data of Moghadam's MOFs
    MOD_PXRD_dict : PXRD data of Moghadam's MOFs
    GAS_SD-data : self-diffusion coefficients data in CoREMOF database
    SE-CO2,N2_data:Separation selectivity ,PXRD and structural feature of CO2/N2 selectivity database
    Sa_sp:Data set partitioning results of CO2/N2 selectivity database
    gas_dict : gas descriptors data used in the self-diffusion coefficients database
    PXRD_DICT : PXRD data after activation of MOFs in Anion-pillared MOFs' experimental database
    xrd_noacv : PXRD data before activation of MOFs in Anion-pillared MOFs' experimental database
    Smiles_ads : Smiles data of gas in Anion-pillared MOFs' experimental database
    all_exAPMOF-1bar : Anion-pillared MOFs' experimental adsorption data under 298K and 1 bar.
    all_exAPMOF-1bar-NOacv : Experimental adsorption data for anion-pillared MOFs with PXRD before activation under 298K and 1 bar.
    exAPMOF_DICT : Anion-pillared MOFs' Smiles data of MOFs' ligands and descriptors of metal centers in the experimental database
    all_exAPMOF-iso : Key library of MOF and gas combinations in Anion-pillared MOFs' experimental isotherm database.
    exAPMOF_ISOdata: Anion-pillared MOFs' experimental adsorption isotherm data under 298K.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Robbie Manolache (2022). AMEX Training Data - Parquet Partitions [Dataset]. https://www.kaggle.com/datasets/slashie/amex-train-data-pq
Organization logo

AMEX Training Data - Parquet Partitions

Explore at:
zip(7813557437 bytes)Available download formats
Dataset updated
Jul 24, 2022
Authors
Robbie Manolache
Description

Dataset

This dataset was created by Robbie Manolache

Contents

Search
Clear search
Close search
Google apps
Main menu