7 datasets found
  1. U

    Input data, model output, and R scripts for a machine learning streamflow...

    • data.usgs.gov
    • datasets.ai
    • +1more
    Updated Nov 19, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan McShane; Cheryl Miller (2021). Input data, model output, and R scripts for a machine learning streamflow model on the Wyoming Range, Wyoming, 2012–17 [Dataset]. http://doi.org/10.5066/P9XCP1AE
    Explore at:
    Dataset updated
    Nov 19, 2021
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Ryan McShane; Cheryl Miller
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Jan 1, 2012 - Dec 31, 2017
    Area covered
    Wyoming, Wyoming Range
    Description

    A machine learning streamflow (MLFLOW) model was developed in R (model is in the Rscripts folder) for modeling monthly streamflow from 2012 to 2017 in three watersheds on the Wyoming Range in the upper Green River basin. Geospatial information for 125 site features (vector data are in the Sites.shp file) and discrete streamflow observation data and environmental predictor data were used in fitting the MLFLOW model and predicting with the fitted model. Tabular calibration and validation data are in the Model_Fitting_Site_Data.csv file, totaling 971 discrete observations and predictions of monthly streamflow. Geospatial information for 17,518 stream grid cells (raster data are in the Streams.tif file) and environmental predictor data were used for continuous streamflow predictions with the MLFLOW model. Tabular prediction data for all the study area (17,518 stream grid cells) and study period (72 months; 2012–17) are in the Model_Prediction_Stream_Data.csv file, totaling 1,261,296 p ...

  2. The resulting ranking based on the expected relative feature contribution...

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pål Vegard Johnsen; Inga Strümke; Mette Langaas; Andrew Thomas DeWan; Signe Riemer-Sørensen (2023). The resulting ranking based on the expected relative feature contribution (ERFC) for the particular XGBoost model investigated based on training data consisting of 64 000 individuals from UK Biobank. [Dataset]. http://doi.org/10.1371/journal.pcbi.1010963.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Pål Vegard Johnsen; Inga Strümke; Mette Langaas; Andrew Thomas DeWan; Signe Riemer-Sørensen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The resulting ranking based on the expected relative feature contribution (ERFC) for the particular XGBoost model investigated based on training data consisting of 64 000 individuals from UK Biobank.

  3. f

    Data from: Evaluation of QSAR models for predicting mutagenicity: outcome of...

    • tandf.figshare.com
    xlsx
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A. Furuhama; A. Kitazawa; J. Yao; C.E. Matos dos Santos; J. Rathman; C. Yang; J.V. Ribeiro; K. Cross; G. Myatt; G. Raitano; E. Benfenati; N. Jeliazkova; R. Saiakhov; S. Chakravarti; R.S. Foster; C. Bossa; C. Laura Battistelli; R. Benigni; T. Sawada; H. Wasada; T. Hashimoto; M. Wu; R. Barzilay; P.R. Daga; R.D. Clark; J. Mestres; A. Montero; E. Gregori-Puigjané; P. Petkov; H. Ivanova; O. Mekenyan; S. Matthews; D. Guan; J. Spicer; R. Lui; Y. Uesawa; K. Kurosaki; Y. Matsuzaka; S. Sasaki; M.T.D. Cronin; S.J. Belfield; J.W. Firman; N. Spînu; M. Qiu; J.M. Keca; G. Gini; T. Li; W. Tong; H. Hong; Z. Liu; Y. Igarashi; H. Yamada; K.-I. Sugiyama; M. Honma (2023). Evaluation of QSAR models for predicting mutagenicity: outcome of the Second Ames/QSAR international challenge project [Dataset]. http://doi.org/10.6084/m9.figshare.24720632.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    A. Furuhama; A. Kitazawa; J. Yao; C.E. Matos dos Santos; J. Rathman; C. Yang; J.V. Ribeiro; K. Cross; G. Myatt; G. Raitano; E. Benfenati; N. Jeliazkova; R. Saiakhov; S. Chakravarti; R.S. Foster; C. Bossa; C. Laura Battistelli; R. Benigni; T. Sawada; H. Wasada; T. Hashimoto; M. Wu; R. Barzilay; P.R. Daga; R.D. Clark; J. Mestres; A. Montero; E. Gregori-Puigjané; P. Petkov; H. Ivanova; O. Mekenyan; S. Matthews; D. Guan; J. Spicer; R. Lui; Y. Uesawa; K. Kurosaki; Y. Matsuzaka; S. Sasaki; M.T.D. Cronin; S.J. Belfield; J.W. Firman; N. Spînu; M. Qiu; J.M. Keca; G. Gini; T. Li; W. Tong; H. Hong; Z. Liu; Y. Igarashi; H. Yamada; K.-I. Sugiyama; M. Honma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Quantitative structure−activity relationship (QSAR) models are powerful in silico tools for predicting the mutagenicity of unstable compounds, impurities and metabolites that are difficult to examine using the Ames test. Ideally, Ames/QSAR models for regulatory use should demonstrate high sensitivity, low false-negative rate and wide coverage of chemical space. To promote superior model development, the Division of Genetics and Mutagenesis, National Institute of Health Sciences, Japan (DGM/NIHS), conducted the Second Ames/QSAR International Challenge Project (2020–2022) as a successor to the First Project (2014–2017), with 21 teams from 11 countries participating. The DGM/NIHS provided a curated training dataset of approximately 12,000 chemicals and a trial dataset of approximately 1,600 chemicals, and each participating team predicted the Ames mutagenicity of each trial chemical using various Ames/QSAR models. The DGM/NIHS then provided the Ames test results for trial chemicals to assist in model improvement. Although overall model performance on the Second Project was not superior to that on the First, models from the eight teams participating in both projects achieved higher sensitivity than models from teams participating in only the Second Project. Thus, these evaluations have facilitated the development of QSAR models.

  4. Scaling laws in antibody language models reveal data-constrained optima

    • zenodo.org
    Updated May 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Shafiei Neyestanak; Mahdi Shafiei Neyestanak; Bryan Briney; Bryan Briney (2025). Scaling laws in antibody language models reveal data-constrained optima [Dataset]. http://doi.org/10.5281/zenodo.15447079
    Explore at:
    Dataset updated
    May 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mahdi Shafiei Neyestanak; Mahdi Shafiei Neyestanak; Bryan Briney; Bryan Briney
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Motivation: Antibody language models (AbLMs) play a critical role in exploring the extensive sequence diversity of antibody repertoires, significantly enhancing therapeutic discovery. However, the optimal strategy for scaling these models, particularly concerning the interplay between model size and data availability, remains underexplored, especially in contrast to natural language processing where data is abundant. This study aims to systematically investigate scaling laws in AbLMs to define optimal scaling thresholds and maximize their potential in antibody engineering and discovery.

    Results: This study pretrained ESM-2 architecture models across five distinct parameterizations (8 million to 650 million weights) and three training data scales (Quarter, Half, and Full datasets, with the full set comprising ~1.6 million paired antibody sequences). Performance was evaluated using cross-entropy loss and downstream tasks, including per-position amino acid identity prediction, antibody specificity classification, and native heavy-light chain pairing recognition. Findings reveal that increasing model size does not monotonically improve performance; for instance, with the full dataset, loss began to increase beyond ~163M parameters. The 350M parameter model trained on the full dataset (350M-F) often demonstrated optimal or near-optimal performance in downstream tasks, such as achieving the highest accuracy in predicting mutated CDRH3 regions.

    Conclusion: These results underscore that in data-constrained domains like antibody sequences, strategically balancing model capacity with dataset size is crucial, as simply increasing model parameters without a proportional increase in diverse training data can lead to diminishing returns or even impaired generalization

    Files. The following files are included in this repository:

    • model_weights.zip: Model weights for all pre-trained AbLMs in the study. The models can also be downloaded from HuggingFace.
    • train-eval-test.zip: The datasets used for training all models, with sequences obtained from Jaffe et al. and Hurtado et al., are provided in a compressed folder. This folder contains three subfolders—Full_data, Half_data, and Quarter_data—each containing the training data used for the models. Specifically, the Full_data subfolder is further organized into training, eval, and test subdirectories, which respectively contain the train_dataset.csv, validation_dataset.csv, and test_dataset.csv files.
    • HD_vs_COV.csv.zip: The paired antibody sequences that were used for the antibody specificity binary classification task. The Coronavirus (CoV) antibody sequences included were sourced from the CoV-AbDab database.
    • hd-0_CoV-1_flu-2.csv.zip: Paired antibody sequences utilized for the 3-way antibody specificity classification task, distinguishing between Healthy Donor (HD), Coronavirus (CoV), and Influenza (Flu) specific Abs. The influenza-specific antibody sequences included in this dataset were sourced from Wang et al.
    • shuffled_data.csv.zip: Contains the dataset used for the native vs. shuffled paired antibody sequence classification task. This dataset is derived from the test_dataset.csv.
    • per_position_inference.zip: The dataset utilized for per-residue prediction by the full-data models, including both unmutated and mutated antibody sequences.
    • test_datasets.zip: A compressed folder that contains twelve distinct test sets that were not utilized during model training. These datasets were specifically used for evaluating pretrained models and generating Cross-entropy loss curves. The data originates from both in-house laboratory sources and a study conducted by Ng et al.

    Code: The code for model training and evaluation is available under the MIT license on GitHub.

  5. f

    Performance metrics of the Dynamic Criticality Index models developed from...

    • plos.figshare.com
    xls
    Updated Jan 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anita K. Patel; Eduardo Trujillo-Rivera; James M. Chamberlain; Hiroki Morizono; Murray M. Pollack (2024). Performance metrics of the Dynamic Criticality Index models developed from the multi-institutional database applied to the single-site test dataset (A) and the single-site Dynamic Criticality Index models applied to the single-site test dataset (B). [Dataset]. http://doi.org/10.1371/journal.pone.0288233.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 29, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Anita K. Patel; Eduardo Trujillo-Rivera; James M. Chamberlain; Hiroki Morizono; Murray M. Pollack
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance metrics of the Dynamic Criticality Index models developed from the multi-institutional database applied to the single-site test dataset (A) and the single-site Dynamic Criticality Index models applied to the single-site test dataset (B).

  6. f

    Population characteristics of children’s national patient sample.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jan 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anita K. Patel; Eduardo Trujillo-Rivera; James M. Chamberlain; Hiroki Morizono; Murray M. Pollack (2024). Population characteristics of children’s national patient sample. [Dataset]. http://doi.org/10.1371/journal.pone.0288233.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 29, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Anita K. Patel; Eduardo Trujillo-Rivera; James M. Chamberlain; Hiroki Morizono; Murray M. Pollack
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Population characteristics of children’s national patient sample.

  7. f

    Overall distribution of training, validation, and test data.

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keshab Raj Dahal; Nawa Raj Pokhrel; Santosh Gaire; Sharad Mahatara; Rajendra P. Joshi; Ankrit Gupta; Huta R. Banjade; Jeorge Joshi (2023). Overall distribution of training, validation, and test data. [Dataset]. http://doi.org/10.1371/journal.pone.0284695.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Keshab Raj Dahal; Nawa Raj Pokhrel; Santosh Gaire; Sharad Mahatara; Rajendra P. Joshi; Ankrit Gupta; Huta R. Banjade; Jeorge Joshi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overall distribution of training, validation, and test data.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ryan McShane; Cheryl Miller (2021). Input data, model output, and R scripts for a machine learning streamflow model on the Wyoming Range, Wyoming, 2012–17 [Dataset]. http://doi.org/10.5066/P9XCP1AE

Input data, model output, and R scripts for a machine learning streamflow model on the Wyoming Range, Wyoming, 2012–17

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Nov 19, 2021
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Ryan McShane; Cheryl Miller
License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Time period covered
Jan 1, 2012 - Dec 31, 2017
Area covered
Wyoming, Wyoming Range
Description

A machine learning streamflow (MLFLOW) model was developed in R (model is in the Rscripts folder) for modeling monthly streamflow from 2012 to 2017 in three watersheds on the Wyoming Range in the upper Green River basin. Geospatial information for 125 site features (vector data are in the Sites.shp file) and discrete streamflow observation data and environmental predictor data were used in fitting the MLFLOW model and predicting with the fitted model. Tabular calibration and validation data are in the Model_Fitting_Site_Data.csv file, totaling 971 discrete observations and predictions of monthly streamflow. Geospatial information for 17,518 stream grid cells (raster data are in the Streams.tif file) and environmental predictor data were used for continuous streamflow predictions with the MLFLOW model. Tabular prediction data for all the study area (17,518 stream grid cells) and study period (72 months; 2012–17) are in the Model_Prediction_Stream_Data.csv file, totaling 1,261,296 p ...

Search
Clear search
Close search
Google apps
Main menu