7 datasets found

U
Input data, model output, and R scripts for a machine learning streamflow...
data.usgs.gov
datasets.ai
+1more
Updated Nov 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan McShane; Cheryl Miller (2021). Input data, model output, and R scripts for a machine learning streamflow model on the Wyoming Range, Wyoming, 2012–17 [Dataset]. http://doi.org/10.5066/P9XCP1AE
Explore at:
Unique identifier
https://doi.org/10.5066/P9XCP1AE
Dataset updated
Nov 19, 2021
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Ryan McShane; Cheryl Miller
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Jan 1, 2012 - Dec 31, 2017
Area covered
Wyoming, Wyoming Range
Description
A machine learning streamflow (MLFLOW) model was developed in R (model is in the Rscripts folder) for modeling monthly streamflow from 2012 to 2017 in three watersheds on the Wyoming Range in the upper Green River basin. Geospatial information for 125 site features (vector data are in the Sites.shp file) and discrete streamflow observation data and environmental predictor data were used in fitting the MLFLOW model and predicting with the fitted model. Tabular calibration and validation data are in the Model_Fitting_Site_Data.csv file, totaling 971 discrete observations and predictions of monthly streamflow. Geospatial information for 17,518 stream grid cells (raster data are in the Streams.tif file) and environmental predictor data were used for continuous streamflow predictions with the MLFLOW model. Tabular prediction data for all the study area (17,518 stream grid cells) and study period (72 months; 2012–17) are in the Model_Prediction_Stream_Data.csv file, totaling 1,261,296 p ...
The resulting ranking based on the expected relative feature contribution...
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pål Vegard Johnsen; Inga Strümke; Mette Langaas; Andrew Thomas DeWan; Signe Riemer-Sørensen (2023). The resulting ranking based on the expected relative feature contribution (ERFC) for the particular XGBoost model investigated based on training data consisting of 64 000 individuals from UK Biobank. [Dataset]. http://doi.org/10.1371/journal.pcbi.1010963.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1010963.t003
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Pål Vegard Johnsen; Inga Strümke; Mette Langaas; Andrew Thomas DeWan; Signe Riemer-Sørensen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The resulting ranking based on the expected relative feature contribution (ERFC) for the particular XGBoost model investigated based on training data consisting of 64 000 individuals from UK Biobank.
f
Data from: Evaluation of QSAR models for predicting mutagenicity: outcome of...
tandf.figshare.com
xlsx
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A. Furuhama; A. Kitazawa; J. Yao; C.E. Matos dos Santos; J. Rathman; C. Yang; J.V. Ribeiro; K. Cross; G. Myatt; G. Raitano; E. Benfenati; N. Jeliazkova; R. Saiakhov; S. Chakravarti; R.S. Foster; C. Bossa; C. Laura Battistelli; R. Benigni; T. Sawada; H. Wasada; T. Hashimoto; M. Wu; R. Barzilay; P.R. Daga; R.D. Clark; J. Mestres; A. Montero; E. Gregori-Puigjané; P. Petkov; H. Ivanova; O. Mekenyan; S. Matthews; D. Guan; J. Spicer; R. Lui; Y. Uesawa; K. Kurosaki; Y. Matsuzaka; S. Sasaki; M.T.D. Cronin; S.J. Belfield; J.W. Firman; N. Spînu; M. Qiu; J.M. Keca; G. Gini; T. Li; W. Tong; H. Hong; Z. Liu; Y. Igarashi; H. Yamada; K.-I. Sugiyama; M. Honma (2023). Evaluation of QSAR models for predicting mutagenicity: outcome of the Second Ames/QSAR international challenge project [Dataset]. http://doi.org/10.6084/m9.figshare.24720632.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24720632.v1
Dataset updated
Dec 4, 2023
Dataset provided by
Taylor & Francis
Authors
A. Furuhama; A. Kitazawa; J. Yao; C.E. Matos dos Santos; J. Rathman; C. Yang; J.V. Ribeiro; K. Cross; G. Myatt; G. Raitano; E. Benfenati; N. Jeliazkova; R. Saiakhov; S. Chakravarti; R.S. Foster; C. Bossa; C. Laura Battistelli; R. Benigni; T. Sawada; H. Wasada; T. Hashimoto; M. Wu; R. Barzilay; P.R. Daga; R.D. Clark; J. Mestres; A. Montero; E. Gregori-Puigjané; P. Petkov; H. Ivanova; O. Mekenyan; S. Matthews; D. Guan; J. Spicer; R. Lui; Y. Uesawa; K. Kurosaki; Y. Matsuzaka; S. Sasaki; M.T.D. Cronin; S.J. Belfield; J.W. Firman; N. Spînu; M. Qiu; J.M. Keca; G. Gini; T. Li; W. Tong; H. Hong; Z. Liu; Y. Igarashi; H. Yamada; K.-I. Sugiyama; M. Honma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Quantitative structure−activity relationship (QSAR) models are powerful in silico tools for predicting the mutagenicity of unstable compounds, impurities and metabolites that are difficult to examine using the Ames test. Ideally, Ames/QSAR models for regulatory use should demonstrate high sensitivity, low false-negative rate and wide coverage of chemical space. To promote superior model development, the Division of Genetics and Mutagenesis, National Institute of Health Sciences, Japan (DGM/NIHS), conducted the Second Ames/QSAR International Challenge Project (2020–2022) as a successor to the First Project (2014–2017), with 21 teams from 11 countries participating. The DGM/NIHS provided a curated training dataset of approximately 12,000 chemicals and a trial dataset of approximately 1,600 chemicals, and each participating team predicted the Ames mutagenicity of each trial chemical using various Ames/QSAR models. The DGM/NIHS then provided the Ames test results for trial chemicals to assist in model improvement. Although overall model performance on the Second Project was not superior to that on the First, models from the eight teams participating in both projects achieved higher sensitivity than models from teams participating in only the Second Project. Thus, these evaluations have facilitated the development of QSAR models.
Scaling laws in antibody language models reveal data-constrained optima
zenodo.org
Updated May 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi Shafiei Neyestanak; Mahdi Shafiei Neyestanak; Bryan Briney; Bryan Briney (2025). Scaling laws in antibody language models reveal data-constrained optima [Dataset]. http://doi.org/10.5281/zenodo.15447079
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15447079
Dataset updated
May 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mahdi Shafiei Neyestanak; Mahdi Shafiei Neyestanak; Bryan Briney; Bryan Briney
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Motivation: Antibody language models (AbLMs) play a critical role in exploring the extensive sequence diversity of antibody repertoires, significantly enhancing therapeutic discovery. However, the optimal strategy for scaling these models, particularly concerning the interplay between model size and data availability, remains underexplored, especially in contrast to natural language processing where data is abundant. This study aims to systematically investigate scaling laws in AbLMs to define optimal scaling thresholds and maximize their potential in antibody engineering and discovery.

Results: This study pretrained ESM-2 architecture models across five distinct parameterizations (8 million to 650 million weights) and three training data scales (Quarter, Half, and Full datasets, with the full set comprising ~1.6 million paired antibody sequences). Performance was evaluated using cross-entropy loss and downstream tasks, including per-position amino acid identity prediction, antibody specificity classification, and native heavy-light chain pairing recognition. Findings reveal that increasing model size does not monotonically improve performance; for instance, with the full dataset, loss began to increase beyond ~163M parameters. The 350M parameter model trained on the full dataset (350M-F) often demonstrated optimal or near-optimal performance in downstream tasks, such as achieving the highest accuracy in predicting mutated CDRH3 regions.

Conclusion: These results underscore that in data-constrained domains like antibody sequences, strategically balancing model capacity with dataset size is crucial, as simply increasing model parameters without a proportional increase in diverse training data can lead to diminishing returns or even impaired generalization

Files. The following files are included in this repository:

model_weights.zip: Model weights for all pre-trained AbLMs in the study. The models can also be downloaded from HuggingFace.

train-eval-test.zip: The datasets used for training all models, with sequences obtained from Jaffe et al. and Hurtado et al., are provided in a compressed folder. This folder contains three subfolders—Full_data, Half_data, and Quarter_data—each containing the training data used for the models. Specifically, the Full_data subfolder is further organized into training, eval, and test subdirectories, which respectively contain the train_dataset.csv, validation_dataset.csv, and test_dataset.csv files.

HD_vs_COV.csv.zip: The paired antibody sequences that were used for the antibody specificity binary classification task. The Coronavirus (CoV) antibody sequences included were sourced from the CoV-AbDab database.

hd-0_CoV-1_flu-2.csv.zip: Paired antibody sequences utilized for the 3-way antibody specificity classification task, distinguishing between Healthy Donor (HD), Coronavirus (CoV), and Influenza (Flu) specific Abs. The influenza-specific antibody sequences included in this dataset were sourced from Wang et al.

shuffled_data.csv.zip: Contains the dataset used for the native vs. shuffled paired antibody sequence classification task. This dataset is derived from the test_dataset.csv.

per_position_inference.zip: The dataset utilized for per-residue prediction by the full-data models, including both unmutated and mutated antibody sequences.

test_datasets.zip: A compressed folder that contains twelve distinct test sets that were not utilized during model training. These datasets were specifically used for evaluating pretrained models and generating Cross-entropy loss curves. The data originates from both in-house laboratory sources and a study conducted by Ng et al.

Code: The code for model training and evaluation is available under the MIT license on GitHub.
f
Performance metrics of the Dynamic Criticality Index models developed from...
plos.figshare.com
xls
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anita K. Patel; Eduardo Trujillo-Rivera; James M. Chamberlain; Hiroki Morizono; Murray M. Pollack (2024). Performance metrics of the Dynamic Criticality Index models developed from the multi-institutional database applied to the single-site test dataset (A) and the single-site Dynamic Criticality Index models applied to the single-site test dataset (B). [Dataset]. http://doi.org/10.1371/journal.pone.0288233.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0288233.t002
Dataset updated
Jan 29, 2024
Dataset provided by
PLOS ONE
Authors
Anita K. Patel; Eduardo Trujillo-Rivera; James M. Chamberlain; Hiroki Morizono; Murray M. Pollack
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance metrics of the Dynamic Criticality Index models developed from the multi-institutional database applied to the single-site test dataset (A) and the single-site Dynamic Criticality Index models applied to the single-site test dataset (B).
f
Population characteristics of children’s national patient sample.
figshare.com
plos.figshare.com
xls
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anita K. Patel; Eduardo Trujillo-Rivera; James M. Chamberlain; Hiroki Morizono; Murray M. Pollack (2024). Population characteristics of children’s national patient sample. [Dataset]. http://doi.org/10.1371/journal.pone.0288233.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0288233.t001
Dataset updated
Jan 29, 2024
Dataset provided by
PLOS ONE
Authors
Anita K. Patel; Eduardo Trujillo-Rivera; James M. Chamberlain; Hiroki Morizono; Murray M. Pollack
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Population characteristics of children’s national patient sample.
f
Overall distribution of training, validation, and test data.
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keshab Raj Dahal; Nawa Raj Pokhrel; Santosh Gaire; Sharad Mahatara; Rajendra P. Joshi; Ankrit Gupta; Huta R. Banjade; Jeorge Joshi (2023). Overall distribution of training, validation, and test data. [Dataset]. http://doi.org/10.1371/journal.pone.0284695.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0284695.t005
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS ONE
Authors
Keshab Raj Dahal; Nawa Raj Pokhrel; Santosh Gaire; Sharad Mahatara; Rajendra P. Joshi; Ankrit Gupta; Huta R. Banjade; Jeorge Joshi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overall distribution of training, validation, and test data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ryan McShane; Cheryl Miller (2021). Input data, model output, and R scripts for a machine learning streamflow model on the Wyoming Range, Wyoming, 2012–17 [Dataset]. http://doi.org/10.5066/P9XCP1AE

Input data, model output, and R scripts for a machine learning streamflow model on the Wyoming Range, Wyoming, 2012–17

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.5066/P9XCP1AE

Dataset updated

Nov 19, 2021

Dataset provided by

United States Geological Surveyhttp://www.usgs.gov/

Authors

Ryan McShane; Cheryl Miller

License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Time period covered

Jan 1, 2012 - Dec 31, 2017

Area covered

Wyoming, Wyoming Range

Description

A machine learning streamflow (MLFLOW) model was developed in R (model is in the Rscripts folder) for modeling monthly streamflow from 2012 to 2017 in three watersheds on the Wyoming Range in the upper Green River basin. Geospatial information for 125 site features (vector data are in the Sites.shp file) and discrete streamflow observation data and environmental predictor data were used in fitting the MLFLOW model and predicting with the fitted model. Tabular calibration and validation data are in the Model_Fitting_Site_Data.csv file, totaling 971 discrete observations and predictions of monthly streamflow. Geospatial information for 17,518 stream grid cells (raster data are in the Streams.tif file) and environmental predictor data were used for continuous streamflow predictions with the MLFLOW model. Tabular prediction data for all the study area (17,518 stream grid cells) and study period (72 months; 2012–17) are in the Model_Prediction_Stream_Data.csv file, totaling 1,261,296 p ...

Clear search

Close search

Google apps

Main menu

Input data, model output, and R scripts for a machine learning streamflow...

The resulting ranking based on the expected relative feature contribution...

Data from: Evaluation of QSAR models for predicting mutagenicity: outcome of...

Scaling laws in antibody language models reveal data-constrained optima

Performance metrics of the Dynamic Criticality Index models developed from...

Population characteristics of children’s national patient sample.

Overall distribution of training, validation, and test data.

Input data, model output, and R scripts for a machine learning streamflow model on the Wyoming Range, Wyoming, 2012–17See More Versions

Input data, model output, and R scripts for a machine learning streamflow model on the Wyoming Range, Wyoming, 2012–17