U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
A machine learning streamflow (MLFLOW) model was developed in R (model is in the Rscripts folder) for modeling monthly streamflow from 2012 to 2017 in three watersheds on the Wyoming Range in the upper Green River basin. Geospatial information for 125 site features (vector data are in the Sites.shp file) and discrete streamflow observation data and environmental predictor data were used in fitting the MLFLOW model and predicting with the fitted model. Tabular calibration and validation data are in the Model_Fitting_Site_Data.csv file, totaling 971 discrete observations and predictions of monthly streamflow. Geospatial information for 17,518 stream grid cells (raster data are in the Streams.tif file) and environmental predictor data were used for continuous streamflow predictions with the MLFLOW model. Tabular prediction data for all the study area (17,518 stream grid cells) and study period (72 months; 2012–17) are in the Model_Prediction_Stream_Data.csv file, totaling 1,261,296 p ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The resulting ranking based on the expected relative feature contribution (ERFC) for the particular XGBoost model investigated based on training data consisting of 64 000 individuals from UK Biobank.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Quantitative structure−activity relationship (QSAR) models are powerful in silico tools for predicting the mutagenicity of unstable compounds, impurities and metabolites that are difficult to examine using the Ames test. Ideally, Ames/QSAR models for regulatory use should demonstrate high sensitivity, low false-negative rate and wide coverage of chemical space. To promote superior model development, the Division of Genetics and Mutagenesis, National Institute of Health Sciences, Japan (DGM/NIHS), conducted the Second Ames/QSAR International Challenge Project (2020–2022) as a successor to the First Project (2014–2017), with 21 teams from 11 countries participating. The DGM/NIHS provided a curated training dataset of approximately 12,000 chemicals and a trial dataset of approximately 1,600 chemicals, and each participating team predicted the Ames mutagenicity of each trial chemical using various Ames/QSAR models. The DGM/NIHS then provided the Ames test results for trial chemicals to assist in model improvement. Although overall model performance on the Second Project was not superior to that on the First, models from the eight teams participating in both projects achieved higher sensitivity than models from teams participating in only the Second Project. Thus, these evaluations have facilitated the development of QSAR models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation: Antibody language models (AbLMs) play a critical role in exploring the extensive sequence diversity of antibody repertoires, significantly enhancing therapeutic discovery. However, the optimal strategy for scaling these models, particularly concerning the interplay between model size and data availability, remains underexplored, especially in contrast to natural language processing where data is abundant. This study aims to systematically investigate scaling laws in AbLMs to define optimal scaling thresholds and maximize their potential in antibody engineering and discovery.
Results: This study pretrained ESM-2 architecture models across five distinct parameterizations (8 million to 650 million weights) and three training data scales (Quarter, Half, and Full datasets, with the full set comprising ~1.6 million paired antibody sequences). Performance was evaluated using cross-entropy loss and downstream tasks, including per-position amino acid identity prediction, antibody specificity classification, and native heavy-light chain pairing recognition. Findings reveal that increasing model size does not monotonically improve performance; for instance, with the full dataset, loss began to increase beyond ~163M parameters. The 350M parameter model trained on the full dataset (350M-F) often demonstrated optimal or near-optimal performance in downstream tasks, such as achieving the highest accuracy in predicting mutated CDRH3 regions.
Conclusion: These results underscore that in data-constrained domains like antibody sequences, strategically balancing model capacity with dataset size is crucial, as simply increasing model parameters without a proportional increase in diverse training data can lead to diminishing returns or even impaired generalization
Files. The following files are included in this repository:
Code: The code for model training and evaluation is available under the MIT license on GitHub.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance metrics of the Dynamic Criticality Index models developed from the multi-institutional database applied to the single-site test dataset (A) and the single-site Dynamic Criticality Index models applied to the single-site test dataset (B).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Population characteristics of children’s national patient sample.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overall distribution of training, validation, and test data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
A machine learning streamflow (MLFLOW) model was developed in R (model is in the Rscripts folder) for modeling monthly streamflow from 2012 to 2017 in three watersheds on the Wyoming Range in the upper Green River basin. Geospatial information for 125 site features (vector data are in the Sites.shp file) and discrete streamflow observation data and environmental predictor data were used in fitting the MLFLOW model and predicting with the fitted model. Tabular calibration and validation data are in the Model_Fitting_Site_Data.csv file, totaling 971 discrete observations and predictions of monthly streamflow. Geospatial information for 17,518 stream grid cells (raster data are in the Streams.tif file) and environmental predictor data were used for continuous streamflow predictions with the MLFLOW model. Tabular prediction data for all the study area (17,518 stream grid cells) and study period (72 months; 2012–17) are in the Model_Prediction_Stream_Data.csv file, totaling 1,261,296 p ...