5 datasets found

m
Data, Models, and Python Code For: Machine learning models and performance...
data.mendeley.com
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Armen Beck (2024). Data, Models, and Python Code For: Machine learning models and performance dependency on 2D chemical descriptor space for retention time prediction of pharmaceuticals [Dataset]. http://doi.org/10.17632/x925rnxzcb.1
Explore at:
Unique identifier
https://doi.org/10.17632/x925rnxzcb.1
Dataset updated
May 31, 2024
Authors
Armen Beck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Training data, models, and python code for the manuscript: Machine learning models and performance dependency on 2D chemical descriptor space for retention time prediction of pharmaceuticals.

MOE descriptors of the METLIN SMRT dataset (original by Domingo-Almenara et. al. with RTs and structures available at: https://figshare.com/ndownloader/files/18130628), training scripts (python), UMAP, GMM, and SVR models with training splits results are within the SMRT_exp.zip file.

CSVs for feature importance for SVR models are standalone files.
n
Acoustic features as a tool to visualize and explore marine soundscapes:...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Feb 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson (2024). Acoustic features as a tool to visualize and explore marine soundscapes: Applications illustrated using marine mammal Passive Acoustic Monitoring datasets [Dataset]. http://doi.org/10.5061/dryad.3bk3j9kn8
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.3bk3j9kn8
Dataset updated
Feb 15, 2024
Dataset provided by
Memorial University of Newfoundland
University of Parma
Fisheries and Oceans Canada
Authors
Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Passive Acoustic Monitoring (PAM) is emerging as a solution for monitoring species and environmental change over large spatial and temporal scales. However, drawing rigorous conclusions based on acoustic recordings is challenging, as there is no consensus over which approaches, and indices are best suited for characterizing marine and terrestrial acoustic environments. Here, we describe the application of multiple machine-learning techniques to the analysis of a large PAM dataset. We combine pre-trained acoustic classification models (VGGish, NOAA & Google Humpback Whale Detector), dimensionality reduction (UMAP), and balanced random forest algorithms to demonstrate how machine-learned acoustic features capture different aspects of the marine environment. The UMAP dimensions derived from VGGish acoustic features exhibited good performance in separating marine mammal vocalizations according to species and locations. RF models trained on the acoustic features performed well for labelled sounds in the 8 kHz range, however, low and high-frequency sounds could not be classified using this approach. The workflow presented here shows how acoustic feature extraction, visualization, and analysis allow for establishing a link between ecologically relevant information and PAM recordings at multiple scales. The datasets and scripts provided in this repository allow replicating the results presented in the publication. Methods Data acquisition and preparation We collected all records available in the Watkins Marine Mammal Database website listed under the “all cuts'' page. For each audio file in the WMD the associated metadata included a label for the sound sources present in the recording (biological, anthropogenic, and environmental), as well as information related to the location and date of recording. To minimize the presence of unwanted sounds in the samples, we only retained audio files with a single source listed in the metadata. We then labelled the selected audio clips according to taxonomic group (Odontocetae, Mysticetae), and species. We limited the analysis to 12 marine mammal species by discarding data when a species: had less than 60 s of audio available, had a vocal repertoire extending beyond the resolution of the acoustic classification model (VGGish), or was recorded in a single country. To determine if a species was suited for analysis using VGGish, we inspected the Mel-spectrograms of 3-s audio samples and only retained species with vocalizations that could be captured in the Mel-spectrogram (Appendix S1). The vocalizations of species that produce very low frequency, or very high frequency were not captured by the Mel-spectrogram, thus we removed them from the analysis. To ensure that records included the vocalizations of multiple individuals for each species, we only considered species with records from two or more different countries. Lastly, to avoid overrepresentation of sperm whale vocalizations, we excluded 30,000 sperm whale recordings collected in the Dominican Republic. The resulting dataset consisted in 19,682 audio clips with a duration of 960 milliseconds each (0.96 s) (Table 1). The Placentia Bay Database (PBD) includes recordings collected by Fisheries and Oceans Canada in Placentia Bay (Newfoundland, Canada), in 2019. The dataset consisted of two months of continuous recordings (1230 hours), starting on July 1st, 2019, and ending on August 31st 2029. The data was collected using an AMAR G4 hydrophone (sensitivity: -165.02 dB re 1V/µPa at 250 Hz) deployed at 64 m of depth. The hydrophone was set to operate following 15 min cycles, with the first 60 s sampled at 512 kHz, and the remaining 14 min sampled at 64 kHz. For the purpose of this study, we limited the analysis to the 64 kHz recordings. Acoustic feature extraction The audio files from the WMD and PBD databases were used as input for VGGish (Abu-El-Haija et al., 2016; Chung et al., 2018), a CNN developed and trained to perform general acoustic classification. VGGish was trained on the Youtube8M dataset, containing more than two million user-labelled audio-video files. Rather than focusing on the final output of the model (i.e., the assigned labels), here the model was used as a feature extractor (Sethi et al., 2020). VGGish converts audio input into a semantically meaningful vector consisting of 128 features. The model returns features at multiple resolution: ~1 s (960 ms); ~5 s (4800 ms); ~1 min (59’520 ms); ~5 min (299’520 ms). All of the visualizations and results pertaining to the WMD were prepared using the finest feature resolution of ~1 s. The visualizations and results pertaining to the PBD were prepared using the ~5 s features for the humpback whale detection example, and were then averaged to an interval of 30 min in order to match the temporal resolution of the environmental measures available for the area. UMAP ordination and visualization UMAP is a non-linear dimensionality reduction algorithm based on the concept of topological data analysis which, unlike other dimensionality reduction techniques (e.g., tSNE), preserves both the local and global structure of multivariate datasets (McInnes et al., 2018). To allow for data visualization and to reduce the 128 features to two dimensions for further analysis, we applied Uniform Manifold Approximation and Projection (UMAP) to both datasets and inspected the resulting plots. The UMAP algorithm generates a low-dimensional representation of a multivariate dataset while maintaining the relationships between points in the global dataset structure (i.e., the 128 features extracted from VGGish). Each point in a UMAP plot in this paper represents an audio sample with duration of ~ 1 second (WMD dataset), ~ 5 seconds (PBD dataset, humpback whale detections), or 30 minutes (PBD dataset, environmental variables). Each point in the two-dimensional UMAP space also represents a vector of 128 VGGish features. The nearer two points are in the plot space, the nearer the two points are in the 128-dimensional space, and thus the distance between two points in UMAP reflects the degree of similarity between two audio samples in our datasets. Areas with a high density of samples in UMAP space should, therefore, contain sounds with similar characteristics, and such similarity should decrease with increasing point distance. Previous studies illustrated how VGGish and UMAP can be applied to the analysis of terrestrial acoustic datasets (Heath et al., 2021; Sethi et al., 2020). The visualizations and classification trials presented here illustrate how the two techniques (VGGish and UMAP) can be used together for marine ecoacoustics analysis. UMAP visualizations were prepared the umap-learn package for Python programming language (version 3.10). All UMAP visualizations presented in this study were generated using the algorithm’s default parameters.
Labelling sound sources The labels for the WMD records (i.e., taxonomic group, species, location) were obtained from the database metadata. For the PBD recordings, we obtained measures of wind speed, surface temperature, and current speed from (Fig 1) an oceanographic buy located in proximity of the recorder. We choose these three variables for their different contributions to background noise in marine environments. Wind speed contributes to underwater background noise at multiple frequencies, ranging 500 Hz to 20 kHz (Hildebrand et al., 2021). Sea surface temperature contributes to background noise at frequencies between 63 Hz and 125 Hz (Ainslie et al., 2021), while ocean currents contribute to ambient noise at frequencies below 50 Hz (Han et al., 2021) Prior to analysis, we categorized the environmental variables and assigned the categories as labels to the acoustic features (Table 2). Humpback whale vocalizations in the PBD recordings were processed using the humpback whale acoustic detector created by NOAA and Google (Allen et al., 2021), providing a model score for every ~5 s sample. This model was trained on a large dataset (14 years and 13 locations) using humpback whale recordings annotated by experts (Allen et al., 2021). The model returns scores ranging from 0 to 1 indicating the confidence in the predicted humpback whale presence. We used the results of this detection model to label the PBD samples according to presence of humpback whale vocalizations. To verify the model results, we inspected all audio files that contained a 5 s sample with a model score higher than 0.9 for the month of July. If the presence of a humpback whale was confirmed, we labelled the segment as a model detection. We labelled any additional humpback whale vocalization present in the inspected audio files as a visual detection, while we labelled other sources and background noise samples as absences. In total, we labelled 4.6 hours of recordings. We reserved the recordings collected in August to test the precision of the final predictive model. Label prediction performance We used Balanced Random Forest models (BRF) provided in the imbalanced-learn python package (Lemaître et al., 2017) to predict humpback whale presence and environmental conditions from the acoustic features generated by VGGish. We choose BRF as the algorithm as it is suited for datasets characterized by class imbalance. The BRF algorithm performs under sampling of the majority class prior to prediction, allowing to overcome class imbalance (Lemaître et al., 2017). For each model run, the PBD dataset was split into training (80%) and testing (20%) sets. The training datasets were used to fine-tune the models though a nested k-fold cross validation approach with ten-folds in the outer loop, and five-folds in the inner loop. We selected nested cross validation as it allows optimizing model hyperparameters and performing model evaluation in a single step. We used the default parameters of the BRF algorithm, except for the ‘n_estimators’ hyperparameter, for which we tested
f
Material for manifold learning techniques comparison on benchmark dataset
springernature.figshare.com
application/x-gzip
Updated Jul 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elodie Laine; Valentin Lombard; Sergei Grudinin (2024). Material for manifold learning techniques comparison on benchmark dataset [Dataset]. http://doi.org/10.6084/m9.figshare.25112459.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25112459.v1
Dataset updated
Jul 5, 2024
Dataset provided by
figshare
Authors
Elodie Laine; Valentin Lombard; Sergei Grudinin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This archive contains the restricted 10 ensemble benchmark and the scripts used in the manifold learning techniques assessment. Files related to an ensemble are prefixed with ID1_ID2_, where ID1 is the first member in alphabetical order, and ID2 is the reference for the structural alignment.

The archive includes the following for each member of the benchmark: A _mm.pdb file containing the ensemble's conformations. A _aln.fa file, which is the multiple sequence alignment of the ensemble. A _rmsd.txt file with the all pairwise root mean squared deviation (RMSD) of the ensemble. A _raw_coords_ca.bin file with the raw coordinates in binary format. A _raw_coords_ca_mask.bin file with the binary format gap coordinates. A _features_pca.csv file detailing the positions of each sample in the ensemble's principal component space. A _dist_to_hull.csv file with the ID of each ensemble member, their label in the clustering in the PC space, and the squared distance of this sample to the convex hull formed by members of the other clusters. A _pca_errors.csv file containing the same information as the _dist_to_hull.csv file, but with the addition of the PCA reconstruction error, measured as the RMSD between the predicted and ground truth structures. The prediction of a sample is done by fitting the PCA to all clusters except the one being evaluated. Three _XXX_kcpa_errors.json files with the kPCA reconstruction errors for each ensemble member, measured as the RMSD between the predicted and ground truth structures, using kPCA at different sigma and alpha parameters from the grid search. The XXX indicates the kernel used. The prediction of a sample is done by fitting the kPCA to all clusters except the one being evaluated. A _umap_errors.json file with the UMAP reconstruction errors for each ensemble member, measured as the RMSD between the predicted and ground truth structures, using UMAP at different n_neigh and min_dist parameters from the grid search. The prediction of a sample is done by fitting the UMAP to all clusters except the one being evaluated. UMAP could be run only on a subset of the ensembles. A _rbf_kpca_default_sigma.json file containing the kPCA reconstruction errors for each ensemble member, measured as the RMSD between the predicted and ground truth structures, using kPCA with RBF kernel at the default alpha and sigma parameters. The prediction of a sample is done by fitting the kPCA to all clusters except the one being evaluated. A _rbf_kpca_errors_real.json file with the kPCA reconstruction errors for each ensemble member, measured as the RMSD between the predicted and ground truth structures, using kPCA with RBF kernel with a predicted optimal sigma parameter and alpha parameters of 1.0, 1e-5, and 1e-6. The prediction of a sample is done by fitting the kPCA to all clusters except the one being evaluated. The scripts used to generate the convex hull and for the PCA-kPCA comparison are as follows: dist_to_hull.py computes the coordinates in the PC space of each member, divides the members into clusters, and computes the distance of each member to the convex hull formed by members of the other clusters in the PC space. This script uses polytope_app.cpp with a Python binding to compute the squared distance of each member to the convex hull. polytope_module.so is the compiled C++ module called by the Python script. interpol_apase.py computes the interpolation in the ATPase latent space, and outputs the .pdb files of the trajectories. pca_kpca.py calculates the reconstruction error for both PCA, kPCA, and UMAP for each ensemble member by fitting the PCA, kPCA, or UMAP to all members of other clusters, excluding the cluster of the member currently being evaluated. A procheck folder containing summary tables of the procheck analysis on original and reconstructed structures. The stats.csv file contains descriptive information about the benchmark. Please consult the related documentation to understand the meaning of each column in this file.
u
Data from: Supplementary Material for the "Virtual Screening of Gold(I)...
portalcientifico.uvigo.gal
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kiriakidi, Sofia; Kiriakidi, Sofia (2025). Supplementary Material for the "Virtual Screening of Gold(I) Buchwald-type Catalysts for the Intramolecular Cyclization of α-Aminotropones" [Dataset]. https://portalcientifico.uvigo.gal/documentos/688b603017bb6239d2d497a4
Explore at:
Dataset updated
2025
Authors
Kiriakidi, Sofia; Kiriakidi, Sofia
Description
Supplementary material for the research paper with title: "Virtual Screening of Gold(I) Buchwald-type Catalysts for the Intramolecular Cyclization of α-Aminotropones"

derivatives.rar: contains Gaussian output files for the 129 Buchwald-type derivatives tested in this work

data_initial.csv: initial dataset used for ML

data_final.csv: final dataset used for ML

PCA_UMAP.py: python script for the PCA and UMAP dimensionality reduction analyses

Deepchem-Regression-initial.ipynb: Jupyter notebook with the python code used for the XGB ML model within the Deepchem framework on the initial dataset

Deepchem-Regression-final.ipynb: Jupyter notebook with the python code used for the XGB ML model within the Deepchem framework on the final dataset

INITIAL-Robert-ML.rar: Robert Report and all the files generated by the ROBERT analysis on the initial dataset

FINAL-Robert-ML.rar: Robert Report and all the files generated by the ROBERT analysis on the final dataset

Dataset on Bibliographic, Textual, and Embedding Data for General Relativity...

zenodo.org

bin

Updated Dec 31, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Raphael Schlattmann; Raphael Schlattmann (2024). Dataset on Bibliographic, Textual, and Embedding Data for General Relativity and Gravitation Publications (1911–2000) [Dataset]. http://doi.org/10.5281/zenodo.14581503

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14581503

Dataset updated

Dec 31, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Raphael Schlattmann; Raphael Schlattmann

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset Overview

This dataset supplements the paper “Trajectories of Change: Approaches for Tracking Knowledge Evolution,” currently under review. It includes bibliographic, textual, and embedding data for 180,785 publications in General Relativity and Gravitation (GRG), spanning 1911 to 2000 and is based on the NASA/ADS. The file is in Parquet format with 33 columns.

Usage

The dataset is directly compatible with the UnigramKLD and EmbeddingDensities classes of the semanticlayertools Python package.

Data Structure

Column	Format	Description	Example
Bibcode	string	Unique publication identifier.	`"1995PASP..107..803U"`
Author	string	Authors listed as comma-separated names.	`"Urry CM, Padovani P"`
Title	string	Title of the publication.	`"Unified Schemes for Radio-Loud Active Galactic Nuclei"`
Title_en	string	Title translated into English.	`"Unified Schemes for Radio-Loud Active Galactic Nuclei"`
Year	integer	Year of publication.	`1995`
Journal	string	Journal name.	`"Publications of the Astronomical Society of the Pacific"`
Journal Abbreviation	string	Abbreviated journal name.	`"PASP"`
Volume	string	Volume number (if applicable).	`"107"`
Issue	string	Issue number (if applicable).	`"19"`
First Page	string	Starting page.	`"803"`
Last Page	string	Ending page.	`"25"`
Abstract	string	Abstract text.	`"The appearance of active galactic nuclei (AGN) depends strongly on orientation, dominating classification..."`
Abstract_en	string	Abstract translated into English.	`"The appearance of active galactic nuclei (AGN) depends strongly on orientation, dominating classification..."`
Keywords	string	Comma-separated keywords.	`"galaxies: active, galaxies: fundamental parameters, astrophysics"`
DOI	string	Digital Object Identifier.	`"10.1086/133630"`
Affiliation	string	Author affiliations.	`"AA(University of XYZ), AB(-)"`
Category	string	Publication type (e.g., article, book).	`"article"`
Citation Count	float	Number of citations.	`4380.0`
References	array of strings	List of cited Bibcodes.	`["1966Natur.209..751H", "1966Natur.211..468R", "1968ApJ...151..393S"]`
PDF_URL	string	Link to the publication PDF.	`"https://ui.adsabs.harvard.edu/link_gateway/1995PASP..107..803U/ADS_PDF"`
Title_lang	string	Language of the title.	`"en"`
Abstract_lang	string	Language of the abstract.	`"en"`
full_text	string	Full text of the publication (where available).	`"Unified Schemes for Radio-Loud Active Galactic Nuclei. The appearance of AGN depends so strongly on..."`
tokens	array of strings	Tokenized text of the title and abstract for computational analysis.	`["unify", "schemes", "radio", "loud", "active", "galactic", "nuclei"]`
UMAP-1	float32	UMAP embedding coordinate 1.	`10.423940`
UMAP-2	float32	UMAP embedding coordinate 2.	`7.890975`
Cluster	integer	Cluster label for topic modeling or grouping.	`15`
Name	string	Descriptive cluster name.	`"15_radio_quasars_sources_galaxies"`
KeyBERT	string	Key phrases extracted via KeyBERT.	`"radio galaxies, high redshift, radio sources, optical imaging"`
OpenAI	string	Embedding-based descriptive phrases.	`"Cosmological Evolution of Radio-Loud Quasars"`
MMR	string	Extracted key phrases using Maximal Marginal Relevance (MMR).	`"quasars, radio sources, redshift, luminosity, star formation"`
POS	string	Key terms extracted via part-of-speech tagging.	`"radio, quasars, sources, galaxies, redshift, optical"`
full_embeddings	array of floats	Text embeddings generated using OpenAI's text-embedding-3-large model.	`"[ 0.01164897 -0.00343577 -0.03168862 ... 0.00237622]"`

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Armen Beck (2024). Data, Models, and Python Code For: Machine learning models and performance dependency on 2D chemical descriptor space for retention time prediction of pharmaceuticals [Dataset]. http://doi.org/10.17632/x925rnxzcb.1

Data, Models, and Python Code For: Machine learning models and performance dependency on 2D chemical descriptor space for retention time prediction of pharmaceuticals

Explore at:

Unique identifier

https://doi.org/10.17632/x925rnxzcb.1

Dataset updated

May 31, 2024

Authors

Armen Beck

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Training data, models, and python code for the manuscript: Machine learning models and performance dependency on 2D chemical descriptor space for retention time prediction of pharmaceuticals.

MOE descriptors of the METLIN SMRT dataset (original by Domingo-Almenara et. al. with RTs and structures available at: https://figshare.com/ndownloader/files/18130628), training scripts (python), UMAP, GMM, and SVR models with training splits results are within the SMRT_exp.zip file.

CSVs for feature importance for SVR models are standalone files.

Clear search

Close search

Google apps

Main menu

Data, Models, and Python Code For: Machine learning models and performance...

Acoustic features as a tool to visualize and explore marine soundscapes:...

Material for manifold learning techniques comparison on benchmark dataset

Data from: Supplementary Material for the "Virtual Screening of Gold(I)...

Dataset on Bibliographic, Textual, and Embedding Data for General Relativity...

Dataset Overview

Usage

Data Structure

Data, Models, and Python Code For: Machine learning models and performance dependency on 2D chemical descriptor space for retention time prediction of pharmaceuticals