The MNIST database of handwritten digits.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
The Places dataset is designed following principles of human visual cognition. Our goal is to build a core of visual knowledge that can be used to train artificial systems for high-level visual understanding tasks, such as scene context, object recognition, action and event prediction, and theory-of-mind inference.
The semantic categories of Places are defined by their function: the labels represent the entry-level of an environment. To illustrate, the dataset has different categories of bedrooms, or streets, etc, as one does not act the same way, and does not make the same predictions of what can happen next, in a home bedroom, an hotel bedroom or a nursery. In total, Places contains more than 10 million images comprising 400+ unique scene categories. The dataset features 5000 to 30,000 training images per class, consistent with real-world frequencies of occurrence. Using convolutional neural networks (CNN), Places dataset allows learning of deep scene features for various scene recognition tasks, with the goal to establish new state-of-the-art performances on scene-centric benchmarks.
Here we provide the Places Database and the trained CNNs for academic research and education purposes.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('placesfull', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/placesfull-1.0.0.png" alt="Visualization" width="500px">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The zip file contains three folders:"Data for Illapel earthquake", "HVCE and ABIC method implement on matlab" and "GDED method implement on tensorflow". Take the simulation experiments 1.2 and actual Illapel earthquakes as examples. the code for GDED method are placed on "GDED method implement on tensorflow" folders, and the code for the ABIC method and the HVCE method are place on "HVCE and ABIC method implement on matlab" folders.
In the file"HVCE and ABIC method implement on matlab", the meaning of each code are represent as following
ABIC_SIM.m:the slip distribution inversion results with the relative weight ratios determined by ABIC method Of simulation experiments and Illapel earthquakes
HVCE.m:the slip distribution inversion results with the relative weight ratios determined by HVCE method Of simulation experiments and Illapel earthquakes
GDED.m:the slip distribution inversion results with the relative weight ratios determined by GDED method Of simulation experiments and Illapel earthquakes(the relative weight ratios are from the "GDED method implement on tensorflow")
savedata.m: that code are used for save matrix or data for the GDED method implement on tensorflow
In the file"GDED method implement on tensorflow", the meaning of each code are represent as following
joint_inver_tensor_ex_1.0(1.1).py: the code for determining the relative weight ratios by the GDED method with(without) plot figures, which implement on Tensorflow platform
the InSAR data and GPS data of Illapel earthquakes are palce on the folder" Data for Illapel earthquake/GPS_ori.txt and InSAR_ori.txt"
Drug Cardiotoxicity dataset [1-2] is a molecule classification task to detect cardiotoxicity caused by binding hERG target, a protein associated with heart beat rhythm. The data covers over 9000 molecules with hERG activity.
Note:
The data is split into four splits: train, test-iid, test-ood1, test-ood2.
Each molecule in the dataset has 2D graph annotations which is designed to facilitate graph neural network modeling. Nodes are the atoms of the molecule and edges are the bonds. Each atom is represented as a vector encoding basic atom information such as atom type. Similar logic applies to bonds.
We include Tanimoto fingerprint distance (to training data) for each molecule in the test sets to facilitate research on distributional shift in graph domain.
For each example, the features include: atoms: a 2D tensor with shape (60, 27) storing node features. Molecules with less than 60 atoms are padded with zeros. Each atom has 27 atom features. pairs: a 3D tensor with shape (60, 60, 12) storing edge features. Each edge has 12 edge features. atom_mask: a 1D tensor with shape (60, ) storing node masks. 1 indicates the corresponding atom is real, othewise a padded one. pair_mask: a 2D tensor with shape (60, 60) storing edge masks. 1 indicates the corresponding edge is real, othewise a padded one. active: a one-hot vector indicating if the molecule is toxic or not. [0, 1] indicates it's toxic, otherwise [1, 0] non-toxic.
[1]: V. B. Siramshetty et al. Critical Assessment of Artificial Intelligence Methods for Prediction of hERG Channel Inhibition in the Big Data Era. JCIM, 2020. https://pubs.acs.org/doi/10.1021/acs.jcim.0c00884
[2]: K. Han et al. Reliable Graph Neural Networks for Drug Discovery Under Distributional Shift. NeurIPS DistShift Workshop 2021. https://arxiv.org/abs/2111.12951
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('cardiotox', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Data for my Yolo v3 Object Detection in Tensorflow kernel.
Contains sample images, fonts, class names and weights.
This dataset contains ILSVRC-2012 (ImageNet) validation images augmented with a new set of "Re-Assessed" (ReaL) labels from the "Are we done with ImageNet" paper, see https://arxiv.org/abs/2006.07159. These labels are collected using the enhanced protocol, resulting in multi-label and more accurate annotations.
Important note: about 3500 examples contain no label, these should be excluded from the averaging when computing the accuracy. One possible way of doing this is with the following NumPy code:
is_correct = [pred in real_labels[i] for i, pred in enumerate(predictions) if real_labels[i]]
real_accuracy = np.mean(is_correct)
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imagenet2012_real', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012_real-1.0.0.png" alt="Visualization" width="500px">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OpenSim is an open-source biomechanical package with a variety of applications. It is available for many users with bindings in MATLAB, Python, and Java via its application programming interfaces (APIs). Although the developers described well the OpenSim installation on different operating systems (Windows, Mac, and Linux), it is time-consuming and complex since each operating system requires a different configuration. This project aims to demystify the development of neuro-musculoskeletal modeling in OpenSim with zero configuration on any operating system for installation (thus cross-platform), easy to share models while accessing free graphical processing units (GPUs) on a web-based platform of Google Colab. To achieve this, OpenColab was developed where OpenSim source code was used to build a Conda package that can be installed on the Google Colab with only one block of code in less than 7 min. To use OpenColab, one requires a connection to the internet and a Gmail account. Moreover, OpenColab accesses vast libraries of machine learning methods available within free Google products, e.g. TensorFlow. Next, we performed an inverse problem in biomechanics and compared OpenColab results with OpenSim graphical user interface (GUI) for validation. The outcomes of OpenColab and GUI matched well (r≥0.82). OpenColab takes advantage of the zero-configuration of cloud-based platforms, accesses GPUs, and enables users to share and reproduce modeling approaches for further validation, innovative online training, and research applications. Step-by-step installation processes and examples are available at: https://simtk.org/projects/opencolab.
Tensorflow reimplementation of Swin Transformer model.
Based on Official Pytorch implementation.
https://user-images.githubusercontent.com/24825165/121768619-038e6d80-cb9a-11eb-8cb7-daa827e7772b.png" alt="image">
tensorflow >= 2.4.1
ImageNet-1K and ImageNet-22K Pretrained Checkpoints
| name | pretrain | resolution |acc@1 | #params | model |
| :---: | :---: | :---: | :---: | :---: | :---: |
|swin_tiny_224
|ImageNet-1K |224x224|81.2|28M|github|
|swin_small_224
|ImageNet-1K |224x224|83.2|50M|github|
|swin_base_224
|ImageNet-22K|224x224|85.2|88M|github|
|swin_base_384
|ImageNet-22K|384x384|86.4|88M|github|
|swin_large_224
|ImageNet-22K|224x224|86.3|197M|github|
|swin_large_384
|ImageNet-22K|384x384|87.3|197M|github|
Initializing the model: ```python from swintransformer import SwinTransformer
model = SwinTransformer('swin_tiny_224', num_classes=1000, include_top=True, pretrained=False)
You can use a pretrained model like this:
python
import tensorflow as tf
from swintransformer import SwinTransformer
model = tf.keras.Sequential([
tf.keras.layers.Lambda(lambda data: tf.keras.applications.imagenet_utils.preprocess_input(tf.cast(data, tf.float32), mode="torch"), input_shape=[*IMAGE_SIZE, 3]),
SwinTransformer('swin_tiny_224', include_top=False, pretrained=True),
tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
])
If you use a pretrained model with TPU on kaggle, specify `use_tpu` option:
python
import tensorflow as tf
from swintransformer import SwinTransformer
model = tf.keras.Sequential([ tf.keras.layers.Lambda(lambda data: tf.keras.applications.imagenet_utils.preprocess_input(tf.cast(data, tf.float32), mode="torch"), input_shape=[*IMAGE_SIZE, 3]), SwinTransformer('swin_tiny_224', include_top=False, pretrained=True, use_tpu=True), tf.keras.layers.Dense(NUM_CLASSES, activation='softmax') ]) ``` Example: TPU training on Kaggle
@article{liu2021Swin,
title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
journal={arXiv preprint arXiv:2103.14030},
year={2021}
}
Machine learning can be as good as maximum likelihood when reconstructing phylogenetic topologies and determining the best evolutionary model on four taxon alignments. Phylogenetic tree reconstruction with molecular data is important in many fields of life science research. The gold standard in this discipline is the Maximum Likelihood tree reconstruction method. Here we show that for quartet trees, Machine Learning using neural networks can be as good as the Maximum Likelihood method to infer the best tree topology and the best model of sequence evolution for nucleotide as well as amino acid sequences. For this purpose we simulated data sets for a wide range of branch lengths, evolutionary models and model parameters and compared the topologies and inferred models obtained with Machine learning with those obtained with the Maximum Likelihood and the Neighbour Joining method. Our results show that neural networks are a promising avenue for determining relatedness between taxa, which is ..., This archive is part of the DeepNNPhylogeny project: DeepNNPhylogeny, for which the code of the software is available on GitHub. It contains pre-trained neural networks to predict (a) the best models of sequence evolution and (b) the best quartet tree topologies for alignments of four nucleotide or amino acid sequences. For each use case, six neural networks with different architectures have been trained and saved for further usage with the Python library TensorFlow. Neural networks have been saved with the tf.keras.Model.save function in the so-called Tensorflow SavedModel format. All neural networks have been trained with a large number of alignments simulated with the software PolyMoSim v1.1.4, which is available on GitHub. For each simulated data set, model parameters (including proportion of invariant sites, shape parameter of gamma distribution for site heterogeneity, transition/transversion ratio - if applicable, nucleotide base frequencies - if applicable, relative substitution ..., In this project, neural networks have been trained to: - predict/classify the correct topology for four nucleotide or amino acid sequences that evolved on a quartet tree. - predict the best model of sequence evolution for four nucleotide or amino acid sequences that evolved on a quartet tree. Together with the software in the DeepNNPhylogeny project, the pre-trained neural networks can be used to predict the best model of sequence evolution for the model and topology classification tasks. The GitHub repository DeepNNPhylogeny contains the software with which: a) the neural networks presented here have been trained and with which new neural networks can be trained, b) predictions can be made using the pre-trained neural networks available in this archive. They can predict with an accuracy close or identical to the Maximum likelihood method the best evolutionary model and best topology for alignments of four nucleotide or amino acid sequences. The neural networks stored in this repository...
This is a pose estimation dataset, consisting of symmetric 3D shapes where multiple orientations are visually indistinguishable. The challenge is to predict all equivalent orientations when only one orientation is paired with each image during training (as is the scenario for most pose estimation datasets). In contrast to most pose estimation datasets, the full set of equivalent orientations is available for evaluation.
There are eight shapes total, each rendered from 50,000 viewpoints distributed uniformly at random over the full space of 3D rotations. Five of the shapes are featureless -- tetrahedron, cube, icosahedron, cone, and cylinder. Of those, the three Platonic solids (tetrahedron, cube, icosahedron) are annotated with their 12-, 24-, and 60-fold discrete symmetries, respectively. The cone and cylinder are annotated with their continuous symmetries discretized at 1 degree intervals. These symmetries are provided for evaluation; the intended supervision is only a single rotation with each image.
The remaining three shapes are marked with a distinguishing feature. There is a tetrahedron with one red-colored face, a cylinder with an off-center dot, and a sphere with an X capped by a dot. Whether or not the distinguishing feature is visible, the space of possible orientations is reduced. We do not provide the set of equivalent rotations for these shapes.
Each example contains of
a shape index so that the dataset may be filtered by shape.
The indices correspond to:
the rotation used in the rendering process, represented as a 3x3 rotation matrix
the set of known equivalent rotations under symmetry, for evaluation.
In the case of the three marked shapes, this is only the rendering rotation.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('symmetric_solids', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/symmetric_solids-1.0.0.png" alt="Visualization" width="500px">
We present a novel and efficient computing framework for segmenting the overlapping nuclei by combining Marker-controlled Watershed with our proposed convolutional neural network (DIMAN). We implemented our method based on the open-source machine learning framework TensorFlow and reinforcement learning library TensorLayer.This repository contains all code used in our experiments, incuding the data preparation, model construction, model training and result evaluation. For comparison with our method, we also utilized TensorFlow and TensorLayer to reimplement four known semantic segmentation convolutional neural networks: FCN8s, U-Net, HED and SharpMask. Beside this, we also compare our method with four published state-of-art methods.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is the dataset that accompanies the paper titled "A Dual-Frequency Radar Retrieval of Snowfall Properties Using a Neural Network", submitted for peer review in August 2020. Please see the github for the most up-to-date data after the revision process: https://github.com/dopplerchase/Chase_et_al_2021_NN Authors: Randy J. Chase, Stephen W. Nesbitt and Greg M. McFarquhar Corresponding author: Randy J. Chase (randyjc2@illinois.edu) Here we have the data used in the manuscript. Please email me if you have specific questions about units etc. 1) DDA/GMM database of scattering properties: base_df_DDA.csv This is the combined dataset from the following papers: Leinonen & Moisseev, 2015; Leinonen & Szyrmer, 2015; Lu et al., 2016; Kuo et al., 2016; Eriksson et al., 2018. The column names are D: Maximum dimension in meters, M: particle mass in grams kg, sigma_ku: backscatter cross-section at ku in m^2, sigma_ka: backscatter cross-section at ka in m^2, sigma_w: backscatter cross-section at w in m^2. The first column is just an index column. 2) Synthetic Data used to train and test the neural network: Unrimed_simulation_wholespecturm_train_V2.nc, Unrimed_simulation_wholespecturm_test_V2.nc This was the result of combining the PSDs and DDA/GMM particles randomly to build the training and test dataset. 3) Notebook for training the network using the synthetic database and Google Colab (tensorflow): Train_Neural_Network_Chase2020.ipynb This is the notebook used to train the neural network. 4)Trained tensorflow neural network: NN_6by8.h5 This is the hdf5 tensorflow model that resulted from the training. You will need this to run the retrieval. 5) Scalers needed to apply the neural network: scaler_X_V2.pkl, scaler_y_V2.pkl These are the sklearn scalers used in training the neural network. You will need these to scale your data if you wish to run the retrieval. 6) New in this version - Example notebook of how to run the trained neural network on Ku- Ka- band observations. We showed this with the 3rd case in the paper: Run_Chase2021_NN.ipynb 7) New in this version - APR data used to show how to run the neural network retrieval: Chase_2021_NN_APR03Dec2015.nc The data for the analysis on the observations are not provided here because of the size of the radar data. Please see the GHRC website (https://ghrc.nsstc.nasa.gov/home/) if you wish to download the radar and in-situ data or contact me. We can coordinate transferring the exact datafiles used. The GPM-DPR data are avail. here: http://dx.doi.org/10.5067/GPM/DPR/GPM/2A/05
Introduction This dataset contains the data described in the paper titled "A deep neural network approach to predicting clinical outcomes of neuroblastoma patients." by Tranchevent, Azuaje and Rajapakse. More precisely, this dataset contains the topological features extracted from graphs built from publicly available expression data (see details below). This dataset does not contain the original expression data, which are available elsewhere. We thank the scientists who did generate and share these data (please see below the relevant links and publications). Content File names start with the name of the publicly available dataset they are built on (among "Fischer", "Maris" and "Versteeg"). This name is followed by a tag representing whether they contain raw data ("raw", which means, in this case, the raw topological features) or TF formatted data ("TF", which stands for TensorFlow). This tag is then followed by a unique identifier representing a unique configuration. The configuration file "Global_configuration.tsv" contains details about these configurations such as which topological features are present and which clinical outcome is considered. The code associated to the same manuscript that uses these data is at https://gitlab.com/biomodlih/SingalunDeep. The procedure by which the raw data are transformed into the TensorFlow ready data is described in the paper. File format All files are TSV files that correspond to matrices with samples as rows and features as columns (or clinical data as columns for clinical data files). The data files contain various sets of topological features that were extracted from the sample graphs (or Patient Similarity Networks - PSN). The clinical files contain relevant clinical outcomes. The raw data files only contain the topological data. For instance, the file "Fischer_raw_2d0000_data_tsv" contains 24 values for each sample corresponding to the 12 centralities computed for both the microarray (Fischer-M) and RNA-seq (Fischer-R) datasets. The TensorFlow ready files do not contain the sample identifiers in the first column. However, they contain two extra columns at the end. The first extra column is the sample weights (for the classifiers and because we very often have a dominant class). The second extra column is the class labels (binary), based on the clinical outcome of interest. Dataset details The Fischer dataset is used to train, evaluate and validate the models, so the dataset is split into train / eval / valid files, which contains respectively 249, 125 and 124 rows (samples) of the original 498 samples. In contrast, the other two datasets (Maris and Versteeg) are smaller and are only used for validation (and therefore have no training or evaluation file). The Fischer dataset also has more data files because various configurations were tested (see manuscript). In contrast, the validation, using the Maris and Versteeg datasets is only done for a single configuration and there are therefore less files. For Fischer, a few configurations are listed in the global configuration file but there is no corresponding raw data. This is because these items are derived from concatenations of the original raw data (see global configuration file and manuscript for details). References This dataset is associated with Tranchevent L., Azuaje F.. Rajapakse J.C., A deep neural network approach to predicting clinical outcomes of neuroblastoma patients. If you use these data in your research, please do not forget to also cite the researchers who have generated the original expression datasets. Fischer dataset: Zhang W. et al., Comparison of RNA-seq and microarray-based models for clinical endpoint prediction. Genome Biology 16(1) (2015). doi:10.1186/s13059-015-0694-1 Wang C. et al., The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat. Biotechnol. 32(9), 926–932. doi:10.1038/nbt.3001 Versteeg dataset: Molenaar J.J. et al., Sequencing of neuroblastoma identifies chromothripsis and defects in neuritogenesis genes. Nature 483(7391), 589–593. doi:10.1038/nature10910 Maris dataset: Wang Q. et al., Integrative genomics identifies distinct molecular classes of neuroblastoma and shows that multiple genes are targeted by regional alterations in DNA copy number. Cancer Res. 66(12), 6050–6062. doi:10.1158/0008-5472.CAN-05-4618 Project supported by the Fonds National de la Recherche (FNR), Luxembourg (SINGALUN project). This research was also partially supported by Tier-2 grant MOE2016-T2-1-029 by the Ministry of Education, Singapore.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Forecasting the weather in an area characterized by erratic weather patterns and unpredictable climate change is a challenging endeavour. The weather is classified as a non-linear system since it is influenced by various factors that contribute to climate change, such as humidity, average temperature, sea level pressure, and rainfall. A reliable forecasting system is crucial in several industries, including transportation, agriculture, tourism, & development. This study showcases the effectiveness of data mining, meteorological analysis, and machine learning techniques such as RNN-LSTM, TensorFlow Decision Forest (TFDF), and model stacking (including ElasticNet, GradientBoost, KRR, and Lasso) in improving the precision and dependability of weather forecasting. The stacking model strategy entails aggregating multiple base models into a meta-model to address issues of overfitting and underfitting, hence improving the accuracy of the prediction model. To carry out the study, a comprehensive 60-year meteorological record from Bangladesh was gathered, encompassing data on rainfall, humidity, average temperature, and sea level pressure. The results of this study suggest that the stacking average model outperforms the TFDF and RNN-LSTM models in predicting average temperature. The stacking average model achieves an RMSLE of 1.3002, which is a 10.906% improvement compared to the TFDF model. It is worth noting that the TFDF model had previously outperformed the RNN-LSTM model. The performance of the individual stacking model is not as impressive as that of the average model, with the validation results being better in TFDF.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In recent years, the exploitation of three-dimensional (3D) data in deep learning has gained momentum despite its inherent challenges. The necessity of 3D approaches arises from the limitations of two-dimensional (2D) techniques when applied to 3D data due to the lack of global context. A critical task in medical and microscopy 3D image analysis is instance segmentation, which is inherently complex due to the need for accurately identifying and segmenting multiple object instances in an image. Here, we introduce a 3D adaptation of the Mask R-CNN, a powerful end-to-end network designed for instance segmentation. Our implementation adapts a widely used 2D TensorFlow Mask R-CNN by developing custom TensorFlow operations for 3D Non-Max Suppression and 3D Crop And Resize, facilitating efficient training and inference on 3D data. We validate our 3D Mask R-CNN on two experiences. The first experience uses a controlled environment of synthetic data with instances exhibiting a wide range of anisotropy and noise. Our model achieves good results while illustrating the limit of the 3D Mask R-CNN for the noisiest objects. Second, applying it to real-world data involving cell instance segmentation during the morphogenesis of the ascidian embryo Phallusia mammillata, we show that our 3D Mask R-CNN outperforms the state-of-the-art method, achieving high recall and precision scores. The model preserves cell connectivity, which is crucial for applications in quantitative study. Our implementation is open source, ensuring reproducibility and facilitating further research in 3D deep learning.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Forecasting the weather in an area characterized by erratic weather patterns and unpredictable climate change is a challenging endeavour. The weather is classified as a non-linear system since it is influenced by various factors that contribute to climate change, such as humidity, average temperature, sea level pressure, and rainfall. A reliable forecasting system is crucial in several industries, including transportation, agriculture, tourism, & development. This study showcases the effectiveness of data mining, meteorological analysis, and machine learning techniques such as RNN-LSTM, TensorFlow Decision Forest (TFDF), and model stacking (including ElasticNet, GradientBoost, KRR, and Lasso) in improving the precision and dependability of weather forecasting. The stacking model strategy entails aggregating multiple base models into a meta-model to address issues of overfitting and underfitting, hence improving the accuracy of the prediction model. To carry out the study, a comprehensive 60-year meteorological record from Bangladesh was gathered, encompassing data on rainfall, humidity, average temperature, and sea level pressure. The results of this study suggest that the stacking average model outperforms the TFDF and RNN-LSTM models in predicting average temperature. The stacking average model achieves an RMSLE of 1.3002, which is a 10.906% improvement compared to the TFDF model. It is worth noting that the TFDF model had previously outperformed the RNN-LSTM model. The performance of the individual stacking model is not as impressive as that of the average model, with the validation results being better in TFDF.
Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF). Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only..., Molecular docking calculations and the machine learning approaches are described in the Computational details section of [1]. Reference[1] Lukas Bucinsky, Marián Gall, Ján Matúška, Michal Pitoňák, Marek Å teklÃ¡Ä . Advances and critical assessment of machine learning techniques for prediction of docking scores. Int. J. Quantum. Chem. (2023) DOI: 10.1002/qua.27110., ,
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is built for time-series Sentinel-2 cloud detection and stored in Tensorflow TFRecord (refer to https://www.tensorflow.org/tutorials/load_data/tfrecord).
Each file is compressed in 7z format and can be decompressed using Bandzip or 7-zip software.
Dataset Structure:
Each filename can be split into three parts using underscores. The first part indicates whether it is designated for training or validation ('train' or 'val'); the second part indicates the Sentinel-2 tile name, and the last part indicates the number of samples in this file.
For each sample, it includes:
Sample ID;
Array of time series 4 band image patches in 10m resolution, shaped as (n_timestamps, 4, 42, 42);
Label list indicating cloud cover status for the center (6\times6) pixels of each timestamp;
Ordinal list for each timestamp;
Sample weight list (reserved);
Here is a demonstration function for parsing the TFRecord file:
import tensorflow as tf
def parseRecordDirect(fname): sep = '/' parts = tf.strings.split(fname,sep) tn = tf.strings.split(parts[-1],sep='_')[-2] nn = tf.strings.to_number(tf.strings.split(parts[-1],sep='_')[-1],tf.dtypes.int64) t = tf.data.Dataset.from_tensors(tn).repeat().take(nn) t1 = tf.data.TFRecordDataset(fname) ds = tf.data.Dataset.zip((t, t1)) return ds
keys_to_features_direct = { 'localid': tf.io.FixedLenFeature([], tf.int64, -1), 'image_raw_ldseries': tf.io.FixedLenFeature((), tf.string, ''), 'labels': tf.io.FixedLenFeature((), tf.string, ''), 'dates': tf.io.FixedLenFeature((), tf.string, ''), 'weights': tf.io.FixedLenFeature((), tf.string, '') }
class SeriesClassificationDirectDecorder(decoder.Decoder): """A tf.Example decoder for tfds classification datasets.""" def init(self) -> None: super()._init_()
def decode(self, tid, ds): parsed = tf.io.parse_single_example(ds, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) sample_dict = { 'tid': tid, # tile ID 'dates': dates, # Date list 'localid': parsed['localid'], # sample ID 'imgs': decoded, # image array 'labels': label, # label list 'weights': weight } return sample_dict
def preprocessDirect(tid, record): parsed = tf.io.parse_single_example(record, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) return tid, dates, parsed['localid'], decoded, label, weight
t1 = parseRecordDirect('filename here') dataset = t1.map(preprocessDirect, num_parallel_calls=tf.data.experimental.AUTOTUNE)
#
Class Definition:
0: clear
1: opaque cloud
2: thin cloud
3: haze
4: cloud shadow
5: snow
Dataset Construction:
First, we randomly generate 500 points for each tile, and all these points are aligned to the pixel grid center of the subdatasets in 60m resolution (eg. B10) for consistence when comparing with other products. It is because that other cloud detection method may use the cirrus band as features, which is in 60m resolution.
Then, the time series image patches of two shapes are cropped with each point as the center.The patches of shape (42 \times 42) are cropped from the bands in 10m resolution (B2, B3, B4, B8) and are used to construct this dataset.And the patches of shape (348 \times 348) are cropped from the True Colour Image (TCI, details see sentinel-2 user guide) file and are used to interpreting class labels.
The samples with a large number of timestamps could be time-consuming in the IO stage, thus the time series patches are divided into different groups with timestamps not exceeding 100 for every group.
This data collection was created for quick and easy applying Machine Learning. All images and labels are numeric arrays with the same data types and shapes. Of course, the data is free for noncommercial and nongovernmental goals as original data.
1) DogBreedImages.h5(3.77 GB) Origin: Homepage & Source code Images (float32 => 128x128 pixels, 3 color channels) and labels (int32 => 120 classes): 12,000 for training & 8580 for testing.
This catalog really impressed me => TensorFlow Datasets
Discovering the capabilities of algorithms in the recognition of biological objects based on the same formatted data.
This data collection was created for quick and easy applying Machine Learning. All images and labels are numeric arrays with the same data types and shapes. Of course, the data is free for noncommercial and nongovernmental goals as original data.
1) Images of Biospecies 2 2) TfFlowerImages.h5 (688.14 MB) Origin: Homepage & Source code Images (float32 => 128x128 pixels, 3 color channels) and labels (int32 => 5 classes): 3303 for training & 367 for testing.
This catalog really impressed me => TensorFlow Datasets
Discovering the capabilities of algorithms in the recognition of biological objects based on the same formatted data.
The MNIST database of handwritten digits.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">