Bayesian inference has predominantly relied on the Markov chain Monte Carlo (MCMC) algorithm for many years. However, MCMC is computationally laborious, especially for complex phylogenetic models of time trees. This bottleneck has led to the search for alternatives, such as variational Bayes, which can scale better to large datasets. In this paper, we introduce torchtree, a framework written in Python that allows developers to easily implement rich phylogenetic models and algorithms using a fixed tree topology. One can either use automatic differentiation, or leverage torchtree's plug-in system to compute gradients analytically for model components for which automatic differentiation is slow. We demonstrate that the torchtree variational inference framework performs similarly to BEAST in terms of speed and approximation accuracy. Furthermore, we explore the use of the forward KL divergence as an optimizing criterion for variational inference, which can handle discontinuous and non-diffe..., , , # torchtree: flexible phylogenetic model development and inference using PyTorch
Mathieu Fourment, Matthew Macaulay, Christiaan J Swanepoel, Xiang Ji, Marc A Suchard, Frederick A Matsen IV. torchtree: flexible phylogenetic model development and inference using PyTorch. arXiv:2406.18044 (2024)
The SI.pdf file contains supplementary methods and figures referenced in the main manuscript (found on Zenodo under Supplemental Information).
The data.zip contains input files and phylogenetic trees used for analyses in the associated manuscript. The data are organized by dataset (HCV
and SC2
) and by tool (beast
and torchtree
), and include sequence alignments (see next section for SC2 alignment) and configuration files (xml and json files). torchtree uses variational Bayes while BEAST uses MCMC.
data/
├── HCV/
│ ├── HCV.fasta # Sequence alignment for HCV
│ ├── HCV.tree # Newick ...,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Personal Protective Equipment Dataset (PPED)
This dataset serves as a benchmark for PPE in chemical plants We provide datasets and experimental results.
We produced a data set based on the actual needs and relevant regulations in chemical plants. The standard GB 39800.1-2020 formulated by the Ministry of Emergency Management of the People’s Republic of China defines the protective requirements for plants and chemical laboratories. The complete dataset is contained in the folder PPED/data.
1.1. Image collection
We took more than 3300 pictures. We set the following different characteristics, including different environments, different distances, different lighting conditions, different angles, and the diversity of the number of people photographed.
Backgrounds: There are 4 backgrounds, including office, near machines, factory and regular outdoor scenes.
Scale: By taking pictures from different distances, the captured PPEs are classified in small, medium and large scales.
Light: Good lighting conditions and poor lighting conditions were studied.
Diversity: Some images contain a single person, and some contain multiple people.
Angle: The pictures we took can be divided into front and side.
A total of more than 3300 photos were taken in the raw data under all conditions. All images are located in the folder “PPED/data/JPEGImages”.
1.2. Label
We use Labelimg as the labeling tool, and we use the PASCAL-VOC labelimg format. Yolo use the txt format, we can use trans_voc2yolo.py to convert the XML file in PASCAL-VOC format to txt file. Annotations are stored in the folder PPED/data/Annotations
1.3. Dataset Features
The pictures are made by us according to the different conditions mentioned above. The file PPED/data/feature.csv is a CSV file which notes all the .os of all the image. It records every feature of the picture, including lighting conditions, angles, backgrounds, number of people and scale.
1.4. Dataset Division
The data set is divided into 9:1 training set and test set.
We provide baseline results with five models, namely Faster R-CNN ®, Faster R-CNN (M), SSD, YOLOv3-spp, and YOLOv5. All code and results is given in folder PPED/experiment.
2.1. Environment and Configuration:
Intel Core i7-8700 CPU
NVIDIA GTX1060 GPU
16 GB of RAM
Python: 3.8.10
pytorch: 1.9.0
pycocotools: pycocotools-win
Windows 10
2.2. Applied Models
The source codes and results of the applied models is given in folder PPED/experiment with sub-folders corresponding to the model names.
2.2.1. Faster R-CNN
Faster R-CNN
backbone: resnet50+fpn
We downloaded the pre-training weights from https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth.
We modified the dataset path, training classes and training parameters including batch size.
We run train_res50_fpn.py start training.
Then, the weights are trained by the training set.
Finally, we validate the results on the test set.
backbone: mobilenetv2
the same training method as resnet50+fpn, but the effect is not as good as resnet50+fpn, so it is directly discarded.
The Faster R-CNN source code used in our experiment is given in folder PPED/experiment/Faster R-CNN. The weights of the fully-trained Faster R-CNN (R), Faster R-CNN (M) model are stored in file PPED/experiment/trained_models/resNetFpn-model-19.pth and mobile-model.pth. The performance measurements of Faster R-CNN (R) Faster R-CNN (M) are stored in folder PPED/experiment/results/Faster RCNN(R)and Faster RCNN(M).
2.2.2. SSD
backbone: resnet50
We downloaded pre-training weights from https://download.pytorch.org/models/resnet50-19c8e357.pth.
The same training method as Faster R-CNN is applied.
The SSD source code used in our experiment is given in folder PPED/experiment/ssd. The weights of the fully-trained SSD model are stored in file PPED/experiment/trained_models/SSD_19.pth. The performance measurements of SSD are stored in folder PPED/experiment/results/SSD.
2.2.3. YOLOv3-spp
backbone: DarkNet53
We modified the type information of the XML file to match our application.
We run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.
The weights used are: yolov3-spp-ultralytics-608.pt.
The YOLOv3-spp source code used in our experiment is given in folder PPED/experiment/YOLOv3-spp. The weights of the fully-trained YOLOv3-spp model are stored in file PPED/experiment/trained_models/YOLOvspp-19.pt. The performance measurements of YOLOv3-spp are stored in folder PPED/experiment/results/YOLOv3-spp.
2.2.4. YOLOv5
backbone: CSP_DarkNet
We modified the type information of the XML file to match our application.
We run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.
The weights used are: yolov5s.
The YOLOv5 source code used in our experiment is given in folder PPED/experiment/yolov5. The weights of the fully-trained YOLOv5 model are stored in file PPED/experiment/trained_models/YOLOv5.pt. The performance measurements of YOLOv5 are stored in folder PPED/experiment/results/YOLOv5.
2.3. Evaluation
The computed evaluation metrics as well as the code needed to compute them from our dataset are provided in the folder PPED/experiment/eval.
Faster R-CNN (R and M)
official code: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/faster_rcnn.py
SSD
official code: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/ssd.py
YOLOv3-spp
YOLOv5
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Chinese Chemical Safety Signs (CCSS)
This dataset is compiled as a benchmark for recognizing chemical safety signs from images. We provide both the dataset and the experimental results.
1. The Dataset
The complete dataset is contained in the folder ccss/data
. The images include signs based on the Chinese standard "Safety Signs and their Application Guidelines" (GB 2894-2008) for safety signs in chemical environments. This standard, in turn, refers to the standards ISO 7010 (Graphical symbols – Safety Colours and Safety Signs – Safety signs used in workplaces and public areas), GB/T 10001 (Public Information Graphic Symbols for Signs), and GB 13495 (Fire Safety Signs)
1.1. Image Collection
We collect photos of commonly used chemical safety signs in chemical laboratories and chemistry teaching. For a discussion of the standards we base our collections, refer to the book "Talking about Hazardous Chemicals and Safety Signs" for common signs, and refer to the safety signs guidelines (GB 2894-2008).
Under all conditions, a total of 4650 photos were taken in the original data. These were expanded to 27,900 photos were via data enhancement. All images are located in folder ccss/data/JPEGImages
.
The file ccss/data/features/enhanced_data_to_original_data.csv
provides a mapping between the enhanced image name and the corresponding original image.
1.2. Annotation and Labelimg
We use Labelimg as labeling tool, which, in turn, uses the PASCAL-VOC labelimg format. The annotation is stored in the folder ccss/data/Annotations
.
Faster R-CNN and SSD are two algorithms that use this format. When training YOLOv5, you can run trans_voc2yolo.py
to convert the XML file in PASCAL-VOC format to a txt file.
We provide further meta-information about the dataset in form of a CSV file features.csv
which notes, for each image, which other features it has (lighting conditions, scale, multiplicity, etc.). We apply the COCO standard for deciding whether a target is small, medium, or large in size.
1.3. Dataset Features
As stated above, the images have been shot under different conditions. We provide all the feature information in folder ccss/data/features
. For each feature, there is a separate list of file names in that folder. The file ccss/data/features/features_on_original_data.csv
is a CSV file which notes all the features of each original image.
1.4. Dataset Division
The data set is fixedly divided into 7:3 training set and test set. You can find the corresponding image names in the files ccss/data/training_data_file_names.txt
and ccss/data/test_data_file_names.txt
.
2. Baseline Experiments
We provide baseline results with five models, namely Faster R-CNN (R), Faster R-CNN (M), SSD, YOLOv3-spp, and YOLOv5. All code and results is given in folder ccss/experiment
.
2.2. Environment and Configuration:
2.3. Applied Models
The source codes and results of the applied models is given in folder ccss/experiment
with sub-folders corresponding to the model names.
2.3.1. Faster R-CNN
train_res50_fpn.py
ccss/experiment/sources/faster_rcnn (R)
. The weights of the fully-trained Faster R-CNN (R) model are stored in file ccss/experiment/trained_models/faster_rcnn (R).pth
. The performance measurements of Faster R-CNN (R) are stored in folder ccss/experiment/performance_indicators/faster_rcnn (R)
.train_mobilenetv2.py
ccss/experiment/sources/faster_rcnn (M)
. The weights of the fully-trained Faster R-CNN (M) model are stored in file ccss/experiment/trained_models/faster_rcnn (M).pth
. The performance measurements of Faster R-CNN (M) are stored in folder ccss/experiment/performance_indicators/faster_rcnn (M)
.2.3.2. SSD
The SSD source code used in our experiment is given in folder ccss/experiment/sources/ssd
. The weights of the fully-trained SSD model are stored in file ccss/experiment/trained_models/ssd.pth
. The performance measurements of SSD are stored in folder ccss/experiment/performance_indicators/ssd
.
2.3.3. YOLOv3-spp
trans_voc2yolo.py
to convert the XML file in VOC format to a txt file.The YOLOv3-spp source code used in our experiment is given in folder ccss/experiment/sources/yolov3-spp
. The weights of the fully-trained YOLOv3-spp model are stored in file ccss/experiment/trained_models/yolov3-spp.pt
. The performance measurements of YOLOv3-spp are stored in folder ccss/experiment/performance_indicators/yolov3-spp
.
2.3.4. YOLOv5
trans_voc2yolo.py
to convert the XML file in VOC format to a txt file.The YOLOv5 source code used in our experiment is given in folder ccss/experiment/sources/yolov5
. The weights of the fully-trained YOLOv5 model are stored in file ccss/experiment/trained_models/yolov5.pt
. The performance measurements of YOLOv5 are stored in folder ccss/experiment/performance_indicators/yolov5
.
2.4. Evaluation
The computed evaluation metrics as well as the code needed to compute them from our dataset are provided in the folder ccss/experiment/performance_indicators
. They are provided over the complete test st as well as separately for the image features (over the test set).
3. Code Sources
We are particularly thankful to the author of the GitHub repository WZMIAOMIAO/deep-learning-for-image-processing (with whom we are not affiliated). Their instructive videos and codes were most helpful during our work. In
This repository contains databases of protein domains for use with Foldclass and Merizo-search. We provide databases for all 365 million domains in TED, as well as all classified domains in CATH 4.3.Foldclass and Merizo-search use two formats for databases. The default format uses a PyTorch tensor and a pickled list of Python tuples to store the data. This format is used for the CATH database, which is small enough to fit in memory. For larger-than-memory datasets, such as TED, we use a binary format that is searched using the Faiss library.The CATH database requires approximately 1.4 GB of disk space, whereas the TED database requires about 885 GB. Please ensure you have enough free storage space before downloading. For best search performance with the TED database, the database should be stored on the fastest storage hardware available to you.IMPORTANT:We recommend going in to each folder and downloading the files; if you attempt to download each folder in one go, it will download a zip file which will need to be decompressed. This is particularly an issue if downloading the TED database, as you will need to have roughly twice the storage space needed as compared to downloading the individual files. Our GitHub repository (see Related Materials below) contains a convenience script to download each database; we recommend using that.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For fast reproduction of our results, we provide PyTorch datasets of precomputed interaction graphs for the entire PDBbind database on Zenodo. To enable quick establishment of leakage-free evaluation setups with PDBbind, we also provide pairwise similarity matrices for the entire PDBbind dataset on Zenodo.
I made this data annotation for conference paper . I try to make an application that will be fast and light enough to deploy in any cutting edge device while maintaining a good accuracy like any state-of-the-art model.
The following pre-processing was applied to each image: * Auto-orientation of pixel data (with EXIF-orientation stripping) * Resize to 416x416 (Stretch)
The following augmentation was applied to create 3 versions of each source image in trainig set images: * 50% probability of horizontal flip * 50% probability of vertical flip * Equal probability of one of the following 90-degree rotations: none, clockwise, counter-clockwise, upside-down * Randomly crop between 0 and 7 percent of the image * Random rotation of between -40 and +40 degrees * Random shear of between -29° to +29° horizontally and -15° to +15° vertically * Random exposure adjustment of between -34 and +34 percent * Random Gaussian blur of between 0 and 1.5 pixels * Salt and pepper noise was applied to 4 percent of pixels
A big shoutout to Massey University for making this dataset public. The original dataset Link is : here , Please keep in mind that the original dataset maybe updated from time to time. However, I don't intend to update this annotated version.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset constains training data of two types:
Each volume has the shape (128,128,128), and is of float32 precision. They are saved as .tiff files, which can be easily read via the tifffile Python library.
Additionally, the dataset contains:
[1] M. Baczewska, W. Krauze, A. Kuś, P. Stępień, K. Tokarska, K. Zukowski, E. Malinowska, Z. Brzózka, and M. Kujawińska, “On-chip holographic tomography for quantifying refractive index changes of cells’ dynamics,” in Quantitative Phase Imaging VIII, vol. 11970 Y. Liu, G. Popescu, and Y. Park, eds., International Society for Optics and Photonics (SPIE, 2022), p. 1197008.
[2] P. Stępień, M. Ziemczonok, M. Kujawińska, M. Baczewska, L. Valenti, A. Cherubini, E. Casirati, and W. Krauze, “Numerical refractive index correction for the stitching procedure in tomographic quantitative phase imaging,” Biomed. Opt. Express 13, 5709–5720 (2022).
[3] M. Ziemczonok, A. Kuś, P. Wasylczyk, and M. Kujawińska, “3d-printed biological cell phantom for testing 3d quantitative phase imaging systems,” Sci. Reports 9, 1–9 (2019).
[4] M. Ziemczonok, A. Kuś, and M. Kujawińska, “Optical diffraction tomography meets metrology — measurement accuracy on cellular and subcellular level,” Measurement 195, 111106 (2022).
[5] W. Krauze, P. Makowski, M. Kujawińska, and A. Kuś, “Generalized total variation iterative constraint strategy in limited angle optical diffraction tomography,” Opt. Express 24, 4924–4936 (2016).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets for code and Source Data for the study "Knowledge-Guided Machine Learning can improve C cycle quantification in agroecosystems" https://doi.org/10.1038/s41467-023-43860-5. All files belong to Licheng Liu and Zhenong Jin at University of Minnesota. deposit_code_v2.zip contains packaged codes and sample runs for KGML-ag-Carbon training, validation and implementations. Source Data.zip contains data for generating the figures inside the study.
Note: We used Pytorch 1.6.0 (https://pytorch.org/get-started/previous-versions/, last access: 21 Oct 2023) and Python 3.7.11 (https://www.python.org/downloads/release/python-3711/, last access: 21 Oct 2023) as the programming environment for model development. Statistical analysis, such as linear regression, was conducted using Statsmodels 0.14.0 (https://github.com/statsmodels/statsmodels/, last access: 21 Oct 2023) In order to use a GPU to speed-up the training process, we installed the CUDA Toolkit 10.1.243 (https://developer.nvidia.com/cuda-toolkit, last access: 21 Oct 2023).
To use the full kgml_lib function, please create a new environment with the same python and libs above.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are the models in http://hdl.handle.net/20.500.12537/125 trained with 40% layer drop. They are suitable for inference using every other layer for optimized inference speed with lower translation performance. We refer to the prior submission for usage and the documentation on layerdrop at https://github.com/pytorch/fairseq/blob/fcca32258c8e8bcc9f9890bf4714fa2f96b6b3e1/examples/layerdrop/README.md.
Þessi líkön eru þjálfuð með 40% laga missi (e. layer drop) á líkönunum í http://hdl.handle.net/20.500.12537/125. Þau henta vel til þýðinga þar sem er búið að henda öðru hverju lagi í netinu og þannig er hægt að hraða á þýðingum á kostnað gæða. Leiðbeiningar um notkun netanna er að finna með upphaflegu líkönunum og í notkunarleiðbeiningum Fairseq í https://github.com/pytorch/fairseq/blob/fcca32258c8e8bcc9f9890bf4714fa2f96b6b3e1/examples/layerdrop/README.md.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Bayesian inference has predominantly relied on the Markov chain Monte Carlo (MCMC) algorithm for many years. However, MCMC is computationally laborious, especially for complex phylogenetic models of time trees. This bottleneck has led to the search for alternatives, such as variational Bayes, which can scale better to large datasets. In this paper, we introduce torchtree, a framework written in Python that allows developers to easily implement rich phylogenetic models and algorithms using a fixed tree topology. One can either use automatic differentiation, or leverage torchtree's plug-in system to compute gradients analytically for model components for which automatic differentiation is slow. We demonstrate that the torchtree variational inference framework performs similarly to BEAST in terms of speed and approximation accuracy. Furthermore, we explore the use of the forward KL divergence as an optimizing criterion for variational inference, which can handle discontinuous and non-diffe..., , , # torchtree: flexible phylogenetic model development and inference using PyTorch
Mathieu Fourment, Matthew Macaulay, Christiaan J Swanepoel, Xiang Ji, Marc A Suchard, Frederick A Matsen IV. torchtree: flexible phylogenetic model development and inference using PyTorch. arXiv:2406.18044 (2024)
The SI.pdf file contains supplementary methods and figures referenced in the main manuscript (found on Zenodo under Supplemental Information).
The data.zip contains input files and phylogenetic trees used for analyses in the associated manuscript. The data are organized by dataset (HCV
and SC2
) and by tool (beast
and torchtree
), and include sequence alignments (see next section for SC2 alignment) and configuration files (xml and json files). torchtree uses variational Bayes while BEAST uses MCMC.
data/
├── HCV/
│ ├── HCV.fasta # Sequence alignment for HCV
│ ├── HCV.tree # Newick ...,