Facebook
TwitterThis dataset was created by Peeyush Kant Misra
Facebook
TwitterThe rapid advancement of computing technologies, particularly artificial intelligence (AI), has revolutionized various domains, including drug discovery. Curated datasets are crucial for developing reliable, generalizable, and accurate models for practical applications. Generating experimental data on a large scale is an expensive and arduous process. In domains such as medical diagnostics where real-life data is hard to obtain, synthetic data has been shown to be extremely valuable. We, teams from IIIT Hyderabad, Intel, AWS, and Insilico Medicine, have performed physics-based calculations (molecular dynamics simulations) on about 20,000 protein-ligand complexes. The dataset comprises molecular dynamics snapshots, binding affinities calculated using the MM-PBSA method, and individual energy components, including electrostatic and van der Waals interactions. DatasetFileFormats essentially incorporate i. 3D coordinates of the protein-ligand complexes (pdb) in tar.gz files, and ii. CSV files containing the energy data. DatasetUsages are on i. ML scoring function for predicting binding affinities of given protein-ligand complexes, ii. Classification models for predicting correct binding poses of ligands, iii. identification of cryptic binding pockets, and iv. optimization of binding features by exploiting the individual components of the energy (experimental data has only the total binding affinity). Further, the novelty of the dataset highlights the fact that existing AI/ML training datasets lack dynamic data and are inherently biased. Further, binding affinity data existing in the literature are obtained from different experimental protocols. Therefore, this dataset has been uniquely created (from the same computational protocols) followed by free energy calculations with molecular dynamics (MD) simulations. The dynamic data-enriched protein-ligand coordinates can be used to effectively train convolutional neural network-based regression models for more accurate binding affinity prediction.
Facebook
TwitterThis dataset contains raw protein-ligand complexes sourced from PDBbind, along with their experimentally measured binding affinities in -log Kd/pKd units. It serves as a valuable benchmark for training and evaluating molecular docking, scoring functions, and deep learning models for binding affinity prediction.
The dataset includes: • Protein-ligand complexes identified by PDB codes • Binding affinity values in -log Kd (pKd) • Structural data needed for molecular modeling and machine learning applications • A dataset CSV file listing all protein-ligand complexes and their binding affinities (-log Kd/pKd)
This dataset is not preprocessed, meaning the raw structural files (PDB/MOL2/SDF) are intact and can be featurized using tools like DeepChem for deep learning applications, such as Atomic Convolutional Neural Networks (ACNNs).
References: (1) Li, Y.; Liu, Z. H.; Han, L.; Li, J.; Liu, J.; Zhao, Z. X.; Li, C. K.; Wang, R. X. (2014). Comparative Assessment of Scoring Functions on an Updated Benchmark: I. Compilation of the Test Set. J. Chem. Inf. Model. DOI: 10.1021/ci500080q. (2) Li, Y.; Han, L.; Liu, Z. H.; Wang, R. X. (2014). Comparative Assessment of Scoring Functions on an Updated Benchmark: II. Evaluation Methods and General Results. J. Chem. Inf. Model. DOI: 10.1021/ci500081m. (3)DeepChem: An Open-Source Toolkit for Deep Learning in Drug Discovery, Quantum Chemistry, Materials Science, and Biology. Available at: https://deepchem.io.
Facebook
TwitterVirtual screening of protein–protein and protein–peptide interactions is a challenging task that directly impacts the processes of hit identification and hit-to-lead optimization in drug design projects involving peptide-based pharmaceuticals. Although several screening tools designed to predict the binding affinity of protein–protein complexes have been proposed, methods specifically developed to predict protein–peptide binding affinity are comparatively scarce. Frequently, predictors trained to score the affinity of small molecules are used for peptides indistinctively, despite the larger complexity and heterogeneity of interactions rendered by peptide binders. To address this issue, we introduce PPI-Affinity, a tool that leverages support vector machine (SVM) predictors of binding affinity to screen datasets of protein–protein and protein–peptide complexes, as well as to generate and rank mutants of a given structure. The performance of the SVM models was assessed on four benchmark datasets, which include protein–protein and protein–peptide binding affinity data. In addition, we evaluated our model on a set of mutants of EPI-X4, an endogenous peptide inhibitor of the chemokine receptor CXCR4, and on complexes of the serine proteases HTRA1 and HTRA3 with peptides. PPI-Affinity is freely accessible at https://protdcal.zmb.uni-due.de/PPIAffinity.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Title: Antibody and Nanobody Design Dataset (ANDD): A Comprehensive Resource with Sequence, Structure, and Binding Affinity Data
DOI: 10.5281/zenodo.16894086
Resource Type: Dataset
Publisher: Zenodo
Publication Year: 2025
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Overview (Abstract):
The Antibody and Nanobody Design Dataset (ANDD) is a unified, large-scale dataset created to overcome the limitations of data fragmentation and incompleteness in antibody and nanobody research. It integrates sequence, structure, antigen information, and binding affinity data from 15 diverse sources, including OAS, PDB, SabDab, and others. ANDD comprises 48,800 antibody/nanobody sequences, structural data for 25,158 entries, antigen sequences for 12,617 entries, and a total of 9,569 binding affinity values for antibody/nanobody-antigen pairs. A key innovation is the augmentation of experimental affinity data with 5,218 high-quality predictions generated by the ANTIPASTI model. This makes ANDD the largest available dataset of its kind, providing a robust foundation for training and validating deep learning models in therapeutic antibody and nanobody design.
Keywords: Dataset, Antibody Design, Nanobody Design, VHH, Deep Learning, Protein Engineering, Binding Affinity, Therapeutic Antibodies, Computational Biology
Methods (Data Curation and Processing):
The ANDD was constructed through a rigorous multi-step process:
Data Specifications and Format:
The dataset is distributed in two parts:
ANDD.csv: A comprehensive spreadsheet containing all annotated metadata for each entry.All_structures/Folder: A directory containing the corresponding PDB structure files for entries with structural data.The ANDD.csvfile includes the following key fields (a full description is available in the Data Record section of the paper):
Affinity_Kd(M), ∆Gbinding(kJ), and the Affinity_Method.Ab/Nano_mutation).Technical Validation:
The quality of ANDD has been ensured through extensive validation:
Potential Uses:
ANDD is designed to accelerate research in computational biology and drug discovery, including:
Access and License:
The ANDD dataset is publicly available for download under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. Users are free to share and adapt the material for any purpose, even commercially, provided appropriate credit is given to the original authors and this data descriptor is cited.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
A major shortcoming of empirical scoring functions for protein–ligand complexes is the low degree of correlation between predicted and experimental binding affinities, as frequently observed not only for large and diverse data sets but also for SAR series of individual targets. Improvements can be envisaged by developing new descriptors, employing larger training sets of higher quality, and resorting to more sophisticated regression methods. Herein, we describe the use of SFCscore descriptors to develop an improved scoring function by means of a PDBbind training set of 1005 complexes in combination with random forest for regression. This provided SFCscoreRF as a new scoring function with significantly improved performance on the PDBbind and CSAR–NRC HiQ benchmarks in comparison to previously developed SFCscore functions. A leave-cluster-out cross-validation and performance in the CSAR 2012 scoring exercise point out remaining limitations but also directions for further improvements of SFCscoreRF and empirical scoring functions in general.
Facebook
TwitterWeb accessible database of data extracted from scientific literature, focusing on proteins that are drug-targets or candidate drug-targets and for which structural data are present in Protein Data Bank . Website supports query types including searches by chemical structure, substructure and similarity, protein sequence, ligand and protein names, affinity ranges and molecular weight . Data sets generated by BindingDB queries can be downloaded in form of annotated SDfiles for further analysis, or used as basis for virtual screening of compound database uploaded by user. Data are linked to structural data in PDB via PDB IDs and chemical and sequence searches, and to literature in PubMed via PubMed IDs .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary data file S4 from the manuscript 'The application of the Open Pharmacological Concepts Triple Store (Open PHACTS) to support Drug Discovery Research' to be published in PLOS ONE
Facebook
Twitterhttps://github.com/DISIC/politique-de-contribution-open-source/blob/master/LICENSE.pdfhttps://github.com/DISIC/politique-de-contribution-open-source/blob/master/LICENSE.pdf
This Zenodo repository provides comprehensive resources for the paper titled "Spatio-temporal learning from molecular dynamics simulations for protein-ligand binding affinity prediction" published on Bioinformatics. We created a dataset of 63,000 molecular dynamics simulations by performing 10 simulations of 10 ns on 6,300 complexes. Neural networks were developed to learn from this data in order to predict the binding affinities of protein-ligand complexes. The implementation of these neural networks are available on github. Our collection includes training/benchmark datasets, trained statistical models, and results on test sets (CSV & PDF files).
Training/benchmark datasets:
Training, validation and test sets are provided to train and evaluate the following neural networks:
For each training methodology (MD data augmentation and spatiotemporal learning), we provide the data for the whole complex, only the ligand or only the protein. Additionally for spatiotemporal learning, we provide the data with only the ligand using the tracking mode.
Statistical models:
We provide the models trained with Pafnucy, Proli, Densenucy, Timenucy and Videonucy. Each models were trained in 10 replicates.
For Pafnucy, Proli, Densenucy, we provide the models trained with random and systematic rotations, as well as with or without MD data augmentation.
For Proli, Densenucy, Timenucy and Videonucy, we provide the models trained on the whole complex, only the ligand or only the protein.
For Pafnucy we also provide the models trained on the reduced set (5932 complexes).
Results on test sets (CSV & PDF files):
We provide the predictions on the PDBbind v.2016 core set.
Results on the FEP dataset are also provided for Pafnucy, Proli and Densenucy.
The Raw MD data (~4.5 To) are stored, and can be visualized/downloaded, on the MDDB.
This work was performed using HPC resources from GENCI-IDRIS (Grant 2021-A0100712496 & 2022-AD011013521) and CRIANN (Grant 2021002).
Facebook
TwitterThis dataset contains the predicted prices of the asset Affinity over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comprehensive dataset containing 65 verified Affinity Group locations in United States with complete contact information, ratings, reviews, and location data.
Facebook
TwitterDatabase of affinity data for protein-ligand complexes of the Protein Data Bank (PDB) providing direct and free access to the experimental affinity of a given complex structure. Affinity data are exclusively obtained from the scientific literature. As of Thursday, May 01st, 2014, AffinDB contains 748 affinity values covering 474 different PDB complexes. More than one affinity value may be associated with a single PDB complex, which is most frequently due to multiple references reporting affinity data for the same complex. AffinDB provides access to data in three different forms: # Summary information for PDB entry # Affinity information window # Tabular reports
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global affinity analysis platform market size reached USD 1.87 billion in 2024, demonstrating robust momentum across sectors. With a projected CAGR of 13.2% during the forecast period, the market is anticipated to attain a value of USD 5.58 billion by 2033. This impressive growth is primarily attributed to increasing demand for advanced data analytics solutions, rising adoption of AI-driven customer insights, and the ongoing digital transformation across industries. As organizations strive to gain a competitive edge through data-driven decision-making, affinity analysis platforms are rapidly becoming indispensable tools for uncovering actionable patterns and optimizing business strategies.
A major growth factor propelling the affinity analysis platform market is the exponential increase in data generation from digital channels, IoT devices, and customer interactions. Organizations across retail, BFSI, healthcare, and e-commerce are leveraging affinity analysis to mine relationships and associations within large datasets, enabling them to understand customer behavior, preferences, and trends with unprecedented accuracy. This demand is further amplified by the proliferation of omnichannel strategies, where businesses seek to create seamless and personalized experiences for their customers. As a result, the need for sophisticated analytics tools capable of real-time processing and actionable insights has never been higher, driving continuous innovation and investment in affinity analysis technologies.
Another significant driver is the integration of artificial intelligence and machine learning algorithms within affinity analysis platforms. These technologies empower organizations to automate complex analytical processes, enhance the accuracy of predictions, and uncover hidden correlations that traditional methods might overlook. The ability to deliver highly targeted marketing campaigns, optimize product recommendations, and detect fraudulent activities in real time has become a key differentiator for businesses. Furthermore, advancements in cloud computing have democratized access to these platforms, allowing even small and medium enterprises to benefit from enterprise-grade analytics without heavy upfront investments in infrastructure.
The increasing regulatory focus on data privacy and security is also shaping the affinity analysis platform market. As data-driven strategies become central to business operations, organizations are under pressure to comply with stringent regulations such as GDPR, CCPA, and HIPAA. This has led to a surge in demand for platforms that offer robust security features, data governance capabilities, and compliance tools. Vendors are responding by enhancing their offerings with advanced encryption, access controls, and audit trails, thereby building trust and ensuring the responsible use of customer data. This regulatory landscape, while challenging, is also fostering innovation and driving adoption among risk-averse industries like healthcare and finance.
From a regional perspective, North America continues to dominate the affinity analysis platform market, accounting for the largest share owing to the early adoption of advanced analytics, presence of key technology providers, and high digital maturity of enterprises. However, Asia Pacific is emerging as the fastest-growing region, fueled by rapid digitalization, booming e-commerce, and increasing investments in AI and big data. Europe remains a significant market, driven by stringent data protection regulations and a strong focus on customer-centric business models. Meanwhile, Latin America and the Middle East & Africa are witnessing steady growth, supported by expanding digital infrastructure and rising awareness of the benefits of affinity analysis.
The affinity analysis platform market by component is segmented into software and services, each playing a crucial role in delivering value to end-users. The software segment, which includes analytics engines, visualization tools, and data integration modules, holds the lion’s share of the market. This dominance is attributed to the continuous advancements in analytics algorithms, user-friendly interfaces, and integration capabilities with existing enterprise systems. Organizations are increasingly seeking scalable and customizable software solutions that can handle large vol
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For fast reproduction of our results, we provide PyTorch datasets of precomputed interaction graphs for the entire PDBbind database on Zenodo. To enable quick establishment of leakage-free evaluation setups with PDBbind, we also provide pairwise similarity matrices for the entire PDBbind dataset on Zenodo.
Version 2 - Updated to improve the accuracy of Tanimoto Scores in the pairwise similarity matrices, which also caused minor changes in the composition of PDBbind CleanSplit.
Version 3 - Including pairwise similarity matrix for sequence identity (from TM-align)
Facebook
TwitterBackground Proteins HMG1 and HMG2 are two of the most abundant non histone proteins in the nucleus of mammalian cells, and contain a domain of homology with many proteins implicated in the control of development, such as the sex-determination factor Sry and the Sox family of proteins. In vitro studies of interactions of HMG1/2 with DNA have shown that these proteins can bind to many unusual DNA structures, in particular to four-way junctions, with binding affinities of 107 to 109 M-1. Results Here we show that HMG1 and HMG2 bind with a much higher affinity, at least 4 orders of magnitude higher, to a new structure, Form X, which consists of a DNA loop closed at its base by a semicatenated DNA junction, forming a DNA hemicatenane. The binding constant of HMG1 to Form X is higher than 5 × 1012 M-1, and the half-life of the complex is longer than one hour in vitro. Conclusions Of all DNA structures described so far with which HMG1 and HMG2 interact, we have found that Form X, a DNA loop with a semicatenated DNA junction at its base, is the structure with the highest affinity by more than 4 orders of magnitude. This suggests that, if similar structures exist in the cell nucleus, one of the functions of these proteins might be linked to the remarkable property of DNA hemicatenanes to associate two distant regions of the genome in a stable but reversible manner.
Facebook
Twitterhttp://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
The code, dataset, and model weights are described in the paper "Interformer: An Interaction-Aware Model for Protein-Ligand Docking and Affinity Prediction."
experiment_results.zip: Contains generated results that can reproduce the result from the reported paper.
benchmark.zip: Contains docking and affinity input data of the interformer. You can use the source code to make predictions and reproduce the number of the reported paper.
checkpoints.zip: Contains one weight for the Energy and four PoseScore and Affinity models.
source_code_1.0.zip: Contains the initial version of the source code.
interformer_train.tar.gz: Contains prepared training data for interformer. poses/ contains all structure need for training, poses/ligand contains the re-docking poses generated by interformer energy, poses/ligand/rcsb contains the conformation of reference ligand, poses/pocket contains all pocket extract by raw PDB from rcsb, poses/uff contains all ligand conformation minimized using UFF from reference ligand, and train/ contains the training csv.
You can also find the newest version of the source code at https://github.com/tencent-ailab/Interformer
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Receptor affinity data for THC collected from the literature. The columns identify the receptor, the radioligand used in determining affinity, the source species from which the receptor was used, the tissue from which the receptor was used, the Ki value in nanomoles, and the literature reference from which the data was obtained.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comprehensive dataset containing 262 verified Guaranteed Rate Affinity locations in United States with complete contact information, ratings, reviews, and location data.
Facebook
TwitterThis dataset provides information about the number of properties, residents, and average property values for Affinity Street cross streets in Hoxie, AR.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the Supporting Information for the experimental data paper associated with the CASP16 pharmaceutical protein-ligand pose- and affinity-prediction challenge. The contents are summarized as follows. The paper's DOI will be added to this Zenodo record once it is available.
Roche: Semicolon-delimited files with ligand SMILES strings, PDB identifiers, IC50 data (μM) for chymase and ATX, and ligand pKa data, as well as IC50 for cathepsin G, which is similar to chymase but was not used as a CASP16 target.
Idorsia: Semicolon-delimited files with ligand SMILES strings and PDB identifiers for 3CL/Mpro targets. Table of X-ray data processing statistics (Table S1) and structure refinement statistics (Table S2).
The SI also includes an inventory of the SI files with data definitions.
Facebook
TwitterThis dataset was created by Peeyush Kant Misra