Facebook
TwitterThe AntiBody Sequence Database is a public dataset for antibody sequence data. It provides unique identifiers for antibody sequences, including both immunoglobulin and single-chain variable fragment sequences. These are are critical for immunological studies, and allows users to search and retrieve antibody sequences based on sequence similarity and specificity, and other biological properties.
Facebook
TwitterDatabase containing all antibody structures available in the PDB, annotated and presented in consistent fashion.Each structure is annotated with number of properties including experimental details, antibody nomenclature (e.g. heavy-light pairings), curated affinity data and sequence annotations. You can use the database to inspect individual structures, create and download datasets for analysis, search the database for structures with similar sequences to your query, monitor the known structural repetoire of antibodies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Title: Antibody and Nanobody Design Dataset (ANDD): A Comprehensive Resource with Sequence, Structure, and Binding Affinity Data
DOI: 10.5281/zenodo.16894086
Resource Type: Dataset
Publisher: Zenodo
Publication Year: 2025
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Overview (Abstract):
The Antibody and Nanobody Design Dataset (ANDD) is a unified, large-scale dataset created to overcome the limitations of data fragmentation and incompleteness in antibody and nanobody research. It integrates sequence, structure, antigen information, and binding affinity data from 15 diverse sources, including OAS, PDB, SabDab, and others. ANDD comprises 48,800 antibody/nanobody sequences, structural data for 25,158 entries, antigen sequences for 12,617 entries, and a total of 9,569 binding affinity values for antibody/nanobody-antigen pairs. A key innovation is the augmentation of experimental affinity data with 5,218 high-quality predictions generated by the ANTIPASTI model. This makes ANDD the largest available dataset of its kind, providing a robust foundation for training and validating deep learning models in therapeutic antibody and nanobody design.
Keywords: Dataset, Antibody Design, Nanobody Design, VHH, Deep Learning, Protein Engineering, Binding Affinity, Therapeutic Antibodies, Computational Biology
Methods (Data Curation and Processing):
The ANDD was constructed through a rigorous multi-step process:
Data Specifications and Format:
The dataset is distributed in two parts:
ANDD.csv: A comprehensive spreadsheet containing all annotated metadata for each entry.All_structures/Folder: A directory containing the corresponding PDB structure files for entries with structural data.The ANDD.csvfile includes the following key fields (a full description is available in the Data Record section of the paper):
Affinity_Kd(M), ∆Gbinding(kJ), and the Affinity_Method.Ab/Nano_mutation).Technical Validation:
The quality of ANDD has been ensured through extensive validation:
Potential Uses:
ANDD is designed to accelerate research in computational biology and drug discovery, including:
Access and License:
The ANDD dataset is publicly available for download under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. Users are free to share and adapt the material for any purpose, even commercially, provided appropriate credit is given to the original authors and this data descriptor is cited.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mass spectrometry (MS)-based proteomics is a powerful method for identifying and quantifying antibodies. Among the various MS approaches, bottom-up proteomics is especially effective for analyzing thousands of antibodies in complex mixtures. In this method, proteins are enzymatically digested into smaller peptides, typically using the protease trypsin, which are then analyzed via mass spectrometry. These peptides are matched to sequences in standard databases like UniProt or NCBI-RefSeq for identification.
However, a major limitation of this approach is the absence of comprehensive disease-specific antibody databases. Current databases, such as UniProt, include only a fraction of the antibody sequences present in the human body. For instance, as of January 2024, UniProt contains just 38,800 immunoglobulin sequences, far short of the billions of antibodies the human immune system can produce. As a result, relying on such limited databases can lead to under-detection of antibodies, particularly those associated with specific diseases. Expanding antibody databases with disease-specific sequences is crucial for improving the accuracy of MS-based proteomics in identifying antibodies relevant to human health.
Recently, through next-generation sequencing of antibody gene repertoires, it has become possible to obtain billions of antibody sequences (in amino acid format) by annotating, translating, and numbering antibody gene sequences. These large numbers of sequences are now available in public databases such as the Observed Antibody Space. We hypothesize that using these theoretical antibody sequences as new databases for bottom-up proteomics could address the current lack of antibody coverage in standard databases.
We developed a workflow to create disease-specific antibody peptide databases for bottom-up proteomics. The workflow details are available on GitHub. The database and metadata files generated by this workflow are stored in this Zenodo dataset, and they are used in DAT-DB — a web application that allows researchers to obtain FASTA files of disease-specific antibody peptides for direct use in bottom-up proteomics (see Demo version).
Each database file in this dataset is in .duckdb format and contains tables with 10 columns: Sequence, Filename, Patient, BSource, BType, Isotype, N_patient, N_antibody, Length_aa, and CDR3. The "Sequence" column contains tryptic peptides. "Filename" is the file where the data was collected. "Patient" refers to the patient number as listed in metadata2.csv. "BSource" refers to the B-cells' source, and "BType" refers to the type of B-cells. "Isotype" specifies the antibody isotype (IgA, IgD, IgE, IgG, IgM, or Bulk). "N_patient" indicates the number of patients having this peptide, and "N_antibody" specifies the number of antibodies containing this peptide. "Length_aa" indicates the number of amino acids in the peptide, while "CDR3" shows whether the peptide is found in the CDR3 region.
The file metadata1.csv contains information about each database file, while metadata2.csv provides details about the sources of the collected antibodies.
Facebook
TwitterTracks all antibody and nanobody related therapeutics recognized by World Health Organisation, and identifies any corresponding structures in Structural Antibody Database with near exact or exact variable domain sequence matches. Synchronized with SAbDab to update weekly, reflecting new Protein Data Bank entries and availability of new sequence data published by WHO.
Facebook
TwitterA database of antibody structure containing sequences from Kabat, IMGT and the Protein Databank (PDB), as well as structure data from the PDB. It provides search of the sequence data on various criteria and display of results in different formats. For data from the PDB, sequence searches can be combined with structural constraints. For example, one can ask for all the antibodies with a 10-residue Kabat CDR-L1 with a serine at H23 and an arginine within 10A of H36. The site also has software for structure analysis and other information on antibody structure available.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Structural region to sequence mapping for RosettaAntibody.
Facebook
Twitterhttp://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
A dataset of ~500 antibodies with binding affinity: antibody sequence, antigen sequence, Kd. Obtained from SAbDab via Therapeutic Data Commons
Python code (get_antibody_affinity_data.py) and dataset (antibody_affinity_protein_sabdab.csv)
Facebook
TwitterThe antibody repertoire is a critical component of the adaptive immune system and is believed to reflect an individual’s immune history and current immune status. Delineating the antibody repertoire has advanced our understanding of humoral immunity, facilitated antibody discovery, and showed great potential for improving the diagnosis and treatment of disease. However, no tool to date has effectively integrated big Rep-seq data and prior knowledge of functional antibodies to elucidate the remarkably diverse antibody repertoire. We developed a Rep-seq dataset Analysis Platform with an Integrated antibody Database (RAPID; https://rapid.zzhlab.org/), a free and web-based tool that allows researchers to process and analyse Rep-seq datasets. RAPID consolidates 521 WHO-recognized therapeutic antibodies, 88,059 antigen- or disease-specific antibodies, and 306 million clones extracted from 2,449 human IGH Rep-seq datasets generated from individuals with 29 different health conditions. RAPID also integrates a standardized Rep-seq dataset analysis pipeline to enable users to upload and analyse their datasets. In the process, users can also select set of existing repertoires for comparison. RAPID automatically annotates clones based on integrated therapeutic and known antibodies, and users can easily query antibodies or repertoires based on sequence or optional keywords. With its powerful analysis functions and rich set of antibody and antibody repertoire information, RAPID will benefit researchers in adaptive immune studies.
Facebook
TwitterThe Kabat Database determines the combining site of antibodies based on the available amino acid sequences. The precise delineation of complementarity determining regions (CDR) of both light and heavy chains provides the first example of how properly aligned sequences can be used to derive structural and functional information of biological macromolecules. The Kabat database now includes nucleotide sequences, sequences of T cell receptors for antigens (TCR), major histocompatibility complex (MHC) class I and II molecules, and other proteins of immunological interest. The Kabat Database searching and analysis tools package is an ASP.NET web-based portal containing lookup tools, sequence matching tools, alignment tools, length distribution tools, positional correlation tools and much more. The searching and analysis tools are custom made for the aligned data sets contained in both the SQL Server and ASCII text flat file formats. The searching and analysis tools may be run on a single PC workstation or in a distributed environment. The analysis tools are written in ASP.NET and C# and are available in Visual Studio .NET 2003/2005/2008 formats. The Kabat Database was initially started in 1970 to determine the combining site of antibodies based on the available amino acid sequences at that time. Bence Jones proteins, mostly from human, were aligned, using the now-known Kabat numbering system, and a quantitative measure, variability, was calculated for every position. Three peaks, at positions 24-34, 50-56 and 89-97, were identified and proposed to form the complementarity determining regions (CDR) of light chains. Subsequently, antibody heavy chain amino acid sequences were also aligned using a different numbering system, since the locations of their CDRs (31-35B, 50-65 and 95-102) are different from those of the light chains. CDRL1 starts right after the first invariant Cys 23 of light chains, while CDRH1 is eight amino acid residues away from the first invariant Cys 22 of heavy chains. During the past 30 years, the Kabat database has grown to include nucleotide sequences, sequences of T cell receptors for antigens (TCR), major histocompatibility complex (MHC) class I and II molecules and other proteins of immunological interest. It has been used extensively by immunologists to derive useful structural and functional information from the primary sequences of these proteins.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
More templates are available for all structural regions in the new database.
Facebook
TwitterThis data is released alongside "p-IgGen: A Paired Antibody Generative Language Model", which contains full details on the data processing and cleaning. p-IgGen Paper: https://www.biorxiv.org/content/10.1101/2024.08.06.606780v1 . OAS: https://opig.stats.ox.ac.uk/webapps/oas/
Facebook
TwitterAn annotated, searchable collection of HIV-1 cytotoxic and helper T-cell epitopes and antibody binding sites, plus related tools and information. The goal of this database is to provide a comprehensive listing of defined HIV epitopes. These data are also printed in the HIV Molecular Immunology compendium, which is updated yearly and provided free of charge to scientific researchers, both by online download and as a printed copy. The data included in this database are extracted from the HIV immunology literature. HIV-specific B-cell and T-cell responses are summarized and annotated. Immunological responses are divided into three sections, CTL (CD8+), T helper (CD4+), and antibody. Within these sections, defined epitopes are organized by protein and binding sites within each protein, moving from left to right through the coding regions spanning the HIV genome. We include human responses to natural HIV infections, as well as vaccine studies in a range of animal models and human trials. Responses that are not specifically defined, such as responses to whole proteins or monoclonal antibody responses to discontinuous epitopes, are summarized at the end of each protein sub-section. Studies describing general HIV responses to the virus, but not to any specific protein, are included at the end of each section. The annotation includes information such as cross-reactivity, escape mutations, antibody sequence, TCR usage, functional domains that overlap with an epitope, immune response associations with rates of progression and therapy, and how specific epitopes were experimentally defined. Basic information such as HLA specificities for T-cell epitopes, isotypes of monoclonal antibodies, and epitope sequences are included whenever possible. All studies that we can find that incorporate the use of a specific monoclonal antibody are included in the entry for that antibody. A single T-cell epitope can have multiple entries, generally one entry per study. Finally, tables and maps of all defined linear epitopes relative to the HXB2 reference proteins are provided. Alignments of CTL, helper T-cell, and antibody epitopes are available through the search interfaces. Only responses to HIV-1 and HIV-2 are included in the database.
Facebook
TwitterStructural flexibility in germline gene-encoded antibodies allows promiscuous binding to diverse antigens. The binding affinity and specificity for a particular epitope typically increase as antibody genes acquire somatic mutations in antigen-stimulated B cells. In this work, we investigated whether germline gene-encoded antibodies are optimal for polyspecificity by determining the basis for recognition of diverse antigens by antibodies encoded by three VH gene segments. Panels of somatically mutated antibodies encoded by a common VH gene, but each binding to a different antigen, were computationally redesigned to predict antibodies that could engage multiple antigens at once. The Rosetta multi-state design process predicted antibody sequences for the entire heavy chain variable region, including framework, CDR1, and CDR2 mutations. The predicted sequences matched the germline gene sequences to a remarkable degree, revealing by computational design the residues that are predicted to enable polyspecificity, i.e., binding of many unrelated antigens with a common sequence. The process thereby reverses antibody maturation in silico. In contrast, when designing antibodies to bind a single antigen, a sequence similar to that of the mature antibody sequence was returned, mimicking natural antibody maturation in silico. We demonstrated that the Rosetta computational design algorithm captures important aspects of antibody/antigen recognition. While the hypervariable region CDR3 often mediates much of the specificity of mature antibodies, we identified key positions in the VH gene encoding CDR1, CDR2, and the immunoglobulin framework that are critical contributors for polyspecificity in germline antibodies. Computational design of antibodies capable of binding multiple antigens may allow the rational design of antibodies that retain polyspecificity for diverse epitope binding.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
(a) sequences for MMP-IGHV-targeting set, (b) extracted features for MMP-IGHV-targeting set, (c) sequences for IGHV-reference set, (d) extracted features for IGHV-reference set, (e) distribution of features, (f) statistical testing and feature selection scores in the MMP-IGHV-targeting and IGHV-reference sets, (g) Jaccard coefficient association scores for features within the MMP-IGHV-targeting set and within IGHV-reference set. (XLSX)
Facebook
TwitterDataset Description
ProteinSpace-TheraHuman-mAbs is a curated collection of therapeutic antibody sequences derived from the Thera-SAbDab database. The dataset contains 1,400 antibody chain sequences (700 heavy chains and 700 light chains) from 700 therapeutic antibodies, all of which are either genetically human or humanized whole monoclonal antibodies (mAbs). Each sequence has been processed with ANARCI (Antibody Numbering and Receptor ClassIfication) to provide IMGT-numbered… See the full description on the dataset page: https://huggingface.co/datasets/melanierb/ProteinSpace-TheraHuman-mAbs.
Facebook
TwitterEpitome is a database of structurally inferred antigenic epitopes in proteins. It includes all known antigenic residues and the antibodies that interact with them, including a detailed description of residues involved in the interaction and their sequence/structure environments. Additionally, Interactions can be visualized using an interface into Jmol. The website also contains specialized software, NLProt, to enable users to extract protein names and sequences from natural language text, and links to several other databases involved in antibody/antigen interactions. antibody/antigen interactions, antigen epitope
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OASis human 9-mer peptide database, generated from 118 million human antibody sequences from the Observed Antibody Space database.
Attached is a gzipped SQLite database containing two tables: "peptides" and "subjects".
Links:
Facebook
TwitterAbSet is a dataset of antibodies extracted from the PDB, carefully standardized, and enriched with a subset of in silico-generated antibody-antigen complexes containing poses similar to the bound state, along with a novel set of decoys. In total, AbSet comprises over 800000 structures, encompassing antibodies with paired heavy and light chains (VH-VL), only heavy chains (VH), only light chains (VL), and single-chain variable fragments (scFv), including both free antibodies and those complexed with protein antigens. The in silico dataset was generated through molecular docking using HADDOCK following two distinct approaches:
Blind Docking: Conducted using 2135 experimentally determined antibody-antigen complexes.
Site-Directed Docking: Applied to 1755 complexes, where the antibody sequences were extracted, modeled using AbodyBuilder2, and then docked with their original crystallized antigen.
Each docking run produced 250 poses, which were classified into four quality categories based on DockQ: high quality, medium quality, acceptable quality, and incorrect. This dataset includes molecular descriptors of amino acid residues. These descriptors were calculated for all standardized antibody structures obtained from the PDB. For in silico structures generated via docking, molecular descriptors were computed for 4 selected structures from a set of 250 poses generated per system. The code used to calculate molecular descriptors is available in the GitHub repository The descriptors include:
Solvent Accessible Surface Area Relative Accessible Surface Area Atomic depth Potrusion index Hydrophobicity Sequence Half-sphere exposure calculations C coordinates ϕ and dihedral angles Secondary structure of the protein
Facebook
TwitterIn multiple myeloma diseases, monoclonal immunoglobulin light chains (LCs) are abundantly produced, with as a consequence in some cases the formation of deposits affecting various organs, such as kidney, while in other cases to remain soluble up to concentrations of several g.L-1 in plasma. The exact factors crucial for the solubility of light chains are poorly understood, but it can be hypothesized that their amino acid sequence plays an important role. Determining the precise sequences of patient-derived light chains is therefore highly desirable. We establish here a novel de novo sequencing workflow for patient-derived LCs, based on the combination of bottom-up and top-down proteomics without database search. PEAKS is used for the de novo sequencing of peptides that are further assembled into full length LC sequences using ALPS. Top-down proteomics provides the molecular masses of proteoforms and allows the exact determination of the amino acid sequence including all post translational modifications. This pipeline is then used for the complete de novo sequencing of LCs extracted from the urine of 10 patients with multiple myeloma. We show that for the bottom-up part, digestions with trypsin and Nepenthes digestive fluid are sufficient to produce overlapping peptides able to generate the best sequence candidates. Top-down proteomics is absolutely required to achieve 100% final sequence coverage and characterize clinical samples containing several LCs . Our work highlights an unexpected range of modifications.
Facebook
TwitterThe AntiBody Sequence Database is a public dataset for antibody sequence data. It provides unique identifiers for antibody sequences, including both immunoglobulin and single-chain variable fragment sequences. These are are critical for immunological studies, and allows users to search and retrieve antibody sequences based on sequence similarity and specificity, and other biological properties.