100+ datasets found

b
Data from: AntiBody Sequence Database
bioregistry.io
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). AntiBody Sequence Database [Dataset]. https://bioregistry.io/absd
Explore at:
Dataset updated
Jan 23, 2025
Description
The AntiBody Sequence Database is a public dataset for antibody sequence data. It provides unique identifiers for antibody sequences, including both immunoglobulin and single-chain variable fragment sequences. These are are critical for immunological studies, and allows users to search and retrieve antibody sequences based on sequence similarity and specificity, and other biological properties.
d
Structural Antibody Database
dknet.org
neuinfo.org
+2more
Updated Apr 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Structural Antibody Database [Dataset]. http://identifiers.org/RRID:SCR_022096
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_022096
Dataset updated
Apr 20, 2022
Description
Database containing all antibody structures available in the PDB, annotated and presented in consistent fashion.Each structure is annotated with number of properties including experimental details, antibody nomenclature (e.g. heavy-light pairings), curated affinity data and sequence annotations. You can use the database to inspect individual structures, create and download datasets for analysis, search the database for structures with similar sequences to your query, monitor the known structural repetoire of antibodies.
Antibody and Nanobody Design Dataset (ANDD)
zenodo.org
zip
Updated Sep 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yikai Wu; Yikai Wu (2025). Antibody and Nanobody Design Dataset (ANDD) [Dataset]. http://doi.org/10.5281/zenodo.16894086
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16894086
Dataset updated
Sep 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yikai Wu; Yikai Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Title: Antibody and Nanobody Design Dataset (ANDD): A Comprehensive Resource with Sequence, Structure, and Binding Affinity Data

DOI: 10.5281/zenodo.16894086

Resource Type: Dataset

Publisher: Zenodo

Publication Year: 2025

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Overview (Abstract):

The Antibody and Nanobody Design Dataset (ANDD) is a unified, large-scale dataset created to overcome the limitations of data fragmentation and incompleteness in antibody and nanobody research. It integrates sequence, structure, antigen information, and binding affinity data from 15 diverse sources, including OAS, PDB, SabDab, and others. ANDD comprises 48,800 antibody/nanobody sequences, structural data for 25,158 entries, antigen sequences for 12,617 entries, and a total of 9,569 binding affinity values for antibody/nanobody-antigen pairs. A key innovation is the augmentation of experimental affinity data with 5,218 high-quality predictions generated by the ANTIPASTI model. This makes ANDD the largest available dataset of its kind, providing a robust foundation for training and validating deep learning models in therapeutic antibody and nanobody design.

Keywords: Dataset, Antibody Design, Nanobody Design, VHH, Deep Learning, Protein Engineering, Binding Affinity, Therapeutic Antibodies, Computational Biology

Methods (Data Curation and Processing):

The ANDD was constructed through a rigorous multi-step process:

Data Collection: Data was aggregated from 15 primary sources, including both antibody/nanobody-specific databases (e.g., OAS, SAbDab, INDI, sdAb-DB) and general protein databases (e.g., PDB, UNIPROT, PDBbind).

Integration and Standardization: Data from disparate sources was consolidated into a consistent format, addressing challenges of format inconsistency. Entries were manually validated to exclude non-relevant data (e.g., T-cell receptors).

Affinity Data Augmentation: The ANTIPASTI deep learning model was used to predict and add binding affinity values for entries that had structural data but lacked experimental affinity measurements.

Manual Curation: Web-based data and information from publicly available patents targeting key antigens (HER2, IL-6, CD45, SARS-CoV-2 RBD) were manually extracted to enhance completeness.

Hierarchical Organization: Data is organized in a hierarchical structure, offering four progressively detailed levels: Sequence-only, Sequence+Structure, Sequence+Structure+Antigen, and Sequence+Structure+Antigen+Affinity.

Data Specifications and Format:

The dataset is distributed in two parts:

ANDD.csv: A comprehensive spreadsheet containing all annotated metadata for each entry.

All_structures/Folder: A directory containing the corresponding PDB structure files for entries with structural data.

The ANDD.csvfile includes the following key fields (a full description is available in the Data Record section of the paper):

General Info: Source, Update_Date, PDB_ID, Experimental_Method, Ab_or_Nano, Source_Organism.

Chain Details: Entity IDs, Asym IDs, Database Accession Codes, and Macromolecule Names for Heavy (H) and Light (L) chains.

Antigen Details: Ag_Name, Ag_Seq, Ag_Source Organism, and relevant database identifiers.

Sequence Data: Full amino acid sequences for H/L chains and individual CDR regions (H1-H3, L1-L3).

Affinity Data: Experimentally measured or predicted Affinity_Kd(M), ∆Gbinding(kJ), and the Affinity_Method.

Mutation Data: Annotation of any amino acid mutations (Ab/Nano_mutation).

Technical Validation:

The quality of ANDD has been ensured through extensive validation:

Manual Curation: A rigorous manual review process was conducted to check for accuracy and consistency between sequence, structure, and affinity data across randomly selected entries.

Affinity Validation with AlphaBind: The experimental Kd values were validated by comparing them against enrichment ratios predicted by the AlphaBind model, showing a significant correlation (Pearson’s r = 0.750).

Cross-Mapping Validation: The internal consistency between Kd and ∆Gbinding values within the dataset was confirmed, showing a perfect correlation (Pearson’s r = 1.000) as per thermodynamic principles.

Proof-of-Concept Application: The dataset's utility was demonstrated by fine-tuning the Diffab generative model on a subset of ANDD. The fine-tuned model showed significant improvements in generating nanobodies with better predicted binding affinity, structural diversity, and developability metrics.

Potential Uses:

ANDD is designed to accelerate research in computational biology and drug discovery, including:

Training and benchmarking deep learning models for de novoantibody/nanobody sequence and structure generation.

Developing and validating predictive models for antibody-antigen binding affinity.

Studying structure-function relationships in antibody-antigen interactions.

Facilitating the design of optimized therapeutic antibodies and nanobodies with improved specificity and efficacy.

Access and License:

The ANDD dataset is publicly available for download under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. Users are free to share and adapt the material for any purpose, even commercially, provided appropriate credit is given to the original authors and this data descriptor is cited.
Data from: Data mining antibody sequences for database searching in...
zenodo.org
bin, csv
Updated Sep 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xuan-Tung Trinh; Xuan-Tung Trinh; Rebecca Freitag; Konrad Krawczyk; Veit Schwämmle; Veit Schwämmle; Rebecca Freitag; Konrad Krawczyk (2024). Data mining antibody sequences for database searching in bottom-up proteomics [Dataset]. http://doi.org/10.5281/zenodo.11045596
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11045596
Dataset updated
Sep 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xuan-Tung Trinh; Xuan-Tung Trinh; Rebecca Freitag; Konrad Krawczyk; Veit Schwämmle; Veit Schwämmle; Rebecca Freitag; Konrad Krawczyk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mass spectrometry (MS)-based proteomics is a powerful method for identifying and quantifying antibodies. Among the various MS approaches, bottom-up proteomics is especially effective for analyzing thousands of antibodies in complex mixtures. In this method, proteins are enzymatically digested into smaller peptides, typically using the protease trypsin, which are then analyzed via mass spectrometry. These peptides are matched to sequences in standard databases like UniProt or NCBI-RefSeq for identification.

However, a major limitation of this approach is the absence of comprehensive disease-specific antibody databases. Current databases, such as UniProt, include only a fraction of the antibody sequences present in the human body. For instance, as of January 2024, UniProt contains just 38,800 immunoglobulin sequences, far short of the billions of antibodies the human immune system can produce. As a result, relying on such limited databases can lead to under-detection of antibodies, particularly those associated with specific diseases. Expanding antibody databases with disease-specific sequences is crucial for improving the accuracy of MS-based proteomics in identifying antibodies relevant to human health.

Recently, through next-generation sequencing of antibody gene repertoires, it has become possible to obtain billions of antibody sequences (in amino acid format) by annotating, translating, and numbering antibody gene sequences. These large numbers of sequences are now available in public databases such as the Observed Antibody Space. We hypothesize that using these theoretical antibody sequences as new databases for bottom-up proteomics could address the current lack of antibody coverage in standard databases.

We developed a workflow to create disease-specific antibody peptide databases for bottom-up proteomics. The workflow details are available on GitHub. The database and metadata files generated by this workflow are stored in this Zenodo dataset, and they are used in DAT-DB — a web application that allows researchers to obtain FASTA files of disease-specific antibody peptides for direct use in bottom-up proteomics (see Demo version).

Each database file in this dataset is in .duckdb format and contains tables with 10 columns: Sequence, Filename, Patient, BSource, BType, Isotype, N_patient, N_antibody, Length_aa, and CDR3. The "Sequence" column contains tryptic peptides. "Filename" is the file where the data was collected. "Patient" refers to the patient number as listed in metadata2.csv. "BSource" refers to the B-cells' source, and "BType" refers to the type of B-cells. "Isotype" specifies the antibody isotype (IgA, IgD, IgE, IgG, IgM, or Bulk). "N_patient" indicates the number of patients having this peptide, and "N_antibody" specifies the number of antibodies containing this peptide. "Length_aa" indicates the number of amino acids in the peptide, while "CDR3" shows whether the peptide is found in the CDR3 region.

The file metadata1.csv contains information about each database file, while metadata2.csv provides details about the sources of the collected antibodies.
d
Therapeutic Structural Antibody Database
dknet.org
rrid.site
+2more
Updated Apr 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Therapeutic Structural Antibody Database [Dataset]. http://identifiers.org/RRID:SCR_022093
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_022093
Dataset updated
Apr 20, 2022
Description
Tracks all antibody and nanobody related therapeutics recognized by World Health Organisation, and identifies any corresponding structures in Structural Antibody Database with near exact or exact variable domain sequence matches. Synchronized with SAbDab to update weekly, reflecting new Protein Data Bank entries and availability of new sequence data published by WHO.
d
Abysis Database
dknet.org
scicrunch.org
+2more
Updated Aug 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Abysis Database [Dataset]. http://identifiers.org/RRID:SCR_000756
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_000756
Dataset updated
Aug 1, 2024
Description
A database of antibody structure containing sequences from Kabat, IMGT and the Protein Databank (PDB), as well as structure data from the PDB. It provides search of the sequence data on various criteria and display of results in different formats. For data from the PDB, sequence searches can be combined with structural constraints. For example, one can ask for all the antibodies with a 10-residue Kabat CDR-L1 with a serine at H23 and an arginine within 10A of H36. The site also has software for structure analysis and other information on antibody structure available.
Structural region to sequence mapping for RosettaAntibody.
figshare.com
xls
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeliazko R. Jeliazkov; Rahel Frick; Jing Zhou; Jeffrey J. Gray (2023). Structural region to sequence mapping for RosettaAntibody. [Dataset]. http://doi.org/10.1371/journal.pone.0234282.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0234282.t001
Dataset updated
Jun 11, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jeliazko R. Jeliazkov; Rahel Frick; Jing Zhou; Jeffrey J. Gray
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Structural region to sequence mapping for RosettaAntibody.
Antibody dataset Kd
zenodo.org
csv, text/x-python
Updated Aug 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akbar Rahmad; Akbar Rahmad (2024). Antibody dataset Kd [Dataset]. http://doi.org/10.5281/zenodo.13120765
Explore at:
csv, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13120765
Dataset updated
Aug 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Akbar Rahmad; Akbar Rahmad
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Description
A dataset of ~500 antibodies with binding affinity: antibody sequence, antigen sequence, Kd. Obtained from SAbDab via Therapeutic Data Commons

Python code (get_antibody_affinity_data.py) and dataset (antibody_affinity_protein_sabdab.csv)
f
Table_2_RAPID: A Rep-Seq Dataset Analysis Platform With an Integrated...
datasetcatalog.nlm.nih.gov
Updated Aug 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu, Xueqing; Xu, Qingxian; Tang, Haipei; Zeng, Huikun; Chen, Yuan; Lan, Chunhong; Zhang, Yanxia; Wang, Minhui; Guan, Junjie; Zhu, Yan; Ma, Cuiyu; Wei, Lai; Zhang, Zhenhai; Xie, Wenxi; Chen, Sen; Yang, Wei; Zhang, Yan; Wang, Qilong; Zhang, Yanfang; Wang, Chengrui; Guo, Shixin; Chen, Tianjian; Yang, Xiujia; Ren, Jian (2021). Table_2_RAPID: A Rep-Seq Dataset Analysis Platform With an Integrated Antibody Database.xlsx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000844044
Explore at:
Dataset updated
Aug 13, 2021
Authors
Yu, Xueqing; Xu, Qingxian; Tang, Haipei; Zeng, Huikun; Chen, Yuan; Lan, Chunhong; Zhang, Yanxia; Wang, Minhui; Guan, Junjie; Zhu, Yan; Ma, Cuiyu; Wei, Lai; Zhang, Zhenhai; Xie, Wenxi; Chen, Sen; Yang, Wei; Zhang, Yan; Wang, Qilong; Zhang, Yanfang; Wang, Chengrui; Guo, Shixin; Chen, Tianjian; Yang, Xiujia; Ren, Jian
Description
The antibody repertoire is a critical component of the adaptive immune system and is believed to reflect an individual’s immune history and current immune status. Delineating the antibody repertoire has advanced our understanding of humoral immunity, facilitated antibody discovery, and showed great potential for improving the diagnosis and treatment of disease. However, no tool to date has effectively integrated big Rep-seq data and prior knowledge of functional antibodies to elucidate the remarkably diverse antibody repertoire. We developed a Rep-seq dataset Analysis Platform with an Integrated antibody Database (RAPID; https://rapid.zzhlab.org/), a free and web-based tool that allows researchers to process and analyse Rep-seq datasets. RAPID consolidates 521 WHO-recognized therapeutic antibodies, 88,059 antigen- or disease-specific antibodies, and 306 million clones extracted from 2,449 human IGH Rep-seq datasets generated from individuals with 29 different health conditions. RAPID also integrates a standardized Rep-seq dataset analysis pipeline to enable users to upload and analyse their datasets. In the process, users can also select set of existing repertoires for comparison. RAPID automatically annotates clones based on integrated therapeutic and known antibodies, and users can easily query antibodies or repertoires based on sequence or optional keywords. With its powerful analysis functions and rich set of antibody and antibody repertoire information, RAPID will benefit researchers in adaptive immune studies.
n
Data from: Kabat Database of Sequences of Proteins of Immunological Interest...
neuinfo.org
scicrunch.org
+2more
Updated Jan 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Kabat Database of Sequences of Proteins of Immunological Interest [Dataset]. http://identifiers.org/RRID:SCR_006465
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006465 https://identifiers.org/RRID:SCR_006465/resolver?q=&i=rrid
Dataset updated
Jan 29, 2022
Description
The Kabat Database determines the combining site of antibodies based on the available amino acid sequences. The precise delineation of complementarity determining regions (CDR) of both light and heavy chains provides the first example of how properly aligned sequences can be used to derive structural and functional information of biological macromolecules. The Kabat database now includes nucleotide sequences, sequences of T cell receptors for antigens (TCR), major histocompatibility complex (MHC) class I and II molecules, and other proteins of immunological interest. The Kabat Database searching and analysis tools package is an ASP.NET web-based portal containing lookup tools, sequence matching tools, alignment tools, length distribution tools, positional correlation tools and much more. The searching and analysis tools are custom made for the aligned data sets contained in both the SQL Server and ASCII text flat file formats. The searching and analysis tools may be run on a single PC workstation or in a distributed environment. The analysis tools are written in ASP.NET and C# and are available in Visual Studio .NET 2003/2005/2008 formats. The Kabat Database was initially started in 1970 to determine the combining site of antibodies based on the available amino acid sequences at that time. Bence Jones proteins, mostly from human, were aligned, using the now-known Kabat numbering system, and a quantitative measure, variability, was calculated for every position. Three peaks, at positions 24-34, 50-56 and 89-97, were identified and proposed to form the complementarity determining regions (CDR) of light chains. Subsequently, antibody heavy chain amino acid sequences were also aligned using a different numbering system, since the locations of their CDRs (31-35B, 50-65 and 95-102) are different from those of the light chains. CDRL1 starts right after the first invariant Cys 23 of light chains, while CDRH1 is eight amino acid residues away from the first invariant Cys 22 of heavy chains. During the past 30 years, the Kabat database has grown to include nucleotide sequences, sequences of T cell receptors for antigens (TCR), major histocompatibility complex (MHC) class I and II molecules and other proteins of immunological interest. It has been used extensively by immunologists to derive useful structural and functional information from the primary sequences of these proteins.
More templates are available for all structural regions in the new database....
plos.figshare.com
xls
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeliazko R. Jeliazkov; Rahel Frick; Jing Zhou; Jeffrey J. Gray (2023). More templates are available for all structural regions in the new database. [Dataset]. http://doi.org/10.1371/journal.pone.0234282.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0234282.t002
Dataset updated
Jun 11, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jeliazko R. Jeliazkov; Rahel Frick; Jing Zhou; Jeffrey J. Gray
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
More templates are available for all structural regions in the new database.
o
p-IgGen Dataset: Cleaned paired and unpaired antibody sequence data for...
explore.openaire.eu
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oliver Turnbull; Dino Oglic; Rebecca Croasdale-Wood; Charlotte Deane (2024). p-IgGen Dataset: Cleaned paired and unpaired antibody sequence data for machine learning applications. [Dataset]. http://doi.org/10.5281/zenodo.13880873
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13880873
Dataset updated
Oct 2, 2024
Authors
Oliver Turnbull; Dino Oglic; Rebecca Croasdale-Wood; Charlotte Deane
Description
This data is released alongside "p-IgGen: A Paired Antibody Generative Language Model", which contains full details on the data processing and cleaning. p-IgGen Paper: https://www.biorxiv.org/content/10.1101/2024.08.06.606780v1 . OAS: https://opig.stats.ox.ac.uk/webapps/oas/
n
HIV Molecular Immunology Database
neuinfo.org
rrid.site
+2more
Updated Aug 10, 2003
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2003). HIV Molecular Immunology Database [Dataset]. http://identifiers.org/RRID:SCR_002893
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002893
Dataset updated
Aug 10, 2003
Description
An annotated, searchable collection of HIV-1 cytotoxic and helper T-cell epitopes and antibody binding sites, plus related tools and information. The goal of this database is to provide a comprehensive listing of defined HIV epitopes. These data are also printed in the HIV Molecular Immunology compendium, which is updated yearly and provided free of charge to scientific researchers, both by online download and as a printed copy. The data included in this database are extracted from the HIV immunology literature. HIV-specific B-cell and T-cell responses are summarized and annotated. Immunological responses are divided into three sections, CTL (CD8+), T helper (CD4+), and antibody. Within these sections, defined epitopes are organized by protein and binding sites within each protein, moving from left to right through the coding regions spanning the HIV genome. We include human responses to natural HIV infections, as well as vaccine studies in a range of animal models and human trials. Responses that are not specifically defined, such as responses to whole proteins or monoclonal antibody responses to discontinuous epitopes, are summarized at the end of each protein sub-section. Studies describing general HIV responses to the virus, but not to any specific protein, are included at the end of each section. The annotation includes information such as cross-reactivity, escape mutations, antibody sequence, TCR usage, functional domains that overlap with an epitope, immune response associations with rates of progression and therapy, and how specific epitopes were experimentally defined. Basic information such as HLA specificities for T-cell epitopes, isotypes of monoclonal antibodies, and epitope sequences are included whenever possible. All studies that we can find that incorporate the use of a specific monoclonal antibody are included in the entry for that antibody. A single T-cell epitope can have multiple entries, generally one entry per study. Finally, tables and maps of all defined linear epitopes relative to the HXB2 reference proteins are provided. Alignments of CTL, helper T-cell, and antibody epitopes are available through the search interfaces. Only responses to HIV-1 and HIV-2 are included in the database.
f
Data from: Human Germline Antibody Gene Segments Encode Polyspecific...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Apr 25, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Willis, Jordan R.; Crowe Jr, James E.; Meiler, Jens; DeLuca, Samuel L.; Briney, Bryan S. (2013). Human Germline Antibody Gene Segments Encode Polyspecific Antibodies [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001730034
Explore at:
Dataset updated
Apr 25, 2013
Authors
Willis, Jordan R.; Crowe Jr, James E.; Meiler, Jens; DeLuca, Samuel L.; Briney, Bryan S.
Description
Structural flexibility in germline gene-encoded antibodies allows promiscuous binding to diverse antigens. The binding affinity and specificity for a particular epitope typically increase as antibody genes acquire somatic mutations in antigen-stimulated B cells. In this work, we investigated whether germline gene-encoded antibodies are optimal for polyspecificity by determining the basis for recognition of diverse antigens by antibodies encoded by three VH gene segments. Panels of somatically mutated antibodies encoded by a common VH gene, but each binding to a different antigen, were computationally redesigned to predict antibodies that could engage multiple antigens at once. The Rosetta multi-state design process predicted antibody sequences for the entire heavy chain variable region, including framework, CDR1, and CDR2 mutations. The predicted sequences matched the germline gene sequences to a remarkable degree, revealing by computational design the residues that are predicted to enable polyspecificity, i.e., binding of many unrelated antigens with a common sequence. The process thereby reverses antibody maturation in silico. In contrast, when designing antibodies to bind a single antigen, a sequence similar to that of the mature antibody sequence was returned, mimicking natural antibody maturation in silico. We demonstrated that the Rosetta computational design algorithm captures important aspects of antibody/antigen recognition. While the hypervariable region CDR3 often mediates much of the specificity of mature antibodies, we identified key positions in the VH gene encoding CDR1, CDR2, and the immunoglobulin framework that are critical contributors for polyspecificity in germline antibodies. Computational design of antibodies capable of binding multiple antigens may allow the rational design of antibodies that retain polyspecificity for diverse epitope binding.
Detailed data for representative sequences in the MMP-IGHV-targeting and...
plos.figshare.com
xlsx
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinmeng Li; James A. Van Deventer; Soha Hassoun (2023). Detailed data for representative sequences in the MMP-IGHV-targeting and IGHV-reference sets. [Dataset]. http://doi.org/10.1371/journal.pcbi.1007779.s006
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1007779.s006
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Xinmeng Li; James A. Van Deventer; Soha Hassoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
(a) sequences for MMP-IGHV-targeting set, (b) extracted features for MMP-IGHV-targeting set, (c) sequences for IGHV-reference set, (d) extracted features for IGHV-reference set, (e) distribution of features, (f) statistical testing and feature selection scores in the MMP-IGHV-targeting and IGHV-reference sets, (g) Jaccard coefficient association scores for features within the MMP-IGHV-targeting set and within IGHV-reference set. (XLSX)
h
ProteinSpace-TheraHuman-mAbs
huggingface.co
Updated Feb 14, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melanie Buechler (2026). ProteinSpace-TheraHuman-mAbs [Dataset]. https://huggingface.co/datasets/melanierb/ProteinSpace-TheraHuman-mAbs
Explore at:
Dataset updated
Feb 14, 2026
Authors
Melanie Buechler
Description
Dataset Description

ProteinSpace-TheraHuman-mAbs is a curated collection of therapeutic antibody sequences derived from the Thera-SAbDab database. The dataset contains 1,400 antibody chain sequences (700 heavy chains and 700 light chains) from 700 therapeutic antibodies, all of which are either genetically human or humanized whole monoclonal antibodies (mAbs). Each sequence has been processed with ANARCI (Antibody Numbering and Receptor ClassIfication) to provide IMGT-numbered… See the full description on the dataset page: https://huggingface.co/datasets/melanierb/ProteinSpace-TheraHuman-mAbs.
s
Epitome
scicrunch.org
neuinfo.org
+2more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Epitome [Dataset]. http://identifiers.org/RRID:SCR_007641
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007641
Description
Epitome is a database of structurally inferred antigenic epitopes in proteins. It includes all known antigenic residues and the antibodies that interact with them, including a detailed description of residues involved in the interaction and their sequence/structure environments. Additionally, Interactions can be visualized using an interface into Jmol. The website also contains specialized software, NLProt, to enable users to extract protein names and sequences from natural language text, and links to several other databases involved in antibody/antigen interactions. antibody/antigen interactions, antigen epitope
OASis peptide database
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Aug 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Prihoda; David Prihoda; Jad Maamary; Andrew Waight; Veronica Juan; Laurence Fayadat-Dilman; Daniel Svozil; Danny A. Bitton; Jad Maamary; Andrew Waight; Veronica Juan; Laurence Fayadat-Dilman; Daniel Svozil; Danny A. Bitton (2021). OASis peptide database [Dataset]. http://doi.org/10.5281/zenodo.5164685
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5164685
Dataset updated
Aug 7, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Prihoda; David Prihoda; Jad Maamary; Andrew Waight; Veronica Juan; Laurence Fayadat-Dilman; Daniel Svozil; Danny A. Bitton; Jad Maamary; Andrew Waight; Veronica Juan; Laurence Fayadat-Dilman; Daniel Svozil; Danny A. Bitton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OASis human 9-mer peptide database, generated from 118 million human antibody sequences from the Observed Antibody Space database.

Attached is a gzipped SQLite database containing two tables: "peptides" and "subjects".

Links:

BioPhi codebase and documentation: https://github.com/Merck/BioPhi

Public BioPhi server: https://biophi.dichlab.org

OAS Database: http://opig.stats.ox.ac.uk/webapps/oas/
r
Data from: AbSet: A Standardized Dataset of Antibody Structures for Machine...
resodate.org
Updated May 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diego Almeida; Matheus Almeida; Jean Sampaio; Eduardo Gaieta; Andrielly Costa; Flávio Rabelo; César Calvacante; Geraldo Sartori; João Silva (2025). AbSet: A Standardized Dataset of Antibody Structures for Machine Learning Applications [Dataset]. https://resodate.org/resources/aHR0cHM6Ly96ZW5vZG8ub3JnL3JlY29yZHMvMTQ4ODgwMDI=
Explore at:
Dataset updated
May 18, 2025
Dataset provided by
Zenodo
Authors
Diego Almeida; Matheus Almeida; Jean Sampaio; Eduardo Gaieta; Andrielly Costa; Flávio Rabelo; César Calvacante; Geraldo Sartori; João Silva
Description
AbSet is a dataset of antibodies extracted from the PDB, carefully standardized, and enriched with a subset of in silico-generated antibody-antigen complexes containing poses similar to the bound state, along with a novel set of decoys. In total, AbSet comprises over 800000 structures, encompassing antibodies with paired heavy and light chains (VH-VL), only heavy chains (VH), only light chains (VL), and single-chain variable fragments (scFv), including both free antibodies and those complexed with protein antigens. The in silico dataset was generated through molecular docking using HADDOCK following two distinct approaches:

Blind Docking: Conducted using 2135 experimentally determined antibody-antigen complexes.

Site-Directed Docking: Applied to 1755 complexes, where the antibody sequences were extracted, modeled using AbodyBuilder2, and then docked with their original crystallized antigen.

Each docking run produced 250 poses, which were classified into four quality categories based on DockQ: high quality, medium quality, acceptable quality, and incorrect. This dataset includes molecular descriptors of amino acid residues. These descriptors were calculated for all standardized antibody structures obtained from the PDB. For in silico structures generated via docking, molecular descriptors were computed for 4 selected structures from a set of 250 poses generated per system. The code used to calculate molecular descriptors is available in the GitHub repository The descriptors include:

Solvent Accessible Surface Area Relative Accessible Surface Area Atomic depth Potrusion index Hydrophobicity Sequence Half-sphere exposure calculations C coordinates ϕ and dihedral angles Secondary structure of the protein

Organization of Available Data: 📂 PDBs_Files (Antibodies extracted from the PDB)│── 📂 Structures │── 📂 Descriptors 📂 InSilicoComplexStructures-MonomersFromXtal (Blind Docking)│── 📂 Structures │── 📂 Descriptors │── 📂 Index DockQ 📂 InSilicoComplexStructures-MonomersFromModeling (Site-Directed Docking)│── 📂 Structures │── 📂 Descriptors │── 📂 Index DockQ
e
Data from: De novo sequencing of antibody light chain proteoforms from...
ebi.ac.uk
Updated Nov 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martial Rey (2021). De novo sequencing of antibody light chain proteoforms from patients with multiple myeloma [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD025884
Explore at:
Dataset updated
Nov 8, 2021
Authors
Martial Rey
Variables measured
Proteomics
Description
In multiple myeloma diseases, monoclonal immunoglobulin light chains (LCs) are abundantly produced, with as a consequence in some cases the formation of deposits affecting various organs, such as kidney, while in other cases to remain soluble up to concentrations of several g.L-1 in plasma. The exact factors crucial for the solubility of light chains are poorly understood, but it can be hypothesized that their amino acid sequence plays an important role. Determining the precise sequences of patient-derived light chains is therefore highly desirable. We establish here a novel de novo sequencing workflow for patient-derived LCs, based on the combination of bottom-up and top-down proteomics without database search. PEAKS is used for the de novo sequencing of peptides that are further assembled into full length LC sequences using ALPS. Top-down proteomics provides the molecular masses of proteoforms and allows the exact determination of the amino acid sequence including all post translational modifications. This pipeline is then used for the complete de novo sequencing of LCs extracted from the urine of 10 patients with multiple myeloma. We show that for the bottom-up part, digestions with trypsin and Nepenthes digestive fluid are sufficient to produce overlapping peptides able to generate the best sequence candidates. Top-down proteomics is absolutely required to achieve 100% final sequence coverage and characterize clinical samples containing several LCs . Our work highlights an unexpected range of modifications.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2025). AntiBody Sequence Database [Dataset]. https://bioregistry.io/absd

Data from: AntiBody Sequence Database

Explore at:

Dataset updated

Jan 23, 2025

Description

The AntiBody Sequence Database is a public dataset for antibody sequence data. It provides unique identifiers for antibody sequences, including both immunoglobulin and single-chain variable fragment sequences. These are are critical for immunological studies, and allows users to search and retrieve antibody sequences based on sequence similarity and specificity, and other biological properties.

Clear search

Close search

Google apps

Main menu

Data from: AntiBody Sequence Database

Structural Antibody Database

Antibody and Nanobody Design Dataset (ANDD)

Data from: Data mining antibody sequences for database searching in...

Therapeutic Structural Antibody Database

Abysis Database

Structural region to sequence mapping for RosettaAntibody.

Antibody dataset Kd

Table_2_RAPID: A Rep-Seq Dataset Analysis Platform With an Integrated...

Data from: Kabat Database of Sequences of Proteins of Immunological Interest...

More templates are available for all structural regions in the new database....

p-IgGen Dataset: Cleaned paired and unpaired antibody sequence data for...

HIV Molecular Immunology Database

Data from: Human Germline Antibody Gene Segments Encode Polyspecific...

Detailed data for representative sequences in the MMP-IGHV-targeting and...

ProteinSpace-TheraHuman-mAbs

Epitome

OASis peptide database

Data from: AbSet: A Standardized Dataset of Antibody Structures for Machine...

Data from: De novo sequencing of antibody light chain proteoforms from...

Data from: AntiBody Sequence Database