Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OPIG Observed Antibody Dataset (OAS)
Filtering
If you know you only want a subset of this dataset, you can load a filtered subset like this: ds = load_dataset('OPIG/OAS', streaming=streaming, filters=[("meta_Species", "==", "human")].
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OASis human 9-mer peptide database, generated from 118 million human antibody sequences from the Observed Antibody Space database.
Attached is a gzipped SQLite database containing two tables: "peptides" and "subjects".
Links:
Facebook
TwitterThis data is released alongside "p-IgGen: A Paired Antibody Generative Language Model", which contains full details on the data processing and cleaning. p-IgGen Paper: https://www.biorxiv.org/content/10.1101/2024.08.06.606780v1 . OAS: https://opig.stats.ox.ac.uk/webapps/oas/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data is released alongside "LICHEN: Light-chain Immunoglobulin sequence generation Conditioned on the Heavy chain and Experimental Needs", which contains full details on the model and the data processing and cleaning.
Data: Cleaned paired human antibody sequence data for machine learning applications.
Model: Model weights for LICHEN.
Preprint: https://doi.org/10.1101/2025.08.06.668938
GitHub: https://github.com/oxpig/LICHEN
WebApp: https://opig.stats.ox.ac.uk/webapps/lichen/
OAS: https://opig.stats.ox.ac.uk/webapps/oas/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mass spectrometry (MS)-based proteomics is a powerful method for identifying and quantifying antibodies. Among the various MS approaches, bottom-up proteomics is especially effective for analyzing thousands of antibodies in complex mixtures. In this method, proteins are enzymatically digested into smaller peptides, typically using the protease trypsin, which are then analyzed via mass spectrometry. These peptides are matched to sequences in standard databases like UniProt or NCBI-RefSeq for identification.
However, a major limitation of this approach is the absence of comprehensive disease-specific antibody databases. Current databases, such as UniProt, include only a fraction of the antibody sequences present in the human body. For instance, as of January 2024, UniProt contains just 38,800 immunoglobulin sequences, far short of the billions of antibodies the human immune system can produce. As a result, relying on such limited databases can lead to under-detection of antibodies, particularly those associated with specific diseases. Expanding antibody databases with disease-specific sequences is crucial for improving the accuracy of MS-based proteomics in identifying antibodies relevant to human health.
Recently, through next-generation sequencing of antibody gene repertoires, it has become possible to obtain billions of antibody sequences (in amino acid format) by annotating, translating, and numbering antibody gene sequences. These large numbers of sequences are now available in public databases such as the Observed Antibody Space. We hypothesize that using these theoretical antibody sequences as new databases for bottom-up proteomics could address the current lack of antibody coverage in standard databases.
We developed a workflow to create disease-specific antibody peptide databases for bottom-up proteomics. The workflow details are available on GitHub. The database and metadata files generated by this workflow are stored in this Zenodo dataset, and they are used in DAT-DB — a web application that allows researchers to obtain FASTA files of disease-specific antibody peptides for direct use in bottom-up proteomics (see Demo version).
Each database file in this dataset is in .duckdb format and contains tables with 10 columns: Sequence, Filename, Patient, BSource, BType, Isotype, N_patient, N_antibody, Length_aa, and CDR3. The "Sequence" column contains tryptic peptides. "Filename" is the file where the data was collected. "Patient" refers to the patient number as listed in metadata2.csv. "BSource" refers to the B-cells' source, and "BType" refers to the type of B-cells. "Isotype" specifies the antibody isotype (IgA, IgD, IgE, IgG, IgM, or Bulk). "N_patient" indicates the number of patients having this peptide, and "N_antibody" specifies the number of antibodies containing this peptide. "Length_aa" indicates the number of amino acids in the peptide, while "CDR3" shows whether the peptide is found in the CDR3 region.
The file metadata1.csv contains information about each database file, while metadata2.csv provides details about the sources of the collected antibodies.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OPIG Observed Antibody Dataset (OAS)
Filtering
If you know you only want a subset of this dataset, you can load a filtered subset like this: ds = load_dataset('OPIG/OAS', streaming=streaming, filters=[("meta_Species", "==", "human")].