5 datasets found
  1. h

    OAS

    • huggingface.co
    Updated Jan 22, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Protein Informatics Group (2026). OAS [Dataset]. https://huggingface.co/datasets/opig/OAS
    Explore at:
    Dataset updated
    Jan 22, 2026
    Dataset authored and provided by
    Oxford Protein Informatics Group
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OPIG Observed Antibody Dataset (OAS)

      Filtering
    

    If you know you only want a subset of this dataset, you can load a filtered subset like this: ds = load_dataset('OPIG/OAS', streaming=streaming, filters=[("meta_Species", "==", "human")].

  2. OASis peptide database

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Aug 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Prihoda; David Prihoda; Jad Maamary; Andrew Waight; Veronica Juan; Laurence Fayadat-Dilman; Daniel Svozil; Danny A. Bitton; Jad Maamary; Andrew Waight; Veronica Juan; Laurence Fayadat-Dilman; Daniel Svozil; Danny A. Bitton (2021). OASis peptide database [Dataset]. http://doi.org/10.5281/zenodo.5164685
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Aug 7, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Prihoda; David Prihoda; Jad Maamary; Andrew Waight; Veronica Juan; Laurence Fayadat-Dilman; Daniel Svozil; Danny A. Bitton; Jad Maamary; Andrew Waight; Veronica Juan; Laurence Fayadat-Dilman; Daniel Svozil; Danny A. Bitton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OASis human 9-mer peptide database, generated from 118 million human antibody sequences from the Observed Antibody Space database.

    Attached is a gzipped SQLite database containing two tables: "peptides" and "subjects".

    Links:

  3. o

    p-IgGen Dataset: Cleaned paired and unpaired antibody sequence data for...

    • explore.openaire.eu
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oliver Turnbull; Dino Oglic; Rebecca Croasdale-Wood; Charlotte Deane (2024). p-IgGen Dataset: Cleaned paired and unpaired antibody sequence data for machine learning applications. [Dataset]. http://doi.org/10.5281/zenodo.13880873
    Explore at:
    Dataset updated
    Oct 2, 2024
    Authors
    Oliver Turnbull; Dino Oglic; Rebecca Croasdale-Wood; Charlotte Deane
    Description

    This data is released alongside "p-IgGen: A Paired Antibody Generative Language Model", which contains full details on the data processing and cleaning. p-IgGen Paper: https://www.biorxiv.org/content/10.1101/2024.08.06.606780v1 . OAS: https://opig.stats.ox.ac.uk/webapps/oas/

  4. LICHEN: Light-chain Immunoglobulin sequence generation Conditioned on the...

    • zenodo.org
    zip
    Updated Aug 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Henriette Capel; Henriette Capel; Alexander Greenshields-Watson; Alexander Greenshields-Watson; Charlotte Deane; Charlotte Deane (2025). LICHEN: Light-chain Immunoglobulin sequence generation Conditioned on the Heavy chain and Experimental Needs (Dataset and Model weights) [Dataset]. http://doi.org/10.5281/zenodo.15917096
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Henriette Capel; Henriette Capel; Alexander Greenshields-Watson; Alexander Greenshields-Watson; Charlotte Deane; Charlotte Deane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data is released alongside "LICHEN: Light-chain Immunoglobulin sequence generation Conditioned on the Heavy chain and Experimental Needs", which contains full details on the model and the data processing and cleaning.

    Data: Cleaned paired human antibody sequence data for machine learning applications.
    Model: Model weights for LICHEN.

    Preprint: https://doi.org/10.1101/2025.08.06.668938
    GitHub: https://github.com/oxpig/LICHEN
    WebApp: https://opig.stats.ox.ac.uk/webapps/lichen/
    OAS: https://opig.stats.ox.ac.uk/webapps/oas/

  5. Data from: Data mining antibody sequences for database searching in...

    • zenodo.org
    bin, csv
    Updated Sep 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuan-Tung Trinh; Xuan-Tung Trinh; Rebecca Freitag; Konrad Krawczyk; Veit Schwämmle; Veit Schwämmle; Rebecca Freitag; Konrad Krawczyk (2024). Data mining antibody sequences for database searching in bottom-up proteomics [Dataset]. http://doi.org/10.5281/zenodo.11045596
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Sep 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xuan-Tung Trinh; Xuan-Tung Trinh; Rebecca Freitag; Konrad Krawczyk; Veit Schwämmle; Veit Schwämmle; Rebecca Freitag; Konrad Krawczyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Mass spectrometry (MS)-based proteomics is a powerful method for identifying and quantifying antibodies. Among the various MS approaches, bottom-up proteomics is especially effective for analyzing thousands of antibodies in complex mixtures. In this method, proteins are enzymatically digested into smaller peptides, typically using the protease trypsin, which are then analyzed via mass spectrometry. These peptides are matched to sequences in standard databases like UniProt or NCBI-RefSeq for identification.

    However, a major limitation of this approach is the absence of comprehensive disease-specific antibody databases. Current databases, such as UniProt, include only a fraction of the antibody sequences present in the human body. For instance, as of January 2024, UniProt contains just 38,800 immunoglobulin sequences, far short of the billions of antibodies the human immune system can produce. As a result, relying on such limited databases can lead to under-detection of antibodies, particularly those associated with specific diseases. Expanding antibody databases with disease-specific sequences is crucial for improving the accuracy of MS-based proteomics in identifying antibodies relevant to human health.

    Recently, through next-generation sequencing of antibody gene repertoires, it has become possible to obtain billions of antibody sequences (in amino acid format) by annotating, translating, and numbering antibody gene sequences. These large numbers of sequences are now available in public databases such as the Observed Antibody Space. We hypothesize that using these theoretical antibody sequences as new databases for bottom-up proteomics could address the current lack of antibody coverage in standard databases.

    We developed a workflow to create disease-specific antibody peptide databases for bottom-up proteomics. The workflow details are available on GitHub. The database and metadata files generated by this workflow are stored in this Zenodo dataset, and they are used in DAT-DB — a web application that allows researchers to obtain FASTA files of disease-specific antibody peptides for direct use in bottom-up proteomics (see Demo version).

    Each database file in this dataset is in .duckdb format and contains tables with 10 columns: Sequence, Filename, Patient, BSource, BType, Isotype, N_patient, N_antibody, Length_aa, and CDR3. The "Sequence" column contains tryptic peptides. "Filename" is the file where the data was collected. "Patient" refers to the patient number as listed in metadata2.csv. "BSource" refers to the B-cells' source, and "BType" refers to the type of B-cells. "Isotype" specifies the antibody isotype (IgA, IgD, IgE, IgG, IgM, or Bulk). "N_patient" indicates the number of patients having this peptide, and "N_antibody" specifies the number of antibodies containing this peptide. "Length_aa" indicates the number of amino acids in the peptide, while "CDR3" shows whether the peptide is found in the CDR3 region.

    The file metadata1.csv contains information about each database file, while metadata2.csv provides details about the sources of the collected antibodies.

  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Oxford Protein Informatics Group (2026). OAS [Dataset]. https://huggingface.co/datasets/opig/OAS

OAS

opig/OAS

Observed Antibody Dataset

Explore at:
Dataset updated
Jan 22, 2026
Dataset authored and provided by
Oxford Protein Informatics Group
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

OPIG Observed Antibody Dataset (OAS)

  Filtering

If you know you only want a subset of this dataset, you can load a filtered subset like this: ds = load_dataset('OPIG/OAS', streaming=streaming, filters=[("meta_Species", "==", "human")].

Search
Clear search
Close search
Google apps
Main menu