5 datasets found

h
OAS
huggingface.co
Updated Jan 22, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Protein Informatics Group (2026). OAS [Dataset]. https://huggingface.co/datasets/opig/OAS
Explore at:
Dataset updated
Jan 22, 2026
Dataset authored and provided by
Oxford Protein Informatics Group
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OPIG Observed Antibody Dataset (OAS)

Filtering

If you know you only want a subset of this dataset, you can load a filtered subset like this: ds = load_dataset('OPIG/OAS', streaming=streaming, filters=[("meta_Species", "==", "human")].
OASis peptide database
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Aug 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Prihoda; David Prihoda; Jad Maamary; Andrew Waight; Veronica Juan; Laurence Fayadat-Dilman; Daniel Svozil; Danny A. Bitton; Jad Maamary; Andrew Waight; Veronica Juan; Laurence Fayadat-Dilman; Daniel Svozil; Danny A. Bitton (2021). OASis peptide database [Dataset]. http://doi.org/10.5281/zenodo.5164685
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5164685
Dataset updated
Aug 7, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Prihoda; David Prihoda; Jad Maamary; Andrew Waight; Veronica Juan; Laurence Fayadat-Dilman; Daniel Svozil; Danny A. Bitton; Jad Maamary; Andrew Waight; Veronica Juan; Laurence Fayadat-Dilman; Daniel Svozil; Danny A. Bitton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OASis human 9-mer peptide database, generated from 118 million human antibody sequences from the Observed Antibody Space database.

Attached is a gzipped SQLite database containing two tables: "peptides" and "subjects".

Links:

BioPhi codebase and documentation: https://github.com/Merck/BioPhi

Public BioPhi server: https://biophi.dichlab.org

OAS Database: http://opig.stats.ox.ac.uk/webapps/oas/
o
p-IgGen Dataset: Cleaned paired and unpaired antibody sequence data for...
explore.openaire.eu
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oliver Turnbull; Dino Oglic; Rebecca Croasdale-Wood; Charlotte Deane (2024). p-IgGen Dataset: Cleaned paired and unpaired antibody sequence data for machine learning applications. [Dataset]. http://doi.org/10.5281/zenodo.13880873
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13880873
Dataset updated
Oct 2, 2024
Authors
Oliver Turnbull; Dino Oglic; Rebecca Croasdale-Wood; Charlotte Deane
Description
This data is released alongside "p-IgGen: A Paired Antibody Generative Language Model", which contains full details on the data processing and cleaning. p-IgGen Paper: https://www.biorxiv.org/content/10.1101/2024.08.06.606780v1 . OAS: https://opig.stats.ox.ac.uk/webapps/oas/
LICHEN: Light-chain Immunoglobulin sequence generation Conditioned on the...
zenodo.org
zip
Updated Aug 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Henriette Capel; Henriette Capel; Alexander Greenshields-Watson; Alexander Greenshields-Watson; Charlotte Deane; Charlotte Deane (2025). LICHEN: Light-chain Immunoglobulin sequence generation Conditioned on the Heavy chain and Experimental Needs (Dataset and Model weights) [Dataset]. http://doi.org/10.5281/zenodo.15917096
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15917096
Dataset updated
Aug 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Henriette Capel; Henriette Capel; Alexander Greenshields-Watson; Alexander Greenshields-Watson; Charlotte Deane; Charlotte Deane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data is released alongside "LICHEN: Light-chain Immunoglobulin sequence generation Conditioned on the Heavy chain and Experimental Needs", which contains full details on the model and the data processing and cleaning.

Data: Cleaned paired human antibody sequence data for machine learning applications.
Model: Model weights for LICHEN.

Preprint: https://doi.org/10.1101/2025.08.06.668938
GitHub: https://github.com/oxpig/LICHEN
WebApp: https://opig.stats.ox.ac.uk/webapps/lichen/
OAS: https://opig.stats.ox.ac.uk/webapps/oas/
Data from: Data mining antibody sequences for database searching in...
zenodo.org
bin, csv
Updated Sep 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xuan-Tung Trinh; Xuan-Tung Trinh; Rebecca Freitag; Konrad Krawczyk; Veit Schwämmle; Veit Schwämmle; Rebecca Freitag; Konrad Krawczyk (2024). Data mining antibody sequences for database searching in bottom-up proteomics [Dataset]. http://doi.org/10.5281/zenodo.11045596
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11045596
Dataset updated
Sep 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xuan-Tung Trinh; Xuan-Tung Trinh; Rebecca Freitag; Konrad Krawczyk; Veit Schwämmle; Veit Schwämmle; Rebecca Freitag; Konrad Krawczyk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mass spectrometry (MS)-based proteomics is a powerful method for identifying and quantifying antibodies. Among the various MS approaches, bottom-up proteomics is especially effective for analyzing thousands of antibodies in complex mixtures. In this method, proteins are enzymatically digested into smaller peptides, typically using the protease trypsin, which are then analyzed via mass spectrometry. These peptides are matched to sequences in standard databases like UniProt or NCBI-RefSeq for identification.

However, a major limitation of this approach is the absence of comprehensive disease-specific antibody databases. Current databases, such as UniProt, include only a fraction of the antibody sequences present in the human body. For instance, as of January 2024, UniProt contains just 38,800 immunoglobulin sequences, far short of the billions of antibodies the human immune system can produce. As a result, relying on such limited databases can lead to under-detection of antibodies, particularly those associated with specific diseases. Expanding antibody databases with disease-specific sequences is crucial for improving the accuracy of MS-based proteomics in identifying antibodies relevant to human health.

Recently, through next-generation sequencing of antibody gene repertoires, it has become possible to obtain billions of antibody sequences (in amino acid format) by annotating, translating, and numbering antibody gene sequences. These large numbers of sequences are now available in public databases such as the Observed Antibody Space. We hypothesize that using these theoretical antibody sequences as new databases for bottom-up proteomics could address the current lack of antibody coverage in standard databases.

We developed a workflow to create disease-specific antibody peptide databases for bottom-up proteomics. The workflow details are available on GitHub. The database and metadata files generated by this workflow are stored in this Zenodo dataset, and they are used in DAT-DB — a web application that allows researchers to obtain FASTA files of disease-specific antibody peptides for direct use in bottom-up proteomics (see Demo version).

Each database file in this dataset is in .duckdb format and contains tables with 10 columns: Sequence, Filename, Patient, BSource, BType, Isotype, N_patient, N_antibody, Length_aa, and CDR3. The "Sequence" column contains tryptic peptides. "Filename" is the file where the data was collected. "Patient" refers to the patient number as listed in metadata2.csv. "BSource" refers to the B-cells' source, and "BType" refers to the type of B-cells. "Isotype" specifies the antibody isotype (IgA, IgD, IgE, IgG, IgM, or Bulk). "N_patient" indicates the number of patients having this peptide, and "N_antibody" specifies the number of antibodies containing this peptide. "Length_aa" indicates the number of amino acids in the peptide, while "CDR3" shows whether the peptide is found in the CDR3 region.

The file metadata1.csv contains information about each database file, while metadata2.csv provides details about the sources of the collected antibodies.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Oxford Protein Informatics Group (2026). OAS [Dataset]. https://huggingface.co/datasets/opig/OAS

OAS

opig/OAS

Observed Antibody Dataset

Explore at:

Dataset updated

Jan 22, 2026

Dataset authored and provided by

Oxford Protein Informatics Group

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

OPIG Observed Antibody Dataset (OAS)

  Filtering

If you know you only want a subset of this dataset, you can load a filtered subset like this: ds = load_dataset('OPIG/OAS', streaming=streaming, filters=[("meta_Species", "==", "human")].

Clear search

Close search

Google apps

Main menu

OAS

OASis peptide database

p-IgGen Dataset: Cleaned paired and unpaired antibody sequence data for...

LICHEN: Light-chain Immunoglobulin sequence generation Conditioned on the...

Data from: Data mining antibody sequences for database searching in...

OAS

opig/OAS

Observed Antibody Dataset