Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Paired SARS-COV-2 heavy/light chain sequences from the Observed Antibody Space database
Human paired heavy/light chain amino acid sequences from the Observed Antibody Space (OAS) database obtained from SARS-COV-2 studies. https://opig.stats.ox.ac.uk/webapps/oas/ Please include the following citation in your work: Olsen, TH, Boyles, F, Deane, CM. Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science.… See the full description on the dataset page: https://huggingface.co/datasets/bloyal/oas_paired_human_sars_cov_2.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OASis human 9-mer peptide database, generated from 118 million human antibody sequences from the Observed Antibody Space database.
Attached is a gzipped SQLite database containing two tables: "peptides" and "subjects".
Links:
BioPhi codebase and documentation: https://github.com/Merck/BioPhi
Public BioPhi server: https://biophi.dichlab.org
OAS Database: http://opig.stats.ox.ac.uk/webapps/oas/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used ABodyBuilder2 (https://doi.org/10.1038/s42003-023-04927-7) to model ~1.5M paired antibody structures from paired antibody sequences in Observed Antibody Space (https://opig.stats.ox.ac.uk/webapps/oas/oas_paired/). We have save the structures in folders and sub folders that correspond to the OAS files they came from. Parent folders are named according to study. Within each parent folder are sub folders names according to the files (named by SRA ID) containing sequences. Each structure is then named with the parent file followed by the row number from this file.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bottom-up proteomics approaches rely on database searches that compare experimental values of peptides to theoretical values derived from protein sequences in a database. While the human body can produce millions of distinct antibodies, current databases for human antibodies such as UniProtKB are limited to only 1095 sequences (as of 2024 January). This limitation may hinder the identification of new antibodies using bottom-up proteomics. Therefore, extending the databases is an important task for discovering new antibodies.
Herein, we adopted extensive collection of antibody sequences from Observed Antibody Space for conducting efficient database searches in publicly available proteomics data with a focus on the SARS-CoV-2 disease. Thirty million heavy antibody sequences from 146 SARS-CoV-2 patients in the Observed Antibody Space were in silico digested to obtain 18 million unique peptides. These peptides were then used to create six databases (DB1-DB6) for bottom-up proteomics. We used those databases for searching antibody peptides in publicly available SARS-CoV-2 human plasma samples in the Proteomics Identification Database (PRIDE), and we consistently found new antibody peptides in those samples. The database searching task was done by using Fragpipe softwares.
Table 1. Information of databases. In addition to human SARS-CoV-2 antibody peptides, every database also contains human protein sequences from UniProt database and contaminants from cRAP database.
File | Database | Number of human SARS-CoV-2 antibody peptides | Number of covered antibodies |
DB1.fasta | DB1 | 100 | 1.28E7 |
DB2.fasta | DB2 | 1E3 | 1.93E7 |
DB3.fasta | DB3 | 1E4 | 2.40E7 |
DB4.fasta | DB4 | 1E5 | 2.66E7 |
DB5.fasta | DB5 | 1E6 | 2.83E7 |
DB6.fasta | DB6 | 1E7 | 3.01E7 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
### Training and test data for humanness evaluation
This data was collected in conjunction with and used for
training and testing for Parkinson / Wang et al 2024. The
data is organized as follows:
- Heavy chain training and multispecies test data (under the heavy chain folder)
- The conslidated cAb rep file contains training human sequences
- The test sample sequences folder contains fasta files with test sequences for each species
- Light chain training and multispecies test data (under the light chain folder)
- The conslidated cAb rep file contains training human sequences
- The test sample sequences folder contains fasta files with test sequences for each species
- Abybank data (under the abybank compiled data folder)
- This folder contains separate folders for heavy and light chain
- Each subfolder contains test data for a more diverse species set under fasta files for each species
- Humanization test data (under the humanization test data folder)
- The sequences in the parental.fa file were originally humanized as part of drug discovery programs
- The experimental.fa file contains the humanization results
- IMGT and ADA data (under the imgt test data folder)
- The imgt mab db fa and tsv files contain sequences and species assignments for IMGT mAb DB
- The thera ada fa file contains sequences evaluated in the clinic
- The Therapeutic ADA txt file contains anti drug antibody results for those antibodies
The data was retrieved from the following sources.
1. All heavy and light chain training data is from the cAb-Rep database from [Guo et al.](https://pubmed.ncbi.nlm.nih.gov/31649674/)
2. All testing data is from the Observed Antibody Space [(OAS) database](https://opig.stats.ox.ac.uk/webapps/oas/)
The training and test data show is after filtering for quality. The testing data was additionally randomly sampled to yield a set of 50,000 sequences for each species, then filtered to remove duplicates. The human test data was checked to ensure no overlap with the human training set.
The IMGT, ADA and humanization test data was retrieved from Prihoda et al. and
the associated [Github repo](https://github.com/Merck/BioPhi-2021-publication).
See Parkinson et al. 2024 and the associated github repos for more details on how models other than
SAM / AntPack were evaluated on this data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Paired SARS-COV-2 heavy/light chain sequences from the Observed Antibody Space database
Human paired heavy/light chain amino acid sequences from the Observed Antibody Space (OAS) database obtained from SARS-COV-2 studies. https://opig.stats.ox.ac.uk/webapps/oas/ Please include the following citation in your work: Olsen, TH, Boyles, F, Deane, CM. Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science.… See the full description on the dataset page: https://huggingface.co/datasets/bloyal/oas_paired_human_sars_cov_2.