91 datasets found
  1. b

    Data from: AntiBody Sequence Database

    • bioregistry.io
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). AntiBody Sequence Database [Dataset]. https://bioregistry.io/registry/absd
    Explore at:
    Dataset updated
    Jan 23, 2025
    Description

    The AntiBody Sequence Database is a public dataset for antibody sequence data. It provides unique identifiers for antibody sequences, including both immunoglobulin and single-chain variable fragment sequences. These are are critical for immunological studies, and allows users to search and retrieve antibody sequences based on sequence similarity and specificity, and other biological properties.

  2. R

    Raw data from external antibody databases and scripts to homogenize and...

    • entrepot.recherche.data.gouv.fr
    application/x-gzip +1
    Updated Feb 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas MAILLET; Nicolas MAILLET; Simon MALESYS; Simon MALESYS (2025). Raw data from external antibody databases and scripts to homogenize and standardize them used to build AntiBody Sequence Database (for reproducibility) [Dataset]. http://doi.org/10.57745/DDLHWU
    Explore at:
    application/x-gzip(620431), application/x-gzip(163643), application/x-gzip(6833391387), text/markdown(12475), application/x-gzip(80726198), application/x-gzip(65497009)Available download formats
    Dataset updated
    Feb 4, 2025
    Dataset provided by
    Recherche Data Gouv
    Authors
    Nicolas MAILLET; Nicolas MAILLET; Simon MALESYS; Simon MALESYS
    License

    https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/3.1/customlicense?persistentId=doi:10.57745/DDLHWUhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/3.1/customlicense?persistentId=doi:10.57745/DDLHWU

    Description

    Reproducibility data for the AntiBody Sequence Database (ABSD) article. This dataset contains the raw data (antibody sequences) extracted on June 20, 2024, from various databases, as well as the several scripts, to ensure the reproducibility of our results. External databases used: ABDB, AbPDB, CoV-AbDab, Genbank, IMGT, PDB, SACS, SAbDab, TheraSAbDab, UniProt, KABAT Scripts usage: each external database has a corresponding script to format all antibody sequences extracted from it. A last script enable merging all extracted antibody sequences while removing redundancy, standardizing and cleaning data.

  3. d

    Structural Antibody Database

    • dknet.org
    • neuinfo.org
    • +2more
    Updated Apr 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Structural Antibody Database [Dataset]. http://identifiers.org/RRID:SCR_022096
    Explore at:
    Dataset updated
    Apr 20, 2022
    Description

    Database containing all antibody structures available in the PDB, annotated and presented in consistent fashion.Each structure is annotated with number of properties including experimental details, antibody nomenclature (e.g. heavy-light pairings), curated affinity data and sequence annotations. You can use the database to inspect individual structures, create and download datasets for analysis, search the database for structures with similar sequences to your query, monitor the known structural repetoire of antibodies.

  4. s

    Abysis Database

    • scicrunch.org
    • dknet.org
    • +1more
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Abysis Database [Dataset]. http://identifiers.org/RRID:SCR_000756/resolver?q=&i=rrid
    Explore at:
    Dataset updated
    May 6, 2025
    Description

    A database of antibody structure containing sequences from Kabat, IMGT and the Protein Databank (PDB), as well as structure data from the PDB. It provides search of the sequence data on various criteria and display of results in different formats. For data from the PDB, sequence searches can be combined with structural constraints. For example, one can ask for all the antibodies with a 10-residue Kabat CDR-L1 with a serine at H23 and an arginine within 10A of H36. The site also has software for structure analysis and other information on antibody structure available.

  5. Antibody Sequencing Services Market Report | Global Forecast From 2025 To...

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Antibody Sequencing Services Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/antibody-sequencing-services-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Antibody Sequencing Services Market Outlook




    The global antibody sequencing services market size was valued at approximately USD 450 million in 2023 and is projected to reach around USD 950 million by 2032, growing at a compound annual growth rate (CAGR) of 8.5% during the forecast period. The primary growth factor driving this market is the increasing demand for therapeutic and diagnostic antibodies, which are crucial in developing targeted therapies for various diseases, including cancer and autoimmune disorders.




    One of the significant growth factors for the antibody sequencing services market is the rising prevalence of chronic diseases and the subsequent demand for advanced therapeutic options. With an aging population and the global burden of diseases like cancer, autoimmune disorders, and infectious diseases on the rise, there is an increased need for effective treatments. Antibody-based therapies have proven to be highly effective in targeting specific disease markers, leading to their growing adoption. This, in turn, is driving the demand for antibody sequencing services, which are essential for the development and optimization of these therapies.




    Another critical factor contributing to the market's growth is the advancements in sequencing technologies. Over the past decade, there have been significant improvements in sequencing methods, leading to faster, more accurate, and cost-effective sequencing solutions. Techniques such as next-generation sequencing (NGS) and single-cell sequencing have revolutionized the field, allowing for high-throughput and detailed analysis of antibody sequences. These technological advancements have made it easier for researchers and companies to obtain high-quality sequencing data, thereby boosting the adoption of antibody sequencing services.




    Furthermore, the increasing focus on personalized medicine is also fueling the growth of the antibody sequencing services market. Personalized medicine aims to tailor treatments based on an individual's unique genetic makeup, leading to more effective and targeted therapies. Antibody sequencing plays a crucial role in this approach by enabling the identification of specific antibodies that can be used to design personalized treatments. As the healthcare industry continues to shift towards personalized medicine, the demand for antibody sequencing services is expected to grow significantly.



    In addition to sequencing, the Antibody Labeling Service is gaining traction as an essential component in the development of therapeutic and diagnostic antibodies. This service involves the attachment of specific labels to antibodies, which can be used in various applications such as imaging, flow cytometry, and immunoassays. The ability to label antibodies accurately and efficiently enhances their utility in research and clinical settings, allowing for more precise targeting and detection of disease markers. As the demand for personalized medicine and targeted therapies continues to grow, the need for reliable antibody labeling services is expected to increase, complementing the advancements in antibody sequencing technologies.




    From a regional perspective, North America holds the largest share in the antibody sequencing services market, followed by Europe and the Asia Pacific. The dominance of North America can be attributed to the presence of a well-established healthcare infrastructure, significant investments in research and development, and the presence of major pharmaceutical and biotechnology companies. Additionally, the region has a high prevalence of chronic diseases, further driving the demand for advanced therapeutic options. The Asia Pacific region is expected to witness the highest growth during the forecast period, owing to the increasing healthcare expenditure, growing focus on research activities, and the rising prevalence of chronic diseases in countries like China and India.



    Service Type Analysis




    The antibody sequencing services market can be segmented by service type into De Novo Sequencing, Database Sequencing, and Hybrid Sequencing. Among these, De Novo Sequencing accounts for a significant market share due to its capability to provide a complete sequence of antibodies without any prior knowledge of the sequence. This service is particularly crucial for discovering novel antibodies and understanding their structure and f

  6. f

    Serum Antibody Repertoire Profiling Using In Silico Antigen Screen

    • plos.figshare.com
    doc
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinyue Liu; Qiang Hu; Song Liu; Luke J. Tallo; Lisa Sadzewicz; Cassandra A. Schettine; Mikhail Nikiforov; Elena N. Klyushnenkova; Yurij Ionov (2023). Serum Antibody Repertoire Profiling Using In Silico Antigen Screen [Dataset]. http://doi.org/10.1371/journal.pone.0067181
    Explore at:
    docAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Xinyue Liu; Qiang Hu; Song Liu; Luke J. Tallo; Lisa Sadzewicz; Cassandra A. Schettine; Mikhail Nikiforov; Elena N. Klyushnenkova; Yurij Ionov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Serum antibodies are valuable source of information on the health state of an organism. The profiles of serum antibody reactivity can be generated by using a high throughput sequencing of peptide-coding DNA from combinatorial random peptide phage display libraries selected for binding to serum antibodies. Here we demonstrate that the targets of immune response, which are recognized by serum antibodies directed against sequential epitopes, can be identified using the serum antibody repertoire profiles generated by high throughput sequencing. We developed an algorithm to filter the results of the protein database BLAST search for selected peptides to distinguish real antigens recognized by serum antibodies from irrelevant proteins retrieved randomly. When we used this algorithm to analyze serum antibodies from mice immunized with human protein, we were able to identify the protein used for immunizations among the top candidate antigens. When we analyzed human serum sample from the metastatic melanoma patient, the recombinant protein, corresponding to the top candidate from the list generated using the algorithm, was recognized by antibodies from metastatic melanoma serum on the western blot, thus confirming that the method can identify autoantigens recognized by serum antibodies. We demonstrated also that our unbiased method of looking at the repertoire of serum antibodies reveals quantitative information on the epitope composition of the targets of immune response. A method for deciphering information contained in the serum antibody repertoire profiles may help to identify autoantibodies that can be used for diagnosing and monitoring autoimmune diseases or malignancies.

  7. d

    Therapeutic Structural Antibody Database

    • dknet.org
    • scicrunch.org
    • +1more
    Updated Oct 16, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Therapeutic Structural Antibody Database [Dataset]. http://identifiers.org/RRID:SCR_022093
    Explore at:
    Dataset updated
    Oct 16, 2019
    Description

    Tracks all antibody and nanobody related therapeutics recognized by World Health Organisation, and identifies any corresponding structures in Structural Antibody Database with near exact or exact variable domain sequence matches. Synchronized with SAbDab to update weekly, reflecting new Protein Data Bank entries and availability of new sequence data published by WHO.

  8. f

    DataSheet_1_Complete variable domain sequences of monoclonal antibody light...

    • frontiersin.figshare.com
    txt
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allison Nau; Yun Shen; Vaishali Sanchorawala; Tatiana Prokaeva; Gareth J. Morgan (2023). DataSheet_1_Complete variable domain sequences of monoclonal antibody light chains identified from untargeted RNA sequencing data.fasta [Dataset]. http://doi.org/10.3389/fimmu.2023.1167235.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 19, 2023
    Dataset provided by
    Frontiers
    Authors
    Allison Nau; Yun Shen; Vaishali Sanchorawala; Tatiana Prokaeva; Gareth J. Morgan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionMonoclonal antibody light chain proteins secreted by clonal plasma cells cause tissue damage due to amyloid deposition and other mechanisms. The unique protein sequence associated with each case contributes to the diversity of clinical features observed in patients. Extensive work has characterized many light chains associated with multiple myeloma, light chain amyloidosis and other disorders, which we have collected in the publicly accessible database, AL-Base. However, light chain sequence diversity makes it difficult to determine the contribution of specific amino acid changes to pathology. Sequences of light chains associated with multiple myeloma provide a useful comparison to study mechanisms of light chain aggregation, but relatively few monoclonal sequences have been determined. Therefore, we sought to identify complete light chain sequences from existing high throughput sequencing data.MethodsWe developed a computational approach using the MiXCR suite of tools to extract complete rearranged IGVL-IGJL sequences from untargeted RNA sequencing data. This method was applied to whole-transcriptome RNA sequencing data from 766 newly diagnosed patients in the Multiple Myeloma Research Foundation CoMMpass study.ResultsMonoclonal IGVL-IGJL sequences were defined as those where >50% of assigned IGK or IGL reads from each sample mapped to a unique sequence. Clonal light chain sequences were identified in 705/766 samples from the CoMMpass study. Of these, 685 sequences covered the complete IGVL-IGJL region. The identity of the assigned sequences is consistent with their associated clinical data and with partial sequences previously determined from the same cohort of samples. Sequences have been deposited in AL-Base.DiscussionOur method allows routine identification of clonal antibody sequences from RNA sequencing data collected for gene expression studies. The sequences identified represent, to our knowledge, the largest collection of multiple myeloma-associated light chains reported to date. This work substantially increases the number of monoclonal light chains known to be associated with non-amyloid plasma cell disorders and will facilitate studies of light chain pathology.

  9. Data from: Improving antibody language models with native pairing

    • zenodo.org
    application/gzip, zip
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Burbach; Sarah Burbach; Bryan Briney; Bryan Briney (2025). Improving antibody language models with native pairing [Dataset]. http://doi.org/10.5281/zenodo.12745725
    Explore at:
    application/gzip, zipAvailable download formats
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sarah Burbach; Sarah Burbach; Bryan Briney; Bryan Briney
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Motivation. Existing large language models designed to predict antibody structure and function have been trained exclusively with unpaired antibody sequences. This is a substantial drawback, as each antibody represents a unique pairing of heavy and light chains that both contribute to antigen recognition. The cost of generating large datasets of natively paired antibody sequences is orders of magnitude higher than the cost of unpaired sequences, and the paucity of available paired antibody sequence datasets precludes training a state-of-the-art language model using only paired training data. Here, we sought to determine whether and to what extent natively paired training data improves model performance.

    Results. Using a unique and recently reported dataset of approximately 1.6 x 106 natively paired human antibody sequences, we trained two baseline antibody language model (BALM) variants: BALM-paired and BALM-unpaired. We quantify the superiority of BALM-paired over BALM-unpaired, and we show that BALM-paired's improved performance can be attributed at least in part to its ability to learn cross-chain features that span natively paired heavy and light chains. Additionally, we fine-tuned the general protein language model ESM-2 using these paired antibody sequences and report that the fine-tuned model, but not base ESM-2, demonstrates a similar understanding of cross-chain features.

    Files. The following files are included in this repository:

    • BALM-paired.tar.gz: Model weights for the BALM-paired model.
    • BALM-shuffled.tar.gz: Model weights for the BALM-shuffled model.
    • BALM-unpaired.tar.gz: Model weights for the BALM-unpaired model.
    • ESM2-650M_paired-fine-tuned.tar.gz: Model weights for the 650M-parameter ESM-2 model after fine-tuning with natively paired antibody sequences.
    • jaffe-paired-dataset_airr-annotation.tar.gz: All natively paired antibody sequences from the Jaffe dataset were annotated with abstar and subsequently filtered to remove duplicates or unproductive sequences. The annotated sequences are provided in an AIRR-compliant format.
    • test-dataset_annotated.tar.gz: Two csv files, both with sequences annotated in an AIRR-compliant format. lc-coherence_test-unique_annotated.csv contains all sequences from the test dataset and fig3-20kembeddings_annotated.csv contains the 20k sequences from the test used for the Figure 2 UMAP embeddings. For both datasets, the sequences can be paired together based on their pair_id.
    • train-test-eval_paired.tar.gz: Datasets used to train, test, and evaluate the BALM-paired model. Compressed folder containing three files: train.txt, test.txt, and eval.txt. Each file has one input sequence per line. This dataset was also used to fine-tune the 650M-parameter ESM-2 variant.
    • train-test-eval_shuffled.tar.gz: Datasets used to train, test, and evaluate the BALM-shuffled model. Compressed folder containing three csv files, with two columns for the heavy and light chains.
    • train-test-eval_unpaired.tar.gz: Datasets used to train, test, and evaluate the BALM-unpaired model. Compressed folder containing three files: train.txt, test.txt, and eval.txt. Each file has one input sequence per line.
    • classification-datasets.tar.gz: Three classification datasets used to train classification models in Figure 5. The datasets are: flu-0_cov-1.csv, hd-0_cov-1.csv, and hd-0_flu-1_cov-2.csv. CoV antibody sequences were obtained from CoV-AbDab, Flu antibody sequences were obtained from Wang et al., and healthy donor antibody sequences were obtained from Hurtado et al.

    Code: All code used for model training, testing, and figure generation is available under the MIT license on GitHub. An archived version of the GitHub repository (from the time of manuscript publication) is included here as code-archive.zip.

  10. Databases of human SARS-CoV-2 antibody peptides for bottom-up proteomics

    • zenodo.org
    bin
    Updated Feb 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuan-Tung Trinh; Xuan-Tung Trinh; Rebecca Freitag; Konrad Krawczyk; Veit Schwämmle; Veit Schwämmle; Rebecca Freitag; Konrad Krawczyk (2024). Databases of human SARS-CoV-2 antibody peptides for bottom-up proteomics [Dataset]. http://doi.org/10.5281/zenodo.10566370
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xuan-Tung Trinh; Xuan-Tung Trinh; Rebecca Freitag; Konrad Krawczyk; Veit Schwämmle; Veit Schwämmle; Rebecca Freitag; Konrad Krawczyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Bottom-up proteomics approaches rely on database searches that compare experimental values of peptides to theoretical values derived from protein sequences in a database. While the human body can produce millions of distinct antibodies, current databases for human antibodies such as UniProtKB are limited to only 1095 sequences (as of 2024 January). This limitation may hinder the identification of new antibodies using bottom-up proteomics. Therefore, extending the databases is an important task for discovering new antibodies.

    Herein, we adopted extensive collection of antibody sequences from Observed Antibody Space for conducting efficient database searches in publicly available proteomics data with a focus on the SARS-CoV-2 disease. Thirty million heavy antibody sequences from 146 SARS-CoV-2 patients in the Observed Antibody Space were in silico digested to obtain 18 million unique peptides. These peptides were then used to create six databases (DB1-DB6) for bottom-up proteomics. We used those databases for searching antibody peptides in publicly available SARS-CoV-2 human plasma samples in the Proteomics Identification Database (PRIDE), and we consistently found new antibody peptides in those samples. The database searching task was done by using Fragpipe softwares.

    Table 1. Information of databases. In addition to human SARS-CoV-2 antibody peptides, every database also contains human protein sequences from UniProt database and contaminants from cRAP database.

    FileDatabaseNumber of human SARS-CoV-2 antibody peptidesNumber of covered antibodies
    DB1.fastaDB11001.28E7
    DB2.fastaDB21E31.93E7
    DB3.fastaDB31E42.40E7
    DB4.fastaDB41E52.66E7
    DB5.fastaDB51E62.83E7
    DB6.fastaDB61E73.01E7
  11. R

    AB-SR (AntiBody Sequence Reconstructor) software: datasets for complete...

    • entrepot.recherche.data.gouv.fr
    text/markdown, xz
    Updated Jul 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas MAILLET; Nicolas MAILLET; Bertrand SAUNIER; Bertrand SAUNIER (2023). AB-SR (AntiBody Sequence Reconstructor) software: datasets for complete benchmarking [Dataset]. http://doi.org/10.57745/4RNESM
    Explore at:
    xz(151168720), xz(123771256), xz(167698664), text/markdown(13509), xz(826456268), xz(177197812)Available download formats
    Dataset updated
    Jul 28, 2023
    Dataset provided by
    Recherche Data Gouv
    Authors
    Nicolas MAILLET; Nicolas MAILLET; Bertrand SAUNIER; Bertrand SAUNIER
    License

    https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/4RNESMhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/4RNESM

    Description

    Files, folders, tabular data and some raw data used in the publication: AB-SR reconstructs polyclonal antibody Fv domains after bottom-up proteomic de-novo sequencing (N. Maillet & B. Saunier). The AB-SR software reconstructs the sequences of most pairs of heavy and light chain variable regions from (in silico) pools containing up to 500 immunoglobulins in just a few minutes. For each Figure, the data before and after AB-SR software are available (see README.md for detailed explanations). Data presented here are used to benchmark AB-SR. More precisely, each experiment consists in IgGs coming from public databases being in silico digested using RPG software. Resulting peptides are then fed to AB-SR that reconstructs most initial IgGs.

  12. h

    oas_paired_human_sars_cov_2

    • huggingface.co
    Updated Aug 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Loyal (2023). oas_paired_human_sars_cov_2 [Dataset]. https://huggingface.co/datasets/bloyal/oas_paired_human_sars_cov_2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2023
    Authors
    Brian Loyal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Paired SARS-COV-2 heavy/light chain sequences from the Observed Antibody Space database

    Human paired heavy/light chain amino acid sequences from the Observed Antibody Space (OAS) database obtained from SARS-COV-2 studies. https://opig.stats.ox.ac.uk/webapps/oas/ Please include the following citation in your work: Olsen, TH, Boyles, F, Deane, CM. Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science.… See the full description on the dataset page: https://huggingface.co/datasets/bloyal/oas_paired_human_sars_cov_2.

  13. f

    Table_1_RAPID: A Rep-Seq Dataset Analysis Platform With an Integrated...

    • frontiersin.figshare.com
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yanfang Zhang; Tianjian Chen; Huikun Zeng; Xiujia Yang; Qingxian Xu; Yanxia Zhang; Yuan Chen; Minhui Wang; Yan Zhu; Chunhong Lan; Qilong Wang; Haipei Tang; Yan Zhang; Chengrui Wang; Wenxi Xie; Cuiyu Ma; Junjie Guan; Shixin Guo; Sen Chen; Wei Yang; Lai Wei; Jian Ren; Xueqing Yu; Zhenhai Zhang (2023). Table_1_RAPID: A Rep-Seq Dataset Analysis Platform With an Integrated Antibody Database.xlsx [Dataset]. http://doi.org/10.3389/fimmu.2021.717496.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Yanfang Zhang; Tianjian Chen; Huikun Zeng; Xiujia Yang; Qingxian Xu; Yanxia Zhang; Yuan Chen; Minhui Wang; Yan Zhu; Chunhong Lan; Qilong Wang; Haipei Tang; Yan Zhang; Chengrui Wang; Wenxi Xie; Cuiyu Ma; Junjie Guan; Shixin Guo; Sen Chen; Wei Yang; Lai Wei; Jian Ren; Xueqing Yu; Zhenhai Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The antibody repertoire is a critical component of the adaptive immune system and is believed to reflect an individual’s immune history and current immune status. Delineating the antibody repertoire has advanced our understanding of humoral immunity, facilitated antibody discovery, and showed great potential for improving the diagnosis and treatment of disease. However, no tool to date has effectively integrated big Rep-seq data and prior knowledge of functional antibodies to elucidate the remarkably diverse antibody repertoire. We developed a Rep-seq dataset Analysis Platform with an Integrated antibody Database (RAPID; https://rapid.zzhlab.org/), a free and web-based tool that allows researchers to process and analyse Rep-seq datasets. RAPID consolidates 521 WHO-recognized therapeutic antibodies, 88,059 antigen- or disease-specific antibodies, and 306 million clones extracted from 2,449 human IGH Rep-seq datasets generated from individuals with 29 different health conditions. RAPID also integrates a standardized Rep-seq dataset analysis pipeline to enable users to upload and analyse their datasets. In the process, users can also select set of existing repertoires for comparison. RAPID automatically annotates clones based on integrated therapeutic and known antibodies, and users can easily query antibodies or repertoires based on sequence or optional keywords. With its powerful analysis functions and rich set of antibody and antibody repertoire information, RAPID will benefit researchers in adaptive immune studies.

  14. Scaling laws in antibody language models reveal data-constrained optima

    • zenodo.org
    Updated May 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Shafiei Neyestanak; Mahdi Shafiei Neyestanak; Bryan Briney; Bryan Briney (2025). Scaling laws in antibody language models reveal data-constrained optima [Dataset]. http://doi.org/10.5281/zenodo.15447079
    Explore at:
    Dataset updated
    May 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mahdi Shafiei Neyestanak; Mahdi Shafiei Neyestanak; Bryan Briney; Bryan Briney
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Motivation: Antibody language models (AbLMs) play a critical role in exploring the extensive sequence diversity of antibody repertoires, significantly enhancing therapeutic discovery. However, the optimal strategy for scaling these models, particularly concerning the interplay between model size and data availability, remains underexplored, especially in contrast to natural language processing where data is abundant. This study aims to systematically investigate scaling laws in AbLMs to define optimal scaling thresholds and maximize their potential in antibody engineering and discovery.

    Results: This study pretrained ESM-2 architecture models across five distinct parameterizations (8 million to 650 million weights) and three training data scales (Quarter, Half, and Full datasets, with the full set comprising ~1.6 million paired antibody sequences). Performance was evaluated using cross-entropy loss and downstream tasks, including per-position amino acid identity prediction, antibody specificity classification, and native heavy-light chain pairing recognition. Findings reveal that increasing model size does not monotonically improve performance; for instance, with the full dataset, loss began to increase beyond ~163M parameters. The 350M parameter model trained on the full dataset (350M-F) often demonstrated optimal or near-optimal performance in downstream tasks, such as achieving the highest accuracy in predicting mutated CDRH3 regions.

    Conclusion: These results underscore that in data-constrained domains like antibody sequences, strategically balancing model capacity with dataset size is crucial, as simply increasing model parameters without a proportional increase in diverse training data can lead to diminishing returns or even impaired generalization

    Files. The following files are included in this repository:

    • model_weights.zip: Model weights for all pre-trained AbLMs in the study. The models can also be downloaded from HuggingFace.
    • train-eval-test.zip: The datasets used for training all models, with sequences obtained from Jaffe et al. and Hurtado et al., are provided in a compressed folder. This folder contains three subfolders—Full_data, Half_data, and Quarter_data—each containing the training data used for the models. Specifically, the Full_data subfolder is further organized into training, eval, and test subdirectories, which respectively contain the train_dataset.csv, validation_dataset.csv, and test_dataset.csv files.
    • HD_vs_COV.csv.zip: The paired antibody sequences that were used for the antibody specificity binary classification task. The Coronavirus (CoV) antibody sequences included were sourced from the CoV-AbDab database.
    • hd-0_CoV-1_flu-2.csv.zip: Paired antibody sequences utilized for the 3-way antibody specificity classification task, distinguishing between Healthy Donor (HD), Coronavirus (CoV), and Influenza (Flu) specific Abs. The influenza-specific antibody sequences included in this dataset were sourced from Wang et al.
    • shuffled_data.csv.zip: Contains the dataset used for the native vs. shuffled paired antibody sequence classification task. This dataset is derived from the test_dataset.csv.
    • per_position_inference.zip: The dataset utilized for per-residue prediction by the full-data models, including both unmutated and mutated antibody sequences.
    • test_datasets.zip: A compressed folder that contains twelve distinct test sets that were not utilized during model training. These datasets were specifically used for evaluating pretrained models and generating Cross-entropy loss curves. The data originates from both in-house laboratory sources and a study conducted by Ng et al.

    Code: The code for model training and evaluation is available under the MIT license on GitHub.

  15. d

    Data from: Kabat Database of Sequences of Proteins of Immunological Interest...

    • dknet.org
    • neuinfo.org
    • +1more
    Updated Sep 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Kabat Database of Sequences of Proteins of Immunological Interest [Dataset]. http://identifiers.org/RRID:SCR_006465/resolver/mentions
    Explore at:
    Dataset updated
    Sep 29, 2024
    Description

    The Kabat Database determines the combining site of antibodies based on the available amino acid sequences. The precise delineation of complementarity determining regions (CDR) of both light and heavy chains provides the first example of how properly aligned sequences can be used to derive structural and functional information of biological macromolecules. The Kabat database now includes nucleotide sequences, sequences of T cell receptors for antigens (TCR), major histocompatibility complex (MHC) class I and II molecules, and other proteins of immunological interest. The Kabat Database searching and analysis tools package is an ASP.NET web-based portal containing lookup tools, sequence matching tools, alignment tools, length distribution tools, positional correlation tools and much more. The searching and analysis tools are custom made for the aligned data sets contained in both the SQL Server and ASCII text flat file formats. The searching and analysis tools may be run on a single PC workstation or in a distributed environment. The analysis tools are written in ASP.NET and C# and are available in Visual Studio .NET 2003/2005/2008 formats. The Kabat Database was initially started in 1970 to determine the combining site of antibodies based on the available amino acid sequences at that time. Bence Jones proteins, mostly from human, were aligned, using the now-known Kabat numbering system, and a quantitative measure, variability, was calculated for every position. Three peaks, at positions 24-34, 50-56 and 89-97, were identified and proposed to form the complementarity determining regions (CDR) of light chains. Subsequently, antibody heavy chain amino acid sequences were also aligned using a different numbering system, since the locations of their CDRs (31-35B, 50-65 and 95-102) are different from those of the light chains. CDRL1 starts right after the first invariant Cys 23 of light chains, while CDRH1 is eight amino acid residues away from the first invariant Cys 22 of heavy chains. During the past 30 years, the Kabat database has grown to include nucleotide sequences, sequences of T cell receptors for antigens (TCR), major histocompatibility complex (MHC) class I and II molecules and other proteins of immunological interest. It has been used extensively by immunologists to derive useful structural and functional information from the primary sequences of these proteins.

  16. Antibody dataset Kd

    • zenodo.org
    csv, text/x-python
    Updated Aug 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akbar Rahmad; Akbar Rahmad (2024). Antibody dataset Kd [Dataset]. http://doi.org/10.5281/zenodo.13120765
    Explore at:
    csv, text/x-pythonAvailable download formats
    Dataset updated
    Aug 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Akbar Rahmad; Akbar Rahmad
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    A dataset of ~500 antibodies with binding affinity: antibody sequence, antigen sequence, Kd. Obtained from SAbDab via Therapeutic Data Commons

    Python code (get_antibody_affinity_data.py) and dataset (antibody_affinity_protein_sabdab.csv)

  17. d

    Data from: A scalable model for simulating multi-round antibody evolution...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Nov 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chao Zhang; Andrey V. Bzikadze; Yana Safonova; Siavash Mirarab (2023). A scalable model for simulating multi-round antibody evolution and benchmarking of clonal tree reconstruction methods [Dataset]. http://doi.org/10.6076/D1530H
    Explore at:
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    Dryad Digital Repository
    Authors
    Chao Zhang; Andrey V. Bzikadze; Yana Safonova; Siavash Mirarab
    Time period covered
    Jan 1, 2023
    Description

    Affinity maturation (AM) of B cells through somatic hypermutations (SHMs) enables the immune system to evolve to recognize diverse pathogens. The accumulation of SHMs leads to the formation of clonal lineages of antibody-secreting b cells that have evolved from a common naïve B cell. Advances in high-throughput sequencing have enabled deep scans of B cell receptor repertoires, paving the way for reconstructing clonal trees. However, it is not clear if clonal trees, which capture microevolutionary time scales, can be reconstructed using traditional phylogenetic reconstruction methods with adequate accuracy. In fact, several clonal tree reconstruction methods have been developed to fix supposed shortcomings of phylogenetic methods. Nevertheless, no consensus has been reached regarding the relative accuracy of these methods, partially because evaluation is challenging. Benchmarking the performance of existing methods and developing better methods would both benefit from realistic models of..., The data was created using the simulation method DimSIM. As described in the paper, the analyses includes two sets of simulations, one based on real target antibodies (SARS-Cov2) and the other based on flu. SARS-CoV2 simulations had 3-5 rounds with 50 replicates. For targets, we first selected all heavy chain sequences of human antibodies with IGHV1-58 and IGHJ3 from the Coronavirus Antibody Database that neutralize some variants of SARS-CoV2 and have 16 amino acids in their CDR3. Per upload date, we chose the antibody that neutralizes the most variants of SARS-CoV2 resulting in 14 sequences. We then randomly chose targets among them. The infection start date was set to be the upload date. Each round of infections is set to last 50 days. At the end of simulations, we sample ς = 50, 100, 200, 500 antibody-coding nucleotide sequences from the last round of infection and built the clonal tree. For flu simulations, we performed several simulations with r = 56 rounds of flu, using sequence...,

  18. f

    DataSheet_1_RAPID: A Rep-Seq Dataset Analysis Platform With an Integrated...

    • frontiersin.figshare.com
    docx
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yanfang Zhang; Tianjian Chen; Huikun Zeng; Xiujia Yang; Qingxian Xu; Yanxia Zhang; Yuan Chen; Minhui Wang; Yan Zhu; Chunhong Lan; Qilong Wang; Haipei Tang; Yan Zhang; Chengrui Wang; Wenxi Xie; Cuiyu Ma; Junjie Guan; Shixin Guo; Sen Chen; Wei Yang; Lai Wei; Jian Ren; Xueqing Yu; Zhenhai Zhang (2023). DataSheet_1_RAPID: A Rep-Seq Dataset Analysis Platform With an Integrated Antibody Database.docx [Dataset]. http://doi.org/10.3389/fimmu.2021.717496.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Yanfang Zhang; Tianjian Chen; Huikun Zeng; Xiujia Yang; Qingxian Xu; Yanxia Zhang; Yuan Chen; Minhui Wang; Yan Zhu; Chunhong Lan; Qilong Wang; Haipei Tang; Yan Zhang; Chengrui Wang; Wenxi Xie; Cuiyu Ma; Junjie Guan; Shixin Guo; Sen Chen; Wei Yang; Lai Wei; Jian Ren; Xueqing Yu; Zhenhai Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The antibody repertoire is a critical component of the adaptive immune system and is believed to reflect an individual’s immune history and current immune status. Delineating the antibody repertoire has advanced our understanding of humoral immunity, facilitated antibody discovery, and showed great potential for improving the diagnosis and treatment of disease. However, no tool to date has effectively integrated big Rep-seq data and prior knowledge of functional antibodies to elucidate the remarkably diverse antibody repertoire. We developed a Rep-seq dataset Analysis Platform with an Integrated antibody Database (RAPID; https://rapid.zzhlab.org/), a free and web-based tool that allows researchers to process and analyse Rep-seq datasets. RAPID consolidates 521 WHO-recognized therapeutic antibodies, 88,059 antigen- or disease-specific antibodies, and 306 million clones extracted from 2,449 human IGH Rep-seq datasets generated from individuals with 29 different health conditions. RAPID also integrates a standardized Rep-seq dataset analysis pipeline to enable users to upload and analyse their datasets. In the process, users can also select set of existing repertoires for comparison. RAPID automatically annotates clones based on integrated therapeutic and known antibodies, and users can easily query antibodies or repertoires based on sequence or optional keywords. With its powerful analysis functions and rich set of antibody and antibody repertoire information, RAPID will benefit researchers in adaptive immune studies.

  19. Antibody Sequencing Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2024). Antibody Sequencing Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/antibody-sequencing-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Oct 4, 2024
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Antibody Sequencing Market Outlook



    The global antibody sequencing market size was valued at approximately USD 360 million in 2023 and is projected to reach about USD 880 million by 2032, growing at a compound annual growth rate (CAGR) of 10.2% over the forecast period. The tremendous growth in this market can be attributed to advancements in sequencing technologies and the increasing need for precise and high-throughput antibody characterization in various applications such as therapeutics, diagnostics, and research.



    One of the primary growth factors driving the antibody sequencing market is the rapid advancements in sequencing technologies. Next-generation sequencing (NGS) and other high-throughput sequencing methods have revolutionized the way antibodies are characterized and studied. These technologies provide precise, comprehensive, and rapid sequencing data, which is crucial for the development of new therapeutic antibodies, vaccines, and diagnostic tools. As these technologies become more affordable and accessible, their adoption across various sectors is expected to proliferate, further propelling market growth.



    Another significant driver is the increasing prevalence of chronic diseases and the growing demand for personalized medicine. Antibody-based therapies have shown tremendous potential in treating a wide range of diseases, including cancer, autoimmune disorders, and infectious diseases. The need for tailored therapeutic solutions has spurred extensive research and development activities in the field of antibody sequencing. This increased focus on personalized medicine is expected to contribute significantly to the expansion of the market over the forecast period.



    Additionally, the rising investment in biopharmaceutical research and development by both public and private entities is fueling market growth. Governments and healthcare organizations across the globe are recognizing the importance of antibody-based therapies and are investing heavily in research initiatives. Furthermore, collaborations between academic institutions, research organizations, and pharmaceutical companies are fostering innovation and accelerating the development of novel antibody-based solutions. These collaborative efforts are expected to drive market growth by expanding the scope and scale of antibody sequencing applications.



    Regionally, North America dominates the antibody sequencing market due to its well-established healthcare infrastructure, significant investment in research and development, and the presence of major market players. The Asia Pacific region is expected to witness the fastest growth during the forecast period, driven by increasing healthcare expenditure, growing awareness about personalized medicine, and the expansion of biotechnology industries in countries like China and India. Europe also holds a substantial market share, supported by robust research activities and favorable government policies promoting biotechnology.



    Technology Analysis



    The technology segment of the antibody sequencing market is primarily categorized into Next-Generation Sequencing (NGS), Sanger Sequencing, and Mass Spectrometry. Each of these technologies offers distinct advantages and finds unique applications in the field of antibody sequencing, contributing to the overall growth of the market.



    Next-Generation Sequencing (NGS) has emerged as a revolutionary technology in the field of genomics and proteomics, including antibody sequencing. NGS allows for the parallel sequencing of millions of DNA molecules, providing high-throughput, accurate, and comprehensive data. The ability of NGS to sequence entire antibody repertoires quickly and cost-effectively has made it a preferred choice for many researchers and biopharmaceutical companies. The rapid advancements in NGS platforms, coupled with decreasing costs, are expected to drive the adoption of this technology, further fueling market growth.



    Sanger Sequencing, also known as the chain termination method, was one of the first methods used for sequencing DNA. Despite the advent of newer technologies, Sanger Sequencing remains a gold standard for its accuracy and reliability in sequencing short DNA fragments. It is widely used for validating NGS results and for applications requiring high precision. The enduring relevance of Sanger Sequencing in confirmatory testing and its integration with other advanced technologies continue to support its presence in the antibody sequencing market.



    Mass Spectrometry is another critical technology used in antibody sequ

  20. d

    Data from: Antibody production relies on the tRNA inosine wobble...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophie Giguère; Xuesong Wang; Sabrina Huber; Liling Xu; John Warner; Stephanie Weldon; Jennifer Hu; Quynh Anh Phan; Katie Tumang; Thavaleak Prum; Duanduan Ma; Kathrin Kirsch; Usha Nair; Peter Dedon; Facundo Batista (2023). Antibody production relies on the tRNA inosine wobble modification to meet biased codon demand [Dataset]. http://doi.org/10.5061/dryad.f4qrfj723
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Dryad Digital Repository
    Authors
    Sophie Giguère; Xuesong Wang; Sabrina Huber; Liling Xu; John Warner; Stephanie Weldon; Jennifer Hu; Quynh Anh Phan; Katie Tumang; Thavaleak Prum; Duanduan Ma; Kathrin Kirsch; Usha Nair; Peter Dedon; Facundo Batista
    Time period covered
    Jan 1, 2023
    Description

    Antibodies are produced at high rates to provide immunoprotection, which puts pressure on the B cell translational machinery. Here, we identified a pattern of codon usage conserved across antibody genes. One feature thereof is the hyperutilization of codons which lack genome-encoded Watson–Crick tRNAs, instead relying on the post-transcriptional tRNA modification inosine (I34), which expands the decoding capacity of specific tRNAs through wobbling. Antibody-secreting cells had increased I34 levels and were more reliant on I34 for protein production than naive B cells. Furthermore, antibody I34-dependent codon usage may influence B cell passage through regulatory checkpoints. Our work elucidates the interface between the tRNA pool and protein production in the immune system and has implications for the design and selection of antibodies for vaccines and therapeutics., tRNA sequences from J558, MPC11, WEHI231, and Bcl clone 5b1b. See Methods and Materials, sections on "RNA extraction" and "tRNA sequencing". The raw, unpaired sequence FASTQ files are provided., , # Antibody production relies on the tRNA inosine wobble modification to meet biased codon demand

    tRNA sequencing of murine B cell lines

    Description of the data and file structure

    tRNA sequencing (2 independent sequencing runs)

    Raw sequencing files from small RNA extracts processed using the AQRNAseq pipeline. RNA extracted from two PC-like murine cell lines (J558, MPC11) and two NBC-like murine cell lines (WEHI231, Bcl clone 5b1b)

    Samples are numbered as follows:

    run_datecelltypefile_tag
    22-04-05bcl5b1bnbc220405Bat_D22-4079
    22-04-05bcl5b1bnbc220405Bat_D22-4080
    22-04-05j558pc220405Bat_D22-4081
    22-04-05j558pc220405Bat_D22-4082
    22-04-05j558pc220405Bat_D22-4083
    22-04-05j558pc220405Bat_D22-4084
    22-04-05j558pc220405Bat_D22-4085
    22-04-05j558pc220405Bat_D22-4...
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2025). AntiBody Sequence Database [Dataset]. https://bioregistry.io/registry/absd

Data from: AntiBody Sequence Database

Related Article
Explore at:
Dataset updated
Jan 23, 2025
Description

The AntiBody Sequence Database is a public dataset for antibody sequence data. It provides unique identifiers for antibody sequences, including both immunoglobulin and single-chain variable fragment sequences. These are are critical for immunological studies, and allows users to search and retrieve antibody sequences based on sequence similarity and specificity, and other biological properties.

Search
Clear search
Close search
Google apps
Main menu