Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Unlike conventional major histocompatibility complex (MHC) class I and II molecules reactive T cells, the unconventional T cell subpopulations recognize various non-polymorphic antigen-presenting molecules and are typically characterized by simplified patterns of T cell receptors (TCRs), rapid effector responses and ‘public’ antigen specificities. Dissecting the recognition patterns of the non-MHC antigens by unconventional TCRs can help us further our understanding of the unconventional T cell immunity. The small size and irregularities of the released unconventional TCR sequences are far from high-quality to support systemic analysis of unconventional TCR repertoire. Here we present UcTCRdb, a database that contains 669,900 unconventional TCRs collected from 34 corresponding studies in humans, mice, and cattle. In UcTCRdb, users can interactively browse TCR features of different unconventional T cell subsets in different species, search and download sequences under different conditions. Additionally, basic and advanced online TCR analysis tools have been integrated into the database, which will facilitate the study of unconventional TCR patterns for users with different backgrounds. UcTCRdb is freely available at http://uctcrdb.cn/.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The file contains human T cell receptor (TCR) sequences obtained by multiplex PCR amplification of cDNA molecules followed by Illumina sequencing. Sequences were aligned to the human genome using MIGEC software (see doi: 10.1038/nmeth.2960 for details). Except for the header row, each row contains information about a unique TCR nucleotide sequence. Column 1 stores the TCR chain (a, alpha; b, beta). Column 2 stores the T cell subset. Column 3 is an identifier for the thymus sample of origin. Columns 4 and 5 store the nucleotide sequence and amino acid sequence, respectively, of the complementarity-determining region 3 (CDR3). Columns 6 and 7 store the TCR variable (v) and joining (j) gene segment information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the accompanying dataset that was generated by the GitHub project: https://github.com/tonyreina/tdc-tcr-epitope-antibody-binding. In that repository I show how to create a machine learning models for predicting if a T-cell receptor (TCR) and protein epitope will bind to each other.
A model that can predict how well a TCR bindings to an epitope can lead to more effective treatments that use immunotherapy. For example, in anti-cancer therapies it is important for the T-cell receptor to bind to the protein marker in the cancer cell so that the T-cell (actually the T-cell's friends in the immune system) can kill the cancer cell.
HuggingFace provides a "one-stop shop" to train and deploy AI models. In this case, we use Facebook's open-source Evolutionary Scale Model (ESM-2). These embeddings turn the protein sequences into a vector of numbers that the computer can use in a mathematical model.
To load them into Python use the Pandas library:
import pandas as pd
train_data = pd.read_pickle("train_data.pkl") validation_data = pd.read_pickle("validation_data.pkl") test_data = pd.read_pickle("test_data.pkl")
The epitope_aa and the tcr_full columns are the protein (peptide) sequences for the epitope and the T-cell receptor, respectively. The letters correspond to the standard amino acid codes.
The epitope_smi column is the SMILES notation for the chemical structure of the epitope. We won't use this information. Instead, the ESM-1b embedder should be sufficient for the input to our binary classification model.
The tcr column is the CDR3 hyperloop. It's the part of the TCR that actually binds (assuming it binds) to the epitope.
The label column is whether the two proteins bind. 0 = No. 1 = Yes.
The tcr_vector and epitope_vector columns are the bio-embeddings of the TCR and epitope sequences generated by the Facebook ESM-1b model. These two vectors can be used to create a machine learning model that predicts whether the combination will produce a successful protein binding.
From the TDC website:
T-cells are an integral part of the adaptive immune system, whose survival, proliferation, activation and function are all governed by the interaction of their T-cell receptor (TCR) with immunogenic peptides (epitopes). A large repertoire of T-cell receptors with different specificity is needed to provide protection against a wide range of pathogens. This new task aims to predict the binding affinity given a pair of TCR sequence and epitope sequence.
Weber et al.
Dataset Description: The dataset is from Weber et al. who assemble a large and diverse data from the VDJ database and ImmuneCODE project. It uses human TCR-beta chain sequences. Since this dataset is highly imbalanced, the authors exclude epitopes with less than 15 associated TCR sequences and downsample to a limit of 400 TCRs per epitope. The dataset contains amino acid sequences either for the entire TCR or only for the hypervariable CDR3 loop. Epitopes are available as amino acid sequences. Since Weber et al. proposed to represent the peptides as SMILES strings (which reformulates the problem to protein-ligand binding prediction) the SMILES strings of the epitopes are also included. 50% negative samples were generated by shuffling the pairs, i.e. associating TCR sequences with epitopes they have not been shown to bind.
Task Description: Binary classification. Given the epitope (a peptide, either represented as amino acid sequence or as SMILES) and a T-cell receptor (amino acid sequence, either of the full protein complex or only of the hypervariable CDR3 loop), predict whether the epitope binds to the TCR.
Dataset Statistics: 47,182 TCR-Epitope pairs between 192 epitopes and 23,139 TCRs.
References:
Weber, Anna, Jannis Born, and María Rodriguez Martínez. “TITAN: T-cell receptor specificity prediction with bimodal attention networks.” Bioinformatics 37.Supplement_1 (2021): i237-i244.
Bagaev, Dmitry V., et al. “VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium.” Nucleic Acids Research 48.D1 (2020): D1057-D1062.
Dines, Jennifer N., et al. “The immunerace study: A prospective multicohort study of immune response action to covid-19 events with the immunecode™ open access database.” medRxiv (2020).
Dataset License: CC BY 4.0.
Contributed by: Anna Weber and Jannis Born.
The Facebook ESM-2 model has the MIT license and was published in:
HuggingFace has several versions of the trained model.
Checkpoint name Number of layers Number of parameters
esm2_t48_15B_UR50D 48 15B
esm2_t36_3B_UR50D 36 3B
esm2_t33_650M_UR50D 33 650M
esm2_t30_150M_UR50D 30 150M
esm2_t12_35M_UR50D 12 35M
esm2_t6_8M_UR50D 6 8M
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of pooled T-cell receptor (TCR) sequences for TCR alpha and beta chains of human and mouse.
Sequences are obtained from various samples of healthy individuals/mice using our conventional protocols: see for example [Britanova et al "Dynamics of individual T cell repertoires: from cord blood to centenarians" The Journal of Immunology 2016] and [Izraelson et al. "Comparative analysis of murine T‐cell receptor repertoires." Immunology 2018].
The sequences are stored as gzipped clonotype tables in VDJtools format, see [https://vdjtools-doc.readthedocs.io/en/master/input.html#vdjtools-format].
This control dataset can be used as a proxy for a generative VDJ rearrangement model to estimate the expected frequency distribution of TCRs and check for enrichment of rare TCR clonotypes and groups of similar TCR sequences. For the implementation of the enrichment analysis, please see CalcDegreeStats routine from VDJtools software, see [https://vdjtools-doc.readthedocs.io/en/master/annotate.html#calcdegreestats].
Files named "human.tra.strict.txt.gz", etc are pools of random/naive TCR clonotypes containing unique V/J/CDR3 nucleotide sequence combinations observed in data. The pools.zip file is used for TCR motif inference in VDJdb database [https://github.com/antigenomics/vdjdb-motifs], it contains human.tra.aa.txt, etc files that contain random/naive TCR clonotypes grouped by CDR3 amino acid sequence with the most frequent representative V and J.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An example dataset containing B-cell receptor (BCR) gene sequences. This dataset is intended to be used for testing software tools developed to annotate (i.e. map Variable, Diversity and Joining segments) and perform clonal analysis of BCR sequencing data.
Sequencing:
Libraries prepared using 5'RACE from PBMCs of a healthy donor. Input molecules were tagged with unique molecular identifiers (UMIs). Sequencing was ran on MiSeq , 300+300bp reads.
Contents:
The dataset contains both raw sequencing reads and high-quality consensus sequences assembled using unique molecular tagging (UMI) approach. Consensus assembly corrects for sequencing errors and eliminates sequencing artifacts.
All files contain an UMI tag sequence in their header, in form UMI:NNNN:QQQQ where N is the base character and Q is the quality character (for assembled consensuses the total number of reads is given instead of Q string).
Note that consensus sequences were assembled using only raw sequences that correspond to UMI tags supported by at least 10 sequencing reads. That means that consensus sequence files contain a subset of all UMI tags found in raw sequences. Thus, if one wants to assess software performance on raw sequencing reads using assembled consensus sequences as a high-quality data standard, raw sequencing reads should be filtered to contain only those UMI tags that are present in consensus sequence file.
Citations:
The whole dataset was used to benchmark MiXCR software and was originally referenced in Bolotin DA, et al. MiXCR: software for comprehensive adaptive immunity profiling Nature methods 12(5):380-381, 2015.
Data pre-processing was carried out using MIGEC software, Shugay M et al. Towards error-free profiling of immune repertoires. Nature Methods 11(6):653-655, 2014.
Contributors:
The dataset was generated in Prof. Chudakov lab (Adaptive Immunity Group in Masaryk University, Brno and Genomics of Adaptive Immunity Lab in Institute of Bioorganic Chemistry, Moscow). Sample preparation and sequencing was performed by Dr. Olga Britanova and Dr. Maria Turchaninova. Raw sequencing reads were pre-processed and uploaded by Dr. Mikhail Shugay.
Facebook
TwitterHigh-throughput T cell receptor (TCR) sequencing allows the characterization of an individual's TCR repertoire and directly queries their immune state. However, it remains a non-trivial task to couple these sequenced TCRs to their antigenic targets. In this paper, we present a novel strategy to annotate full TCR sequence repertoires with their epitope specificities. The strategy is based on a machine learning algorithm to learn the TCR patterns common to the recognition of a specific epitope. These results are then combined with a statistical analysis to evaluate the occurrence of specific epitope-reactive TCR sequences per epitope in repertoire data. In this manner, we can directly study the capacity of full TCR repertoires to target specific epitopes of the relevant vaccines or pathogens. We demonstrate the usability of this approach on three independent datasets related to vaccine monitoring and infectious disease diagnostics by independently identifying the epitopes that are targeted by the TCR repertoire. The developed method is freely available as a web tool for academic use at tcrex.biodatamining.be.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
T cell receptor sequence data of 26 people living with HIV on long-term anti-retroviral therapy, and 12 HIV-negative healthy controls, produced using the UCL Chain lab protocol. All participants were Caucasian male adults recruited from London, UK. People living with HIV were on anti-retroviral therapy for a median of 8.5 years (interquartile range 3-16 years). They had undetectable plasma HIV viral load (
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionT cells influence COVID-19 severity and establish long-lasting immune memory in response to vaccination and infection. The diversity of the T cell repertoire, and complexity of T cell epitope recognition, make it challenging to define protective epitope-specific T cells. In this study, we created a highly specific TCR meta-database to identify T cell epitopes from the nearly complete SARS-CoV-2 proteome and determine whether vaccination with mRNA vaccines influenced the TCR repertoire.MethodsUsing this meta-database, we analyzed immunosequencing data of genomic DNA to define the variable region of T cell receptor (TCR) b chain (TCRB) sequences among participants in a longitudinal COVID-19 cohort study. The TCR repertoire was compared between participants who were vaccinated or unvaccinated against SARS-CoV-2 and stratified by disease severity. TCR diversity was measured using clonality, an index defined as the inverted normalized Shannon entropy. ResultsHighly clonal TCR repertoires correlated with age and comorbidities. Using our meta-database approach, we found that vaccinated participants hospitalized with infection had the most restricted SARS-CoV-2-specific CD8 TCR repertoire. However, TCRB with predicted specificity to non-spike SARS-CoV-2 proteins dominated the response, even in vaccinated participants. We identified a peptide sequence in the ORF10 accessory protein that was more frequently recognized in study participants with mild disease. Conversely, CD8 T cell recognition of a peptide sequence in ORF1ab more closely correlated with severe disease.DiscussionOverarchingly, TCR repertoire analysis revealed that CD8 T cells responding to SARS-CoV-2 broadly recognize epitopes across the SARS-CoV-2 proteome, and provided opportunities to identify epitopes associated with disease.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Thymic selection is crucial for forming a pool of T-cells that can efficiently discriminate self from non-self using their T-cell receptors (TCRs) to develop adaptive immunity. In the present study we analyzed how a diverse set of physicochemical and sequence features of a TCR can affect the chances of successfully passing the selection. On a global scale we identified differences in selection probabilities based on CDR3 loop length, hydrophobicity, and residue sizes depending on variable genes and TCR chain context. We also observed a substantial decrease in N-glycosylation sites and other short sequence motifs for both alpha and beta chains. At the local scale we used dedicated statistical and machine learning methods coupled with a probabilistic model of the V(D)J rearrangement process to infer patterns in the CDR3 region that are either enriched or depleted during the course of selection. While the abundance of patterns containing poly-Glycines can improve CDR3 flexibility in selected TCRs, the “holes” in the TCR repertoire induced by negative selection can be related to Arginines in the (N)-Diversity (D)-N-region (NDN) region. Corresponding patterns were stored by us in a database available online. We demonstrated how TCR sequence composition affects lineage commitment during thymic selection. Structural modeling reveals that TCRs with “flat” and “bulged” CDR3 loops are more likely to commit T-cells to the CD4+ and CD8+ lineage respectively. Finally, we highlighted the effect of an individual MHC haplotype on the selection process, suggesting that those “holes” can be donor-specific. Our results can be further applied to identify potentially self-reactive TCRs in donor repertoires and aid in TCR selection for immunotherapies.
Facebook
Twitterhttps://www.scilifelab.se/data/restricted-access/https://www.scilifelab.se/data/restricted-access/
This dataset contains genomic TCR beta sequences from single cell DNA samples amplified by multiple displacement amplification (MDA) and subjected to nested PCR targeting the genomic TCR beta locus. The individual files contain raw data representing nucleotide sequences including both productive and non-productive rearrangements of the TCR beta sequence (with dropout in some cases). FASTQ files corresponding to single cell RNAseq data from single CD8+ T cells prepared by the smart-seq2 method.FASTQ files for 25-cell ‘mini-bulk’ RNAseq for CD8+ T cells prepared according to the smart-seq2 protocol.
Facebook
TwitterThe Kabat Database determines the combining site of antibodies based on the available amino acid sequences. The precise delineation of complementarity determining regions (CDR) of both light and heavy chains provides the first example of how properly aligned sequences can be used to derive structural and functional information of biological macromolecules. The Kabat database now includes nucleotide sequences, sequences of T cell receptors for antigens (TCR), major histocompatibility complex (MHC) class I and II molecules, and other proteins of immunological interest. The Kabat Database searching and analysis tools package is an ASP.NET web-based portal containing lookup tools, sequence matching tools, alignment tools, length distribution tools, positional correlation tools and much more. The searching and analysis tools are custom made for the aligned data sets contained in both the SQL Server and ASCII text flat file formats. The searching and analysis tools may be run on a single PC workstation or in a distributed environment. The analysis tools are written in ASP.NET and C# and are available in Visual Studio .NET 2003/2005/2008 formats. The Kabat Database was initially started in 1970 to determine the combining site of antibodies based on the available amino acid sequences at that time. Bence Jones proteins, mostly from human, were aligned, using the now-known Kabat numbering system, and a quantitative measure, variability, was calculated for every position. Three peaks, at positions 24-34, 50-56 and 89-97, were identified and proposed to form the complementarity determining regions (CDR) of light chains. Subsequently, antibody heavy chain amino acid sequences were also aligned using a different numbering system, since the locations of their CDRs (31-35B, 50-65 and 95-102) are different from those of the light chains. CDRL1 starts right after the first invariant Cys 23 of light chains, while CDRH1 is eight amino acid residues away from the first invariant Cys 22 of heavy chains. During the past 30 years, the Kabat database has grown to include nucleotide sequences, sequences of T cell receptors for antigens (TCR), major histocompatibility complex (MHC) class I and II molecules and other proteins of immunological interest. It has been used extensively by immunologists to derive useful structural and functional information from the primary sequences of these proteins.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
T cell receptor sequence data of alopecia patients before and during sensitisation with diphenylcyclopropenone and healthy volunteers at equivalent timepoints, using the UCL Chain lab protocol. Details of the study are provided in Ronel et al, eLife 2021 (10.7554/eLife.54747). The processed data files have been generated using Decombinator V4 (https://github.com/innate2adaptive/Decombinator). The raw data files are available at the NCBI Sequence Read Archive, accession number PRJNA592875.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Processing
Samples were demultiplexed via their Illumina indices, and processed using the Immcantation toolkit(1,2). Raw fastq files were filtered based on a quality score threshold of 20. Paired reads were joined if they had a minimum length of 10 nt, maximum error rate of 0.3 and a significance threshold of 0.0001. Reads with identical UMI were collapsed to a consensus sequence. Reads with identical full-length sequence and identical constant primer but differing UMI were further collapsed. Sequences were then submitted to IgBlast (3) for VDJ assignment and sequence annotation. Constant region sequences were mapped to germline using Stampy(4). The number and type of V gene mutations was calculated using the shazam R package.(2)
software_versions pRESTO:0.5.3,Change-O:0.3.4,IgBlast 1.6.1, stampy1.0.21. shazam0.1.8
quality_thresholds FilterSeq.py pRESTO Q>20
paired_reads_assembly AssemblePairs.py pRESTO minlen 10 maxerror 0.3 alpha 0.0001
primer_match_cutoffs MaskPrimers.py pRESTO C primer & V primer maxerror 0.2
consensus_building BuildConsensus.py pRESTO maxerror 0.1 maxgap 0.5
collapsing_method CollapseSeq.py pRESTO
germline_database IMGT
Format
Processed sequences are provided in a tab delimited file format, including the following annotations:
C_CALL Isotype subclass
SEQUENCE_ID Sequence identifier
V_CALL V segment gene and allele
D_CALL D segment gene and allele
J_CALL J segment gene and allele
JUNCTION_LENGTH Junction length
CONSCOUNT Raw read count from which UMI consensus sequences were generated, summed over all UMIs for the given unique sequence.
DUPCOUNT UMI count for the given unique sequence
ISOTYPE Constant region primer (isotype)
MU_COUNT_CDR_R Number of replacement mutations in CDR region
MU_COUNT_CDR_S Number of silent mutations in CDR region
MU_COUNT_FWR_R Number of replacement mutations in FWR region
MU_COUNT_FWR_S Number of silent mutations in FWR region
MUT_TOTAL Total number of mutations in V gene
SEQUENCE_INPUT Full length sequence
SEQUENCE_IMGT Gapped IMGT sequence
V_GERM_START_VDJ position of the first nucleotide in ungapped V germline sequence alignment
JUNCTION Junction nucleotide sequence
GERMLINE_IMGT_D_MASK IMGT-gapped germline nucleotide sequence with ns masking the NP1-D-NP2 regions
Run ID of sequencing run
Sample_type The tissue sampled (e.g Peripheral Blood, bone marrow, ..)
Sex Sex of the Subject
Age Age of the subject
UNIQUE_ID Subject identifier
SAMPLE_ID Sample identifier, linking back to raw data
Subset Defined B cell subset
Repertoire Defined B cell repertoire (Naive, Memory IgM/IgD, IgA, IgG)
R_SCDR R/S ratio in CDR region
R_SFWR R/S ratio in FWR region
V_FAM V family gene
V_GENE V segment gene
D_GENE D segment gene
J_GENE J segment gene
Clust_Rank Cluster rank
Clust_REPRES Cluster representative
Clust_SIZE Cluster size
Clust_MAXFREQ Cluster maximum frequency
Clust_SHAREDNESS Cluster sharedness
CDR3_AA_GRAVY CDR3 hydrophobicity index
CDR3_AA_CHARGE CDR3 charge
CDRH3PDB CDRH3 PDB (Structure) code
H1Canon H1 Canonical class
H2Canon H2 Canonical class
H1_GERMLINE H1 Germline Canonical class
H2_GERMLINE H2 Germline Canonical class
References
1. Vander Heiden, J. A., G. Yaari, M. Uduman, J. N. H. Stern, K. C. O’Connor, D. A. Hafler, F. Vigneault, and S. H. Kleinstein. 2014. PRESTO: A toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics30: 1930–1932.
2. Gupta, N. T., J. A. Vander Heiden, M. Uduman, D. Gadala-Maria, G. Yaari, and S. H. Kleinstein. 2015. Change-O: A toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics31: 3356–3358.
3. Ye, J., N. Ma, T. L. Madden, and J. M. Ostell. 2013. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res.41.
4. Lunter, G., and M. Goodson. 2011. Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res.21: 936–939.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example repertoire data generated by AIRRSHIP (https://github.com/Cowanlab/airrship). Four repertoires are available (two with SHM, two without), each of which contains 100,000 sequences produced using the default AIRRSHIP parameters. Sequence data is contained in the FASTA files, TSV files give details of each step in the generation process, summary file shows the command given to AIRRSHIP and the locus file contains the alleles used in the repertoire. See https://airrship.readthedocs.io/en/latest/output/ for more information on file format.
Repertoires were created using version 0.1.2 of AIRRSHIP.
Facebook
TwitterLiquid biopsy is a promising non-invasive technology that is capable of diagnosing cancer. However, current ctDNA-based approaches detect only a minority of early-stage disease. We set out to improve the sensitivity of liquid biopsy by harnessing tumor recognition by T cells through the sequencing of the circulating T-cell receptor repertoire. We studied a cohort of 463 patients with lung cancer (86% stage I) and 587 subjects without cancer using gDNA extracted from blood buffy coats. We performed TCR β chain sequencing to yield a median of 113,571 TCR clonotypes per sample and built a TCR sequence similarity graph to cluster clonotypes into TCR repertoire functional units (RFUs). The TCR frequencies of RFUs were tested for association with cancer status and RFUs with a statistically significant association were combined into a cancer score using a support vector machine model. The model was evaluated by 10-fold cross-validation and compared with a ctDNA panel of 237 mutation hotspots in 154 lung cancer driver genes and 17 cancer related protein biomarkers in 85 subjects. We identified 327 cancer- associated TCR RFUs with a false discovery rate (FDR) ≤ 0.1, including 157 enriched in cancer samples and 170 enriched in controls. Levels of 247/327 (76%) RFUs were correlated with the presence of an HLA allele at FDR ≤ 0.1 and tumor-infiltrating lymphocyte TCRs from multiple RFUs bound HLA presented tumor antigen peptides, suggesting antigen recognition as a driver of the cancer-RFU associations found. The RFU cancer score detected nearly 50% of stage I lung cancers at a specificity of 80% and boosted the sensitivity by up to 20 percentage points when added to ctDNA and circulating proteins in a multi- analyte cancer screening test. Overall, we show that circulating TCR repertoire functional unit analysis can complement established analytes to improve liquid biopsy sensitivity for early-stage cancer.This dataset contains the CellRanger output for 20 cancer patients. Please refer to https://www.10xgenomics.com/support/software/cell-ranger/latest for documentation.For details on how the data was generated, please see Li Y. et al. 2025: Circulating T-cell Receptor Repertoire for Cancer Early Detection.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The adaptation of high-throughput sequencing to the B cell receptor and T cell receptor has made it possible to characterize the adaptive immune receptor repertoire (AIRR) at unprecedented depth. These AIRR sequencing (AIRR-seq) studies offer tremendous potential to increase the understanding of adaptive immune responses in vaccinology, infectious disease, autoimmunity, and cancer. The increasingly wide application of AIRR-seq is leading to a critical mass of studies being deposited in the public domain, offering the possibility of novel scientific insights through secondary analyses and meta-analyses. However, effective sharing of these large-scale data remains a challenge. The AIRR community has proposed minimal information about adaptive immune receptor repertoire (MiAIRR), a standard for reporting AIRR-seq studies. The MiAIRR standard has been operationalized using the National Center for Biotechnology Information (NCBI) repositories. Submissions of AIRR-seq data to the NCBI repositories typically use a combination of web-based and flat-file templates and include only a minimal amount of terminology validation. As a result, AIRR-seq studies at the NCBI are often described using inconsistent terminologies, limiting scientists’ ability to access, find, interoperate, and reuse the data sets. In order to improve metadata quality and ease submission of AIRR-seq studies to the NCBI, we have leveraged the software framework developed by the Center for Expanded Data Annotation and Retrieval (CEDAR), which develops technologies involving the use of data standards and ontologies to improve metadata quality. The resulting CEDAR-AIRR (CAIRR) pipeline enables data submitters to: (i) create web-based templates whose entries are controlled by ontology terms, (ii) generate and validate metadata, and (iii) submit the ontology-linked metadata and sequence files (FASTQ) to the NCBI BioProject, BioSample, and Sequence Read Archive databases. Overall, CAIRR provides a web-based metadata submission interface that supports compliance with the MiAIRR standard. This pipeline is available at http://cairr.miairr.org, and will facilitate the NCBI submission process and improve the metadata quality of AIRR-seq studies.
Facebook
Twitter
According to our latest research, the global B Cell Receptor Sequencing market size reached USD 382.4 million in 2024, demonstrating robust momentum driven by technological advancements and the growing demand for precision medicine. The market is expected to expand at a CAGR of 16.2% during the forecast period, reaching a projected value of USD 1,346.7 million by 2033. This substantial growth is propelled by increasing applications in immunology, oncology, and vaccine development, alongside the widespread adoption of next-generation sequencing technologies.
One of the most significant growth factors for the B Cell Receptor Sequencing market is the surging focus on personalized medicine and immunotherapy. The ability to sequence B cell receptors at a high resolution provides researchers and clinicians with deep insights into the adaptive immune system, enabling the identification of disease-specific antibodies and the development of targeted therapies. The rise of chronic diseases, including various types of cancers and autoimmune conditions, has further fueled the need for advanced immunoprofiling techniques. As a result, pharmaceutical and biotechnology companies are increasingly investing in B cell receptor sequencing technologies to accelerate drug discovery and enhance the efficacy of immunotherapeutic interventions, thereby driving market expansion.
Another major driver is the technological evolution in sequencing platforms, particularly the adoption of next-generation sequencing (NGS). NGS has revolutionized the field by allowing high-throughput, cost-effective, and accurate sequencing of B cell receptors, surpassing the limitations of traditional methods like Sanger sequencing. The integration of bioinformatics and advanced data analysis tools has further streamlined the process, making it more accessible for both research and clinical applications. Continuous improvements in sequencing accuracy, speed, and scalability are encouraging a broader range of end-users, including academic institutes, hospitals, and pharmaceutical companies, to integrate B cell receptor sequencing into their workflows, which is anticipated to further boost market growth.
Regulatory support and increasing investments in biomedical research have also played a pivotal role in market development. Governments and funding agencies worldwide are prioritizing immunology research, infectious disease monitoring, and vaccine development, especially in the wake of recent global health crises. Collaborative initiatives between public and private sectors have led to the establishment of research consortia and biobanks, fostering the adoption of advanced sequencing technologies. The expansion of clinical trials involving immunotherapies and monoclonal antibodies has further emphasized the importance of comprehensive B cell receptor profiling, thereby creating a conducive environment for market growth over the coming years.
From a regional perspective, North America continues to dominate the B Cell Receptor Sequencing market, accounting for the largest share due to its well-established healthcare infrastructure, high research and development spending, and presence of leading biotechnology firms. Europe follows closely, supported by strong academic research and government initiatives. The Asia Pacific region is witnessing the fastest growth, attributed to increasing investments in healthcare, rising awareness about precision medicine, and the rapid expansion of research facilities. As global collaborations intensify and technological adoption accelerates, the market is poised for significant growth across all major regions during the forecast period.
The B Cell Receptor Sequencing market is segmented by product type into Reagents & Kits, Instruments, and Software & Services, each playing a distinct role in the overall ecosystem. Reagents & Kits represent the largest and most dynamic segment, driven by their recurring demand in sequencing
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The COVID-19 pandemic, driven by the continuous evolution of severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2 and the emergence of immune-evasive variants, remains a global threat, elevating reinfection risks and challenging existing therapeutics. I, therefore, conducted a comprehensive study on SARS-CoV-2 antibodies, under the hypothesis that antibody engineering strategies such as bispecific antibodies could overcome such immune evasion and integrating B cell receptor (BCR) sequencing with functional screening assay would enable efficient antibody discovery. This study focused on two main aspects: the engineering of broad-neutralizing antibodies to combat immune evasion and the development of efficient strategies for screening potent antibodies.For the first aspect, following the fifth wave of COVID-19 in Hong Kong, I did a serological survey (n=36) to assess herd immunity against emerging variants. Using neutralization assays, I demonstrated that convalescents from the third and fourth waves infected by B.1.1.63 and B.1.36 showed significantly weaker responses to Omicron sublineages as compared with those infected with BA.2/BA.5 during the fifth wave. These results indicated a higher susceptibility to reinfection among patients previously exposed to earlier-waves. Moreover, I found that breakthrough infections elicited stronger neutralizing responses than infection alone. This finding underscored the role of hybrid immunity for better protection. Subsequently, to overcome the immune escape of BA.4/5 against the previously identified broadly neutralizing antibody (bnAb) ZCB11, I engineered bispecific antibodies in DVD-Ig format by fusing the class I ZCB11 with class III neutralizing antibodies P2D9/P3E6. My results showed that these bispecific antibodies successfully restored neutralization activities against BA.4/5, although with reduced potency. I found higher IC50 values (ZCB11-P2D9: 0.5746 μg/mL; ZCB11-P3E6: 0.1639 μg/mL) than those of parental monoclonal antibodies (P2D9: 0.0753 μg/mL; P3E6: 0.0743 μg/mL) against BA.4/5. Structure-guided design targeting the F486V-driven disruption of a hydrophobic interface failed to yield functional gain-of-binding mutants, underscoring the challenges of rational affinity maturation. These results indicated that the pairing between class I and class III neutralizing antibodies is unlikely a good strategy for constructing potent bispecific broadly neutralizing antibodies, probably due to structural hinders.For the second aspect, I tried to optimize antibody screening by integrating BCR sequencing with functional validation. A total of 146 BCR sequences were selected and tested via phylogenetic and similarity-based criteria from the total BCR repertoire derived from a well-defined bnAb donor by sequencing 3395 single B cell clones. None of them, however, showed neutralization activities. Concurrently, several ultrapotent broadly neutralizing antibodies were isolated from this donor using conventional single B cell sorting method. Unexpectedly, identical BCR clones were not found from the repertoire sequenced. This result indicated the low frequency of ultrapotent bnAbs in the donor. Lastly, I adopted a method of linking B cell receptor to antigen specificity through sequencing (LIBRA-seq). I successfully identified 20 cross-reactive antibodies from memory B cells with the top candidate showing broad but weak neutralization. In conclusion, my findings not only revealed polyclonal antibody responses against SARS-CoV-2 but indicated useful platforms of technology for engineering of bispecific antibodies and a promising sequence-guided screening framework for rapid antibody discovery.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Using long read sequencing, we sequenced four Indian-origin rhesus macaque tissues. From raw full-length, non-chimeric circular consensus sequencing (CCS) reads, we obtained high quality, full-length sequences for over 6,000 unique immunoglobulin and T-cell receptor transcripts, without the need for sequence assembly.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Processed sequencing data from BioProject PRJNA349143.
Study Design
Samples were collected from human volunteers as described in Laserson and Vigneault et al, 2014 (1). Briefly, blood samples were collected from three individuals both pre- and post-vaccination for seasonal influenza. Samples were collected for sequencing at time points -8 days, -2 days, -1 hour, +1 hour, +1 day, +3 days, +7 days, +14 days, +21 days and +28 days relative to injection with seasonal influenza vaccine.
Library Preparation and Sequencing
The original samples from Laserson and Vigneault et al, 2014 (1) were re-sequenced as described in Gupta et al, 2017 (2). Briefly, sequencing libraries were prepared from mRNA using 5'RACE with addition of 17-nucleotide unique molecular identifiers (UMIs). Amplification was performed using constant region primers specific to IGHA, IGHD, IGHE, IGHG, IGHM, IGKC and IGLC. Sequencing was conducted on the Illumina MiSeq platform using the 600 cycle kit with 325 cycles for read 1 and 275 cycles for read 2. A 10% PhiX spike-in was added for sequencing.
Data Processing
Sequences were processed using the pRESTO (3) and Change-O (4) toolkits as described in Gupta et al, 2017 (2).
Note, the provided data has been filtered significantly, including the removal of sequences that fail V(D)J alignment and the exclusion of non-functional sequences.
Format
Processed sequences are provided in FASTA format annotated using the pRESTO scheme.
Annotations included are as follows:
CONSCOUNT: Raw read count from which UMI consensus sequences were generated, summed over all UMIs for the given unique sequence.
DUPCOUNT: UMI count for the given unique sequence.
PRCONS: Constant region primer (isotype).
SUBJECT: Subject identifier.
TIME_POINT: Time point label.
Citations
Laserson U and Vigneault F, et al. High-resolution antibody dynamics of vaccine-induced immune responses. Proc Natl Acad Sci USA 111, 4928-33 (2014).
Gupta NT, et al. Hierarchical Clustering Can Identify B Cell Clones with High Confidence in Ig Repertoire Sequencing Data. J Immunol 1601850 (2017).
Vander Heiden JA and Yaari G, et al. pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics 30, 1930–2 (2014).
Gupta NT and Vander Heiden JA, et al. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 31, 3356–8 (2015).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Unlike conventional major histocompatibility complex (MHC) class I and II molecules reactive T cells, the unconventional T cell subpopulations recognize various non-polymorphic antigen-presenting molecules and are typically characterized by simplified patterns of T cell receptors (TCRs), rapid effector responses and ‘public’ antigen specificities. Dissecting the recognition patterns of the non-MHC antigens by unconventional TCRs can help us further our understanding of the unconventional T cell immunity. The small size and irregularities of the released unconventional TCR sequences are far from high-quality to support systemic analysis of unconventional TCR repertoire. Here we present UcTCRdb, a database that contains 669,900 unconventional TCRs collected from 34 corresponding studies in humans, mice, and cattle. In UcTCRdb, users can interactively browse TCR features of different unconventional T cell subsets in different species, search and download sequences under different conditions. Additionally, basic and advanced online TCR analysis tools have been integrated into the database, which will facilitate the study of unconventional TCR patterns for users with different backgrounds. UcTCRdb is freely available at http://uctcrdb.cn/.