Designing full-length, epitope-specific TCR αβ remains challenging due to vast sequence space, data biases and incomplete modeling of immunogenetic constraints. We present LSMTCR, a scalable multi-architecture framework that separates specificity from constraint learning to enable de novo, epitope-conditioned generation of paired, full-length TCRs. A diffusion-enhanced BERT encoder learns time-conditioned epitope representations; conditional GPT decoders, pretrained on CDR3β and transferred to CDR3α, generate chain-specific CDR3s under cross-modal conditioning with temperature-controlled diversity; and a gene-aware Transformer assembles complete αβ sequences by predicting V/J usage to ensure immunogenetic fidelity. Across GLIPH, TEP, MIRA, McPAS and our curated dataset, LSMTCR achieves higher predicted binding than baselines on most datasets, more faithfully recovers positional and length grammars, and delivers superior, temperature-tunable diversity. For α-chain generation, transfer learning improves predicted binding, length realism and diversity over representative methods. Full-length assembly from known or de novo CDR3s preserves k-mer spectra, yields low edit distances to references, and, in paired αβ co-modelling with epitope, attains higher pTM/ipTM than single-chain settings. LSMTCR outputs diverse, gene-contextualized, full-length TCR designs from epitope input alone, enabling high-throughput screening and iterative optimization.
@article{Zhang2025LSMTCR,title={LSMTCR: A Scalable Multi-Architecture Model for Epitope-Specific T Cell Receptor de novo Design},author={Zhang, Ruihao and Liu, Xiao},journal={arXiv preprint},eprint={2509.07627},archiveprefix={arXiv},primaryclass={cs.CE},year={2025},month=sep,url={https://arxiv.org/abs/2509.07627},}
arXiv
Classification of autoimmune diseases from peripheral blood TCR repertoires by multimodal multi-instance learning
Ruihao Zhang, Mao Chen, Fei Ye, and 3 more authors
T cell receptor (TCR) repertoires encode critical immunological signatures for autoimmune diseases, yet their clinical application remains limited by sequence sparsity and low witness rates. We developed EAMil, a multi-instance deep learning framework that leverages TCR sequencing data to diagnose systemic lupus erythematosus (SLE) and rheumatoid arthritis (RA) with exceptional accuracy. By integrating PrimeSeq feature extraction with ESMonehot encoding and enhanced gate attention mechanisms, our model achieved state-of-the-art performance with AUCs of 98.95% for SLE and 97.76% for RA. EAMil successfully identified disease-associated genes with over 90% concordance with established differential analyses and effectively distinguished disease-specific TCR genes. The model demonstrated robustness in classifying multiple disease categories, utilizing the SLEDAI score to stratify SLE patients by disease severity as well as to diagnose the site of damage in SLE patients, and effectively controlling for confounding factors such as age and gender. This interpretable framework for immune receptor analysis provides new insights for autoimmune disease detection and classification with broad potential clinical applications across immune-mediated conditions.
@article{Zhang2025EAMil,title={Classification of autoimmune diseases from peripheral blood TCR repertoires by multimodal multi-instance learning},author={Zhang, Ruihao and Chen, Mao and Ye, Fei and Meng, Dandan and Huang, Yixuan and Liu, Xiao},journal={arXiv preprint},eprint={2507.04981},archiveprefix={arXiv},primaryclass={q-bio.QM},year={2025},month=jul,url={https://arxiv.org/abs/2507.04981},}
BiB
LightCTL: lightweight contrastive TCR-pMHC specificity learning with context-aware prompt
Fei Ye, Mao Chen, Yixuan Huang, and 6 more authors