Home

Awesome

TCR-VALID: TCR Variational Autoencoder Landscape for Interpretable Dimensions

This is a package for TCR-VALID: capacity controlled VAEs for TCR sequence data.

This package is associated with the publication: Leary, Allen Y., et al. "Designing meaningful continuous representations of T cell receptor sequences with deep generative models." Nature Communications (2024)

Any and all future code development will occur on the forked repo: https://github.com/regeneron-mpds/tcrvalid/.

Installation

We advise initially creating a conda environment via

And subsequently we can install TCRVALID via pip pointing at the directory where you cloned the repo:

In case there are any issues in versioning we additionally provide:

Installation should not take more than a few minutes via pip. For an idea of runtime of TCRVALID - check out the results in results_data/timings

Models and data

All required models are co-packaged with TCR-VALID, and can be loaded by name with a loading utility function.

The following will load a single named model for TRB CDR2-CDR3 sequences:

from tcrvalid.load_models import *
model_name = '1_2'
loaded_trb_models = load_named_models(model_name,chain='TRB',as_keras=True)[model_name]

where for TRB available model names are:

'0_0', '1_1', '1_2', '1_5', '1_10', '1_20', '1_2_full_40'

The first digit represent the value of $\beta$ used in the KL-loss term of the model, and the second the capacity of the model.

The model with 'full' in the name is the final TRB model trained on the larges TRB training dataset, which was collected at the 40th epoch at the minima of the validation loss. All other models were trained on the "smallTRB" dataset of approx 4 million sequences.

For TRA we provide the TCR-VALID model with capacity=2: '1_2'.

Multiple models can be collected into a python dictionary for looping through them via:

from tcrvalid.load_models import *
#  model_names is a list of model names to collect
model_names = ['1_2','1_5']
loaded_trb_models = load_named_models(model_names,chain='TRB',as_keras=True)

Example embedding

from tcrvalid.load_models import *
from tcrvalid.physio_embedding import SeqArrayDictConverter
# get model for TRB
model_name = '1_2_full_40'
model = load_named_models(model_name,chain='TRB',as_keras=True)[model_name]

# get mapping object to convert protein sequence 
# to physicochemical representation
mapping = SeqArrayDictConverter()

# convert seq to physicochemical - list could have many sequences
# sequences are "CDR2-CDR3" formatted
x = mapping.seqs_to_array(['FYNNEI-ASSETLAGANTQY'],maxlen=28)

# get TCR-VALID representation
z,_,_ = model.predict(x)
print(z)
# expect:
# [[ 0.2436572  -1.5467906  -0.63847804  0.83660173  0.10755217 -0.28501382
#    0.9832421  -0.19073558  0.38733137 -0.5093988   0.5247447  -0.660075
#    0.04878296  0.5692204  -1.3631787   1.3796847 ]]

Embeddings of TCRs such as this can be used for clustering, classification, generation. More examples of such cases can be found in notebooks/. Package imports should take ~5s, model loading ~1.5s, and embedding calculation <1s. Times based on a 4-core CPU machine.

data

We also co-package esssential datasets for reproducing our findings. Namely:

Some of the data is stored in "results_data"

comparitor_tooling

We provide the tooling and wrappers to perform our TCR-antigen clustering benchmarking with and without irrelevant TCR spike ins. Briefly:

Examples

notebooks:

In the /notebook directory you will find several examples and plotting methods:

scripts:

In the script directory you will find:

License

Copyright 2023 Regeneron Pharmaceuticals Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0.

Please see the "LICENSE" file.

Third-party software and data

Note that the hinton.py module in /notebooks/ contains a function from matplotlib which has its own license, which is available in the notebooks directory as "matplotlib_v.1.2.0_LICENSE".

tcrvalid/data/TR{A/B}_reference.csv:

tcrvalid/data/tts_full_TRA_te and tcrvalid/data/tts_TRBte are a subset of data collated from iReceptor [Corrie et al. Immuonological reviews 284 (2018) ]and VDJServer [Christley et al. Frontiers in immunology 9 (2018)].

The wrappers around, and changes to, tcr-dist and iSMART in tcrvalid/comparitor_tooling/distance_based_tools, at scripts/modified_third_party and tcrclustering/modified_third_party, have their own licenses and are present in those programs.