Awesome
Parsing T- and B-cell receptor segments from IMGT database to a flexible plain-text format
Getting raw sequences
Instructions for downloading raw IMGT files:
- Go to the genedb page and click the first submit button
- Next, scroll down to the end of resulting page (loading can take a while) and mark
Select all genes
- Select
F+ORF+in-frame P nucleotide sequences with IMGT gaps
format and click submit - Copy-paste resulting FASTA records and use them as an input file
Running the software
Get the compiled binaries and run the software as
java -jar segmentparser.jar [options] imgt_raw_file output_prefix
.
The following options can be selected:
-n
include non-functional segments into output (pseudogenes, etc)-m
include minor alleles (segments with*02
,*03
, etc suffix)-s
toggle species detalisation (e.g. BALBc and C57Bl6 for MusMusculus)-b
report IMGT records that cannot be parsed properly (missing conserved residues, etc)
Output files include:
- A
$output_prefix$.metadata.txt
file with summary statistics. - Files with erroneous/bad records:
$output_prefix$.nojrefpoint.txt
,$output_prefix$.novrefpoint.txt
,$output_prefix$.othersegm.txt
. - Output file containing sequences, CDR3 reference points and CDR1,2,2.5 coordinates:
$output_prefix$.txt
.
SegmentParser generates a tab-delimited table with species name, gene and segment id, nucleotide sequence and the reference point position: 0-based coordinate of first nucleotide after conserved Cys for Variable segments and before first nucleotide before conserved Phe/Trp for Joining segments. The metadata table provided with results lists all species and genes and tells if there are any V/D/J segments associated with them (
0
or1
in corresponding row).
- A file with CDR1,2,2.5 nucleotide and amino acid sequences:
$output_prefix$.txt
(only includes V segments).
Note that CDR2.5 is a putative MHC-binding region of TCR V segment, defined in a recent work of Paul Thomas lab (Dash et al. Nature 2017).