Home

Awesome

ncbi_peregrine

BranchGitHub ActionsCodecov logo
masterR-CMD-checkcodecov.io
developR-CMD-checkcodecov.io
BranchGitHub Actions
mastermake
developmake

NCBI results, part of the code of Bilderbeek, Richèl JC, et al. "Transmembrane Helices Are an Over-Presented and Evolutionarily Conserved Source of Major Histocompatibility Complex Class I and II Epitopes." Frontiers in immunology 12 (2021).

Experiment

  1. Collect 1123 membrane proteins' gene IDs
  2. Convert all 1123 gene IDs to gene names
  3. Per gene name, find the SNP IDs
  4. Per SNP ID, get the variation (in HGVS format)
  5. Per variation that changes the protein structure, score the topology
Gene IDGene nameSNP IDvariationis_in_tmhp_is_tmh
7124TNF1583049783NP_000585.2:p.Gly144AspFALSE0.1
..AnotherNP_000585.2:p.Gly144AspFALSE0.2

Files

Input files

None

Intermediate files

:heavy_check_mark: gene_ids.csv

gene_id
-------
1956
7124
348
7040
3091
3586

:heavy_check_mark: gene_names.csv

1 gene IDs file: gene_names.csv, created by create_gene_names.R, a tibble with two columns gene_id and gene_name

gene_idgene_name
1956EGFR
7124TNF
348APOE
7040TGFB1
3091HIF1A
3586IL10

:heavy_check_mark: [gene_name]_snps.csv

per gene name, a file named [gene_name]_snps.csv, created by create_gene_name_snps.R, each a tibble with one column snp_id.

When all [gene_name]_snps.csv files are created, the file create_gene_name_snps_is_done.txt

snp_id    
----------
1583049783
...       

:white_check_mark: [gene_name]_variations.rds

Per [gene_name]_snps.csv, a file named [gene_name]_variations.rds, created by create_snp_variations_rds.R, each list of tibbles with two columns: snp_id and variation. Each tibble can have zero to dozens of rows.

When all [gene_name]_variation.csv files are created, the file create_snp_variations_rds_is_done.txt

[[1]]
# A tibble: 0 x 2
# ... with 2 variables: snp_id <dbl>, variation <chr>

[[2]]
# A tibble: 1 x 2
      snp_id variation              
       <dbl> <chr>                  
1 1599031008 NP_001156469.1:p.Val35=

[[4]]
# A tibble: 2 x 2
      snp_id variation                
       <dbl> <chr>                    
1 1599030856 NP_001156469.1:p.Trp20Arg
2 1599030856 NP_001156469.1:p.Trp20Cys

[[15]]
# A tibble: 0 x 2
# ... with 2 variables: snp_id <dbl>, variation <chr>

:heavy_check_mark: [gene_name]_variations.csv

Per [gene_name]_variations.rds, a file named [gene_name]_variations.csv, created by create_snp_variations_csv.R, each a tibble with two columns: snp_id and variation.

When all [gene_name]_variation.csv files are created, the file create_snp_variations_csv_is_done.txt

snp_idvariation
1583049783NP_000585.2:p.Gly144Asp
......

[gene_name].fasta

The script create_fasta_files.R, per gene name, reads the [gene_name]_variation.csv file, and creates a file [gene_name].fasta with all the variation' proteins' sequences.

When all [gene_name].fasta files are created, the file create_fasta_files_is_done.txt

> NP_001007554.1
FANTASTICALLY
> NP_001229821.1
FAMILYVW

For example, https://www.ncbi.nlm.nih.gov/snp/rs1570884790 is a SNP that works on multiple proteins:

NP_001007554.1:p.Val754Gly
NP_001229821.1:p.Val754Gly
NP_009089.4:p.Val723Gly
NP_001229822.1:p.Val723Gly
NP_001123995.1:p.Val769Gly
NP_001229820.1:p.Val800Gly 

[gene_name].topo

The script create_topo_files.R, per gene name, reads the [gene_name].fasta file, and creates a file [gene_name].topo with the topology of these proteins.

When all [gene_name].topo files are created, the file create_topo_files_is_done.txt

> NP_001007554.1
0000000110000
> NP_001229821.1
0000000000000

:white_check_mark: [gene_name]_is_in_tmh.csv

Per gene name, reads the [gene_name]_variation.csv and [gene_name].topo file. For each variation, it tallies if the variation is in a TMH, as well as the proportion of TMH in the protein. done by script create_is_in_tmh_files.R

variationis_in_tmhp_in_tmh
NP_000585.2:p.Gly144AspFALSE0.123

Results files

gene_idgene_namesnp_idvariationis_in_tmhp_in_tmhn_tmh
7124TNF1583049783NP_000585.2:p.Gly144AspFALSE0.123314
..................271
|----------------|
  gene_names.csv

                 |----------|
                 [gene_name]_snps.csv

                 |----------------------------------|
                      [gene_name]_variations.csv

                            |------------------------------------------|
                                     [gene_name]_is_in_tmh.csv

                            |-----------------------|                  |-----|
                                                      [gene_name].topo

Estimated time

In reality:

real	63m32.145s
user	49m42.254s
sys	0m31.846s
real	61m2.904s
user	44m32.908s
sys	0m28.639s
real	63m14.851s
user	47m47.423s
sys	0m32.378s
real	63m25.927s
user	45m41.377s
sys	0m29.540s

How are the figures created?

By running the tests of ncbi_results locally.

Downloads