Awesome
ncbi_peregrine
Branch | GitHub Actions | |
---|---|---|
master | ||
develop |
Branch | GitHub Actions |
---|---|
master | |
develop |
NCBI results, part of the code of Bilderbeek, Richèl JC, et al. "Transmembrane Helices Are an Over-Presented and Evolutionarily Conserved Source of Major Histocompatibility Complex Class I and II Epitopes." Frontiers in immunology 12 (2021).
Experiment
- Collect 1123 membrane proteins' gene IDs
- Convert all 1123 gene IDs to gene names
- Per gene name, find the SNP IDs
- Per SNP ID, get the variation (in HGVS format)
- Per variation that changes the protein structure, score the topology
Gene ID | Gene name | SNP ID | variation | is_in_tmh | p_is_tmh |
---|---|---|---|---|---|
7124 | TNF | 1583049783 | NP_000585.2:p.Gly144Asp | FALSE | 0.1 |
. | . | Another | NP_000585.2:p.Gly144Asp | FALSE | 0.2 |
Files
Input files
None
Intermediate files
:heavy_check_mark: gene_ids.csv
- 1 gene IDs file:
gene_ids.csv
, created bycreate_gene_ids.R
, a tibble with one columngene_id
gene_id
-------
1956
7124
348
7040
3091
3586
:heavy_check_mark: gene_names.csv
1 gene IDs file: gene_names.csv
,
created by create_gene_names.R
,
a tibble with two columns gene_id
and gene_name
gene_id | gene_name |
---|---|
1956 | EGFR |
7124 | TNF |
348 | APOE |
7040 | TGFB1 |
3091 | HIF1A |
3586 | IL10 |
:heavy_check_mark: [gene_name]_snps.csv
per gene name, a file named [gene_name]_snps.csv
,
created by create_gene_name_snps.R
,
each a tibble with one column snp_id
.
When all [gene_name]_snps.csv
files are created,
the file create_gene_name_snps_is_done.txt
snp_id
----------
1583049783
...
:white_check_mark: [gene_name]_variations.rds
Per [gene_name]_snps.csv
, a file named [gene_name]_variations.rds
,
created by create_snp_variations_rds.R
,
each list of tibbles with two columns: snp_id
and variation
.
Each tibble can have zero to dozens of rows.
When all [gene_name]_variation.csv
files are created,
the file create_snp_variations_rds_is_done.txt
[[1]]
# A tibble: 0 x 2
# ... with 2 variables: snp_id <dbl>, variation <chr>
[[2]]
# A tibble: 1 x 2
snp_id variation
<dbl> <chr>
1 1599031008 NP_001156469.1:p.Val35=
[[4]]
# A tibble: 2 x 2
snp_id variation
<dbl> <chr>
1 1599030856 NP_001156469.1:p.Trp20Arg
2 1599030856 NP_001156469.1:p.Trp20Cys
[[15]]
# A tibble: 0 x 2
# ... with 2 variables: snp_id <dbl>, variation <chr>
:heavy_check_mark: [gene_name]_variations.csv
Per [gene_name]_variations.rds
, a file named [gene_name]_variations.csv
,
created by create_snp_variations_csv.R
,
each a tibble with two columns: snp_id
and variation
.
When all [gene_name]_variation.csv
files are created,
the file create_snp_variations_csv_is_done.txt
snp_id | variation |
---|---|
1583049783 | NP_000585.2:p.Gly144Asp |
... | ... |
[gene_name].fasta
The script create_fasta_files.R
,
per gene name, reads the [gene_name]_variation.csv
file,
and creates a file [gene_name].fasta
with all the variation'
proteins' sequences.
When all [gene_name].fasta
files are created,
the file create_fasta_files_is_done.txt
> NP_001007554.1
FANTASTICALLY
> NP_001229821.1
FAMILYVW
For example, https://www.ncbi.nlm.nih.gov/snp/rs1570884790
is a SNP that
works on multiple proteins:
NP_001007554.1:p.Val754Gly
NP_001229821.1:p.Val754Gly
NP_009089.4:p.Val723Gly
NP_001229822.1:p.Val723Gly
NP_001123995.1:p.Val769Gly
NP_001229820.1:p.Val800Gly
[gene_name].topo
The script create_topo_files.R
,
per gene name, reads the [gene_name].fasta
file,
and creates a file [gene_name].topo
with the topology
of these proteins.
When all [gene_name].topo
files are created,
the file create_topo_files_is_done.txt
> NP_001007554.1
0000000110000
> NP_001229821.1
0000000000000
:white_check_mark: [gene_name]_is_in_tmh.csv
Per gene name, reads the [gene_name]_variation.csv
and [gene_name].topo
file.
For each variation, it tallies if the variation is in a TMH,
as well as the proportion of TMH in the protein.
done by script create_is_in_tmh_files.R
variation | is_in_tmh | p_in_tmh |
---|---|---|
NP_000585.2:p.Gly144Asp | FALSE | 0.123 |
Results files
- All raw output files in one table,
results.csv
, created bycreate_results.sh
gene_id | gene_name | snp_id | variation | is_in_tmh | p_in_tmh | n_tmh |
---|---|---|---|---|---|---|
7124 | TNF | 1583049783 | NP_000585.2:p.Gly144Asp | FALSE | 0.123 | 314 |
... | ... | ... | ... | ... | ... | 271 |
|----------------|
gene_names.csv
|----------|
[gene_name]_snps.csv
|----------------------------------|
[gene_name]_variations.csv
|------------------------------------------|
[gene_name]_is_in_tmh.csv
|-----------------------| |-----|
[gene_name].topo
Estimated time
- 8 seconds per gene ID,
- Must be done in sequence for NCBI
- 8 seconds * 1123 jobs = 9000 secs = 150 mins = 3 hours
In reality:
real 63m32.145s
user 49m42.254s
sys 0m31.846s
real 61m2.904s
user 44m32.908s
sys 0m28.639s
real 63m14.851s
user 47m47.423s
sys 0m32.378s
real 63m25.927s
user 45m41.377s
sys 0m29.540s
How are the figures created?
By running the tests of ncbi_results
locally.
Downloads
- 30 SNPs per gene ID: ncbi_peregrine_data_20201214.zip
- 60 SNPs per gene ID: ncbi_peregrine_data_20201219.zip