Home

Awesome

Frame2seq

Official repository for Frame2seq, a structured-conditioned masked language model for protein sequence design, as described in our preprint Structure-conditioned masked language models for protein sequence design generalize beyond the native sequence space.

<p align="center"><img src="https://github.com/dakpinaroglu/Frame2seq/blob/main/.github/frame2seq_net_arc.png"/></p>

Colab notebook

Colab notebook for generating sequences with Frame2seq: Open In Colab

Setup

To use Frame2seq, install via pip:

pip install frame2seq

If previously installed via pip, upgrade to the latest version:

pip install --upgrade frame2seq

Usage

Sequence design

To use Frame2seq to generate sequences, you can use the design function.

from frame2seq import Frame2seqRunner

runner = Frame2seqRunner()
runner.design(pdb_file, chain_id, temperature, num_samples)

Arguments

Outputs

A .fasta file containing all sampled sequences is automatically saved. If save_indiv_seqs is True, individual .fasta files for each sampled sequence are also saved.

>pdbid=2fra chain_id=A recovery=62.67% score=0.83 temperature=1.0
PPSSVDWRDLGCITDVLDMGGCGACWAFSAVGALEARTTQKTGELTRLSAQDLVDCAREKYGNEGCDGGRMKSSFQFIIDKNGIDSHQAYPFTASDQECLYNSKYKAATCTDYTVLPEGDEDKLREAVSNVGPVAVGIDATHPEFRNFKSGVYHDPKCTTETNHGVLVVGYGTLKGKRFYKVKTCWGTYFGEDGFIRVAKNQGNHCGISTDPSYPEM

If save_indiv_neg_pll is True, a .csv file containing the per-residue negative pseudo-log-likelihoods of the sampled sequences is also saved.

Advanced sequence design

To use Frame2seq to generate sequences with advanced options, you can use the design function with additional arguments.

from frame2seq import Frame2seqRunner

runner = Frame2seqRunner()
runner.design(pdb_file, chain_id, temperature, num_samples, omit_AA=['C'], fixed_positions=[1, 3, 11])

Scoring

To use Frame2seq to score sequences, you can use the score function.

The following will score the PDB sequence for the PDB backbone.

from frame2seq import Frame2seqRunner

runner = Frame2seqRunner()
runner.score(pdb_file, chain_id)

The following will score all sequences in the given .fasta file for the PDB backbone.

from frame2seq import Frame2seqRunner

runner = Frame2seqRunner()
runner.score(pdb_file, chain_id, fasta_file)

Arguments

Outputs

A .csv file containing the average negative pseudo-log-likelihoods of the given sequence(s) is automatically saved. If save_indiv_neg_pll is True, per-residue negative pseudo-log-likelihoods are also saved in individual .csv files.

Citing this work

@article{akpinaroglu2023structure,
  title={Structure-conditioned masked language models for protein sequence design generalize beyond the native sequence space},
  author={Akpinaroglu, Deniz and Seki, Kosuke and Guo, Amy and Zhu, Eleanor and Kelly, Mark JS and Kortemme, Tanja},
  journal={bioRxiv},
  pages={2023--12},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

DOI

zenodo