Source code for the paper "Protein Structure Accuracy Estimation using Geometry-Complete Perceptron Networks".

NOTE: A web server implementation is freely available at http://gcpnet-ema.missouri.edu.



Install Mamba

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh  # accept all terms and install to the default location
rm Mambaforge-$(uname)-$(uname -m).sh  # (optionally) remove installer after using it
source ~/.bashrc  # alternatively, one can restart their shell session to achieve the same result

Install dependencies

# clone project
git clone https://github.com/BioinfoMachineLearning/GCPNet-EMA

# create conda environment
mamba env create -f environment.yaml # NOTE: use `cpu_environment.yaml` instead for web server deployment
conda activate GCPNet-EMA  # NOTE: one still needs to use `conda` to (de)activate environments

# install local project as package
pip3 install -e .

# install ProDy separately to avoid a Biopython version conflict with Ankh
pip3 install prody==2.4.1

# uninstall protobuf (if installed) due to (potential) local GLIBCXX conflicts
pip3 uninstall protobuf

Note: TM-score is required to score predicted protein structures, where one can install it as follows:

# download and compile TM-score
mkdir -p ~/Programs && cd ~/Programs
wget https://zhanggroup.org/TM-score/TMscore.cpp
g++ -static -O3 -ffast-math -lm -o TMscore TMscore.cpp
rm TMscore.cpp

Make sure to update the tmscore_exec_path value in e.g., configs/paths/default.yaml to reflect where you have placed the TM-score executable on your machine. Also, make sure that lddt_exec_path points to the bin/lddt path within your GCPNet-EMA Conda environment, where lddt is installed automatically as described in environment.yaml.

GCPNet for protein structure EMA (GCPNet-EMA)

How to prepare data and checkpoints for GCPNet-EMA

Download training and evaluation data as well as GCPNet-EMA model checkpoints

cd data/EMA/
wget https://zenodo.org/record/10719475/files/ema_decoy_model.tar.gz
wget https://zenodo.org/record/10719475/files/ema_true_model.tar.gz
tar -xzf ema_decoy_model.tar.gz
tar -xzf ema_true_model.tar.gz
cd ../../  # head back to the root project directory

wget -P checkpoints/ https://zenodo.org/record/10719475/files/structure_ema_finetuned_gcpnet_i2d5t9xh_best_epoch_106.ckpt
wget -P checkpoints/ https://zenodo.org/record/10719475/files/structure_denoising_pretrained_gcpnet.ckpt
wget -P checkpoints/ https://zenodo.org/record/10719475/files/structure_ema_finetuned_gcpnet_without_plddt_ije6iplr_best_epoch_055.ckpt
wget -P checkpoints/ https://zenodo.org/record/10719475/files/default_structure_ema_finetuned_gcpnet_without_plddt_or_esm_emb_p0p8c6pz_best_epoch_099.ckpt
wget -P checkpoints/ https://zenodo.org/record/10719475/files/structure_ema_finetuned_gcpnet_without_esm_emb_x8tjgsf4_best_epoch_027.ckpt

How to train GCPNet-EMA

Train a model for the estimation of protein structure model accuracy (EMA) task

# NOTE: adjust feature ablation arguments as desired
python3 src/train.py experiment=gcpnet_ema.yaml model.ablate_af2_plddt=true model.ablate_gtn=true data.ablate_ankh_embeddings=true data.ablate_esm_embeddings=true

How to evaluate GCPNet-EMA

Reproduce our results for the (tertiary structure) EMA task


# NOTE: ensure feature ablation arguments match checkpoint type
python3 src/eval.py data=ema model=gcpnet_ema logger=csv trainer.accelerator=gpu trainer.devices=1 ckpt_path="$default_ema_model_ckpt_path" model.ablate_af2_plddt=true model.ablate_gtn=true data.ablate_ankh_embeddings=true data.ablate_esm_embeddings=true
python3 src/eval.py data=ema model=gcpnet_ema logger=csv trainer.accelerator=gpu trainer.devices=1 ckpt_path="$af2_ema_model_ckpt_path" model.ablate_af2_plddt=false model.ablate_gtn=true data.ablate_ankh_embeddings=true data.ablate_esm_embeddings=true
Default EMA Model - No AlphaFold plDDT or ESM Embeddings as Inputs
┃          Test metric           ┃          DataLoader 0          ┃
│        test/PerModelMAE        │      0.04611478000879288       │
│        test/PerModelMSE        │      0.004228705074638128      │
│  test/PerModelPearsonCorrCoef  │       0.8075723052024841       │
│       test/PerResidueMAE       │      0.07066802680492401       │
│       test/PerResidueMSE       │      0.010494622401893139      │
│ test/PerResiduePearsonCorrCoef │       0.7123321890830994       │
│           test/loss            │      0.005345446057617664      │

AlphaFold EMA Model - No ESM Embeddings as Inputs
┃          Test metric           ┃          DataLoader 0          ┃
│        test/PerModelMAE        │      0.042016904801130295      │
│        test/PerModelMSE        │      0.003771992400288582      │
│  test/PerModelPearsonCorrCoef  │       0.8381679654121399       │
│       test/PerResidueMAE       │      0.06481857597827911       │
│       test/PerResidueMSE       │      0.009247069247066975      │
│ test/PerResiduePearsonCorrCoef │       0.7482331991195679       │
│           test/loss            │      0.004621841479092836      │

Note: Please contact us if you are interested in reproducing our results for the multimer structure EMA task as described in the manuscript. We would be happy to provide you with a copy of this corresponding dataset as desired.

How to predict lDDT scores for protein structures using GCPNet-EMA

Predict per-residue and per-model lDDT scores for 3D protein structures

predict_batch_size=1  # adjust as desired according to available GPU memory
num_workers=0  # note: required when initially processing new PDB file inputs, due to ESM's GPU usage

# NOTE: ensure feature ablation arguments match checkpoint type
python3 src/predict.py model=gcpnet_ema data=ema data.predict_input_dir=$MY_INPUT_PDB_DIR data.predict_true_dir=$MY_OPTIONAL_TRUE_PDB_DIR data.predict_output_dir=$MY_OUTPUTS_DIR data.predict_batch_size=$predict_batch_size data.num_workers=$num_workers logger=csv trainer.accelerator=gpu trainer.devices=1 ckpt_path="$default_ema_model_ckpt_path" model.ablate_af2_plddt=true model.ablate_gtn=true data.ablate_ankh_embeddings=true data.ablate_esm_embeddings=true

For example, one can predict per-residue and per-model lDDT scores for a batch of tertiary protein structure inputs, 6W6VE.pdb and 6W77K.pdb within data/EMA/examples/decoy_model, as follows

python3 src/predict.py model=gcpnet_ema data=ema data.predict_input_dir=data/EMA/examples/decoy_model data.predict_output_dir=data/EMA/examples/outputs data.predict_batch_size=1 data.num_workers=0 data.python_exec_path="$HOME"/mambaforge/envs/gcpnet/bin/python data.lddt_exec_path="$HOME"/mambaforge/envs/gcpnet/bin/lddt data.pdbtools_dir="$HOME"/mambaforge/envs/gcpnet/lib/python3.10/site-packages/pdbtools/ logger=csv trainer.accelerator=gpu trainer.devices=[0] ckpt_path=checkpoints/default_structure_ema_finetuned_gcpnet_without_plddt_or_esm_emb_p0p8c6pz_best_epoch_099.ckpt model.ablate_af2_plddt=true model.ablate_gtn=true data.ablate_ankh_embeddings=true data.ablate_esm_embeddings=true

Note: After running the above command, an output CSV containing metadata for the predictions will be located at logs/predict/runs/YYYY-MM-DD_HH-MM-SS/predict_YYYYMMDD_HHMMSS_rank_0_predictions.csv, with text substitutions for the time at which the above command was completed. This CSV will contain a column called predicted_annotated_pdb_filepath that identifies the temporary location of each input PDB file after annotating it with GCPNet-EMA's predicted lDDT scores for each residue. If a directory containing ground-truth PDB files corresponding one-to-one with the inputs in data.predict_input_dir is provided as data.predict_true_dir, then metrics and PDB annotation filepaths will also be reported in the output CSV to quantitatively and qualitatively describe how well GCPNet-EMA was able to improve upon AlphaFold's initial per-residue plDDT values.

For developers

Set up pre-commit (one time only) for automatic code linting and formatting upon each git commit

pre-commit install

Manually reformat all files in the project, as desired

pre-commit run -a

Update dependencies in environment.yml

mamba env export > env.yaml # e.g., run this after installing new dependencies locally
diff environment.yaml env.yaml # note the differences and copy accepted changes back into `environment.yaml`
rm env.yaml # clean up temporary environment file

Use Gunicorn to parallelize responses to web server requests across 4 workers using port 5000

SERVER_USE_CONFIG_0=true gunicorn -w 4 -b --timeout 300 src.wsgi:app

Test server locally using curl

curl -X POST -F "title=6KHVA" -F "structure_upload=@data/EMA/test_examples/decoy_model/6KHVA.pdb" -F "results_email=username@email.com"

Create a user cronjob (via crontab -e) that checks every five minutes to make sure the Gunicorn web server is running and, if it is not, starts the server by running the Gunicorn command above

# NOTE: add this to your user cronjobs using `crontab -e`
*/5 * * * * pgrep -f "gunicorn -w 4 -b --timeout 300 src.wsgi:app" || cd /bml/$USER/Repositories/Lab_Repositories/GCPNet-EMA && ~/mambaforge/condabin/mamba run -n GCPNet-EMA SERVER_USE_CONFIG_0=true gunicorn -w 4 -b --timeout 300 --chdir /bml/$USER/Repositories/Lab_Repositories/GCPNet-EMA src.wsgi:app >> /bml/$USER/Repositories/Lab_Repositories/GCPNet-EMA/server_crontab_logfile.log 2>&1

The server should now be publicly available at gcpnet-ema.missouri.edu when running it on port 5000 and at gcpnet-ema-1.missouri.edu when running it on port 5001, and so on e.g., up to port 5003 (as configured locally via one's Apache server proxy).

NOTE: You should substitute the /bml/$USER/Repositories/Lab_Repositories/GCPNet-EMA references above with the absolute path to your personal copy of the repository.

NOTE: Make sure to create in the project's local directory (i.e., ./) a .env file that contains values for four key environment variables for server support: (1) SERVER_EMAIL_ADDRESS (e.g., "bml@missouri.edu"); (2) SERVER_EMAIL_SMTP_SERVER (e.g., "massmail.missouri.edu"); (3) SERVER_EMAIL_PORT (e.g., "587" by default); and (4) SERVER_SECRET_KEY (initially generated by the Python secrets package).

NOTE: You can configure deployment of different server versions using the environment variables SERVER_USE_CONFIG_{0,1,2,3}=true. Base config 0 ablates AF2 plDDT and ESM embeddings; 1 ablates just ESM embeddings; 2 ablates just AF2 plDDT; and 3 uses both AF2 plDDT and ESM embeddings.


GCPNet-EMA builds upon the source code and data from the following project(s):

We thank all their contributors and maintainers!

Citing this work

If you use the code or data associated with this project, or otherwise find this work useful, please cite:

  title={Protein Structure Accuracy Estimation using Geometry-Complete Perceptron Networks},
  author={Morehead, Alex and Liu, Jian and Cheng, Jianlin},
  journal={Protein Science}