Home

Awesome

AF2Rank

Code for the paper "State-of-the-Art Estimation of Protein Model Accuracy using AlphaFold" (https://www.biorxiv.org/content/10.1101/2022.03.11.484043v3). Experiments were run using the latest AlphaFold github commit as of 5/16/2022 (https://github.com/deepmind/alphafold on commitb85ffe10799ca08cc62146f1dabb4e4ee6c0a580).

The script test_templates.py was used to run the evaluations in the paper for both the Rosetta decoy set and CASP. At a high level, this script takes a series of decoy structures, and uses AlphaFold to rank them by predicted accuracy. Its command line arguments and behavior are as follows:

Example Usage:

python main.py [name] --targets_file [list of targets] --seed 1 --recycles 1 --decoy_dir [decoy directory] --seq_replacement - --mask_sidechains_add_cb --use_native

This will run the script with the gap token for the template sequence and the side chains masked, which was the configuration used for the results in the paper.

Directory Structure

The decoy dataset directory specified by decoy_dir should have the following structure:

The output directory specified by output_dir + name will have the following structure:

Data Availability

The raw data from the analyses in the paper can be found here: https://drive.google.com/drive/folders/1Q0aCR_lk4R67XlX9IHl6Jk0-dUI19rhA?usp=sharing

This folder contains the following files with ranking results from the Rosetta decoy dataset:

These files have the following subset of the fields described above:

In addition, the following files are included from the CASP14 evalutation:

The fields in these files are as follows:

Where all fields are as described above, and gdt_ts_out is the GDT_TS of the AlphaFold output structure to the native structure. Note that, for the CASP data, we were unable to access native structures for targets T1085 and T1086, so output accuracies are unavailable for these targets. In general, numeric values are set to -1 when they are not applicable (for instance, the input TM Score for a line representing AlphaFold's behavior with no template input).

The rosetta decoy set can be found here: https://files.ipd.uw.edu/pub/decoyset/decoys.zip

Some extra .txt files have been added to this dataset. The full set of .txt files can be found here: https://drive.google.com/drive/folders/1ew0Y8N55U--2m9fIWm9gJTuKC1LzN9K2?usp=sharing

Notebook

To run AF2Rank in Google Colab, take a look at this notebook:

https://colab.research.google.com/github/sokrypton/ColabDesign/blob/main/af/examples/AF2Rank.ipynb#scrollTo=UCUZxJdbBjZt

Update to previous version (5/23/2022)

The previous version of this repository contained a slightly different set of data obtained from setting the decoy sequence to either the target sequence and a sequence of all alanines. This earlier version of the results contained a bug which caused the target sequence to be incorrectly encoded. Specifically, the old code used the amino acid encoding specified by residue_constants.restypes, while it should have used the encoding given by residue_constants.HHBLITS_AA_TO_ID. This bug caused significant changes to the results using the target sequence, which are discussed in the paper. The old code and data with the erroneous sequence encoding can be found here: https://github.com/jproney/AF2Rank/tree/d7c9ec1fda03604b95f05132a9c2b4b2739a77a5. These results are described in an earlier version of the preprint from before the error was corrected: https://www.biorxiv.org/content/10.1101/2022.03.11.484043v2